1 Introduction

With the development of machine learning and deep learning, machine learning and deep learning are also used to detect malicious URLs. In these methods, in order to be able to convert the URL as a string into a number column that can be recognized by machine learning or deep learning, like the natural language processing, it will segment the URL and embed the URLs into the feature vectors. Chen’s research shows most malicious URLs detection methods use embedding, segmentation methods, and machine learning algorithms, which means either segmentation method, embedding method, or machine learning model will affect the performance of malicious URL detection method.

The feature of the method of using machine learning to detect malicious URLs is that it can detect malicious URLs efficiently under the premise of a low false detection rate. In this case, the accuracy of malicious URLs detection method is an important evaluation index. For this reason, researches on malicious URLs detection method based on machine learning focus on increasing the accuracy of detection.

As an important part of malicious URLs detection methods, the method of turning URLs into feature vectors which call URL embedding method will also significantly affect the performance of malicious URL detection methods. However, the only way to evaluate the performance of URL embedding method is the accuracy result after training the machine learning model in related research. The accuracy of malicious URLs detection is not only based on the performance of the detection methods, but is also related to training sets and test sets. In other words, the accuracy of malicious URLs detection methods will change due to different test sets, so it is not comprehensive to evaluate URL embedding methods only from the detection accuracy of single test sets.

To solve this problem, the evaluation of another aspect, in addition to accuracy, becomes particularly important. The evaluation method focus on the embedded feature vectors called intrinsic evaluating method. Unlike extrinsic evaluating method, it does not depend on the other part of detection and the training test set. Because the embedding method is the only variable, the advantage is not to worry about the impact of other variables.

The main contributions of this paper are shown as follows:

  1. 1.

    We proposed an intrinsic evaluation method for URL embedding method based on cosine similarity. The intrinsic evaluation method can evaluate URL embedding method without the effect of machine learning models and data sets.

  2. 2.

    Besides, we evaluated several URL embedding methods with intrinsic and extrinsic method and found that the traditional extrinsic evaluation methods have some difficulties in evaluating URL embedding methods and proved the intrinsic method’s usefulness.

  3. 3.

    At last, we offered guidance in selecting suitable embedding method in malicious URLs detection according to the results of the evaluation.

The structure of this paper as below: In the Preliminary section, we will introduce the important algorithms and URL embedding methods used in our work,and we will introduce some related research including malicious URLs detection and evaluating method in the Related Work section. The structure of intrinsic evaluation method is explained in the Sects. 4 and 5 shows the whole process of evaluation including extrinsic and intrinsic. The Evaluation Section contains experimental data as well as experimental results and results analysis. At last, we discuss some problem have not solved of URL embedding method and evaluation method.

2 Preliminary

In this section, we will introduce F-1 score and cosine similarity, which will be used as the indicator of extrinsic evaluation method and the evaluation algorithm of intrinsic evaluation method. Besides, we will introduce the URL embedding methods used in our test.

2.1 F-1 score

F-1 score is the harmonic mean of the precision: attempts to answer the question that what proportion of positive identifications was actually correct; and recall: attempts to answer the question that what proportion of actual positives was identified correctly:

$$\begin{aligned} & Precision\!=\frac{tp}{tp+fp} \end{aligned}$$
(1)
$$\begin{aligned} & Recall\!=\frac{tp}{tp+fn} \end{aligned}$$
(2)
$$\begin{aligned} & F\!=2\frac{Precision \cdot Recall}{Precision+Recall} \end{aligned}$$
(3)

2.2 Cosine similarity

We used cosine similarity as an indicator of measuring how much information is retained. Specifically, the URLs are embedded as vectors in an inner product space and the cosine similarity is defined as the cosine of the angle between two vectors, that is, the dot product of the vectors divided by the product of their lengths.

$$\begin{aligned} S\!=\frac{v_{x} \cdot v_{y}}{\parallel v_{x} \parallel \parallel v_{y} \parallel } \end{aligned}$$
(4)

As shown in algorithm 4, \(v_{x}\) and \(v_{y}\) are two feature vectors and \(\parallel v_{x} \parallel \) and \(\parallel v_{y} \parallel \) mean their L2 norm. The advantage of cosine similarity is its low complexity, especially for sparse vectors: only the non-zero coordinates must be considered. Cosine similarity represents the relationship between two Tokens; when cosine similarity is close to 1, it means that the two Tokens are very similar in the meaning of embedding method. On the contrary, if cosine similarity is close to 0, the two Tokens are not similar in the sense of embedding method. On the contrary, if cosine similarity is close to 0, the two Tokens are not similar in the sense of embedding method.

2.3 URL embedding methods

The method that turns the URL into feature vectors that can be trained is called embedding method. In this section, we will introduce several famous embedding methods by dividing them into context-considering embedding methods and context-agnostic embedding methods.

2.3.1 Context-considering embedding methods

Context-considering embedding methods means the generation of word vectors takes into account the context of the corpus. Like the algorithms CBOW and Skip-gram in Word2Vec, they can predict a word based on context or predict the context based on a word. When they change words into word vectors, they will consider their context, which will increase the accuracy of prediction. In this paper, we used Word2Vec [2], FastText [3], GloVe [4] as the target context-considering URL embedding methods.

2.3.2 Context-agnostic embedding methods

Context-agnostic embedding methods like One-hot Code and TF-IDF [5] are the basic embedding methods. They turn words into vectors easily, and they embed words only by using the word’s quantity or physical order.

3 Related work

The Related Work section will introduce some research about the embedding evaluator in NLP, and the malicious URLs detector used URL embedding and machine leaning refer to the survey of G. Pradeepa, and R. Devi, they summarized 12 related research on malicious URL detection using machine learning in "Review of Malicious URL Detection Using Machine Learning" [6]. They summarized the research’s machine learning models and features, which is very helpful for investigation of relevant research.

3.1 A three-step framework for detecting malicious URLs

Chen [7] proposed a three-steps framework to review 14 methods of detecting malicious URLs. They divided the method of malicious URL detection using machine learning into three parts: Segmentation, Embedding, and Machine learning. They evaluated some machine learning models and context-considering methods by three-step framework, and they verified the importance of considering context and found that context-considering embedding methods are more important and the malicious URLs detection accuracy improved by about 6\(\%\) with context-considering methods. Chen’s research uses F-1 score to evaluate the suitability of each embedding method and malicious URL detection methods according to the specific malicious URL detection task. However, once the training set and test set of malicious URL detection task change, F-1 score will also change, which will affect the evaluation results. In this case, their evaluation of embedding methods is incomplete.

3.2 The extrinsic and intrinsic evaluating method in NLP

Wang [8] categorizes the NLP evaluators into intrinsic and extrinsic two types. Intrinsic evaluators test the quality of a representation independent of specific natural language processing tasks, while extrinsic evaluators use word embeddings as input features to a downstream task and measure changes in performance metrics specific to that task. Although the Token split by URL embedding in the Segmentation step is different from natural language processing, and the Token is not a Word in the language sense, because the process and method of URL embedding and Word embedding are similar, we can refer to the evaluation method of Word embedding.

3.3 The segmentation methods of the malicious URLs detection research

The method proposed by Yuan et al. [9] named URL2Vec: “URL Modeling with Character Embeddings for Fast and Accurate Phishing Website Detection”is a typical research that uses machine learning to detect malicious URLs. They divided the URLs by the structure of URL protocol, sub-domain name, domain name, domain suffix, and URL path 5 parts.

The method proposed by Kaneko et al. [10] named “Detecting Malicious Websites by Query Templates”used the machine learning algorithm DBSCAN to cluster malicious URLs and benign URLs. In the segmentation step, they chose a different way to divide URLs is that use all delimiters into URLs. Each part of the split URL were called a Token and we call this method as Token segmentation method, and we used the method to split URL in this paper.

The method“Learning a URL Representation with Deep Learning for Malicious URL Detection”named URLnet, which was proposed by Le et al. [11], trained a Convolutional Neural Network model to detect malicious URLs obtain a good results. They proposed two different methods to divide URLs, Char-level-CNN separates the URL by each letter, which we called the Alphabet segmentation method. The Word-level-CNN selected the separators“/”,“.”and“-”as the benchmark to divide the URLs.

3.4 The URL embedding methods of the malicious URLs detection research

As the embedding step of URL2Vec, each part of URL were embedded by using Skip-Gram as feature vectors. The method proposed by Joshua et al. [12] name eXpose:“A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys”and the URLnet used one-hot code embedding method. The new embedding model proposed by Yan et al. [13] named UE Model propoesed a new URL embedding method that used Huffman Code and Huffman Tree to embed URLs. The phishing URL detection system proposed by Ozgur et al. [14] makes improvements on the basis of previous research, they used the NLP-based features to considered more word information. The malicious URLs detection system proposed by Cho et al. [15], Ripon et al. [16], Kamel et al. [17], Yogendra et al. [18], Patil et al. [19], Ammara et al. [20], Ferhat et al. [21], Mohammad et al. [22] are typical detection systems based on Feature Engineering. They used the number of characters ’.’, the number of subdomain levels, the length of the URL, and a series of lexical features to train machine learning and deep learning models.

4 Intrinsic evaluation method

Intrinsic evaluation methods focus on the embedding performance of URL embedding methods. They test the quality of a representation independent of specific malicious URLs detection tasks and they measure the relationships among domains in the URL directly. In other words, the embedded feature vector contains the relative information of the URL Token, and the accuracy of the amount of information retained after changing the URL string into a digital string reflects the performance of URL embedding method.

4.1 Intrinsic score

With the premise in Sect. 2.2, we can know that if two Tokens have similar meanings in the URL, and their cosine similarity is close to 1, it means that the two Tokens are well embedded. The method to evaluate a group of Token’s similarity is to calculate their average value, as shown in algorithm 5, which means the Tokens in a group calculate cosine similarity with each other and take their average value. Because the Tokens in the group are similar to each other, the closer \(S_{Similar}\) is to 1, the better they are embedded.

$$\begin{aligned} S_{Similar} = \frac{1}{n(n-1)} \sum _{ v_{x} \in A, v_{y} \in A, v_{x} \ne v_{y}} S(v_{x},v_{y}) \end{aligned}$$
(5)

On the other hand, if two Tokens are not similar in URL meaning, and their cosine similarity is close to 0, it also means that the two Tokens are well embedded. The method is similar to algorithm 5 but it needs two groups of Tokens and group A is not similar to group B in the meaning of URL. As shown in algorithm 6, it calculates the average of cosine similarity of group A with group B because the Tokens in group A are not similar to the Tokens in group B, so the closer \(S_{Dissimilar}\) is to 0, the better they are embedded.

$$\begin{aligned} S_{Dissimilar} = \frac{1}{n(n-1)} \sum _{ v_{x} \in A, v_{y} \in B} S(v_{x},v_{y}) \end{aligned}$$
(6)

More expansion, there are three characteristics of embedded well: \(S_{Similar}\) close to 1, and \(S_{Dissimilar}\) close to 0, and the difference between \(S_{Similar}\) and \(S_{Dissimilar}\) is large, so we propose the following algorithm to evaluate the performance of URL embedding method, the larger the score, the better performance of embedding:

$$\begin{aligned} \begin{aligned} Score\!= (100 \cdot S_{Similar} - 100 \cdot S_{Dissimilar})^{2} \\ + 100 \cdot S_{Similar} \\ \end{aligned} \end{aligned}$$
(7)

4.2 URL token pair

In order to verify the relationship between two feature vectors, we need a pair of URL Tokens that already know their relationship. Likes Token“amazon”and“google”, they usually play the role of the domain name in URL“www.amazon.com”or “www.google.com”so they should be similar in either URL or vector. We collected the top 50 domains in AlexaTop and looking forward to selecting 15 of them to form a similar Token set. We calculated the cosine similarity of these Tokens with the embedding methods Word2Vec, FastText and GloVe to ensure they are not only similar in the meaning of domain but also similar in the meaning of embedding. After manually selecting and verifying by different embedding methods, we form a similar Token set shown in Table . Besides, we were also looking forward to selecting 15 Tokens which dissimilar from the Tokens in a similar Token set like the domain suffix part in URL, such as“www”or“com”. After selection and verification, the dissimilar Token set is shown in Table 2.

Table 1 Similar token set
Table 2 Dissimilar token set
Fig. 1
figure 1

Process of extrinsic evaluation

Fig. 2
figure 2

Process of intrinsic evaluation

5 Evaluation process

5.1 Process of extrinsic evaluation

Extrinsic evaluating method uses URL embedding method as input features to a downstream task and measures changes in performance metrics specific to that task, which means we set up a specific malicious URLs detection task as a downstream task, and we used several indicators for evaluating the performance of malicious URL detection methods.

Figure 1 shows the outline of the process of extrinsic evaluation. The origin URL will be split by the segmentation method into URL Token, and then the URL Token will be embedded into the feature vectors according to different embedding methods. The machine learning model will be trained with the feature vectors, after training the output model can predict the URL used for testing. In order to evaluate different URL embedding methods, we changed several methods in the embedding step and machine learning step, including Random Forest [23], LightGBM [24], Decision Tree, Logistic Regression, and CNN. Besides, the dimension also be set as a variable.

5.2 Process of intrinsic evaluation

Figure 2 shows the outline of the process of intrinsic evaluation. We split the URL in the corpus by all delimiters into URL, and set up the embedding method by using the corpus. As experimental subjects, we used Word2Vec/Skip-gram, Word2Vec/CBOW, FastText/Skip-gram, FastText/CBOW, GloVe, TF-IDF, and One-hot Code. Then the embedding method to be evaluated will calculate \(S_{Similar}\) and \(S_{Dissimilar}\) with a similar Token set and a dissimilar Token set mentioned in Sect. 4.2.

Table 3 Extrinsic score (F-1 score) comparison of URL embedding method with 50\(\%\) ratio of benign/malicious
Table 4 Extrinsic score (F-1 score) comparison of URL embedding method with 10\(\%\) ratio of benign/malicious
Table 5 Extrinsic score (AUC score) comparison of URL embedding method with 50\(\%\) ratio of benign/malicious
Table 6 Extrinsic score (AUC score) comparison of URL embedding method with 10\(\%\) ratio of benign/malicious
Table 7 Intrinsic score comparison of URL embedding METHods

6 Evaluation

In this section, we will show and analyze the results obtained according to the evaluation process described in the Sect. 5.

6.1 Data set

The extrinsic evaluating method requires a complete set of malicious URL detection tasks, so we have prepared an URL set for use as a corpus and the training test set for training and detection. We set up a crawling program to crawl 140 thousands URLs from AlexaTop [25], a website that counts the most used domain names made by Amazon. We selected 5 thousands malicious URLs with classic URL structure in URLhaus [26], a manually maintained malicious URL database as malicious URLs set and we selected 5 thousands benign URLs in the crawl results with classic URL structure. The training set and test set will be produced by the random seed of cross-validation from the malicious URLs and benign URLs mentioned above. In addition, to demonstrate that the extrinsic method changes with the dataset, we used different random seeds to construct different datasets for experimentation. One experiment used the same number of malicious and benign URLs for training and testing, while the other experiment used a dataset of benign: malicious behavior 1:10.

6.2 Experiment results

We took several URL embedding methods as variables and tested them with extrinsic and intrinsic evaluation methods. Besides, we compared vectors embedded in 64 and 2 dimensions because higher dimensional vectors usually contain more information than lower dimensional vectors, and high dimensional vectors can train with more features than low dimensional vectors, which can improve prediction accuracy. Tables 3 and 4 show the F-1 score of both data sets and Tables 5 and 6 show the AUC score of the extrinsic evaluation mentioned in Sect. , and Table 7 shows the results of the intrinsic evaluation mentioned in Sect. 5.2. The Table 9 shows the 64D \(S_{Similar}\) and \(S_{Dissimilar}\) of each URL embedding methods.

6.3 Results analyse

As shown in Tables 3, 4, 5, 6 and 7, both extrinsic evaluation results and intrinsic evaluation results, the context-considering embedding methods are better than context-agnostic embedding methods, means not only in NLP but also in URL embedding, considering context is essential.

Table 8 Evaluation of related works

6.4 The evaluation of related works

We have summarized 14 related research using machine learning for malicious URL detection in Sect. 3, and we used intrinsic evaluation method and extrinsic evaluation method to evaluate the related research’s detection method in this section. As shown in Table 8, the related research can be divided into 5 parts, and due to the use of many different machine learning algorithms in many related studies, in order to achieve unified evaluation, we will use the random forest algorithm uniformly. Besides, these related studies have also used different segmentation methods such as Alphabet, Word, Token, etc. Here, we uniformly use Token segmentation method. The reason is that, according to our previous study, the Token segmentation method is context-considering segmentation method, and the actual detection accuracy is one of the highest among all segmentation methods. Secondly, the Token Pairs used in this study for the Intrinsic Evaluation Method are split by Token segmentation method and can be detected by the Intrinsic Evaluation Method.

As shown in Table 8, the research “URLnet”[11] and“eXpose” [12] used the One-hot Code, kind of context-agnostic embedding method for embedding, obtained almost lowest extrinsic score in related works for the reason that context-agnostic embedding method do not including contextual information during embedding which also mentioned in Sect. 2.3.2. Not only extrinsic score, but also from intrinsic score, we can clearly see that the embedding performance of One-hot Code method lags far behind other context-considering embedding methods. Given that these two related studies were proposed relatively early, they provide new ideas for the study of other machine learning based malicious URL detection methods.

The research“URL2Vec”[9] and“UE Model” [13] achieved the highest scores in related works. The Skip-gram embedding method is a famous context-considering model and it embeds Tokens with similar meanings into similar vector spaces. During the detection process, the distance in the vector space plays a significant role in predicting URL properties.

The other research used the Numerical embedding method that is not a word embedding algorithm, they use traditional feature embedding methods, using features such as character length, number of characters, etc. to input machine learning models for detection. This traditional method is more commonly used in network attack detection such as DDoS detection. This embedding method obtained the lowest score, indicating that it is not suitable for current malicious URL detection.

Besides, the research using context-considering embedding methods typically yields higher detection accuracy, which is the result of the Extrinsic Evaluation Method. However, the results of the Extrinsic Evaluation Method are generally not significantly different, and often, in order to make the difference more obvious, research will reduce the training set or training frequency to make the difference in detection accuracy become apparent. Although this approach can make the difference in detection accuracy between different methods more apparent, the detection accuracy reflected by reducing the training set and training frequency is not the detection accuracy that the detection method should have.

As shown in Fig. 3, the results of Intrinsic Evaluation Method is also consistent with the results of the Extrinsic Evaluation Method, but the difference between the results are more clearly. However, the results of the Intrinsic Evaluation Method only evaluated the performance of the Embedding Method in the malicious URL detection task, and the actual detection effect still needs further verification by the Extrinsic Evaluation Method. So the combination of Intrinsic and Extrinsic Evaluation Methods is the most ideal.

Fig. 3
figure 3

Comparison of intrinsic and extrinsic results about related works

6.5 Intrinsic method solve the disadvantages of extrinsic method

6.5.1 Compare the URL embedding methods easier

As shown in Table 3, 4, 5 and 6, even if the dimension of the word vector is reduced from 64 to 2, the F-1 score results are very close under the test of different machine learning models. In this case, it is difficult for us to compare the performance of each URL embedding methods and hard to select the right URL embedding method. With the help of the intrinsic evaluation method, we can know the embed situation more clearly. Like Table 7, the Skip-gram method from Word2Vec and FastText, and GloVe method has a huge difference in the intrinsic scores of 64D and 2D, which shows 64D embedding can improve the performance of detection with the embedding methods Word2Vec/Skip-gram, FastText/Skip-gram and GloVe. Besides, the AUC score is an important metric used to measure the detectors, but it can be seen from the Tables 5 and 6 that apart from DT, there is not much difference among the other machine learning models, and it is also not possible to compare various embedding methods.

6.5.2 Not affected by machine learning models and datasets

As shown in Tables 3 and 5, the 64D GloVe with CNN has lower F-1 Score than 64D Word2Vec/Skip-gram with CNN with 50\(\%\) Ratio of Benign/Malicious, but the 64D GloVe with CNN has higher F-1 Score than 64D Word2Vec/Skip-gram with CNN with 10\(\%\) Ratio of Benign/Malicious. This illustrates a drawback of the Extrinsic Evaluation Method, which can be influenced by the dataset. Besides, it is also difficult for us to compare the performance of embedding methods such as Word2Vec and FastText, as their comparison results under different Machine Learning models are different from each other, but with the help of the Intrinsic Evaluation Method, as shown in the Table 7, we can clearly distinguish the differences between different methods.

7 Discussion

7.1 Problems about Existing URL Embedding Methods

Even though most URL embedding methods in extrinsic tests have achieved good detection accuracy, the specific cosine similarity of each URL embedding methods shown in Table 9 shows that this URL embedding methods are not the most suitable for malicious URLs detection. They usually get high similarity in \(S_{Similar}\), but \(S_{Dissimilar}\) is not too low, which means different URL Tokens are not well distinguished by existing URL embedding methods. The most special example is the CBOW method, whether from Word2Vec or FastText, it got the highest \(S_{Similar}\), but the difference between \(S_{Similar}\) and \(S_{Dissimilar}\) is low. In general, Word2Vec is more suitable for malicious URLs detection, and the Skip-gram algorithm is more suitable for URL embedding.

Table 9 \(S_{Similar}\) and \(S_{Dissimilar}\) of URL embedding methods

However, the common problem with the existing URL embedding methods is these embedding methods are originally used for NLP. They identify similar words and related words from the perspective of natural language, which is base on the relative position of words in the corpus. These algorithms are not the most suitable for URL embedding because the relative position of Tokens in URL is different from natural language. Besides, the treatment of polysemous Tokens is also unsatisfactory, like the Token ’zoom’, when it is used as the domain name ’zoom’, its meaning is different from that of other domain names as part of the path, which makes the cosine similarity of Tokens related to ’zoom’ very poor. In conclusion, URL embedding methods need to solve the above problems to obtain a better embedding performance.

7.2 Limitations and weaknesses

Although our Intrinsic method can solve the disadvantage of Extrinsic evaluation methods that affected by the training set and test set for machine learning, as well as issues such as small differences in detection accuracy results, it still has limitations and weaknesses.

7.2.1 The token pairs

The Intrinsic evaluation method do not need a test set based on specific downstream tasks, but a corpus for embedding and a set of Token pair is required. Firstly, regarding the selection of Token Pair, as mentioned above, the Token Pairs used for Intrinsic Evaluation Method is manually chosen by us. Changing a few Token Pairs will greatly alter the final evaluation results obtained. That is to say, URL embedding methods like Word2Vec and FastText, which are relatively similar, are likely to result in different Intrinsic Score comparison results due to changing one or two Token Pairs, so this requires caution when choosing Token Pair. But this small change will not affect the significant difference in the Intrinsic Score comparison between Context-considering and Context-agnostic embedding methods.

Besides, as described in Sect. 4.2, Token Pair also needs to be generated according to corpus. This means that these Token Pair sets are bound to the current corpus, because if the corpus does not contain one of the tokens in the token pair, that token cannot be embedded. This will result in the need to adjust the composition of Token Pair, and the corresponding corpus will also need to be adjusted. The adjustment of corpus containing a large number of URLs will require reacquiring a large amount of URL data, which will make the evaluation process more complex.

7.2.2 Evaluation for actual tasks

The Intrinsic Evaluation Method is used to evaluate the performance of the target embedding method in embedding URLs. It cannot evaluate the performance of specific methods in a particular task. For example, in the malicious URL detection task described in this article, the Intrinsic Evaluation Method cannot evaluate the performance of related detection methods in detecting malicious URLs. In conclusion, the Intrinsic Evaluation Method we proposed is a method used to assist the Extrinsic Method in evaluating the performance of specific URL embedding methods, the performance of the detection system still needs to be judged based on specific machine learning models.

7.2.3 Establish a standard test collection

The Token Pair mentioned in this article only has 30 pairs, which is not enough in terms of quantity. For example, the WordSim353 collection for word vectors has 353 pairs, and it was proposed in 2002. In recent years, most collections have included thousands of pairs. At present, our research has demonstrated the feasibility of the Intrinsic Evaluation Method in evaluating the performance of URL embedding methods, but the Intrinsic Evaluation Method is not yet a standardized evaluation method. In the future, we will strive to improve the Token Pair Collection and its corresponding Corpus, so that our proposed method can truly be used for standardized evaluation.

7.2.4 Conclusion of strengths and weaknesses

We will make a conclusion of our proposed Intrinsic Evaluation Method with the strengths and weaknesses as list in the section.

Strengths of The Intrinsic Evaluation Method:

  1. 1.

    As a specialized method for evaluating embedding methods, it can compare the URL Embedding Methods Easier.

  2. 2.

    It will not be affected by Machine Learning Models and Datasets.

Weaknesses of The Intrinsic Evaluation Method:

  1. 1.

    The change of Token Pairs may significant change the result of evaluation as well as the corresponding corpus.

  2. 2.

    It cannot evaluate the performance of specific methods in a particular task such as malicious URL detection.

  3. 3.

    In order to be a standardized evaluation method for URL embedding methods, a Token Pair Test Collection and its corresponding Corpus must be built.

8 Conclusion

In this paper, we proposed an intrinsic evaluation method for URL embedding method, and it can evaluate URL embedding method without the effect of machine learning models and data sets. Besides, we evaluated several URL embedding methods with intrinsic and extrinsic method and found that the results of traditional extrinsic evaluation methods are hard to compare in evaluating URL embedding methods, and the results of intrinsic evaluation method proved intrinsic evaluation method plays its role in URL embedding methods evaluation. At last, we found that Word2Vec embedding method and Skip-gram algorithm are suitable for URL embedding according to the results of the evaluation.

About future work, we will focus on improving the Intrinsic Evaluation Method, including increasing the number of Token Pairs, and addressing the issue of how to perform Intrinsic Evaluation without using Token segmentation method. Besides, using larger and more diverse datasets that cover various domains, languages and collection periods will be a future challenge for us to further verify the feasibility of our proposed method.