An intrinsic evaluator for embedding methods in malicious URL detection

243 Accesses
Explore all metrics

Abstract

Nowadays, machine learning is used in many fields. Not only in fields such as image recognition, machine learning is also used for malicious detection. Especially in recent years, there have been many studies using machine learning for malicious URL detection to replace traditional blacklists. In order to compare the performance of the malicious URLs detection method, researches used the F-score or other detection accuracy to evaluate, but there are some difficulties in evaluating the URL embedding method used in malicious URLs detection because the detection accuracy is also effect by machine learning or deep learning models and data sets. An evaluation method of URL embedding method that is not affected by other factors is particularly important. In this paper, we proposed an intrinsic evaluation method for URL embedding method that is not affected by machine learning models or deep learning models and data sets. Besides, We analyse some URL embedding methods according to intrinsic and extrinsic methods and offer a guidance in selecting suitable embedding methods in URL by analysing the results.

Toward the Establishment of Evaluating URL Embedding Methods Using Intrinsic Evaluator via Malicious URLs Detection

A Malicious URL Detection Model Based on Convolutional Neural Network

Exploring and Identifying Malicious Sites in Dark Web Using Machine Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the development of machine learning and deep learning, machine learning and deep learning are also used to detect malicious URLs. In these methods, in order to be able to convert the URL as a string into a number column that can be recognized by machine learning or deep learning, like the natural language processing, it will segment the URL and embed the URLs into the feature vectors. Chen’s research shows most malicious URLs detection methods use embedding, segmentation methods, and machine learning algorithms, which means either segmentation method, embedding method, or machine learning model will affect the performance of malicious URL detection method.

The feature of the method of using machine learning to detect malicious URLs is that it can detect malicious URLs efficiently under the premise of a low false detection rate. In this case, the accuracy of malicious URLs detection method is an important evaluation index. For this reason, researches on malicious URLs detection method based on machine learning focus on increasing the accuracy of detection.

As an important part of malicious URLs detection methods, the method of turning URLs into feature vectors which call URL embedding method will also significantly affect the performance of malicious URL detection methods. However, the only way to evaluate the performance of URL embedding method is the accuracy result after training the machine learning model in related research. The accuracy of malicious URLs detection is not only based on the performance of the detection methods, but is also related to training sets and test sets. In other words, the accuracy of malicious URLs detection methods will change due to different test sets, so it is not comprehensive to evaluate URL embedding methods only from the detection accuracy of single test sets.

To solve this problem, the evaluation of another aspect, in addition to accuracy, becomes particularly important. The evaluation method focus on the embedded feature vectors called intrinsic evaluating method. Unlike extrinsic evaluating method, it does not depend on the other part of detection and the training test set. Because the embedding method is the only variable, the advantage is not to worry about the impact of other variables.

The main contributions of this paper are shown as follows:

1.
We proposed an intrinsic evaluation method for URL embedding method based on cosine similarity. The intrinsic evaluation method can evaluate URL embedding method without the effect of machine learning models and data sets.
2.
Besides, we evaluated several URL embedding methods with intrinsic and extrinsic method and found that the traditional extrinsic evaluation methods have some difficulties in evaluating URL embedding methods and proved the intrinsic method’s usefulness.
3.
At last, we offered guidance in selecting suitable embedding method in malicious URLs detection according to the results of the evaluation.

The structure of this paper as below: In the Preliminary section, we will introduce the important algorithms and URL embedding methods used in our work,and we will introduce some related research including malicious URLs detection and evaluating method in the Related Work section. The structure of intrinsic evaluation method is explained in the Sects. 4 and 5 shows the whole process of evaluation including extrinsic and intrinsic. The Evaluation Section contains experimental data as well as experimental results and results analysis. At last, we discuss some problem have not solved of URL embedding method and evaluation method.

2 Preliminary

In this section, we will introduce F-1 score and cosine similarity, which will be used as the indicator of extrinsic evaluation method and the evaluation algorithm of intrinsic evaluation method. Besides, we will introduce the URL embedding methods used in our test.

2.1 F-1 score

F-1 score is the harmonic mean of the precision: attempts to answer the question that what proportion of positive identifications was actually correct; and recall: attempts to answer the question that what proportion of actual positives was identified correctly:

$$\begin{aligned} & Precision\!=\frac{tp}{tp+fp} \end{aligned}$$

(1)

$$\begin{aligned} & Recall\!=\frac{tp}{tp+fn} \end{aligned}$$

(2)

$$\begin{aligned} & F\!=2\frac{Precision \cdot Recall}{Precision+Recall} \end{aligned}$$

(3)

2.2 Cosine similarity

We used cosine similarity as an indicator of measuring how much information is retained. Specifically, the URLs are embedded as vectors in an inner product space and the cosine similarity is defined as the cosine of the angle between two vectors, that is, the dot product of the vectors divided by the product of their lengths.

$$\begin{aligned} S\!=\frac{v_{x} \cdot v_{y}}{\parallel v_{x} \parallel \parallel v_{y} \parallel } \end{aligned}$$

(4)

As shown in algorithm 4, $v_{x}$ and $v_{y}$ are two feature vectors and $\parallel v_{x} \parallel $ and $\parallel v_{y} \parallel $ mean their L2 norm. The advantage of cosine similarity is its low complexity, especially for sparse vectors: only the non-zero coordinates must be considered. Cosine similarity represents the relationship between two Tokens; when cosine similarity is close to 1, it means that the two Tokens are very similar in the meaning of embedding method. On the contrary, if cosine similarity is close to 0, the two Tokens are not similar in the sense of embedding method. On the contrary, if cosine similarity is close to 0, the two Tokens are not similar in the sense of embedding method.

2.3 URL embedding methods

The method that turns the URL into feature vectors that can be trained is called embedding method. In this section, we will introduce several famous embedding methods by dividing them into context-considering embedding methods and context-agnostic embedding methods.

2.3.1 Context-considering embedding methods

Context-considering embedding methods means the generation of word vectors takes into account the context of the corpus. Like the algorithms CBOW and Skip-gram in Word2Vec, they can predict a word based on context or predict the context based on a word. When they change words into word vectors, they will consider their context, which will increase the accuracy of prediction. In this paper, we used Word2Vec [2], FastText [3], GloVe [4] as the target context-considering URL embedding methods.

2.3.2 Context-agnostic embedding methods

Context-agnostic embedding methods like One-hot Code and TF-IDF [5] are the basic embedding methods. They turn words into vectors easily, and they embed words only by using the word’s quantity or physical order.

3 Related work

The Related Work section will introduce some research about the embedding evaluator in NLP, and the malicious URLs detector used URL embedding and machine leaning refer to the survey of G. Pradeepa, and R. Devi, they summarized 12 related research on malicious URL detection using machine learning in "Review of Malicious URL Detection Using Machine Learning" [6]. They summarized the research’s machine learning models and features, which is very helpful for investigation of relevant research.

3.1 A three-step framework for detecting malicious URLs

Chen [7] proposed a three-steps framework to review 14 methods of detecting malicious URLs. They divided the method of malicious URL detection using machine learning into three parts: Segmentation, Embedding, and Machine learning. They evaluated some machine learning models and context-considering methods by three-step framework, and they verified the importance of considering context and found that context-considering embedding methods are more important and the malicious URLs detection accuracy improved by about 6$\%$ with context-considering methods. Chen’s research uses F-1 score to evaluate the suitability of each embedding method and malicious URL detection methods according to the specific malicious URL detection task. However, once the training set and test set of malicious URL detection task change, F-1 score will also change, which will affect the evaluation results. In this case, their evaluation of embedding methods is incomplete.

3.2 The extrinsic and intrinsic evaluating method in NLP

Wang [8] categorizes the NLP evaluators into intrinsic and extrinsic two types. Intrinsic evaluators test the quality of a representation independent of specific natural language processing tasks, while extrinsic evaluators use word embeddings as input features to a downstream task and measure changes in performance metrics specific to that task. Although the Token split by URL embedding in the Segmentation step is different from natural language processing, and the Token is not a Word in the language sense, because the process and method of URL embedding and Word embedding are similar, we can refer to the evaluation method of Word embedding.

3.3 The segmentation methods of the malicious URLs detection research

The method proposed by Yuan et al. [9] named URL2Vec: “URL Modeling with Character Embeddings for Fast and Accurate Phishing Website Detection”is a typical research that uses machine learning to detect malicious URLs. They divided the URLs by the structure of URL protocol, sub-domain name, domain name, domain suffix, and URL path 5 parts.

The method proposed by Kaneko et al. [10] named “Detecting Malicious Websites by Query Templates”used the machine learning algorithm DBSCAN to cluster malicious URLs and benign URLs. In the segmentation step, they chose a different way to divide URLs is that use all delimiters into URLs. Each part of the split URL were called a Token and we call this method as Token segmentation method, and we used the method to split URL in this paper.

The method“Learning a URL Representation with Deep Learning for Malicious URL Detection”named URLnet, which was proposed by Le et al. [11], trained a Convolutional Neural Network model to detect malicious URLs obtain a good results. They proposed two different methods to divide URLs, Char-level-CNN separates the URL by each letter, which we called the Alphabet segmentation method. The Word-level-CNN selected the separators“/”,“.”and“-”as the benchmark to divide the URLs.

3.4 The URL embedding methods of the malicious URLs detection research

As the embedding step of URL2Vec, each part of URL were embedded by using Skip-Gram as feature vectors. The method proposed by Joshua et al. [12] name eXpose:“A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys”and the URLnet used one-hot code embedding method. The new embedding model proposed by Yan et al. [13] named UE Model propoesed a new URL embedding method that used Huffman Code and Huffman Tree to embed URLs. The phishing URL detection system proposed by Ozgur et al. [14] makes improvements on the basis of previous research, they used the NLP-based features to considered more word information. The malicious URLs detection system proposed by Cho et al. [15], Ripon et al. [16], Kamel et al. [17], Yogendra et al. [18], Patil et al. [19], Ammara et al. [20], Ferhat et al. [21], Mohammad et al. [22] are typical detection systems based on Feature Engineering. They used the number of characters ’.’, the number of subdomain levels, the length of the URL, and a series of lexical features to train machine learning and deep learning models.

4 Intrinsic evaluation method

Intrinsic evaluation methods focus on the embedding performance of URL embedding methods. They test the quality of a representation independent of specific malicious URLs detection tasks and they measure the relationships among domains in the URL directly. In other words, the embedded feature vector contains the relative information of the URL Token, and the accuracy of the amount of information retained after changing the URL string into a digital string reflects the performance of URL embedding method.

4.1 Intrinsic score

With the premise in Sect. 2.2, we can know that if two Tokens have similar meanings in the URL, and their cosine similarity is close to 1, it means that the two Tokens are well embedded. The method to evaluate a group of Token’s similarity is to calculate their average value, as shown in algorithm 5, which means the Tokens in a group calculate cosine similarity with each other and take their average value. Because the Tokens in the group are similar to each other, the closer $S_{Similar}$ is to 1, the better they are embedded.

$$\begin{aligned} S_{Similar} = \frac{1}{n(n-1)} \sum _{ v_{x} \in A, v_{y} \in A, v_{x} \ne v_{y}} S(v_{x},v_{y}) \end{aligned}$$

(5)

On the other hand, if two Tokens are not similar in URL meaning, and their cosine similarity is close to 0, it also means that the two Tokens are well embedded. The method is similar to algorithm 5 but it needs two groups of Tokens and group A is not similar to group B in the meaning of URL. As shown in algorithm 6, it calculates the average of cosine similarity of group A with group B because the Tokens in group A are not similar to the Tokens in group B, so the closer $S_{Dissimilar}$ is to 0, the better they are embedded.

$$\begin{aligned} S_{Dissimilar} = \frac{1}{n(n-1)} \sum _{ v_{x} \in A, v_{y} \in B} S(v_{x},v_{y}) \end{aligned}$$

(6)

More expansion, there are three characteristics of embedded well: $S_{Similar}$ close to 1, and $S_{Dissimilar}$ close to 0, and the difference between $S_{Similar}$ and $S_{Dissimilar}$ is large, so we propose the following algorithm to evaluate the performance of URL embedding method, the larger the score, the better performance of embedding:

$$\begin{aligned} \begin{aligned} Score\!= (100 \cdot S_{Similar} - 100 \cdot S_{Dissimilar})^{2} \\ + 100 \cdot S_{Similar} \\ \end{aligned} \end{aligned}$$

(7)

4.2 URL token pair

In order to verify the relationship between two feature vectors, we need a pair of URL Tokens that already know their relationship. Likes Token“amazon”and“google”, they usually play the role of the domain name in URL“www.amazon.com”or “www.google.com”so they should be similar in either URL or vector. We collected the top 50 domains in AlexaTop and looking forward to selecting 15 of them to form a similar Token set. We calculated the cosine similarity of these Tokens with the embedding methods Word2Vec, FastText and GloVe to ensure they are not only similar in the meaning of domain but also similar in the meaning of embedding. After manually selecting and verifying by different embedding methods, we form a similar Token set shown in Table . Besides, we were also looking forward to selecting 15 Tokens which dissimilar from the Tokens in a similar Token set like the domain suffix part in URL, such as“www”or“com”. After selection and verification, the dissimilar Token set is shown in Table 2.

Table 1 Similar token set

Full size table

Table 2 Dissimilar token set

Full size table

5 Evaluation process

5.1 Process of extrinsic evaluation

Extrinsic evaluating method uses URL embedding method as input features to a downstream task and measures changes in performance metrics specific to that task, which means we set up a specific malicious URLs detection task as a downstream task, and we used several indicators for evaluating the performance of malicious URL detection methods.

Figure 1 shows the outline of the process of extrinsic evaluation. The origin URL will be split by the segmentation method into URL Token, and then the URL Token will be embedded into the feature vectors according to different embedding methods. The machine learning model will be trained with the feature vectors, after training the output model can predict the URL used for testing. In order to evaluate different URL embedding methods, we changed several methods in the embedding step and machine learning step, including Random Forest [23], LightGBM [24], Decision Tree, Logistic Regression, and CNN. Besides, the dimension also be set as a variable.

5.2 Process of intrinsic evaluation

Figure 2 shows the outline of the process of intrinsic evaluation. We split the URL in the corpus by all delimiters into URL, and set up the embedding method by using the corpus. As experimental subjects, we used Word2Vec/Skip-gram, Word2Vec/CBOW, FastText/Skip-gram, FastText/CBOW, GloVe, TF-IDF, and One-hot Code. Then the embedding method to be evaluated will calculate $S_{Similar}$ and $S_{Dissimilar}$ with a similar Token set and a dissimilar Token set mentioned in Sect. 4.2.

Table 3 Extrinsic score (F-1 score) comparison of URL embedding method with 50$\%$ ratio of benign/malicious

Full size table

Table 4 Extrinsic score (F-1 score) comparison of URL embedding method with 10$\%$ ratio of benign/malicious

Full size table

Table 5 Extrinsic score (AUC score) comparison of URL embedding method with 50$\%$ ratio of benign/malicious

Full size table

Table 6 Extrinsic score (AUC score) comparison of URL embedding method with 10$\%$ ratio of benign/malicious

Full size table

Table 7 Intrinsic score comparison of URL embedding METHods

Full size table

6 Evaluation

In this section, we will show and analyze the results obtained according to the evaluation process described in the Sect. 5.

6.1 Data set

The extrinsic evaluating method requires a complete set of malicious URL detection tasks, so we have prepared an URL set for use as a corpus and the training test set for training and detection. We set up a crawling program to crawl 140 thousands URLs from AlexaTop [25], a website that counts the most used domain names made by Amazon. We selected 5 thousands malicious URLs with classic URL structure in URLhaus [26], a manually maintained malicious URL database as malicious URLs set and we selected 5 thousands benign URLs in the crawl results with classic URL structure. The training set and test set will be produced by the random seed of cross-validation from the malicious URLs and benign URLs mentioned above. In addition, to demonstrate that the extrinsic method changes with the dataset, we used different random seeds to construct different datasets for experimentation. One experiment used the same number of malicious and benign URLs for training and testing, while the other experiment used a dataset of benign: malicious behavior 1:10.

6.2 Experiment results

We took several URL embedding methods as variables and tested them with extrinsic and intrinsic evaluation methods. Besides, we compared vectors embedded in 64 and 2 dimensions because higher dimensional vectors usually contain more information than lower dimensional vectors, and high dimensional vectors can train with more features than low dimensional vectors, which can improve prediction accuracy. Tables 3 and 4 show the F-1 score of both data sets and Tables 5 and 6 show the AUC score of the extrinsic evaluation mentioned in Sect. , and Table 7 shows the results of the intrinsic evaluation mentioned in Sect. 5.2. The Table 9 shows the 64D $S_{Similar}$ and $S_{Dissimilar}$ of each URL embedding methods.

6.3 Results analyse

As shown in Tables 3, 4, 5, 6 and 7, both extrinsic evaluation results and intrinsic evaluation results, the context-considering embedding methods are better than context-agnostic embedding methods, means not only in NLP but also in URL embedding, considering context is essential.

Table 8 Evaluation of related works

Full size table

6.4 The evaluation of related works

We have summarized 14 related research using machine learning for malicious URL detection in Sect. 3, and we used intrinsic evaluation method and extrinsic evaluation method to evaluate the related research’s detection method in this section. As shown in Table 8, the related research can be divided into 5 parts, and due to the use of many different machine learning algorithms in many related studies, in order to achieve unified evaluation, we will use the random forest algorithm uniformly. Besides, these related studies have also used different segmentation methods such as Alphabet, Word, Token, etc. Here, we uniformly use Token segmentation method. The reason is that, according to our previous study, the Token segmentation method is context-considering segmentation method, and the actual detection accuracy is one of the highest among all segmentation methods. Secondly, the Token Pairs used in this study for the Intrinsic Evaluation Method are split by Token segmentation method and can be detected by the Intrinsic Evaluation Method.

As shown in Table 8, the research “URLnet”[11] and“eXpose” [12] used the One-hot Code, kind of context-agnostic embedding method for embedding, obtained almost lowest extrinsic score in related works for the reason that context-agnostic embedding method do not including contextual information during embedding which also mentioned in Sect. 2.3.2. Not only extrinsic score, but also from intrinsic score, we can clearly see that the embedding performance of One-hot Code method lags far behind other context-considering embedding methods. Given that these two related studies were proposed relatively early, they provide new ideas for the study of other machine learning based malicious URL detection methods.

The research“URL2Vec”[9] and“UE Model” [13] achieved the highest scores in related works. The Skip-gram embedding method is a famous context-considering model and it embeds Tokens with similar meanings into similar vector spaces. During the detection process, the distance in the vector space plays a significant role in predicting URL properties.

The other research used the Numerical embedding method that is not a word embedding algorithm, they use traditional feature embedding methods, using features such as character length, number of characters, etc. to input machine learning models for detection. This traditional method is more commonly used in network attack detection such as DDoS detection. This embedding method obtained the lowest score, indicating that it is not suitable for current malicious URL detection.

Besides, the research using context-considering embedding methods typically yields higher detection accuracy, which is the result of the Extrinsic Evaluation Method. However, the results of the Extrinsic Evaluation Method are generally not significantly different, and often, in order to make the difference more obvious, research will reduce the training set or training frequency to make the difference in detection accuracy become apparent. Although this approach can make the difference in detection accuracy between different methods more apparent, the detection accuracy reflected by reducing the training set and training frequency is not the detection accuracy that the detection method should have.

As shown in Fig. 3, the results of Intrinsic Evaluation Method is also consistent with the results of the Extrinsic Evaluation Method, but the difference between the results are more clearly. However, the results of the Intrinsic Evaluation Method only evaluated the performance of the Embedding Method in the malicious URL detection task, and the actual detection effect still needs further verification by the Extrinsic Evaluation Method. So the combination of Intrinsic and Extrinsic Evaluation Methods is the most ideal.

6.5 Intrinsic method solve the disadvantages of extrinsic method

6.5.1 Compare the URL embedding methods easier

As shown in Table 3, 4, 5 and 6, even if the dimension of the word vector is reduced from 64 to 2, the F-1 score results are very close under the test of different machine learning models. In this case, it is difficult for us to compare the performance of each URL embedding methods and hard to select the right URL embedding method. With the help of the intrinsic evaluation method, we can know the embed situation more clearly. Like Table 7, the Skip-gram method from Word2Vec and FastText, and GloVe method has a huge difference in the intrinsic scores of 64D and 2D, which shows 64D embedding can improve the performance of detection with the embedding methods Word2Vec/Skip-gram, FastText/Skip-gram and GloVe. Besides, the AUC score is an important metric used to measure the detectors, but it can be seen from the Tables 5 and 6 that apart from DT, there is not much difference among the other machine learning models, and it is also not possible to compare various embedding methods.

6.5.2 Not affected by machine learning models and datasets

As shown in Tables 3 and 5, the 64D GloVe with CNN has lower F-1 Score than 64D Word2Vec/Skip-gram with CNN with 50$\%$ Ratio of Benign/Malicious, but the 64D GloVe with CNN has higher F-1 Score than 64D Word2Vec/Skip-gram with CNN with 10$\%$ Ratio of Benign/Malicious. This illustrates a drawback of the Extrinsic Evaluation Method, which can be influenced by the dataset. Besides, it is also difficult for us to compare the performance of embedding methods such as Word2Vec and FastText, as their comparison results under different Machine Learning models are different from each other, but with the help of the Intrinsic Evaluation Method, as shown in the Table 7, we can clearly distinguish the differences between different methods.

7 Discussion

7.1 Problems about Existing URL Embedding Methods

Even though most URL embedding methods in extrinsic tests have achieved good detection accuracy, the specific cosine similarity of each URL embedding methods shown in Table 9 shows that this URL embedding methods are not the most suitable for malicious URLs detection. They usually get high similarity in $S_{Similar}$, but $S_{Dissimilar}$ is not too low, which means different URL Tokens are not well distinguished by existing URL embedding methods. The most special example is the CBOW method, whether from Word2Vec or FastText, it got the highest $S_{Similar}$, but the difference between $S_{Similar}$ and $S_{Dissimilar}$ is low. In general, Word2Vec is more suitable for malicious URLs detection, and the Skip-gram algorithm is more suitable for URL embedding.

Table 9 $S_{Similar}$ and $S_{Dissimilar}$ of URL embedding methods

Full size table

However, the common problem with the existing URL embedding methods is these embedding methods are originally used for NLP. They identify similar words and related words from the perspective of natural language, which is base on the relative position of words in the corpus. These algorithms are not the most suitable for URL embedding because the relative position of Tokens in URL is different from natural language. Besides, the treatment of polysemous Tokens is also unsatisfactory, like the Token ’zoom’, when it is used as the domain name ’zoom’, its meaning is different from that of other domain names as part of the path, which makes the cosine similarity of Tokens related to ’zoom’ very poor. In conclusion, URL embedding methods need to solve the above problems to obtain a better embedding performance.

7.2 Limitations and weaknesses

Although our Intrinsic method can solve the disadvantage of Extrinsic evaluation methods that affected by the training set and test set for machine learning, as well as issues such as small differences in detection accuracy results, it still has limitations and weaknesses.

7.2.1 The token pairs

The Intrinsic evaluation method do not need a test set based on specific downstream tasks, but a corpus for embedding and a set of Token pair is required. Firstly, regarding the selection of Token Pair, as mentioned above, the Token Pairs used for Intrinsic Evaluation Method is manually chosen by us. Changing a few Token Pairs will greatly alter the final evaluation results obtained. That is to say, URL embedding methods like Word2Vec and FastText, which are relatively similar, are likely to result in different Intrinsic Score comparison results due to changing one or two Token Pairs, so this requires caution when choosing Token Pair. But this small change will not affect the significant difference in the Intrinsic Score comparison between Context-considering and Context-agnostic embedding methods.

Besides, as described in Sect. 4.2, Token Pair also needs to be generated according to corpus. This means that these Token Pair sets are bound to the current corpus, because if the corpus does not contain one of the tokens in the token pair, that token cannot be embedded. This will result in the need to adjust the composition of Token Pair, and the corresponding corpus will also need to be adjusted. The adjustment of corpus containing a large number of URLs will require reacquiring a large amount of URL data, which will make the evaluation process more complex.

7.2.2 Evaluation for actual tasks

The Intrinsic Evaluation Method is used to evaluate the performance of the target embedding method in embedding URLs. It cannot evaluate the performance of specific methods in a particular task. For example, in the malicious URL detection task described in this article, the Intrinsic Evaluation Method cannot evaluate the performance of related detection methods in detecting malicious URLs. In conclusion, the Intrinsic Evaluation Method we proposed is a method used to assist the Extrinsic Method in evaluating the performance of specific URL embedding methods, the performance of the detection system still needs to be judged based on specific machine learning models.

7.2.3 Establish a standard test collection

The Token Pair mentioned in this article only has 30 pairs, which is not enough in terms of quantity. For example, the WordSim353 collection for word vectors has 353 pairs, and it was proposed in 2002. In recent years, most collections have included thousands of pairs. At present, our research has demonstrated the feasibility of the Intrinsic Evaluation Method in evaluating the performance of URL embedding methods, but the Intrinsic Evaluation Method is not yet a standardized evaluation method. In the future, we will strive to improve the Token Pair Collection and its corresponding Corpus, so that our proposed method can truly be used for standardized evaluation.

7.2.4 Conclusion of strengths and weaknesses

We will make a conclusion of our proposed Intrinsic Evaluation Method with the strengths and weaknesses as list in the section.

Strengths of The Intrinsic Evaluation Method:

1.
As a specialized method for evaluating embedding methods, it can compare the URL Embedding Methods Easier.
2.
It will not be affected by Machine Learning Models and Datasets.

Weaknesses of The Intrinsic Evaluation Method:

1.
The change of Token Pairs may significant change the result of evaluation as well as the corresponding corpus.
2.
It cannot evaluate the performance of specific methods in a particular task such as malicious URL detection.
3.
In order to be a standardized evaluation method for URL embedding methods, a Token Pair Test Collection and its corresponding Corpus must be built.

8 Conclusion

In this paper, we proposed an intrinsic evaluation method for URL embedding method, and it can evaluate URL embedding method without the effect of machine learning models and data sets. Besides, we evaluated several URL embedding methods with intrinsic and extrinsic method and found that the results of traditional extrinsic evaluation methods are hard to compare in evaluating URL embedding methods, and the results of intrinsic evaluation method proved intrinsic evaluation method plays its role in URL embedding methods evaluation. At last, we found that Word2Vec embedding method and Skip-gram algorithm are suitable for URL embedding according to the results of the evaluation.

About future work, we will focus on improving the Intrinsic Evaluation Method, including increasing the number of Token Pairs, and addressing the issue of how to perform Intrinsic Evaluation without using Token segmentation method. Besides, using larger and more diverse datasets that cover various domains, languages and collection periods will be a future challenge for us to further verify the feasibility of our proposed method.

Data availability

The corpus of URL is crawled from AlexaTop https://www.alexa.com/topsites, which stopped service in 2022, but we still can obtain the last version of the site rank. Uploaded as BenignURL.csv. The URLhaus database https://urlhaus.abuse.ch/ provides malicious URLs. Uploaded as MaliciousURL.csv.

References

Chen,Q., Omote,K.: Toward the establishment of evaluating URL embedding methods using intrinsic evaluator via malicious URLs detection. In: 38TH International Conference on ICT Systems Security and Privacy Protection (IFIP SEC), pp. 350–360 (2023)
Goldberg,Y., Levy, O.: word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method, CoRR abs/1402.3722 (2014). http://arxiv.org/abs/1402.3722
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext. zip: compressing text classification models, arXiv preprint arXiv:1612.03651 (2016)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), pp. 1532–1543
Rajaraman, A., Ullman, J.D.: Data mining, Cambridge University Press (2011), pp. 1–17. https://doi.org/10.1017/CBO9781139058452.002
Pradeepa, G., Devi, R.: Review of malicious URL detection using machine learning. In: Soft Computing for Security Applications, pp 97–105, Springer (2022)
Chen, Q., Omote, K.: A three-step framework for detecting malicious URLs. In: 2022 International Symposium on Networks, Computers and Communications (ISNCC), IEEE, pp. 1–6 (2022)
Wang, B., Wang, A., Chen, F., Wang, Y., Kuo, C.C.J.: Evaluating word embedding models: methods and experimental results. APSIPA Trans Signal Inform Process 8, e19 (2019)
Article MATH Google Scholar
Yuan, H., Yang, Z., Chen, X., Li, Y., Liu, W.: URL2Vec: URL modeling with character embeddings for fast and accurate phishing website detection. In: 2018 ISPA/IUCC/BDCloud/SocialCom/SustainCom (2018), pp. 265–272. https://doi.org/10.1109/BDCloud.2018.00050
Kaneko, S., Yamada, A., Sawaya, Y., Thao, T.P., Kubota, A., Omote, K.: Detecting malicious websites by query templates. In: Simion, E., Géraud-Stewart, R. (eds.) Innovative Security Solutions for Information Technology and Communications. Springer, New York (2020)
MATH Google Scholar
Le, H., Pham, Q., Sahoo, D., Hoi, S.C.H.: URLNet: learning a URL representation with deep learning for malicious URL detection, CoRR abs/1802.03162 (2018)
Saxe, J., Berlin, K.: eXpose: a character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registryb keys, arXiv preprint arXiv:1702.08568 (2017)
Yan, X., Xu, Y., Cui, B., Zhang, S., Guo, T., Li, C.: Learning URL embedding for malicious website detection. IEEE Trans. Ind. Inform. 16(10), 6673 (2020)
Article MATH Google Scholar
Sahingoz, O.K., Buber, E., Demir, O., Diri, B.: Machine learning based phishing detection from URLs. Expert Syst. Appl. 117, 345 (2019)
Article Google Scholar
Do Xuan, C., Nguyen, H.D., Nikolaevich, T.V., et al.: Malicious URL detection based on machine learning. Int. J. Adv. Comput. Sci. Appl. 11(1), 1 (2020)
MATH Google Scholar
Patgiri, R., Katari, H., Kumar, R., Sharma, D.: Empirical study on malicious URL detection using machine learning. In: International Conference on Distributed Computing and Internet Technology, pp. 380–388, Springer (2019)
Djaballah, K.A., Boukhalfa, K., Ghalem, Z., Boukerma, O.: A new approach for the detection and analysis of phishing in social networks: the case of Twitter. In: 2020 Seventh International Conference on Social Networks Analysis, Management and Security, IEEE, pp. 1–8 (2020)
Kumar,Y., Subba,B.: A lightweight machine learning based security framework for detecting phishing attacks. In: 2021 International Conference on Communication Systems & Networks, IEEE, pp. 184–188 (2021)
Patil, D., Patil, J.: Feature-based malicious URL and attack type detection using multi-class classification. ISC Int. J. Inform. Secur. 10(2), 141 (2018)
MATH Google Scholar
Zamir, A., Khan, H.U., Iqbal, T., Yousaf, N., Aslam, F., Anjum, A., Hamdani, M.: Phishing web site detection using diverse machine learning algorithms, The Electronic Library (2020)
Catak, F., Ozgur, K., Sahinbas, Dörtkardeş, V.: Malicious URL detection using machine learning. In: Artificial Intelligence Paradigms for Smart Cyber-Physical Systems, IGI Global, pp. 160–180 (2021)
Alam, M.N., Sarma, D., Lima, F.F., Saha, I., Hossain, S., et al.: Phishing attacks detection using machine learning approach. In: 2020 Third International Conference on Smart Systems and Inventive Technology, IEEE, pp. 1173–1179 (2020)
Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1 (1995), vol. 1, pp. 278–282. https://doi.org/10.1109/ICDAR.1995.598994
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y.: Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural Inform. Process. Syst. 30, 3146 (2017)
MATH Google Scholar
Top sites—alexa. https://www.alexa.com/topsites
Urlhaus | malware URL exchange. https://urlhaus.abuse.ch/

Download references

Funding

This work was supported by JSPS KAKENHI Grant Number JP22H03588.

Author information

Authors and Affiliations

University of Tsukuba, Tsukuba, Ibaraki, 305-8577, Japan
Qisheng Chen & Kazumasa Omote

Authors

Qisheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kazumasa Omote
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Chen and Omote reviewed this idea and confirmed the original manuscript of the paper. Chen was responsible for analyzing and evaluating the research and writing the paper, while Omote supervised the writing of the paper.

Corresponding author

Correspondence to Qisheng Chen.

Ethics declarations

Conflict of interest

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The preliminary version of this paper was presented at IFIPSEC 2023 [1]. This paper added more relative research of malicious URLs detection and used our new evaluation method to evaluated these relative research.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (csv 8516 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, Q., Omote, K. An intrinsic evaluator for embedding methods in malicious URL detection. Int. J. Inf. Secur. 24, 36 (2025). https://doi.org/10.1007/s10207-024-00950-9

Download citation

Published: 02 December 2024
DOI: https://doi.org/10.1007/s10207-024-00950-9

An intrinsic evaluator for embedding methods in malicious URL detection

Abstract

Similar content being viewed by others

Toward the Establishment of Evaluating URL Embedding Methods Using Intrinsic Evaluator via Malicious URLs Detection

A Malicious URL Detection Model Based on Convolutional Neural Network

Exploring and Identifying Malicious Sites in Dark Web Using Machine Learning

1 Introduction

2 Preliminary

2.1 F-1 score

2.2 Cosine similarity

2.3 URL embedding methods

2.3.1 Context-considering embedding methods

2.3.2 Context-agnostic embedding methods

3 Related work

3.1 A three-step framework for detecting malicious URLs

3.2 The extrinsic and intrinsic evaluating method in NLP

3.3 The segmentation methods of the malicious URLs detection research

3.4 The URL embedding methods of the malicious URLs detection research

4 Intrinsic evaluation method

4.1 Intrinsic score

4.2 URL token pair

5 Evaluation process

5.1 Process of extrinsic evaluation

5.2 Process of intrinsic evaluation

6 Evaluation

6.1 Data set

6.2 Experiment results

6.3 Results analyse

6.4 The evaluation of related works

6.5 Intrinsic method solve the disadvantages of extrinsic method

6.5.1 Compare the URL embedding methods easier

6.5.2 Not affected by machine learning models and datasets

7 Discussion

7.1 Problems about Existing URL Embedding Methods

7.2 Limitations and weaknesses

7.2.1 The token pairs

7.2.2 Evaluation for actual tasks

7.2.3 Establish a standard test collection

7.2.4 Conclusion of strengths and weaknesses

8 Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (csv 8516 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation