More Web Proxy on the site http://driver.im/

research-article

Identification of Chinese dark jargons in Telegram underground markets using context-oriented and linguistic features

Authors:

Haizhou WangAuthors Info & Claims

Volume 59, Issue 5

https://doi.org/10.1016/j.ipm.2022.103033

Published: 01 September 2022 Publication History

Abstract

When cybercriminals communicate with their customers in underground markets, they tend to use secure and customizable instant messaging (IM) software, i.e. Telegram. It is a popular IM software with over 700 million monthly active users (MAU) up to June 2022. In recent years, more and more dark jargons (i.e. an innocent-looking replacement of sensitive terms) appear frequently on Telegram. Therefore, jargons identification is one of the most significant research perspectives to track online underground markets and cybercrimes. This paper proposes a novel Chinese Jargons Identification Framework (CJI-Framework) to identify dark jargons. Firstly, we collect chat history from Telegram groups that are related to the underground market and construct the corpus TUMCC (Telegram Underground Market Chinese Corpus), which is the first Chinese corpus in jargons identification research field. Secondly, we extract seven brand-new features which can be classified into three categories: Vectors-based Features (VF), Lexical analysis-based Features (LF), and Dictionary analysis-based Features (DF), to identify Chinese dark jargons from commonly-used words. Based on these features, we then run a statistical outlier detection to decide whether a word is a jargon. Furthermore, we employ a word vector projection method and a transfer learning method to improve the effect of the framework. Experimental results show that CJI-Framework achieves a remarkable performance with an F1-score of 89.66%. After adaptation for English, it performs better than state-of-the-art English jargons identification method as well. Our built corpus and code have been publicly released to facilitate the reproduction and extension of our work.

References

[1]

Alassad M., Spann B., Agarwal N., Combining advanced computational social science and graph theoretic techniques to reveal adversarial information operations, Information Processing & Management 58 (1) (2021) 10.1016/j.ipm.2020.102385.

[2]

Aoki, T., Sasano, R., Takamura, H., & Okumura, M. (2017). Distinguishing Japanese non-standard usages from standard ones. In Proceedings of the 14th Conference on empirical methods in natural language processing (pp. 2323–2328). Copenhagen, Denmark: https://doi.org/10.18653/v1/D17-1246.

[3]

Artetxe, M., Labaka, G., & Agirre, E. (2018). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual meeting of the association for computational linguistics (pp. 789–798). Melbourne, Australia: https://doi.org/10.18653/v1/P18-1073.

[4]

Boukerche A., Zheng L., Alfandi O., Outlier detection: Methods, models, and classification, ACM Computing Surveys 53 (3) (2020) 1–37,.

Digital Library

[5]

Dasgupta, S., Piplai, A., Kotal, A., Joshi, A., et al. (2020). A Comparative Study of Deep Learning based Named Entity Recognition Algorithms for Cybersecurity. In 4th International workshop on big data analytics for cyber intelligence and defense, IEEE International conference on big data. Virtual event: https://doi.org/10.1109/BigData50022.2020.9378482.

[6]

Dou, Y., Liu, Z., Sun, L., Deng, Y., Peng, H., & Yu, P. S. (2020). Enhancing Graph Neural Network-based Fraud Detectors Against Camouflaged Fraudsters. In Proceedings of the 29th ACM International conference on information & knowledge management (pp. 315–324). Virtual event: https://doi.org/10.1145/3340531.3411903.

[7]

Fan, Y., Ye, Y., Peng, Q., Zhang, J., Zhang, Y., Xiao, X., et al. (2020). Metagraph Aggregated Heterogeneous Graph Neural Network for Illicit Traded Product Identification in Underground Market. In Proceedings of the 20th IEEE International conference on data mining (pp. 132–141). Virtual event: https://doi.org/10.1109/ICDM50108.2020.00022.

[8]

Farrell, T., Araque, O., Fernandez, M., & Alani, H. (2020). On the use of Jargon and Word Embeddings to Explore Subculture within the Reddit’s Manosphere. In 12th ACM Conference on web science (pp. 221–230). Virtual event: https://doi.org/10.1145/3394231.3397912.

[9]

Gupta M., Gao J., Aggarwal C.C., Han J., Outlier detection for temporal data: A survey, IEEE Transactions on Knowledge and Data Engineering 26 (9) (2013) 2250–2267,.

[10]

Haasio A., Harviainen J.T., Savolainen R., Information needs of drug users on a local dark web marketplace, Information Processing & Management 57 (2) (2020),.

Digital Library

[11]

Hada, T., Sei, Y., Tahara, Y., & Ohsuga, A. (2020). Codewords Detection in Microblogs Focusing on Differences in Word Use Between Two Corpora. In Proceedings of the 3rd International conference on computing, electronics & communications engineering (pp. 103–108). Southend, UK: https://doi.org/10.1109/iCCECE49321.2020.9231109.

[12]

Hoseini, M., Melo, P., Júnior, M., Benevenuto, F., Chandrasekaran, B., Feldmann, A., et al. (2020). Demystifying the Messaging Platforms’ Ecosystem Through the Lens of Twitter. In Proceedings of the 20th ACM internet measurement conference (pp. 345–359). Virtual event: https://doi.org/10.1145/3419394.3423651.

[13]

Huang, S.-Y., & Ban, T. (2020). Monitoring Social Media for Vulnerability-Threat Prediction and Topic Analysis. In Proceedings of the 19th International conference on trust, security and privacy in computing and communications (pp. 1771–1776). Virtual event: https://doi.org/10.1109/TrustCom50675.2020.00243.

[14]

Kumar, R., Yadav, S., Daniulaityte, R., Lamy, F., Thirunarayan, K., Lokala, U., et al. (2020). edarkfind: Unsupervised Multi-view Learning for Sybil Account Detection. In Proceedings of the 29th International world wide web conference (pp. 1955–1965). Taipei: https://doi.org/10.1145/3366423.3380263.

[15]

Le, Q., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents. In Proceedings of the 31st International conference on machine learning, vol. 32 no. 2 (pp. 1188–1196). Bejing, China.

[16]

Lee, S., Yoon, C., Kang, H., Kim, Y., Kim, Y., Han, D., et al. (2019). Cybercriminal Minds: An Investigative Study of Cryptocurrency Abuses in the Dark Web. In Proceedings of the 26th Network and distributed system security symposium (pp. 1–15). San Diego, USA: https://doi.org/10.14722/ndss.2019.23055.

[17]

Levy O., Goldberg Y., Neural word embedding as implicit matrix factorization, Advances in Neural Information Processing Systems 27 (2014) 2177–2185.

[18]

Li Y., Cheng J., Huang C., Chen Z., Niu W., NEDetector: Automatically extracting cybersecurity neologisms from hacker forums, Journal of Information Security and Applications 58 (2021),.

[19]

Liu, T., Ungar, L., & Sedoc, J. (2019). Unsupervised Post-processing of Word Vectors Via Conceptor Negation. In Proceedings of the 33rd AAAI Conference on artificial intelligence (pp. 6778–6785). Hawaii, USA: https://doi.org/10.1609/aaai.v33i01.33016778.

[20]

Lusthaus, J. (2019). Beneath the Dark Web: Excavating the Layers of Cybercrime’s Underground Economy. In Proceedings of the 40th IEEE European symposium on security and privacy workshops (pp. 474–480). Stockholm, Sweden: https://doi.org/10.1109/EuroSPW.2019.00059.

[21]

Maddela, M., Xu, W., & Preoţiuc-Pietro, D. (2019). Multi-task Pairwise Neural Ranking for Hashtag Segmentation. In Proceedings of the 57th Annual meeting of the association for computational linguistics (pp. 2538–2549). Florence, Italy: https://doi.org/10.18653/v1/p19-1242.

[22]

Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J., Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems 26 (2013) 3111–3119.

[23]

Morgia, M. L., Mei, A., Raponi, S., & Stefa, J. (2018). Time-Zone Geolocation of Crowds in the Dark Web. In Proceedings of the 38th IEEE International conference on distributed computing systems (pp. 445–455). Vienna, Austria: https://doi.org/10.1109/ICDCS.2018.00051.

[24]

Nasar Z., Jaffry S.W., Malik M.K., Textual keyword extraction and summarization: State-of-the-art, Information Processing & Management 56 (6) (2019),.

Digital Library

[25]

Niu, Y., Xie, R., Liu, Z., & Sun, M. (2017). Improved Word Representation Learning With Sememes. In Proceedings of the 55th Annual meeting of the association for computational linguistics, vol. 1 (pp. 2049–2058). Vancouver, Canada: https://doi.org/10.18653/v1/P17-1187.

[26]

Nobari, A. D., Reshadatmand, N., & Neshati, M. (2017). Analysis of Telegram, an Instant Messaging Service. In Proceedings of the 26th ACM on Conference on information and knowledge management (pp. 2035–2038). Singapore: https://doi.org/10.1145/3132847.3133132.

[27]

Pastrana, S., Hutchings, A., Caines, A., & Buttery, P. (2018). Characterizing Eve: Analysing Cybercrime Actors in a Large Underground Forum. In The 21st International symposium on research in attacks, intrusions, and defenses (pp. 207–227). Heraklion, Greece: https://doi.org/10.1007/978-3-030-00470-5_10.

[28]

Pastrana, S., Hutchings, A., Thomas, D., & Tapiador, J. (2019). Measuring eWhoring. In Proceedings of the 19th Internet measurement conference (pp. 463–477). Amsterdam, Netherlands: https://doi.org/10.1145/3355369.3355597.

[29]

Pastrana, S., Thomas, D. R., Hutchings, A., & Clayton, R. (2018). Crimebb: Enabling Cybercrime Research on Underground Forums at Scale. In Proceedings of the 27th International world wide web conference (pp. 1845–1854). Lyon, France: https://doi.org/10.1145/3178876.3186178.

[30]

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., et al. (2018). Deep Contextualized Word Representations. In Proceedings of the 16th Conference of the north american chapter of the association for computational linguistics: human language technologies, vol. 1 (pp. 2227–2237). New Orleans, Louisiana, USA.

[31]

Peters, M. E., Neumann, M., Zettlemoyer, L., & Yih, W.-t. (2018). Dissecting Contextual Word Embeddings: Architecture and Representation. In Proceedings of the conference on empirical methods in natural language processing (pp. 1499–1509). Brussels, Belgium: https://doi.org/10.18653/v1/D18-1179.

[32]

Portnoff, R. S., Afroz, S., Durrett, G., Kummerfeld, J. K., Berg-Kirkpatrick, T., McCoy, D., et al. (2017). Tools for Automated Analysis of Cybercriminal Markets. In Proceedings of the 26th International conference on world wide web (pp. 657–666). Perth, Australia: https://doi.org/10.1145/3038912.3052600.

[33]

Qian, C., Feng, F., Wen, L., & Chua, T.-S. (2021). Conceptualized and Contextualized Gaussian Embedding. In Proceedings of the 35th Conference on artificial intelligence, vol. 35 no. 15 (pp. 13683–13691). Virtual event.

[34]

Raganato, A., Camacho-Collados, J., & Navigli, R. (2017). Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison. In Proceedings of the 15th Conference of the european chapter of the association for computational linguistics (pp. 99–110). Alencia, Spain.

[35]

Reid, M., Marrese-Taylor, E., & Matsuo, Y. (2020). VCDM: Leveraging Variational Bi-encoding and Deep Contextualized Word Representations for Improved Definition Modeling. In Proceedings of the 17th Conference on empirical methods in natural language processing (pp. 6331–6344). Punta Cana, Dominican: https://doi.org/10.18653/v1/2020.emnlp-main.513.

[36]

Samtani S., Zhu H., Chen H., Proactively identifying emerging hacker threats from the dark web: A diachronic graph embedding framework (D-GEF), ACM Transactions on Privacy and Security 23 (4) (2020) 1–33,.

Digital Library

[37]

Sasano, R., & Korhonen, A. (2020). Investigating Word-Class Distributions in Word Vector Spaces. In Proceedings of the 58th Annual meeting of the association for computational linguistics (pp. 3657–3666). Virtual event: https://doi.org/10.18653/v1/2020.acl-main.337.

[38]

Spinde T., Rudnitckaia L., Mitrović J., Hamborg F., Granitzer M., Gipp B., et al., Automated identification of bias inducing words in news articles using linguistic and context-oriented features, Information Processing & Management 58 (3) (2021),.

Digital Library

[39]

Sutikno T., Handayani L., Stiawan D., Riyadi M.A., Subroto I.M.I., WhatsApp, Viber and Telegram: Which is the best for instant messaging?, International Journal of Electrical & Computer Engineering 6 (3) (2016) 2088–8708,.

[40]

Tamaazousti Y., Le Borgne H., Hudelot C., Seddik M.E.A., Tamaazousti M., Learning more universal representations for transfer-learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (9) (2020) 2212–2224,.

Digital Library

[41]

Tayebi, M. A., Ester, M., Glässer, U., & Brantingham, P. L. (2014). Spatially Embedded Co-offence Prediction Using Supervised Learning. In Proceedings of the 20th ACM SIGKDD International conference on knowledge discovery and data mining (pp. 1789–1798). New York, USA: https://doi.org/10.1145/2623330.2623353.

[42]

Thomas, K., McCoy, D., Grier, C., Kolcz, A., & Paxson, V. (2013). Trafficking Fraudulent Accounts: The Role of the Underground Market in Twitter Spam and Abuse. In Proceedings of the 22nd USENIX security symposium (pp. 195–210). Washington D.C., USA.

[43]

Wang H., Hou Y., Wang H., A novel framework of identifying Chinese jargons for telegram underground markets, in: Proceedings of the 30th International conference on computer communications and networks, IEEE, Athens, Greece, 2021, pp. 1–9,.

[44]

Wegberg, R. v., Miedema, F., Akyazi, U., Noroozian, A., Klievink, B., & van Eeten, M. (2020). Go See a Specialist? Predicting Cybercrime Sales on Online Anonymous Markets from Vendor and Product Characteristics. In Proceedings of the 29th International world wide web conference (pp. 816–826). Taipei: https://doi.org/10.1145/3366423.3380162.

[45]

Xia P., Zhang L., Li F., Learning similarity with cosine similarity ensemble, Information Sciences 307 (2015) 39–52,.

Digital Library

[46]

Yang, H., Ma, X., Du, K., Li, Z., Duan, H., Su, X., et al. (2017). How to Learn Klingon Without a Dictionary: Detection and Measurement of Black Keywords Used by the Underground Economy. In Proceedings of the 38th IEEE Symposium on security and privacy (pp. 751–769). San Jose, USA: https://doi.org/10.1109/SP.2017.11.

[47]

Yuan, K., Lu, H., Liao, X., & Wang, X. (2018). Reading Thieves’ Cant: Automatically Identifying and Understanding Dark Jargons From Cybercrime Marketplaces. In Proceedings of the 27th USENIX Security symposium (pp. 1027–1041). Baltimore, USA.

[48]

Zhang, Y., Fan, Y., Song, W., Hou, S., Ye, Y., Li, X., et al. (2019). Your Style Your Identity: Leveraging Writing and Photography Styles for Drug Trafficker Identification in Darknet Markets Over Attributed Heterogeneous Information Network. In Proceedings of the 28th International conference on world wide web (pp. 3448–3454). San Francisco, USA: https://doi.org/10.1145/3308558.3313537.

[49]

Zhang, Y., Fan, Y., Ye, Y., Zhao, L., & Shi, C. (2019). Key Player Identification in Underground Forums Over Attributed Heterogeneous Information Network Embedding Framework. In Proceedings of the 28th ACM International conference on information and knowledge management (pp. 549–558). Beijing, China: https://doi.org/10.1145/3357384.3357876.

[50]

Zhang, Y., Qian, Y., Fan, Y., Ye, Y., Li, X., Xiong, Q., et al. (2020). dStyle-GAN: Generative Adversarial Network based on Writing and Photography Styles for Drug Identification in Darknet Markets. In Proceedings of the 36th Annual computer security applications conference (pp. 669–680). Virtual event: https://doi.org/10.1145/3427228.3427603.

[51]

Zhang B., Xiong D., Su J., Neural machine translation with deep attention, IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (1) (2018) 154–163,.

Digital Library

[52]

Zhao J., Liu X., Yan Q., Li B., Shao M., Peng H., et al., Automatically predicting cyber attack preference with attributed heterogeneous attention networks and transductive learning, Computers & Security 102 (2021),.

Digital Library

[53]

Zhao, K., Zhang, Y., Xing, C., Li, W., & Chen, H. (2016). Chinese Underground Market Jargon Analysis Based on Unsupervised Learning. In Proceedings of the 14th IEEE Conference on intelligence and security informatics (pp. 97–102). Tucson, USA: https://doi.org/10.1109/ISI.2016.7745450.

[54]

Zheng J., Cai F., Chen H., de Rijke M., Pre-train, interact, fine-tune: A novel interaction representation for text classification, Information Processing & Management 57 (6) (2020),.

[55]

Zhu, W., Gong, H., Bansal, R., Weinberg, Z., Christin, N., Fanti, G., et al. (2021). Self-supervised euphemism detection and identification for content moderation. In Proceedings of the 43rd IEEE Symposium on security and privacy (pp. 229–246). Virtual Event: https://doi.org/10.1109/SP40001.2021.00075.

Cited By

Ma JWang LFu ZShao HGuo W(2023)Capturing mental modelsAdvanced Engineering Informatics10.1016/j.aei.2023.10208357:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.aei.2023.102083

Index Terms

Identification of Chinese dark jargons in Telegram underground markets using context-oriented and linguistic features
1. Applied computing
  1. Law, social and behavioral sciences
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Index terms have been assigned to the content through auto-classification.

Recommendations

Composing Word Embeddings for Compound Words Using Linguistic Knowledge
In recent years, the use of distributed representations has been a fundamental technology for natural language processing. However, Japanese has multiple compound words, and often we must compare the meanings of a word and a compound word. Moreover, word ...
CNN-based Context Sensitive Lemmatization
CODS-COMAD '19: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data

Morphological analysis is always considered as an important task in natural language processing (NLP). Lemmatization is a major morphological operation that finds the dictionary headword/root of a surface word. In context sensitive languages, the ...
Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement

Word embedding-based methods have received increasing attention for their flexibility and effectiveness in many natural language-processing (NLP) tasks, including Word Similarity (WS). However, these approaches rely on high-quality corpus and neglect ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal

Information Processing and Management: an International Journal Volume 59, Issue 5

Sep 2022

730 pages

ISSN:0306-4573

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 September 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ma JWang LFu ZShao HGuo W(2023)Capturing mental modelsAdvanced Engineering Informatics10.1016/j.aei.2023.10208357:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.aei.2023.102083

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents