[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Identification of Chinese dark jargons in Telegram underground markets using context-oriented and linguistic features

Published: 01 September 2022 Publication History

Abstract

When cybercriminals communicate with their customers in underground markets, they tend to use secure and customizable instant messaging (IM) software, i.e. Telegram. It is a popular IM software with over 700 million monthly active users (MAU) up to June 2022. In recent years, more and more dark jargons (i.e. an innocent-looking replacement of sensitive terms) appear frequently on Telegram. Therefore, jargons identification is one of the most significant research perspectives to track online underground markets and cybercrimes. This paper proposes a novel Chinese Jargons Identification Framework (CJI-Framework) to identify dark jargons. Firstly, we collect chat history from Telegram groups that are related to the underground market and construct the corpus TUMCC (Telegram Underground Market Chinese Corpus), which is the first Chinese corpus in jargons identification research field. Secondly, we extract seven brand-new features which can be classified into three categories: Vectors-based Features (VF), Lexical analysis-based Features (LF), and Dictionary analysis-based Features (DF), to identify Chinese dark jargons from commonly-used words. Based on these features, we then run a statistical outlier detection to decide whether a word is a jargon. Furthermore, we employ a word vector projection method and a transfer learning method to improve the effect of the framework. Experimental results show that CJI-Framework achieves a remarkable performance with an F1-score of 89.66%. After adaptation for English, it performs better than state-of-the-art English jargons identification method as well. Our built corpus and code have been publicly released to facilitate the reproduction and extension of our work.

References

[1]
Alassad M., Spann B., Agarwal N., Combining advanced computational social science and graph theoretic techniques to reveal adversarial information operations, Information Processing & Management 58 (1) (2021) 10.1016/j.ipm.2020.102385.
[2]
Aoki, T., Sasano, R., Takamura, H., & Okumura, M. (2017). Distinguishing Japanese non-standard usages from standard ones. In Proceedings of the 14th Conference on empirical methods in natural language processing (pp. 2323–2328). Copenhagen, Denmark: https://doi.org/10.18653/v1/D17-1246.
[3]
Artetxe, M., Labaka, G., & Agirre, E. (2018). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual meeting of the association for computational linguistics (pp. 789–798). Melbourne, Australia: https://doi.org/10.18653/v1/P18-1073.
[4]
Boukerche A., Zheng L., Alfandi O., Outlier detection: Methods, models, and classification, ACM Computing Surveys 53 (3) (2020) 1–37,.
[5]
Dasgupta, S., Piplai, A., Kotal, A., Joshi, A., et al. (2020). A Comparative Study of Deep Learning based Named Entity Recognition Algorithms for Cybersecurity. In 4th International workshop on big data analytics for cyber intelligence and defense, IEEE International conference on big data. Virtual event: https://doi.org/10.1109/BigData50022.2020.9378482.
[6]
Dou, Y., Liu, Z., Sun, L., Deng, Y., Peng, H., & Yu, P. S. (2020). Enhancing Graph Neural Network-based Fraud Detectors Against Camouflaged Fraudsters. In Proceedings of the 29th ACM International conference on information & knowledge management (pp. 315–324). Virtual event: https://doi.org/10.1145/3340531.3411903.
[7]
Fan, Y., Ye, Y., Peng, Q., Zhang, J., Zhang, Y., Xiao, X., et al. (2020). Metagraph Aggregated Heterogeneous Graph Neural Network for Illicit Traded Product Identification in Underground Market. In Proceedings of the 20th IEEE International conference on data mining (pp. 132–141). Virtual event: https://doi.org/10.1109/ICDM50108.2020.00022.
[8]
Farrell, T., Araque, O., Fernandez, M., & Alani, H. (2020). On the use of Jargon and Word Embeddings to Explore Subculture within the Reddit’s Manosphere. In 12th ACM Conference on web science (pp. 221–230). Virtual event: https://doi.org/10.1145/3394231.3397912.
[9]
Gupta M., Gao J., Aggarwal C.C., Han J., Outlier detection for temporal data: A survey, IEEE Transactions on Knowledge and Data Engineering 26 (9) (2013) 2250–2267,.
[10]
Haasio A., Harviainen J.T., Savolainen R., Information needs of drug users on a local dark web marketplace, Information Processing & Management 57 (2) (2020),.
[11]
Hada, T., Sei, Y., Tahara, Y., & Ohsuga, A. (2020). Codewords Detection in Microblogs Focusing on Differences in Word Use Between Two Corpora. In Proceedings of the 3rd International conference on computing, electronics & communications engineering (pp. 103–108). Southend, UK: https://doi.org/10.1109/iCCECE49321.2020.9231109.
[12]
Hoseini, M., Melo, P., Júnior, M., Benevenuto, F., Chandrasekaran, B., Feldmann, A., et al. (2020). Demystifying the Messaging Platforms’ Ecosystem Through the Lens of Twitter. In Proceedings of the 20th ACM internet measurement conference (pp. 345–359). Virtual event: https://doi.org/10.1145/3419394.3423651.
[13]
Huang, S.-Y., & Ban, T. (2020). Monitoring Social Media for Vulnerability-Threat Prediction and Topic Analysis. In Proceedings of the 19th International conference on trust, security and privacy in computing and communications (pp. 1771–1776). Virtual event: https://doi.org/10.1109/TrustCom50675.2020.00243.
[14]
Kumar, R., Yadav, S., Daniulaityte, R., Lamy, F., Thirunarayan, K., Lokala, U., et al. (2020). edarkfind: Unsupervised Multi-view Learning for Sybil Account Detection. In Proceedings of the 29th International world wide web conference (pp. 1955–1965). Taipei: https://doi.org/10.1145/3366423.3380263.
[15]
Le, Q., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents. In Proceedings of the 31st International conference on machine learning, vol. 32 no. 2 (pp. 1188–1196). Bejing, China.
[16]
Lee, S., Yoon, C., Kang, H., Kim, Y., Kim, Y., Han, D., et al. (2019). Cybercriminal Minds: An Investigative Study of Cryptocurrency Abuses in the Dark Web. In Proceedings of the 26th Network and distributed system security symposium (pp. 1–15). San Diego, USA: https://doi.org/10.14722/ndss.2019.23055.
[17]
Levy O., Goldberg Y., Neural word embedding as implicit matrix factorization, Advances in Neural Information Processing Systems 27 (2014) 2177–2185.
[18]
Li Y., Cheng J., Huang C., Chen Z., Niu W., NEDetector: Automatically extracting cybersecurity neologisms from hacker forums, Journal of Information Security and Applications 58 (2021),.
[19]
Liu, T., Ungar, L., & Sedoc, J. (2019). Unsupervised Post-processing of Word Vectors Via Conceptor Negation. In Proceedings of the 33rd AAAI Conference on artificial intelligence (pp. 6778–6785). Hawaii, USA: https://doi.org/10.1609/aaai.v33i01.33016778.
[20]
Lusthaus, J. (2019). Beneath the Dark Web: Excavating the Layers of Cybercrime’s Underground Economy. In Proceedings of the 40th IEEE European symposium on security and privacy workshops (pp. 474–480). Stockholm, Sweden: https://doi.org/10.1109/EuroSPW.2019.00059.
[21]
Maddela, M., Xu, W., & Preoţiuc-Pietro, D. (2019). Multi-task Pairwise Neural Ranking for Hashtag Segmentation. In Proceedings of the 57th Annual meeting of the association for computational linguistics (pp. 2538–2549). Florence, Italy: https://doi.org/10.18653/v1/p19-1242.
[22]
Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J., Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems 26 (2013) 3111–3119.
[23]
Morgia, M. L., Mei, A., Raponi, S., & Stefa, J. (2018). Time-Zone Geolocation of Crowds in the Dark Web. In Proceedings of the 38th IEEE International conference on distributed computing systems (pp. 445–455). Vienna, Austria: https://doi.org/10.1109/ICDCS.2018.00051.
[24]
Nasar Z., Jaffry S.W., Malik M.K., Textual keyword extraction and summarization: State-of-the-art, Information Processing & Management 56 (6) (2019),.
[25]
Niu, Y., Xie, R., Liu, Z., & Sun, M. (2017). Improved Word Representation Learning With Sememes. In Proceedings of the 55th Annual meeting of the association for computational linguistics, vol. 1 (pp. 2049–2058). Vancouver, Canada: https://doi.org/10.18653/v1/P17-1187.
[26]
Nobari, A. D., Reshadatmand, N., & Neshati, M. (2017). Analysis of Telegram, an Instant Messaging Service. In Proceedings of the 26th ACM on Conference on information and knowledge management (pp. 2035–2038). Singapore: https://doi.org/10.1145/3132847.3133132.
[27]
Pastrana, S., Hutchings, A., Caines, A., & Buttery, P. (2018). Characterizing Eve: Analysing Cybercrime Actors in a Large Underground Forum. In The 21st International symposium on research in attacks, intrusions, and defenses (pp. 207–227). Heraklion, Greece: https://doi.org/10.1007/978-3-030-00470-5_10.
[28]
Pastrana, S., Hutchings, A., Thomas, D., & Tapiador, J. (2019). Measuring eWhoring. In Proceedings of the 19th Internet measurement conference (pp. 463–477). Amsterdam, Netherlands: https://doi.org/10.1145/3355369.3355597.
[29]
Pastrana, S., Thomas, D. R., Hutchings, A., & Clayton, R. (2018). Crimebb: Enabling Cybercrime Research on Underground Forums at Scale. In Proceedings of the 27th International world wide web conference (pp. 1845–1854). Lyon, France: https://doi.org/10.1145/3178876.3186178.
[30]
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., et al. (2018). Deep Contextualized Word Representations. In Proceedings of the 16th Conference of the north american chapter of the association for computational linguistics: human language technologies, vol. 1 (pp. 2227–2237). New Orleans, Louisiana, USA.
[31]
Peters, M. E., Neumann, M., Zettlemoyer, L., & Yih, W.-t. (2018). Dissecting Contextual Word Embeddings: Architecture and Representation. In Proceedings of the conference on empirical methods in natural language processing (pp. 1499–1509). Brussels, Belgium: https://doi.org/10.18653/v1/D18-1179.
[32]
Portnoff, R. S., Afroz, S., Durrett, G., Kummerfeld, J. K., Berg-Kirkpatrick, T., McCoy, D., et al. (2017). Tools for Automated Analysis of Cybercriminal Markets. In Proceedings of the 26th International conference on world wide web (pp. 657–666). Perth, Australia: https://doi.org/10.1145/3038912.3052600.
[33]
Qian, C., Feng, F., Wen, L., & Chua, T.-S. (2021). Conceptualized and Contextualized Gaussian Embedding. In Proceedings of the 35th Conference on artificial intelligence, vol. 35 no. 15 (pp. 13683–13691). Virtual event.
[34]
Raganato, A., Camacho-Collados, J., & Navigli, R. (2017). Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison. In Proceedings of the 15th Conference of the european chapter of the association for computational linguistics (pp. 99–110). Alencia, Spain.
[35]
Reid, M., Marrese-Taylor, E., & Matsuo, Y. (2020). VCDM: Leveraging Variational Bi-encoding and Deep Contextualized Word Representations for Improved Definition Modeling. In Proceedings of the 17th Conference on empirical methods in natural language processing (pp. 6331–6344). Punta Cana, Dominican: https://doi.org/10.18653/v1/2020.emnlp-main.513.
[36]
Samtani S., Zhu H., Chen H., Proactively identifying emerging hacker threats from the dark web: A diachronic graph embedding framework (D-GEF), ACM Transactions on Privacy and Security 23 (4) (2020) 1–33,.
[37]
Sasano, R., & Korhonen, A. (2020). Investigating Word-Class Distributions in Word Vector Spaces. In Proceedings of the 58th Annual meeting of the association for computational linguistics (pp. 3657–3666). Virtual event: https://doi.org/10.18653/v1/2020.acl-main.337.
[38]
Spinde T., Rudnitckaia L., Mitrović J., Hamborg F., Granitzer M., Gipp B., et al., Automated identification of bias inducing words in news articles using linguistic and context-oriented features, Information Processing & Management 58 (3) (2021),.
[39]
Sutikno T., Handayani L., Stiawan D., Riyadi M.A., Subroto I.M.I., WhatsApp, Viber and Telegram: Which is the best for instant messaging?, International Journal of Electrical & Computer Engineering 6 (3) (2016) 2088–8708,.
[40]
Tamaazousti Y., Le Borgne H., Hudelot C., Seddik M.E.A., Tamaazousti M., Learning more universal representations for transfer-learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (9) (2020) 2212–2224,.
[41]
Tayebi, M. A., Ester, M., Glässer, U., & Brantingham, P. L. (2014). Spatially Embedded Co-offence Prediction Using Supervised Learning. In Proceedings of the 20th ACM SIGKDD International conference on knowledge discovery and data mining (pp. 1789–1798). New York, USA: https://doi.org/10.1145/2623330.2623353.
[42]
Thomas, K., McCoy, D., Grier, C., Kolcz, A., & Paxson, V. (2013). Trafficking Fraudulent Accounts: The Role of the Underground Market in Twitter Spam and Abuse. In Proceedings of the 22nd USENIX security symposium (pp. 195–210). Washington D.C., USA.
[43]
Wang H., Hou Y., Wang H., A novel framework of identifying Chinese jargons for telegram underground markets, in: Proceedings of the 30th International conference on computer communications and networks, IEEE, Athens, Greece, 2021, pp. 1–9,.
[44]
Wegberg, R. v., Miedema, F., Akyazi, U., Noroozian, A., Klievink, B., & van Eeten, M. (2020). Go See a Specialist? Predicting Cybercrime Sales on Online Anonymous Markets from Vendor and Product Characteristics. In Proceedings of the 29th International world wide web conference (pp. 816–826). Taipei: https://doi.org/10.1145/3366423.3380162.
[45]
Xia P., Zhang L., Li F., Learning similarity with cosine similarity ensemble, Information Sciences 307 (2015) 39–52,.
[46]
Yang, H., Ma, X., Du, K., Li, Z., Duan, H., Su, X., et al. (2017). How to Learn Klingon Without a Dictionary: Detection and Measurement of Black Keywords Used by the Underground Economy. In Proceedings of the 38th IEEE Symposium on security and privacy (pp. 751–769). San Jose, USA: https://doi.org/10.1109/SP.2017.11.
[47]
Yuan, K., Lu, H., Liao, X., & Wang, X. (2018). Reading Thieves’ Cant: Automatically Identifying and Understanding Dark Jargons From Cybercrime Marketplaces. In Proceedings of the 27th USENIX Security symposium (pp. 1027–1041). Baltimore, USA.
[48]
Zhang, Y., Fan, Y., Song, W., Hou, S., Ye, Y., Li, X., et al. (2019). Your Style Your Identity: Leveraging Writing and Photography Styles for Drug Trafficker Identification in Darknet Markets Over Attributed Heterogeneous Information Network. In Proceedings of the 28th International conference on world wide web (pp. 3448–3454). San Francisco, USA: https://doi.org/10.1145/3308558.3313537.
[49]
Zhang, Y., Fan, Y., Ye, Y., Zhao, L., & Shi, C. (2019). Key Player Identification in Underground Forums Over Attributed Heterogeneous Information Network Embedding Framework. In Proceedings of the 28th ACM International conference on information and knowledge management (pp. 549–558). Beijing, China: https://doi.org/10.1145/3357384.3357876.
[50]
Zhang, Y., Qian, Y., Fan, Y., Ye, Y., Li, X., Xiong, Q., et al. (2020). dStyle-GAN: Generative Adversarial Network based on Writing and Photography Styles for Drug Identification in Darknet Markets. In Proceedings of the 36th Annual computer security applications conference (pp. 669–680). Virtual event: https://doi.org/10.1145/3427228.3427603.
[51]
Zhang B., Xiong D., Su J., Neural machine translation with deep attention, IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (1) (2018) 154–163,.
[52]
Zhao J., Liu X., Yan Q., Li B., Shao M., Peng H., et al., Automatically predicting cyber attack preference with attributed heterogeneous attention networks and transductive learning, Computers & Security 102 (2021),.
[53]
Zhao, K., Zhang, Y., Xing, C., Li, W., & Chen, H. (2016). Chinese Underground Market Jargon Analysis Based on Unsupervised Learning. In Proceedings of the 14th IEEE Conference on intelligence and security informatics (pp. 97–102). Tucson, USA: https://doi.org/10.1109/ISI.2016.7745450.
[54]
Zheng J., Cai F., Chen H., de Rijke M., Pre-train, interact, fine-tune: A novel interaction representation for text classification, Information Processing & Management 57 (6) (2020),.
[55]
Zhu, W., Gong, H., Bansal, R., Weinberg, Z., Christin, N., Fanti, G., et al. (2021). Self-supervised euphemism detection and identification for content moderation. In Proceedings of the 43rd IEEE Symposium on security and privacy (pp. 229–246). Virtual Event: https://doi.org/10.1109/SP40001.2021.00075.

Cited By

View all

Index Terms

  1. Identification of Chinese dark jargons in Telegram underground markets using context-oriented and linguistic features
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Information Processing and Management: an International Journal
      Information Processing and Management: an International Journal  Volume 59, Issue 5
      Sep 2022
      730 pages

      Publisher

      Pergamon Press, Inc.

      United States

      Publication History

      Published: 01 September 2022

      Author Tags

      1. Jargons identification
      2. Information security
      3. Feature engineering
      4. Word embedding
      5. Transfer learning
      6. Vectors projection

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media