One major challenge facing the intelligence and security community is monitoring online media for terrorist group communications. This study addresses the online anonymity problem by applying authorship analysis to English and Arabic extremist group Web forum messages. The study evaluates the performance impact of different feature categories and techniques across both languages. To enhance writing style identification, researchers incorporated a comprehensive list of online authorship features. Additionally, they created an Arabic language model by adopting specific features and techniques, including an elongation filter and a root-clustering algorithm, to handle challenging linguistic characteristics. A series of experiments indicated a high level of efficacy in the models. Finally, the authors compare the English and Arabic language models and messages to aid the research community's understanding of the dynamics of these groups' authorship tendencies.This article is part of a special issue on Homeland Security.
References
[1]
R. Zheng, et al., "A Framework of Authorship Identification for Online Messages: Writing Style Features and Classification Techniques," to be published in J. Am. Soc. Information Science and Technology (Jasist), 2005.
J.F. Burrows, "Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style," Literary and Linguistic Computing, vol. 2, 1987, pp. 61–67.
F. Peng, et al., "Automated Authorship Attribution with Character Level Language Models," presented at the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003); http://users.cs.dal.ca/~vlado/papers/2003-EACL03-139.pdf.
E. Stamatatos N. Fakotakis and G. Kokkinakis, "Computer-Based Authorship Attribution without Lexical Measures," Computers and the Humanities, vol. 35, no. 2, 2001, pp. 193–214.
S.S. Al-Fedaghi and F. Al-Anzi, "A New Algorithm to Generate Arabic Root-Pattern Forms," Proc. 11th Nat'l Computer Conf., KFUPM, Saudi Arabia, 1989, pp. 391–400.
L.S. Larkey and M.E. Connell, "Arabic Information Retrieval at UMass in TREC-10," Proc. 10th Text Retrieval Conf. (TREC 2001), Nat'l Inst. of Standards and Technology, 2001.
I. Hmeidi G. Kanaan and M. Evens, "Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents," J. Am. Soc. Information Science, vol. 48, no. 10, 1997, pp. 867–881.
A.N. De Roeck and W. Al-Fares, "A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots," Proc. Assoc. for Computational Linguistics (ACL 00), 2000; www.informatik.uni-trier.de/~ley/db/conf/acl/acl2000.html.
Sarwar RPerera MTeh PNawaz RHassan M(2024)Crossing Linguistic Barriers: Authorship Attribution in Sinhala TextsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365562023:5(1-14)Online publication date: 10-May-2024
Adebayo GYampolskiy R(2023)Automatic IQ Estimation from Written text using Stylometry MethodsProceedings of the 2023 7th International Conference on Information System and Data Mining10.1145/3603765.3603769(56-65)Online publication date: 10-May-2023
US domestic extremist groups have increased in numbers and are using the Internet intensively as a tool to share resources and members with limited regard for geographic, legal, or other obstacles. Researchers find that monitoring extremist and hate ...
HLT '01: Proceedings of the first international conference on Human language technology research
This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies ...
JISIC '14: Proceedings of the 2014 IEEE Joint Intelligence and Security Informatics Conference
When we read a piece of writing, the meaning we derive from that text often includes information about the authors themselves. Clues to their identity, worldview, and even psychological states are encoded in features such as word choice and sentence ...
The question of whether an author leaves an unconscious but statistically discernable "signature" on his or her writing was first visited by Wake at Oxford in 1911. Wake was an eminent classicist, but he was not a statistician, and his sentence length statistics did not prove useful. In the 1960s, a Church of England minister and New Testament scholar, A.Q. Morton, who was a statistician, developed a statistical authorship test for Greek, and used it successfully on the Pauline Epistles, the Gospel of Luke, and the Acts of the Apostles. He and others later used it on Homer's
Iliad
, also with notable success. The test was very simple, but useful for Greek text; he simply counted the number of times
kai
was used in each sentence.
Kai
is a coordinating conjunction in Greek 95 percent of the time (it is an adverb the other five percent), and performs the combined roles of all the coordinating conjunctions in English (and, or, but, and so on). Alvar Ellegard developed a much more sophisticated statistical method [1] for his doctoral dissertation at Uppsala, and used it to prove that Sir Philip Francis, a British civil servant, had written the scathing Junius Letters to the
London Public Advertiser
criticizing King George III and his war against the American colonies. Junius Brutus killed Julius Caesar, but George III would certainly have hanged this Junius, Philip Francis, for sedition if he knew he was the author of the letters.
This fascinating paper takes the unconscious authorship signature problem into new theoretical (but also very practical) realms. The paper presents new methods that go beyond Greek and English literary texts to the analysis of extremist multi-language polemics on Internet Web sites. This extension of the technology opens up new vistas. For example, Internet Web sites are a very new literary genre, and the Arabic language, with its 5,000 roots or stems, is very highly inflected. Arabic has 15 verbal conjugations, compared to Hebrew with only eight, and Indo-European languages with even fewer. The liaison issues in Arabic, which is only written cursively, and which has initial, medial, and final forms for many letters, along with infixes and consonant stacking, add to the morphological, grammatical, and syntactical interface of the language. The authors find that this craggy linguistic interface, while complex, does add some statistical hand and toe holds. Their methods show significant discriminating power in the application of authorship identification techniques to both English and Arabic messages. KKK polemics were used as a sort of English language control in the development of the methods.
This well-presented, well-written paper illustrates an important and very current application of computer-based statistical methods for authorship identification. It is so good, and so relevant to our times, that I am surprised it wasn't classified by the US National Security Agency (NSA).
Online Computing Reviews Service
Access critical reviews of Computing literature here
Sarwar RPerera MTeh PNawaz RHassan M(2024)Crossing Linguistic Barriers: Authorship Attribution in Sinhala TextsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365562023:5(1-14)Online publication date: 10-May-2024
Adebayo GYampolskiy R(2023)Automatic IQ Estimation from Written text using Stylometry MethodsProceedings of the 2023 7th International Conference on Information System and Data Mining10.1145/3603765.3603769(56-65)Online publication date: 10-May-2023
Alqahtani FDohler M(2023)Survey of Authorship Identification Tasks on Arabic TextsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/356415622:4(1-24)Online publication date: 12-Apr-2023
Guo J(2022)Design and Implementation of a Machine Learning-Based English Intelligent Test SystemWireless Communications & Mobile Computing10.1155/2022/58753802022Online publication date: 1-Jan-2022
Chadoulis RNikolaou AKotropoulos C(2022)Authorship Attribution in Greek Literature Using Word AdjacenciesProceedings of the 12th Hellenic Conference on Artificial Intelligence10.1145/3549737.3549750(1-9)Online publication date: 7-Sep-2022
Casimiro GDigiampietri L(2022)Authorship Attribution with Temporal Data in RedditProceedings of the XVIII Brazilian Symposium on Information Systems10.1145/3535511.3535515(1-8)Online publication date: 16-May-2022
Zagalsky ATe'eni DYahav ISchwartz DSilverman GCohen DMann YLewinsky D(2021)The Design of Reciprocal Learning Between Human and Artificial IntelligenceProceedings of the ACM on Human-Computer Interaction10.1145/34795875:CSCW2(1-36)Online publication date: 18-Oct-2021
Ai ZYijia ZHao WMingyu L(2021)LDA-Transformer Model in Chinese Poetry Authorship AttributionInformation Retrieval10.1007/978-3-030-88189-4_5(59-73)Online publication date: 29-Oct-2021