1. Introduction
Digital forensics is an interdisciplinary domain of research encompassing contributions from computer science, criminology, and law. It involves the collection, analysis, interpretation, and presentation of evidence sourced from various digital devices. This evidence includes data from smartphones, desktop and laptop computers, gaming consoles, and a wide range of Internet of Things (IoT) devices, such as smart home systems, wearable technology, and network-connected sensors [
1]. Much of the data analyzed in digital forensics investigations is user-generated [
2]. According to a report by EdgeDelta in 2023, the average consumer of digital devices generates approximately 16 terabytes of data per day [
3]. While many digital forensics practitioners utilize automated tools to assist in analyzing this vast amount of data, the process of parsing digital device contents for relevant evidentiary data often remains largely manual [
4]. For example, a forensic investigator may employ tools like Cellebrite to search for specific evidence on a suspect’s smartphone; however, they must often rely on search features within the tool to identify pertinent information [
5].
Natural Language Processing (NLP) is a subdomain of computer science focused on spoken and written human languages [
6]. It aims to enhance the efficiency of human–computer interaction [
7]. NLP enables computer systems to process human input and provide meaningful output [
8]. Numerous algorithms exist to process textual input and produce applications in various domains. Notable NLP algorithms include Latent Dirichlet Allocation, word embedding models, pairwise correlation, n-gram analysis, named entity recognition, and sentiment analysis [
9].
One of AI’s most prominent features is its ability to identify patterns in large data streams that are otherwise not easily discernible through manual analysis [
10]. Healthcare information systems use pattern recognition to identify potential health risks in patients, while cybersecurity organizations utilize similar algorithms to detect malware and other threats to critical infrastructure [
11]. Studies have attempted to predict human behavior using statistics and NLP [
12]. For example, one study employed log files, user logins, email communications, and other data to predict insider threats within an organization [
13]. Developing a machine learning model that can accurately predict human behavior or decision-making is challenging due to the difficulty of quantifying psychological variables, which often suffer from bias and insufficient information [
14].
Research has shown that ongoing antisocial behavior can be identified in social media using machine learning classifiers and semantic analysis [
15]. Additionally, sentiment data can be used as a variable in machine learning models to predict criminal behavior [
16,
17]. Considering these data points, we propose a technique that employs pre-trained deep learning emotion classifiers along with other NLP tools to identify persons of interest within a large corpus of text [
18]. The corpus could be an aggregated dataset of emails and text messages or hours of transcribed audio footage. In our research, the corpus consists of a dialog between nine fictitious characters over a thirty-day period. The contribution of this research is a practical criminological application of NLP techniques to the domain of digital forensics that has not been explored in previous studies. Furthermore, this research will demonstrate how our emotions and use of language can betray us under scientific scrutiny. This hypothesis will be demonstrated through the creation of a sociolinguistic chronology that (through inference) ties a speaker to an event as a Person of Interest (PoI). A PoI can either be eliminated as a suspect through further investigation or validated by linguistically tying them to other suspects and to key themes in a case.
The remainder of this paper is organized as follows:
Section 2 presents a review of relevant literature.
Section 3 describes the generation of our LLM-based text dataset.
Section 4 outlines the methodology, followed by findings in
Section 5.
Section 6 provides a discussion of the results, and
Section 7 concludes the paper.
2. Literature Review
According to the literature, Digital Forensics (originally known as Computer Forensic Science) was introduced in 1984 [
19]. In its infancy, the discipline was more concerned with expertise in examining evidence that came from desktop computers and large servers [
20]. Forty years later (at the time of this research), the discipline has expanded tenfold. It now includes smartphones, gaming consoles, cameras, IoT devices, the cloud, and wireless technologies [
21]. AI has now been added as a fundamental component of forensic research [
22]. The use of digital forensics with text mining and other NLP tools first started to appear in the academic literature around 2012 [
23]. Some of the earliest publications of digital forensics combined with sentiment analysis date to approximately 2014 [
24]. We reviewed research across criminology, political science, computational linguistics, and machine learning, identifying several significant themes throughout these fields. The literature emphasizes the criminal decision-making process, exploring the variables that may influence an individual’s choice to engage in criminal behavior. Additionally, emotions associated with crime, such as anger, fear, and guilt, play a central role in understanding offenders’ psychological states. Another theme addresses the distinction between sentiment and intent, providing insights into underlying motivations that may not align with expressed emotions. The correlation of individuals with specific themes and events is also crucial, helping to construct timelines that can be used to link people to actions or evidence. Finally, various NLP tools facilitate these analyses, offering methods to parse complex datasets and uncover hidden patterns within forensic investigations.
2.1. The Criminal Decision-Making Process
Studies of criminals have shown that the decision-making process used by offenders is crucial to understanding the genesis of a crime. The question often posed by researchers of criminology is to determine what the variables are that enter into the model that determines the intent to commit a criminal act. Some of these variables include economic status, education, self-control, and personal morality. Studies often decompose the criminal decision-making model into three classes: “before”, “during”, and “after”. In our research, we are more concerned with the “after” part of the model [
25].
2.2. The Emotions of Crime
According to the literature, the five most prevalent emotions associated with crime are anger, fear, shame, guilt, and anxiety. There could be one or more of these emotions present simultaneously depending on what stage the offender was in with regard to the commission of a criminal act [
26]. In the average model of criminal behavior (especially in violent crimes), anger is often present before and during a crime [
27]. According to some studies, fear is usually the most frequent emotion felt following the commission of a crime. This may also be accompanied by anxiety. One study referred to anxiety as fear + anticipation [
28]. According to studies of prison inmates, shame often accompanied the completion of a crime. It is unclear as to whether shame was more correlated with getting caught or if it was attributed to a sense of morality [
29].
2.3. Sentiment and Intent as Separate Constructs
There are some studies, particularly in the realm of Political Science, that have asserted there is a distinction between a person’s sentiment and a policy position they may hold. The authors have determined the reason for such a distinction is complex and multidimensional [
30]. From this idea, we could extrapolate this distinction beyond the domain of Political Science to other domains. The distance between sentiment and intent can be relabeled as a difference between emotion and intent. If we were to implement this notion in the domain of criminal justice, we could use the following example as an analog. Let us say, for example, that we see a male in his forties in a coffee shop who looks rather nervous and agitated. He frequently checks who is around him and then returns to checking his smartphone. To an outside observer, the appearance of nervous and agitated behavior might suggest that the person is involved in something nefarious. Without additional information, we do not have the context to attribute the man with criminal behavior. If the same man is arrested and his text messages reveal that he is involved in a violent crime, there is context for the nervous and agitated behavior [
31].
2.4. Correlation of People to Themes and Events (Timelines)
Timelines are a key piece of evidence in any forensic investigation [
32]. In digital forensics, artifacts are removed from digital devices and then placed in a linear chart based on the timestamps that are assigned to the object by the file system. What results is a chronological timeline of computer files that suggest behavior patterns of a specific person and associate them with certain events or topics. This chronological timeline orientation of objects provides investigators with evidence to either corroborate or eliminate suspects [
33]. There are other forensic studies that used multiclass classification to identify content related to criminal behavior in emails [
34]. Items extracted from text for analysis can be sentences, unigrams, bigrams, and trigrams [
35]. Some forms of this classification do not use timelines. The focus is on the correlation between topics or named entities [
36]. In one study, social media accounts were correlated with their owners through classification. Features were extracted from a corpus using topic modeling. The extracted features were then used in a classification problem using Support Vector Machine (SVM) [
37].
2.5. NLP Tools
The task of correlating a suspect with topics and keywords can be conducted using a wide variety of tools used in NLP. We found through a review of the literature and empirical research that Latent Dirichlet Allocation (LDA), pairwise correlation, and word embedding plots yielded the best results. LDA was introduced in 2003, so it is one of the older techniques in the canon of tools used in computational linguistics. LDA works on the assumption that every dataset of text consists of a finite number of latent topics that can be visually represented [
38]. The LDAvis graphing tool is a good way of visualizing topics and their associated words [
39]. Pairwise correlation is a statistical function that takes pairs of words and calculates the strength of their relationship to other pairs of words in the dataset. The relationship between word pairs is visualized in a network graph of nodes and edges. Clusters of nodes and edges represent topics. These clusters visualize relationships between named entities and keywords [
40]. Word embeddings were first introduced in 2003. Since that time, they have been very popular in NLP. With word embeddings, a word is represented as a vector of numbers that is recognizable to a computer. Different word vectors can be compared using distance metrics such as cosine distance. When word vectors are displayed on a two-dimensional graph, words that appear closer together are closer in their context [
41].
We synthesized key data points from the literature that we surveyed. Based on the five previously discussed themes we extracted during the review, we made the following inferences which served as guiding parameters for our methodology. First, it is not possible to assign guilt to a person based on circumstantial evidence. There are not enough given variables in the data to unequivocally assign attribution to one or two people. Instead, we can identify Persons of Interest (PoI) based on emotion and correlation with keywords. If a suspect has noticeable and consistent spikes in fear, anger, anxiety, shame, or guilt, they will warrant more scrutiny [
27]. Also, if a suspect is highly correlated with topics and keywords associated with a crime, the associations will require closer scrutiny.
Second, sentiment by itself does not possess the full context necessary to predict behavior patterns. However, sentiment can reasonably serve as an initial indicator suggesting possible miscreant behavior. Sentiment, in this case, refers to a discrete range of emotions within the negative sentiment spectrum. Specifically, these emotions are anger, anxiety, guilt, shame, and fear. We must acknowledge that these emotions will not be universally manifested in suspects. There will be false positives and false negatives. One example of this is a person who possesses a sociopathic personality. Such a person may not feel any guilt, shame, or fear following the commission of a criminal act. We make the assumption that the negative spectrum of emotions will manifest in people with an average personality or temperament [
42].
Third, our objective is to identify correlations and associations between people and topics [
43]. An association between a person and a keyword or topic does not identify guilt per se. When we are dealing with a discrete pool of candidates for an event such as a homicide investigation, a strong correlation between a person and a relevant topic suggests one of three possibilities. First, the person has engaged in frequent conversations with others in the same candidate pool about the subject matter. Second, the person has second-hand involvement in the homicide. This involvement could be as a witness or as someone who has some peripheral involvement in the crime. Third, the person is directly involved with or is a co-conspirator in the criminal activity. The relevant keywords identified in the person-to-topic correlations for a homicide could be, for example, “alibi”, “confess”, “hiding”, and “body”. The relevant keywords will depend on the background of the investigation.
3. LLM Sentiment Data Creation
The primary goal for developing the dataset was to create a narrative-driven corpus that could be utilized to train and test NLP models for automatically identifying culprits in murder mystery scenarios. This was intended to simulate the complexities found in real-world deceptive communication and relationship dynamics, thereby providing a robust testing ground for predictive analytics in forensic linguistics.
The dataset was generated using OpenAI’s GPT-4, a large language model known for its ability to produce human-like text. GPT-4 was instrumental in formulating dialog entries that are coherent, contextually rich, and intricate in terms of plot development. The dialogs include direct and indirect interactions among characters, embedded with emotional undercurrents and strategic misinformation to mimic deceptive behaviors typical in criminal scenarios. The process involved iterative dialog generation, where GPT-4 was prompted to produce conversations between fictional characters, each with distinct personalities and hidden agendas.
3.1. The Generation Process
Character Development: Each character was designed with specific traits and backgrounds to ensure varied and plausible interactions. This foundational step was crucial for generating realistic dialogs.
Scenario Setting: GPT-4 was provided with a basic plot outline involving the murder of a character named Jordan. The model was tasked with creating dialogs that gradually unveiled the mystery through interactions among the suspects.
Dialog Expansion: GPT-4 generated dialogs based on the evolving storyline. Each entry was carefully crafted to include potential clues, red herrings, and character-specific sentiments, ensuring depth and engagement in the narrative.
3.2. Structure
Each entry in the dataset was structured to include:
Speaker: The character delivering the dialog.
Addressee: The intended recipient of the dialog.
Text: The dialog content, crafted to contain clues, misdirections, and truthful assertions.
Date and Time: Timestamps that provide a temporal context to each dialog, aiding in the sequence analysis.
Sentiment: A sentiment label (positive, negative, neutral, anxious, or suspicious) assigned to reflect the speaker’s emotional state during each interaction.
This structured format supports complex analytical approaches like sequence analysis and context-based inference, which are crucial for understanding the narrative flow and identifying behavioral patterns.
3.3. Annotation
With assistance from GPT-4, sentiments were systematically assigned to each dialog entry to simulate emotional variability inherent in human communication. This layer of emotional annotation enhances the dataset’s utility for training NLP models to recognize emotional expressions that may indicate stress, deception, or sincerity—key aspects in forensic analysis.
The use of GPT-4 in the dataset construction process ensured that the dialogs were not only realistic and engaging but also varied and complex enough to challenge advanced NLP techniques. These include sentiment analysis, relationship extraction, and deception detection, all of which are vital for automating culprit identification in narrative texts.
4. Methodology
Our task in this research was to identify a subset of suspects from a discrete pool of suspect candidates using language as our primary tool. Language is a manifestation of social interactive behavior between people. As such, it contains cognitive and psychological aspects in its taxonomy. The physical manifestation of language can be decomposed into smaller subunits. These smaller units include sentences, phrases, words, sounds, and syllables. The content of information expressed through words, sentences, and phrases offers us a window into human psychology. In other words, to understand a person’s speech is to understand their psyche. To that end, our approach to identifying the most significant persons of interest included emotion classification, topic modeling, pairwise correlation word network, and word embeddings. In the subsequent paragraphs of this section, we will discuss how each of these NLP techniques offered a unique view into the psyche of a suspect.
4.1. Emotion Classification
Emotion classification is a multiclass form of sentiment analysis. Since we are interested in emotions after a crime, we are primarily looking for elevated instances of fear, anxiety, guilt, and shame. We used a community open-source repository called Hugging Face. Hugging Face hosts a collection of user-contributed machine learning models for different tasks. We used a special class of Hugging Face pre-trained deep learning models that were developed specifically for text-based emotion classification. These user-contributed pre-trained models are called transformers. Transformers were especially beneficial to our research since our dataset consists of only 981 rows of user text. Training a custom deep learning model requires large amounts of data to produce a desirable outcome. Transformers allowed us to make use of a fully trained deep learning model without having to train one ourselves. In addition, these transformers were specifically designed to classify emotion.
Emotion classifiers make use of word embeddings and cosine distance to determine contextual differences between words. We were interested in identifying instances of fear, anxiety, guilt, and shame in our text dataset. All of these words exist in the negative spectrum of sentiment. Since we used four different transformers, the cosine distances between the word vectors in the dataset may have a small degree of variance. What this means is that a sample of text may be classified as “fear” according to one transformer, whereas in another, it may be classified as “anxiety”. In this case, we are not interested in uniquely categorizing each emotion for individual analysis. We are more interested in consistent elevated instances of these emotions and where they fall in the time period of the dataset.
4.2. Topic Modeling
Topic modeling is an unsupervised technique for discovering all of the latent topics that exist within a corpus of text. In the domain of NLP, a corpus refers to a comprehensive text dataset. The dataset may consist of a single large body of text, such as Leo Tolstoy’s War and Peace, or it can be an aggregation of many smaller documents such as social media posts or emails. Latent Dirichlet Allocation (LDA) is an example of an unsupervised topic modeling technique. LDA works on the assumption that there is a finite number of topics (k) in every corpus. The appropriate number of topics for the variable k can be determined using a statistical algorithm such as the CaoJuan2009 method, which is adaptive and selects the target size for k based on density. Other statistical methods for determining the size of k are the Deveaud2014, Arun2010, and Griffiths2004. All of the words that are relevant to a particular topic will appear under that topic name.
For this research, we used LDA as it is still a popular technique used in NLP problems for topic analysis. We used an ensemble of four statistical algorithms to determine the appropriate number of topics for our dataset. The algorithms we used were Griffiths2004, CaoJuan2009, Arun2010, and Deveaud2014. The composite graph in
Figure 1 shows that all four algorithms are in agreement. Nine is the appropriate number of topics for k in our dataset.
Our goal for this research is to use LDA to decompose the existing dialog between the nine suspects into a constituent set of aggregated topics. By breaking the dialog down into smaller partitioned chunks, we can more easily search for existing relationships and patterns. An example of a relationship we may look for is the appearance of characters together within a topic. Keywords that appear with character names provide a correlation between the characters and specific activities. For example, if the character Henry appears by himself within a topic along with the words “murder”, “coverup”, “missing”, and “clues”, we can reasonably conclude that Henry has an association with a questionable activity that requires further investigation. If characters appear together within a specific topic (along with associated words), it is reasonable to conclude that a relationship exists of some kind between the characters. The relationship suggested by a co-occurrence within a topic may be superficial and conversational or it may suggest stronger ties. The aggregation of words within a topic also suggests a possible pattern. Let us take, for example, the following word series: “body”, “alibi”, “missing”, “nervous”, and “argument”. If the aforementioned terms co-occur together within the same topic with significant frequency, it suggests a pattern of behavior subordinate to the bigger picture of a case that is being investigated.
For our research, we used an application called LDAvis to visualize the topics in our dataset. LDAvis uses web-based interactive features to visualize topics. Each topic is represented by a circle. The words associated with the topic are displayed on the right side of the screen along with horizontal bar graphs depicting the importance of each word within the topic. When a specific topic is clicked within the graph, the associated words are then displayed on the right side of the screen. When words occur in more than one topic, this is represented as overlapping circles in the graph. This representation of word occurrence and topic overlap is a useful way of visualizing topic relevance and redundancy.
4.3. Pairwise Correlation
Pairwise correlation is a statistical function that takes pairs of variables and compares them to other pairs of variables. In this case, we take pairs of words and compare them to other word pairs. The statistical data that results are represented in a graph that displays the strength of relationships between word pairs. We use a network graph to display the word relationships in our dataset. To construct this graph, we use a library in the R programming language called widyr. The script in R takes a correlation variable and uses it as the basis to determine which word pairs meet the criteria. If, for example, we set the correlation variable to 40%, then only words that have a co-occurrence ratio that equals forty percent or greater will be represented in our final graph. As a general rule, the higher the correlation variable, the fewer word pairs will be represented. The desired outcome is for the less useful information to be trimmed from the network graph so that it is less cluttered. By doing this, the graph shows more valuable information. To create the best possible visualization of a graph, the correlation variable must be experimented with to achieve the most desired outcome.
The word network is visualized using a library called ggraph. Relationships between word pairs are represented using nodes and edges. If a relationship between a word pair is tenuous, then the edge on the graph is thin and lighter in color. If the relationship is stronger, then the edge is thicker and darker. When numerous word pairs are joined together in a cluster, it is seen as a topic. There will be variations in the strength of relationships within the cluster. For our research, we look at clusters and identify topics that contain information relevant to an investigation. In addition, if a character’s name is part of a cluster, it may suggest some kind of significance. For example, if there are multiple character names in a cluster along with other words, it might suggest a series of discussions and exchanges. If a character appears by themself within a cluster along with words of interest, it may suggest patterns of behavior that require questioning.
4.4. Word Embeddings
A computer does not understand text as it is written or spoken. As such, to perform any NLP-related tasks, text must be converted into a format that is understood by computers. Word embedding (also referred to as word vectors) is the task of converting words into a set of numbers that will be understood by a machine learning algorithm. The embedding process retains some of the context held by words so that the processing of the text might be closer to what humans might use in their interpretation of spoken language. After text has been vectorized, it can be represented visually in two or three dimensions in a graph. A distance algorithm such as cosine distance can be used to determine the similarity between words. Words that are similar in context will appear closer together in the graph. The greater the dissimilarity between words, the farther apart they will appear in the graph. This function is true not just for routine nouns, verbs, and adjectives but also for named entities. So, for example, if the name Joseph appears within close proximity to other names such as Betty, Frank, and Susan, then Joseph has a stronger contextual association with the other three people. Also, if the name Joseph appears within close proximity to words like “murder”, “alibi”, and “secret”, then he has strong contextual associations with these words within the dataset.
For our research, we converted our dataset into word embeddings after preprocessing using a Python library called Word2Vec. We displayed the word embeddings in a two-dimensional graph using a library called plotly. The plotly graph is interactive so we could zoom into specific regions of the graph to search for words that interested us. We used this technique so we could identify the locations of named entities from the dataset. Specifically, we wanted to identify the names of the nine characters from the dataset and see which words were immediately surrounding them. If a character’s name appeared next to a keyword of interest such as “alibi” or “evidence”, then we took a closer look at the distance between the named entity and the keyword. These distances suggested the strength of association between a character and a topic of interest.
4.5. Approach
Our approach in this research is to mine thirty days of dialog among nine people for identifiable patterns that suggest a determinable association with an event. In this case, the event of interest is a murder. The graph displayed in
Figure 2 demonstrates the process flow for our methodology.
In our approach to identifying persons of interest, we use four different NLP techniques to look under the hood of human language, figuratively speaking. The emotions which are manifest in the language of all of the suspects, along with the syntax, vocabulary, and topical conversations associated with each person, are decomposed into individual chunks, each of which has significance. All of these chunks are interrelated and suggest patterns of behavior. If we follow a process flow for our methodology, each suspect is associated with a discrete set of text statements, which is their contribution to the dataset. The language that constitutes the dialog is analyzed for emotion, topical content, context, and relational connectivity (i.e., who is talking to whom). This analysis lays bare sociolinguistic patterns that are inherent in the language. These patterns take the form of correlations between characters and keywords as well as relationships between characters. The correlations and associations that we mine from the language suggest a degree of relationship to a crime. A “degree of relationship to a crime”, in this case, can be defined as direct involvement in a crime, witnessing some aspect of a crime, or extended discussion with others who have some peripheral association with the crime. Those persons whom we identify as having a stronger degree of relationship to a crime are Persons of Interest (PoI). We realize that some aspects of this approach rely on the subjective interpretation of data. When research deals with elements of social science, the reliance on subjective analysis, to some degree, is inevitable and necessary. For our research, the subjective analysis we provide is based on statistical data that have been corroborated with supplementary data. For example, the results of LDA can be supported by the results of pairwise correlations and cosine distance between word embeddings.
5. Findings
In our research, we used four NLP techniques with a dataset of 981 rows of LLM-generated text, which simulated a series of dialogs among nine people over thirty days. The primary goal was to identify a list of Persons of Interest (PoI) who had the strongest degrees of relationship to a crime. The “relationship” could be established through direct involvement, being a witness, or through frequent conversation with others who had knowledge of the crime. We withheld ground truth from our research activities until we completed all four NLP techniques. All we knew were the basic parameters, which had been given to the LLM when it generated the dataset. These parameters were to create a fictitious scenario involving ten people who do not exist, with random names. Two of the ten characters conspire to kill a third. The dataset consists of dialog between the nine remaining fictitious characters over thirty days. The LLM separately provided the names of the two conspirators. Based on the four NLP techniques, we identified two characters as persons of interest based on the evidence we observed. The two persons of interest we identified were Eva and Bob.
Throughout the subsequent paragraphs of this section, we will lay out the evidence that allowed us to come to this conclusion. This evidence includes a number of graphs, statistical data, and logical inference. The logical inference can be viewed as a subjective postulation; however, we assert that our data points and observations are validated through three different forms of linguistic correlation. We will begin with the results of the emotion classification part of our research. This will be followed by the results of our use of LDA for topic modeling. We will conclude our results with a discussion vis-à-vis pairwise correlation and word embeddings. We will reveal the ground truth results in the closing of this section to properly put all of our data points in perspective.
5.1. Emotion Classification
The emotion classification portion of our NLP research provided some of the most noticeable results. For this task, we made use of a community-based online platform called Hugging Face, which hosts a large number of pre-trained algorithms that are open-source. The four emotion classifier algorithms we selected from the Hugging Face repository are listed below in
Table 1.
5.2. Emotions and Word Embeddings
Some NLP lexicons that perform sentiment analysis are designed to focus on specific emotions such as “anger”, “disgust”, “joy”, “anticipation”, and “fear”. Transformers are pre-trained deep-learning-based algorithms that use word embeddings and cosine distance to compare words from a dataset to specific emotion labels. In our research, we are interested in words that occur within the negative spectrum of emotions. These words are fear, anxiety, guilt, and shame. According to criminology research literature, these are some of the most prevalent emotions that occur following a criminal act. We tasked each of the four transformers we selected to search the dataset for words that are contextually close to fear, anxiety, guilt, and shame. In each case, cosine distance was used to compare the words and labels. Since there are differences in each of the four models, there were variances in how close certain words were to specific emotions. For example, in transformer model 1, a word may be classified as “guilt”, whereas in model 2 it may be classified as “fear”. For this research, we were not focused on the occurrence of one specific emotion. We were concerned with the occurrence of words that suggested a temperament that generally manifests within people who have engaged in subversive or antisocial behavior. Specifically, we were looking for consistently elevated instances of fear, anxiety, guilt, and shame over the thirty-day period.
5.3. Emotion Classification Results and Discussion
We used Python to leverage the power of the four Hugging Face transformers. In order to visualize the results in the most meaningful way, we used the ggplot2 library in the R programming language to display the emotions for all nine of the characters over the thirty-day period. The results were displayed as a density plot with the date on the x-axis and the overall amount (n) on the y-axis. We found that only one person out of the pool of nine candidates showed a consistent pattern of negative emotions. That person was Eva. For classifiers one and two, there were identical peaks showing a significant level of guilt that occurred around January 10. In classifier three, Eva’s graph (shown below in
Figure 3) showed a significant increase in fear that occurred around the same time period as the first two classifiers. We attribute the difference in emotion (guilt vs. fear) to variances in cosine distance between the three classifiers. The first two classifiers were similar enough to cause identical instances of guilt over the same date using the dataset. However, the same dataset was interpreted in this instance as fear. According to studies in the literature, fear and guilt are associated with each other. With this in mind, we interpret guilt, fear, and anxiety as comparable reactions to adverse stimuli. In terms of NLP, the difference between these terms lies in the cosine distance between their word vectors within the model. Shame, it can be argued, is more associated with people who have been caught for engaging in antisocial behavior.
In the fourth emotion classifier, Eva experienced a significant increase in “anxiety”. However, unlike the first three classifiers, the spike in anxiety occurred several days after the increase in guilt/fear. We attribute this difference to the classifier itself. It appears to be more sensitive to terminology that correlates with anxiety. Eva experienced increases in anxiety on January 16 and on January 20. Eva’s increase in anxiety over this time period may be an outlier that is unsupported by other evidence. With that being said, three out of four different classifiers identified the same fear-based event over the same time period for Eva. Since none of the other eight people consistently demonstrated the same pattern, we interpret this data to suggest an underlying psychological state driven by an adverse reaction to a recent event. By itself, sentiment tells us little about the truth of a matter. To view the bigger picture, we must dig deeper into the language itself to find additional patterns.
5.4. Pairwise Correlation Word Graph
Pairwise correlation is the statistical comparison of word pairs to other word pairs. Word pairs that frequently occur with other word pairs have a correlation with each other. When several word pairs co-occur together in a related context, they form a topic. These relationships can be visualized as a graph using nodes and edges. In a graph, a topic is represented as a cluster of related word pairs. We used the R programming language and a library called widyr to perform the pairwise correlation. We used a graphing library called ggraph to visualize the results. We used a correlation value of 70% to parse the dataset into topic clusters. This process of topic decomposition is a task of trial and error. If the correlation value is too small, the graph is overly cluttered and tells us very little. If the correlation value is too high, the graph will be sparse and again tell us very little. The graph shown in
Figure 4 demonstrates the topic distribution in our dataset. In our graph, there were four significant topic clusters and sixteen smaller topics. We will refer back to this graph later in this section as we gain additional insight through the results.
In the emotion classification part of this section, we have already identified Eva as a PoI due to the fact that she consistently demonstrated elevated levels of fear, guilt, and anxiety. In the graph in
Figure 4, Eva appears at the very center of one of the larger clusters. If we read her parent cluster, she is the sole person connected on one side to the keywords “message”, “signed”, and “decode”. On the other side of Eva’s parent cluster, she is connected to Irene and Frank. The correlation with Irene and Frank suggests a conversation that took place between the three individuals. The conversation was brief since the lines connecting them have a lower correlation, as evidenced by the thinner, lighter-colored gray line. In the next section, we will discuss the results of topic modeling with LDA and how some of the clusters in
Figure 4 are related.
5.5. Topic Modeling
We used the R programming language to create our topic models of the text dataset. The models themselves were created using a library called topicmodels. The topic models were visualized using a library called LDAvis. The graph shown in
Figure 5 displays the topic models using a series of plotted circles to represent the individual topics. We chose nine to be the value of k for the number of topics in our models.
We will not discuss the results of all nine of the topics as they were laid out by the LDA model. Instead, we will focus on the topic that was of most interest to us. Topic 8 stood out apart from the others as a data point of interest. Specifically, this was due to the fact that Eva co-occurred with another person’s name and no other person. We did not include instances where a person’s name appeared with an apostrophe, as this serves more as a descriptor (for example, Eva’s belongings). Instead, we focused on instances where a person’s name occurred by itself or with another person’s name. If multiple people occur within the same topic, we interpret this as a series of conversational exchanges. Topic 8 was the only instance in which two characters co-occurred as a unit within the topic. This observation suggests the possibility of a relationship of some kind. The red vertical bar graphs represent the level of relevance that a certain term has within the scope of a particular topic. The names Bob and Eva appear near the bottom of the topic list. This means that these two names did not occur with great frequency within the topic. The fact that Bob and Eva appear together in their own topic with a relatively weak frequency suggests possibly that the two took measures to not be seen or associated as being “together”. However, the evidence of association between the two people still exists. In the next section, we will take relevant clusters from the pairwise correlation graph and add them to the topic model graph as an additional layer. This additional layer will demonstrate the extended correlation of characters with forensic topics in this dataset.
5.6. Topic Model with Pairwise Correlation Clusters
Figure 6 displays the topic model graph from our research with the added layer of clusters from our pairwise correlations graph. This combination provides us with an extended view of relevant themes and their statistical correlations with other characters and other topics.
In this supplementary graph, we visualized the associations between LDA topic-relevant words to pairwise clusters that contain the same words. By doing this, we provide additional context to our topic analysis. In our observation, the two most notable clusters in this graph correlate with Bob. The larger of the two clusters related to Bob contains the keywords “hiding”, “map”, “confess”, and “guilty”. In the original dataset, there is a line of dialog between Alice and Bob where Alice asks, “You’ve been looking really guilty lately, anything you want to confess?” There is another cluster that is associated with Bob that contains the following words: “looked”, “shaken”, and “murder”. In the original dataset, there is a conversation between Henry and Diana where Henry says, “Bob looked really shaken up after they found out about the murder”.
There was one more correlation in this graph that was notable. To provide more context, we found out at the end of the research that the ground truth was that Bob and Eva were, in fact, the two conspirators who were guilty of murder. However, this is not what was interesting with regard to the correlation between Eva and her pairwise cluster. The pairwise correlation cluster that is associated with Eva in this topic includes two other people, i.e., Irene and Frank. Based on the current evidence (and knowing the ground truth), we inferred that the conversation between Eva, Irene, and Frank might have been an attempt by Eva to spread disinformation to remove suspicion from herself. We made this observation based on the assumption that since Eva knows she is guilty, anything she says on the topics related to the murder will be misleading or wholly false.
5.7. Word Embeddings
We used the Python programming language and the Word2Vec library to create word embeddings of our dataset. The 981 rows of dialog between the nine characters of the dataset underwent a stage of preprocessing wherein the text was first tokenized, meaning every word was treated as a single token or entity. Next, non-useful words (called stopwords) were removed, special characters (apostrophes, colons, semicolons, etc.) were removed, and finally, superfluous whitespace was removed. What remained was a set of words that provided the most significant content of the dataset. Through a process called term frequency–inverse document frequency (TF-IDF), each word was assigned a numerical value that represents its significance with regard to the dataset as a whole. The contextual significance of each word was also captured by Word2Vec and used as a parameter. Using all of the data, a process called vectorization (embedding) takes place. Each word in the dataset is converted to a set of numbers that attempts to capture its meaning and context. These word vectors can then be displayed in two or three dimensions for comparison. Words that are more similar will appear closer together, while words that share few or no similarities will be displayed farther apart. Named entities, such as place names or names of people, can appear next to other terms. If a person’s name appears near another person’s name or particular term, it indicates that the terms are close in context. For example, if the named entity “Fred” appears next to the words “IRS” and “Audit”, contextually, it means that these terms appeared together in their parent dataset frequently enough to be deemed contextually similar. So we can reasonably infer from this that a person named Fred was associated in some way with an IRS audit in the parent dataset. To this end, we used Word2Vec to create a word embedding graph of our dataset.
To create the graph, we used an interactive library in Python called Plotly. When the graph is printed, there are hundreds of words plotted in two-dimensional space near each other. The ability to zoom in and out as well as pan (move in any direction) was essential in order to isolate instances of named entities that interested us. When we viewed the graph in its entirety, we could not identify any words that placed named entities near any forensic words (e.g., alibi, murder, belongings, shaken, confess). We were able to find each of the character names from the dataset as named entities in the graph. After panning and zooming numerous times through the interactive graph, we found only one instance where a character name appeared near other terms of interest. For this example, refer to
Figure 7. In this case, Bob appears close to the words “hidden”, “used”, and “shaken”. Even though the proximity of the words appears a little farther away, we must note that the graph in
Figure 7 is zoomed in to a considerable magnification. Bob is closest to the word “hidden”. The word “shaken” was also seen previously in
Figure 6 when it was associated with both Bob and Eva in the LDA topic model. Bob also appears close in proximity to Alice and Irene in the graph. Of the three people in this microregion of the graph, Bob is the closest person to these words. In addition, Irene and Alice are very close together and separate from Bob. For these reasons, we did not include Irene and Alice in the area highlighted in red. In the next section, we will show
Figure 7 with related clusters from the pairwise correlation graph.
5.8. Word Embedding Graph with Related Pairwise Clusters
In this composite graph (
Figure 8), three of the word embeddings relate to topic clusters from the pairwise correlation graph. First, Bob (a named entity) relates most strongly to a cluster that contains the words “confess” and “guilty”. This can be linked to a quote from Alice to Bob, which is: “You’ve been looking really guilty lately, anything you want to confess?” Second, the word “hidden” refers back to a pairwise cluster that contains the words “hidden”, “house”, “passages”, and “rumors”. These correlated terms link to a quote in the original dataset from Bob to Charlie: “There are rumors of hidden passages in this house. Seen anything?” Third, the word “shaken” refers to a pairwise cluster with the following terms: “looked”, “murder”, “shaken”, “Bob’s”. The first three words refer to a quote from Henry to Diana: “Bob looked really shaken up after they found out about the murder”. The possessive term “Bob’s” was not present in the aforementioned quote. However, the term “Bob’s” appeared in two statements that were spoken by multiple characters in the dataset. These two statements were the following: “I found a torn piece of a letter in Bob’s room. It might be a clue”. “I found a clue in Bob’s belongings. We need to talk”.
6. Discussion
In our research, we used four principal NLP techniques to look under the hood of spoken language to lay bare artifacts that suggest behavior patterns within a discrete group of people. The diagram in
Figure 9 illustrates our findings based on the available evidence. We started with a candidate pool of nine potential suspects. Going into this task, we were only aware of the original parameters given to an LLM when the dataset was created. These parameters were:
Create a fictitious scenario that lasts a period of thirty days.
In the fictitious scenario, there are ten people who do not exist. Their names are randomly chosen.
Two of the ten people conspire to kill a third.
Ground truth (the identities of the two conspirators) was kept from us until the initial battery of NLP techniques were run and prime persons of interest were chosen.
The dataset consists of a series of dialogs between the nine remaining characters over the thirty days.
The evidence we believe had the most predictive ability was emotion classification. We used four different transformers to search for text that could be classified as “guilt”, “shame”, “fear”, and “anxiety”. Of all the candidates, Eva was the only one who consistently demonstrated elevated levels of fear, anxiety, and guilt. We felt that these were good predictors because they all fall within the same spectrum of negative valence that is often found manifested within average people who have committed a crime such as murder. Emotion and sentiment by themselves do not provide enough context to allow for scrutiny. For this reason, we decided to further analyze the language of the suspects using topic modeling, pairwise correlation, and word embeddings. Using this approach, if any patterns were identified, they could be validated using additional techniques.
Bob did not demonstrate a consistent pattern of elevated emotion the way that Eva did. Through topic modeling, we were able to associate him directly with Eva since the two named entities occurred within the same topic as an isolated pair. Furthermore, Bob had a significant correlation with key clusters in pairwise correlation. He was also the only named entity that appeared in the word embedding graph whose name appeared within close proximity to forensic words of interest.
Through our NLP analysis, we made two observations:
Topic modeling or topic clustering is a good way to associate persons of interest who may be coordinating efforts in a criminal conspiracy. Two (or more) people may be operating in relative isolation; however, common variables that tie them together may be exposed through LDA, KNN, or K-means.
Emotion classification of the negative valence spectrum is a good initial predictor of persons of interest when done as a time series. When visualized as a density or line graph, such a method identifies average emotional baselines as well as peaks in specific emotions. This could be particularly useful if a peak occurs on a key date in a timeline.
7. Conclusions
In this research, we presented a novel technique that applies NLP to digital forensic investigations. The technique we propose seeks to lay bare the inherent patterns that exist in our language in order to identify persons of interest in an investigation. Guilt can not be assigned when it is based purely on circumstantial evidence. However, investigators can be led to people in a case who have more knowledge of relevant events than others. Persons of interest can either be eliminated as suspects or corroborated with additional evidence. In digital forensics, one of the most useful sources of evidence is text data generated by users. This can be text messages, emails, social media posts, and even Microsoft Office documents. This data can be mined for emotion, associations between people, and ties to specific themes. Emotion classification performed in conjunction with a timeline can be a useful predictor of persons of interest when certain emotions within the negative spectrum are identified. Specifically, if there are consistently elevated levels of guilt, fear, shame, and anxiety in a person’s timeline, there is a possibility that the person is manifesting evidence of exposure to criminal activity. Follow-up supplemental evidence might corroborate this. The use of topic modeling can be a good way to statistically associate a suspect with another suspect in an investigation. Using additional tools, such as pairwise correlation word graphs and word embedding graphs, named entities can be tied to forensic keywords. Using correlation variables and cosine distance, the strength of these associations can also be measured. Our proposed method is a proof-of-concept that uses a fictitious case that was created using a large language model. We arrived at some of our findings through inference that was supported by computational linguistic data. We wish to expand this research to further validate our findings. Our plan for future research is to increase the scope of the pool of suspects. We plan to increase the number of suspects to around twenty. We would also like to use transcribed audio data with timestamps as our dataset. We have high confidence that this technique will prove just as successful in a larger dataset as it was with our current dataset.
During the preparation of this manuscript, the authors used OpenAI’s ChatGPT to assist in drafting and refining portions of the text, particularly to improve clarity and ensure technical accuracy in the presentation of methods and findings. The tool provided initial language suggestions, and the authors subsequently reviewed and thoroughly edited all AI-generated content to align with the study’s goals and maintain scientific rigor. The authors take full responsibility for the accuracy and integrity of the content.