Abstract
Problem
Recognizing written languages using symbols written in cuneiform is a tough endeavor due to the lack of information and the challenge of the process of tokenization. The Cuneiform Language Identification (CLI) dataset attempts to understand seven cuneiform languages and dialects, including Sumerian and six dialects of the Akkadian language: Old Babylonian, Middle Babylonian Peripheral, Standard Babylonian, Neo-Babylonian, Late Babylonian, and Neo-Assyrian. However, this dataset suffers from the problem of imbalanced categories.
Aim
Therefore, this article aims to build a system capable of distinguishing between several cuneiform languages and solving the problem of unbalanced categories in the CLI dataset.
Methods
Oversampling technique was used to balance the dataset, and the performance of machine learning algorithms such as Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Decision Tree (DT), Random Forest (RF), and deep learning such as deep neural networks (DNNs) using the unigram feature extraction method was investigated.
Results
The proposed method using machine learning algorithms (SVM, KNN, DT, and RF) on a balanced dataset obtained an accuracy of 88.15, 88.14, 94.13, and 95.46%, respectively, while the DNN model got an accuracy of 93%. This proves improved performance compared to related works.
Conclusion
This proves the improvement of classifiers when working on a balanced dataset. The use of unigram features also showed an improvement in the performance of the classifier as it reduced the size of the data and accelerated the processing process.
1 Introduction
During many ancient civilizations, ancient man invented many different ways of writing, the most famous of which is writing cuneiform symbols [1,2]. It is a type of writing that appeared in ancient civilizations and was created by the Sumerians who lived in Mesopotamia, where it appeared between the years 3500 and 3000 BC; it is considered one of the most important things that the Sumerians contributed to in terms of general culture [3], in particular, to the Sumerian city of Uruk, which created this writing in the year 3200 BC.
This writing was called by this name through the Latin word (cuneus), which means wedge or nail, because the method of writing is done by drawing wedge-like shapes, where a slab of soft clay is pressed carefully using a tool that is specially made for writing called the pen made of reed [4]. Cuneiform writing is an important part of human history, as it means documenting events, laws, literature, and science at that time [5,6]. At the beginning of the nineteenth century, thousands of cuneiform tablets were discovered in Iraq and Iran, representing the various Assyrian-Babylonian and Persian civilizations [7]. Today, these tablets are in many museums, and the process of interpretation requires experience and time. However, information technology was required to solve the problem related to the task of recognizing cuneiform symbols. Therefore, the motivations for this work are the following:
Preserving historical and cultural heritage: Cuneiform writing was one of the earliest forms of writing in human history, and it was used in ancient civilizations such as Sumer, Babylon, and Assyria. Designing systems to classify cuneiform symbols can help preserve this valuable historical and cultural heritage for future generations by facilitating the study and interpretation of cuneiform texts.
Facilitating research in various fields: Cuneiform texts cover a wide range of topics, including literature, religion, science, and history. By designing systems to classify cuneiform symbols, researchers can easily identify and analyze relevant texts in their fields of interest, which can help advance knowledge and understanding in various fields.
Thus, the main objective of this work is to build an intelligent system for recognizing cuneiform symbols by investigating the performance of deep learning and machine learning algorithms in classifying these symbols within a cuneiform language identification (CLI) dataset. Below is a summary of the most important contributions of this study:
To use machine learning algorithms such as (Support Vector Machine [SVM], Decision Tree [DT], K-Nearest Neighbors [KNN] and Random Forest [RF]) in addition to building a deep learning model such as deep neural network (DNN) in the process of classifying cuneiform symbols in the CLI dataset.
To use the unigram method for extracting features from cuneiform texts to convert the text into the smallest possible units (symbols), which allows reducing data size and improving processing speed.
To solve the problem of class imbalance in the dataset by using the oversampling technique.
2 Related works
This section provides a summary of some previous research that made use of the dataset provided by Jauhiainen et al. [8] to participants in the CLI-shared task that was held at VarDial 2019 [9].
In 2019, Paetzold and Zampieri [10] applied machine learning techniques to determine the language of cuneiform texts. The authors use a dataset of cuneiform texts written in various languages, including Sumerian (SUX), Akkadian, and Hittite in the CLI dataset. They extract features from texts, such as n-grams of one–five characters, and use these features to train the SVM machine learning algorithm. Their method achieved a 73.8% F1 in identifying the language of cuneiform texts. The article shows that machine learning techniques can be effective in identifying the language of cuneiform texts and that character-based features are particularly useful for this task. The results of this experiment could have implications for the development of automated tools for the analysis and classification of cuneiform texts.
But the process of using n-gram features from 1 to 5 is not feasible for several reasons, including the increase in computational cost; on the one hand, non-zero features may increase. This can make it difficult for the model to learn meaningful patterns. In 2019, Benites et al. [11] concentrated on two subtasks: determining the dialect of German writings and the language of cuneiform texts. The authors employ a dataset of cuneiform writings written in SUX and Akkadian for the goal of identifying the languages used in the writing system. They use character n-grams as features and a machine learning technique based on SVMs. The authors employ a dataset of texts written in the four German dialects of Bavarian, Swabian, Low German, and High German for the German dialect identification (GDI) job. The accuracy, precision, recall, and F1 score are only a few of the criteria the authors use to assess the performance of their models. They discover that their models do well on both tests, reaching a macro averaged F1 of 74.6% for the GDI and 74.7% for the CLI. Their method was tested on the SVM algorithm only, and other machine learning algorithms or deep learning models were not investigated. Also, the accuracy can be further enhanced; in addition, the dataset was unbalanced, which may cause poor system performance. In 2019, Bernier-Colborne et al. [12] presented the systems developed by the National Research Council Canada for the 2019 VarDial evaluation campaign’s CLI-shared task. Their study compared three different approaches: a baseline method that utilizes n-grams extraction method and a statistical classifier, an ensemble of classifiers, and a deep learning technique using a Transformer network. They provided a detailed account of how each system was trained and examined the impact of various preprocessing and model estimation decisions. Their DNN achieved a remarkable accuracy of 77% on the testing data, surpassing all previous performance records for CLI. They trained the classifier on an unbalanced dataset, which led to poor performance of the classifier due to its bias toward the majority class.
In 2020, Doostmohammadi and Nassajian [13] studied language and dialect identification of cuneiform texts by examining various machine learning techniques, including SVM, Naive Bayes, RF, Logistic Regression, and neural networks. The results indicated that an ensemble of SVM and Naive Bayes achieved the best performance. The study revealed that using only characters was sufficient to attain an F1-score of at least 72.10%. They used the method of taking features from 1 to 3 g or 1 to 4 g, which is computationally costly, so it is considered useless, in addition to using a dataset without balancing it and verifying its results.
All of the above-mentioned studies were done on an imbalanced dataset. When a dataset is imbalanced, it signifies that there are significantly more instances of one class than the other. Due to this, training may be biased in favor of the majority class, which could harm the performance of the minority class [14]. To the best of our knowledge, this article will be the first to implement the oversampling method for balancing classes in a CLI dataset and show its impact on machine learning and deep learning classifiers.
3 Materials and methods
This section describes the dataset and preprocessing methods used to improve the performance of machine learning and deep learning classifiers for the classification of cuneiform symbols, as shown in Figure 1.
3.1 Data description
The CLI [9] dataset contains 134,000 snippets of cuneiform texts to solve the problem of recognizing seven languages and dialects written in cuneiform: SUX, Old Babylonian (OLB), Middle Babylonian Peripheral (MPB), Standard Babylonian (STB), Neo-Babylonian (NEB), Late Babylonian (LTB), and Neo-Assyrian (NEA). This dataset suffers from an imbalance of classes, as classes SUX and NEA contain most of the data compared to the rest of the other classes, as shown in Table 1.
Class | Language | Number of samples |
---|---|---|
LTB | “Late Babylonian” | 15,947 |
MPB | “Middle Babylonian Peripheral” | 5,508 |
NEA | “Neo-Assyrian” | 32,966 |
NEB | “Neo-Babylonian” | 9,707 |
OLB | “Old Babylonian” | 3,803 |
STB | “Standard Babylonian” | 17,817 |
SUX | “Sumerian” | 53,673 |
3.2 Pre-processing
In order to initialize the datasets before delivering them to the algorithms for training, this part describes the procedures that are carried out on them, such as Unigram extraction and counting, Balancing of the classes, and Data splitting.
3.2.1 Unigram extraction and counting
N-gram extraction [15] is a method used in natural language processing to identify and extract contiguous sequences of n items from a given text. For example, if n = 1, then the extracted sequences are called “unigram”; if n = 2, then the extracted sequences are called “bigrams”; and if n = 3, then the extracted sequences are called “trigrams.” In this work, a unigram was applied. For example, if the “cuneiform” column contained the string, applying the unigram would result in the “cuneiform_split” column containing the list.
Once the n-grams have been extracted, they can be counted to provide useful information about the text. The count of each n-gram can be used to calculate the frequency of occurrence of the n-gram in the text as shown in Figure 2, which can be used in the cuneiform symbols’ classification.
3.2.2 Balancing of the classes
The CLI dataset suffers from unbalanced classes where classes LTB, OLB, MPB, NEA, NEB, and STB are considered the minority classes, and class SUX is the majority class as shown in Table 2. This stage performs upsampling (oversampling) of minority classes in the CLI dataset. It creates new samples by randomly drawing samples from the minority classes with replacements. The new samples are added to the original dataset to balance the classes to become 53,673 samples in each class as shown in Table 2.
Class | Number of samples | Number of samples after the balance |
---|---|---|
LTB | 15,947 | 53,673 |
MPB | 5,508 | 53,673 |
NEA | 32,966 | 53,673 |
NEB | 9,707 | 53,673 |
OLB | 3,803 | 53,673 |
STB | 17,817 | 53,673 |
SUX | 53,673 | 53,673 |
3.2.3 Data splitting
After the class balancing process is performed, we use the holdout cross-validation method to divide the dataset into 80% for training and 20% for testing classifier performance.
3.3 Classifier
Machine learning algorithms and deep learning models are investigated to classify cuneiform symbols and compare their performance.
Machine learning algorithms: Regarding machine learning algorithms, the algorithms will be selected: SVM, RF, KNN, and DT to compare their performance with related works. Standard parameters were chosen for all algorithms.
Deep learning models: DNN was designed with three dense layers: 128, 64, and 32 nodes to improve the classification process, and a dropout layer by 50% after the dense layer of 32 nodes to reduce the occurrence of overfitting. Finally, a dense layer with seven nodes was added to match the number of outputs, as shown in Figure 3. The network is trained on 50 epochs, batch size 32, and an Adam optimizer.
3.4 Classifier evaluation
After the proposed classifiers have been trained, a testing process is carried out to determine whether they can correctly categorize the cuneiform symbols by putting them to the test on the testing set. The most common metrics are used: Accuracy, Precision, Recall, and F1-score to measure the performance of the classifiers [16, 17].
Accuracy is calculated by dividing the number of correctly predicted cases by the total number of samples as in equation (1). Accuracy measures how many cases, both positive and negative, were correctly classified. Accuracy should not be used alone on imbalanced problems because it is easy to get a high accuracy score by simply classifying all cases as the majority class.
Precision is calculated by dividing the number of positive cases that were correctly predicted by the total of positive cases that were correctly predicted and the number of positive cases that were incorrectly predicted. Precision metric calculated by equation (2).
Recall is calculated by dividing the number of positive cases that were correctly predicted by the total of positive cases that were correctly predicted and the number of negative cases that were incorrectly predicted. Recall is a useful statistic when a false negative is more of a concern than a false positive. Equation (3) is used to calculate recall.
F1 score: When we try to increase the model’s precision, the recall decreases, and vice versa. This measure can be interpreted as an accordant mean of precision and recall, giving an extensive depiction of these two measurements, where an F1 score reaches its best value at 1 and worst at 0. Therefore, this score takes both false positives and false negatives into account. F1 is usually more useful than accuracy, especially if we have an unbalanced class distribution. F1 score is calculated by equation (4).
4 Results and discussions
This section is divided as follows: Section 4.1 presented the results without and with a balancing dataset; in Section 4.2, the obtained results are compared with other related works, and in Section 4.3, the research limitations are presented.
4.1 Results without and with balancing dataset
When training the classifiers on the unbalanced CLI dataset, the obtained results are shown in Table 3.
Classifier | Accuracy (%) | Precision (%) | Recall (%) | F1-score (%) |
---|---|---|---|---|
SVM | 83.25 | 83.3 | 83.25 | 82.52 |
KNN | 76.18 | 75.94 | 76.18 | 74.16 |
DT | 75.47 | 74.67 | 75.47 | 75 |
RF | 82.67 | 82.91 | 82.67 | 81.87 |
DNN | 83 | 82 | 83 | 83 |
Because there is an uneven equilibrium in the number of samples between the different classes in the CLI dataset, this leads to the DT algorithm tending to favor the most representative class. This leads to an improvement in the classification performance for the most represented category and a deterioration in the classification performance for the least represented categories. This is the reason for the poor performance of the DT algorithm. The same goes for the rest of the algorithms.
When training the classifiers on the same dataset after conducting class balancing using the sampling method, we noticed the improvement of all the classifiers, as the RF classifier obtained the highest accuracy of 95.46%, followed by the classifier DT, ANN, SVM, and KNN, where the accuracy reached 94.13, 93, 88.15, and 88.14%, respectively, as shown in Table 4.
Classifier | Accuracy (%) | Precision (%) | Recall (%) | F1-score (%) |
---|---|---|---|---|
SVM | 88.15 | 88.63 | 88.15 | 88.18 |
KNN | 88.14 | 88.15 | 88.14 | 87.93 |
DT | 94.13 | 94.11 | 94.13 | 94.11 |
RF | 95.46 | 95.49 | 95.46 | 95.47 |
DNN | 93 | 93 | 93 | 93 |
When the dataset is balanced, the RF algorithm achieves better accuracy because it can generate DTs that are more representative of the entire dataset, rather than one that is biased toward the majority class as shown in Figure 4. In addition, when the dataset is balanced, the classifiers have more data to train on for the minority class, which can improve their ability to classify cases in that class.
4.2 Comparison with related works
To prove the success of the method, the results of the proposed approach are compared with the related work, as shown in Table 5. Our proposed approach obtained the highest performance, as it obtained 95.46, 95.49, 95.46, and 95.47% in terms of accuracy, Precision, recall, and F1, respectively.
Author | Method | Accuracy (%) | Precision (%) | Recall (%) | F1-score (%) |
---|---|---|---|---|---|
Paetzold and Zampieri [10] | A meta-classifier trained on various SVM models | N/V | N/V | N/V | 73.8 |
Benites et al. [11] | SVM with character n-gram features | N/V | N/V | N/V | 74.7 |
Bernier-Colborne et. al. [12] | The BERT model with character n-gram features | 77.11 | N/V | N/V | 76.95 |
Doostmohammadi and Nassajian [13] | SVM + NB | 72.39 | N/V | N/V | 72.10 |
Proposed approach | RF with character unigram and balancing dataset | 95.46 | 95.49 | 95.46 | 95.47 |
The superiority of the proposed approach over the previous works is due to the importance of class balancing to avoid biasing classifiers toward the majority classes. In addition, the unigram method is better for extracting features from cuneiform texts because it converts the text into the smallest possible units (symbols). These single symbols can be used in operations such as identifying symbol frequency in text and extracting information about the text. Also, it allows reducing data size and improving processing speed, which makes it very useful in machine learning and classification techniques.
4.3 Research limitations
In most research studies, there are some limitations that can be solved in future research. The limitation of this research is as follows: There are still some samples that have been misclassified, due to the great similarity between the writing symbols of different civilizations. More pre-processing is possible to avoid this.
5 Conclusions and future work
Cuneiform, also known as ancient cuneiform, is a writing system developed in Sumer (present-day southern Iraq) around 3200 BC. Cuneiform is one of the oldest writing systems known in history.
SUX was the language used in this writing system, and cuneiform letters were used to inscribe texts on clay tablets. Cuneiform letters consist of signs and shapes resembling pointed nails and grooves that are carved into the clay with a stick or pen intended for writing, so the process of interpreting cuneiform symbols is a difficult task and requires expertise. Therefore, this article aims to build an intelligent system that has the ability to distinguish the cuneiform symbols of different civilizations. Experiments were conducted on the CLI dataset to classify it into seven categories, but this dataset had a category imbalance.
This article investigated the performance of machine learning and deep learning algorithms, including SVM, KNN, DT, RF, and DNN when performing class balancing using the oversampling method. We noticed that the best performance of the algorithms was achieved after the balancing procedure, where the proposed method using machine learning algorithms (SVM, KNN, DT, and RF) obtained an accuracy of 88.15, 88.14, 94.13, and 95.46%, respectively, while the DNN model obtained an accuracy 93%. It was found that the use of unigram features significantly improved the performance of classifiers and accelerated their performance compared to previous studies. The class balancing process also solved the problem of classifier bias toward majority classes, which to the best of our knowledge has not been done in previous studies. For future work, and through what we have noticed in the results of previous studies and our findings, there is still a need for more pre-processing for the purpose of increasing the accuracy of the classifiers, in addition to checking the results of the assembly of different algorithms.
-
Author contributions: The authors are equally contributed to the paper. Authors have read and agreed to the published version of the manuscript.
-
Conflict of interest: The authors declare that there is no conflict of interest.
-
Data availability statement: Data sharing is not applicable to this article since no datasets were generatedduring this study.
References
[1] Cuneiform - Hittite and other languages | Britannica. https://www.britannica.com/topic/cuneiform/Hittite-and-other-languages. (accessed Mar. 24, 2023).Search in Google Scholar
[2] Boadt L, Clifford RJ, Harrington DJ. Reading the Old Testament: An Introduction. Mahwah, NJ: Paulist Press; 2012.Search in Google Scholar
[3] Mara H, Krömker S, Jakob S, Breuckmann B. GigaMesh and gilgamesh – 3D multiscale integral invariant cuneiform character extraction. VAST 2010 - 11th Int. Symp. Virtual Reality, Archaeol. Intell. Cult. Herit.; January 2010. p. 131–8. 10.2312/VAST/VAST10/131-138.Search in Google Scholar
[4] Rasheed NA, Nados WL. Recognition of cuneiform symbols using neural network. J Theor Appl Inf Technol. 2018;96(17):5857–68.Search in Google Scholar
[5] Charpin D. Writing, law, and kingship in Old Babylonian Mesopotamia. Chicago: University of Chicago Press; 2010.10.7208/chicago/9780226101590.001.0001Search in Google Scholar
[6] Uchida E, Watanabe R. Blackening of the surfaces of mesopotamian clay tablets due to manganese precipitation. Archaeol Discov. 2014;02(04):107–16. 10.4236/ad.2014.24012.Search in Google Scholar
[7] Woods C. Visible language. Spring. 2011;45(1/2):155. 10.1037/020683.Search in Google Scholar
[8] Jauhiainen T, Jauhiainen H, Alstola T, Lindén K. Language and dialect identification of cuneiform texts; 2019. p. 89–98. 10.18653/v1/w19-1409.Search in Google Scholar
[9] Zampieri M, Malmasi S, Scherrer Y, Samardžić T, Tyers F, Silfverberg M, et al. A report on the third; 2019. p. 1–16. 10.18653/v1/w19-1401.Search in Google Scholar
[10] Paetzold GH, Zampieri M. Experiments in cuneiform language identification. Vol. 2017; 2019. p. 209–13. 10.18653/v1/w19-1423.Search in Google Scholar
[11] Benites F, von Däniken P, Cieliebak M. {T}wist{B}ytes – Identification of Cuneiform Languages and {G}erman Dialects at {V}ar{D}ial 2019. Proc. Sixth Work. {NLP} Similar Lang. Var. Dialects; 2019. p. 194–201. https://aclanthology.org/W19-1421.Search in Google Scholar
[12] Bernier-Colborne G, Goutte C, Léger S. Improving cuneiform language identification with; 2019. p. 17–25. 10.18653/v1/w19-1402.Search in Google Scholar
[13] Doostmohammadi E, Nassajian M. Investigating machine learning methods for language and dialect identification of cuneiform texts; 2019. p. 188–93. 10.18653/v1/w19-1420.Search in Google Scholar
[14] Mukhlif AA, Al-Khateeb B, Mohammed MA. Incorporating a novel dual transfer learning approach for medical images. Sensors. 2023;23(2):570. 10.3390/s23020570.Search in Google Scholar PubMed PubMed Central
[15] Ali M, Shiaeles S, Bendiab G, Ghita Malgra B. Machine learning and N-GRAM malware feature extraction and detection system. Electron. 2020;9(11):1–20. 10.3390/electronics9111777.Search in Google Scholar
[16] Anwar SM, Majid M, Qayyum A, Awais M, Alnowami M, Khan MK. Medical image analysis using convolutional neural networks: A review. J Med Syst. 2018;42(11):1–13. 10.1007/s10916-018-1088-1.Search in Google Scholar PubMed
[17] Mukhlif AA, Al-Khateeb B, Mohammed MA. Breast cancer images classification using a new transfer learning technique. Iraqi J Comput Sci Math. 2023;4(1):167–80. 10.52866/ijcsm.2023.01.01.0014.Search in Google Scholar
© 2023 the author(s), published by De Gruyter
This work is licensed under the Creative Commons Attribution 4.0 International License.