In this section, we will first introduce the texts and the features. Then, we will show the count results of different categories of words corresponding to the same features. Next, we will obtain the pair-wise dissimilarity among the texts and visualize the results. Finally, we will perform the spectral clustering based on the dissimilarity matrix to discuss the linguistic styles of the Middle Chinese texts.
4.1 Description of the Data and the Features
We will analyze the texts with the help of the Tagged Corpus of Middle Chinese (
http://lingcorpus.iis.sinica.edu.tw). We take 34 Middle Chinese texts, of which 6 are native Chinese texts and the remaining 28 are Chinese translation of Buddhist scriptures, to perform the quantitative analysis. These texts, dating from Eastern Han Dynasty (25–220 A.D., tagged as P1), Three Kingdoms (229–280 A.D., tagged as P2), Western Jin Dynasty (265–317 A.D., tagged as P3), Eastern Jin Dynasty (317–420 A.D., tagged as P4), Sixteen Kingdoms (304–439 A.D., tagged as P5), Northern and Southern Dynasties(439–589 A.D., tagged as P6) to Sui Dynasties (581–618 A.D., tagged as P7), are shown in Table
1, with respect to their author(s), book name, denotative label in this article, as well as category. In the column for category, by “T,” we denote the Chinese translation of Buddhist scriptures and by “N” the native Chinese texts. In addition, their number in
CBETA (Chinese Buddhist Electronic Text Association) is given if the texts are translated sutras.
During the Middle Ages, wars were frequent, and the number of the native texts that could be handed down was small, and even fewer of them were credible and annotated in the corpus. We have tried our best to select all the possible texts in the corpus, since there are very few resources available. If there are new literature research and annotation results in the future, then we can further enrich the results of this article.
In Table
1, time periods are given in terms of dynasties, because we cannot accurately date most of the texts to the exact year. Also, these dynasties were somewhat overlapped, since there was possibly more than one kingdom at the time, and we attribute the texts to dynasties that most scholars tended to classify them into.
We filter the entries in the Swadesh list to construct the lexical features based on two criteria. First, each entry filtered should correspond to a group of synonyms, of which some were often used in the early phase whereas the others in the later phase, i.e., there existed diachronic word substitutions among these synonyms. Second, the words related to such an entry should occur in the aforementioned Middle Chinese texts as ubiquitously as possible. Thanks to the survey on the evolution of core words in Middle Chinese [
37], we finally take a total of 12 entries to construct features as listed in Table
2. It is also worth noting that all the word substitutions are indeed gradual and relative. In other words, the words preferred in the early phase would possibly occur in the later phase and vice versa.
4.2 Statistics of Featured Word Occurrences
With respect to each feature, we count its corresponding early words and later words in all texts, respectively. The results are shown in Table
3 in the ratio of
m to
n, in which
m is the number of the early word occurrences and
n is that of the later word occurrences. Here, the first column of the table is the text label. Except the label column, the leftmost seven columns are for noun features, and the rightmost five columns are for verb features.
Table
3 shows that there are ratios of
\(0:0\) even with the use of the Swadesh list, which contains only the most basic and common concepts. Note that the ratio of
\(0:0\) means that neither the early words nor the later ones appeared in the given text. So, we know nothing about its author’s habits in representing such a concept. In addition, if two texts both demonstrate the ratio of
\(0:0\), then it does not mean that their authors followed the same word usage habit; instead, it only means that we have no knowledge about the word usage in either text with respect to this feature.
Furthermore, the identical ratios of the early words to the later words for two different texts do not absolutely imply the identical wording preference. For instance, we get the identical ratios, for instance, \(6:4\) in the text labeled ZK and \(3:2\) in the text labeled TK with respect to the feature “eye.” Thus, we know both authors preferred the former to the latter. Although the ratios \(6:4\) and \(3:2\) are identical, the preferences of the authors are not identical. Here, the two ratios lead to the posterior distributions \(\text{Beta}(7,5)\) and \(\text{Beta}(4,3)\), respectively. Both distributions reach their peaks at the value \(0.6=6/(6+4)=3/(3+2)\), whereas neither is concentrated to a single point, which means the wording preference of the author possibly deviates from the expectation of the corresponding random variable. Consequently, we cannot assert that the authors are with the same wording habits even if the related texts are with the same ratio of the early words to the later words.
4.3 Feature Parameters and Their Dissimilarity
In this subsection, based on the aforementioned statistics, we will render the linguistic styles from the lens of the lexical features. A parallel coordinator plot for some typical texts is shown in Figure
3.
In the horizontal direction, a range of features are listed, each of which corresponds to a vertical axis. For a given text, the mean values of the parameters for the usage of different features at all axes are connected to form a segmented line, which reflects the average measure for all features. The area around the line of the same color shows the 90% confidence interval.
We choose one native text (labeled as YZ) and two translated Buddhism texts (labeled as LO-6 and KU-1) for plot. It is shown that, for some features such as “belly” and “bird,” the usages across all three texts are similar; for some other features such as “dog” and “hear,” their usages are obviously different.
Overall, the translated Buddhism texts, i.e., LO-6 and KU-1, are more similar, while the native text labeled as YZ is quite different from them. In this article, features are treated as random variables, which help us take a deep and detailed look at the features. Not only can we observe the differences of the expectations (the mean value) of the feature random variables, but we also analyze the overlap between their distribution. For these three texts, there is a large overlap in the distribution for some features (e.g., the feature “bird”) and little overlap for other features (e.g., the feature “hear”). The former implies that the corresponding texts is similar and the latter dissimilar. This observation inspires us to visualize the dissimilarity matrix over all texts surveyed, which is shown in Figure
4.
As shown in each subfigure of Figure
4, the diagonal line is rendered as white pixels, which means the dissimilarity of a given text to itself is zero. The other pixels, dyed with shades of blue, show the pair-wise dissimilarity. The darker it is, the more dissimilar the text pair is. Thus, the visualization helps us understand the dissimilarity of word form usage with respect to basic concepts in a quantitative manner.
If we compare the four subfigures in Figure
4, then the subfigures (d) reflecting the cosine distance show the most significant difference in the pair-wise dissimilarity, followed by the subfigure (a) with Taxicab distance, then the subfigure (b) with Euclidean distance, and the subfigure (c) with Chebyshev distance the least significant difference. Here, with regard to the Chebyshev distance, we only retain the most different feature component, while we ignore the other component, resulting in a large loss of information. With regard to the cosine distance, we mainly focus on the angle between the feature vectors and do not pay attention to the absolute feature component values. Both Euclidean distance and Taxicab distance are special cases of Minkowski distance with orders 2 and 1, respectively. These two distances both fully account for the differences of all components, with the former having the same weight for each component, while the latter has a large weight for the components with large differences. We can evaluate the dissimilarity between texts from different perspectives based on the aforementioned distances.
4.4 Clustering and Visualization
Since we can quantitatively evaluate the level of dissimilarity between different texts, we can display different texts as individual points according to the dissimilarity matrix in the low-dimensional space, for instance, a 2D Euclidean plane. If the dissimilarity between two texts is large, then we position their corresponding points farther away; otherwise, they are positioned closer. Mathematically, this approach, called data embedding, is to project high-dimensional data into a low space so the resulting low-dimensional configuration reflects the intrinsic structure of the data and performs better in future processing.
Clustering can be performed based on the combination of different types of distances and different choice of the parameter sigma. Considering that the ranges of distances are different, we set the parameter sigma as shown in Table
4.
With the help of spectral embedding, we display the texts with points in a 2D Euclidean space as shown in Figure
5. It is noticed that the horizontal and vertical axes here only provide a basis of such a Euclidean space.
We can perform a data clustering on all the texts with the use of the dissimilarity matrix. If we classify the texts into two categories, then the results are divided exactly between the native and the translated texts. It can be observed that the native Chinese texts concentrate on the right side, while the translated Buddhism texts concentrate on the left side. In short, the native texts and the translated texts are distributed in different regions, which shows that the selected features based on the lexical evolution in this article indeed reflect the different linguistic styles of the native language itself and the product of indirect language contact, respectively.
It was reported in the literature that the arrival of Buddhism in China and the translation of Buddhist scriptures into Chinese introduced an unprecedentedly large number of Sanskrit words into Chinese. Obviously, we can utilize the loan words as the stylistic features to distinguish the translated texts from the native texts. However, we here conclude that the two categories of texts also greatly differ in the word usage for the common concepts.
Furthermore, the points representing different texts by the same author (for example, Lokak
ema, Zhi Qian, or Dharmarak
a) are located relatively close to each other, indicating that the preferred usage of word forms does reflect the habit of a certain author and the stage of language evolution the author was in.
Inside the translated text category, the texts labeled with “LO-*” and “ZQ-*” are mainly located in the upper side, whereas most of the other translated Buddhist texts are in the lower side. The texts labeled with “LO-*,” where the symbol “*” refers to a certain number, were translated by Lokak
ema, a Buddhist monk of Gandharan origin who traveled to China during the Eastern Han dynasty and translated Buddhist texts into Chinese, and, as such, became a prominent figure in Chinese Buddhism. The texts labeled with “ZQ-*” were translated by Zhi Qian (支谦), a disciple of Zhi Liang (支亮), who in turn had been a disciple of the aforementioned famous translator Lokak
ema. Thus, it is very reasonable that they had similar translation styles. In addition, most texts located in the upper side dated from Eastern Han dynasty to Three Kingdoms (25–280 A.D.), i.e., the early phase of the Middle Ages. As for the lower side in the translated text category, we notice that most of the texts here spanned from West Jin dynasty to Sui dynasty (265–618 A.D.), i.e., the late phase of the Middle Ages. Thus, they manifest a different style compared with that of the early phase. It is concluded that the word form features representing the same concepts can indeed reflect the evolution of language.
Inside the native text category, Qi Min Yao Shu (or Essential Techniques for the Welfare of the People, 齐民要术), compiled by Jia Sixie of the Northern Wei Dynasty, is one of the earliest and relatively systematic agricultural works in China, which has always been considered to be more colloquial, and from the perspective of common words, this text is the closest to the Buddhist scriptures among all native texts. The text labeled with “YX” is Luoyang Qielan Ji (or Record of Buddhist Temples in Luoyang, 洛阳伽蓝记, around 547 A.D.). The author is Yang Xuanzhi, a native writer who served as a government official at that time. The book records the origins and changes of the Buddhist temples of Luoyang City in the Northern Wei Dynasty when it reached the pinnacle of the Buddhism and there were over 1,000 Buddhist temples in just Luoyang city. Even though there are lots of loan words in the text, we correctly classify it to the native category with the proposed lexical features.
In the visualization of all the texts by low-dimensional embedding, not only can we observe the above-mentioned trends, but also should pay attention to the individuality of each text. For example, Lokak
ema belongs to the Eastern Han Dynasty, and Zhi Qian belongs to the Three Kingdoms period. Zhi Qian is later than Lokak
ema and should generally prefer to use newer words. But, in fact, Zhi Qian’s grandfather had already lived in China, and Zhi Qian had been living in the native Chinese environment. Thus, his works have stronger literary characteristics than those of Lokak
ema. For another example,
Yan Shi Jia Xun (or
The Family Motto of Yan), labeled as YZ, is a family motto created by Yan Zhitui, which describes his personal experience, thoughts, and knowledge to educate his descendants. Although the text was written in the Sui Dynasty, it has a strengthened literary color. On the contrary, the
Qi Min Yao Shu (labeled as JS) in the earlier period has a stronger colloquial color.
With regard to different types of distances, the results of text clustering and visualizations based on low-dimensional embedding, although there are slight differences, are largely the same. Specifically, the Buddhist scriptures are generally in one category, and the Middle Earth literature is classified in the other category. The distinction between the two categories is significant. Within the same type of texts, there are similarities in their arrangement. For example, the text marked with ZQ-* (usually ZQ-1) is at one end and the text marked with ZF is at the other end. The text marked with JS in the native literature is closest to the Buddhist text, while the text marked with YZ is relatively far away. The clustering results based on these four types of distance are all reasonable, among which the clustering based on Chebyshev distance is relatively coarse and shows different characteristics from the result with the other three types of distances. Considering the relative position of the texts of the same author or the same period in the visualization, it concludes that the cosine distance and Taxicab distance are more helpful for text clustering.
The parameter \(\sigma\) also plays an important role in clustering and embedding results. The larger the parameter, the stronger the affinity of each text pair and the closer the texts will be; the smaller the parameter, the lower the affinity between the text pairs and the farther away the texts.
Encouragingly, the results of the data analysis in this section can partially support the conclusions of the previous research. In the transmission of Buddhist scriptures from ancient times to the present for thousands of years, there were inevitably a large number of false records. With regard to the translation works of Lokak
ema as an example, only the texts labeled with LO-5 and LO-6 are credible [
23]. It was also reported that there is a big difference between the Ban Zhou San Mei Jing (label with LO-5) and the reliable translation works by Lokak
ema as well as the Eastern Han Buddhist scriptures from the perspective of vocabulary, grammar, and other phenomena [
6]. From the analysis of the visualized results, it is clear that the text labeled with LO-6 is relatively far from the other texts labeled with LO-1 to LO-5 and LO-7. Among all the texts by Lokak
ema except LO-6 itself, the text labeled with LO-5 is in a position closer to LO-6. This is consistent with the above-mentioned academic research.