research-article

Open access

Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word Usage

Authors:

Bing Qiu,

Jiahao HuoAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 7

Article No.: 97, Pages 1 - 22

https://doi.org/10.1145/3665794

Published: 26 June 2024 Publication History

PDF eReader

Abstract

Stylistic analysis enables open-ended and exploratory observation of languages. To fill the gap in the quantitative analysis of the stylistic systems of Middle Chinese, we construct lexical features based on the evolutive core word usage and scheme a Bayesian method for feature parameters estimation. The lexical features are from the Swadesh list, each of which has different word forms along with the language evolution during the Middle Ages. We thus count the varied word of those entries along with the language evolution as the linguistic features. With the Bayesian formulation, the feature parameters are estimated to construct a high-dimensional random feature vector to obtain the pair-wise dissimilarity matrix of all the texts based on different distance measures. Finally, we perform the spectral embedding and clustering to visualize, categorize, and analyze the linguistic styles of Middle Chinese texts. The quantitative result agrees with the existing qualitative conclusions and, furthermore, betters our understanding of the linguistic styles of Middle Chinese from both the inter-category and intra-category aspects. It also helps unveil the special styles induced by the indirect language contact.

1 Introduction

Stylistic analysis is, by any measure, one of the hottest issues in terms of linguistic applications. Its applications range greatly in depth as well as in detail, from stylistic device analysis across text understanding, emotional coloring determination, document classification, to authorship attribution, to name a few [3, 16, 17, 18, 21, 36].

Since the stylistic analysis enables open-ended and exploratory observation of languages, it has been attracting much attention from academia. Quantitative style analysis methods have increasingly been used in the mainstreams of current research. There exist fruitful quantitative research achievements regarding different languages, such as English [5, 18], Portuguese [45], Hebrew-Aramaic [8], Latin [14], and Chinese [9, 10, 21, 40].

However, the amount of work devoted to the quantitative study of the stylistic systems of Middle Chinese is surprisingly low. Middle Chinese, regarded as the key transition from Ancient Chinese to Modern Chinese, is commonly defined as the language variety of Chinese during the historical period between around East Han dynasty (25–220 A.D.) and around Sui dynasty (581–618 A.D.). Furthermore, there occurred the ever largest scale indirect language contact caused by the translation of Buddhist sutras from their source languages, Sanskrit in most cases, into Chinese, starting from the Middle Ages and lasting for about a millennium. The Chinese Buddhist scriptures preserved to the present time contain about 70 million Chinese characters [33, 41]. Some of the translations also had great literary value and had an abiding influence on the Chinese literature [1]. It is worth noting that the translation in the early phase was mainly conducted by foreign translators, which resulted in different styles from the literature by native writers [25, 39]. Thus, the style variation in Middle Chinese mainly laid in two folds: the styles in the early phase versus those in the later phases and the styles of the native literature versus those of the Chinese Buddhist scriptures.

The style variation emerged from various aspects of the language, including phonetics, lexis, and grammar [38]. Among these aspects, the style variation in lexis is of great fundamentality, because the appearance of new words is the most conspicuous type of language change [27] and the evolution of the vocabulary is usually directly perceptible. Therefore, our attention is focused on the lexical style variation.

In terms of a quantitative analysis on the lexical style variation of the Middle Chinese texts, it is by no means easy to extract the required lexical features, as such features should, for one thing, appear as pervasively as possible across all texts in the Middle Ages, and for another, reflect the dynamical change of the language in that period, which lasted for hundreds of years. Moreover, based on such lexical features, it is also indispensable and usually far from trivial to scheme the mathematical methods and then discuss the quantitative results from the texts of Middle Chinese. To the best of our knowledge, it remains an untangled problem to investigate the linguistic styles of Middle Chinese in a quantitative manner, from both the inter-category aspect, i.e., between the native texts and the translated Buddhist scriptures, and the intra-category aspect, i.e., along the temporal evolution in the Middle Ages.

Deep learning, which utilizes billions of artificial neurons arranged in dozens of layers, has led to significant advances in lots of tasks in natural language processing (NLP) [30]. The excellent performance of deep learning in the fields of NLP mainly results from the progress of computing power and the optimization of artificial network architectures, on the one hand, and the large-scale sophisticated contemporary language corpus resources on the other hand. For example, the word2vec technology is based on a large-scale corpus to obtain semantic vector representations of words through mask prediction and computation [26]. However, due to the centuries-long wars from the Three Kingdoms period to the Northern and Southern Dynasties, the corpus that has been handed down from the Middle Ages to the present day is relatively insufficient. In addition, during the several hundred years in Middle Ages, there was a development of Chinese language. For language analysis tasks with low resources and time-varying characteristic, deep learning faces great challenges. Furthermore, the analysis results of deep learning lack interpretability, which is not conducive to understanding the factors affecting the stylistic color for Middle Chinese texts. Therefore, this article still focuses on statistical learning methods based on manual feature extraction based on domain knowledge.

This article aims to fill the aforementioned gap. In particular, our contribution mainly lies in the following three aspects:

First, we provided a lexical lens to observe the styles of Middle Chinese texts inspired by the core vocabulary transition [31]. The core vocabulary, also referred to as the Swadesh list [28], is a predefined compilation of frequent words, such as “head,” “foot,” and “eat,” for the purpose of historical-comparative linguistics. Words in this compilation, usually corresponding to the basic concepts, are common across various texts even in different categories and in different periods. Moreover, the differences in the use of the core vocabulary reflect not only the historical linguistic characteristics but also the habits of wording related to the authors or the translators. Therefore, the Swadesh list helps construct the lexical features pervasively in the Middle Chinese texts to explore the two-fold variation of the linguistic styles.

Second, we scheme a mathematical method by formulating different authors’ habits of the use of words in the Swadesh list as the stylistic features. There were usually various word forms with respect to the same entry in Swadesh list. For instance, at least two word forms, i.e., “目” (mu in pinyin) and “眼” (yan in pinyin), both commonly used in Middle Chinese, represented the concept “eye.” Different authors used these forms with different options. Thus, we are able to count the number of appearances of different word forms and estimated the ratios of word forms as the author behavior parameters. However, we must note that the ratios are indeed stochastic. Furthermore, limited by the content, the ratios are ill-defined if none of the word forms related to the same concepts occur. Thus, we carefully tackled the ratio estimation with Bayesian theory and then calculated the dissimilarity matrix of the texts and performed a spectrum cluster analysis with different distance measures. We also show that these measures of distances of random vectors are consistent with the those of deterministic-valued vectors in classical machine learning.

Third, we obtained the statistical data of 34 Middle Chinese texts considering 12 entries in the Swadesh list and further discussed the lexical styles with the help of visualization for various combinations of different distance measures and different parameter settings. The styles of the Middle Chinese texts, as well as their affine and dissimilarity, are rendered in a clear way. It betters our understanding of the linguistic styles of Middle Chinese from both inter-category and intra-category aspects. In addition, it helps unveil the special styles induced by the indirect language contact. It also concluded that the authors’ habits of the use of different word forms for the same concept, as well as the proposed mathematical method, are important references for text clustering and stylistic color analysis, especially for the low-resource scenarios.

The rest of this article is organized as follows: Section 2 introduces the related work. Section 3 presents the mathematical foundations. Section 4 shows quantitative results and discusses the stylistic characteristics of Middle Chinese texts. Section 5 concludes this article.

2 Related Work

Any language in the world is indeed a complex of various stylistic varieties. Such stylistic varieties are multifold, reflecting the ways people use language in different social dialogic situations, at different locations and during different time periods. For instance, from the perspective of formality level, Martin Joos in his book The Five Clocks [13] simplified the range of language variation by cutting it into five styles, i.e., frozen, formal, consultative, casual, and intimate styles.

Stylistic analysis aims to bring to light the patterns in style that are related to the disciplinary concerns of literary and linguistic interpretation. Fruitful achievements have been made in the stylistic studies for various languages. For the English query texts, written and spoken styles were studied based on the features including text length, duration, and part of speech (POS) [5]. For Russian texts, the stylistic analysis based on the Random Walk Model (RWM) was proposed [19]. For the Modern Chinese texts, the styles, including the formal written style, the colloquial style, and the conversational style, were studied based on the sentence length, the word length, POS, and so forth [10]. The stylistic analysis also helps to determine an anonymous author’s native language by mining a text for errors. To sum up, the quantitative research on the stylistic analysis has attracted the attention of the academic circles. A large number of linguistic features were proposed in the stylistic analysis. For instance, a total of 262 features were employed to analyze the linguistic style of online communities discussing different topics [15].

However, to the best of our knowledge, there are still many gaps in the research on the linguistic styles of Middle Chinese. The existing studies related to the stylistic analysis of Middle Chinese are roughly classified into three categories, i.e., the linguistic feature construction, the quantitative analysis method, and the stylistic visualization. Here, the linguistic feature construction provides a solid foundation for the further tasks.

The linguistic styles of Middle Chinese texts were very complex, which stemmed from the combined effects of some underlying factors. Obviously, the styles were directly related to the contents, the genres, and author’s writing habits of the texts. However, there were two latent clues that influenced the styles of the Middle Chinese texts in a profound manner. The two clues are also the keys to bring to light some facts in the evolution of Chinese language and the Sanskrit-Chinese language contact and thus lead us to construct the linguistic features.

The first clue is the evolution of Chinese language in the Middle Ages. Middle Chinese is the key transition from Ancient Chinese to Modern Chinese [34]. From the lexical perspective, new words are constantly coming into use, whereas old words are gradually dropping out of use [27]. During the Middle Ages, there occurred conspicuous lexical replacement. For instance, there were at least three words over time that express the concept of “feeding,” namely, “食” (si in pinyin),“饲” (si in pinyin), and “喂” (wei in pinyin), and there was a diachronic substitution relationship among the three [12]. Noting that the diachronic lexical replacement does not happen suddenly, but gradually. Words denoting the same concept appeared simultaneously in the lexical system over several centuries in the Middle Ages, and their use reflected the stylistic characters of the texts. Thus, the evolution of the lexical system has an impact on the linguistic styles.

The second clue is the language change induced by the Sanskrit-Chinese contact. Buddhism was first introduced into China during the Eastern Han Dynasty. From that time onward, a large number of Buddhist scriptures were translated into Chinese. Many scholars noticed that there are stylistic differences between the native literature and the translated Buddhist sutras of Middle Chinese. In the early phase, Buddhist scriptures were mainly translated into Chinese by foreigners, whereas in the later phase native writers participated in the Buddhist scripture translation to a certain degree. The translators of Buddhist sutras developed a writing style that on the one hand carried an influence of the varied traditional Chinese education they had received, but on the other hand displayed an attempt to appeal to less formally educated readers or a more general audience, with the intent of achieving missionary success. During the translation process, not only some linguistic characters in the source languages (mainly Sanskrit) but also wording habits of the translators themselves were introduced into the texts written in the target language, i.e., Chinese. The influences conducted by the translation ranged from loan words [1], the bisyllablization [43], the diachronic word replacement [12, 42], and the semantic transferring [33] to the policies of the translator sects. Consequently, the Chinese translation of Buddhist sutras manifested a mixture of styles [25]. Also, mainly based on experience, it is concluded that there are many more oral elements in the translation of Buddhist scriptures in the Eastern Han Dynasty compared to the non-religious documents of the same period [44]. The Sanskrit-Chinese contact based on the translation of Buddhist scriptures is the first large-scale indirect language contact in human history. As a product of the language contact, Chinese translations of Buddhist scriptures have a different linguistic style from the literature by native writers.

Consequently, to inspect the style of the Middle Chinese texts, the lexical features are required to be available not only over hundreds of years along with the language evolution, but also across different kinds of texts, mainly referring to the native Chinese texts and the Chinese Buddhist scriptures translated from the Sanskrit version. The features related to the core words, which appear ubiquitously, are thus preferred. Moreover, on the basis of the Swadesh list [28], Wang [37] has inspected all the core words in Middle Chinese and enumerated in detail the lexical replacement of these core words, which provides a good foundation to construct lexical features.

To analyze the style of the Middle Chinese texts with the aforementioned lexical features inspired by the core word substitution, the existing quantitative stylistic analysis methods are mainly direct ratio calculation. For example, there was a diachronic substitution from “吾” (wu in pinyin) to “我” (wo in pinyin), both belonging to the first-person pronouns. Cao [2] counted the number of the old word “吾” and the new word “我,” both being the first-person pronouns, and then calculated the proportion of old and new words to use such a proportion as a quantitative feature. Although this direct method is reasonable to some extent, there are embarrassments in practice. Given that both the new words and the old words appear once in some text, we will obtain the proportions of 50% versus 50%; given that both the new words and the old words appear 100 times, we will also obtain the proportions of 50% versus 50%. However, we know the former is more contingent and the latter is more creditable. Moreover, we must handle the problem of having a denominator of zero when calculating the proportion if neither the new words nor the old words appear in the text. If many entries in the Swadesh list are used for a comprehensive analysis, then we will face the challenges of the high-dimensional feature problems. Therefore, there are gaps in the quantitative analysis methods from the perspective of probability theory.

To render the linguistic analysis result in a vivid manner, the clustering algorithms and visualization are also required. Clustering is the unsupervised classification of patterns, here referring to the linguistic styles, into groups (clusters). The clustering problem has been addressed in many disciplines and the related technique is used as one of the steps in exploratory data analysis [11]. In the past decades, many clustering algorithms have been developed, such as K-means clustering, mixture models [24], Spectral Clustering (SC) [29]. It is a challenging task to divide the high dimensional data into different clusters. Here, the high dimensional data originates from the numerous features in the texts. However, in practice, a good part of high dimensional data may exhibit dense grouping in a low dimensional space. Principal Component Analysis (PCA) is a typical dimension reduction technique used in stylistic analysis [4]. In particular, it can capture the linear correlations between features but fails when this assumption is violated. In a general sense, the use of the manifold information in spectral clustering, which has shown the state-of-the-art clustering performance, possibly betters the exploratory stylistic analysis.

Machine learning, especially deep learning, has made great progress in NLP tasks in recent years [30]. Traditional text clustering algorithms are mainly based on the bases such as bag-of-words. Recent studies have demonstrated that the deep learning approach helps text clustering, semantic comparison, and authorship verification [7, 35]. However, the existing deep learning frameworks are mainly aimed at contemporary languages and lack full consideration of language temporal changes.

3 The Quantitative Approach and Its Mathematical Formulations

In the section, we will propose the quantitative stylistic analysis approach as well as its mathematical formulation. The approach consists of four steps. First, we count the words of the texts according to the pre-defined features, each of which is linked to an entry in the Swadesh list and the part of speech is taken into account. Next, the count results, i.e., the feature data, are then processed with Bayesian formula to decide the feature parameters, which are indeed random variables that obey beta probability density distributions representing the authors’ wording habits. Afterwards, we calculate the difference of the feature parameters in a quantitative manner for each pair of the texts to obtain the dissimilarity matrix. Finally, we visualize and discuss the text styles via spectral embedding and clustering. The workflow with the aforementioned four steps is illustrated in Figure 1.

Fig. 1.

3.1 Feature Definition and Extraction

In this article, a feature is defined as the inclination to use words to express a given concept. Considering that the Swadesh list consists of basic concepts across various texts, we will choose some entries in this list. Furthermore, to each of these entries, there require to correspond a series of words in the lexis of Middle Chinese, which are synonymous and substitutable to express the same basic concept. In different Middle Chinese texts, different authors would choose different words according to their wording habits and the related contexts. To observe the changes of the lexical styles in the Middle Ages, features should be carefully chosen such that the related words diverged in their linguistic styles along the language evolution. Precisely, some were mainly used from the Ancient Ages (referring to the time span before the Middle Ages) to the early phase of the Middle Ages, and they are usually with a classical and literary style. The other were mainly used from the later phase of the Middle Ages to the Modern Ages (referring to the time span from the end of the Middle Ages to 1919) and usually with a modernistic and colloquial style. Once the features are chosen, they will help construct statistic parameters to reveal the styles of the Middle Chinese texts, because the decision of a specific author to use a particular word to represent a certain concept mainly depends on both the lexical evolution and the authors’ own habits.

By \(F=\lbrace {{f}_{1}},{{f}_{2}},\ldots , {{f}_{M}}\rbrace\), we denote the features, i.e., the basic concepts chosen from the Swadesh list. For any feature\(f\in F\), let \(E(f)\subset V\) denote the set of its corresponding early words and \(L(f)\subset V\)denote the set of its corresponding later words, where V is the vocabulary of Middle Chinese, precisely, the set of all the Chinese words at that time. We also assume that that its early words and later words are different, explicitly, \(E(f)\cap L(f)=\phi\). It is worth noting that the part of speech is taken into account, i.e., for any feature\(f\in F\), the corresponding early words \(E(f)\) and later words \(L(f)\) have the same part of speech.

Let \(T=\lbrace {{T}_{1}},{{T}_{2}},\ldots ,{{T}_{N}}\rbrace\) be the set of the texts, whose styles are to be analyzed. Each text is regarded as a sequence of words. We denote the ith text \({{T}_{i}}\) as \({{T}_{i}}=({{v}_{i1}},{{v}_{i2}},\ldots ,{{v}_{i,{{L}_{i}}}})\), where \({{L}_{i}}\) is the length in words of the ith text and \({{v}_{i,s}}\in V\) is the sth word of the ith text for \(1\le i\le N\) and \(1\le s\le {{L}_{i}}\).

Considering that the words in both categories, namely, the early and the later, were possibly mixed in the same text, we can count the words for both categories. Let \({{m}_{i,k}}\) represent the number of appearances of the early words for the kth feature in the ith text and \({{n}_{i,k}}\) that of the later words, which satisfies

\begin{equation} m_{i,k} =\sum \limits _{\text{s}=1}^{{{L}_{i}}}{{{\mathbb {I}}_{E({{f}_{k}})}}\left(v_{i,s} \right)} \end{equation}

(1)

and

\begin{equation} n_{i,k} =\sum \limits _{s=1}^{{{L}_{i}}}{{{\mathbb {I}}_{L({{f}_{k}})}}\left(v_{i,s} \right),} \end{equation}

(2)

where the indicator function \(\mathbb {I}\) is given by

\begin{equation} {{\mathbb {I}}_{X}}(x)=\left\lbrace \begin{matrix} 1, & \text{if}\ x\in X\ \\ 0, & else. \\ \end{matrix} \right. \end{equation}

(3)

Furthermore, with respect to the kth feature \({{f}_{k}}\) in the ith given text \({{T}_{i}}\), we assume that the author(s) chose the early words with the possibility \({{\theta }_{i,k}}\) and the later words with the possibility \(1-{{\theta }_{i,k}}\). However, the exact value of \({{\theta }_{i,k}}\) is unknown. Thus, we here estimate it with the Bayesian approach. Since the likelihood is given by

\begin{equation} P\left(m_{i,k} ,n_{i,k}|{{\theta }_{i,k}} \right)={{\theta }_{j}}^{m_{i,k}}{{(1-{{\theta }_{j}})}^{n_{i,k}}}, \end{equation}

(4)

we can now derive the following estimation using Bayesian formula:

\begin{equation} P\left({{\theta }_{i,k}}|{{m}_{i,k}},{{n}_{i,k}} \right)=\frac{P\left({{m}_{i,k}},{{n}_{i,k}}|{{\theta }_{i,k}} \right)P({{\theta }_{i,k}})}{P\left({{m}_{i,k}},{{n}_{i,k}} \right)}. \end{equation}

(5)

We assume the prior distribution \(P({{\theta }_{i,k}})\) is the one that corresponds to the least amount of knowledge, in other words, the maximum entropy probability with the only constraint \({{\theta }_{i,k}}\in [0,1]\). Thus, the prior is the uniform distribution given by:

\begin{equation} P({{\theta }_{i,k}}=z)=\left\lbrace \begin{matrix} 1, & \text{if}\ z\in [0,1]\ \\ 0, & else. \\ \end{matrix} \right. \end{equation}

(6)

Consequently, the posterior distribution of the parameter \({{\theta }_{i,k}}\) obeys a Beta distribution given by:

\begin{equation} {{\theta }_{i,k}}\sim \text{Beta}({{m}_{i,k}}+1,{{n}_{i,k}}+1). \end{equation}

(7)

Here, the probability density of the distribution \(\text{Beta}(\alpha ,\beta)\) satisfies

\begin{equation} \rho (x)=\frac{{{x}^{\alpha -1}}{{(1-x)}^{\beta -1}}}{B(\alpha ,\beta)} , \end{equation}

(8)

and the symbol \(B(\alpha ,\beta)\) denotes the beta function.

Now, we will illustrate the posterior distribution with examples. If neither the early words nor the later words appeared in a given texts, i.e., \(m=0\) and \(n=0\), then we have no knowledge about the author’s wording choice and the corresponding \(\theta\) obeys a \(\text{Beta}(1,1)\) distribution, in other words, the uniform distribution in \([0,1]\), the same as the prior distribution. For another instance, if \(m=3\) and \(n=2\), then we know that the author preferred the early words to the later words, and the corresponding \(\theta\) obeys a \(\text{Beta}(4,3)\) distribution. Moreover, if \(m=6\) and \(n=4\), then we also know that the author preferred the early words to the later words, and the corresponding \(\theta\) here obeys a \(\text{Beta}(7,5)\) distribution. Note that both the \(\text{Beta}(4,3)\) distribution and \(\text{Beta}(7,5)\) distribution reach the maximum density at 0.6, since the ratios of the occurrence number of the early words to that of the later words are the same, i.e., \(3:2=6:4\). However, the distribution of the latter is more concentrated than that of the former, because we have more knowledge about the latter. All these instances are shown in Figure 2.

Fig. 2.

Regarding the ith text \({{T}_{i}}\), since \({{\theta }_{i,k}}\) manifests how the authors chose words to represent a given concept \({{f}_{k}}\), we conclude that the vector \({{\boldsymbol {\theta }}_{i}}=({{\theta }_{i1}},{{\theta }_{i2}},\ldots , {{\theta }_{iM}})\) reflects the comprehensive usage for all the features. We remark that in the following discussion, the parameters \({{\theta }_{i}}_{,k}\) and \({{\theta }_{j}}_{,k}\) for different texts are assumed to be independent if \(i\ne j\), and the parameters \({{\theta }_{i,{{k}_{1}}}}\) and \({{\theta }_{i,{{k}_{2}}}}\) are also assumed to be independent if \({{k}_{1}}\ne {{k}_{2}}\).

3.2 Metric of Feature Vectors

The choice of the distances between feature vectors is the key factor affecting the text clustering effect. In machine learning, some common distances are listed as below.

The Euclidean distance formula is given by

\begin{equation} {{d}_{E}}(x,y)=\sqrt {\sum \limits _{k=1}^{M}{{{\left({{x}_{k}}-{{y}_{k}} \right)}^{2}}}} , \end{equation}

(9)

where x and y are the M-dimensional feature vectors and \({{x}_{k}}\) and \({{y}_{k}}\) are the coordinates of the k-th dimensional of the vectors, respectively. The Chebyshev distance is given by

\begin{equation} {{d}_{H}}(x,y)={{\left\Vert x-y \right\Vert }_{+\infty }}=\underset{k}{\mathop {\max }}\,\left| {{x}_{k}}-{{y}_{k}} \right| \end{equation}

(10)

and the Taxicab distance (or Manhattan distance) by

\begin{equation} {{d}_{T}}(x,y)={{\left\Vert x-y \right\Vert }_{1}}=\sum \limits _{k=1}^{M}{\left| {{x}_{k}}-{{y}_{k}} \right|}. \end{equation}

(11)

Noting that Euclidean distance, Chebyshev distance, and Manhattan distance are indeed Minkowski distances expressed as

\begin{equation*} {{d}_{p}}(x,y)=\sqrt [p]{\sum \limits _{k=1}^{M}{{{\left({{x}_{k}}-{{y}_{k}} \right)}^{p}}}} \end{equation*}

with order p assigned to 2, \(+\infty\) and 1, respectively.

The Cosine distance indicates how dissimilar two vectors are using the cosine of the angle between them, given by

\begin{equation} {{d}_{C}}(x,y)=1-\frac{x\cdot y}{\left\Vert x \right\Vert \cdot \left\Vert y \right\Vert }=1-\frac{\sum \nolimits _{k=1}^{M}{{{x}_{k}}{{y}_{k}}}}{\sqrt {\sum \nolimits _{k=1}^{M}{x_{k}^{2}}}\sqrt {\sum \nolimits _{k=1}^{M}{y_{k}^{2}}}}. \end{equation}

(12)

The aforementioned distances are all for applicable to deterministic vectors. However, the features that we extracted from the texts are indeed random variables, which obey the statistical distributions. We now turn to the mathematical formulation to measure the difference of word usage in two specific texts. Inspired by the Łukaszyk–Karmowski metric [22], we define the distance of two feature vector \({{\boldsymbol {\theta }}_{i}}\) and \({{\boldsymbol {\theta }}_{j}}\) as

\begin{equation*} d_{i,j}^{*}={{d}^{*}}({{\boldsymbol {\theta }}_{i}},{{\boldsymbol {\theta }}_{j}})=\boldsymbol {E}{{d}^{*}}({{\boldsymbol {\theta }}_{i}},{{\boldsymbol {\theta }}_{j}}), \end{equation*}

where the symbol * indicates the different choice of the distances and can refer to any of E, H, T, and C. Specifically, here the distance of the feature vectors is actually the expectation of the distance of the random vectors.

Taking the Taxicab distance as an example, it satisfies

\begin{equation*} d_{i,j}^{T}=\boldsymbol {E}{{d}^{T}}({{\boldsymbol {\theta }}_{i}},{{\boldsymbol {\theta }}_{j}})=\boldsymbol {E}\sum \limits _{k=1}^{M}{\left| {{\theta }_{ik}}-{{\theta }_{jk}} \right|}=\sum \limits _{k=1}^{M}{\boldsymbol {E}\left| {{\theta }_{ik}}-{{\theta }_{jk}} \right|}. \end{equation*}

Furthermore, we can define the dissimilarity of the two texts for a given feature. Using \(d({{\theta }_{ik}},{{\theta }_{jk}})\), we define the dissimilarity between the ith text and the jth text, where \(i\ne j\), with respect to the kth feature \({{f}_{k}}\) as

\begin{equation} \begin{aligned}d({{\theta }_{ik}},{{\theta }_{jk}}) &= \boldsymbol {E}\left| {{\theta }_{ik}}-{{\theta }_{jk}} \right| \\ & =\int _{0}^{1}{\int _{0}^{1}{\left| x-y \right|\frac{{{x}^{{{m}_{i,k}}}}{{(1-x)}^{{{n}_{i,k}}}}}{B(1+{{m}_{i,k}},1+{{n}_{i,k}})}\frac{{{y}^{{{m}_{j,k}}}}{{(1-y)}^{{{n}_{j,k}}}}}{B(1+{{m}_{j,k}},1+{{n}_{j,k}})}}}\,\text{d}x\,\text{d}y. \end{aligned} \end{equation}

(13)

It is indeed the mean absolute difference of the two parameters corresponding to the usage of the feature \({{f}_{k}}\) in the texts \({{T}_{i}}\) and \({{T}_{j}}\). Specifically, we assume that \(d({{\theta }_{ik}},{{\theta }_{jk}})=0\) if \(i=j\).

For two random variables \({{\theta }_{i,k}}\) and \({{\theta }_{j,k}}\), their distance \(d({{\theta }_{ik}},{{\theta }_{jk}})\) depends on two factors. First, their expectations are required to be closer. Second, their distribution should concentrate near their expectations, respectively. Thus, by taking the two factors into account, such distances in the article reflect both the difference of the ratio of the early word count to the later word count and the uncertainty of the variables \({{\theta }_{i,k}}\) and \({{\theta }_{j,k}}\).

For the other distance formulas, the distance of the random vectors, as a whole, has a relationship with the joint distribution of all the components and cannot be decomposed into the accumulation of the distances of the components. Monte Carlo method is used to obtain to the numerical solution in this article.

Finally, we will show the consistency between the distance of the random variable vectors and that of the deterministic ones. Since a Beta-distribution random variable \(\theta \sim \text{Beta}(u,v)\) has

\begin{equation*} \boldsymbol {E}\theta =\frac{u}{u+v} \end{equation*}

and

\begin{equation*} \text{Var}(\theta)=\frac{uv}{(u+v+1){{(u+v)}^{2}}}, \end{equation*}

thus, we know that \(\text{Var}(\theta)\) approaches 0 as either u or v tends to positive infinity. By Chebyshev’s inequality, it can be shown the probability approaches 0 for \(\left| \theta -\boldsymbol {E}\theta \right|\gt \varepsilon\) for arbitrary fixed positive \(\varepsilon\) and the Beta distribution converges to the Dirac Delta distribution centered at \(\boldsymbol {E}\theta\). The random variables to express the feature parameters now converge to the deterministic values. Obviously, the distance of such random vectors converges to the distance of deterministic vectors. It means the distances of random vectors defined in this section is consistent with those for the deterministic vectors.

3.3 Dissimilar Matrix and Spectral Clustering

Clearly, the greater the distance \(d_{i,j}^{*}\) is, the more dissimilar the ith text and the jth text are. We construct the dissimilarity matrix as

\begin{equation} {{\boldsymbol {D}}^{*}}=\left(\begin{matrix} {{d}_{11}} & \ldots & {{d}_{1N}} \\ \vdots & \ddots & \vdots \\ {{d}_{N1}} & \cdots & {{d}_{NN}} \\ \end{matrix} \right) , \end{equation}

(14)

where the symbol * can refer to any of E, H, T, and C to indicate Euclidean distance, Chebyshev distance, Taxicab distance, and Cosine distance, respectively.

Based on the dissimilarity matrix, we will perform clustering based on the dissimilarity matrix. Clustering is one of the most widely used techniques for exploratory data analysis to help researchers get a first impression of their data by identifying groups of “similar behavior.” Spectral clustering is a way to cluster data, which relies on the eigenvalue decomposition of a matrix. With a small number of eigenvectors to use for the spectral embedding, we will visualize the data into a low-dimensional representation, where the nodes, namely, the Middle Chinese texts, are located such that their distance are linked to their dissimilarity. Spectral clustering has many fundamental advantages, and its results often outperform traditional approaches.

We use spectral clustering in scikit-learn package [32] to implement the clustering and low-dimensional embedding. The spectral clustering algorithm takes an \(N\times N\) affinity matrix A as the input. The element \({{a}_{ij}}\) can be regarded as the weight on the edge connecting the ith and jth texts, which is measured by a typical Gaussian function:

\begin{equation} {{a}_{ij}}=\exp \left(-\frac{d_{ij}^{2}}{2{{\sigma }^{2}}}\right), \end{equation}

(15)

where \(\sigma\) is a free parameter representing the width of the Gaussian kernel [20]. It is worth noting that different distance formulas can get different value ranges. For example, the Chebyshev distance is between 0 and 1, while the Euclidean distance between 0 and \(\sqrt {M}\). We thus need to adjust the parameter \(\sigma\) to make a reasonable clustering.

4 Result and Discussion

In this section, we will first introduce the texts and the features. Then, we will show the count results of different categories of words corresponding to the same features. Next, we will obtain the pair-wise dissimilarity among the texts and visualize the results. Finally, we will perform the spectral clustering based on the dissimilarity matrix to discuss the linguistic styles of the Middle Chinese texts.

4.1 Description of the Data and the Features

We will analyze the texts with the help of the Tagged Corpus of Middle Chinese (http://lingcorpus.iis.sinica.edu.tw). We take 34 Middle Chinese texts, of which 6 are native Chinese texts and the remaining 28 are Chinese translation of Buddhist scriptures, to perform the quantitative analysis. These texts, dating from Eastern Han Dynasty (25–220 A.D., tagged as P1), Three Kingdoms (229–280 A.D., tagged as P2), Western Jin Dynasty (265–317 A.D., tagged as P3), Eastern Jin Dynasty (317–420 A.D., tagged as P4), Sixteen Kingdoms (304–439 A.D., tagged as P5), Northern and Southern Dynasties(439–589 A.D., tagged as P6) to Sui Dynasties (581–618 A.D., tagged as P7), are shown in Table 1, with respect to their author(s), book name, denotative label in this article, as well as category. In the column for category, by “T,” we denote the Chinese translation of Buddhist scriptures and by “N” the native Chinese texts. In addition, their number in CBETA (Chinese Buddhist Electronic Text Association) is given if the texts are translated sutras.

Table 1.

Label	Author(s)	Book Name	No. in CBETA	Time Period	Category	Tagged Word Count
LO-1	Lokakema (支娄迦谶)	Fo Shuo Dou Sha Jing(佛说兜沙经)	T0280	P1	T	1,472
LO-2		Fo Shuo Yi Ri Mo Ni Bao Jing (佛说遗日摩尼宝经)	T0350	P1	T	5,707
LO-3		Fo Shuo A Du Shi Wang Jing (佛说阿阇世王经)	T0626	P1	T	19,937
LO-4		Wen Shu Shi Li Wen Pu Sa Shu Jing (文殊师利问菩萨署经)	T0458	P1	T	7,198
LO-5		Ban Zhou San Mei Jing (般舟三昧经)	T0418	P1	T	16,191
LO-6		Dao Xing Bo Re Jing (道行般若经)	T0224	P1	T	53,495
LO-7		A Chu Fu Guo Jing (阿閦佛国经)	T0313	P1	T	12,101
AX	An Xuan(安玄)	Fa Jing Jing (法镜经)	T0322	P1	T	8,861
ZK	Zhu Dali(竺大力) and Kang Mengxiang(康孟详)	Xiu Xing Ben Qi Jing (修行本起经)	T0184	P1	T	11,305
TK	Tan Guo(昙果) and Kang Mengxiang(康孟详)	Zhong Ben Qi Jing (中本起经)	T0196	P1	T	17,034
ZQ-1	Zhi Qian (支谦)	Liao Ben Sheng Si Jing (了本生死经)	T0708	P2	T	1,794
ZQ-2		Fo Shuo Si Yuan Jing (佛说四愿经)	T0735	P2	T	1,467
ZQ-3		Fo Shuo Yi Zu Jing (佛说义足经)	T0198	P2	T	15,154
ZQ-4		Fo Shuo Pu Sa Ben Ye Jing (佛说菩萨本业经)	T0281	P2	T	4,090
ZQ-5		Da Ming Du Jing (大明度经)	T0225	P2	T	33,342
ZQ-6		Fan Mo Yu Jing (梵摩渝经)	T0076	P2	T	3,237
KS	Kang Senghui (康僧会)	Liu Du Ji Jing (六度集经)	T0152	P2	T	56,210
AF	An Faqin (安法钦)	A Yu Wang Chuan (阿育王传)	T2042	P3	T	33,676
DH-1	Dharmaraka (竺法护)	Guang Zan Jing (光赞经)	T0222	P3	T	67,694
DH-2		Sheng Jing (生经)	T0154	P3	T	37,417
DH-3		Pu Yao Jing (普曜经)	T0186	P3	T	48,854
FF	Fa Li (法立) and Fa Jyu (法炬)	Da Lou Tan Jing (大楼炭经)	T0023	P3	T	34,951
GB	Gan Bao (干宝)	Sou Shen Ji (搜神记)	-	P4	N	48,190
GH	Ge Hong (葛洪)	Bao Pu Zi Nei Pian (抱朴子内篇)	-	P4	N	64,233
ZF	Zhu Fonian (竺佛念)	Chu Yao Jing (出曜经)	T0212	P5	T	176,274
KU-1	Kumārajīva (鸠摩罗什)	Da Zhuang Yan Lun Jing (大庄严论经)	T0201	P5	T	76,692
KU-2	Kumārajīva (鸠摩罗什)	Miao Fa Lian Hua Jing (妙法莲华经)	T0262	P5	T	51,857
DA	Dharmaraka (昙无谶)	Bei Hua Jing (悲华经)	T0157	P5	T	60,413
LY	Liu Yiqing (刘义庆)	Shi Shuo Xin Yu (世说新语)	-	P6	N	50,013
GU	(求那毗地)	Bai Yu Jing (百喻经)	T0209	P6	T	14,504
JS	Jia sixie (贾思勰)	Qi Min Yao Shu (齐民要术)	-	P6	N	93,229
YX	Yang Xuanzhi (杨炫之)	Luo Yang Xie Lan Ji (洛阳伽蓝记)	-	P6	N	23,230
YZ	Yan Zhitui (颜之推)	Yan Shi Jia Xun (颜氏家训)	-	P7	N	25,994
JN	Jñānagupta (阇那崛多)	Fo Ben Xing Ji Jing (佛本行集经)	T0190	P7	T	270,443

Table 1. The Texts for the Quantitative Stylistic Analysis

During the Middle Ages, wars were frequent, and the number of the native texts that could be handed down was small, and even fewer of them were credible and annotated in the corpus. We have tried our best to select all the possible texts in the corpus, since there are very few resources available. If there are new literature research and annotation results in the future, then we can further enrich the results of this article.

In Table 1, time periods are given in terms of dynasties, because we cannot accurately date most of the texts to the exact year. Also, these dynasties were somewhat overlapped, since there was possibly more than one kingdom at the time, and we attribute the texts to dynasties that most scholars tended to classify them into.

We filter the entries in the Swadesh list to construct the lexical features based on two criteria. First, each entry filtered should correspond to a group of synonyms, of which some were often used in the early phase whereas the others in the later phase, i.e., there existed diachronic word substitutions among these synonyms. Second, the words related to such an entry should occur in the aforementioned Middle Chinese texts as ubiquitously as possible. Thanks to the survey on the evolution of core words in Middle Chinese [37], we finally take a total of 12 entries to construct features as listed in Table 2. It is also worth noting that all the word substitutions are indeed gradual and relative. In other words, the words preferred in the early phase would possibly occur in the later phase and vice versa.

Table 2.

Feature No.	Feature	Part of Speech	Words preferred in the early phase	Words preferred in the later phase
1	Skin	Noun	fu(肤)	pi(皮)
2	Head	Noun	tou(头)	shou(首)
3	Eye	Noun	mu(目)	yan(眼)
4	Foot	Noun	zu(足) zhi(趾)	jiao(脚)
5	Belly	Noun	fu(腹)	du(肚)
6	Bird	Noun	qin(禽)	niao(鸟)
7	Dog	Noun	quan(犬)	gou(狗)
8	Eat	Verb	qi(喫) shi(食)	chi(吃)
9	Bite	Verb	nie(啮/囓) he(龁) shi(噬)	yao(咬)
10	Hear	Verb	wen(闻)	ting(听)
11	Sleep	Verb	qin(寝) mei(寐) mian(眠) ming(瞑)	shui(睡)
12	Burn	Verb	ran(燃) fen(焚) fan(燔)	shao(烧)

Table 2. The Features to Be Observed in the Texts

4.2 Statistics of Featured Word Occurrences

With respect to each feature, we count its corresponding early words and later words in all texts, respectively. The results are shown in Table 3 in the ratio of m to n, in which m is the number of the early word occurrences and n is that of the later word occurrences. Here, the first column of the table is the text label. Except the label column, the leftmost seven columns are for noun features, and the rightmost five columns are for verb features.

Table 3.

label	skin	head	eye	foot	belly	bird	dog	eat	bite	hear	sleep	burn
LO-1	0:0	0:0	0:0	1:0	0:0	0:0	0:0	0:0	0:0	2:0	0:0	0:0
LO-2	0:0	0:1	0:0	0:0	3:0	0:0	0:3	1:0	0:0	18:1	0:0	0:2
LO-3	0:0	10:3	0:8	15:0	0:0	0:0	0:0	12:0	0:0	90:7	0:0	0:4
LO-4	0:0	0:4	0:1	1:0	1:0	0:0	0:0	0:0	0:0	61:4	0:0	0:1
LO-5	0:0	0:3	4:11	4:0	1:0	0:0	0:1	3:0	0:0	142:9	1:0	0:10
LO-6	0:0	2:6	8:9	8:1	1:0	0:7	0:0	5:0	1:0	181:26	0:0	0:4
LO-7	0:0	0:4	3:1	6:0	0:0	0:0	0:0	4:0	0:0	41:16	0:0	0:4
AX	0:0	2:1	1:0	3:0	0:0	0:1	0:0	3:0	0:0	23:4	0:0	0:0
ZK	0:4	2:7	6:4	13:0	1:0	0:2	1:0	13:0	0:0	23:3	5:0	1:4
TK	0:0	2:11	3:2	26:1	0:0	0:2	0:0	17:0	0:0	73:22	3:0	9:4
ZQ-1	0:0	0:0	0:4	0:0	0:0	0:0	0:0	1:0	0:0	0:0	0:0	0:0
ZQ-2	0:0	0:1	1:3	0:0	0:0	0:0	0:0	0:0	0:0	6:2	0:0	0:0
ZQ-3	0:1	0:7	1:13	20:0	2:0	0:0	0:0	14:0	0:0	77:7	0:1	6:2
ZQ-4	0:0	0:2	0:1	4:0	0:0	0:0	0:0	0:0	0:0	13:0	0:0	0:0
ZQ-5	0:0	2:2	6:4	8:1	0:0	0:5	0:0	4:0	0:0	115:13	0:0	0:5
ZQ-6	0:0	2:1	0:1	19:0	0:0	0:0	0:0	0:0	0:0	9:3	1:0	0:0
KS	1:5	59:40	19:20	24:3	7:0	0:18	1:0	73:0	1:0	203:22	12:4	0:16
AF	0:2	2:37	8:66	28:13	3:0	0:7	0:1	42:0	0:0	131:21	9:4	0:15
DH-1	0:3	2:5	6:116	7:2	2:0	0:1	1:1	3:0	0:0	47:28	1:0	0:2
DH-2	0:1	1:26	10:13	26:9	6:0	1:14	0:1	48:0	1:0	162:24	4:0	6:5
DH-3	0:4	7:30	41:14	37:5	1:0	0:13	2:1	10:0	0:1	143:47	14:0	0:28
FF	2:3	0:21	4:1	10:4	14:0	0:11	0:3	71:0	1:0	36:3	0:0	0:33
GB	0:9	19:109	26:8	33:5	14:0	2:30	50:38	71:0	6:0	0:10	26:1	10:18
GH	4:9	13:23	40:1	32:2	14:0	5:16	13:2	85:0	3:0	0:13	2:2	4:19
ZF	2:30	37:63	70:34	90:13	19:0	0:36	11:7	155:0	1:0	471:53	33:16	4:52
KU-1	5:9	7:49	52:26	56:9	10:0	0:4	4:7	87:0	0:0	343:83	5:3	9:33
KU-2	0:0	5:11	9:15	24:1	1:0	0:3	0:9	9:0	1:0	271:58	0:1	10:21
DA	0:8	0:16	17:7	51:1	8:0	0:5	0:0	24:0	0:0	284:65	0:0	1:6
LY	3:4	7:20	16:14	6:7	6:0	2:4	0:4	32:1	2:0	0:22	23:1	4:3
GU	0:5	0:26	8:19	3:4	1:0	0:0	0:2	43:0	0:0	48:4	0:1	0:8
JS	10:216	3:111	31:16	32:35	31:7	0:4	5:16	332:1	6:0	0:1	7:0	18:57
YX	0:4	10:16	17:2	2:0	2:0	1:11	1:1	16:0	3:0	0:5	1:1	3:9
YZ	1:4	4:9	22:3	3:0	4:0	1:4	5:0	13:0	3:0	0:5	0:1	2:2
JN	0:31	34:252	49:91	214:30	17:5	0:51	1:5	273:8	2:5	753:105	48:30	21:45

Table 3. The Statistics of the Word Occurrences

Table 3 shows that there are ratios of \(0:0\) even with the use of the Swadesh list, which contains only the most basic and common concepts. Note that the ratio of \(0:0\) means that neither the early words nor the later ones appeared in the given text. So, we know nothing about its author’s habits in representing such a concept. In addition, if two texts both demonstrate the ratio of \(0:0\), then it does not mean that their authors followed the same word usage habit; instead, it only means that we have no knowledge about the word usage in either text with respect to this feature.

Furthermore, the identical ratios of the early words to the later words for two different texts do not absolutely imply the identical wording preference. For instance, we get the identical ratios, for instance, \(6:4\) in the text labeled ZK and \(3:2\) in the text labeled TK with respect to the feature “eye.” Thus, we know both authors preferred the former to the latter. Although the ratios \(6:4\) and \(3:2\) are identical, the preferences of the authors are not identical. Here, the two ratios lead to the posterior distributions \(\text{Beta}(7,5)\) and \(\text{Beta}(4,3)\), respectively. Both distributions reach their peaks at the value \(0.6=6/(6+4)=3/(3+2)\), whereas neither is concentrated to a single point, which means the wording preference of the author possibly deviates from the expectation of the corresponding random variable. Consequently, we cannot assert that the authors are with the same wording habits even if the related texts are with the same ratio of the early words to the later words.

4.3 Feature Parameters and Their Dissimilarity

In this subsection, based on the aforementioned statistics, we will render the linguistic styles from the lens of the lexical features. A parallel coordinator plot for some typical texts is shown in Figure 3.

Fig. 3.

In the horizontal direction, a range of features are listed, each of which corresponds to a vertical axis. For a given text, the mean values of the parameters for the usage of different features at all axes are connected to form a segmented line, which reflects the average measure for all features. The area around the line of the same color shows the 90% confidence interval.

We choose one native text (labeled as YZ) and two translated Buddhism texts (labeled as LO-6 and KU-1) for plot. It is shown that, for some features such as “belly” and “bird,” the usages across all three texts are similar; for some other features such as “dog” and “hear,” their usages are obviously different.

Overall, the translated Buddhism texts, i.e., LO-6 and KU-1, are more similar, while the native text labeled as YZ is quite different from them. In this article, features are treated as random variables, which help us take a deep and detailed look at the features. Not only can we observe the differences of the expectations (the mean value) of the feature random variables, but we also analyze the overlap between their distribution. For these three texts, there is a large overlap in the distribution for some features (e.g., the feature “bird”) and little overlap for other features (e.g., the feature “hear”). The former implies that the corresponding texts is similar and the latter dissimilar. This observation inspires us to visualize the dissimilarity matrix over all texts surveyed, which is shown in Figure 4.

Fig. 4.

As shown in each subfigure of Figure 4, the diagonal line is rendered as white pixels, which means the dissimilarity of a given text to itself is zero. The other pixels, dyed with shades of blue, show the pair-wise dissimilarity. The darker it is, the more dissimilar the text pair is. Thus, the visualization helps us understand the dissimilarity of word form usage with respect to basic concepts in a quantitative manner.

If we compare the four subfigures in Figure 4, then the subfigures (d) reflecting the cosine distance show the most significant difference in the pair-wise dissimilarity, followed by the subfigure (a) with Taxicab distance, then the subfigure (b) with Euclidean distance, and the subfigure (c) with Chebyshev distance the least significant difference. Here, with regard to the Chebyshev distance, we only retain the most different feature component, while we ignore the other component, resulting in a large loss of information. With regard to the cosine distance, we mainly focus on the angle between the feature vectors and do not pay attention to the absolute feature component values. Both Euclidean distance and Taxicab distance are special cases of Minkowski distance with orders 2 and 1, respectively. These two distances both fully account for the differences of all components, with the former having the same weight for each component, while the latter has a large weight for the components with large differences. We can evaluate the dissimilarity between texts from different perspectives based on the aforementioned distances.

4.4 Clustering and Visualization

Since we can quantitatively evaluate the level of dissimilarity between different texts, we can display different texts as individual points according to the dissimilarity matrix in the low-dimensional space, for instance, a 2D Euclidean plane. If the dissimilarity between two texts is large, then we position their corresponding points farther away; otherwise, they are positioned closer. Mathematically, this approach, called data embedding, is to project high-dimensional data into a low space so the resulting low-dimensional configuration reflects the intrinsic structure of the data and performs better in future processing.

Clustering can be performed based on the combination of different types of distances and different choice of the parameter sigma. Considering that the ranges of distances are different, we set the parameter sigma as shown in Table 4.

Table 4.

Distance	Approximate Maximum Value	The choices of \(\sigma\)
Taxicab Distance(T)	4	1.2, 1.5, 1.8
Euclidean Distance(E)	1.6	0.5, 0.6, 0.7
Chebyshev Distance(H)	0.8	0.25, 0.3, 0.35
Cosine Distance(C)	0.35	0.11, 0.12, 0.13

Table 4. The Choices of Clustering Parameter \(\sigma\)

With the help of spectral embedding, we display the texts with points in a 2D Euclidean space as shown in Figure 5. It is noticed that the horizontal and vertical axes here only provide a basis of such a Euclidean space.

Fig. 5.

We can perform a data clustering on all the texts with the use of the dissimilarity matrix. If we classify the texts into two categories, then the results are divided exactly between the native and the translated texts. It can be observed that the native Chinese texts concentrate on the right side, while the translated Buddhism texts concentrate on the left side. In short, the native texts and the translated texts are distributed in different regions, which shows that the selected features based on the lexical evolution in this article indeed reflect the different linguistic styles of the native language itself and the product of indirect language contact, respectively.

It was reported in the literature that the arrival of Buddhism in China and the translation of Buddhist scriptures into Chinese introduced an unprecedentedly large number of Sanskrit words into Chinese. Obviously, we can utilize the loan words as the stylistic features to distinguish the translated texts from the native texts. However, we here conclude that the two categories of texts also greatly differ in the word usage for the common concepts.

Furthermore, the points representing different texts by the same author (for example, Lokak

ema, Zhi Qian, or Dharmarak

a) are located relatively close to each other, indicating that the preferred usage of word forms does reflect the habit of a certain author and the stage of language evolution the author was in.

Inside the translated text category, the texts labeled with “LO-*” and “ZQ-*” are mainly located in the upper side, whereas most of the other translated Buddhist texts are in the lower side. The texts labeled with “LO-*,” where the symbol “*” refers to a certain number, were translated by Lokak

ema, a Buddhist monk of Gandharan origin who traveled to China during the Eastern Han dynasty and translated Buddhist texts into Chinese, and, as such, became a prominent figure in Chinese Buddhism. The texts labeled with “ZQ-*” were translated by Zhi Qian (支谦), a disciple of Zhi Liang (支亮), who in turn had been a disciple of the aforementioned famous translator Lokak

ema. Thus, it is very reasonable that they had similar translation styles. In addition, most texts located in the upper side dated from Eastern Han dynasty to Three Kingdoms (25–280 A.D.), i.e., the early phase of the Middle Ages. As for the lower side in the translated text category, we notice that most of the texts here spanned from West Jin dynasty to Sui dynasty (265–618 A.D.), i.e., the late phase of the Middle Ages. Thus, they manifest a different style compared with that of the early phase. It is concluded that the word form features representing the same concepts can indeed reflect the evolution of language.

Inside the native text category, Qi Min Yao Shu (or Essential Techniques for the Welfare of the People, 齐民要术), compiled by Jia Sixie of the Northern Wei Dynasty, is one of the earliest and relatively systematic agricultural works in China, which has always been considered to be more colloquial, and from the perspective of common words, this text is the closest to the Buddhist scriptures among all native texts. The text labeled with “YX” is Luoyang Qielan Ji (or Record of Buddhist Temples in Luoyang, 洛阳伽蓝记, around 547 A.D.). The author is Yang Xuanzhi, a native writer who served as a government official at that time. The book records the origins and changes of the Buddhist temples of Luoyang City in the Northern Wei Dynasty when it reached the pinnacle of the Buddhism and there were over 1,000 Buddhist temples in just Luoyang city. Even though there are lots of loan words in the text, we correctly classify it to the native category with the proposed lexical features.

In the visualization of all the texts by low-dimensional embedding, not only can we observe the above-mentioned trends, but also should pay attention to the individuality of each text. For example, Lokak

ema belongs to the Eastern Han Dynasty, and Zhi Qian belongs to the Three Kingdoms period. Zhi Qian is later than Lokak

ema and should generally prefer to use newer words. But, in fact, Zhi Qian’s grandfather had already lived in China, and Zhi Qian had been living in the native Chinese environment. Thus, his works have stronger literary characteristics than those of Lokak

ema. For another example, Yan Shi Jia Xun (or The Family Motto of Yan), labeled as YZ, is a family motto created by Yan Zhitui, which describes his personal experience, thoughts, and knowledge to educate his descendants. Although the text was written in the Sui Dynasty, it has a strengthened literary color. On the contrary, the Qi Min Yao Shu (labeled as JS) in the earlier period has a stronger colloquial color.

With regard to different types of distances, the results of text clustering and visualizations based on low-dimensional embedding, although there are slight differences, are largely the same. Specifically, the Buddhist scriptures are generally in one category, and the Middle Earth literature is classified in the other category. The distinction between the two categories is significant. Within the same type of texts, there are similarities in their arrangement. For example, the text marked with ZQ-* (usually ZQ-1) is at one end and the text marked with ZF is at the other end. The text marked with JS in the native literature is closest to the Buddhist text, while the text marked with YZ is relatively far away. The clustering results based on these four types of distance are all reasonable, among which the clustering based on Chebyshev distance is relatively coarse and shows different characteristics from the result with the other three types of distances. Considering the relative position of the texts of the same author or the same period in the visualization, it concludes that the cosine distance and Taxicab distance are more helpful for text clustering.

The parameter \(\sigma\) also plays an important role in clustering and embedding results. The larger the parameter, the stronger the affinity of each text pair and the closer the texts will be; the smaller the parameter, the lower the affinity between the text pairs and the farther away the texts.

Encouragingly, the results of the data analysis in this section can partially support the conclusions of the previous research. In the transmission of Buddhist scriptures from ancient times to the present for thousands of years, there were inevitably a large number of false records. With regard to the translation works of Lokak

ema as an example, only the texts labeled with LO-5 and LO-6 are credible [23]. It was also reported that there is a big difference between the Ban Zhou San Mei Jing (label with LO-5) and the reliable translation works by Lokak

ema as well as the Eastern Han Buddhist scriptures from the perspective of vocabulary, grammar, and other phenomena [6]. From the analysis of the visualized results, it is clear that the text labeled with LO-6 is relatively far from the other texts labeled with LO-1 to LO-5 and LO-7. Among all the texts by Lokak

ema except LO-6 itself, the text labeled with LO-5 is in a position closer to LO-6. This is consistent with the above-mentioned academic research.

5 Conclusion

To fill the gap in the quantitative analysis of the stylistic systems of Middle Chinese, we have formulated and implemented a novel approach consisting of a series of steps. In the preparatory stage, we screened some entries from the Swadesh list to construct the lexical features. Each feature corresponds to a group of synonyms, of which some were often used in the early phase, whereas the others in the later phase, i.e., there existed word substitutions among these synonyms along with the language evolution during the Middle Ages. In addition, the usage of the featured words reflects the authors’ wording habits. Next, we counted different words of those features. Then, we estimated the feature parameters with Bayesian theory and calculated the dissimilarity matrix of the texts. Finally, we performed a spectrum cluster analysis to visualize, categorize, and analyze the linguistic styles of Middle Chinese texts.

The quantitative results show that the features, which meet the requirements that they both appear as pervasively as possible and reflect the language dynamics, provide a reasonable and effective lexical lens to observe the Middle Chinese texts. By the means of mathematical formulation, we obtained a quantitative estimate of the linguistic features. Moreover, such mathematical formulation still works even when certain word forms are absent in a given text for a given feature, because the Bayesian method has been adopted. The quantitative conclusion deduced by our approach perfectly agrees with that based on scholars’ experience, and furthermore, it helps us get a direct and vivid impression of the linguistic styles of Middle Chinese texts and betters our understanding of the stylistic differences from both the inter-category and the intra-category aspects. It manifests the special styles induced by the indirect language contact in a quantitative manner.

Theoretically, we propose a machine learning framework based on random vectors obeying posterior probability distribution, which is very different from existing machine learning on datasets based on deterministic feature values. For the unsupervised learning task of clustering, this article considers the expectation of distances between random vectors, which includes both the difference between the expectations of the two random vectors and their respective uncertainties. We also demonstrate the consistency between the proposed distances of the random vectors and that of the vectors with determined values. Therefore, the clustering framework proposed in this article is indeed a generalization of the existing ones. Also, we provide a basis that can be used for stylistic color analysis or text clustering from the perspective of language evolution. Along with the evolution of words, different forms can exist to express the same concept. Thus, we can compare the proportions of these forms. If the word forms related to some concept appear relatively rarely, then our estimates are less accurate; conversely, we can more accurately obtain the author’s preference for different word forms. It helps us analyze literature spanning thousands of years and also fully considers the degree of uncertainty in the estimation of authors’ word preferences.

The applicability of the proposed approach and a more in-depth analysis of the Middle Chinese texts require further investigation. In future studies, we will take a bigger collection of texts and a broader range of features into account to get a fuller picture of the long-term lexical style dynamics along with the evolution of the Chinese language.

References

[1]

Purushottam Vishvanath Bapat (Ed.). 1956. 2500 years of Buddhism. New Delhi: Ministry of Information & Broadcasting.

Abstract

1 Introduction

2 Related Work

3 The Quantitative Approach and Its Mathematical Formulations

3.1 Feature Definition and Extraction

3.2 Metric of Feature Vectors

3.3 Dissimilar Matrix and Spectral Clustering

4 Result and Discussion

4.1 Description of the Data and the Features

4.2 Statistics of Featured Word Occurrences

4.3 Feature Parameters and Their Dissimilarity

4.4 Clustering and Visualization

5 Conclusion

References

Index Terms

Recommendations

A Quantitative Approach to the Stylistic Assessment of the Middle Chinese Texts

Corpus-Based Study on the Evolution of the Lexical Phonetic Forms from Old Chinese to Middle Chinese

A Corpus-Based Study on the Syllablic Form Selection of Ancient Chinese High-Frequency Nouns in the Middle Ages

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations