8000 GitHub - jeremiepoiroux/kanji-list
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

jeremiepoiroux/kanji-list

Repository files navigation

A data-driven ordered list of Kanjis

When learning Japanese, kanjis are, soon after knowing Katakanas and Hiragana, one of the main time-consuming tasks. Many resources exist to learn (to write, to recognize or both) thousands of kanjis. However, as students may know and expect, the number of kanjis to learn and the order of the list is crucial. The state of the art about this topic is rather old, and many student still rely on learning the Kyoiku Kanjis (1 026 kanjis learnt in school) and the Jōyō Kanjis (1 110 kanjis learnt in high school) with a quite popular method: Remembering the Kanji (RTK), a 1977 book by James Heisig.

With this project, I would like to suggest a new list in order to mingle contemporary quantitative methods and learning goals. Note that this project will eventually fuel a scientific paper. If you are interested in publications on the topic, you may have a look at my reading list.

A new list: filtering the relevant kanjis

First of all, the task is to filter, from thousands of kanjis, a list with relevant (useful) kanjis. The most popular lists (from MEXT, RTK, JLPT, etc.) seem to have been filtered by popularity, although the criterias are unknown. They generally feature 2 000+ kanjis, a reliable number to read the majority of common texts (especially newspaper articles). Projects which try to objectify the filtering exist, like Kanji Usage Frequency. Dmitry Shpika had a similar goal to us, which was providing answers to the following question:

"in which order should I learn Japanese kanji if I have a goal of reading some specific type of texts, e.g. news or fiction?".

For our project, we decided to analyse the Japanese Wikipedia, a large database of texts about various topics. Thus, we assume the use of a great number of kanjis. We tested our calculations on the 5 millions sentences of the Japanese wiki dump sentence dataset by Ahmed Sabir. For the project, we then used a complete dump of the Japanese Wikipedia, as of 2022.08.08. The dataset (.json), prepared by Inari Kami is available here. For convenience, I transformed the .json into a .csv file (I will give access to it).

Here are some basic stats about the Japanese Wikipedia dataset:

  • The total number of articles is 1 317 738;
  • The number of occurences of:
    • kanjis, ie. all the 20 992 CJK Unified Ideographs (unicode \u4E00 to \u9FFF), is 858 165 384 (44.68%);
    • katakana is 550 415 364 (26.35%);
    • hiragana is 505 859 344 (28.97%);
  • The unique number of kanjis hits the limit of the CJK Unified Ideograph (20 992). Bonus: if we consider the CJK Unified Ideographs (20 992) as well as the characters in the CJK Extensions (97 870) and the CJK Compatibility Ideographs (1 054), the number of unique "kanjis" (actually, ideographs) would be 31 827.

The next step is to calculate the number of occurences of each kanji and the cumulative frequency. Here, I found the kanjis count fooled by two Wikipedia articles, which are (tadam), Unicode一覧 6000-6FFF and CJK統合漢字 (6300-77FF). As a result, the number of unique Kanjis used in the Japanese Wikipedia excepted in the mentioned articles is 20 001. We now have a .csv file with five columns, the kanjis, the rank (based on the number of occurences), the number of occurrences, the frequency of use and the cumulative frequency. The file is ordered by rank.

As a prelimenary interesting insight, to cover 10% of the total use of kanjis, one only need to know 7 of them. In other words, to know 7 kanjis should be enough to read 10% of the kanjis used in a text:

Kanji Rank Nb. of occ. Frequency Cum. Freq.
1 27 383 261 0.031909 0.031909
2 16 694 015 0.019453 0.051363
3 13 533 379 0.015770 0.067133
4 9 205 888 0.010728 0.077861
5 7 801 260 0.009091 0.086951
6 7 239 535 0.008436 0.095387
7 5 861 052 0.006830 0.102217

This trend can also be verified on the following (basic) graph:

Kanji Occurrences Trend

If we have a look at the .csv, we can see that:

  • 1 337 kanjis are only used once (6.7%):
  • to cover 33% of the total use of kanjis, one only need to know 72 of them;
  • to cover 50% of the total use of kanjis, one only need to know 165 of them;
  • to cover 90% of the total use of kanjis, one only need to know 910 of them;
  • Finally, to cover 99% of the total use of kanjis, one only need to know 2 172 of them.

We will use this threshold to filter our list accordingly. The filtered list may be downloaded here.

As a third step, we may now proof-test the list with regard to Patrick Kandráč's work which resulted in a comparison of the frequency rank of kanjis in seven databases, augmented by one average frequency and two weighted frequencies columns.

AVG FREQ: standard average frequency value of all seven database-specific average values (all databases are treated equally)

FREQ (1): weighted average frequency value of all seven database-specific average values (Google, KUF and 文化庁 treated as superior)

FREQ BIG 5: weighted average frequency value of five database-specific average values (excluding KD and WKFR) (KUF = superior, jisho.org = inferior)

I compare the rank of the frequency of kanjis in the Japanese 2022 Wikipedia with all the 10 datasets, first with the absolute frequency difference, then with a calculation of the Pearson correlation coefficient. Please note that from the merge of the two lists (mine with 2 172 kanjis, Kandráč's with 2 242), the comparison is effective on 1 977 kanjis.

Google KUF MCD 文化庁 jisho.org KD WKFR AVG FREQ FREQ (1) FREQ BIG 5
Diff. Abs. Freq. -15.26 -49.18 -148.75 -21.72 -11.21 -2.44 -6.51 -36.44 -34.17 -53.07
Pearson. Corr. 0.82 0.91 0.85 0.87 0.85 0.76 0.96 0.91 0.91 0.89
Spearman. Corr. 0.83 0.92 0.86 0.88 0.86 0.77 0.96 0.91 0.91 0.90
Kendall. Corr. 0.64 0.75 0.63 0.70 0.67 0.59 0.86 0.74 0.74 0.72

For instance, for a given kanji, if its frequency rank is x in the Japanese Wikipedia, it will be, in average, x - 15.26 in the Google Dataset (Shibano's Google Kanji Data, 2009). Our dataset happen to be, unsurprisingly, very similar to the WKFR one (another Wikipedia set, 2010), but also to the KD dataset (based on the Mainichi Newspaper's articles, 2000-2010). The correlation coefficients (Pearson, Spearman and Kendall) shows that our list is fairly comparable to Kandráč three average lists.

Fourth, we add some information to the .csv and calculate. We use the kanjiapi by Iridium Szreter to add the meaning of each kanji, as well as its Heisig keyword, readings, unicode, stoke count, and school grade level (from 1 to 6 for Kyōiku kanjis, 8 for the remaining Jōyō kanjis, 9 for Jinmeiyō kanjis). New JLPT level (from 5 to 1) was retrieved from a .json by David Gouveia and needed manual check for following kanjis (because of the JLPT change from 4 to 5 levels): 分, 的, 身, 無, 可, 里, 孝, 寿, 吾, 唐, 畑, 坊, 丈, 粧. If you know a reliable source to assign a JLPT level to those kanjis, please let me know. For comparison with other databases, we keep Kandráč's lists (AVG FREQ, FREQ (1) and FREQ BIG 5). The .csv file can be found here. NaN and None values were changed to 0.

We have now finished the first part of our study. In the second part, we will try new methods to order the kanjis.

A new list: ordering the kanjis

Work In Progress

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0