- Collect sufficient data in both Chinese and English
- Clean data
- Calculate the entropy of characters or tokens
python 3.8 + ubuntu 18.4
request
bs4
pwd = tysb
web_crawl.py for baike, wiki and novel. try:
python web_crawl.py --N 2000 --home_url "http://baike.baidu.com/view/"
crawl_novel.py for novels in both Chainese and Enlish. try:
python crawl_novel.py --language zh
try
python calculate_entropy --entropy_type characters --language zh