The universal integrated corpus-building environment.
Expanda is an integrated corpus-building environment. Expanda provides integrated pipelines for building a corpus dataset. Building corpus dataset requires several complicated pipelines such as parsing, shuffling, and tokenization. If the corpora are gathered from different applications, it would be a problem to parse various formats. Expanda helps to build corpus simply at once by setting build configuration.
- Easy to build, simple for adding new extensions
- Manages build environment systemically
- Fast build through performance optimization (even written in Python)
- Supports multi-processing
- Extremely less memory usage
- Don't need to write new codes for each corpus. Just write one line for adding a new corpus.
- nltk
- ijson
- tqdm>=4.46.0
- mwparserfromhell>=0.5.4
- tokenizers>=0.7.0
- kss==1.3.1
Expanda can be installed using pip as follows:
$ pip install expanda
You can install from source by cloning the repository and running:
$ git clone https://github.com/affjljoo3581/Expanda.git
$ cd Expanda
$ python setup.py install
Let's build Wikipedia dataset by using Expanda. First of all, install Expanda.
$ pip install expanda
Next, create a workspace to build dataset by running:
$ mkdir workspace
$ cd workspace
Then, download Wikipedia dump file from here.
In this example, we are going to test with part of the wiki.
Download the file through the browser, move to workspace/src
and rename to
wiki.xml.bz2
. Instead, run below code:
$ mkdir src
$ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2
After downloading the dump file, we need to setup the configuration file.
Create expanda.cfg
file and write the below:
[expanda.ext.wikipedia]
num-cores = 6
[tokenization]
unk-token = <unk>
control-tokens = <s>
</s>
<pad>
[build]
input-files =
--expanda.ext.wikipedia src/wiki.xml.bz2
The current directory structure of workspace
should be as follows:
workspace
├── src
│ └── wiki.xml.bz2
└── expanda.cfg
Now we are ready to build! Run Expanda by using:
$ expanda build
Then we can get the below output:
[*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2]
[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[*] merge extracted texts.
[*] start shuffling merged corpus...
[*] optimum stride: 17, buckets: 34
[*] create temporary bucket files.
[*] successfully shuffle offsets. total offsets: 102936
[*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s]
[*] start copying buckets to the output file.
[*] finish copying buckets. remove the buckets...
[*] complete preparing corpus. start training tokenizer...
[00:00:59] Reading files ████████████████████ 100
[00:00:04] Tokenize words ████████████████████ 405802 / 405802
[00:00:00] Count pairs ████████████████████ 405802 / 405802
[00:00:01] Compute merges ████████████████████ 6332 / 6332
[*] create tokenized corpus.
[*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s]
[*] split the corpus into train and test dataset.
[*] remove temporary directory.
[*] finish building corpus.
If you build dataset successfully, you can get the following directory tree:
workspace
├── build
│ ├── corpus.raw.txt
│ ├── corpus.train.txt
│ ├── corpus.test.txt
│ └── vocab.txt
├── src
│ └── wiki.xml.bz2
└── expanda.cfg