Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm.
A Python wrapper is also available here.
Wasm Demo (takes a little time to load the model.)
Vibrato is a Rust reimplementation of the fast tokenizer MeCab,
although its implementation has been simplified and optimized for even faster tokenization.
Especially for language resources with a large matrix
(e.g., unidic-cwj-3.1.1
with a matrix of 459 MiB),
Vibrato will run faster thanks to cache-efficient id mappings.
For example, the following figure shows an experimental result of tokenization time with MeCab and its reimplementations. The detailed experimental settings and other results are available on Wiki.
Vibrato supports options for outputting tokenized results identical to MeCab, such as ignoring whitespace.
Vibrato also supports training parameters (or costs) in dictionaries from your corpora. The detailed description can be found here.
This software is implemented in Rust.
First of all, install rustc
and cargo
following the official instructions.
You can easily get started with Vibrato by downloading a precompiled dictionary. The Releases page distributes several precompiled dictionaries from different resources.
Here, consider to use mecab-ipadic v2.7.0.
(Specify an appropriate Vibrato release tag to VERSION
such as v0.5.0
.)
$ wget https://github.com/daac-tools/vibrato/releases/download/VERSION/ipadic-mecab-2_7_0.tar.xz
$ tar xf ipadic-mecab-2_7_0.tar.xz
You can also compile or train system dictionaries from your own resources. See the docs for more advanced usage.
To tokenize sentences using the system dictionary, run the following command.
$ echo '本ã¨ã‚«ãƒ¬ãƒ¼ã®è¡—神ä¿ç”ºã¸ã‚ˆã†ã“ã。' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst
The resultant tokens will be output in the Mecab format.
本 å詞,一般,*,*,*,*,本,ホン,ホン
㨠助詞,並立助詞,*,*,*,*,ã¨,ト,ト
カレー å詞,固有å詞,地域,一般,*,*,カレー,カレー,カレー
㮠助詞,連ä
B3CE
½“化,*,*,*,*,ã®,ノ,ノ
è¡— å詞,一般,*,*,*,*,è¡—,マãƒ,マãƒ
ç¥žä¿ å詞,固有å詞,地域,一般,*,*,神ä¿,ジンボウ,ジンボー
町 å詞,接尾,地域,*,*,*,町,マãƒ,マãƒ
㸠助詞,æ ¼åŠ©è©ž,一般,*,*,*,ã¸,ヘ,エ
よã†ã“ã æ„Ÿå‹•詞,*,*,*,*,*,よã†ã“ã,ヨウコソ,ヨーコソ
。 記å·,å¥ç‚¹,*,*,*,*,。,。,。
EOS
If you want to output tokens separated by spaces, specify -O wakati
.
$ echo '本ã¨ã‚«ãƒ¬ãƒ¼ã®è¡—神ä¿ç”ºã¸ã‚ˆã†ã“ã。' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst -O wakati
本 㨠カレー ã® è¡— ç¥žä¿ ç”º 㸠よã†ã“ã 。
The distributed models are compressed in zstd format.
If you want to load these compressed models with the vibrato
API,
you must decompress them outside of the API.
// Requires zstd crate or ruzstd crate
let reader = zstd::Decoder::new(File::open("path/to/system.dic.zst")?)?;
let dict = Dictionary::read(reader)?;
Vibrato is a reimplementation of the MeCab algorithm, but with the default settings it can produce different tokens from MeCab.
For example, MeCab ignores spaces (more precisely, SPACE
defined in char.def
) in tokenization.
$ echo "mens second bag" | mecab
mens å詞,固有å詞,組織,*,*,*,*
second å詞,一般,*,*,*,*,*
bag å詞,固有å詞,組織,*,*,*,*
EOS
However, Vibrato handles such spaces as tokens with the default settings.
$ echo 'mens second bag' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst
mens å詞,固有å詞,組織,*,*,*,*
記å·,空白,*,*,*,*,*
second å詞,固有å詞,組織,*,*,*,*
記å·,空白,*,*,*,*,*
bag å詞,固有å詞,組織,*,*,*,*
EOS
If you want to obtain the same results as MeCab, specify the arguments -S
and -M 24
.
$ echo 'mens second bag' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst -S -M 24
mens å詞,固有å詞,組織,*,*,*,*
second å詞,一般,*,*,*,*,*
bag å詞,固有å詞,組織,*,*,*,*
EOS
-S
indicates if spaces are ignored.
-M
indicates the maximum grouping length for unknown words.
There are corner cases where tokenization results in different outcomes due to cost tiebreakers. However, this would be not an essential problem.
You can use your user dictionary along with the system dictionary. The user dictionary must be in the CSV format.
<surface>,<left-id>,<right-id>,<cost>,<features...>
The first four columns are always required.
The others (i.e., <features...>
) are optional.
For example,
$ cat user.csv
神ä¿ç”º,1293,1293,334,カスタムå詞,ジンボãƒãƒ§ã‚¦
本ã¨ã‚«ãƒ¬ãƒ¼ã®è¡—,1293,1293,0,カスタムå詞,ホントカレーノマãƒ
よã†ã“ã,3,3,-1000,感動詞,ヨーコソ,Welcome,欢迎欢迎,Benvenuto,Willkommen
To use the user dictionary, specify the file with the -u
argument.
$ echo '本ã¨ã‚«ãƒ¬ãƒ¼ã®è¡—神ä¿ç”ºã¸ã‚ˆã†ã“ã。' | cargo run --release -p tokenize -- -i ipadic-mecab-2_7_0/system.dic.zst -u user.csv
本ã¨ã‚«ãƒ¬ãƒ¼ã®è¡— カスタムå詞,ホントカレーノマãƒ
神ä¿ç”º カスタムå詞,ジンボãƒãƒ§ã‚¦
㸠助詞,æ ¼åŠ©è©ž,一般,*,*,*,ã¸,ヘ,エ
よã†ã“ã æ„Ÿå‹•詞,ヨーコソ,Welcome,欢迎欢迎,Benvenuto,Willkommen
。 記å·,å¥ç‚¹,*,*,*,*,。,。,。
EOS
The directory docs provides descriptions of more advanced usages such as training or benchmarking.
We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.
- https://daac-tools.slack.com/
- Please get an invitation from here.
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
See the guidelines.
Technical details of Vibrato are available in the following resources:
- 神田峻介, 赤部晃一, 後藤啓介, å°ç”°æ‚ 介. 最å°ã‚³ã‚¹ãƒˆæ³•ã«åŸºã¥ãå½¢æ…‹ç´ è§£æžã«ãŠã‘ã‚‹CPUã‚ャッシュã®åŠ¹çŽ‡åŒ–, 言語処ç†å¦ä¼šç¬¬29回年次大会 (NLP2023).
- 赤部晃一, 神田峻介, å°ç”°æ‚ 介. CRFã«åŸºã¥ãå½¢æ…‹ç´ è§£æžå™¨ã®ã‚¹ã‚³ã‚¢è¨ˆç®—ã®åˆ†å‰²ã«ã‚ˆã‚‹ãƒ¢ãƒ‡ãƒ«ã‚µã‚¤ã‚ºã¨è§£æžé€Ÿåº¦ã®èª¿æ•´, 言語処ç†å¦ä¼šç¬¬29回年次大会 (NLP2023).
- MeCab互æ›ãªå½¢æ…‹ç´ è§£æžå™¨Vibratoã®é«˜é€ŸåŒ–技法, LegalOn Technologies Engineering Blog (2022-09-20).