8000 GitHub - WorksApplications/sudachi.rs: Sudachi in Rust 🦀 and new generation of SudachiPy
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

WorksApplications/sudachi.rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sudachi.rs - English README

Rust

sudachi.rs logo

sudachi.rs is a Rust implementation of Sudachi, a Japanese morphological analyzer.

日本語 README.

Python implementation is also available: SudachiPy Documentation.

TL;DR

Install Python version

pip install --upgrade 'sudachipy>=0.6.10'

or Rust version

$ git clone https://github.com/WorksApplications/sudachi.rs.git
$ cd ./sudachi.rs

$ cargo build --release
$ cargo install --path sudachi-cli/
$ ./fetch_dictionary.sh

$ echo "高輪ゲートウェイ駅" | sudachi
高輪ゲートウェイ駅  å詞,固有å詞,一般,*,*,*    高輪ゲートウェイ駅
EOS

Example

Multi-granular Tokenization

$ echo 鏿Œ™ç®¡ç†å§”員会 | sudachi
鏿Œ™ç®¡ç†å§”員会  å詞,固有å詞,一般,*,*,*        鏿Œ™ç®¡ç†å§”員会
EOS

$ echo 鏿Œ™ç®¡ç†å§”員会 | sudachi --mode A
鏿Œ™    å詞,普通å詞,サ変å¯èƒ½,*,*,*    鏿Œ™
ç®¡ç†    å詞,普通å詞,サ変å¯èƒ½,*,*,*    管ç†
委員    å詞,普通å詞,一般,*,*,*        委員
会      å詞,普通å詞,一般,*,*,*        会
EOS

Normalized Form

$ echo 打込む ã‹ã¤ä¸¼ 附属 vintage | sudachi
打込む  動詞,一般,*,*,五段-マ行,終止形-一般     打ã¡è¾¼ã‚€
        空白,*,*,*,*,*
ã‹ã¤ä¸¼  å詞,普通å詞,一般,*,*,*        カツ丼
        空白,*,*,*,*,*
附属    å詞,普通å詞,サ変å¯èƒ½,*,*,*    付属
        ç
8000
©ºç™½,*,*,*,*,*
vintage å詞,普通å詞,一般,*,*,*        ビンテージ
EOS

Wakati (space-delimited surface form) Output

$ cat lemon.txt
ãˆãŸã„ã®çŸ¥ã‚Œãªã„ä¸å‰ãªå¡ŠãŒç§ã®å¿ƒã‚’始終圧ãˆã¤ã‘ã¦ã„ãŸã€‚
焦èºã¨è¨€ãŠã†ã‹ã€å«Œæ‚ªã¨è¨€ãŠã†ã‹â€•―酒を飲んã ã‚ã¨ã«å®¿é…”ãŒã‚るよã†ã«ã€é…’を毎日飲んã§ã„ã‚‹ã¨å®¿é…”ã«ç›¸å½“ã—ãŸæ™‚期ãŒã‚„ã£ã¦æ¥ã‚‹ã€‚
ãã‚ŒãŒæ¥ãŸã®ã ã€‚ã“れã¯ã¡ã‚‡ã£ã¨ã„ã‘ãªã‹ã£ãŸã€‚

$ sudachi --wakati lemon.txt
ãˆãŸã„ 㮠知れ ãªã„ ä¸å‰ 㪠塊 ãŒ ç§ ã® å¿ƒ ã‚’ 始終 圧㈠ã¤ã‘ 㦠ㄠ㟠。
ç„¦èº ã¨ è¨€ãŠã† ㋠〠嫌悪 㨠言ãŠã† ㋠― ― é…’ ã‚’ 飲ん ã  ã‚㨠㫠宿酔 ㌠ã‚ã‚‹ よㆠ㫠〠酒 ã‚’ 毎日 飲ん ã§ ã„ã‚‹ 㨠宿酔 㫠相当 㗠㟠時期 ㌠や㣠㦠æ¥ã‚‹ 。
ãれ ãŒ æ¥ ãŸ ã® ã  ã€‚ ã“れ 㯠ã¡ã‚‡ã£ã¨ ã„ã‘ ãªã‹ã£ 㟠。

Setup

You need sudachi.rs, default plugins, and a dictionary. (This crate don't include dictionary.)

1. Get the source code

git clone https://github.com/WorksApplications/sudachi.rs.git

2. Download a Sudachi Dictionary

Sudachi requires a dictionary to operate. You can download a dictionary ZIP file from WorksApplications/SudachiDict (choose one from small, core, or full), unzip it, and place the system_*.dic file somewhere. By the default setting file, sudachi.rs assumes that it is placed at resources/system.dic.

Convenience Script

Optionally, you can use the fetch_dictionary.sh shell script to download a dictionary and install it to resources/system.dic (overrides).

# fetch latest core dictionary
./fetch_dictionary.sh

# fetch dictionary of specified version and type
./fetch_dictionary.sh 20241021 small

3. Build

cargo build --release

Build (bake dictionary into binary)

This was un-implemented and does not work currently, see #35

Specify the bake_dictionary feature to embed a dictionary into the binary. The sudachi executable will contain the dictionary binary. The baked dictionary will be used if no one is specified via cli option or setting file.

You must specify the path the dictionary file in the SUDACHI_DICT_PATH environment variable when building. SUDACHI_DICT_PATH is relative to the sudachi.rs directory (or absolute).

Example on Unix-like system:

# Download dictionary to resources/system.dic
$ ./fetch_dictionary.sh

# Build with bake_dictionary feature (relative path)
$ env SUDACHI_DICT_PATH=resources/system.dic cargo build --release --features bake_dictionary

# or

# Build with bake_dictionary feature (absolute path)
$ env SUDACHI_DICT_PATH=/path/to/my-sudachi.dic cargo build --release --features bake_dictionary

4. Install

$ cd sudachi.rs/
$ cargo install --path sudachi-cli/

$ which sudachi
/Users/<USER>/.cargo/bin/sudachi

$ sudachi -h
sudachi 0.6.0
A Japanese tokenizer
...

Usage as a command

$ sudachi -h
A Japanese tokenizer

Usage: sudachi [OPTIONS] [FILE] [COMMAND]

Commands:
  build
          Builds system dictionary
  ubuild
          Builds user dictionary
  dump

  help
          Print this message or the help of the given subcommand(s)

Arguments:
  [FILE]
          Input text file: If not present, read from STDIN

Options:
  -r, --config-file <CONFIG_FILE>
          Path to the setting file in JSON format
  -p, --resource_dir <RESOURCE_DIR>
          Path to the root directory of resources
  -m, --mode <MODE>
          Split unit: "A" (short), "B" (middle), or "C" (Named Entity) [default: C]
  -o, --output <OUTPUT_FILE>
          Output text file: If not present, use stdout
  -a, --all
          Prints all fields
  -w, --wakati
          Outputs only surface form
  -d, --debug
          Debug mode: Print the debug information
  -l, --dict <DICTIONARY_PATH>
          Path to sudachi dictionary. If None, it refer config and then baked dictionary
      --split-sentences <SPLIT_SENTENCES>
          How to split sentences [default: yes]
  -h, --help
          Print help (see more with '--help')
  -V, --version
          Print version

Output

Columns are tab separated.

  • Surface
  • Part-of-Speech Tags (comma separated)
  • Normalized Form

When you add the -a (--all) flag, it additionally outputs

  • Dictionary Form
  • Reading Form
  • Dictionary ID
    • 0 for the system dictionary
    • 1 and above for the user dictionaries
    • -1 if a word is Out-of-Vocabulary (not in the dictionary)
  • Synonym group IDs
  • (OOV) if a word is Out-of-Vocabulary (not in the dictionary)
$ echo "å¤–å›½äººå‚æ”¿æ¨©" | sudachi -a
å¤–å›½äººå‚æ”¿æ¨©    å詞,普通å詞,一般,*,*,*        å¤–å›½äººå‚æ”¿æ¨©    å¤–å›½äººå‚æ”¿æ¨©    ガイコクジンサンセイケン      0       []
EOS
echo "阿quei" | sudachipy -a
阿      å詞,普通å詞,一般,*,*,*        阿      阿              -1      []      (OOV)
quei    å詞,普通å詞,一般,*,*,*        quei    quei            -1      []      (OOV)
EOS

When you add -w (--wakati) flag, it outputs space-delimited surface instead.

$ echo "å¤–å›½äººå‚æ”¿æ¨©" | sudachi -m A -w
外国 人 傿”¿ 権

API

See API reference page.

ToDo

  • Out of Vocabulary handling
  • Easy dictionary file install & management, similar to SudachiPy
  • Registration to crates.io

References

Sudachi

Morphological Analyzers in Rust

Logo

Sponsor this project

 

Packages

No packages published

Contributors 11

Languages

0