phonepiece
is library to manage phone inventories, it also has a few linguistic/phonetics tools.
It is mainly intended to be used in the following projects, but it can be used as a standalone library
- allosaurus: phone recognition toolkit
- transphone: grapheme-to-phoneme toolkit
- asr2k: speech recognition systems for 2000 languages
phonepiece is available from pip
pip install phonepiece
You can also clone this repository and install
python setup.py install
The main feature of phonepiece is to look-up inventory.
An inventory typically contains the following information:
phoneme
: language-dependent unitsphone
: language-independent unitsallophone
: the mapping between phone and phoneme
A simple usage is as follows:
In [1]: from phonepiece import read_inventory
In [2]: eng = read_inventory('eng')
In [3]: eng
Out[3]: <Inventory eng phoneme: 40, phone: 46>
In [4]: eng.phoneme
Out[4]: <Unit: 40 elems: {'<blk>': 0, 'a': 1, 'b': 2, 'd': 3, 'd͡ʒ': 4, 'e': 5, 'f': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 's': 16, 't': 17, 't͡ʃ': 18, 'u': 19, 'v': 20, 'w': 21, 'z': 22, 'æ': 23, 'ð': 24, 'ŋ': 25, 'ɑ': 26, 'ɔ': 27, 'ə': 28, 'ɛ': 29, 'ɡ': 30, 'ɪ': 31, 'ɹ': 32, 'ɹ̩': 33, 'ʃ': 34, 'ʊ': 35, 'ʌ': 36, 'ʒ': 37, 'θ': 38, '<eos>': 39}>
In [5]: eng.phone
Out[5]: <Unit: 46 elems: {'<blk>': 0, 'a': 1, 'b': 2, 'b̥': 3, 'd': 4, 'dʒ': 5, 'd̥': 6, 'e': 7, 'f': 8, 'g': 9, 'h': 10, 'i': 11, 'j': 12, 'k': 13, 'kʰ': 14, 'l': 15, 'm': 16, 'n': 17, 'o': 18, 'p': 19, 'pʰ': 20, 's': 21, 't': 22, 'tʃ': 23, 'tʰ': 24, 'u': 25, 'v': 26, 'w': 27, 'z': 28, 'æ': 29, 'ð': 30, 'ŋ': 31, 'ɑ': 32, 'ɔ': 33, 'ə': 34, 'ɛ': 35, 'ɡ̥': 36, 'ɪ': 37, 'ɹ': 38, 'ɹ̩': 39, 'ʃ': 40, 'ʊ': 41, 'ʌ': 42, 'ʒ': 43, 'θ': 44, '<eos>': 45}>
In [6]: eng.phoneme2phone
Out[6]:
defaultdict(list,
{'a': ['a'],
'b': ['b', 'b̥'],
'd': ['d', 'd̥'],
'd͡ʒ': ['dʒ'],
'e': ['e'],
'f': ['f'],
'h': ['h'],
'i': ['i'],
'j': ['j'],
'k': ['kʰ', 'k'],
'l': ['l'],
'm': ['m'],
'n': ['n'],
'o': ['o'],
'p': ['pʰ', 'p'],
's': ['s'],
't': ['tʰ', 't'],
't͡ʃ': ['tʃ'],
'u': ['u'],
'v': ['v'],
'w': ['w'],
'z': ['z'],
'æ': ['æ'],
'ð': ['ð'],
'ŋ': ['ŋ'],
'ɑ': ['ɑ'],
'ɔ': ['ɔ'],
'ə': ['ə'],
'ɛ': ['ɛ'],
'ɡ': ['g', 'ɡ̥'],
'ɪ': ['ɪ'],
'ɹ': ['ɹ'],
'ɹ̩': ['ɹ̩'],
'ʃ': ['ʃ'],
'ʊ': ['ʊ'],
'ʌ': ['ʌ'],
'ʒ': ['ʒ'],
'θ': ['θ'],
'<blk>': ['<blk>'],
'<eos>': ['<eos>']})
This lib also provides a tokenizer which splits a concatenated IPA string into separate IPAs
In [1]: from phonepiece.ipa import read_ipa
In [2]: ipa = read_ipa()
In [3]: ipa.tokenize('kʰæt')
Out[3]: ['kʰ', 'æ', 't']
The phonological_distance
is an augmented edit distance, it takes phonological distance into account as well.
In [1]: from phonepiece.distance import phonological_distance
In [2]: phonological_distance('a', 'b')
Out[2]: 0.5862068965517241
In [3]: phonological_distance('a', 'e')
Out[3]: 0.03448275862068961
In [4]: phonological_distance('a', 'bc')
Out[4]: 1.5862068965517242
It also includes many lexicon dictionaries, you can look up pronunciation of a particular word (if it exists) The output phonemes are consistent with its language's inventory phoneme space
In [1]: from phonepiece.lexicon import read_lexicon
In [2]: eng = read_lexicon('eng')
In [3]: eng['hello']
Out[3]: ['h', 'ʌ', 'l', 'o', 'w']
model | # supported languages | description |
---|---|---|
phoible | ~2k | phone/phoneme databases extracted from Phoible [1]. Allophone information is from Allovera [3] |
latest | ~8k | Phoible database + estimated inventory based on our LREC work [2] |
This repository use code/data from the following repository
- [1] Moran, Steven, Daniel McCloy, and Richard Wright. "PHOIBLE online." (2014).
- [2] Li, Xinjian, et al. "Phone Inventories and Recognition for Every Language" LREC 2022. 2022
- [3] Mortensen, David R., et al. "AlloVera: A Multilingual Allophone Database." Proceedings of the 12th Language Resources and Evaluation Conference. 2020.