phonepiece

phonepiece is library to manage phone inventories, it also has a few linguistic/phonetics tools.

It is mainly intended to be used in the following projects, but it can be used as a standalone library

allosaurus: phone recognition toolkit
transphone: grapheme-to-phoneme toolkit
asr2k: speech recognition systems for 2000 languages

Install

phonepiece is available from pip

pip install phonepiece

You can also clone this repository and install

python setup.py install

Usage

Inventory Lookup

The main feature of phonepiece is to look-up inventory.

An inventory typically contains the following information:

phoneme: language-dependent units
phone: language-independent units
allophone: the mapping between phone and phoneme

A simple usage is as follows:

In [1]: from phonepiece import read_inventory                                                                                                   

In [2]: eng = read_inventory('eng')                                                                                                             

In [3]: eng                                                                                                                                     
Out[3]: <Inventory eng phoneme: 40, phone: 46>

In [4]: eng.phoneme                                                                                                                             
Out[4]: <Unit: 40 elems: {'<blk>': 0, 'a': 1, 'b': 2, 'd': 3, 'd͡ʒ': 4, 'e': 5, 'f': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 's': 16, 't': 17, 't͡ʃ': 18, 'u': 19, 'v': 20, 'w': 21, 'z': 22, 'æ': 23, 'ð': 24, 'ŋ': 25, 'ɑ': 26, 'ɔ': 27, 'ə': 28, 'ɛ': 29, 'ɡ': 30, 'ɪ': 31, 'ɹ': 32, 'ɹ̩': 33, 'ʃ': 34, 'ʊ': 35, 'ʌ': 36, 'ʒ': 37, 'θ': 38, '<eos>': 39}>

In [5]: eng.phone                                                                                                                               
Out[5]: <Unit: 46 elems: {'<blk>': 0, 'a': 1, 'b': 2, 'b̥': 3, 'd': 4, 'dʒ': 5, 'd̥': 6, 'e': 7, 'f': 8, 'g': 9, 'h': 10, 'i': 11, 'j': 12, 'k': 13, 'kʰ': 14, 'l': 15, 'm': 16, 'n': 17, 'o': 18, 'p': 19, 'pʰ': 20, 's': 21, 't': 22, 'tʃ': 23, 'tʰ': 24, 'u': 25, 'v': 26, 'w': 27, 'z': 28, 'æ': 29, 'ð': 30, 'ŋ': 31, 'ɑ': 32, 'ɔ': 33, 'ə': 34, 'ɛ': 35, 'ɡ̥': 36, 'ɪ': 37, 'ɹ': 38, 'ɹ̩': 39, 'ʃ': 40, 'ʊ': 41, 'ʌ': 42, 'ʒ': 43, 'θ': 44, '<eos>': 45}>

In [6]: eng.phoneme2phone                                                                                                                       
Out[6]: 
defaultdict(list,
            {'a': ['a'],
             'b': ['b', 'b̥'],
             'd': ['d', 'd̥'],
             'd͡ʒ': ['dʒ'],
             'e': ['e'],
             'f': ['f'],
             'h': ['h'],
             'i': ['i'],
             'j': ['j'],
             'k': ['kʰ', 'k'],
             'l': ['l'],
             'm': ['m'],
             'n': ['n'],
             'o': ['o'],
             'p': ['pʰ', 'p'],
             's': ['s'],
             't': ['tʰ', 't'],
             't͡ʃ': ['tʃ'],
             'u': ['u'],
             'v': ['v'],
             'w': ['w'],
             'z': ['z'],
             'æ': ['æ'],
             'ð': ['ð'],
             'ŋ': ['ŋ'],
             'ɑ': ['ɑ'],
             'ɔ': ['ɔ'],
             'ə': ['ə'],
             'ɛ': ['ɛ'],
             'ɡ': ['g', 'ɡ̥'],
             'ɪ': ['ɪ'],
             'ɹ': ['ɹ'],
             'ɹ̩': ['ɹ̩'],
             'ʃ': ['ʃ'],
             'ʊ': ['ʊ'],
             'ʌ': ['ʌ'],
             'ʒ': ['ʒ'],
             'θ': ['θ'],
             '<blk>': ['<blk>'],
             '<eos>': ['<eos>']})

Phone Tokenization

This lib also provides a tokenizer which splits a concatenated IPA string into separate IPAs

In [1]: from phonepiece.ipa import read_ipa                                                         

In [2]: ipa = read_ipa()                                                                            

In [3]: ipa.tokenize('kʰæt')                                                                        
Out[3]: ['kʰ', 'æ', 't']

Phonological Distance

The phonological_distance is an augmented edit distance, it takes phonological distance into account as well.

In [1]: from phonepiece.distance import phonological_distance

In [2]: phonological_distance('a', 'b')
Out[2]: 0.5862068965517241

In [3]: phonological_distance('a', 'e')
Out[3]: 0.03448275862068961

In [4]: phonological_distance('a', 'bc')
Out[4]: 1.5862068965517242

Lexicon Lookup

It also includes many lexicon dictionaries, you can look up pronunciation of a particular word (if it exists) The output phonemes are consistent with its language's inventory phoneme space

In [1]: from phonepiece.lexicon import read_lexicon

In [2]: eng = read_lexicon('eng')

In [3]: eng['hello']
Out[3]: ['h', 'ʌ', 'l', 'o', 'w']

Models

model	# supported languages	description
phoible	~2k	phone/phoneme databases extracted from Phoible [1]. Allophone information is from Allovera [3]
latest	~8k	Phoible database + estimated inventory based on our LREC work [2]

Acknowledgement

This repository use code/data from the following repository

Reference

[1] Moran, Steven, Daniel McCloy, and Richard Wright. "PHOIBLE online." (2014).
[2] Li, Xinjian, et al. "Phone Inventories and Recognition for Every Language" LREC 2022. 2022
[3] Mortensen, David R., et al. "AlloVera: A Multilingual Allophone Database." Proceedings of the 12th Language Resources and Evaluation Conference. 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
phonepiece		phonepiece
test		test
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phonepiece

Install

Usage

Inventory Lookup

Phone Tokenization

Phonological Distance

Lexicon Lookup

Models

Acknowledgement

Reference

About

Releases 1

Packages

Languages

License

xinjli/phonepiece

Folders and files

Latest commit

History

Repository files navigation

phonepiece

Install

Usage

Inventory Lookup

Phone Tokenization

Phonological Distance

Lexicon Lookup

Models

Acknowledgement

Reference

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages