These files take in a sequence of lip images, and predict the phonemes being said.
Reimplemented in nim with some improvements.
- Ffmpeg
- Python 3.3+ or Python 2.7
- face_recognition
- textgrid-parser on the executable path
- Montreal Forced Aligner executable as a cmd command
- A CBLAS and Lapack library on your system. Recomendation
videoToVoice.nim includes all requerients and their version. By the way, install Arraymancer_vision with my patch.
- Test and train the CNNs
- Depend less in external software
- Change face_recognition for Dlib
- Port textgrid-parser to nim
- Optimize, lips identify crops is very slow (4 seconds for image)
- Use a Markov chain or a RNN for better results
- Instead of Convulutioning, pass directly the lips points from face_recognition to a simple neural network