research-article

Vietnamese Voice2Text: A Web Application for Whisper Implementation in Vietnamese Automatic Speech Recognition Tasks: Vietnamese Voice2Text

Authors:

Quangphuoc Nguyen,

Ngocminh Nguyen,

Thanhluan Dang,

Vanha TranAuthors Info & Claims

CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence

Pages 312 - 318

https://doi.org/10.1145/3638584.3638634

Published: 14 March 2024 Publication History

Get Access

Abstract

The publication of the Whisper model by OpenAI inspired us with the idea of a web platform that provides voice-to-text conversion services for Vietnamese people. Using Whisper’s powerful generalization capabilities, we have developed a web application with three main features: record-to-text, file-to-text, and subtitles generator for YouTube. We first fine-tuned Whisper with our target language dataset then deployed the model as a Rest API using the Python Flask framework with three paths for three different tasks. The web application has been developed using ReactJS, a popular JavaScript library for building user interfaces. Its architecture is grounded in component-based design principles, which means that the application is structured into reusable and modular components, enhancing code maintainability and scalability. The web application has been developed using ReactJS, a popular JavaScript library for building user interfaces. Its architecture is grounded in component-based design principles, which means that the application is structured into reusable and modular components, enhancing code maintainability and scalability. The record-to-text function will allow users to record audio on the web page, and then the audio will be processed and converted to text. As for the file-to-text function, the website will receive audio files uploaded by users and will return the transcript text of that audio file. And finally the subtitles generator for YouTube function, where users can enter the YouTube link as input, wait for the website to process and the website will display that video with the transcript attached to the video based on the timestamps of each transcript. This project can inspire and encourage the testing and application of new automatic speech recognition (ASR) models in specific applications.

References

[1]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.

Abstract

References

Cited By

Index Terms

Recommendations

HMM-Based Vietnamese Speech Synthesis

Modeling Vietnamese Speech Prosody: A Step-by-Step Approach Towards an Expressive Speech Synthesis System

Vietnamese automatic speech recognition: the FLaVoR approach

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations