[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3638584.3638634acmotherconferencesArticle/Chapter ViewAbstractPublication PagescsaiConference Proceedingsconference-collections
research-article

Vietnamese Voice2Text: A Web Application for Whisper Implementation in Vietnamese Automatic Speech Recognition Tasks: Vietnamese Voice2Text

Published: 14 March 2024 Publication History

Abstract

The publication of the Whisper model by OpenAI inspired us with the idea of a web platform that provides voice-to-text conversion services for Vietnamese people. Using Whisper’s powerful generalization capabilities, we have developed a web application with three main features: record-to-text, file-to-text, and subtitles generator for YouTube. We first fine-tuned Whisper with our target language dataset then deployed the model as a Rest API using the Python Flask framework with three paths for three different tasks. The web application has been developed using ReactJS, a popular JavaScript library for building user interfaces. Its architecture is grounded in component-based design principles, which means that the application is structured into reusable and modular components, enhancing code maintainability and scalability. The web application has been developed using ReactJS, a popular JavaScript library for building user interfaces. Its architecture is grounded in component-based design principles, which means that the application is structured into reusable and modular components, enhancing code maintainability and scalability. The record-to-text function will allow users to record audio on the web page, and then the audio will be processed and converted to text. As for the file-to-text function, the website will receive audio files uploaded by users and will return the transcript text of that audio file. And finally the subtitles generator for YouTube function, where users can enter the YouTube link as input, wait for the website to process and the website will display that video with the transcript attached to the video based on the timestamps of each transcript. This project can inspire and encourage the testing and application of new automatic speech recognition (ASR) models in specific applications.

References

[1]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
[2]
James Baker. 1975. The DRAGON system–An overview. IEEE Transactions on Acoustics, speech, and signal Processing 23, 1 (1975), 24–29.
[3]
William C Dersch. [n. d.]. IBM Archives: IBM Shoebox. URL http://www-03. ibm. com/ibm/history/exhibits/specialprod1/specialprod1 { _} 7 ([n. d.]).
[4]
Anmol Gulati, James Qin, Chung-Cheng Chiu, and Niki Parmar. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).
[5]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, and Zhang. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).
[6]
DV Hai and ASR Challenge. 2021. Vietnamese Automatic Speech Recognition.
[7]
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).
[8]
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, and Abdel-rahman Mohamed. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29, 6 (2012), 82–97.
[9]
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29, 6 (2012), 82–97.
[10]
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine 29, 6 (2012), 82–97.
[11]
Simon A Kingaby and Simon A Kingaby. 2022. Voice User Interfaces. Data-Driven Alexa Skills: Voice Access to Rich Data Sources for Enterprise Applications (2022), 3–14.
[12]
Larry R Medsker and LC Jain. 2001. Recurrent neural networks. Design and Applications 5, 64-67 (2001), 2.
[13]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
[14]
Wojciech Samek, Grégoire Montavon, Sebastian Lapuschkin, Christopher J Anders, and Klaus-Robert Müller. 2021. Explaining deep neural networks and beyond: A review of methods and applications. Proc. IEEE 109, 3 (2021), 247–278.
[15]
Dang Dinh Son, Dang Xuan Vuong, Duong Quang Tien, Ta Bao Thang, 2022. ASR-VLSP 2021: Conformer with Gradient Mask and Stochastic Weight Averaging for Vietnamese Automatic Speech Recognition. VNU Journal of Science: Computer Science and Communication Engineering 38, 1 (2022).
[16]
Pham Viet Thanh, Dao Dang Huy, Luu Duc Thanh, Nguyen Duc Tan, Dang Trung Duc Anh, Nguyen Thi Thu Trang, 2022. ASR-VLSP 2021: Semi-supervised Ensemble Model for Vietnamese Automatic Speech Recognition. VNU Journal of Science: Computer Science and Communication Engineering 38, 1 (2022).
[17]
NGUYEN Thi Thu Trang and NGUYEN Xuan Tung. 2019. Text-to-speech shared task in VLSP campaign 2019: evaluating Vietnamese speech synthesis on common datasets. Vietnamese Language Signal Processing. VLSP (2019).

Cited By

View all
  • (2024)Implementation of Real-Time Audio-to-Text Conversion and Processing for Seamless Transformation in Classroom EnvironmentInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24SEP1335(2328-2332)Online publication date: 8-Oct-2024
  • (2024)Fully Open-Source Meeting Minutes Generation ToolFuture Internet10.3390/fi1611042916:11(429)Online publication date: 20-Nov-2024

Index Terms

  1. Vietnamese Voice2Text: A Web Application for Whisper Implementation in Vietnamese Automatic Speech Recognition Tasks: Vietnamese Voice2Text

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence
      December 2023
      563 pages
      ISBN:9798400708688
      DOI:10.1145/3638584
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 March 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Audio-to-text
      2. Automatic speech recognition
      3. Subtitles generator
      4. Voice-to-text
      5. Whisper model

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      CSAI 2023

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)32
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 31 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Implementation of Real-Time Audio-to-Text Conversion and Processing for Seamless Transformation in Classroom EnvironmentInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24SEP1335(2328-2332)Online publication date: 8-Oct-2024
      • (2024)Fully Open-Source Meeting Minutes Generation ToolFuture Internet10.3390/fi1611042916:11(429)Online publication date: 20-Nov-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media