short-paper

Open access

Greek2MathTex: A Greek Speech-to-Text Framework for LaTeX Equations Generation

Authors:

Evangelia Gkritzali,

Panagiotis Kaliosis,

Sofia Galanaki,

Elisavet Palogiannidi,

Theodoros GiannakopoulosAuthors Info & Claims

SETN '24: Proceedings of the 13th Hellenic Conference on Artificial Intelligence

Article No.: 19, Pages 1 - 4

https://doi.org/10.1145/3688671.3688771

Published: 27 December 2024 Publication History

PDF eReader

Abstract

In the vast majority of the academic and scientific domains, LaTeX has established itself as the de facto standard for typesetting complex mathematical equations and formulae. However, LaTeX ś complex syntax and code-like appearance present accessibility barriers for individuals with disabilities, as well as those unfamiliar with coding conventions. In this paper, we present a novel solution to this challenge through the development of a novel speech-to- LaTeX equations system specifically designed for the Greek language. We propose an end-to-end system that harnesses the power of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) techniques to enable users to verbally dictate mathematical expressions and equations in natural language, which are subsequently converted into LaTeX format. We present the architecture and design principles of our system, highlighting key components such as the ASR engine, the LLM-based prompt-driven equations generation mechanism, as well as the application of a custom evaluation metric employed throughout the development process. We have made our system open source and available at https://github.com/magcil/greek-speech-to-math.

1 Introduction

LaTeX is a widely used typesetting system in academia, especially in fields like mathematics, physics, and computer science, due to its ability to produce well-structured documents with precise formatting and complex mathematical notations. However, its intricate syntax often requires significant time and effort to master, posing a barrier to entry for many users. Unlike standard word processors, LaTeX uses a markup language to define document structure and formatting, which can be daunting for beginners and inaccessible to those with disabilities. Visually impaired users, for example, may struggle to read and interpret LaTeX code using screen readers or other assistive technologies. Similarly, users with motor impairments may find it challenging to input LaTeX commands accurately, particularly when dealing with complex mathematical equations. These accessibility barriers limit the inclusivity of LaTeX, as well as its adoption among a diverse range of users. Additionally, members of the visually impaired community worldwide have expressed that accessibility challenges present a significant obstacle to pursuing higher education and engaging in research and academia. In Greece, specifically, visually impaired individuals also encounter numerous challenges within educational institutions [7].

In light of these challenges, we focus our work on developing alternative approaches to interact with LaTeX that are more intuitive, accessible, and user-friendly. In this paper, we open source an end-to-end speech-to-text system specifically designed to generate LaTeX equations (in source code format) based solely on audio input. In this way, the users are now able to verbally dictate mathematical expressions and equations, based on which our proposed system swiftly generates the respective LaTeX equation. Our goal is to democratize access to LaTeX and enhance the efficiency of mathematical communication for visually impaired people in Greece.

2 Related Work

In the past, various approaches have been proposed to address the problem of transforming speech to structured mathematical equations, e.g. in MathML or LaTeX form. Initial approaches relied on rule-based conditions to generate mathematical equations from speech transcriptions [3, 11]. It is notable that the ASR modules used in the aforementioned systems were often based on Hidden Markov Models (HMM), whose capabilities are limited compared to the current state of the art ASR systems.

An early approach, TalkMaths, used Dragon Naturally Speaking (DNS) as the ASR system to transcribe speech. The transcriptions were mapped to mathematical notation using a customized Context-Free Grammar (CFG), which generated a parse tree that was then converted into the desired markup format. The authors also conducted statistical analyses on the frequency of unigrams, bigrams, and trigrams in the speech samples to develop a domain-specific Statistical Language Model (SLM). Additionally, they examined the impact of speech timing and prosody, aiming to establish a set of rules for reading mathematical expressions aloud.

Hanakovic et al. [6] created a system that categorizes serialized speech input into discrete mathematical categories, such as function, symbol, and fraction. This approach converts speech into labeled text elements, for example, transforming “cosine of x plus three” into “function: cosine; parameter: x; symbol: plus; parameter: 3”. These labeled inputs are used to build an expression tree, which is then converted into concise mathematical notation, including MathML. The system also includes navigation and edit modes, allowing users to select the category of the speech input or correct output errors directly. Moreover, Batlouni et al. [3] developed Mathifier, a system designed to provide real-time speech recognition to help users retain long or complex equations. The system dynamically rendered the most probable output as speech was processed. They used Sphinx IV¹, developed by CMU, as the speech recognition module. To ensure mathematically sound outputs, they implemented a Grammar Language Model, which restricted the system’s output to domain-specific allowable sequences. This approach limited possible transitions during decoding to those that maintained mathematical coherence, such as preventing “minus” from being followed by “divided by”.

Recent advancements in deep learning have led to systems that can convert equations from images into structured textual form, like LaTeX, benefiting visually or motor-impaired individuals. Wang et al. [10] introduced an encoder-decoder architecture for this purpose. They used a Convolutional Neural Network (CNN) for encoding and a stacked bidirectional Long-Short Term Memory (LSTM) module with soft attention for decoding and token generation. The neural network undergoes two-step training: first, token-level training with Maximum-Likelihood Estimation (MLE), followed by sequence-level training using a reinforcement learning-based policy gradient algorithm, optimizing the model by considering the entire sequence of tokens.

3 Dataset

Given the scarcity of publicly available domain-specific datasets and the limited knowledge of open-source ASR systems regarding both the Greek language and math-related speech nuances, we opted to develop our own task-specific dataset (henceforth denoted as Gr2Tex). It consists of 500 pairs of equations in natural text alongside their corresponding mathematical notation in LaTeX form.

Dataset Collection Process: The equations were selected from a variety of sources, including Greek mathematics textbooks, online educational platforms, and custom-generated examples crafted by computer science professionals. Our selection aimed to comprehensively cover mathematical concepts, from basic arithmetic to advanced algebra. For audio recordings, we engaged 10 native Greek speakers, both male and female, to read the equations aloud, ensuring minimal background noise and clear articulation of mathematical terms.

Data Preprocessing: Prior to finalizing the dataset, we performed some preprocessing steps. The audio recordings were processed to restrict the background noise and normalized to a consistent volume level. The textual data was reviewed for consistency in notation, particularly ensuring uniform representation of symbols and expressions in LaTeX. We also standardized the text format, such as using consistent punctuation and spacing, to facilitate uniform processing during the model training phase. We split the dataset in train, validation and test set following a 70%-15%-15% split.

4 System Architecture

In this section, we provide further information about the architecture of our proposed system. Illustrated in Figure 1, it consists of three primary components; a speech recognition module, a retrieval mechanism, as well as a text generation model. The speech recognition module transcribes speech input into free-form text, while the retrieval component searches a database to return the k most similar equations in natural text from a held-out set of our dataset, along with their respective mathematical forms. Subsequently, the text generation model, in our case GPT-3.5-turbo, is prompted with the k retrieved equation pairs, along with a brief instruction that guides the LLM’s behaviour.

Figure 1:

4.1 Speech Recognition Component

Our speech recognition component consisted of an instance of the XLS-R model developed by Meta AI [2]. While it is trained on thousands of hours of audio in multiple languages (including Greek), its performance on the Greek language was insufficient for integration into our proposed end-to-end pipeline. Thus, we underwent a fine-tuning process in order to initially refine the model’s ability in successfully transcribing audio in Greek. Regarding the initial fine-tuning process, we collected approximately 26 hours of publicly available Greek speech audio, covering diverse topics such as political debates, sportcasting events and stand-up comedy shows. These recordings were transcribed and segmented into smaller chunks of approximately 3 seconds each. Additionally, we utilized a subset of the Common Voice [1] dataset that contained Greek audio. Finally, we also fine-tuned the model using our custom domain- and language-specific dataset (see Section 3) in order to enhance its ability to accurately transcribe math-related audio inputs.

4.2 Equation Generation Component

Given the rapid advancements in the NLP domain, we chose to use a Large Language Model (LLM) for the equation generation process. However, a notable challenge emerges due to the limited availability of LLMs pretrained on a substantial Greek language corpus, with even fewer trained on math-related data. Our decision to develop a speech-to-LaTeX tool for the Greek language meant that models specifically tailored for code generation, such as StarCoder [5] could not be employed, as their capabilities in Greek are limited. While such models would excel due to the code-like structure of LaTeX equations, they lack sufficient training data for Greek language tasks. Thus, we decided to employ OpenAI’s GPT3.5-turbo.

A common strategy when leveraging LLMs is to enhance their capabilities through in-context learning (ICL). In the NLP domain, ICL is a paradigm where an LLM is prompted with a set of instructions alongside a few task-specific demonstrative examples. This allows the model to adapt its output based solely on the provided examples, without the need to update its parameters [4]. Given the lack of a task-specific dataset for Greek, aside from our own Gr2Tex with only 500 examples, ICL was essential for guiding the LLM’s responses. We selected demonstrative examples by retrieving the k most semantically similar samples to the query, with k being a hyperparameter optimized during a tuning process. Balancing performance with the cost of using a closed-source LLM like GPT-3.5, we experimented with values of k from two to six, as detailed in Section 5.2.

4.3 Retrieval Mechanism

Our retrieval mechanism operates on the Gr2TeX dataset (see Section 3), composed of a diverse range of mathematical equations represented in natural language and their corresponding LaTeX forms. We employ the k-NN algorithm in order to identify the k most similar examples to the current query. We experimented with various similarity and distance measures, such as the cosine and euclidean similarity, as well as the Manhattan distance. The results of our exploration, detailing which similarity or distance function yielded better performance, are presented in Section 5.2.

5 Experimental Setup

5.1 Evaluation Metrics

We evaluated our system’s performance by measuring the Levenshtein distance between the generated equations and ground truth, following preprocessing steps to standardize the equations. This preprocessing involved removing LaTeX-specific formatting and commands, such as dollar signs and equation delimiters, and standardizing mathematical symbols and variables, including replacing Greek letters with Latin equivalents and removing unnecessary punctuation. After preprocessing, the function calculates the Levenshtein distance, which quantifies the minimum number of single-character edits required to transform one string into another. We assessed the effectiveness of this evaluation method, which we denote as EL, by comparing its results to a set of annotated assessments conducted by a group of five individuals. Each pair of generated and ground truth equations was categorized as either “Not Match", “Almost Match", or “Match", denoted respectively as − 1, 0, and 1. In addition to this evaluation method, we report typical Natural Language Generation (NLG) metrics such as BLEU [8] and chrF [9].

5.2 Experimental Results

In this section, we present the outcomes of our experiments and discuss our observations. Table 2 showcases the performance of our proposed system across different values of k and various similarity or distance functions. The first row of the table corresponds to our baseline approach, which did not utilize any in-context learning (ICL) functionality. In this setup, the generative model was prompted solely to produce the LaTeX equation for the provided transcribed text sequence. The third column indicates the selected prompt (see Table 2). The fourth and fifth columns of the table indicate the percentage of equations with an EL distance lower than 0.1 or 0.4 from their respective ground truths. The final two columns present the BLEU [8] and chrF [9] scores respectively.

Table 1:

Instruction Prompts
p1	“You are a LaTeX equation generator. You are provided with an equation described in natural text and you are asked to generate the respective LaTeX equation.”
p2	“You are a LaTeX equation generator. You are provided with an equation described in natural text and you are asked to generate the respective LaTeX equation. Follow the examples and generate the LaTeX equation for the last query.”
p3	“Είσαι ένας βοηθός προγραμματιστή. Σου παρέχεται μία εξίσωση σε φυσική γλώσσα και σου ζητείται να παράξεις την αντίστοιχη εξίσωση σε κώδικα LaTeX. Συμπλήρωσε την εξίσωση σε κώδικα LaTeX για το τελευταίο αίτημα.”

Table 1: The instruction prompts used throughout our experiments.

Table 2:

Results on Gr2TeX - GPT3.5
k	Sim/Dist	Prompt	EL < 0.1	EL > 0.4	BLEU	chrF
-	-	-	27.45	24.84	39.54	60.58
3	Cosine	p1	32.06	29.85	39.77	61.86
3	Cosine	p3	34.66	23.04	42.37	63.92
4	Euclidean	p2	34.55	24.27	44.88	63.26
5	Manhattan	p1	36.03	21.01	47.95	65.77
5	Manhattan	p2	36.15	20.84	53.42	66.03
5	Cosine	p2	37.67	17.03	52.33	66.17
5	Cosine	p3	35.98	21.04	48.21	64.38
6	Manhattan	p2	34.86	21.04	46.79	63.69
6	Cosine	p2	37.59	17.51	52.51	66.24

Table 2: The results of our experiments on Gr2TeX using GPT3.5.

We observed a significant impact of the number of demonstrative examples on the performance of the employed LLM. Specifically, we noticed a discernible difference in performance between scenarios with k = 3 and k = 5. However, increasing the number of examples beyond a certain point (specifically k = 6) did not enhance performance. Due to the costs associated with using closed-source LLMs like GPT-3.5, we did not test with more examples, but this is planned for future research to further validate our findings.

Furthermore, the choice of instruction prompt significantly influences the system’s performance. In our experiments, we tested three different instructions, outlined in Table 2. Notably, instruction p2 demonstrated superior effectiveness compared to instruction p3, which is almost the same as p2 but in Greek. Specifically, p2 improved the EL < 0.1 metric by up to 1.5% and reduced the EL > 0.4 metric by up to 3.8%, which is the desirable outcome. Similarly, instruction p1 generally underperformed compared to the other two prompts. This underscores the importance of exploring different prompting strategies to optimize system performance. In addition, the cosine similarity measure seems to lead to slightly better overall performance based on the results of Table 2.

6 Web Application

To address accessibility barriers in formulating mathematical expressions, we integrated Gr2tex into a web application, with the backend developed in FASTAPI and the frontend in React. The interface includes four main buttons: one for recording the mathematical expression, another for playing back the recorded phrase, a third for downloading the audio file, and a fourth for converting speech to LaTeX (Figure 2). Clicking the LaTeX button transcribes the audio and sends the text to the model that generates the corresponding LaTeX expression. This expression is displayed in a modal window, giving users immediate access to the mathematical representation of their spoken input (Figure 2). The web application is accessible by following the instructions found in our open-source repository.

Figure 2:

7 Conclusions & Future Work

In this work, we introduced an end-to-end speech-to-text system designed to generate LaTeX equations. The system comprises an ASR module, a text generation component, and an assistive retrieval mechanism. The ASR module is a fine-tuned version of the XLS-R model [2], while the text generation component utilizes GPT3.5. We leverage in-context learning to dynamically enhance the model’s performance without further training, using a set of demonstrative examples retrieved for each query from a held-out set. This system represents a significant advancement in translating spoken mathematical expressions into LaTeX format, improving accessibility and efficiency in scientific communication. In future work, our objective is to explore more advanced ASR systems and newer LLMs with enhanced capabilities in Greek language processing. Additionally, we plan to investigate more sophisticated retrieval techniques and experiment with elaborate prompting strategies, such as Chain-of-Thought (CoT) prompting.

Footnote

CMU Sphinx: https://cmusphinx.github.io

References

[1]

R. Ardila et al.2020. Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France.

Google Scholar

[2]

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. M. Pino, A. Baevski, A. Conneau, and M. Auli. 2021. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. In Interspeech.

Google Scholar

[3]

S. Batlouni, H. Karaki, F. Zaraket, and F. Karameh. 2011. Mathifier — Speech recognition of math equations. In 2011 18th IEEE International Conference on Electronics, Circuits, and Systems. 301–304.

Crossref

Google Scholar

[4]

Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, L. Li, and Z. Sui. 2023. A Survey on In-context Learning. arxiv:https://arXiv.org/abs/2301.00234

Google Scholar

[5]

R. Li et al.2023. StarCoder: may the source be with you! Transactions on Machine Learning Research (2023).

Google Scholar

[6]

T. Hanakovič and M. Nagy. 2006. Speech Recognition Helps Visually Impaired People Writing Mathematical Formulas. In Computers Helping People with Special Needs. Springer Berlin Heidelberg, Berlin, Heidelberg, 1231–1234.

Digital Library

Google Scholar

[7]

P. Katsoulis. 2008. The current educational situation for students with visual impairment in Greece: Trends and prospects. “Recent Approaches & Future Challenges: Programs and Projects regarding the VI & MDVI” (2008).

Google Scholar

[8]

K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA.

Digital Library

Google Scholar

[9]

M. Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation. Association for Computational Linguistics, Lisbon, Portugal.

Crossref

Google Scholar

[10]

Z. Wang and J. Liu. 2021. Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training. International Journal on Document Analysis and Recognition (IJDAR) 24 (06 2021), 1–13.

Google Scholar

[11]

A. Wigmore, G. Hunter, E. Pfluegel, J. Denholm-Price, and V. Binelli. 2009. Using Automatic Speech Recognition to Dictate Mathematical Expressions: The Development of the ’TalkMaths’ Application at Kingston University. (01 2009).

Google Scholar

Index Terms

Greek2MathTex: A Greek Speech-to-Text Framework for LaTeX Equations Generation
1. General and reference
  1. Document types
    1. General conference proceedings

Recommendations

Assessing Virtual Assistant Capabilities with Italian Dysarthric Speech
ASSETS '18: Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility

The usage of smartphone-based virtual assistants (e.g., Siri or Google Assistant) is growing, and their spread has generally a positive impact on device accessibility, e.g., for people with disabilities. However, people with dysarthria or other speech ...
Towards More Robust Speech Interactions for Deaf and Hard of Hearing Users
ASSETS '18: Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility

Mobile, wearable, and other ubiquitous computing devices are increasingly creating a context in which conventional keyboard and screen-based inputs are being replaced in favor of more natural speech-based interactions. Digital personal assistants use ...
Artificial neural networks as speech recognisers for dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach

Dysarthria is a neurological impairment of controlling the motor speech articulators that compromises the speech signal. Automatic Speech Recognition (ASR) can be very helpful for speakers with dysarthria because the disabled persons are often ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SETN '24: Proceedings of the 13th Hellenic Conference on Artificial Intelligence

September 2024

437 pages

ISBN:9798400709821

DOI:10.1145/3688671

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 December 2024

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

SETN 2024

SETN 2024: 13th Hellenic Conference on Artificial Intelligence

September 11 - 13, 2024

Piraeus, Greece

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
72
Total Downloads

Downloads (Last 12 months)72
Downloads (Last 6 weeks)72

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

1 Introduction

2 Related Work

3 Dataset

4 System Architecture

4.1 Speech Recognition Component

4.2 Equation Generation Component

4.3 Retrieval Mechanism

5 Experimental Setup

5.1 Evaluation Metrics

5.2 Experimental Results

6 Web Application

7 Conclusions & Future Work

Footnote

References

Index Terms

Recommendations

Assessing Virtual Assistant Capabilities with Italian Dysarthric Speech

Towards More Robust Speech Interactions for Deaf and Hard of Hearing Users

Artificial neural networks as speech recognisers for dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations