KR20240098282A

KR20240098282A - System for correcting errors of a speech recognition system and method thereof

Info

Publication number: KR20240098282A
Application number: KR1020220179710A
Authority: KR
Inventors: 구명완; 연희연; 방나모
Original assignee: 서강대학교산학협력단
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2024-06-28

Abstract

본 발명은 음성 인식기 오류 보정 시스템에 관한 것이다. 상기 음성 인식기 오류 보정 시스템은, 음성 인식기들에 각각 대응되는 1차 교정 모델들을 학습시키는 1차 교정 모델 학습부; 재교정 모델들을 학습시키는 재교정 모델 학습부; 통합 교정 모델을 학습시키는 통합 교정 모델 학습부; 1차 교정 모델들, 재교정 모델, 및 상기 통합 교정 모델을 순차적으로 연결하여, 음성 인식기의 오류를 보정하기 위한 계층적 파이프라인을 구성하고 업데이트시키는 파이프라인 구성부; 및 음성 인식 결과 데이터가 입력되면, 해당 음성 인식기에 대응되는 1차 교정 모델, 재교정 모델 및 통합 교정 모델로 이루어진 계층적 파이프라인을 이용하여 음성 인식 결과 데이터의 오류를 보정하여 출력하는 음성 인식기 오류 보정 장치;를 구비하는 것이 바람직하다. The present invention relates to a speech recognizer error correction system. The speech recognizer error correction system includes a primary correction model learning unit that learns primary correction models corresponding to each speech recognizer; A recalibration model learning unit that trains recalibration models; An integrated correction model learning unit that trains an integrated correction model; a pipeline component that sequentially connects the primary calibration models, the re-calibration model, and the integrated calibration model to configure and update a hierarchical pipeline for correcting errors in the speech recognizer; And when the voice recognition result data is input, errors in the voice recognition result data are corrected and output using a hierarchical pipeline consisting of a primary calibration model, a re-calibration model, and an integrated calibration model corresponding to the voice recognizer. It is desirable to have a correction device.

Description

System for correcting errors of a speech recognition system and method thereof}

본 발명은 음성 인식기 오류 보정 시스템 및 그 방법에 관한 것으로서, 더욱 구체적으로는 계층적 언어 모델을 활용하여 1차 교정 모델, 재교정 모델 및 통합 교정 모델로 이루어진 계층적 파이프라인을 구성함으로써, 음성 인식기의 음성 인식 결과를 깨끗한 텍스트 데이터로 변환시킬 수 있도록 구성된 음성 인식기 오류 보정 시스템 및 그 방법에 관한 것이다. The present invention relates to a speech recognizer error correction system and method, and more specifically, to construct a hierarchical pipeline consisting of a primary correction model, a re-calibration model, and an integrated correction model using a hierarchical language model, thereby improving the speech recognizer. It relates to a speech recognizer error correction system and method configured to convert speech recognition results into clean text data.

다양한 생활 전반에 가상 비서와 같은 지능형 에이전트는 음성 인식 기술을 기반으로 발전하고 있다. 실제 환경에서 지능형 에이전트는 음성 데이터를 입력으로 받고, 입력된 음성 데이터로부터 사용자의 의도 및 사용자의 의도를 해결하기 위한 개체명을 적절히 추출해야 한다. Intelligent agents such as virtual assistants in various areas of life are developing based on voice recognition technology. In a real environment, an intelligent agent must receive voice data as input and properly extract the user's intent and an entity name to resolve the user's intent from the input voice data.

이와 같이, 사람의 음성 데이터로부터 의도 분류 및 개체명 인식을 수행하고, 이로부터 의미 정보를 추론하는 언어 이해 단계는 지능형 에이전트의 발전에 핵심 요소로 자리잡고 있다. In this way, the language understanding step of performing intent classification and entity name recognition from human voice data and inferring semantic information from this has become a key element in the development of intelligent agents.

기존의 음성 언어 이해에 대한 연구는 주로 파이프라인 방식과 종단형 방식으로 진행되고 있다. 파이프라인 방식은 먼저 자동 음성 인식(Automatic Speech Recognition)을 수행하여 음성을 텍스트로 변환시킨 뒤, 변환된 텍스트를 이용한 언어 이해 단계가 진행된다. 파이프라인 방식과는 달리, 종단형 방식은 입력된 음성 신호를 텍스트 형태로 변환하지 않는 것으로서, 음성 신호로부터 직접 언어 이해 작업을 처리하는 단일 과정으로 구성된다. Existing research on spoken language understanding is mainly conducted using pipeline and longitudinal methods. The pipeline method first performs automatic speech recognition to convert speech into text, and then proceeds with the language understanding step using the converted text. Unlike the pipeline method, the end-to-end method does not convert the input voice signal into text form, and consists of a single process that processes the language understanding task directly from the voice signal.

파이프라인 방식은 언어 이해 단계에서 텍스트 정보를 사용하기 때문에 많은 이점을 갖는다. 텍스트 정보를 활용한다는 것은 텍스트 형태의 입력을 통해 기존의 사전 학습된 대규모 언어모델들의 학습된 정보를 활용함으로써, 적절한 언어이해가 가능하다는 것을 의미한다. 하지만, 기존의 공개된 파이프라인 방식들은 사전학습 모델이 음성 오류가 없는 데이터로 학습되었기 때문에, 음성 인식 단계에서 오류를 포함하고 있는 데이터에 적용되는 경우에는 언어 이해의 성능저하가 발생한다는 문제를 가진다. The pipeline method has many advantages because it uses text information in the language understanding stage. Utilizing text information means that appropriate language understanding is possible by utilizing the learned information of existing large-scale pre-trained language models through input in the form of text. However, existing public pipeline methods have the problem that language understanding performance deteriorates when applied to data containing errors in the speech recognition stage because the pre-learning model is trained with data without speech errors. .

한편, 상기에 기술한 바와 같이 파이프라인 형태의 음성 이해 방식을 차용하기 위한 음성 인식 오류를 줄이는 연구도 많이 진행되어 왔다. 이러한 연구의 결과에 의해, 단어 레벨에서의 음성 인식 오류율(Word Error Rate)이 많이 낮아지고 음성 인식 성능이 향상되기는 하였음에도 불구하고, 음성 인식 오류를 포함한 채 언어 이해 모델로 입력이 들어가는 경우 언어 이해 모델의 성능 저하 문제가 여전히 발생하게 된다. Meanwhile, as described above, much research has been conducted to reduce speech recognition errors by adopting a pipeline-type speech understanding method. As a result of this research, although the word error rate at the word level has been significantly lowered and speech recognition performance has been improved, if input is entered into the language understanding model with speech recognition errors, the language understanding model Performance degradation problems still occur.

기존에는 사전 학습된 모델에 음성 인식 오류를 감소시키기 위하여, 음성 인식기에 나온 데이터를 증강시키고 파인 튜닝을 통해 음성 인식 오류 패턴을 학습시키려는 시도가 많았다. 하지만 이러한 방식은 증강하는 데이터의 양이나 음성 인식기에 따라 성능이 의존되는 경향을 보인다. 도 1은 종래의 기술에 따른 음성 인식 오류 보정 방법을 도시한 블록도이다. 도 1에 도시된 종래의 방식은, 음성 인식 결과 자체의 오류 데이터를 증강하고, 오류 데이터를 언어 이해 모델에 학습시키는 것을 특징으로 한다. 하지만, 이러한 종래의 방식은 증강된 데이터의 양이나 특정 음성 오류 패턴에 집중될 수 있다는 문제점을 갖게 된다. Previously, in order to reduce speech recognition errors in pre-trained models, there were many attempts to augment data from the speech recognizer and learn speech recognition error patterns through fine tuning. However, the performance of this method tends to depend on the amount of augmented data or the voice recognizer. Figure 1 is a block diagram showing a voice recognition error correction method according to the prior art. The conventional method shown in FIG. 1 is characterized by augmenting error data of the speech recognition result itself and training the error data into a language understanding model. However, this conventional method has the problem that it may be focused on the amount of augmented data or on a specific speech error pattern.

또한, 음성 관련 특징 값을 주입함으로써 음성 정보도 활용할 수 있도록 하는 방식이 제안된 바 있다. 음성 인식 단계에서 얻어지는 음소 순서, lattice graph, N-best hypothesis 활용하는 방식이다. 하지만 실제 상황에서 이러한 정보를 얻기가 힘들기에 일반화하기 힘들다는 문제점을 가진다.Additionally, a method that allows voice information to be utilized by injecting voice-related feature values has been proposed. This method utilizes the phoneme order, lattice graph, and N-best hypothesis obtained in the speech recognition stage. However, it is difficult to generalize because it is difficult to obtain such information in real situations.

한국등록특허공보 제 10-1394253호Korean Patent Publication No. 10-1394253 한국공개특허공보 제 10-2021-0016682호Korean Patent Publication No. 10-2021-0016682 한국등록특허공보 제 10-2324829호Korean Patent Publication No. 10-2324829 한국공개특허공보 제 10-2021-0052563호Korean Patent Publication No. 10-2021-0052563

전술한 문제점을 해결하기 위하여 본 발명은, 계층적 언어 모델을 활용하여 1차 교정 모델, 재교정 모델 및 통합 교정 모델로 이루어진 계층적 파이프라인을 구성함으로써, 음성 인식기의 음성 인식 결과를 깨끗한 텍스트 데이터로 변환시킬 수 있도록 구성된 음성 인식기 오류 보정 시스템 및 그 방법을 제공하는 것을 목적으로 한다. In order to solve the above-described problem, the present invention utilizes a hierarchical language model to construct a hierarchical pipeline consisting of a primary calibration model, a re-calibration model, and an integrated calibration model, thereby converting the speech recognition results of the speech recognizer into clean text data. The purpose is to provide a voice recognizer error correction system and method configured to convert to .

전술한 기술적 과제를 달성하기 위한 본 발명의 제1 태양에 따른 음성 인식기 오류 보정 시스템은, 머신 러닝을 수행하는 컴퓨팅 시스템에 의해 구현되는 시스템으로서, 적어도 둘 이상의 음성 인식기들에 각각 대응되는 1차 교정 모델들을 구비하고, 상기 1차 교정 모델들에 대하여 해당 음성 인식기별 음성 오류 패턴을 학습시키는 1차 교정 모델 학습부; 하나 또는 둘 이상의 1차 교정 모델들에 대응되도록 설정된 재교정 모델들을 구비하고, 상기 재교정 모델들을 학습시키는 재교정 모델 학습부; 통합 교정 모델을 구비하고, 통합 교정 모델을 학습시키는 통합 교정 모델 학습부; 및 하나 또는 둘 이상의 1차 교정 모델들과 이에 대응되는 재교정 모델이 연결되어 이루어진 하나 또는 둘 이상의 계층들, 및 상기 통합 교정 모델을 순차적으로 연결하여, 음성 인식기의 오류를 보정하기 위한 계층적 파이프라인을 구성하고 업데이트시키는 파이프라인 구성부;를 구비한다. The speech recognizer error correction system according to the first aspect of the present invention for achieving the above-described technical problem is a system implemented by a computing system that performs machine learning, and includes primary correction corresponding to each of at least two speech recognizers. a primary correction model learning unit that includes models and learns speech error patterns for each speech recognizer with respect to the primary correction models; A recalibration model learning unit that includes recalibration models set to correspond to one or more primary calibration models and trains the recalibration models; An integrated calibration model learning unit that has an integrated calibration model and trains the integrated calibration model; and one or two or more layers formed by connecting one or more primary correction models and the corresponding re-calibration model, and a hierarchical pipe for sequentially connecting the integrated correction model to correct errors in the speech recognizer. A pipeline configuration unit that configures and updates lines is provided.

전술한 제1 태양에 따른 음성 인식기 오류 보정 시스템은, 임의의 음성 인식기의 음성 인식 결과 데이터가 입력되면, 해당 음성 인식기에 대응되는 1차 교정 모델, 상기 1차 교정 모델에 대응되는 재교정 모델 및 통합 교정 모델로 이루어진 계층적 파이프라인을 이용하여 음성 인식 결과 데이터의 오류를 보정하여 출력하는 음성 인식기 오류 보정 장치;를 더 구비하는 것이 바람직하다. The voice recognizer error correction system according to the first aspect described above includes, when voice recognition result data of an arbitrary voice recognizer is input, a primary calibration model corresponding to the voice recognizer, a re-calibration model corresponding to the primary calibration model, and It is desirable to further include a speech recognizer error correction device that corrects errors in speech recognition result data using a hierarchical pipeline consisting of an integrated correction model and outputs the corrected errors.

전술한 제1 태양에 따른 음성 인식기 오류 보정 시스템에 있어서, 상기 파이프라인 구성부는, 파이프라인의 음성 인식 결과에 대한 성능 평가 지표에 따라 파이프라인을 구성하는 계층의 차수를 결정하며, 상기 계층의 차수에 따라 파이프라인을 구성하는 1차 교정 모델들과 이에 대응되는 재교정 모델의 개수가 결정되는 것이 바람직하다. In the speech recognizer error correction system according to the first aspect described above, the pipeline configuration unit determines the order of the layer constituting the pipeline according to the performance evaluation index for the speech recognition result of the pipeline, and the order of the layer It is desirable to determine the number of primary calibration models constituting the pipeline and the corresponding recalibration models according to .

전술한 제1 태양에 따른 음성 인식기 오류 보정 시스템에 있어서, 상기 1차 교정 모델, 재교정 모델 및 통합 교정 모델은, 인코더-디코더 구조를 갖는 트랜스포머 모델을 기반으로 한 텍스트 투 텍스트(Text to Text) 모델인 것이 바람직하다. In the speech recognizer error correction system according to the first aspect described above, the primary correction model, re-calibration model, and integrated correction model are text-to-text based on a transformer model with an encoder-decoder structure. It is desirable to be a model.

전술한 제1 태양에 따른 음성 인식기 오류 보정 시스템에 있어서, 상기 파이프라인 구성부가 사용하는 음성 인식 결과에 대한 성능 평가 지표는, WER (Word Error Rate) 인 것이 바람직하다. In the speech recognizer error correction system according to the first aspect described above, the performance evaluation index for the speech recognition result used by the pipeline component is preferably WER (Word Error Rate).

전술한 제1 태양에 따른 음성 인식기 오류 보정 시스템에 있어서, 상기 1차 교정 모델들은, 각 1차 교정 모델에 일대일 대응된 음성 인식기로부터 출력된 음성 인식 결과 데이터가 입력되고 깨끗한 텍스트 데이터가 정답으로 라벨링되어 훈련된 것을 특징으로 하며, 상기 재교정 모델들은, 재교정 모델에 대응된 하나 또는 둘 이상의 1차 교정 모델들로부터 출력된 음성 인식 결과 데이터들이 입력되고 깨끗한 텍스트 데이터가 정답으로 라벨링되어 훈련된 것을 특징으로 하며, 상기 통합 교정 모델은, 파이프라인을 구성하는 하나 또는 둘 이상의 계층들에 포함된 재교정 모델들로부터 출력된 음성 인식 결과 데이터들이 입력되고 깨끗한 텍스트 데이터가 정답으로 라벨링되어 훈련된 것이 바람직하다. In the speech recognizer error correction system according to the above-described first aspect, the first correction models are input with speech recognition result data output from a speech recognizer corresponding one-to-one to each first correction model, and clean text data is labeled as the correct answer. Characterized in that the re-calibration models are trained by inputting speech recognition result data output from one or more primary calibration models corresponding to the re-calibration model and labeling the clean text data as the correct answer. Characteristically, the integrated correction model is preferably trained by inputting speech recognition result data output from recalibration models included in one or two or more layers constituting the pipeline, and labeling clean text data as the correct answer. do.

전술한 제1 태양에 따른 음성 인식기 오류 보정 시스템에 있어서, 상기 파이프라인 구성부는, 하나 또는 둘 이상의 1차 교정 모델들과 통합 교정 모델을 연결하여 계층적 파이프라인의 초기 상태를 설정하는 초기 파이프라인 설정 모듈; 계층적 파이프라인의 음성 인식 결과에 대한 성능 평가 지표를 측정하는 파이프라인 성능 평가 모듈; 계층적 파이프라인에 하나 또는 둘 이상의 1차 교정 모델들과 이에 대응되는 재교정 모델이 연결되어 이루어진 계층을 하나씩 추가하면서 파이프라인의 음성 인식 결과에 대한 성능 평가 지표를 측정하고, 성능 평가 지표가 향상된 경우 상기 파이프라인으로 업데이트시키도록 구성된 파이프라인 업데이트 모듈; 을 구비하는 것이 바람직하다. In the speech recognizer error correction system according to the first aspect described above, the pipeline configuration unit includes an initial pipeline that connects one or more primary correction models and an integrated correction model to set the initial state of the hierarchical pipeline. settings module; a pipeline performance evaluation module that measures performance evaluation indicators for the speech recognition results of the hierarchical pipeline; By adding one or more layers to the hierarchical pipeline one by one by connecting one or more primary calibration models and the corresponding recalibration model, the performance evaluation indicators for the speech recognition results of the pipeline are measured, and the performance evaluation indicators are improved. a pipeline update module configured to update the pipeline; It is desirable to have a.

본 발명의 제2 태양에 따른 음성 인식기 오류 보정 방법은, (a) 적어도 둘 이상의 음성 인식기들에 각각 대응되는 1차 교정 모델들에 대하여 해당 음성 인식기별 음성 오류 패턴을 학습시키는 단계; (b) 하나 또는 둘 이상의 1차 교정 모델들에 대응되도록 재교정 모델들을 설정하고, 상기 재교정 모델들을 학습시키는 단계; (c) 통합 교정 모델을 학습시키는 단계; 및 (d) 하나 또는 둘 이상의 1차 교정 모델들과 이에 대응되는 재교정 모델이 연결되어 이루어진 하나 또는 둘 이상의 계층들, 및 상기 통합 교정 모델을 순차적으로 연결하여, 음성 인식기의 오류를 보정하기 위한 계층적 파이프라인을 구성하는 단계;를 구비한다. A speech recognizer error correction method according to a second aspect of the present invention includes the steps of (a) learning a speech error pattern for each speech recognizer with respect to primary correction models corresponding to each of at least two or more speech recognizers; (b) setting recalibration models to correspond to one or more primary calibration models and learning the recalibration models; (c) learning an integrated correction model; and (d) sequentially connecting one or more layers consisting of one or more primary correction models and the corresponding re-calibration model, and the integrated correction model to correct errors in the speech recognizer. It includes a step of configuring a hierarchical pipeline.

전술한 제2 태양에 따른 음성 인식기 오류 보정 방법은, (e) 임의의 음성 인식기의 음성 인식 결과 데이터가 입력되면, 해당 음성 인식기에 대응되는 1차 교정 모델, 상기 1차 교정 모델에 대응되는 재교정 모델 및 통합 교정 모델로 이루어진 계층적 파이프라인을 이용하여, 음성 인식 결과 데이터의 오류를 보정하여 출력하는 단계;를 더 구비하는 것이 바람직하다. The voice recognizer error correction method according to the above-described second aspect is (e) when voice recognition result data of an arbitrary voice recognizer is input, a primary calibration model corresponding to the voice recognizer, and a material corresponding to the primary calibration model. It is desirable to further include a step of correcting errors in the speech recognition result data and outputting the result using a hierarchical pipeline consisting of a calibration model and an integrated calibration model.

전술한 제2 태양에 따른 음성 인식기 오류 보정 방법에 있어서, 상기 (d) 단계는, 파이프라인의 음성 인식 결과에 대한 성능 평가 지표에 따라 파이프라인을 구성하는 계층의 차수를 결정하며, 상기 계층의 차수에 따라 파이프라인을 구성하는 1차 교정 모델들과 이에 대응되는 재교정 모델의 개수가 결정되는 것이 바람직하다. In the speech recognizer error correction method according to the second aspect described above, the step (d) determines the order of the layer constituting the pipeline according to the performance evaluation index for the speech recognition result of the pipeline, and the order of the layer constituting the pipeline is determined. It is desirable that the number of primary calibration models constituting the pipeline and the corresponding recalibration models are determined according to the order.

전술한 제2 태양에 따른 음성 인식기 오류 보정 방법에 있어서, 상기 1차 교정 모델, 재교정 모델 및 통합 교정 모델은, 인코더-디코더 구조를 갖는 트랜스포머 모델을 기반으로 한 텍스트 투 텍스트(Text to Text) 모델인 것이 바람직하다. In the speech recognizer error correction method according to the second aspect described above, the primary correction model, re-calibration model, and integrated correction model are text-to-text based on a transformer model with an encoder-decoder structure. It is desirable to be a model.

전술한 제2 태양에 따른 음성 인식기 오류 보정 방법에 있어서, 상기 (d) 단계는, (d1) 하나 또는 둘 이상의 1차 교정 모델들과 통합 교정 모델을 연결하여 계층적 파이프라인의 초기 형태를 설정하는 단계; (d2) 계층적 파이프라인의 음성 인식 결과에 대한 성능 평가 지표를 측정하는 단계; 및 (d3) 상기 계층적 파이프라인에 대하여, 하나 또는 둘 이상의 1차 교정 모델들과 이에 대응되는 재교정 모델이 연결되어 이루어진 계층을 하나씩 추가하면서 파이프라인의 음성 인식 결과에 대한 성능 평가 지표를 측정하고, 성능 평가 지표를 기반으로 하여 상기 파이프라인으로 업데이트시키는 단계;를 구비하는 것이 바람직하다. In the speech recognizer error correction method according to the second aspect described above, step (d) includes: (d1) connecting one or more primary correction models and an integrated correction model to establish an initial form of a hierarchical pipeline; steps; (d2) measuring performance evaluation indicators for the speech recognition results of the hierarchical pipeline; and (d3) to the hierarchical pipeline, measuring performance evaluation indicators for the speech recognition results of the pipeline while adding layers consisting of one or more primary calibration models and the corresponding re-calibration models connected one by one. and updating the pipeline based on performance evaluation indicators.

전술한 본 발명에 따른 음성 인식기 오류 보정 시스템 및 방법은, 각 음성 인식기에 대응하는 1차 교정 모델 학습부를 통해 음성 인식기별 오류 패턴을 학습하고, 재교정 모델 학습부 및 통합 교정 모델 학습부를 통해 종래의 연구 및 방법론과는 달리, 파이프라인을 통해 깨끗한 데이터로 변환시킬 수 있게 된다.The speech recognizer error correction system and method according to the present invention described above learns the error pattern for each speech recognizer through a primary correction model learning unit corresponding to each speech recognizer, and conventionally corrects the error pattern through a re-calibration model learning unit and an integrated correction model learning unit. Unlike the research and methodology of , it can be converted into clean data through a pipeline.

본 발명에 따른 시스템은, 계층적 언어 모델을 활용한 음성 인식기 오류 보정 방식으로 학습이 완료되고 나면, 추후 음성 인식기 결과가 입력으로 들어왔을 때 오류 패턴에 따른 변환 방식이 맵핑되어 깨끗한 데이터로 변환하여 특정 분야에 파인 튜닝된 언어 처리 모델로 들어갈 수 있도록 한다. 따라서, 이렇게 음성 오류 패턴을 알아차리고 적절한 변환 방식을 맵핑하여 깨끗한 데이터를 입력으로 넣을 경우 사전 학습 모델의 풍부한 정보와 파인 튜닝을 통해 음성 오류에 강인한 세부 태스크(task)에 최적화된 음성 언어 이해 시스템을 구축할 수 있게 된다. In the system according to the present invention, once learning is completed using a voice recognizer error correction method using a hierarchical language model, when the voice recognizer result is input, a conversion method according to the error pattern is mapped and converted into clean data. It allows you to enter a fine-tuned language processing model in a specific field. Therefore, if you recognize speech error patterns, map an appropriate conversion method, and input clean data, you can create a speech language understanding system optimized for detailed tasks that is robust to speech errors through rich information and fine tuning of the pre-learning model. It becomes possible to build.

또한, 본 발명에 따른 음성 인식기 오류 보정 시스템 및 방법은 깨끗한 텍스트 형태로 보정된 파인 튜닝된 모델을 사용하기 때문에, 음성 데이터뿐만 아니라 일반 텍스트 데이터도 입력으로 받을 수 있어서, 활용성이 매우 높다. In addition, since the voice recognizer error correction system and method according to the present invention uses a fine-tuned model corrected in a clean text form, it can receive not only voice data but also general text data as input, so its usability is very high.

또한, 본 발명에 따른 음성 인식기 오류 보정 시스템 및 방법은, 음성 오류에도 강건한 음성 언어 이해 시스템을 제공할 수 있게 되고, 그 결과 자연어 처리 분야의 다양한 분야, 예를 들면 분류, 질의 응답, 유사성 분석 등과 같은 분야의 데이터 셋에 적용될 수 있다. In addition, the speech recognizer error correction system and method according to the present invention can provide a speech language understanding system that is robust even against speech errors, and as a result, it can be used in various fields of natural language processing, such as classification, question answering, similarity analysis, etc. It can be applied to data sets in the same field.

또한, 본 발명에 따른 오류 보정 방법을 통해 보정된 음성 인식 결과는 다양한 공개된 한국어 사전 학습 모델을 기반으로 다양한 데이터 셋 및 특정 분야에 파인 튜닝된 모델의 입력으로 사용될 수 있다. In addition, the speech recognition results corrected through the error correction method according to the present invention can be used as input for a model fine-tuned for various data sets and specific fields based on various publicly available Korean dictionary learning models.

도 1은 종래의 기술에 따른 음성 인식 오류 보정 방법을 도시한 블록도이다.
도 2는 본 발명의 바람직한 실시예에 따른 음성 인식기 오류 보정 시스템을 전체적으로 도시한 구성도이다.
도 3은 본 발명의 바람직한 실시예에 따른 음성 인식기 오류 보정 시스템에 있어서, 언어 모델의 계층 증가를 설명하기 위하여 예시적으로 도시한 모식도이다.
도 4는 본 발명의 바람직한 실시예에 따른 음성 인식기 오류 보정 시스템에 있어서, 1차 교정 모델의 학습 과정을 도시한 것이다.
도 5는 본 발명의 바람직한 실시예에 따른 음성 인식기 오류 보정 시스템에 있어서, 파이프라인 업데이트 모듈이 파이프라인 업데이트하는 과정을 설명한 흐름도이다.
도 6은 본 발명에 따른 음성 인식기 오류 보정 방법을 순차적으로 도시한 흐름도이다.
도 7 및 도 8은 본 발명의 다른 태양에 따른 음성 데이터 오류 보정 시스템을 설명하기 위하여 도시한 블록도들이다.
도 9는 도 7 및 도 8에 따른 시스템이 적용될 수 있는 다양한 스타일의 음성 데이터들을 예시적으로 표시한 도표이다. Figure 1 is a block diagram showing a voice recognition error correction method according to the prior art.
Figure 2 is an overall configuration diagram of a voice recognizer error correction system according to a preferred embodiment of the present invention.
Figure 3 is a schematic diagram illustrating an increase in the hierarchy of a language model in the speech recognizer error correction system according to a preferred embodiment of the present invention.
Figure 4 shows the learning process of a first-order correction model in the voice recognizer error correction system according to a preferred embodiment of the present invention.
Figure 5 is a flowchart explaining the pipeline update process of the pipeline update module in the voice recognizer error correction system according to a preferred embodiment of the present invention.
Figure 6 is a flowchart sequentially showing the voice recognizer error correction method according to the present invention.
7 and 8 are block diagrams illustrating a voice data error correction system according to another aspect of the present invention.
FIG. 9 is a diagram illustrating various styles of voice data to which the system according to FIGS. 7 and 8 can be applied.

이하, 첨부된 도면을 참조하여 본 발명에 따른 음성 인식기 오류 보정 시스템 및 그 방법에 대하여 구체적으로 설명한다. 본 발명에 따른 음성 인식기 오류 보정 시스템 및 그 방법은 머신 러닝을 수행하는 컴퓨팅 시스템에 의해 구현될 수 있다. Hereinafter, the voice recognizer error correction system and method according to the present invention will be described in detail with reference to the attached drawings. The voice recognizer error correction system and method according to the present invention can be implemented by a computing system that performs machine learning.

도 2는 본 발명의 바람직한 실시예에 따른 음성 인식기 오류 보정 시스템을 전체적으로 도시한 구성도이다. 그리고, 도 3은 본 발명의 바람직한 실시예에 따른 음성 인식기 오류 보정 시스템에 있어서, 언어 모델의 계층 증가를 설명하기 위하여 예시적으로 도시한 모식도이다. 도 2 및 도 3을 참조하면, 본 발명의 바람직한 실시예에 따른 음성 인식기 오류 보정 시스템(1)은 1차 교정 모델 학습부(10), 재교정 모델 학습부(20), 통합 교정 모델 학습부(30), 파이프라인 구성부(40) 및 음성 인식기 오류 보정 장치(50)를 구비하여, 음성 인식기의 음성 인식 결과에 대한 오류를 보정하여 깨끗한 텍스트 데이터로 제공할 수 있도록 구성된다. 이하, 상기 음성 인식기 오류 보정 시스템을 구성하는 각 요소들에 대하여 구체적으로 설명한다. Figure 2 is an overall configuration diagram of a voice recognizer error correction system according to a preferred embodiment of the present invention. And, Figure 3 is a schematic diagram illustrating the increase in the hierarchy of the language model in the speech recognizer error correction system according to the preferred embodiment of the present invention. 2 and 3, the voice recognizer error correction system 1 according to a preferred embodiment of the present invention includes a primary correction model learning unit 10, a re-calibration model learning unit 20, and an integrated correction model learning unit. (30), a pipeline component (40), and a voice recognizer error correction device (50) are provided to correct errors in the voice recognition results of the voice recognizer and provide clean text data. Hereinafter, each element that constitutes the voice recognizer error correction system will be described in detail.

상기 1차 교정 모델 학습부(10)는, 적어도 둘 이상의 음성 인식기들에 각각 대응되는 1차 교정 모델들을 구비하고, 상기 1차 교정 모델들에 대하여 해당 음성 인식기별 음성 오류 패턴을 학습시킨다. The primary correction model learning unit 10 is provided with primary correction models corresponding to at least two speech recognizers, and learns speech error patterns for each speech recognizer with respect to the primary correction models.

상기 1차 교정 모델들은 인코더-디코더 구조를 갖는 트랜스포머(Transformer) 모델을 기반으로 한 텍스트 투 텍스트(Text to Text) 모델로 구성되는 것이 바람직하다. 트랜스포머 모델은 기존의 시퀀스 투 시퀀스(Sequence to Sequence)의 구조인 인코더-디코더 구조를 가지며, 내부적으로는 RNN 레이어없이 어덴션(Attention)으로만 구현한 기계 학습 모델이다. The first correction models are preferably composed of a text to text model based on a transformer model with an encoder-decoder structure. The transformer model has an encoder-decoder structure, which is the existing Sequence to Sequence structure, and is a machine learning model internally implemented only with attention without an RNN layer.

도 4는 본 발명의 바람직한 실시예에 따른 음성 인식기 오류 보정 시스템에 있어서, 1차 교정 모델의 학습 과정을 도시한 것이다. 도 3에 도시된 바와 같이, 상기 1차 교정 모델들은, 인코더-디코더 구조로 이루어진다. 트랜스포머 기반의 인코더에 각 1차 교정 모델에 일대일 대응된 음성 인식기로부터 출력된 음성 인식 결과 데이터가 입력된다. 트랜스포머 기반의 디코더는 인코더에서 나온 전체 문장에 대한 임베딩 값을 컨텍스트 벡터로써 입력을 받고 시작 토큰을 시작으로 하여 순차적으로 전 셀의 출력 결과가 들어가게 된다. 도 3를 참조하여 설명하면, “엄마가 성적 삼 등급 올린다고”라는 특정 인식기 인식 결과를 인코더 입력으로 넣고, 디코더는 “엄마가 성적 3등 올랐다고”라는 깨끗한 텍스트 데이터, 즉 원래의 정답 텍스트가 출력될 수 있도록 한다. 이는 하나의 번역기 원리로 동작함을 확인할 수 있다. 본 발명에 따른 교정 모델의 인코더와 디코더는 다음과 같은 특징을 갖는다. 인코더 부분은 입력의 심층 표현(Representation)을 위해 복수의 트랜스포머 계층을 쌓고, 토큰 시퀀스인 마스킹 언어 모델(Masking Language Model)에 마스킹 과정을 적용하는 것을 특징으로 한다. 다른 인공지능 모델과는 달리 모든 문제를 텍스트 형태로 풀어 입력과 출력 모두 텍스트 형태라는 특징을 갖는다. Figure 4 shows the learning process of a first-order correction model in the voice recognizer error correction system according to a preferred embodiment of the present invention. As shown in FIG. 3, the first correction models have an encoder-decoder structure. Speech recognition result data output from a speech recognizer that corresponds one-to-one to each primary correction model is input to the transformer-based encoder. The transformer-based decoder receives the embedding value for the entire sentence from the encoder as a context vector and inputs the output results of all cells sequentially, starting with the start token. To explain with reference to FIG. 3, the recognition result of a specific recognizer, “Mom raised my grade by 3 grades,” is input to the encoder, and the decoder outputs clean text data, “My mother raised my grade by 3 grades,” that is, the original correct answer text. make it possible It can be confirmed that this operates on the principle of a single translator. The encoder and decoder of the calibration model according to the present invention have the following characteristics. The encoder part is characterized by stacking multiple transformer layers for in-depth representation of the input and applying a masking process to the masking language model, which is a token sequence. Unlike other artificial intelligence models, all problems are solved in text form, and both input and output are in text form.

각 음성 인식기의 음성 인식 결과에 대응하는 1차 교정 모델들을 훈련하기 위한 코퍼스는 다음과 같이 구축될 수 있다. 가장 먼저, 특정 분야의 음성 코퍼스를 구축한다. 이때, 음성 코퍼스와 이를 전사한 깨끗한 텍스트 데이터를 짝으로 하여 어노테이션한다. 상기 분야와 유사한 한국어 음성 코퍼스를 공개되어 있는 웹사이트들로부터 추가적으로 수집한다. 상기의 훈련 코퍼스를 활용하여, 공개되어 있는 음성 인식기 및 자체적으로 구축한 음성 인식기를 통해 인식기 성능을 평가하고 가장 좋은 성능을 갖는 인식기들의 순서로 상기 파이프라인 2n 개를 단계에 포함한다. 선택된 2ⁿ개의 인식기 결과를 활용하여 이에 대응하는 2ⁿ개의 음성 인식 1차 교정 모델들을 학습한다. A corpus for training primary correction models corresponding to the speech recognition results of each speech recognizer can be constructed as follows. First, build a speech corpus for a specific field. At this time, the voice corpus and the clean text data transcribed from it are paired and annotated. Korean speech corpora similar to the above fields are additionally collected from publicly available websites. Using the training corpus, recognizer performance is evaluated using publicly available speech recognizers and self-constructed speech recognizers, and 2n pipelines are included in the steps in the order of recognizers with the best performance. Using the selected 2 ⁿ recognizer results, the corresponding 2 ⁿ speech recognition first-order correction models are learned.

상기 재교정 모델 학습부(20)는, 하나 또는 둘 이상의 1차 교정 모델들에 대응되도록 설정된 재교정 모델들을 구비하고, 상기 재교정 모델들을 학습시킨다. 도 3에 도시된 바와 같이, 본 발명에서는 하나의 재교정 모델에 대하여 2개의 1차 교정 모델이 대응되도록 설정된 경우를 예시적으로 설명한다. 하지만, 이러한 설정은 시스템의 성능에 따라 변경될 수 있으며, 하나의 재교정 모델에 대응되는 1차 교정 모델의 개수는 시스템의 성능에 따라 조정될 수 있을 것이다. 상기 재교정 모델은 전술한 1차 교정 모델과 마찬가지로, 인코더-디코더 구조를 갖는 트랜스포머 모델을 기반으로 한 텍스트 투 텍스트(Text to Text) 모델로 구성되는 것이 바람직하다. The recalibration model learning unit 20 is provided with recalibration models set to correspond to one or more primary calibration models, and trains the recalibration models. As shown in FIG. 3, the present invention exemplarily describes a case where two primary calibration models are set to correspond to one recalibration model. However, these settings may be changed depending on the performance of the system, and the number of primary calibration models corresponding to one recalibration model may be adjusted according to the performance of the system. The recalibration model, like the above-described first calibration model, is preferably composed of a text to text model based on a transformer model with an encoder-decoder structure.

상기 재교정 모델들은, 재교정 모델에 대응된 하나 또는 둘 이상의 1차 교정 모델들로부터 출력된 음성 인식 결과 데이터들이 입력되고 깨끗한 텍스트 데이터가 정답으로 라벨링되어 훈련된다. The recalibration models are trained by inputting speech recognition result data output from one or more primary calibration models corresponding to the recalibration model and labeling clean text data as the correct answer.

상기 통합 교정 모델 학습부(30)는 통합 교정 모델을 구비하고, 통합 교정 모델을 학습시킨다. 상기 통합 교정 모델은 상기 1차 교정 모델과 마찬가지로, 인코더-디코더 구조를 갖는 트랜스포머 모델을 기반으로 한 텍스트 투 텍스트(Text to Text) 모델로 구성되는 것이 바람직하다. 상기 통합 교정 모델은, 파이프라인을 구성하는 하나 또는 둘 이상의 계층들에 포함된 재교정 모델들로부터 출력된 음성 인식 결과 데이터들이 입력되고 깨끗한 텍스트 데이터가 정답으로 라벨링되어 훈련된다. The integrated calibration model learning unit 30 is provided with an integrated calibration model and learns the integrated calibration model. The integrated correction model, like the first correction model, is preferably composed of a text to text model based on a transformer model with an encoder-decoder structure. The integrated correction model is trained by inputting speech recognition result data output from recalibration models included in one or two or more layers constituting the pipeline, and labeling clean text data as the correct answer.

상기 파이프라인 구성부(40)는, 초기 파이프라인 설정 모듈, 파이프라인 성능 평가 모듈 및 파이프라인 업데이트 모듈을 구비하여, 하나 또는 둘 이상의 1차 교정 모델들과 이에 대응되는 재교정 모델이 연결되어 이루어진 하나 또는 둘 이상의 계층들, 및 상기 통합 교정 모델을 순차적으로 연결하여, 음성 인식기의 오류를 보정하기 위한 계층적 파이프라인을 구성하고, 성능 평가에 따라 업데이트한다. The pipeline configuration unit 40 includes an initial pipeline setting module, a pipeline performance evaluation module, and a pipeline update module, and consists of one or more primary calibration models and the corresponding recalibration models connected thereto. By sequentially connecting one or two or more layers and the integrated correction model, a hierarchical pipeline for correcting errors in the speech recognizer is constructed and updated according to performance evaluation.

상기 초기 파이프라인 설정 모듈은, 하나 또는 둘 이상의 1차 교정 모델들과 통합 교정 모델을 연결하여 계층적 파이프라인의 초기 상태를 구성한다. The initial pipeline setting module configures the initial state of the hierarchical pipeline by connecting one or more primary calibration models and an integrated calibration model.

상기 파이프라인 성능 평가 모듈은, 계층적 파이프라인의 음성 인식 결과에 대한 성능 평가 지표를 측정한다. 상기 파이프라인 성능 평가 모듈이 사용하는 음성 인식 결과에 대한 성능 평가 지표로는 단어 오류율(Word Error Rate ; 이하 'WER'이라 한다.) 또는 문자 오류율(Character Error Rate ; 이하 'CER'이라 한다.)를 사용할 수 있으며, 가장 바람직하게는 WER을 사용한다. The pipeline performance evaluation module measures performance evaluation indicators for the speech recognition results of the hierarchical pipeline. Performance evaluation indicators for speech recognition results used by the pipeline performance evaluation module include Word Error Rate (hereinafter referred to as 'WER') or Character Error Rate (hereinafter referred to as 'CER'). can be used, and most preferably WER is used.

도 5는 본 발명의 바람직한 실시예에 따른 음성 인식기 오류 보정 시스템에 있어서, 파이프라인 업데이트 모듈이 파이프라인 업데이트하는 과정을 설명한 흐름도이다. 도 5를 참조하면, 상기 파이프라인 업데이트 모듈은, 계층적 파이프라인에 하나 또는 둘 이상의 1차 교정 모델들과 이에 대응되는 재교정 모델이 연결되어 이루어진 계층을 하나씩 추가하면서 파이프라인의 음성 인식 결과에 대한 성능 평가 지표를 측정하고, 성능 평가 지표가 향상된 경우 상기 계층적 파이프라인으로 업데이트시키도록 구성된다. 이러한 계층의 추가는 파이프라인의 WER이 증가할 때까지 반복하게 된다. Figure 5 is a flowchart explaining the pipeline update process of the pipeline update module in the voice recognizer error correction system according to a preferred embodiment of the present invention. Referring to FIG. 5, the pipeline update module adds one or more layers to the hierarchical pipeline one by one by connecting one or more primary calibration models and the corresponding recalibration models, thereby updating the speech recognition results of the pipeline. It is configured to measure performance evaluation indicators and update the hierarchical pipeline when the performance evaluation indicators are improved. Addition of these layers is repeated until the WER of the pipeline increases.

상기 파이프라인의 음성 인식 결과에 대한 성능 평가 지표에 따라 파이프라인을 구성하는 계층의 차수를 결정하며, 상기 계층의 차수에 따라 파이프라인을 구성하는 1차 교정 모델들과 이에 대응되는 재교정 모델의 개수가 결정되는 것이 바람직하다. The order of the layer constituting the pipeline is determined according to the performance evaluation index for the speech recognition result of the pipeline, and the first calibration models constituting the pipeline and the corresponding recalibration model are determined according to the order of the layer. It is desirable that the number is determined.

상기 음성 인식기 오류 보정 장치(50)는, 임의의 음성 인식기의 음성 인식 결과 데이터가 입력되면, 해당 음성 인식기에 대응되는 1차 교정 모델, 상기 1차 교정 모델에 대응되는 재교정 모델 및 통합 교정 모델로 이루어진 계층적 파이프라인을 이용하여 음성 인식 결과 데이터의 오류를 보정하여 출력한다. When voice recognition result data from an arbitrary voice recognizer is input, the voice recognizer error correction device 50 generates a primary calibration model corresponding to the voice recognizer, a recalibration model corresponding to the primary calibration model, and an integrated calibration model. Errors in the voice recognition result data are corrected and output using a hierarchical pipeline consisting of .

본 발명에 따른 방식으로 학습이 진행되면, 음성 인식 오류 통합 교정 모델부는 단계가 증가할수록 다양한 음성 인식기의 오류 패턴을 학습하고 깨끗한 텍스트 형태로 보정할 수 있게 된다. When learning progresses in the method according to the present invention, the integrated voice recognition error correction model unit can learn error patterns of various voice recognizers as the level increases and correct them in a clean text form.

이하, 전술한 본 발명에 따른 음성 인식기 오류 보정 시스템의 음성 인식기 오류 보정 장치에 대하여 설명한다. Hereinafter, the voice recognizer error correction device of the voice recognizer error correction system according to the present invention described above will be described.

본 발명에 따른 음성 인식기 오류 보정 장치는, 머신 러닝을 수행하는 컴퓨팅 시스템에 의해 구현되는 것으로서, 하나 또는 둘 이상의 1차 교정 모델들과 이에 대응되는 재교정 모델이 연결되어 이루어진 하나 또는 둘 이상의 계층들, 및 상기 통합 교정 모델이 순차적으로 연결된 음성 인식기의 오류를 보정하기 위한 계층적 파이프라인을 구비한다. 상기 1차 교정 모델들은 음성 인식기들에 각각 대응되는 음성 인식 오류 보정 모델이며, 상기 재교정 모델은 하나 또는 둘 이상의 1차 교정 모델들에 각각 대응되도록 설정된 음성 인식 오류 보정 모델이다. The speech recognizer error correction device according to the present invention is implemented by a computing system that performs machine learning, and includes one or two or more layers consisting of one or more primary correction models and a corresponding re-calibration model connected thereto. , and a hierarchical pipeline for correcting errors in voice recognizers to which the integrated correction model is sequentially connected. The primary correction models are speech recognition error correction models corresponding to speech recognizers, and the re-calibration models are speech recognition error correction models set to correspond to one or more primary correction models.

전술한 구성을 갖는 음성 인식기 오류 보정 장치는, 임의의 음성 인식기의 음성 인식 결과 데이터가 입력되면, 해당 음성 인식기에 대응되는 1차 교정 모델, 상기 1차 교정 모델에 대응되는 재교정 모델 및 통합 교정 모델로 이루어진 계층적 파이프라인을 이용하여 음성 인식 결과 데이터의 오류를 보정하여 출력한다. The voice recognizer error correction device having the above-described configuration includes, when voice recognition result data of an arbitrary voice recognizer is input, a primary calibration model corresponding to the voice recognizer, a recalibration model corresponding to the primary calibration model, and integrated calibration. Errors in the voice recognition result data are corrected and output using a hierarchical pipeline composed of models.

상기 1차 교정 모델, 재교정 모델 및 통합 교정 모델은, 인코더-디코더 구조를 갖는 트랜스포머 모델을 기반으로 한 텍스트 투 텍스트(Text to Text) 모델로 구성된 것이 바람직하다. The primary correction model, re-calibration model, and integrated correction model are preferably composed of a text to text model based on a transformer model with an encoder-decoder structure.

상기 1차 교정 모델들은, 각 1차 교정 모델에 일대일 대응된 음성 인식기로부터 출력된 음성 인식 결과 데이터가 입력되고 깨끗한 텍스트 데이터가 정답으로 라벨링되어 훈련된다. 상기 재교정 모델들은, 재교정 모델에 대응된 하나 또는 둘 이상의 1차 교정 모델들로부터 출력된 음성 인식 결과 데이터들이 입력되고 깨끗한 텍스트 데이터가 정답으로 라벨링되어 훈련된다. 상기 통합 교정 모델은, 상기 계층적 파이프라인을 구성하는 하나 또는 둘 이상의 계층들에 포함된 재교정 모델들로부터 출력된 음성 인식 결과 데이터들이 입력되고 깨끗한 텍스트 데이터가 정답으로 라벨링되어 훈련된다. The first correction models are trained by inputting speech recognition result data output from a speech recognizer corresponding one-to-one to each first correction model and labeling the clean text data as the correct answer. The recalibration models are trained by inputting speech recognition result data output from one or more primary calibration models corresponding to the recalibration model and labeling clean text data as the correct answer. The integrated correction model is trained by inputting speech recognition result data output from recalibration models included in one or two or more layers constituting the hierarchical pipeline, and labeling clean text data as the correct answer.

이하, 본 발명에 따른 음성 인식기 오류 보정 방법에 대하여 구체적으로 설명한다. 본 발명에 따른 음성 인식기 오류 보정 방법은 프로그램으로 구현되어 컴퓨팅 시스템에서 실행될 수 있다. Hereinafter, the voice recognizer error correction method according to the present invention will be described in detail. The voice recognizer error correction method according to the present invention can be implemented as a program and executed on a computing system.

도 6는 본 발명에 따른 음성 인식기 오류 보정 방법을 순차적으로 도시한 흐름도이다. 도 6를 참조하면, 본 발명에 따른 음성 인식기 오류 보정 방법은, 먼저, 적어도 둘 이상의 음성 인식기들에 각각 대응되는 1차 교정 모델들에 대하여 해당 음성 인식기별 음성 오류 패턴을 학습시킨다. 여기서, 상기 1차 교정 모델들은, 각 1차 교정 모델에 일대일 대응된 음성 인식기로부터 출력된 음성 인식 결과 데이터가 입력되고 깨끗한 텍스트 데이터가 정답으로 라벨링되어 훈련된다. Figure 6 is a flowchart sequentially showing the voice recognizer error correction method according to the present invention. Referring to FIG. 6, in the speech recognizer error correction method according to the present invention, first, the speech error pattern for each speech recognizer is learned for the first correction models corresponding to each of at least two or more speech recognizers. Here, the first correction models are trained by inputting speech recognition result data output from a speech recognizer corresponding one-to-one to each first correction model and labeling the clean text data as the correct answer.

다음, 하나 또는 둘 이상의 1차 교정 모델들에 대응되도록 재교정 모델들을 설정하고, 상기 재교정 모델들을 학습시킨다. 여기서, 상기 재교정 모델들은, 재교정 모델에 대응된 하나 또는 둘 이상의 1차 교정 모델들로부터 출력된 음성 인식 결과 데이터들이 입력되고 깨끗한 텍스트 데이터가 정답으로 라벨링되어 훈련된다. Next, recalibration models are set to correspond to one or more primary calibration models, and the recalibration models are trained. Here, the recalibration models are trained by inputting voice recognition result data output from one or more primary calibration models corresponding to the recalibration model and labeling clean text data as the correct answer.

다음, 통합 교정 모델을 학습시킨다. 여기서, 상기 통합 교정 모델은, 파이프라인을 구성하는 하나 또는 둘 이상의 계층들에 포함된 재교정 모델들로부터 출력된 음성 인식 결과 데이터들이 입력되고 깨끗한 텍스트 데이터가 정답으로 라벨링되어 훈련된다. Next, learn the integrated correction model. Here, the integrated correction model is trained by inputting speech recognition result data output from recalibration models included in one or two or more layers constituting the pipeline, and labeling clean text data as the correct answer.

다음, 하나 또는 둘 이상의 1차 교정 모델들과 이에 대응되는 재교정 모델이 연결되어 이루어진 하나 또는 둘 이상의 계층들, 및 상기 통합 교정 모델을 순차적으로 연결하여, 음성 인식기의 오류를 보정하기 위한 계층적 파이프라인을 구성하고 업데이트한다. 이하, 계층적 파이프라인을 구성하고 업데이트하는 과정에 대하여 보다 구체적으로 설명한다. 먼저, 하나 또는 둘 이상의 1차 교정 모델들과 통합 교정 모델을 연결하여 계층적 파이프라인의 초기 형태를 설정한 후, 계층적 파이프라인의 음성 인식 결과에 대한 성능 평가 지표인 WER를 측정한다. 다음, 상기 계층적 파이프라인에 대하여, 하나 또는 둘 이상의 1차 교정 모델들과 이에 대응되는 재교정 모델이 연결되어 이루어진 계층을 하나씩 추가하면서 파이프라인의 음성 인식 결과에 대한 WER을 측정하고, WER이 감소되어 성능이 향상된 경우에는 추가된 계층을 갖는 파이프라인으로 업데이트시킨다. 만약, WER이 증가되어 성능이 나빠진 경우에는 계층의 추가없이 기존의 파이프라인을 그대로 유지시킨다. 이러한 계층의 추가는 파이프라인에 대한 WER이 증가할 때까지 반복된다. Next, one or two or more layers consisting of one or more primary correction models and the corresponding re-calibration model are connected, and the integrated correction model is sequentially connected to create a hierarchical structure for correcting errors in the speech recognizer. Configure and update the pipeline. Hereinafter, the process of configuring and updating the hierarchical pipeline will be described in more detail. First, the initial form of the hierarchical pipeline is established by connecting one or more primary calibration models and the integrated calibration model, and then the WER, which is a performance evaluation indicator for the speech recognition results of the hierarchical pipeline, is measured. Next, to the hierarchical pipeline, the WER of the speech recognition result of the pipeline is measured by adding one or more layers consisting of one or more primary calibration models and the corresponding recalibration model connected, and the WER is If performance is improved by reduction, it is updated to a pipeline with added layers. If performance deteriorates due to increased WER, the existing pipeline is maintained as is without adding any layers. Adding these layers is repeated until the WER for the pipeline increases.

전술한 과정을 통해, 음성 인식기 오류 보정을 위한 계층적 파이프라인을 완성한다. 다음, 임의의 음성 인식기의 음성 인식 결과 데이터가 입력되면, 해당 음성 인식기에 대응되는 1차 교정 모델, 상기 1차 교정 모델에 대응되는 재교정 모델 및 통합 교정 모델로 이루어진 계층적 파이프라인을 이용하여, 음성 인식 결과 데이터의 오류를 보정하여 출력한다. Through the above-described process, a hierarchical pipeline for speech recognizer error correction is completed. Next, when the voice recognition result data of any voice recognizer is input, a hierarchical pipeline consisting of a primary calibration model corresponding to the voice recognizer, a recalibration model corresponding to the primary calibration model, and an integrated calibration model is used. , Errors in the voice recognition result data are corrected and output.

상기 1차 교정 모델, 재교정 모델 및 통합 교정 모델은, 인코더-디코더 구조를 갖는 트랜스포머 모델을 기반으로 한 텍스트 투 텍스트(Text to Text) 모델인 것이 바람직하다. The primary correction model, re-correction model, and integrated correction model are preferably text-to-text models based on a transformer model with an encoder-decoder structure.

한편, 본 발명에 따른 음성 인식기 오류 보정 시스템은 음성 인식기의 음성 인식 결과에 대한 오류 보정뿐만 아니라, 다양한 스타일의 음성 데이터에 대한 오류 보정에도 적용될 수 있다. Meanwhile, the voice recognizer error correction system according to the present invention can be applied not only to error correction for voice recognition results of a voice recognizer, but also to error correction for voice data of various styles.

일반적으로, 음성 데이터는 깨끗한 데이터와 다른 몇가지의 특징을 보인다 그 특징은 다음과 같이 나열할 수 있다. 1) Speech Recognition Error Style은 음성 데이터를 텍스트형 태로 변환하면서 발생하는 음성 인식 오류들이다. 2) Disfluency Style는 대화를 하거나 읽을 때 “아…음…” 등과 같은 의미 없는 허밍음 같은 것이다. 마지막으로 3) Paraphrased Style은 다양한 동음이의어가 발화에서 사용될 수 있다는 것이다. 이러한 특징들은 깨끗한 텍스트를 수집할 때에는 삭제 되거나 수정되기 때문에 발생하지 않고, 음성 데이터에만 존재한다는 특징을 가진다. 도 9는 후술되는 도 7 및 도 8에 따른 시스템이 적용될 수 있는 다양한 스타일의 음성 데이터들을 예시적으로 표시한 도표이다. In general, voice data shows several characteristics that are different from clean data. The characteristics can be listed as follows. 1) Speech Recognition Error Style is speech recognition errors that occur while converting speech data into text format. 2) Disfluency Style is when you say, “Ah…” when talking or reading. hmm… It’s like a meaningless humming sound like “” etc. Lastly, 3) Paraphrased Style means that various homonyms can be used in speech. These features do not occur when collecting clean text because they are deleted or modified, and exist only in voice data. FIG. 9 is a diagram illustrating various styles of voice data to which the system according to FIGS. 7 and 8, which will be described later, can be applied.

도 7 및 도 8은 본 발명의 다른 태양에 따른 음성 데이터 오류 보정 시스템을 설명하기 위하여 도시한 블록도들이다. 도 7 및 도 8을 참조하면, 전술한 바와 같은 다양한 스타일의 음성 데이터들(예를 들면, Speech Recognition Error Style, Disfluency Style, Paraphrased Style)이 입력되고, 각 스타일의 음성 데이터들에 대한 1차 교정 모델이 학습되어 구비되고, 1차 교정 모델, 재교정 모델 및 통합 교정 모델에 의하여 오류가 보정된다. 7 and 8 are block diagrams illustrating a voice data error correction system according to another aspect of the present invention. Referring to Figures 7 and 8, various styles of voice data (e.g., Speech Recognition Error Style, Disfluency Style, Paraphrased Style) as described above are input, and first correction is performed on the voice data of each style. A model is learned and prepared, and errors are corrected by the primary calibration model, re-calibration model, and integrated calibration model.

이상에서 본 발명에 대하여 그 바람직한 실시예를 중심으로 설명하였으나, 이는 단지 예시일 뿐 본 발명을 한정하는 것이 아니며, 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 발명의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 그리고, 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다. Although the present invention has been described above with a focus on preferred embodiments, this is only an example and does not limit the present invention, and those skilled in the art will understand that it does not deviate from the essential characteristics of the present invention. It will be apparent that various modifications and applications not exemplified above are possible within the scope. In addition, these variations and differences in application should be construed as being included in the scope of the present invention as defined in the appended claims.

1 : 음성 인식기 오류 보정 시스템
10 : 1차 교정 모델 학습부
20 : 재교정 모델 학습부
30 : 통합 교정 모델 학습부
40 : 파이프라인 구성부
50 : 음성 인식기 오류 보정 장치1: Voice recognizer error correction system
10: First correction model learning unit
20: Recalibration model learning unit
30: Integrated correction model learning unit
40: Pipeline component
50: Voice recognizer error correction device

Claims

In a speech recognizer error correction system implemented by a computing system performing machine learning,
a primary correction model learning unit comprising primary correction models corresponding to at least two speech recognizers, wherein the first correction models learn speech error patterns for each speech recognizer;
A recalibration model learning unit that has recalibration models set to correspond to one or more primary calibration models and trains the recalibration models;
An integrated calibration model learning unit that has an integrated calibration model and trains the integrated calibration model; and
A hierarchical pipeline for correcting errors in a speech recognizer by sequentially connecting one or more layers consisting of one or more primary correction models and the corresponding re-calibration model, and the integrated correction model. A pipeline component that configures and updates;
A voice recognizer error correction system comprising:

The method of claim 1, wherein the voice recognizer error correction system:
When voice recognition result data from a random voice recognizer is input, voice recognition is performed using a hierarchical pipeline consisting of a primary calibration model corresponding to the voice recognizer, a recalibration model corresponding to the primary calibration model, and an integrated calibration model. A voice recognition error correction device that corrects errors in the resulting data and outputs them;
A voice recognizer error correction system further comprising:

The method of claim 1, wherein the pipeline component,
The order of the layers that make up the pipeline is determined according to the performance evaluation indicators for the speech recognition results of the pipeline,
A speech recognizer error correction system, characterized in that the number of primary calibration models constituting the pipeline and the corresponding recalibration models are determined according to the order of the hierarchy.

The method of claim 1, wherein the primary calibration model, re-calibration model, and integrated calibration model are:
A speech recognizer error correction system characterized in that it is a text to text model based on a transformer model with an encoder-decoder structure.

The speech recognizer error correction system according to claim 1, wherein the performance evaluation index for the speech recognition result used by the pipeline component is WER (Word Error Rate).

The method of claim 1, wherein the first correction models are trained by inputting speech recognition result data output from a speech recognizer corresponding one-to-one to each first correction model and labeling clean text data as the correct answer,
The recalibration models are characterized in that speech recognition result data output from one or more primary calibration models corresponding to the recalibration model are input and clean text data is labeled as the correct answer and trained,
The integrated correction model is characterized in that speech recognition result data output from recalibration models included in one or two or more layers constituting the hierarchical pipeline are input and clean text data is labeled as the correct answer and trained. A speech recognizer error correction system.

The method of claim 1, wherein the pipeline component,
an initial pipeline setting module that sets the initial state of the hierarchical pipeline by connecting one or more primary calibration models and an integrated calibration model;
a pipeline performance evaluation module that measures performance evaluation indicators for the speech recognition results of the hierarchical pipeline;
By adding one or more layers to the hierarchical pipeline one by one by connecting one or more primary calibration models and the corresponding recalibration model, the performance evaluation indicators for the speech recognition results of the pipeline are measured, and the performance evaluation indicators are improved. a pipeline update module configured to update the pipeline;
A voice recognizer error correction system comprising:

(a) learning a speech error pattern for each speech recognizer with respect to primary correction models corresponding to at least two speech recognizers, respectively;
(b) setting recalibration models to correspond to one or more primary calibration models and learning the recalibration models;
(c) learning an integrated correction model; and
(d) one or two or more layers consisting of one or more primary correction models and the corresponding re-calibration model connected, and a layer for sequentially connecting the integrated correction model to correct errors in the speech recognizer Constructing an enemy pipeline;
A voice recognizer error correction method comprising:

The method of claim 8, wherein the voice recognizer error correction method includes:
(e) When voice recognition result data from any voice recognizer is input, a hierarchical pipeline consisting of a primary calibration model corresponding to the voice recognizer, a recalibration model corresponding to the primary calibration model, and an integrated calibration model is used. Thus, correcting errors in the voice recognition result data and outputting it;
A voice recognizer error correction method further comprising:

The method of claim 8, wherein step (d) is,
The order of the layers that make up the pipeline is determined according to the performance evaluation indicators for the speech recognition results of the pipeline,
A speech recognizer error correction method, characterized in that the number of primary calibration models constituting the pipeline and the corresponding re-calibration models are determined according to the order of the hierarchy.

The method of claim 8, wherein the primary calibration model, recalibration model, and integrated calibration model are:
A speech recognizer error correction method, characterized in that it is a text to text model based on a transformer model with an encoder-decoder structure.

The method of claim 10, wherein the performance evaluation index for the speech recognition results of the pipeline is WER (Word Error Rate).

The method of claim 8, wherein the first correction models are trained by inputting speech recognition result data output from a speech recognizer corresponding one-to-one to each first correction model and labeling clean text data as the correct answer,
The recalibration models are characterized in that speech recognition result data output from one or more primary calibration models corresponding to the recalibration model are input and clean text data is labeled as the correct answer and trained,
The integrated correction model is a speech recognizer characterized in that speech recognition result data output from recalibration models included in one or two or more layers constituting the pipeline is input and clean text data is labeled as the correct answer and trained. Error correction method.

The method of claim 8, wherein step (d) is,
(d1) establishing an initial form of a hierarchical pipeline by connecting one or more primary calibration models and an integrated calibration model;
(d2) measuring performance evaluation indicators for the speech recognition results of the hierarchical pipeline; and
(d3) To the hierarchical pipeline, measure performance evaluation indicators for the speech recognition results of the pipeline while adding one or more layers consisting of one or more primary calibration models and the corresponding recalibration model connected one by one, and , updating the pipeline based on performance evaluation indicators;
A voice recognizer error correction method comprising:

In a speech recognizer error correction device implemented by a computing system that performs machine learning,
One or more layers consisting of one or more primary correction models and their corresponding re-calibration models connected, and a hierarchical pipeline for correcting errors in a speech recognizer to which the integrated correction models are sequentially connected. do,
When voice recognition result data from a random voice recognizer is input, voice recognition is performed using a hierarchical pipeline consisting of a primary calibration model corresponding to the voice recognizer, a recalibration model corresponding to the primary calibration model, and an integrated calibration model. It is characterized by correcting errors in the result data and outputting it.
The first correction models are speech recognition error correction models corresponding to speech recognizers, and the re-calibration model is a speech recognition error correction model set to correspond to one or more first correction models. Recognizer error correction device.

The method of claim 15, wherein the primary calibration model, recalibration model, and integrated calibration model are:
A speech recognizer error correction device characterized in that it is a text to text model based on a transformer model with an encoder-decoder structure.

The method of claim 15, wherein the first correction models are trained by inputting speech recognition result data output from a speech recognizer corresponding one-to-one to each first correction model and labeling clean text data as the correct answer,
The recalibration models are characterized in that speech recognition result data output from one or more primary calibration models corresponding to the recalibration model are input and clean text data is labeled as the correct answer and trained,
The integrated correction model is characterized in that speech recognition result data output from recalibration models included in one or two or more layers constituting the hierarchical pipeline are input and clean text data is labeled as the correct answer and trained. A speech recognizer error correction device.