KR102185183B1

KR102185183B1 - a broadcast closed caption generating system

Info

Publication number: KR102185183B1
Application number: KR1020190047796A
Authority: KR
Inventors: 손석연
Original assignee: 주식회사 한국스테노
Priority date: 2019-04-24
Filing date: 2019-04-24
Publication date: 2020-12-01
Also published as: KR20200124456A

Abstract

청각 장애인을 위한 실시간 방송 자막 제작 시스템이 개시된다. 본 발명의 일 실시 예에 따른 방송 자막 제작 시스템은 컨텐츠로부터 음성 신호를 수신하고 음성 신호를 인식하여 문자로 변환하는 음성인식부, 속기입력장치를 통해 자막 속기 입력을 획득하는 속기입력부 및 상기 음성인식부로부터 획득되는 음성-문자 변환 데이터 및 상기 속기입력부로부터 획득되는 속기입력 데이터를 통합하여 최종 자막을 생성하는 통합 자막 처리부를 포함한다.A system for producing real-time broadcasting captions for the hearing impaired is disclosed. A broadcast caption production system according to an embodiment of the present invention includes a voice recognition unit for receiving a voice signal from content and converting the voice signal to a text, a shorthand input unit for obtaining a shorthand input through a shorthand input device, and the voice recognition. And an integrated caption processing unit for generating a final caption by integrating the voice-to-text conversion data obtained from the unit and the shorthand input data obtained from the shorthand input unit.

Description

Broadcast closed caption generating system {a broadcast closed caption generating system}

본 발명은 방송 자막 제작 시스템에 관한 것이다. 구체적으로 본 발명은 음성 자동 인식 기술과 속기 입력을 모두 이용하는 방송 자막 제작 시스템에 관한 것이다.The present invention relates to a broadcast caption production system. Specifically, the present invention relates to a broadcast caption production system using both automatic voice recognition technology and shorthand input.

속기의 사전적 의미는 무형의 음성적 언어, 영상이나 음성을 빠르고 정확하게 기록하여 이를 문자화 하는 모든 활동을 의미하는 것이다. 속기는 말로 나타내는 언어를 문자화하는 특징에서 볼 때 필기의 일종이므로 말의 언어를 기록 보존하기 위한 녹음과는 본질적으로 다르다.The dictionary meaning of shorthand refers to any activity in which intangible voice language, video or audio are recorded quickly and accurately and converted into text. Shorthand is a kind of handwriting in terms of characterizing the spoken language, so it is essentially different from the recording for preserving the spoken language.

이러한 속기는 손으로 쓰는 수필 속기부터 타자기 속기, 컴퓨터 자판으로 내용을 입력하는 컴퓨터를 이용한 속기까지 발전하여 오다가 사람이 기록하기 힘들었던 부분까지도 소프트웨어와 하드웨어가 개발됨에 따라 가능하게 되었다.Such shorthand has evolved from handwritten essay shorthand to typewriter shorthand, and computer-based shorthand for inputting content on a computer keyboard, and became possible with the development of software and hardware even for parts that were difficult for humans to write.

청각 장애인을 위한 실시간 방송에서 과거에는 속기에 의존하여 방송 컨텐츠용 자막을 생성하였으나, 최근에는 머신러닝의 발전에 따라 음성 문자 변환 도구가 대중화되었으며, 전문 속기사에 의존하지 않고도 자막을 실시간으로 생성할 수 있게 되었다.In real-time broadcasting for the hearing impaired, in the past, subtitles for broadcasting contents were generated by relying on shorthand, but in recent years, with the development of machine learning, voice-to-text conversion tools have become popular, and captions can be generated in real time without relying on professional stenographers Became.

그러나, 음성 문자 변환 도구를 활용하는 경우 속기사의 입력에 의존하는 기존의 방법보다 수월하게 방송용 자막을 생성할 수 있으나, 때때로 음성 인식이 정확하지 못하거나 음성인식이 되지 않은 경우에 자막이 생성되지 않는 문제가 있을 수 있는바, 이하에서는 음성 인식과 속기 입력을 모두 이용하여 각각의 문제점을 상호 보완할 수 있는 자막 생성 시스템을 설명하도록 한다.However, in the case of using a voice-to-text conversion tool, it is possible to generate subtitles for broadcasting more easily than conventional methods that rely on the input of a stenographer, but sometimes the subtitles are not generated when voice recognition is not accurate or voice recognition is not performed. There may be a problem. Hereinafter, a caption generation system capable of complementing each of the problems by using both voice recognition and shorthand input will be described.

본 발명의 일 실시 예에 따른 방송 자막 제작 시스템은, 음성 인식만으로 자막을 생성하는 경우 발생할 수 있는 문제를 속기 입력을 해결하여 청각 장애인을 위한 실시간 자막을 생성하는 시스템을 제안하는 것을 목적으로 한다.An object of the present invention is to propose a system for generating real-time captions for the hearing impaired by solving shorthand input for a problem that may occur when captions are generated only by voice recognition.

본 발명의 일 실시 예에 따른 방송 자막 제작 시스템은 컨텐츠로부터 음성 신호를 수신하고 음성 신호를 인식하여 문자로 변환하는 음성인식부, 속기입력장치를 통해 자막 속기 입력을 획득하는 속기입력부 및 상기 음성인식부로부터 획득되는 음성-문자 변환 데이터 및 상기 속기입력부로부터 획득되는 속기입력 데이터를 통합하여 최종 자막을 생성하는 통합 자막 처리부를 포함한다.A broadcast caption production system according to an embodiment of the present invention includes a voice recognition unit for receiving a voice signal from content and converting the voice signal to a text, a shorthand input unit for obtaining a shorthand input through a shorthand input device, and the voice recognition. And an integrated caption processing unit for generating a final caption by integrating the voice-to-text conversion data obtained from the unit and the shorthand input data obtained from the shorthand input unit.

본 발명의 일 실시 예에 따른 방송 자막 제작 시스템은, 음성 인식을 통한 자막 생성에 있어서 발생할 수 있는 부정확한 자막에 대하여 속기입력을 통해 정확하게 보완할 수 있다.The broadcast caption production system according to an embodiment of the present invention can accurately compensate for inaccurate captions that may occur in generating captions through voice recognition through shorthand input.

또한, 본 발명의 일 실시 예에 따른 방송 자막 제작 시스템은 음성 인식을 기초로 하여 자막을 생성하는 바, 속기 인력을 최소한으로 사용하면서 더 많은 청각 장애인용 방송 자막을 제작할 수 있다.In addition, since the broadcast caption production system according to an embodiment of the present invention generates captions based on voice recognition, it is possible to produce more broadcast captions for the hearing impaired while using a minimum of shorthand manpower.

도 1은 본 발명의 일 실시 예에 따른 방송 자막 제작 시스템의 전체 구성도이다.
도 2는 본 발명의 일 실시 예에 따른 음성인식부의 구성을 나타낸다.
도 3은 본 발명의 일 실시 예에 따른 속기입력부의 구성을 나타내는 블록도이다.
도 4는 본 발명의 일 실시 예에 따른 자막 생성 시스템의 동작을 나타내는 흐름도이다.1 is an overall configuration diagram of a broadcast caption production system according to an embodiment of the present invention.
2 shows the configuration of a voice recognition unit according to an embodiment of the present invention.
3 is a block diagram showing the configuration of a shorthand input unit according to an embodiment of the present invention.
4 is a flowchart illustrating an operation of a caption generation system according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 구체적인 실시예를 상세하게 설명한다. 그러나 본 발명의 사상은 이하의 실시예에 제한되지 아니하며, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에 포함되는 다른 실시예를 구성요소의 부가, 변경, 삭제, 및 추가 등에 의해서 용이하게 제안할 수 있을 것이나, 이 또한 본 발명 사상의 범위 내에 포함된다고 할 것이다. Hereinafter, specific embodiments of the present invention will be described in detail with reference to the drawings. However, the spirit of the present invention is not limited to the following embodiments, and those skilled in the art who understand the spirit of the present invention can easily add, change, delete, and add components to other embodiments included within the scope of the same idea. It may be suggested, but it will be said that this is also included within the scope of the inventive concept.

첨부 도면은 발명의 사상을 이해하기 쉽게 표현하기 위하여 전체적인 구조를 설명함에 있어서는 미소한 부분은 구체적으로 표현하지 않을 수도 있고, 미소한 부분을 설명함에 있어서는 전체적인 구조는 구체적으로 반영되지 않을 수도 있다. 또한, 설치 위치 등 구체적인 부분이 다르더라도 그 작용이 동일한 경우에는 동일한 명칭을 부여함으로써, 이해의 편의를 높일 수 있도록 한다. 또한, 동일한 구성이 복수 개가 있을 때에는 어느 하나의 구성에 대해서만 설명하고 다른 구성에 대해서는 동일한 설명이 적용되는 것으로 하고 그 설명을 생략한다. In the accompanying drawings, in explaining the overall structure in order to easily understand the spirit of the invention, minute parts may not be specifically expressed, and when describing the minute parts, the overall structure may not be specifically reflected. In addition, even if specific parts such as the installation location are different, if the action is the same, the same name is given, so that the convenience of understanding can be improved. In addition, when there are a plurality of identical configurations, only one configuration will be described, and the same description will be applied to other configurations, and the description will be omitted.

도 1은 본 발명의 일 실시 예에 따른 방송 자막 제작 시스템의 전체 구성도이다.1 is an overall configuration diagram of a broadcast caption production system according to an embodiment of the present invention.

본 발명에서 개시하는 방송 자막 제작 시스템은 일반적인 방송 자막이 아닌, 청각 장애인을 위한 실시간 자막 생성 시스템이다.The broadcast caption production system disclosed in the present invention is not a general broadcast caption, but a real-time caption generation system for the hearing impaired.

도 1에 도시된 바와 같이, 컨텐츠 제공자(20)가 방송 컨텐츠를 방송 자막 제작 시스템(10)에 전달하고, 방송 자막 제작 시스템(10)은 방송 컨텐츠에 기초하여 방송 자막을 생성하고 이를 셋톱박스(30)에 전달한다. 셋톱박스(30)는 전달받은 방송 자막을 인코딩하여 방송 컨텐츠와 함께 출력한다.As shown in Fig. 1, the content provider 20 delivers broadcast content to the broadcast caption production system 10, and the broadcast caption production system 10 generates broadcast captions based on the broadcast content and generates a broadcast caption based on the broadcast content, and the set-top box ( 30). The set-top box 30 encodes the received broadcast caption and outputs it together with the broadcast content.

컨텐츠 제공자(20)는 대표적인 예로 방송국이 있을 수 있으며, 방송국 외 컨텐츠를 제작하여 제공하는 사업자 또는 중간 분배자를 포함할 수도 있다. 컨텐츠 제공자(20)는 오디오와 영상을 포함하는 컨텐츠를 방송 자막 제작 시스템(10)에 전달한다.The content provider 20 may be a broadcasting station as a representative example, and may include a business operator or an intermediate distributor that produces and provides content other than the broadcasting station. The content provider 20 delivers content including audio and video to the broadcast caption production system 10.

방송 자막 제작 시스템(10)은 컨텐츠 제공자(20)로부터 전달받은 컨텐츠에 기초하여 방송 자막을 생성한다. 구체적으로 방송 자막 제작 시스템(10)은 음성인식부(100), 속기입력부(200) 및 통합자막처리부(300)를 포함할 수 있다.The broadcast caption production system 10 generates broadcast captions based on the content received from the content provider 20. Specifically, the broadcast caption production system 10 may include a voice recognition unit 100, a shorthand input unit 200, and an integrated subtitle processing unit 300.

음성인식부(100)는 컨텐츠 제공자(20)로부터 전달받은 컨텐츠가 재생될 때, 음성을 자동 인식하여 문자로 변환한다. 음성인식부(100)는 일반적으로 사용되는 음성 문자 변환 도구일 수 있으며, 구체적인 예로 구글 클라우드 스피치 API나 Amazon Transcirbe일 수 있다. 음성인식부(100)의 구체적인 구성에 대하여는 이하에서 따로 설명하기로 한다.When the content delivered from the content provider 20 is played, the voice recognition unit 100 automatically recognizes the voice and converts it into text. The voice recognition unit 100 may be a commonly used voice-to-text conversion tool, and a specific example may be Google Cloud Speech API or Amazon Transcirbe. A detailed configuration of the voice recognition unit 100 will be described below.

도 2는 본 발명의 일 실시 예에 따른 음성인식부(100)의 구성을 나타낸다.2 shows the configuration of the voice recognition unit 100 according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 음성인식부(100)는 음성수신부(110), 음성-문자 변환부(120) 및 정확도 산출부(130)를 포함할 수 있다.As shown in FIG. 2, the voice recognition unit 100 according to an embodiment of the present invention may include a voice receiver 110, a voice-to-text conversion unit 120, and an accuracy calculation unit 130.

음성수신부(110)는 컨텐츠로부터 음성 신호를 획득한다. 예를 들어 음성수신부(110)는 마이크일 수 있다. 음성수신부(110)는 컨텐츠로부터 전달되는 모든 음성 신호를 수집하고, 수집한 음성 신호를 디지털 신호로 변환하여 음성-문자 변환부(120)로 전달할 수 있다.The voice receiver 110 obtains a voice signal from the content. For example, the voice receiver 110 may be a microphone. The voice receiver 110 may collect all voice signals transmitted from the content, convert the collected voice signals into digital signals, and transmit them to the voice-text converter 120.

또 다른 실시 예에서 음성수신부(110)는 속기사의 음성을 수신할 수도 있다. 속기사가 특정 상황에서 속기키보드를 통한 속기 입력이 어려운 경우, 음성수신부(110)는 속기사의 음성을 수신하여 문자로 변환할 수 있다. 단, 속기사의 음성을 문자로 변환한 데이터는 컨텐츠의 음성 신호를 문자로 변환한 것과 다르게 처리되어 통합 자막 처리부(300)로 전달될 수 있다. 통합 자막 처리부(300)는 속기사의 음성으로부터 변환되는 문자 데이터를 보완 속기 입력과 동일하게 취급하여 최종 자막 생성에 사용할 수 있다. In another embodiment, the voice receiver 110 may receive the voice of the shorthand. When it is difficult for the stenographer to input shorthand through the shorthand keyboard in a specific situation, the voice receiver 110 may receive the voice of the stenographer and convert it into text. However, data obtained by converting the voice of the short story into text may be processed differently from the conversion of the voice signal of the content into text and transmitted to the integrated caption processing unit 300. The integrated caption processing unit 300 may treat the text data converted from the voice of the shorthand in the same way as the supplementary shorthand input and use it for final caption generation.

음성-문자 변환부(120)는 음성수신부(110)로부터 전달받은 음성 신호를 문자로 변환한다. 구체적으로 음성-문자 변환부(120)는 딥러닝을 통한 자동 음성 인식을 위한 기계 학습 애플리케이션일 수 있다. 음성-문자 변환부(120)는 WAV 및 MP3와 같은 일반적인 형식으로 저장된 오디오 파일을 트랜스크립션하고 단어마다 타임스탬프를 추가할 수 있다. The voice-to-text conversion unit 120 converts the voice signal received from the voice receiving unit 110 into text. Specifically, the speech-to-text conversion unit 120 may be a machine learning application for automatic speech recognition through deep learning. The voice-to-text conversion unit 120 may transcribe an audio file stored in a general format such as WAV and MP3 and add a time stamp for each word.

정확도 산출부(130)는 자동 음성 인식 간에 음성 인식의 정확도를 산출할 수 있다. 구체적으로, 정확도 산출부(130)는 음성 신호에서 사람의 목소리(육성)와 노이즈를 구별할 수 있으며, 사람의 목소리 크기, 사람의 목소리와 노이즈간 비율 또는 음성 인식 결과에 기초하여 정확도를 산출할 수 있다. The accuracy calculator 130 may calculate the accuracy of speech recognition between automatic speech recognition. Specifically, the accuracy calculation unit 130 may distinguish a human voice (raising) and noise from the voice signal, and calculate the accuracy based on the size of the human voice, the ratio between the human voice and noise, or the voice recognition result. I can.

일 실시 예에서, 정확도 산출부(130)는 음성 신호 중에서 사람의 목소리가 작으면 정확도를 낮은 것으로 볼 수 있으며, 사람의 목소리가 크면 정확도가 높을 것으로 볼 수 있다. 다시 말해서 정확도 산출부(130)는 사람의 목소리 크기에 비례하여 정확도를 산출할 수 있다. 예를 들어, 컨텐츠 속 화자가 마이크에서 떨어져 발언하거나 말소리가 상대적으로 작은 경우가 있을 수 있다. 사람의 목소리가 크고 작은지 여부를 판단하는 기준은 일반적인 컨텐츠에서의 사람 목소리 크기가 될 수 있으며, 구체적인 값은 기계학습을 통해 얻을 수도 있다. In an embodiment, the accuracy calculator 130 may consider that the accuracy is low when the human voice is small among the voice signals, and the accuracy is high when the human voice is large. In other words, the accuracy calculation unit 130 may calculate the accuracy in proportion to the loudness of a person's voice. For example, there may be a case where a speaker in the content speaks away from the microphone or the speech sound is relatively low. The criterion for determining whether the human voice is large or small may be the size of the human voice in general content, and a specific value may be obtained through machine learning.

또 다른 실시 예에서, 정확도 산출부(130)는 음성 신호 중에서 사람 목소리와 노이즈간 비율에서 노이즈 비율이 높을수록 정확도가 낮은 것으로 볼 수 있다. 다시 말해서, 정확도 산출부(130)는 노이즈 비율과 반비례하여 정확도를 산출할 수 있다. 예를 들어 컨텐츠 속에서 장내가 소란스럽다거나 비음성적인 소리가 중심이 되는 경우가 있을 수 있다.In another embodiment, the accuracy calculation unit 130 may be considered to have a lower accuracy as the noise ratio increases in the ratio between human voice and noise among voice signals. In other words, the accuracy calculator 130 may calculate the accuracy in inverse proportion to the noise ratio. For example, in the content, there may be a case where the interior of the hall is noisy or non-voiced sound is the center.

또 다른 실시 예에서, 정확도 산출부(130)는 인식 결과에 기초하여 정확도를 산출할 수 있다. 정확도 산출부(130)는 음성을 인식하여 문자로 변환한 결과가 표준어 표기에 맞는지 여부를 판단하여 정확도를 산출할 수 있다. 예를 들어 정확도 산출부(130)의 변환 결과가 맞춤법에 맞지 않는 경우가 있을 수 있으며, 컨텐츠 속 화자가 방언을 구사하는 경우가 있을 수 있다.In another embodiment, the accuracy calculator 130 may calculate the accuracy based on the recognition result. The accuracy calculation unit 130 may calculate the accuracy by recognizing the voice and determining whether a result of converting it into a character conforms to the standard language notation. For example, there may be a case in which the conversion result of the accuracy calculation unit 130 does not fit the spelling, and there may be a case where the speaker in the content speaks a dialect.

정확도 산출부(130)는 음성 인식 결과가 특정 값 이하인 경우 해당 단어 또는 구간의 타임스탬프를 따로 기록할 수 있다. 여기에서 기록된 타임스탬프는 통합 자막 처리부(300)로 전달되거나, 속기 입력부(200)에 전달될 수 있다.The accuracy calculator 130 may separately record a timestamp of a corresponding word or section when the speech recognition result is less than or equal to a specific value. The time stamp recorded here may be transmitted to the integrated caption processing unit 300 or may be transmitted to the shorthand input unit 200.

다시 도 1로 돌아온다.It comes back to FIG. 1 again.

속기입력부(200)는 속기입력을 획득한다. 속기입력부(200)는 속기사로부터 속기입력을 획득할 수 있다. 속기입력부(200)의 구체적인 구성에 대하여는 이하에서 따로 설명하기로 한다.The shorthand input unit 200 acquires a shorthand input. The shorthand input unit 200 may obtain a shorthand input from the shorthand. A specific configuration of the shorthand input unit 200 will be separately described below.

도 3은 본 발명의 일 실시 예에 따른 속기입력부(200)의 구성을 나타내는 블록도이다.3 is a block diagram showing the configuration of a shorthand input unit 200 according to an embodiment of the present invention.

도 3에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 속기입력부(200)는 알림표시부(210) 및 속기입력장치(220)를 포함할 수 있다.As shown in FIG. 3, the shorthand input unit 200 according to an embodiment of the present invention may include a notification display unit 210 and a shorthand input device 220.

알림표시부(210)는 속기사에게 알림을 표시한다. 여기에서 속기사에게 표시되는 알림은 음성인식부(100)에서 인식 정확도가 일정 값 이하임을 알리는 것일 수 있다. 음성 인식 정확도가 일정 값 이하인 경우 음성-문자 변환의 결과가 정확하지 않을 확률이 높은 바, 이때 속기사가 직접 자막을 입력하여 자동 음성 인식의 결과를 보정할 수 있다. 알림표시부(210)는 디스플레이장치 또는 오디오 장치일 수 있으며, 알림표시부(210)는 시각적 또는 청각적 알림을 속기사에게 제공할 수 있다.The notification display unit 210 displays a notification to the stenographer. Here, the notification displayed to the stenographer may indicate that the recognition accuracy is less than a certain value by the voice recognition unit 100. When the speech recognition accuracy is less than a certain value, there is a high probability that the result of the speech-to-text conversion is not accurate. In this case, the stenographer may directly input the subtitle to correct the result of the automatic speech recognition. The notification display unit 210 may be a display device or an audio device, and the notification display unit 210 may provide a visual or audible notification to the stenographer.

속기입력장치(220)는 속기사로부터 속기 입력을 획득한다. 속기입력장치(220)는 속기키보드로부터 속기 입력을 획득하여 문자 데이터화할 수 있다. 속기입력장치(220)는 일반적으로 사용되는 속기키보드일 수 있다. 또한, 속기키보드는 영한 겸용 속기키보드일 수도 있다. 속기입력장치(220)는 디스플레이 장치를 더 포함할 수 있다. 디스플레이 장치는 속기키보드를 통한 자막 입력이 표시될 수 있다. The shorthand input device 220 obtains a shorthand input from the shorthand. The shorthand input device 220 may obtain a shorthand input from a shorthand keyboard and convert it into text data. The shorthand input device 220 may be a commonly used shorthand keyboard. In addition, the shorthand keyboard may be a shorthand keyboard for both English and Korean. The shorthand input device 220 may further include a display device. The display device may display a caption input through a shorthand keyboard.

다시 도 1로 돌아온다.It comes back to FIG. 1 again.

통합 자막 처리부(300)는 음성인식부(100) 및 속기입력부(200)로부터 전달받은 문자를 통합하여 최종 자막을 생성한다. 구체적으로 통합 자막 처리부(300)는 음성인식부(100)로부터 전달받은 음성-문자 변환 데이터와 속기입력부(200)로부터 전달받은 속기입력 데이터를 통합하여 최종 자막을 생성한다.The integrated caption processing unit 300 generates a final caption by integrating the text received from the voice recognition unit 100 and the shorthand input unit 200. Specifically, the integrated caption processing unit 300 generates a final caption by combining the voice-to-text conversion data received from the speech recognition unit 100 and the shorthand input data transmitted from the shorthand input unit 200.

일 실시 예에서, 통합 자막 처리부(300)는 음성인식부(100)로부터 전달받은 음성 문자 변환 데이터를 기초로 하고, 음성 문자 변환 데이터의 일부를 속기입력부(200)로부터 전달받은 속기입력 데이터로 보완하여 최종 자막을 생성할 수 있다. 상술한 바와 같이, 특정의 상황에서 음성인식부(100)의 인식 정확도가 낮아 문자 변환 결과가 부정확할 수 있는바, 이 경우 부정확한 문자 변환 결과를 속기사의 직접 입력으로 보완하여 최종 자막을 생성할 수 있다.In one embodiment, the integrated caption processing unit 300 is based on the voice-to-text conversion data received from the voice recognition unit 100, and supplements some of the voice-to-text conversion data with shorthand input data received from the shorthand input unit 200 Thus, the final subtitle can be generated. As described above, the text conversion result may be inaccurate due to the low recognition accuracy of the speech recognition unit 100 in a specific situation. In this case, the final subtitle may be generated by supplementing the incorrect text conversion result with the direct input of the stenographer. I can.

통합 자막 처리부(300)는 음성인식부(100)로부터 정확도가 일정 값 이하인 보완 대상 단어 또는 보완 대상 구간의 타임스탬프 정보를 획득할 수 있다. 그리고 통합 자막 처리부(300)는 보완 대상 단어 또는 구간의 타임스탬프와 속기입력 시작 시간을 비교하여 속기입력 데이터와 음성 문자 변환 데이터를 동기화하여 최종 자막을 생성할 수 있다.The integrated caption processing unit 300 may obtain a word to be supplemented with an accuracy of less than a certain value or timestamp information of a section to be supplemented from the speech recognition unit 100. In addition, the integrated caption processing unit 300 may generate a final caption by synchronizing the shorthand input data and the voice text conversion data by comparing the timestamp of the complementary word or section with the shorthand input start time.

구체적인 실시 예에서, 통합 자막 처리부(300)는 보완 대상 단어 또는 구간의 타임스탬프와 보완 속기입력이 시작된 시간을 비교하고, 그 차이가 가장 작은 보완 대상 단어 또는 구간과 보완 속기입력을 매칭하여 최종 자막을 생성한다.In a specific embodiment, the integrated caption processing unit 300 compares the timestamp of the complementary word or section with the start time of the supplementary shorthand input, and matches the complementary shorthand input with the smallest difference between the complementary shorthand input and the final subtitle. Create

또 다른 실시 예에서, 통합 자막 처리부(300)는 하나 이상의 보완 대상 단어 또는 구간의 시간 순서와 하나 이상의 보완 속기 입력의 시간 순서만을 비교 매칭하여 최종 자막을 생성한다. 보완 대상 단어 또는 구간의 수와 보완 속기 입력의 수가 동일할 것인바, 각 순서만을 비교하여 순서대로 보완 대상 단어 또는 구간을 보완 속기 입력으로 대신하여 최종 자막이 생성될 수 있다.In another embodiment, the integrated caption processing unit 300 compares and matches only the temporal order of one or more supplementary target words or sections with the temporal order of one or more supplementary shorthand inputs to generate a final caption. Since the number of supplementary words or sections and the number of supplementary shorthand inputs will be the same, a final subtitle may be generated by comparing only each order and replacing the supplementary words or sections in order with supplementary shorthand input.

도 4는 본 발명의 일 실시 예에 따른 자막 생성 시스템의 동작을 나타내는 흐름도이다.4 is a flowchart illustrating an operation of a caption generation system according to an embodiment of the present invention.

자막 생성 시스템은 자동 음성 인식 도구를 통해 제1 구간에 포함된 음성을 문자로 변환한다(S10). 여기에서 자동 음성 인식 도구는 상술한 바와 같이 현재 사용되고 있는 자동 음성 인식 도구일 수 있다. 제 1 구간은 음성 인식의 대상이 되는 컨텐츠의 전체 타임라인 중 일부 구간을 의미한다. The caption generation system converts the voice included in the first section into text through an automatic voice recognition tool (S10). Here, the automatic speech recognition tool may be an automatic speech recognition tool currently used as described above. The first section means a partial section of the entire timeline of the content subject to speech recognition.

자막 생성 시스템은 자동 음성 인식의 정확도를 획득하고, 정확도가 특정 값 이상인지 여부를 판단한다(S20). 자동 음성 인식의 정확도는 음성 신호의 크기, 사람의 목소리와 노이즈간 비율 또는 자동 음성 인식의 결과 중 적어도 하나에 기초하여 판단될 수 있다. 그리고 여기에서 임계값으로 사용되는 특정의 기준 값은 임의적으로 입력된 값이거나, 기계학습을 통해 획득되는 값일 수 있다.The caption generation system acquires the accuracy of automatic speech recognition, and determines whether the accuracy is greater than or equal to a specific value (S20). The accuracy of automatic speech recognition may be determined based on at least one of a size of a speech signal, a ratio between a human voice and noise, or a result of automatic speech recognition. In addition, the specific reference value used as the threshold value here may be a randomly input value or a value obtained through machine learning.

자막 생성 시스템은 제 1 구간의 음성-문자 변환의 정확도가 특정 값 이상인 경우 또 다른 구간에 포함된 음성을 자동 음성 인식하여 문자로 변환한다(S30).When the accuracy of the voice-to-text conversion in the first section is greater than or equal to a specific value, the subtitle generation system automatically recognizes the voice included in the other section and converts it into a letter (S30).

한편, 자막 생성 시스템은 제 1 구간의 음성-문자 변환의 정확도가 특정 값 이하인 경우, 제 1 구간의 시작시간을 기록한다(S40). 일반적으로 자동 음성 인식 도구는 문자로 변환된 음성을 획득한 타임스탬프를 기록하고 있으며, 자막 생성 시스템은 정확도가 특정 값 이하인 단어 또는 구간에 대하여 별도로 타임스탬프를 기록하여 관리할 수 있다.Meanwhile, when the accuracy of the voice-to-text conversion of the first section is less than a specific value, the caption generation system records the start time of the first section (S40). In general, an automatic speech recognition tool records a timestamp obtained by acquiring a voice converted into a text, and the caption generation system may separately record and manage a timestamp for a word or section whose accuracy is less than a specific value.

자막 생성 시스템은 제1 구간의 변환 정확도가 특정 값 이하인 경우, 속기사에게 알림을 출력한다(S50). 자막 생성 시스템은 시각적 또는 청각적 방식으로 속기사에게 알림을 출력할 수 있다.When the conversion accuracy of the first section is less than or equal to a specific value, the caption generation system outputs a notification to the shorthand (S50). The caption generation system may output a notification to the stenographer in a visual or audible manner.

자막 생성 시스템은 제 1 음성에 대한 속기 입력을 획득한다(S60). 자막 생성 시스템은 제 1 음성에 대한 속기 입력을 속기키보드를 통해 획득할 수 있다. 또한, 자막 생성 시스템은 제 1 음성에 대한 속기 입력을 음성 인식을 통해 획득할 수도 있다. 여기에서 음성 인식의 대상은 속기사의 음성일 수 있다.The caption generation system acquires a shorthand input for the first voice (S60). The caption generation system may obtain a shorthand input for the first voice through a shorthand keyboard. In addition, the caption generating system may obtain a shorthand input for the first voice through voice recognition. Here, the object of speech recognition may be the voice of a shorthand article.

자막 생성 시스템은 기록된 제 1 구간의 시작시간과 속기입력이 시작된 시간에 기초하여 음성 인식 결과와 속기 입력 결과를 통합하여 최종 자막을 생성한다(S70).The caption generation system generates a final caption by integrating the voice recognition result and the shorthand input result based on the recorded start time of the first section and the shorthand input start time (S70).

일 실시 예에서, 자막 생성 시스템은 변환 정확도가 특정 값 이하인 구간(이하 보완 구간)의 시작 시간과 보완 구간에 대한 속기입력이 시작된 시간을 비교하고, 그 차이가 가장 작은 보완 구간과 보완 속기입력을 매칭하여 최종 자막을 생성한다. 보완 속기입력은 속기사가 알림에 따라 입력한 속기입력 데이터를 지칭한다.In an embodiment, the caption generation system compares the start time of a section (hereinafter referred to as supplementary section) whose conversion accuracy is less than a certain value and a time when shorthand input for the supplementary section is started, and compares the supplementary section with the smallest difference and the supplementary shorthand input. Matching to generate final subtitles. Supplementary shorthand input refers to shorthand input data input by a shorthand by a shorthand notice.

또 다른 실시 예에서, 자막 생성 시스템은 하나 이상의 보완 구간의 시작 시간과 하나 이상의 보완 속기 입력의 시간 순서만을 비교 매칭하여 최종 자막을 생성한다. 보완 대상 단어 또는 구간의 수와 보완 속기 입력의 수가 동일할 것인바, 각 순서만을 비교하여 순서대로 보완 대상 단어 또는 구간을 보완 속기 입력으로 대신하여 최종 자막이 생성될 수 있다.In another embodiment, the caption generation system generates a final caption by comparing and matching only the start times of one or more supplementary sections and a time sequence of one or more supplementary shorthand inputs. Since the number of supplementary words or sections and the number of supplementary shorthand inputs will be the same, a final subtitle may be generated by comparing only each order and replacing the supplementary words or sections in order with supplementary shorthand input.

자막 생성 시스템은 생성된 최종 자막을 셋톱박스로 전달한다. 셋톱박스는 전달받은 자막을 영상과 함께 표시하여 청각장애인을 위한 자막 방송을 출력할 수 있다.The subtitle generation system delivers the final subtitles generated to the set-top box. The set-top box can output closed caption broadcasting for the hearing impaired by displaying the transmitted caption together with the image.

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. The above-described present invention can be implemented as a computer-readable code on a medium on which a program is recorded. The computer-readable medium includes all types of recording devices that store data that can be read by a computer system. Examples of computer-readable media include HDD (Hard Disk Drive), SSD (Solid State Disk), SDD (Silicon Disk Drive), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. There is also a carrier wave (e.g., transmission over the Internet).

상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The above detailed description should not be construed as restrictive in all respects and should be considered as illustrative. The scope of the present invention should be determined by rational interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

Claims

A voice recognition unit including a voice receiving unit for receiving a voice signal from content, a voice-to-text conversion unit for recognizing and converting the received voice signal into text, and an accuracy calculating unit for calculating accuracy of voice recognition;
A shorthand input unit for obtaining a caption shorthand input, including a notification display unit and a shorthand input device for displaying a notification to a stenographer when the accuracy of the speech recognition is less than a specific value;
Including; an integrated caption processing unit for generating a final caption by integrating the speech-to-text conversion data obtained from the speech recognition unit and the shorthand input data obtained from the shorthand input unit,
The speech-to-text conversion unit is a machine learning application for automatic speech recognition, and adds a timestamp to each word converted by speech recognition,
The integrated caption processing unit generates a final caption by matching shorthand input data to a complementary word or section to be complemented with an accuracy of the speech recognition equal to or less than a predetermined value, and inputs the final subtitle in an order that matches the time sequence of the complementary word or section A broadcast caption system, characterized in that the final caption is generated by matching the shorthand input data.

delete