KR102287156B1

KR102287156B1 - Sentence selection device for speech synthesis training based on phoneme string for constructing speech synthesizer and operating method thereof

Info

Publication number: KR102287156B1
Application number: KR1020190116518A
Authority: KR
Inventors: 김회린; 서영주; 정성희; 최연주
Original assignee: 주식회사 한글과컴퓨터; 한국과학기술원
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2021-08-06
Also published as: KR20210034789A

Abstract

음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치 및 그 동작 방법이 개시된다. 본 발명은 복수의 문장들 중 상기 복수의 문장들 전체에서 분포하고 있는 음소열의 분포와 유사한 음소열 분포를 갖는 문장들을 음성합성 훈련용 문장들로 선택할 수 있는 기법을 제시함으로써, 사용자가 음성합성기를 구축하기 위한 발화 데이터를 만들고자 할 때, 적은 수의 문장들만으로도 고효율의 발화 데이터를 만들 수 있도록 지원할 수 있다.Disclosed are an apparatus for selecting a sentence for voice synthesis training based on phoneme sequences for constructing a voice synthesizer and an operating method thereof. The present invention provides a technique for selecting, among a plurality of sentences, sentences having a phoneme sequence distribution similar to that of the phoneme sequence distributed throughout the plurality of sentences as sentences for speech synthesis training, so that the user can use the speech synthesizer. When creating utterance data for construction, it can be supported to create highly efficient utterance data with only a small number of sentences.

Description

SENTENCE SELECTION DEVICE FOR SPEECH SYNTHESIS TRAINING BASED ON PHONEME STRING FOR CONSTRUCTING SPEECH SYNTHESIZER AND OPERATING METHOD THEREOF

본 발명은 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치 및 그 동작 방법에 대한 것이다.The present invention relates to an apparatus for selecting a sentence for voice synthesis training based on a phoneme string for constructing a voice synthesizer and an operating method thereof.

최근, 텍스트를 음성으로 변환하는 텍스트 음성 변환(Text to Speech: TTS) 기술이 발전함에 따라, 이러한 기술을 이용한 다양한 서비스가 출시되고 있다.Recently, as text-to-speech (TTS) technology for converting text into speech has been developed, various services using this technology have been released.

특히, 텍스트 음성 변환 기술은 텍스트를 음성으로 변환하여 출력해줄 수 있기 때문에 시각 장애인들을 위한 보조 도구로서의 활용 가치가 아주 높은 기술이다.In particular, the text-to-speech technology is a technology that has a very high utility value as an auxiliary tool for the visually impaired because it can convert text into speech and output it.

일반적으로 음성합성 기술은 음성 데이터로부터 음성을 파라미터로 모델링한 다음, 그 음향 모델로부터 원하는 텍스트에 해당하는 음성을 합성하는 파라미터 방식 음성합성(parametric speech synthesis) 기술과 발성 가능한 대규모의 음성파형들을 수집하여 코퍼스 형태로 구축한 다음에 합성하고자 하는 텍스트에 해당하면서 가장 자연스러운 음성파형이 생성되도록 작은 음편(음성파형조각)들을 그 코퍼스로부터 선택하고 이들을 서로 접합하는 음편선택 방식 음성합성(unit selection speech synthesis) 기술 또는 이 두 방식을 혼합한 기술로 나누어진다. In general, speech synthesis technology models speech as parameters from speech data and then collects a parametric speech synthesis technology that synthesizes speech corresponding to a desired text from the acoustic model and collects large-scale speech waveforms that can be uttered. Unit selection speech synthesis technology in which small pieces (speech waveform fragments) are selected from the corpus and then joined to each other so that the most natural speech waveform is generated corresponding to the text to be synthesized after being constructed in the form of a corpus or a combination of these two methods.

어느 방식이든지 합성된 음성파형이 자연스럽고 명료한 고품질의 음성이 되기 위해서는 인간이 발성하는 음성에서 나타나는 여러 문맥에 따른 음운현상들로부터 발생되는 변이음들을 잘 포함할 수 있도록 음성코퍼스의 규모가 가능한 한 커야 한다. 그러나 대규모의 음성코퍼스 구축은 많은 비용과 시간을 동반하기 때문에 가능한 한 규모를 줄이면서 문맥에 따른 음소들을 효과적으로 포함 또는 포괄할 수 있는 효율적인 음성코퍼스 구축 기술이 필요하다.Either way, in order for the synthesized speech waveform to be natural and clear and high-quality speech, the scale of the speech corpus must be as large as possible so that it can contain the phonological phenomena according to various contexts in human voice. do. However, since the construction of a large-scale voice corpus entails a lot of cost and time, an efficient voice corpus construction technology that can effectively include or encompass phonemes according to context while reducing the scale as much as possible is required.

특히, 개인 사용자가 자신의 음성으로 음성합성이 되도록 하는 음성합성기를 구축하고자 할 때, 음성합성기의 성능을 높이기 위해서는 많은 수의 문장에 대한 발화 데이터가 확보되어야 하는데, 개인 사용자는 그 특성상 많은 수의 문장에 대한 발화 데이터를 확보하는 데에 한계가 존재할 수밖에 없다.In particular, when an individual user wants to build a voice synthesizer that allows his or her voice to be synthesized, utterance data for a large number of sentences must be secured in order to improve the performance of the voice synthesizer. There is bound to be a limit to securing the utterance data for a sentence.

관련해서, 많은 수의 문장들 중에서 자주 사용되는 음소가 포함된 훈련용 문장들만을 적절히 선별하고, 상기 선별된 훈련용 문장들을 기초로 발화 데이터를 생성할 수 있다면, 많은 수의 문장에 대한 발화 데이터를 전부 확보하지 않더라도 고품질의 음성합성기를 구축할 수 있을 것이다.In this regard, if only training sentences including frequently used phonemes from among a large number of sentences can be appropriately selected and utterance data can be generated based on the selected training sentences, utterance data for a large number of sentences It will be possible to build a high-quality speech synthesizer without securing all of them.

따라서, 자주 사용되는 음소가 포함된 훈련용 문장들을 적절하게 선정할 수 있도록 하는 훈련용 문장 선정 기술에 대한 연구가 필요하다.Therefore, there is a need for a study on a training sentence selection technique that can appropriately select training sentences containing frequently used phonemes.

본 발명은 복수의 문장들 중 상기 복수의 문장들 전체에서 분포하고 있는 음소열의 분포와 유사한 음소열 분포를 갖는 문장들을 음성합성 훈련용 문장들로 선택할 수 있는 기법을 제시함으로써, 사용자가 음성합성기를 구축하기 위한 발화 데이터를 만들고자 할 때, 적은 수의 문장들만으로도 고효율의 발화 데이터를 만들 수 있도록 지원하고자 한다.The present invention provides a technique for selecting, among a plurality of sentences, sentences having a phoneme sequence distribution similar to that of the phoneme sequence distributed throughout the plurality of sentences as sentences for speech synthesis training, so that the user can use the speech synthesizer. When creating utterance data for construction, we want to support the creation of highly efficient utterance data with only a small number of sentences.

본 발명의 일실시예에 따른 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치는 미리 정해진 복수의 문장들이 저장되어 있는 문장 저장부, 상기 복수의 문장들 각각에 포함된 문자들을 음소 단위로 변환함으로써, 상기 복수의 문장들 각각에 대응하는 음소열을 생성하는 음소열 변환부, 상기 복수의 문장들 각각에 대한 음소열에 포함된 복수의 음소들 중 서로 중복되는 음소를 제거함으로써, 복수의 고유 음소들을 생성하는 고유 음소 생성부 및 상기 복수의 문장들 각각에 대한 음소열을 음소열 풀에 하나씩 임시 저장하고, 상기 복수의 문장들 각각에 대한 음소열 전체에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제1 확률분포와 상기 음소열 풀에 저장된 문장에 대한 음소열에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제2 확률분포 간의 쿨백-라이블러 발산(Kullback-Leibler Divergence)에 기초한 비유사도(dissimilarity)를 연산한 후 상기 복수의 문장들 중 상기 비유사도가 최소로 연산된 제1 문장에 대한 음소열을 상기 음소열 풀에 최종 저장하는 음소열 풀 생성부를 포함한다.A sentence selection apparatus for voice synthesis training based on a phoneme string for building a voice synthesizer according to an embodiment of the present invention includes a sentence storage in which a plurality of predetermined sentences are stored, and characters included in each of the plurality of sentences. A phoneme sequence conversion unit generating a phoneme sequence corresponding to each of the plurality of sentences by converting to phoneme units, and removing overlapping phonemes among a plurality of phonemes included in the phoneme sequence for each of the plurality of sentences, A unique phoneme generator generating a plurality of unique phonemes and one phoneme sequence for each of the plurality of sentences are temporarily stored in a phoneme sequence pool, and the plurality of unique phonemes from the entire phoneme sequence for each of the plurality of sentences Kullback-Leibler divergence between a first probability distribution for each ratio and a second probability distribution for a ratio in which each of the plurality of unique phonemes exists in a phoneme sequence for a sentence stored in the phoneme sequence pool. After calculating dissimilarity based on Leibler Divergence, a phoneme sequence pool generation unit for finally storing the phoneme sequence for the first sentence for which the dissimilarity is calculated to be the minimum among the plurality of sentences is included in the phoneme sequence pool do.

또한, 본 발명의 일실시예에 따른 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치의 동작 방법은 미리 정해진 복수의 문장들이 저장되어 있는 문장 저장부를 유지하는 단계, 상기 복수의 문장들 각각에 포함된 문자들을 음소 단위로 변환함으로써, 상기 복수의 문장들 각각에 대응하는 음소열을 생성하는 단계, 상기 복수의 문장들 각각에 대한 음소열에 포함된 복수의 음소들 중 서로 중복되는 음소를 제거함으로써, 복수의 고유 음소들을 생성하는 단계 및 상기 복수의 문장들 각각에 대한 음소열을 음소열 풀에 하나씩 임시 저장하고, 상기 복수의 문장들 각각에 대한 음소열 전체에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제1 확률분포와 상기 음소열 풀에 저장된 문장에 대한 음소열에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제2 확률분포 간의 KLD에 기초한 비유사도를 연산한 후 상기 복수의 문장들 중 상기 비유사도가 최소로 연산된 제1 문장에 대한 음소열을 상기 음소열 풀에 최종 저장하는 단계를 포함한다.In addition, the method of operating a sentence selection apparatus for phoneme string-based voice synthesis training for building a voice synthesizer according to an embodiment of the present invention includes maintaining a sentence storage unit in which a plurality of predetermined sentences are stored; generating a phoneme sequence corresponding to each of the plurality of sentences by converting the characters included in each of the sentences into phoneme units, among a plurality of phonemes included in the phoneme sequence for each of the plurality of sentences generating a plurality of unique phonemes by removing the phoneme, and temporarily storing one phoneme sequence for each of the plurality of sentences in a phoneme sequence pool, and in the entire phoneme sequence for each of the plurality of sentences, the plurality of unique phonemes The dissimilarity based on the KLD between the first probability distribution for the ratio in which each of the phonemes exists and the second probability distribution for the ratio in which each of the plurality of unique phonemes exist in the phoneme sequence for the sentence stored in the phoneme sequence pool and finally storing the phoneme sequence for the first sentence for which the dissimilarity is calculated to be the minimum among the plurality of sentences in the phoneme sequence pool after the calculation.

본 발명은 복수의 문장들 중 상기 복수의 문장들 전체에서 분포하고 있는 음소열의 분포와 유사한 음소열 분포를 갖는 문장들을 음성합성 훈련용 문장들로 선택할 수 있는 기법을 제시함으로써, 사용자가 음성합성기를 구축하기 위한 발화 데이터를 만들고자 할 때, 적은 수의 문장들만으로도 고효율의 발화 데이터를 만들 수 있도록 지원할 수 있다.The present invention provides a technique for selecting, among a plurality of sentences, sentences having a phoneme sequence distribution similar to that of the phoneme sequence distributed throughout the plurality of sentences as sentences for speech synthesis training, so that the user can use the speech synthesizer. When creating utterance data for construction, it can be supported to create highly efficient utterance data with only a small number of sentences.

도 1은 본 발명의 일실시예에 따른 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치의 구조를 도시한 도면이다.
도 2는 본 발명의 일실시예에 따른 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치의 동작 방법을 도시한 순서도이다.1 is a diagram showing the structure of a sentence selection apparatus for voice synthesis training based on a phoneme string for constructing a voice synthesizer according to an embodiment of the present invention.
2 is a flowchart illustrating a method of operating a sentence selection apparatus for voice synthesis training based on a phoneme string for constructing a voice synthesizer according to an embodiment of the present invention.

이하에서는 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명하기로 한다. 이러한 설명은 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였으며, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 본 명세서 상에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 사람에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings. These descriptions are not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. While describing each drawing, like reference numerals are used for similar components, and unless otherwise defined, all terms used in this specification, including technical or scientific terms, refer to those of ordinary skill in the art to which the present invention belongs. It has the same meaning as is commonly understood by those who have it.

본 문서에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다. 또한, 본 발명의 다양한 실시예들에 있어서, 각 구성요소들, 기능 블록들 또는 수단들은 하나 또는 그 이상의 하부 구성요소로 구성될 수 있고, 각 구성요소들이 수행하는 전기, 전자, 기계적 기능들은 전자회로, 집적회로, ASIC(Application Specific Integrated Circuit) 등 공지된 다양한 소자들 또는 기계적 요소들로 구현될 수 있으며, 각각 별개로 구현되거나 2 이상이 하나로 통합되어 구현될 수도 있다. In this document, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. In addition, in various embodiments of the present invention, each of the components, functional blocks or means may be composed of one or more sub-components, and the electrical, electronic, and mechanical functions performed by each component are electronic. A circuit, an integrated circuit, an ASIC (Application Specific Integrated Circuit), etc. may be implemented with various well-known devices or mechanical elements, and may be implemented separately or two or more may be integrated into one.

한편, 첨부된 블록도의 블록들이나 흐름도의 단계들은 범용 컴퓨터, 특수용 컴퓨터, 휴대용 노트북 컴퓨터, 네트워크 컴퓨터 등 데이터 프로세싱이 가능한 장비의 프로세서나 메모리에 탑재되어 지정된 기능들을 수행하는 컴퓨터 프로그램 명령들(instructions)을 의미하는 것으로 해석될 수 있다. 이들 컴퓨터 프로그램 명령들은 컴퓨터 장치에 구비된 메모리 또는 컴퓨터에서 판독 가능한 메모리에 저장될 수 있기 때문에, 블록도의 블록들 또는 흐름도의 단계들에서 설명된 기능들은 이를 수행하는 명령 수단을 내포하는 제조물로 생산될 수도 있다. 아울러, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 명령들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 가능한 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 정해진 순서와 달리 실행되는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 실질적으로 동시에 수행되거나, 역순으로 수행될 수 있으며, 경우에 따라 일부 블록들 또는 단계들이 생략된 채로 수행될 수도 있다.On the other hand, the blocks in the accompanying block diagram or steps in the flowchart are computer program instructions that are loaded in a processor or memory of equipment capable of data processing, such as a general-purpose computer, a special-purpose computer, a portable notebook computer, and a network computer, and perform specified functions. can be interpreted as meaning Since these computer program instructions may be stored in a memory provided in a computer device or in a memory readable by a computer, the functions described in the blocks of the block diagrams or the steps of the flowcharts are produced as articles of manufacture containing instruction means for performing the same. could be In addition, each block or each step may represent a module, segment, or portion of code comprising one or more executable instructions for executing the specified logical function(s). It should also be noted that, in some alternative embodiments, it is also possible for the functions recited in blocks or steps to be executed out of the prescribed order. For example, two blocks or steps shown one after another may be performed substantially simultaneously or in the reverse order, and in some cases, some blocks or steps may be omitted.

도 1은 본 발명의 일실시예에 따른 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치의 구조를 도시한 도면이다.1 is a diagram showing the structure of a sentence selection apparatus for voice synthesis training based on a phoneme string for constructing a voice synthesizer according to an embodiment of the present invention.

도 1을 참조하면, 본 발명에 따른 음성합성 훈련용 문장 선정 장치(110)는 문장 저장부(111), 음소열 변환부(112), 고유 음소 생성부(113) 및 음소열 풀 생성부(114)를 포함한다.Referring to FIG. 1 , the apparatus 110 for selecting a sentence for speech synthesis training according to the present invention includes a sentence storage unit 111 , a phoneme sequence conversion unit 112 , a unique phoneme generation unit 113 , and a phoneme sequence pool generation unit ( 114).

문장 저장부(111)에는 미리 정해진 복수의 문장들이 저장되어 있다. 여기서, 문장 저장부(111)에 저장되어 있는 복수의 문장들은 통상의 음성합성기를 구축할 때 화자의 발화 데이터를 획득하기 위해 사용되는 일반적인 문장들일 수 있다.A plurality of predetermined sentences are stored in the sentence storage unit 111 . Here, the plurality of sentences stored in the sentence storage 111 may be general sentences used to acquire the speaker's utterance data when constructing a conventional speech synthesizer.

음소열 변환부(112)는 상기 복수의 문장들 각각에 포함된 문자들을 음소 단위로 변환함으로써, 상기 복수의 문장들 각각에 대응하는 음소열을 생성한다.The phoneme sequence converter 112 converts the characters included in each of the plurality of sentences into phoneme units, thereby generating a phoneme sequence corresponding to each of the plurality of sentences.

예컨대, 상기 복수의 문장들 중 특정 문장이 '안녕하세요.'라고 하는 경우, 음소열 변환부(112)는 '안녕하세요.'라는 문장에 대해서 'ㅏㄴㄴㅕㅇㅎㅏㅅㅔㅛ'라고 하는 음소열을 생성할 수 있다. 본 예시는 설명의 편의를 위해서 한글에 대한 예시를 들었지만, 영문에 대해서도 적용할 수 있다.For example, when a specific sentence among the plurality of sentences says 'hello.', the phoneme sequence conversion unit 112 may generate a phoneme sequence 'hello.' can In this example, an example in Korean is given for convenience of explanation, but it can also be applied to English.

이때, 본 발명의 일실시예에 따르면, 음소열 변환부(112)는 상기 복수의 문장들 각각에 대응하는 음소열을 생성하기 위한 구체적인 구성으로, 발음 사전 데이터베이스(117), 해시 값 연산부(118) 및 음소열 변환 처리부(119)를 포함할 수 있다.At this time, according to an embodiment of the present invention, the phoneme sequence conversion unit 112 has a specific configuration for generating phoneme sequences corresponding to each of the plurality of sentences, and includes a pronunciation dictionary database 117 and a hash value calculation unit 118 . ) and a phoneme heat conversion processing unit 119 .

발음 사전 데이터베이스(117)에는 미리 정해진 복수의 단어들에 대한 해시 값과 상기 복수의 단어들 각각을 음소 단위로 구분한 음소열이 서로 대응되어 저장되어 있다.In the pronunciation dictionary database 117, hash values for a plurality of predetermined words and phoneme sequences obtained by dividing each of the plurality of words into phoneme units are stored in correspondence with each other.

예컨대, 발음 사전 데이터베이스(117)에는 'apple'이라는 단어에 대한 데이터가 소정의 미리 정해진 해시 함수에 입력으로 인가되어 생성된 해시 값과 'apple'이라는 단어에 대응하는 소정의 음소열이 서로 대응되어 저장되어 있을 수 있다.For example, in the pronunciation dictionary database 117, a hash value generated when data on the word 'apple' is applied as an input to a predetermined hash function and a predetermined phoneme sequence corresponding to the word 'apple' correspond to each other. may be stored.

이 점을 고려하면, 발음 사전 데이터베이스(117)에는 하기의 표 1과 같이 데이터가 저장되어 있을 수 있다.In consideration of this, data may be stored in the pronunciation dictionary database 117 as shown in Table 1 below.

단어에 대한 해시 값hash value for word 음소열phoneme fever 해시 값 1hash value 1 음소열 1phoneme string 1 해시 값 2hash value 2 음소열 2phoneme fever 2 해시 값 3hash value 3 음소열 3phoneme string 3 ...... ......

해시 값 연산부(118)는 상기 복수의 문장들을 구성하는 복수의 제1 단어들을 추출하고, 상기 복수의 제1 단어들을 구성하는 데이터를 미리 정해진 해시 함수에 입력으로 인가하여 상기 복수의 제1 단어들 각각에 대한 해시 값을 연산한다.The hash value calculating unit 118 extracts a plurality of first words constituting the plurality of sentences, and applies data constituting the plurality of first words as an input to a predetermined hash function to obtain the plurality of first words. Compute a hash value for each.

음소열 변환 처리부(119)는 발음 사전 데이터베이스(117)로부터 상기 복수의 제1 단어들 각각에 대한 해시 값을 기초로 한 검색을 수행하여 상기 복수의 제1 단어들 각각에 대한 해시 값과 동일한 해시 값에 대응되어 저장되어 있는 음소열을 상기 복수의 제1 단어들 각각에 대한 음소열로 변환함으로써, 상기 복수의 문장들 각각에 대응하는 음소열을 생성한다.The phoneme sequence conversion processing unit 119 performs a search based on the hash value of each of the plurality of first words from the pronunciation dictionary database 117, and performs a hash value equal to the hash value of each of the plurality of first words. A phoneme sequence corresponding to each of the plurality of sentences is generated by converting a phoneme sequence stored corresponding to a value into a phoneme sequence for each of the plurality of first words.

관련해서, 상기 복수의 문장들 중 특정 문장에 포함된 단어로 'apple'이라는 단어가 존재하는 경우, 음소열 변환 처리부(119)는 'apple'이라는 단어에 대한 데이터를 상기 해시 함수에 입력으로 인가하여 해시 값을 연산한 후 발음 사전 데이터베이스(117) 상에서 상기 해시 값에 대응되어 저장되어 있는 음소열을 추출함으로써, 'apple'이라는 단어를 추출된 음소열로 변환할 수 있고, 이러한 방식으로, 음소열 변환 처리부(119)는 상기 복수의 문장들 각각에 대한 음소열을 생성할 수 있다.In relation, when the word 'apple' exists as a word included in a specific sentence among the plurality of sentences, the phoneme sequence conversion processing unit 119 applies data for the word 'apple' as an input to the hash function After calculating the hash value by extracting the phoneme sequence stored corresponding to the hash value on the pronunciation dictionary database 117, the word 'apple' can be converted into the extracted phoneme sequence. The column conversion processing unit 119 may generate a phoneme sequence for each of the plurality of sentences.

이때, 본 발명의 일실시예에 따르면, 음소열 변환부(112)는 발음 사전 데이터베이스(117)를 기초로 상기 복수의 문장들 각각에 대한 음소열이 생성되면, 상기 복수의 문장들 각각의 문맥을 고려해서 상기 복수의 문장들 각각에 대한 음소열을 문맥에 종속되는 소리에 따른 음소열로 후처리하는 과정을 추가로 수행할 수 있다.At this time, according to an embodiment of the present invention, when the phoneme sequence conversion unit 112 generates a phoneme sequence for each of the plurality of sentences based on the pronunciation dictionary database 117, the context of each of the plurality of sentences is generated. In consideration of , a process of post-processing the phoneme sequence for each of the plurality of sentences into a phoneme sequence according to a sound dependent on a context may be additionally performed.

고유 음소 생성부(113)는 상기 복수의 문장들 각각에 대한 음소열에 포함된 복수의 음소들 중 서로 중복되는 음소를 제거함으로써, 복수의 고유 음소들을 생성한다.The unique phoneme generator 113 generates a plurality of unique phonemes by removing overlapping phonemes among a plurality of phonemes included in a phoneme string for each of the plurality of sentences.

관련해서, 상기 복수의 문장들의 개수가 '10개'라고 하는 경우, 음소열 변환부(112)에서는 '10개'의 문장들 각각에 대응하는 10개의 음소열들이 생성될 수 있다. 이때, 고유 음소 생성부(113)는 상기 10개의 음소열들에 포함된 복수의 음소들 중 서로 중복되는 음소를 제거함으로써, 복수의 고유 음소들을 생성할 수 있다.In relation to this, when the number of the plurality of sentences is '10', the phoneme sequence converting unit 112 may generate 10 phoneme sequences corresponding to each of the '10' sentences. In this case, the unique phoneme generator 113 may generate a plurality of unique phonemes by removing overlapping phonemes among a plurality of phonemes included in the 10 phoneme sequences.

예컨대, 상기 복수의 음소들로 'ㅏㄴㄴㅕㅇㅎㅏㅅㅔㅛ'라고 하는 10개의 음소들이 존재한다고 하는 경우, 고유 음소 생성부(113)는 상기 10개의 음소들에서 서로 중복되는 음소인 'ㄴ'과 'ㅏ'를 제거함으로써, 'ㅏㄴㅕㅇㅎㅅㅔㅛ'이라고 하는 8개의 고유 음소들을 생성할 수 있다.For example, when it is assumed that there are 10 phonemes called 'a ㄴㅕㅇha ㅅㅔㅛ' as the plurality of phonemes, the unique phoneme generating unit 113 generates 'b', which is a phoneme that overlaps with each other, and By removing 'a', it is possible to create eight unique phonemes called 'a ㅕ ㅇ heh ㅔㅛ'.

음소열 풀 생성부(114)는 상기 복수의 문장들 각각에 대한 음소열을 음소열 풀에 하나씩 임시 저장하고, 상기 복수의 문장들 각각에 대한 음소열 전체에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제1 확률분포와 상기 음소열 풀에 저장된 문장에 대한 음소열에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제2 확률분포 간의 쿨백-라이블러 발산(Kullback-Leibler Divergence)에 기초한 비유사도(dissimilarity)를 연산한 후 상기 복수의 문장들 중 상기 비유사도가 최소로 연산된 제1 문장에 대한 음소열을 상기 음소열 풀에 최종 저장한다.The phoneme sequence pool generation unit 114 temporarily stores one phoneme sequence for each of the plurality of sentences in the phoneme sequence pool, and each of the plurality of unique phonemes exists in the entire phoneme sequence for each of the plurality of sentences. Kullback-Leibler Divergence between the first probability distribution for the ratio of After calculating dissimilarity based on , the phoneme sequence for the first sentence for which the dissimilarity is calculated as the minimum among the plurality of sentences is finally stored in the phoneme sequence pool.

예컨대, 상기 복수의 문장들의 개수가 '10개'라고 하는 경우, 음소열 풀 생성부(114)는 10개의 문장들 중 첫 번째 문장을 음소열 풀에 임시 저장하고, 상기 10개의 문장들 각각에 대한 10개의 음소열 전체에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제1 확률분포와 상기 음소열 풀에 저장된 첫 번째 문장에 대한 음소열에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제2 확률분포 간의 KLD에 기초한 비유사도를 연산할 수 있다. 이러한 방식으로, 음소열 풀 생성부(114)는 상기 10개의 문장들 각각을 상기 음소열 풀에 하나씩 임시 저장하면서 각 문장에 대한 상기 비유사도를 연산할 수 있다. 그러고 나서, 음소열 풀 생성부(114)는 상기 10개의 문장들 중 상기 비유사도가 최소로 연산된 문장에 대한 음소열을 선택하여 상기 음소열 풀에 최종 저장할 수 있다.For example, if the number of the plurality of sentences is '10', the phoneme sequence pool generating unit 114 temporarily stores the first sentence among the 10 sentences in the phoneme sequence pool, and A first probability distribution of the ratio in which each of the plurality of unique phonemes exists in all 10 phoneme sequences for It is possible to calculate the dissimilarity based on the KLD between the second probability distributions for . In this way, the phoneme sequence pool generator 114 may calculate the dissimilarity for each sentence while temporarily storing each of the ten sentences one by one in the phoneme sequence pool. Then, the phoneme sequence pool generation unit 114 may select the phoneme sequence for the sentence for which the dissimilarity is calculated at the minimum among the ten sentences, and finally store it in the phoneme sequence pool.

이때, 본 발명의 일실시예에 따르면, 상기 비유사도는 하기의 수학식 1에 따라 연산될 수 있다.In this case, according to an embodiment of the present invention, the dissimilarity may be calculated according to Equation 1 below.

여기서, S는 상기 비유사도, I는 상기 복수의 고유 음소들의 총 개수, i는 상기 복수의 고유 음소들 중 i번째 고유 음소를 지칭하는 인덱스, P_X(i)는 상기 제1 확률분포에 따른 확률 질량 함수로 상기 복수의 문장들 각각에 대한 음소열 전체에서 i번째 고유 음소가 존재하는 비율을 의미하며, P_Y(i)는 상기 제2 확률분포에 따른 확률 질량 함수로 상기 음소열 풀에 저장된 문장에 대한 음소열에서 i번째 고유 음소가 존재하는 비율을 의미한다.Here, S is the dissimilarity, I is the total number of the plurality of unique phonemes, i is an index _{indicating the i-th unique phone among the plurality of unique phonemes, and P X} (i) is the first probability distribution according to the first probability distribution. As a probability mass function, it means a ratio in which the i-th unique phoneme exists in the entire phoneme sequence for each of the plurality of sentences, and P _Y (i) is a probability mass function according to the second probability distribution in the phoneme sequence pool. It means the ratio in which the i-th unique phoneme exists in the phoneme sequence for the stored sentence.

이때, 본 발명의 일실시예에 따르면, 음성합성 훈련용 문장 선정 장치(110)는 음소열 풀 구축부(115) 및 문장 선정부(116)를 더 포함할 수 있다.In this case, according to an embodiment of the present invention, the sentence selection apparatus 110 for voice synthesis training may further include a phoneme string pool construction unit 115 and a sentence selection unit 116 .

음소열 풀 구축부(115)는 상기 음소열 풀에 상기 제1 문장이 최종 저장된 이후, 상기 복수의 문장들 중 상기 음소열 풀에 저장된 문장을 제외한 나머지 문장들 각각에 대한 음소열을 음소열 풀에 하나씩 추가로 임시 저장하고, 상기 복수의 문장들 각각에 대한 음소열 전체에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 상기 제1 확률분포와 상기 음소열 풀에 저장된 문장들에 대한 음소열에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 상기 제2 확률분포 간의 KLD에 기초한 상기 비유사도를 연산한 후 상기 나머지 문장들 중 상기 비유사도가 최소로 연산된 문장에 대한 음소열을 상기 음소열 풀에 추가로 최종 저장하는 과정을 사전 설정된 반복 횟수만큼 반복 수행함으로써, 상기 음소열 풀에 대한 구축을 완료한다.After the first sentence is finally stored in the phoneme sequence pool, the phoneme sequence pool construction unit 115 pools phoneme sequences for each of the remaining sentences except for the sentence stored in the phoneme sequence pool among the plurality of sentences. is additionally temporarily stored one by one in , and the first probability distribution of the ratio of each of the plurality of unique phonemes in the entire phoneme sequence for each of the plurality of sentences and the phoneme for the sentences stored in the phoneme sequence pool After calculating the dissimilarity based on the KLD between the second probability distribution with respect to the ratio in which each of the plurality of unique phonemes exists in a column, the phoneme sequence for the sentence in which the dissimilarity is calculated the least among the remaining sentences is calculated. The construction of the phoneme sequence pool is completed by repeating the process of additionally storing the final phoneme sequence in the phoneme sequence pool for a preset number of repetitions.

관련해서, 상기 복수의 문장들의 개수가 '10개'라고 하고, 음소열 풀 생성부(114)에 의해 세 번째 문장이 상기 음소열 풀에 최종 저장되었다고 하는 경우, 음소열 풀 구축부(115)는 10개의 문장들에서 상기 세 번째 문장을 제외한 나머지 9개의 문장들 중 첫 번째 문장을 음소열 풀에 추가로 임시 저장하고, 상기 10개의 문장들 각각에 대한 10개의 음소열 전체에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제1 확률분포와 상기 음소열 풀에 저장된 첫 번째 문장과 세 번째 문장에 대한 음소열에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제2 확률분포 간의 KLD에 기초한 비유사도를 연산할 수 있다. 이러한 방식으로, 음소열 풀 구축부(114)는 상기 9개의 문장들 각각을 상기 음소열 풀에 하나씩 추가로 임시 저장하면서 각 문장에 대한 상기 비유사도를 연산한 후 상기 9개의 문장들 중 상기 비유사도가 최소로 연산된 문장에 대한 음소열을 선택하여 상기 음소열 풀에 추가로 최종 저장할 수 있다.In relation to this, when it is assumed that the number of the plurality of sentences is '10' and that the third sentence is finally stored in the phoneme string pool by the phoneme string pool generating unit 114, the phoneme string pool construction unit 115 in the 10 sentences, temporarily stores the first sentence among the remaining 9 sentences except the third sentence in the phoneme string pool, and in the total of the 10 phoneme sequences for each of the 10 sentences, the plurality of unique Between the first probability distribution for the ratio in which each of the phonemes exists and the second probability distribution for the ratio in which each of the plurality of unique phonemes exist in the phoneme sequence for the first sentence and the third sentence stored in the phoneme sequence pool Dissimilarity can be calculated based on KLD. In this way, the phoneme string pool construction unit 114 additionally stores each of the nine sentences one by one in the phoneme string pool, calculates the dissimilarity for each sentence, and then calculates the dissimilarity among the nine sentences. The apostle selects the phoneme sequence for the sentence with the least calculation and may finally store it in the phoneme sequence pool.

이때, 음소열 풀 구축부(115)는 이렇게 상기 음소열 풀에 비유사도가 최소로 연산되는 문장에 대한 음소열을 추가로 최종 저장하는 과정을 사전 설정된 반복 횟수만큼 반복 수행함으로써, 상기 음소열 풀에 대한 구축을 완료할 수 있다.In this case, the phoneme sequence pool construction unit 115 repeats the process of additionally finally storing the phoneme sequence for the sentence for which the dissimilarity is calculated at the minimum in the phoneme sequence pool by a preset number of repetitions, so that the phoneme sequence pool construction can be completed.

관련해서, 상기 사전 설정된 반복 횟수가 '5회'라고 하는 경우, 음소열 풀 구축부(115)는 음소열 풀 생성부(114)에 의해 상기 음소열 풀에 상기 제1 문장에 대한 음소열이 최종 저장되면, 상기 음소열 풀에 저장되지 않은 나머지 문장들을 하나씩 상기 음소열 풀에 추가로 임시 저장하면서, 상기 비유사도가 최소인 문장에 대한 음소열을 상기 음소열 풀에 추가로 최종 저장하는 과정을 총 5회 동안 반복 수행함으로써, 상기 음소열 풀에 총 6개의 문장에 대한 음소열이 저장될 수 있도록 처리할 수 있다.In this regard, when the preset number of repetitions is '5', the phoneme sequence pool construction unit 115 includes the phoneme sequence pool for the first sentence by the phoneme sequence pool generation unit 114 in the phoneme sequence pool. When finally stored, the process of additionally temporarily storing the remaining sentences that are not stored in the phoneme sequence pool one by one in the phoneme sequence pool, and finally additionally storing the phoneme sequence for the sentence with the minimum dissimilarity in the phoneme sequence pool by repeating for a total of 5 times, it is possible to process so that phoneme sequences for a total of 6 sentences can be stored in the phoneme sequence pool.

이렇게, 상기 음소열 풀에 대한 구축이 완료되면, 문장 선정부(116)는 상기 복수의 문장들 중 상기 음소열 풀에 최종 저장되어 있는 음소열들에 대응되는 문장들을 음성합성 훈련용 문장들로 선정한다.In this way, when the construction of the phoneme sequence pool is completed, the sentence selection unit 116 converts the sentences corresponding to the phoneme sequences finally stored in the phoneme sequence pool among the plurality of sentences into sentences for speech synthesis training. select

이때, 본 발명의 일실시예에 따르면, 음성합성 훈련용 문장 선정 장치(110)는 중요도 점수 연산부(120), 저장 처리부(121) 및 정보 표시부(123)를 더 포함할 수 있다.In this case, according to an embodiment of the present invention, the apparatus for selecting a sentence for speech synthesis training 110 may further include an importance score calculating unit 120 , a storage processing unit 121 , and an information display unit 123 .

중요도 점수 연산부(120)는 상기 음성합성 훈련용 문장들이 선정 완료되면, 상기 음성합성 훈련용 문장들 각각에 대해 상기 음소열 풀에서의 최종 저장 순서에 반대되는 순번을 할당하고, 상기 음성합성 훈련용 문장들 각각에 할당된 순번에 미리 정해진 기준 점수를 곱하여 상기 음성합성 훈련용 문장들 각각에 대한 중요도 점수를 연산한다.When the selection of the speech synthesis training sentences is completed, the importance score calculating unit 120 assigns to each of the speech synthesis training sentences an order opposite to the final storage order in the phoneme string pool, and for the speech synthesis training An importance score for each of the sentences for speech synthesis training is calculated by multiplying the order assigned to each of the sentences by a predetermined reference score.

예컨대, 상기 음성합성 훈련용 문장들로 '5개'의 음성합성 훈련용 문장들이 선정 완료되었다고 하고, 상기 미리 정해진 기준 점수를 '10점'이라고 하는 경우, 중요도 점수 연산부(120)는 5개의 음성합성 훈련용 문장들 각각에 대해 상기 음소열 풀에서의 최종 저장 순서에 반대되는 순서로 순번을 할당하고, 5개의 음성합성 훈련용 문장들 각각에 할당된 순번에 상기 기준 점수인 10점을 곱하여 5개의 음성합성 훈련용 문장들 각각에 대한 중요도 점수를 연산할 수 있다.For example, if it is assumed that '5' sentences for speech synthesis training have been selected as the speech synthesis training sentences, and the predetermined reference score is '10 points', the importance score calculating unit 120 generates five voices. To each of the sentences for synthesis training, a sequence number is assigned in an order opposite to the final storage order in the phoneme string pool, and the sequence number assigned to each of the five speech synthesis training sentences is multiplied by 10, the reference score, to 5 An importance score for each of the speech synthesis training sentences may be calculated.

관련해서, 5개의 음성합성 훈련용 문장들 중 첫 번째로 상기 음소열 풀에 음소열이 저장된 문장에 대해서는 '5'라는 순번이 할당될 수 있고, 이로 인해 첫 번째로 상기 음소열 풀에 저장된 문장에 대해서는 '50점'이라고 하는 중요도 점수가 연산될 수 있다.In relation to this, a sequence number of '5' may be assigned to a sentence in which a phoneme sequence is first stored in the phoneme sequence pool among the five speech synthesis training sentences, which results in the first sentence stored in the phoneme sequence pool An importance score called '50 points' may be calculated for .

저장 처리부(121)는 상기 음성합성 훈련용 문장들과 각 음성합성 훈련용 문장에 대한 중요도 점수를 훈련용 문장 저장부(122)에 서로 대응시켜 저장한다.The storage processing unit 121 stores the speech synthesis training sentences and the importance score for each speech synthesis training sentence in correspondence with each other in the training sentence storage unit 122 .

그리고, 정보 표시부(123)는 사용자로부터 상기 음성합성 훈련용 문장들 중 제1 음성합성 훈련용 문장에 대한 정보 확인 명령이 인가되면, 훈련용 문장 저장부(122)에 저장되어 있는 상기 제1 음성합성 훈련용 문장과 상기 제1 음성합성 훈련용 문장에 대한 중요도 점수를 추출하여 화면 상에 디스플레이한다.In addition, the information display unit 123 receives the information confirmation command for the first speech synthesis training sentence among the speech synthesis training sentences from the user, the first voice stored in the training sentence storage unit 122 . The importance score for the sentence for synthesis training and the first sentence for voice synthesis training is extracted and displayed on the screen.

이를 통해, 사용자는 사용자는 음성합성을 위한 음성 데이터 구축을 구행하는데 있어, 각 음성합성 훈련용 문장의 중요도를 눈으로 확인할 수 있고, 중요도가 높은 음성합성 훈련용 문장에 대해서는 보다 주의해서 음성 발화를 수행할 수 있을 것이다.Through this, the user can visually check the importance of each speech synthesis training sentence when constructing voice data for voice synthesis, and pay more attention to speech synthesis training sentences with high importance. will be able to do

도 2는 본 발명의 일실시예에 따른 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치의 동작 방법을 도시한 순서도이다.2 is a flowchart illustrating a method of operating a sentence selection apparatus for voice synthesis training based on a phoneme string for constructing a voice synthesizer according to an embodiment of the present invention.

단계(S210)에서는 미리 정해진 복수의 문장들이 저장되어 있는 문장 저장부를 유지한다.In step S210, a sentence storage unit in which a plurality of predetermined sentences are stored is maintained.

단계(S220)에서는 상기 복수의 문장들 각각에 포함된 문자들을 음소 단위로 변환함으로써, 상기 복수의 문장들 각각에 대응하는 음소열을 생성한다.In step S220, by converting the characters included in each of the plurality of sentences into phoneme units, a phoneme sequence corresponding to each of the plurality of sentences is generated.

단계(S230)에서는 상기 복수의 문장들 각각에 대한 음소열에 포함된 복수의 음소들 중 서로 중복되는 음소를 제거함으로써, 복수의 고유 음소들을 생성한다.In step S230, a plurality of unique phonemes are generated by removing overlapping phonemes among a plurality of phonemes included in a phoneme sequence for each of the plurality of sentences.

단계(S240)에서는 상기 복수의 문장들 각각에 대한 음소열을 음소열 풀에 하나씩 임시 저장하고, 상기 복수의 문장들 각각에 대한 음소열 전체에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제1 확률분포와 상기 음소열 풀에 저장된 문장에 대한 음소열에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 제2 확률분포 간의 KLD에 기초한 비유사도를 연산한 후 상기 복수의 문장들 중 상기 비유사도가 최소로 연산된 제1 문장에 대한 음소열을 상기 음소열 풀에 최종 저장한다.In step S240, one phoneme sequence for each of the plurality of sentences is temporarily stored in a phoneme sequence pool, and the ratio of each of the plurality of unique phonemes in the entire phoneme sequence for each of the plurality of sentences is calculated. After calculating the dissimilarity based on the KLD between the first probability distribution and the second probability distribution for the ratio of each of the plurality of unique phonemes in the phoneme sequence for the sentence stored in the phoneme sequence pool, the dissimilarity among the plurality of sentences is calculated. The phoneme sequence for the first sentence for which the dissimilarity is calculated as the minimum is finally stored in the phoneme sequence pool.

이때, 본 발명의 일실시예에 따르면, 상기 음성합성 훈련용 문장 선정 장치의 동작 방법은 상기 음소열 풀에 상기 제1 문장이 최종 저장된 이후, 상기 복수의 문장들 중 상기 음소열 풀에 저장된 문장을 제외한 나머지 문장들 각각에 대한 음소열을 음소열 풀에 하나씩 추가로 임시 저장하고, 상기 복수의 문장들 각각에 대한 음소열 전체에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 상기 제1 확률분포와 상기 음소열 풀에 저장된 문장들에 대한 음소열에서 상기 복수의 고유 음소들 각각이 존재하는 비율에 대한 상기 제2 확률분포 간의 KLD에 기초한 상기 비유사도를 연산한 후 상기 나머지 문장들 중 상기 비유사도가 최소로 연산된 문장에 대한 음소열을 상기 음소열 풀에 추가로 최종 저장하는 과정을 사전 설정된 반복 횟수만큼 반복 수행함으로써, 상기 음소열 풀에 대한 구축을 완료하는 단계 및 상기 음소열 풀에 대한 구축이 완료되면, 상기 복수의 문장들 중 상기 음소열 풀에 최종 저장되어 있는 음소열들에 대응되는 문장들을 음성합성 훈련용 문장들로 선정하는 단계를 더 포함할 수 있다.In this case, according to an embodiment of the present invention, in the method of operating the apparatus for selecting a sentence for speech synthesis training, after the first sentence is finally stored in the phoneme sequence pool, among the plurality of sentences, the sentence stored in the phoneme sequence pool Temporarily store one phoneme sequence for each of the sentences other than After calculating the dissimilarity based on the KLD between the probability distribution and the second probability distribution for the ratio of each of the plurality of unique phonemes in the phoneme sequence for the sentences stored in the phoneme sequence pool, among the remaining sentences Completing the construction of the phoneme sequence pool by repeating the process of additionally finally storing the phoneme sequence for the sentence for which the dissimilarity is calculated the least in the phoneme sequence pool by a preset number of repetitions, and the phoneme sequence The method may further include selecting sentences corresponding to the phoneme sequences finally stored in the phoneme sequence pool among the plurality of sentences as sentences for speech synthesis training when the pool construction is completed.

또한, 본 발명의 일실시예에 따르면, 상기 비유사도는 상기 수학식 1에 따라 연산될 수 있다.Also, according to an embodiment of the present invention, the dissimilarity may be calculated according to Equation 1 above.

또한, 본 발명의 일실시예에 따르면, 단계(S220)에서는 미리 정해진 복수의 단어들에 대한 해시 값과 상기 복수의 단어들 각각을 음소 단위로 구분한 음소열이 서로 대응되어 저장되어 있는 발음 사전 데이터베이스를 유지하는 단계, 상기 복수의 문장들을 구성하는 복수의 제1 단어들을 추출하고, 상기 복수의 제1 단어들을 구성하는 데이터를 미리 정해진 해시 함수에 입력으로 인가하여 상기 복수의 제1 단어들 각각에 대한 해시 값을 연산하는 단계 및 상기 발음 사전 데이터베이스로부터 상기 복수의 제1 단어들 각각에 대한 해시 값을 기초로 한 검색을 수행하여 상기 복수의 제1 단어들 각각에 대한 해시 값과 동일한 해시 값에 대응되어 저장되어 있는 음소열을 상기 복수의 제1 단어들 각각에 대한 음소열로 변환함으로써, 상기 복수의 문장들 각각에 대응하는 음소열을 생성하는 단계를 포함할 수 있다.In addition, according to an embodiment of the present invention, in step S220, a hash value for a plurality of predetermined words and a phoneme string obtained by dividing each of the plurality of words into phoneme units are stored in correspondence with each other. maintaining a database, extracting a plurality of first words constituting the plurality of sentences, and applying data constituting the plurality of first words as an input to a predetermined hash function to each of the plurality of first words calculating a hash value for , and performing a search based on the hash value for each of the plurality of first words from the pronunciation dictionary database to have the same hash value as the hash value for each of the plurality of first words The method may include generating a phoneme sequence corresponding to each of the plurality of sentences by converting a phoneme sequence stored corresponding to , into a phoneme sequence for each of the plurality of first words.

이때, 본 발명의 일실시예에 따르면, 상기 음성합성 훈련용 문장 선정 장치의 동작 방법은 상기 음성합성 훈련용 문장들이 선정 완료되면, 상기 음성합성 훈련용 문장들 각각에 대해 상기 음소열 풀에서의 최종 저장 순서에 반대되는 순서로 순번을 할당하고, 상기 음성합성 훈련용 문장들 각각에 할당된 순번에 미리 정해진 기준 점수를 곱하여 상기 음성합성 훈련용 문장들 각각에 대한 중요도 점수를 연산하는 단계, 상기 음성합성 훈련용 문장들과 각 음성합성 훈련용 문장에 대한 중요도 점수를 훈련용 문장 저장부에 서로 대응시켜 저장하는 단계 및 사용자로부터 상기 음성합성 훈련용 문장들 중 제1 음성합성 훈련용 문장에 대한 정보 확인 명령이 인가되면, 상기 훈련용 문장 저장부에 저장되어 있는 상기 제1 음성합성 훈련용 문장과 상기 제1 음성합성 훈련용 문장에 대한 중요도 점수를 추출하여 화면 상에 디스플레이하는 단계를 더 포함할 수 있다.At this time, according to an embodiment of the present invention, in the method of operating the apparatus for selecting a sentence for voice synthesis training, when the selection of the sentences for voice synthesis training is completed, the phoneme string pool for each of the sentences for voice synthesis training is performed. allocating a sequence number in an order opposite to the final storage order, calculating an importance score for each of the speech synthesis training sentences by multiplying the sequence number assigned to each of the speech synthesis training sentences by a predetermined reference score; storing the speech synthesis training sentences and the importance score for each speech synthesis training sentence in correspondence with each other in the training sentence storage unit; When the information confirmation command is applied, extracting the importance score for the first speech synthesis training sentence and the first speech synthesis training sentence stored in the training sentence storage unit and displaying the extracted on the screen can do.

이상, 도 2를 참조하여 본 발명의 일실시예에 따른 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치의 동작 방법에 대해 설명하였다. 여기서, 본 발명의 일실시예에 따른 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치의 동작 방법은 도 1을 이용하여 설명한 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치(110)의 동작에 대한 구성과 대응될 수 있으므로, 이에 대한 보다 상세한 설명은 생략하기로 한다.The operation method of the phoneme string-based speech synthesis training sentence selection apparatus for constructing a speech synthesizer according to an embodiment of the present invention has been described with reference to FIG. 2 above. Here, the operation method of the apparatus for selecting a sentence for voice synthesis training based on phoneme sequences for building a voice synthesizer according to an embodiment of the present invention is a phoneme sequence-based voice synthesis training for building a voice synthesizer described with reference to FIG. 1 . Since it may correspond to the configuration of the operation of the sentence selection device 110 for use, a more detailed description thereof will be omitted.

본 발명의 일실시예에 따른 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치의 동작 방법은 컴퓨터와의 결합을 통해 실행시키기 위한 저장매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다.The operating method of the phoneme string-based sentence selection apparatus for voice synthesis training for constructing a voice synthesizer according to an embodiment of the present invention may be implemented as a computer program stored in a storage medium for execution through combination with a computer.

또한, 본 발명의 일실시예에 따른 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치의 동작 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. In addition, the operating method of the phoneme string-based speech synthesis training sentence selection apparatus for constructing a speech synthesizer according to an embodiment of the present invention is implemented in the form of a program command that can be executed through various computer means and is a computer-readable medium. can be recorded in The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.As described above, the present invention has been described with specific matters such as specific components and limited embodiments and drawings, but these are provided to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , various modifications and variations are possible from these descriptions by those of ordinary skill in the art to which the present invention pertains.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the described embodiments, and not only the claims to be described later, but also all those with equivalent or equivalent modifications to the claims will be said to belong to the scope of the spirit of the present invention. .

110: 음성합성기를 구축하기 위한 음소열 기반의 음성합성 훈련용 문장 선정 장치
111: 문장 저장부 112: 음소열 변환부
113: 고유 음소 생성부 114: 음소열 풀 생성부
115: 음소열 풀 구축부 116: 문장 선정부
117: 발음 사전 데이터베이스 118: 해시 값 연산부
119: 음소열 변환 처리부 120: 중요도 점수 연산부
121: 저장 처리부 122: 훈련용 문장 저장부
123: 정보 표시부110: Sentence selection device for phoneme string-based speech synthesis training for building a speech synthesizer
111: sentence storage unit 112: phoneme sequence conversion unit
113: unique phoneme generator 114: phoneme string pool generator
115: phoneme string pool construction unit 116: sentence selection unit
117: pronunciation dictionary database 118: hash value operator
119: phoneme sequence conversion processing unit 120: importance score calculation unit
121: storage processing unit 122: training sentence storage unit
123: information display unit

Claims

a sentence storage in which a plurality of predetermined sentences are stored;
a phoneme sequence conversion unit that converts characters included in each of the plurality of sentences into phoneme units, thereby generating a phoneme sequence corresponding to each of the plurality of sentences;
a unique phoneme generator for generating a plurality of unique phonemes by removing overlapping phonemes among a plurality of phonemes included in a phoneme string for each of the plurality of sentences; and
Temporarily storing a phoneme sequence for any one sentence among the plurality of sentences in a phoneme sequence pool, and a first for the ratio of each of the plurality of unique phonemes in the entire phoneme sequence for each of the plurality of sentences Dissimilarity based on Kullback-Leibler divergence between a probability distribution and a second probability distribution for the ratio of each of the plurality of unique phonemes in the phoneme sequence for the sentence stored in the phoneme sequence pool ) as the degree of dissimilarity with respect to the sentence stored in the phoneme string pool is performed for each of the plurality of sentences, and then for the first sentence in which the dissimilarity is calculated as the minimum among the plurality of sentences. A phoneme sequence pool generator that finally stores the phoneme sequence in the phoneme sequence pool
Sentence selection device for phoneme string-based speech synthesis training for building a speech synthesizer comprising a.

According to claim 1,
After the first sentence is finally stored in the phoneme sequence pool, a phoneme sequence for a second sentence that is any one of the remaining sentences except for the sentence finally stored in the phoneme sequence pool among the plurality of sentences is added to the phoneme sequence pool is temporarily stored as, and in the phoneme sequence for the sentences stored in the phoneme sequence pool and the first probability distribution for the ratio of each of the plurality of unique phonemes in the entire phoneme sequence for each of the plurality of sentences. After performing a process of calculating the dissimilarity based on the KLD between the second probability distribution with respect to the ratio in which each of a plurality of unique phonemes exist as the dissimilarity to the second sentence, for each of the remaining sentences, the A phoneme that completes the construction of the phoneme sequence pool by repeating the process of additionally finally storing the phoneme sequence for the sentence for which the dissimilarity is calculated the least among the remaining sentences in the phoneme sequence pool by a preset number of repetitions. thermal pool builders; and
When the construction of the phoneme sequence pool is completed, a sentence selection unit that selects sentences corresponding to phoneme sequences finally stored in the phoneme sequence pool among the plurality of sentences as sentences for speech synthesis training.
Sentence selection device for phoneme string-based speech synthesis training for building a speech synthesizer further comprising a.

According to claim 1,
The dissimilarity is a phoneme sequence-based speech synthesis training sentence selection apparatus for constructing a speech synthesizer calculated according to Equation 1 below.
[Equation 1]

Here, S is the dissimilarity, I is the total number of the plurality of unique phonemes, i is an index _{indicating the i-th unique phone among the plurality of unique phonemes, and P X} (i) is the first probability distribution according to the first probability distribution. The probability mass function, P _Y (i), means a probability mass function according to the second probability distribution.

According to claim 1,
The phoneme heat conversion unit
a pronunciation dictionary database in which hash values for a plurality of predetermined words and phoneme sequences obtained by dividing each of the plurality of words into phoneme units are stored in correspondence with each other;
Extracting a plurality of first words constituting the plurality of sentences, and applying data constituting the plurality of first words as an input to a predetermined hash function to calculate a hash value for each of the plurality of first words a hash value operator; and
A phoneme string stored in correspondence with a hash value identical to the hash value of each of the plurality of first words by performing a search based on the hash value of each of the plurality of first words from the pronunciation dictionary database A phoneme sequence conversion processing unit for generating a phoneme sequence corresponding to each of the plurality of sentences by converting the phoneme sequence for each of the plurality of first words
Sentence selection device for phoneme string-based speech synthesis training for building a speech synthesizer comprising a.

3. The method of claim 2,
When the selection of the speech synthesis training sentences is completed, a sequence number is assigned to each of the speech synthesis training sentences in an order opposite to the final storage order in the phoneme string pool, and assigned to each of the speech synthesis training sentences an importance score calculating unit for calculating an importance score for each of the sentences for speech synthesis training by multiplying the sequence by a predetermined reference score;
a storage processing unit for storing the speech synthesis training sentences and importance scores for each speech synthesis training sentence in correspondence with each other in a training sentence storage unit; and
When an information confirmation command for the first speech synthesis training sentence among the speech synthesis training sentences is applied from the user, the first speech synthesis training sentence stored in the training sentence storage unit and the first voice synthesis training sentence Information display unit that extracts the importance score for the training sentence and displays it on the screen
Sentence selection device for phoneme string-based speech synthesis training for building a speech synthesizer further comprising a.

maintaining a sentence storage in which a plurality of predetermined sentences are stored;
generating a phoneme sequence corresponding to each of the plurality of sentences by converting characters included in each of the plurality of sentences into phoneme units;
generating a plurality of unique phonemes by removing overlapping phonemes among a plurality of phonemes included in a phoneme string for each of the plurality of sentences; and
Temporarily storing a phoneme sequence for any one sentence among the plurality of sentences in a phoneme sequence pool, and a first for the ratio of each of the plurality of unique phonemes in the entire phoneme sequence for each of the plurality of sentences Dissimilarity based on Kullback-Leibler divergence between a probability distribution and a second probability distribution for the ratio of each of the plurality of unique phonemes in the phoneme sequence for the sentence stored in the phoneme sequence pool ) as the degree of dissimilarity with respect to the sentence stored in the phoneme string pool is performed for each of the plurality of sentences, and then for the first sentence in which the dissimilarity is calculated as the minimum among the plurality of sentences. finally storing the phoneme sequence in the phoneme sequence pool
A method of operating a sentence selection apparatus for voice synthesis training based on a phoneme string for constructing a voice synthesizer comprising a.

7. The method of claim 6,
After the first sentence is finally stored in the phoneme sequence pool, a phoneme sequence for a second sentence that is any one of the remaining sentences except for the sentence finally stored in the phoneme sequence pool among the plurality of sentences is added to the phoneme sequence pool is temporarily stored as, and in the phoneme sequence for the sentences stored in the phoneme sequence pool and the first probability distribution for the ratio of each of the plurality of unique phonemes in the entire phoneme sequence for each of the plurality of sentences. After performing a process of calculating the dissimilarity based on the KLD between the second probability distribution with respect to the ratio in which each of a plurality of unique phonemes exist as the dissimilarity to the second sentence, for each of the remaining sentences, the Completing the construction of the phoneme sequence pool by repeating the process of additionally finally storing the phoneme sequence for the sentence for which the dissimilarity is calculated the least among the remaining sentences in the phoneme sequence pool by a preset number of repetitions; ; and
When the construction of the phoneme sequence pool is completed, selecting sentences corresponding to phoneme sequences finally stored in the phoneme sequence pool among the plurality of sentences as sentences for speech synthesis training;
A method of operating a sentence selection apparatus for voice synthesis training based on a phoneme string for constructing a voice synthesizer further comprising a.

7. The method of claim 6,
The dissimilarity is an operating method of an apparatus for selecting a sentence for voice synthesis training based on a phoneme sequence for constructing a voice synthesizer calculated according to Equation 1 below.
[Equation 1]

Here, S is the dissimilarity, I is the total number of the plurality of unique phonemes, i is an index _{indicating the i-th unique phoneme among the plurality of unique phonemes, and P X} (i) is the first probability distribution according to the first probability distribution. The probability mass function, P _Y (i), means a probability mass function according to the second probability distribution.

7. The method of claim 6,
The step of generating the phoneme sequence is
maintaining a pronunciation dictionary database in which hash values for a plurality of predetermined words and phoneme sequences obtained by dividing each of the plurality of words into phoneme units are stored in correspondence with each other;
Extracting a plurality of first words constituting the plurality of sentences, and applying data constituting the plurality of first words as an input to a predetermined hash function to calculate a hash value for each of the plurality of first words to do; and
A phoneme string stored in correspondence with a hash value identical to the hash value of each of the plurality of first words by performing a search based on the hash value of each of the plurality of first words from the pronunciation dictionary database generating a phoneme sequence corresponding to each of the plurality of sentences by converting the phoneme sequence for each of the plurality of first words
A method of operating a sentence selection apparatus for voice synthesis training based on a phoneme string for constructing a voice synthesizer comprising a.

8. The method of claim 7,
When the selection of the speech synthesis training sentences is completed, a sequence number is assigned to each of the speech synthesis training sentences in an order opposite to the final storage order in the phoneme string pool, and assigned to each of the speech synthesis training sentences calculating an importance score for each of the speech synthesis training sentences by multiplying the sequence by a predetermined reference score;
storing the speech synthesis training sentences and the importance score for each speech synthesis training sentence in correspondence with each other in a training sentence storage unit; and
When an information confirmation command for a first speech synthesis training sentence among the speech synthesis training sentences is applied from the user, the first speech synthesis training sentence stored in the training sentence storage unit and the first speech synthesis training sentence Extracting the importance score for the training sentence and displaying it on the screen
A method of operating a sentence selection apparatus for voice synthesis training based on a phoneme string for constructing a voice synthesizer further comprising a.

A computer-readable recording medium recording a computer program for executing the method of any one of claims 6 to 10 through combination with a computer.

A computer program stored in a storage medium for executing the method of any one of claims 6 to 10 through combination with a computer.