CN118522302A - Voice recognition voice and word synchronization method and device - Google Patents
Voice recognition voice and word synchronization method and device Download PDFInfo
- Publication number
- CN118522302A CN118522302A CN202410742486.XA CN202410742486A CN118522302A CN 118522302 A CN118522302 A CN 118522302A CN 202410742486 A CN202410742486 A CN 202410742486A CN 118522302 A CN118522302 A CN 118522302A
- Authority
- CN
- China
- Prior art keywords
- voice
- content
- time
- text
- text content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/055—Time compression or expansion for synchronising with other signals, e.g. video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
Abstract
本发明涉及一种语音识别音字同步的方法,包括以下:获取目标语音信息;对所述目标语音信息进行语音识别得到语音识别内容,语音识别内容包括语音内容以及文本内容;于语音内容中确定文本内容中的每一个文字所对应的起始时刻以及终止时刻,并将首个文字的起始时刻作为语音内容展示时的初始时刻,基于此确定语音内容中的所有文字在语音内容展示时的相对时间位置;以文本内容中的首个文字的起始时刻作为文本内容展示时的初始时刻,基于此确定文本内容中的所有文字在文字内容展示时的相对时间位置;以通过上述方式确定的语音内容以及文本内容中的所有文字在展示时的相对时间位置进行语音识别音字同步展示。本发明能够智能化地保证语音识别音字同步。
The present invention relates to a method for speech recognition and audio-character synchronization, comprising the following steps: obtaining target speech information; performing speech recognition on the target speech information to obtain speech recognition content, wherein the speech recognition content includes speech content and text content; determining the start time and end time corresponding to each character in the text content in the speech content, and taking the start time of the first character as the initial time when the speech content is displayed, based on which the relative time positions of all characters in the speech content when the speech content is displayed are determined; taking the start time of the first character in the text content as the initial time when the text content is displayed, based on which the relative time positions of all characters in the text content when the text content is displayed are determined; performing speech recognition and audio-character synchronization display at the relative time positions of all characters in the speech content and the text content determined in the above manner when displayed. The present invention can intelligently ensure speech recognition and audio-character synchronization.
Description
技术领域Technical Field
本发明涉及语音识别相关技术领域,尤其涉及一种语音识别音字同步的方法及装置。The present invention relates to the technical field related to speech recognition, and in particular to a method and device for speech recognition sound-word synchronization.
背景技术Background Art
语音识别,也被称为自动语音识别(Automatic Speech Recognition,ASR),是一种通过计算机程序将口语转换成文本的技术,其被广泛应用于众多的智能场景中。Speech recognition, also known as Automatic Speech Recognition (ASR), is a technology that converts spoken language into text through computer programs. It is widely used in many intelligent scenarios.
在一些智能场景中,需要对语音识别的文本内容进行展示,并在展示过程中要尽可能地保证文本内容的展示频率与播放的语音内容同步,以给用户带来更为适宜的感官体验,目前市场上的方式大多通过人工的方式进行编排,有少部分通过智能化网络进行粗编排最终经过人工的精细化审核与调整实现,这些方式无疑为工作人员带来了巨大的工作量,且由于人工的经验水平的差异可能会导致最终的结果良莠不齐。In some intelligent scenarios, it is necessary to display the text content of voice recognition, and during the display process, it is necessary to ensure that the display frequency of the text content is synchronized with the played voice content as much as possible to bring users a more suitable sensory experience. Currently, most of the methods on the market are manually arranged, and a small number of them are roughly arranged through intelligent networks and finally realized through manual fine review and adjustment. These methods undoubtedly bring huge workload to the staff, and due to the differences in human experience levels, the final results may be uneven.
发明内容Summary of the invention
本发明的目的是为了至少解决现有技术的不足之一,提供一种语音识别音字同步的方法及装置。The purpose of the present invention is to solve at least one of the deficiencies of the prior art and to provide a method and device for speech recognition pronunciation and character synchronization.
为了实现上述目的,本发明采用以下的技术方案:In order to achieve the above object, the present invention adopts the following technical solutions:
具体的,提出一种语音识别音字同步的方法,包括以下:Specifically, a method for speech recognition and pronunciation-word synchronization is proposed, comprising the following steps:
获取目标语音信息;Obtain target voice information;
对所述目标语音信息进行语音识别得到语音识别内容,所述语音识别内容包括语音内容以及文本内容;Performing speech recognition on the target speech information to obtain speech recognition content, wherein the speech recognition content includes speech content and text content;
于所述语音内容中确定文本内容中的每一个文字所对应的起始时刻以及终止时刻,并将首个文字的起始时刻作为语音内容展示时的初始时刻,基于此确定语音内容中的所有文字在语音内容展示时的相对时间位置;Determine the start time and end time corresponding to each character in the text content in the voice content, and use the start time of the first character as the initial time when the voice content is displayed, based on which determine the relative time position of all characters in the voice content when the voice content is displayed;
以文本内容中的首个文字的起始时刻作为文本内容展示时的初始时刻,基于此确定文本内容中的所有文字在文字内容展示时的相对时间位置;The starting time of the first word in the text content is used as the initial time when the text content is displayed, and based on this, the relative time positions of all words in the text content when the text content is displayed are determined;
以通过上述方式确定的语音内容以及文本内容中的所有文字在展示时的相对时间位置进行语音识别音字同步展示。The speech recognition audio and text are displayed synchronously based on the relative time positions of the speech content and all the characters in the text content determined in the above manner when they are displayed.
进一步,具体的,对所述目标语音信息通过HMM语音识别模型进行语音识别得到文本内容,所述语音内容指的是原目标语音信息通过预处理操作后得到的语音信息。Further, specifically, the target voice information is subjected to voice recognition through an HMM voice recognition model to obtain text content, and the voice content refers to the voice information obtained after the original target voice information is subjected to a preprocessing operation.
进一步,具体的,原目标语音信息通过预处理操作后得到的语音信息,包括,Further, specifically, the speech information obtained after the original target speech information is preprocessed includes:
对原目标语音信息通过下式进行预加重处理得到第一语音信息,The original target voice information is pre-emphasized by the following formula to obtain the first voice information:
Bn=An-a*An-1, Bn = An -a* An -1,
其中Bn表示第一语音信息,An代表原目标语音信息,a为预设的常数,a的取值范围为[0.95,1.00];Wherein Bn represents the first speech information, An represents the original target speech information, a is a preset constant, and the value range of a is [0.95, 1.00];
对第一语音信息通过下式进行加窗处理得到第二语音信息,The first voice information is subjected to windowing processing by the following formula to obtain the second voice information:
Cn=Bn*W(n), Cn = Bn *W(n),
其中Cn代表第二语音信息,W(n)为窗函数,W(n)具体为Where Cn represents the second speech information, W(n) is the window function, and W(n) is specifically:
基于短时能量的判断方式对第二语音信息进行端点检测以区分第二语音的清音段与浊音段得到多个语音分段。The second speech information is subjected to endpoint detection based on a short-time energy determination method to distinguish unvoiced segments and voiced segments of the second speech to obtain a plurality of speech segments.
进一步,具体的,于所述语音内容中确定文本内容中的每一个文字所对应的起始时刻以及终止时刻,包括,Further, specifically, determining the start time and end time corresponding to each character in the text content in the voice content includes:
对每个语音分段,For each speech segment,
通过特征提取算法得到语音分段的特征向量;The feature vector of speech segment is obtained by feature extraction algorithm;
基于语音分段的特征向量,通过语音识别模型对语音分段进行识别得到语音分段所对应的文字;Based on the feature vector of the speech segment, the speech segment is recognized by the speech recognition model to obtain the text corresponding to the speech segment;
将每个语音分段的起始时刻以及初始时刻作为其对应文字的起始时刻以及终止时刻,以此确定文本内容中的每一个文字所对应的起始时刻以及终止时刻。The starting time and initial time of each speech segment are used as the starting time and ending time of its corresponding text, so as to determine the starting time and ending time corresponding to each text in the text content.
进一步,具体的,通过特征提取算法得到语音分段的特征向量,包括,Further, specifically, the feature vector of the speech segment is obtained by the feature extraction algorithm, including:
通过傅立叶变换对每个语音分段进行处理得到每个语音分段的时域信号c(n);Process each speech segment by Fourier transform to obtain a time domain signal c(n) of each speech segment;
对时域信号c(n)进行末位补0得到序列数为N的序列x(n),通过离散傅立叶变换对所述序列进行处理得到线性频谱X(k),计算式如下:The time domain signal c(n) is padded with 0 at the end to obtain a sequence x(n) with a sequence number of N. The sequence is processed by discrete Fourier transform to obtain a linear spectrum X(k), which is calculated as follows:
将线性频谱X(k)通过MEL带通滤波器处理得到MEL频谱,对MEL频谱进行转换得到对数频谱Q(m),The linear spectrum X(k) is processed by a MEL bandpass filter to obtain a MEL spectrum, and the MEL spectrum is converted to obtain a logarithmic spectrum Q(m).
其中MEL带通滤波器中的每个带通滤波器的传递函数为:The transfer function of each bandpass filter in the MEL bandpass filter is:
其中0≤m≤M,M为带通滤波器的个数,Where 0≤m≤M, M is the number of bandpass filters,
其中fh以及fl分别表示带通滤波器的最大频率以及最小频率,Fs为采样频率,Y为进行离散傅立叶变换的窗函数的窗宽,而x表示自变量。Where fh and fl represent the maximum frequency and minimum frequency of the bandpass filter respectively, Fs is the sampling frequency, Y is the window width of the window function for discrete Fourier transform, and x represents the independent variable.
对Q(m)通过离散余弦变换得到MFCC系数d(n)为:The MFCC coefficient d(n) is obtained by discrete cosine transform of Q(m):
进一步,所述方法还包括,进行语音识别音字同步展示时,Furthermore, the method also includes, when performing speech recognition and audio-character synchronous display,
对语音内容以及对应的文字内容进行提前缓冲,即获取由工作人员预先标注的语音内容及其对应的文字内容段落,当展示当前段落时,预加载后一段落的相关内容。The voice content and the corresponding text content are buffered in advance, that is, the voice content and the corresponding text content paragraphs pre-annotated by the staff are obtained, and when the current paragraph is displayed, the relevant content of the next paragraph is preloaded.
进一步,所述方法还包括,进行语音识别音字同步展示时,Furthermore, the method also includes, when performing speech recognition and audio-character synchronous display,
当前段落展示完成后,同步获取当前段落的实时语音信息;After the current paragraph is displayed, the real-time voice information of the current paragraph is obtained synchronously;
对所述实时语音信息进行语音识别得到实时语音识别内容;Performing voice recognition on the real-time voice information to obtain real-time voice recognition content;
于所述实时语音识别内容中确定所有文字在进行语音内容展示时的起始时刻以及终止时刻;Determining the start time and the end time of all characters in the real-time voice recognition content when presenting the voice content;
对每个文字在进行语音内容展示时的起始时刻以及终止时刻进行求平均值记为该文字的语音时间戳;The start time and the end time of each character when the voice content is displayed are averaged and recorded as the voice timestamp of the character;
获取每个文字在进行文字内容展示时的起始时刻以及终止时刻;Get the start time and end time of each character when displaying the text content;
对每个文字在进行文字内容展示时的起始时刻以及终止时刻进行求平均值记为该文字的文字时间戳;The start time and the end time of each character when displaying the text content are calculated and recorded as the text timestamp of the character;
将每个文字的语音时间戳以及文字时间戳作差记为该文字的同步容忍参数;The difference between the speech timestamp and the text timestamp of each character is recorded as the synchronization tolerance parameter of the character;
遍历所有文字的同步容忍参数,找寻出同步容忍参数的绝对值大于第一阈值的文字并统计其数量得到数量P;Traverse the synchronization tolerance parameters of all characters, find the characters whose absolute values of the synchronization tolerance parameters are greater than the first threshold, and count their number to obtain the number P;
若数量P小于第二阈值,则不对下一段落的文字内容展示速度进行调整,若数量P不小于第二阈值也不大于第三阈值,则计算所有的同步容忍参数的平均值,若平均值为正数,则将下一段落的文字内容展示速度提高预设比例,若平均值为负数,则将下一段落的文字内容展示速度降低预设比例,若平均值恰好为0则不对下一段落的文字内容展示速度调整,若数量P大于第三阈值,则对相关工作人员进行告警提醒。If the number P is less than the second threshold, the text content display speed of the next paragraph will not be adjusted. If the number P is neither less than the second threshold nor greater than the third threshold, the average value of all synchronization tolerance parameters will be calculated. If the average value is a positive number, the text content display speed of the next paragraph will be increased by a preset proportion. If the average value is a negative number, the text content display speed of the next paragraph will be reduced by a preset proportion. If the average value is exactly 0, the text content display speed of the next paragraph will not be adjusted. If the number P is greater than the third threshold, an alarm will be issued to the relevant staff.
本发明还提出一种语音识别音字同步的装置,包括以下:The present invention also provides a device for speech recognition and pronunciation-word synchronization, comprising the following:
数据获取模块,用于获取目标语音信息;A data acquisition module, used to acquire target voice information;
语音识别模块,用于对所述目标语音信息进行语音识别得到语音识别内容,所述语音识别内容包括语音内容以及文本内容;A speech recognition module, used for performing speech recognition on the target speech information to obtain speech recognition content, wherein the speech recognition content includes speech content and text content;
第一时间确定模块,用于于所述语音内容中确定文本内容中的每一个文字所对应的起始时刻以及终止时刻,并将首个文字的起始时刻作为语音内容展示时的初始时刻,基于此确定语音内容中的所有文字在语音内容展示时的相对时间位置;A first time determination module is used to determine the start time and end time corresponding to each character in the text content in the voice content, and use the start time of the first character as the initial time when the voice content is displayed, based on which the relative time positions of all characters in the voice content when the voice content is displayed are determined;
第二时间确定模块,用于以文本内容中的首个文字的起始时刻作为文本内容展示时的初始时刻,基于此确定文本内容中的所有文字在文字内容展示时的相对时间位置;A second time determination module is used to use the starting time of the first word in the text content as the initial time when the text content is displayed, and based on this, determine the relative time position of all words in the text content when the text content is displayed;
同步展示模块,用于以通过上述方式确定的语音内容以及文本内容中的所有文字在展示时的相对时间位置进行语音识别音字同步展示。The synchronous display module is used to perform speech recognition audio and character synchronous display based on the relative time positions of all characters in the speech content and text content determined in the above manner when displaying.
本发明的有益效果为:The beneficial effects of the present invention are:
本发明提供的一种语音识别音字同步的方法,通过对目标语音信息进行语音识别得到语音内容以及文本内容,再以语音内容的文字在进行展示时的相对时间位置关系对语音内容以及文本内容进行同步标注,基于此能够确保语音识别内容实现音字同步,另外在语音识别音字同步展示时,通过预加载以及反馈机制的方式尽可能地保证语音识别音字同步。The present invention provides a method for speech recognition and audio-text synchronization, which obtains speech content and text content by performing speech recognition on target speech information, and then synchronously annotates the speech content and text content according to the relative time position relationship of the text of the speech content when it is displayed, thereby ensuring that the speech recognition content achieves audio-text synchronization, and in addition, when the speech recognition and audio-text are synchronously displayed, the speech recognition and audio-text synchronization is ensured as much as possible by means of preloading and feedback mechanisms.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
通过对结合附图所示出的实施方式进行详细说明,本公开的上述以及其他特征将更加明显,本公开附图中相同的参考标号表示相同或相似的输出电压,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图,在附图中:The above and other features of the present disclosure will become more obvious by describing in detail the embodiments shown in the accompanying drawings. The same reference numerals in the accompanying drawings of the present disclosure represent the same or similar output voltages. Obviously, the accompanying drawings described below are only some embodiments of the present disclosure. For those skilled in the art, other accompanying drawings can be obtained based on these accompanying drawings without creative work. In the accompanying drawings:
图1所示为本发明一种语音识别音字同步的方法的流程图。FIG. 1 is a flow chart of a method for speech recognition and pronunciation-word synchronization according to the present invention.
具体实施方式DETAILED DESCRIPTION
以下将结合实施例和附图对本发明的构思、具体结构及产生的技术效果进行清楚、完整的描述,以充分地理解本发明的目的、方案和效果。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。附图中各处使用的相同的附图标记指示相同或相似的部分。The following will be combined with the embodiments and drawings to clearly and completely describe the concept, specific structure and technical effects of the present invention, so as to fully understand the purpose, scheme and effect of the present invention. It should be noted that the embodiments in this application and the features in the embodiments can be combined with each other without conflict. The same reference numerals used throughout the drawings indicate the same or similar parts.
参照图1,实施例1,本发明提出一种语音识别音字同步的方法,包括以下:Referring to FIG. 1 , Embodiment 1, the present invention proposes a method for speech recognition and pronunciation-word synchronization, comprising the following steps:
步骤110、获取目标语音信息;Step 110: Obtain target voice information;
步骤120、对所述目标语音信息进行语音识别得到语音识别内容,所述语音识别内容包括语音内容以及文本内容;Step 120: Perform voice recognition on the target voice information to obtain voice recognition content, where the voice recognition content includes voice content and text content;
步骤130、于所述语音内容中确定文本内容中的每一个文字所对应的起始时刻以及终止时刻,并将首个文字的起始时刻作为语音内容展示时的初始时刻,基于此确定语音内容中的所有文字在语音内容展示时的相对时间位置;Step 130: determining the start time and end time corresponding to each character in the text content in the voice content, and taking the start time of the first character as the initial time when the voice content is displayed, based on which the relative time positions of all characters in the voice content when the voice content is displayed are determined;
步骤140、以文本内容中的首个文字的起始时刻作为文本内容展示时的初始时刻,基于此确定文本内容中的所有文字在文字内容展示时的相对时间位置;Step 140: Taking the starting time of the first character in the text content as the initial time when the text content is displayed, based on which the relative time positions of all characters in the text content when the text content is displayed are determined;
步骤150、以通过上述方式确定的语音内容以及文本内容中的所有文字在展示时的相对时间位置进行语音识别音字同步展示。Step 150: Perform speech recognition audio-character synchronization display based on the relative time positions of the speech content determined in the above manner and all the characters in the text content when being displayed.
在本实施例1中,通过对目标语音信息进行语音识别得到语音内容以及文本内容,再以语音内容的文字在进行展示时的相对时间位置关系对语音内容以及文本内容进行同步标注,基于此能够确保语音识别内容实现音字同步。In this embodiment 1, voice content and text content are obtained by performing voice recognition on the target voice information, and then the voice content and text content are synchronously annotated based on the relative time position relationship of the text of the voice content when it is displayed, based on this, it can be ensured that the voice recognition content can achieve audio-text synchronization.
作为本发明的优选实施方式,具体的,对所述目标语音信息通过HMM语音识别模型进行语音识别得到文本内容,所述语音内容指的是原目标语音信息通过预处理操作后得到的语音信息。As a preferred embodiment of the present invention, specifically, the target voice information is subjected to voice recognition through an HMM voice recognition model to obtain text content, and the voice content refers to the voice information obtained after the original target voice information is subjected to a preprocessing operation.
在本优选实施方式中,通过成熟的HMM语音识别模型进行语音识别得到文本内容,能够确保识别得到的文本内容准确无误。In this preferred embodiment, the text content is obtained by performing speech recognition using a mature HMM speech recognition model, which can ensure that the recognized text content is accurate.
作为本发明的优选实施方式,具体的,原目标语音信息通过预处理操作后得到的语音信息,包括,As a preferred embodiment of the present invention, specifically, the original target voice information is obtained after the preprocessing operation, including:
对原目标语音信息通过下式进行预加重处理得到第一语音信息,The original target voice information is pre-emphasized by the following formula to obtain the first voice information:
Bn=An-a*An-1, Bn = An -a* An -1,
其中Bn表示第一语音信息,An代表原目标语音信息,a为预设的常数,a的取值范围为0.95,1.00];Wherein Bn represents the first speech information, An represents the original target speech information, a is a preset constant, and the value range of a is 0.95, 1.00];
对第一语音信息通过下式进行加窗处理得到第二语音信息,The first voice information is subjected to windowing processing by the following formula to obtain the second voice information:
Cn=Bn*W(n), Cn = Bn *W(n),
其中Cn代表第二语音信息,W(n)为窗函数,W(n)具体为Where Cn represents the second speech information, W(n) is the window function, and W(n) is specifically:
基于短时能量的判断方式对第二语音信息进行端点检测以区分第二语音的清音段与浊音段得到多个语音分段。The second speech information is subjected to endpoint detection based on a short-time energy determination method to distinguish unvoiced segments and voiced segments of the second speech to obtain a plurality of speech segments.
在本优选实施方式中,通过上述方式找寻出语音分段,能够避免在找寻文本内容中的每个文字时出现遗漏。In this preferred embodiment, by finding the speech segments in the above manner, omissions can be avoided when finding each word in the text content.
作为本发明的优选实施方式,具体的,于所述语音内容中确定文本内容中的每一个文字所对应的起始时刻以及终止时刻,包括,As a preferred embodiment of the present invention, specifically, determining the start time and the end time corresponding to each character in the text content in the voice content includes:
对每个语音分段,For each speech segment,
通过特征提取算法得到语音分段的特征向量;The feature vector of speech segment is obtained by feature extraction algorithm;
基于语音分段的特征向量,通过语音识别模型对语音分段进行识别得到语音分段所对应的文字;Based on the feature vector of the speech segment, the speech segment is recognized by the speech recognition model to obtain the text corresponding to the speech segment;
将每个语音分段的起始时刻以及初始时刻作为其对应文字的起始时刻以及终止时刻,以此确定文本内容中的每一个文字所对应的起始时刻以及终止时刻。The starting time and initial time of each speech segment are used as the starting time and ending time of its corresponding text, so as to determine the starting time and ending time corresponding to each text in the text content.
在本优选实施方式中,具体在实现于所述语音内容中确定文本内容中的每一个文字所对应的起始时刻以及终止时刻的过程,包括以下,In this preferred embodiment, the process of determining the start time and the end time corresponding to each word in the text content in the voice content includes the following:
1.语音分段:语音识别系统会首先对音频进行分段,将其划分为较小的语音片段。这些片段通常是基于语音特征的变化、静音检测或语音信号的能量等特征来确定的。1. Speech segmentation: The speech recognition system first segments the audio into smaller speech segments. These segments are usually determined based on features such as changes in speech characteristics, silence detection, or the energy of the speech signal.
2.特征提取:对于每个语音片段,特征提取算法会将其转换为特征向量,这些特征向量通常包括MFCC(Mel频率倒谱系数)、功率谱密度等。这些特征向量可以帮助识别每个字的声学特征。2. Feature extraction: For each speech segment, the feature extraction algorithm converts it into feature vectors, which usually include MFCC (Mel frequency cepstral coefficients), power spectrum density, etc. These feature vectors can help identify the acoustic characteristics of each word.
3.语音识别模型:接下来,使用语音识别模型(如深度学习模型)来对每个片段进行识别,将其转换为文本。这一步通常会产生每个字的概率分布,表示该字出现在该片段中的可能性。3. Speech recognition model: Next, a speech recognition model (such as a deep learning model) is used to recognize each segment and convert it into text. This step usually produces a probability distribution for each word, indicating the probability of the word appearing in the segment.
4.时间戳生成:根据语音片段的时长以及语音识别模型的输出,可以估计每个字在片段中的出现时间。通常,这会根据语音片段的起始时刻和终止时刻以及每个字的相对位置进行计算。4. Timestamp generation: Based on the duration of the speech segment and the output of the speech recognition model, the time when each word appears in the segment can be estimated. Usually, this is calculated based on the start and end times of the speech segment and the relative position of each word.
5.后处理:在得到每个字的时间戳后,可能需要进行一些后处理步骤来提高准确性。这可能涉及到一些技术,如时间对齐、平滑处理等,以确保每个字的时间戳是准确的。5. Post-processing: After getting the timestamp for each word, some post-processing steps may be required to improve the accuracy. This may involve some techniques such as time alignment, smoothing, etc. to ensure that the timestamp for each word is accurate.
作为本发明的优选实施方式,具体的,通过特征提取算法得到语音分段的特征向量,包括,As a preferred embodiment of the present invention, specifically, the feature vector of the speech segment is obtained by a feature extraction algorithm, including:
通过傅立叶变换对每个语音分段进行处理得到每个语音分段的时域信号c(n);Process each speech segment by Fourier transform to obtain a time domain signal c(n) of each speech segment;
对时域信号c(n)进行末位补0得到序列数为N的序列x(n),通过离散傅立叶变换对所述序列进行处理得到线性频谱X(k),计算式如下:The time domain signal c(n) is padded with 0 at the end to obtain a sequence x(n) with a sequence number of N. The sequence is processed by discrete Fourier transform to obtain a linear spectrum X(k), which is calculated as follows:
将线性频谱X(k)通过MEL带通滤波器处理得到MEL频谱,对MEL频谱进行转换得到对数频谱Q(m),The linear spectrum X(k) is processed by a MEL bandpass filter to obtain a MEL spectrum, and the MEL spectrum is converted to obtain a logarithmic spectrum Q(m).
其中MEL带通滤波器中的每个带通滤波器的传递函数为:The transfer function of each bandpass filter in the MEL bandpass filter is:
其中0≤m≤M,M为带通滤波器的个数,Where 0≤m≤M, M is the number of bandpass filters,
其中fh以及fl分别表示带通滤波器的最大频率以及最小频率,Fs为采样频率,Y为进行离散傅立叶变换的窗函数的窗宽,而x表示自变量。Where fh and fl represent the maximum frequency and minimum frequency of the bandpass filter respectively, Fs is the sampling frequency, Y is the window width of the window function for discrete Fourier transform, and x represents the independent variable.
对Q(m)通过离散余弦变换得到MFCC系数d(n)为:The MFCC coefficient d(n) is obtained by discrete cosine transform of Q(m):
在本优选实施方式中,通过上述方式计算MFCC系数作为语音分段的特征向量,能够准确地得到MFCC系数,以便于后续的相关识别计算过程。In this preferred embodiment, by calculating the MFCC coefficients as the feature vectors of the speech segment in the above manner, the MFCC coefficients can be accurately obtained to facilitate the subsequent related recognition calculation process.
作为本发明的优选实施方式,所述方法还包括,进行语音识别音字同步展示时,As a preferred embodiment of the present invention, the method further comprises, when performing speech recognition, sound and word synchronous display,
对语音内容以及对应的文字内容进行提前缓冲,即获取由工作人员预先标注的语音内容及其对应的文字内容段落,当展示当前段落时,预加载后一段落的相关内容。The voice content and the corresponding text content are buffered in advance, that is, the voice content and the corresponding text content paragraphs pre-annotated by the staff are obtained, and when the current paragraph is displayed, the relevant content of the next paragraph is preloaded.
在本优选实施方式中,提前缓冲一定量的语音内容和对应的文字内容,以确保即使在网络延迟或处理延迟的情况下,也能够保持语音和文字的同步性。这可以通过预加载下一段语音和文字内容来实现,以确保在当前段内容播放完毕之前就已经准备好下一段内容。In this preferred embodiment, a certain amount of voice content and corresponding text content is buffered in advance to ensure that the synchronization of voice and text can be maintained even in the case of network delay or processing delay. This can be achieved by preloading the next segment of voice and text content to ensure that the next segment of content is ready before the current segment of content is played.
作为本发明的优选实施方式,所述方法还包括,进行语音识别音字同步展示时,As a preferred embodiment of the present invention, the method further comprises, when performing speech recognition, sound and word synchronous display,
当前段落展示完成后,同步获取当前段落的实时语音信息;After the current paragraph is displayed, the real-time voice information of the current paragraph is obtained synchronously;
对所述实时语音信息进行语音识别得到实时语音识别内容;Performing voice recognition on the real-time voice information to obtain real-time voice recognition content;
于所述实时语音识别内容中确定所有文字在进行语音内容展示时的起始时刻以及终止时刻;Determining the start time and the end time of all characters in the real-time voice recognition content when presenting the voice content;
对每个文字在进行语音内容展示时的起始时刻以及终止时刻进行求平均值记为该文字的语音时间戳;The start time and the end time of each character when the voice content is displayed are averaged and recorded as the voice timestamp of the character;
获取每个文字在进行文字内容展示时的起始时刻以及终止时刻;Get the start time and end time of each character when displaying the text content;
对每个文字在进行文字内容展示时的起始时刻以及终止时刻进行求平均值记为该文字的文字时间戳;The start time and the end time of each character when displaying the text content are calculated and recorded as the text timestamp of the character;
将每个文字的语音时间戳以及文字时间戳作差记为该文字的同步容忍参数;The difference between the speech timestamp and the text timestamp of each character is recorded as the synchronization tolerance parameter of the character;
遍历所有文字的同步容忍参数,找寻出同步容忍参数的绝对值大于第一阈值的文字并统计其数量得到数量P;Traverse the synchronization tolerance parameters of all characters, find the characters whose absolute values of the synchronization tolerance parameters are greater than the first threshold, and count their number to obtain the number P;
若数量P小于第二阈值,则不对下一段落的文字内容展示速度进行调整,若数量P不小于第二阈值也不大于第三阈值,则计算所有的同步容忍参数的平均值,若平均值为正数,则将下一段落的文字内容展示速度提高预设比例,若平均值为负数,则将下一段落的文字内容展示速度降低预设比例,若平均值恰好为0则不对下一段落的文字内容展示速度调整,若数量P大于第三阈值,则对相关工作人员进行告警提醒。If the number P is less than the second threshold, the text content display speed of the next paragraph will not be adjusted. If the number P is neither less than the second threshold nor greater than the third threshold, the average value of all synchronization tolerance parameters will be calculated. If the average value is a positive number, the text content display speed of the next paragraph will be increased by a preset proportion. If the average value is a negative number, the text content display speed of the next paragraph will be reduced by a preset proportion. If the average value is exactly 0, the text content display speed of the next paragraph will not be adjusted. If the number P is greater than the third threshold, an alarm will be issued to the relevant staff.
在本优选实施方式中,根据实际播放情况,动态调整文字内容的显示速度和时间,以确保与语音内容的同步。如果发现文字内容显示速度过快或过慢,可以通过上述反馈机制进行实时调整来修正,其中第一阈值、第二阈值、第三阈值均通过预先实验的方式来进行人为设置。In this preferred embodiment, the display speed and time of the text content are dynamically adjusted according to the actual playback situation to ensure synchronization with the voice content. If it is found that the text content display speed is too fast or too slow, it can be corrected by real-time adjustment through the above feedback mechanism, wherein the first threshold, the second threshold, and the third threshold are all artificially set by pre-experimentation.
本发明还提出一种语音识别音字同步的装置,包括以下:The present invention also provides a device for speech recognition and pronunciation-word synchronization, comprising the following:
数据获取模块,用于获取目标语音信息;A data acquisition module, used to acquire target voice information;
语音识别模块,用于对所述目标语音信息进行语音识别得到语音识别内容,所述语音识别内容包括语音内容以及文本内容;A speech recognition module, used for performing speech recognition on the target speech information to obtain speech recognition content, wherein the speech recognition content includes speech content and text content;
第一时间确定模块,用于于所述语音内容中确定文本内容中的每一个文字所对应的起始时刻以及终止时刻,并将首个文字的起始时刻作为语音内容展示时的初始时刻,基于此确定语音内容中的所有文字在语音内容展示时的相对时间位置;A first time determination module is used to determine the start time and end time corresponding to each character in the text content in the voice content, and use the start time of the first character as the initial time when the voice content is displayed, based on which the relative time positions of all characters in the voice content when the voice content is displayed are determined;
第二时间确定模块,用于以文本内容中的首个文字的起始时刻作为文本内容展示时的初始时刻,基于此确定文本内容中的所有文字在文字内容展示时的相对时间位置;A second time determination module is used to use the starting time of the first word in the text content as the initial time when the text content is displayed, and based on this, determine the relative time position of all words in the text content when the text content is displayed;
同步展示模块,用于以通过上述方式确定的语音内容以及文本内容中的所有文字在展示时的相对时间位置进行语音识别音字同步展示。The synchronous display module is used to perform speech recognition audio and character synchronous display based on the relative time positions of all characters in the speech content and text content determined in the above manner when displaying.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例中的方案的目的。The modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical modules, that is, they may be located in one place or distributed on multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本发明各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module. The above integrated modules may be implemented in the form of hardware or in the form of software functional modules.
所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储的介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,RandomAccess Memory)、电载波信号、电信信号以及软件分发介质等。If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the present invention implements all or part of the processes in the above-mentioned embodiment method, and can also be completed by instructing the relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium, and the computer program can implement the steps of the above-mentioned various method embodiments when executed by the processor. Among them, the computer program includes computer program code, and the computer program code can be in source code form, object code form, executable file or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signal, telecommunication signal and software distribution medium, etc.
尽管本发明的描述已经相当详尽且特别对几个所述实施例进行了描述,但其并非旨在局限于任何这些细节或实施例或任何特殊实施例,而是应当将其视作是通过参考所附权利要求考虑到现有技术为这些权利要求提供广义的可能性解释,从而有效地涵盖本发明的预定范围。此外,上文以发明人可预见的实施例对本发明进行描述,其目的是为了提供有用的描述,而那些目前尚未预见的对本发明的非实质性改动仍可代表本发明的等效改动。Although the description of the present invention has been quite detailed and specifically described with respect to several described embodiments, it is not intended to be limited to any of these details or embodiments or any particular embodiment, but should be regarded as providing a broad possible interpretation of these claims in view of the prior art by reference to the appended claims, thereby effectively covering the intended scope of the present invention. In addition, the above description of the present invention is based on the embodiments foreseeable by the inventors, and its purpose is to provide a useful description, and those non-substantial changes to the present invention that have not yet been foreseen may still represent equivalent changes to the present invention.
以上所述,只是本发明的较佳实施例而已,本发明并不局限于上述实施方式,只要其以相同的手段达到本发明的技术效果,都应属于本发明的保护范围。在本发明的保护范围内其技术方案和/或实施方式可以有各种不同的修改和变化。The above is only a preferred embodiment of the present invention. The present invention is not limited to the above implementation. As long as the technical effect of the present invention is achieved by the same means, it should belong to the protection scope of the present invention. Within the protection scope of the present invention, its technical scheme and/or implementation method can have various modifications and changes.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410742486.XA CN118522302B (en) | 2024-06-11 | 2024-06-11 | A method and device for speech recognition sound-word synchronization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410742486.XA CN118522302B (en) | 2024-06-11 | 2024-06-11 | A method and device for speech recognition sound-word synchronization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118522302A true CN118522302A (en) | 2024-08-20 |
CN118522302B CN118522302B (en) | 2024-11-15 |
Family
ID=92281362
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410742486.XA Active CN118522302B (en) | 2024-06-11 | 2024-06-11 | A method and device for speech recognition sound-word synchronization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118522302B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012129445A2 (en) * | 2011-03-23 | 2012-09-27 | Audible, Inc. | Managing playback of synchronized content |
CN108366182A (en) * | 2018-02-13 | 2018-08-03 | 京东方科技集团股份有限公司 | Text-to-speech synchronizes the calibration method reported and device, computer storage media |
CN111091810A (en) * | 2019-12-19 | 2020-05-01 | 佛山科学技术学院 | VR game character expression control method and storage medium based on voice information |
CN112616062A (en) * | 2020-12-11 | 2021-04-06 | 北京有竹居网络技术有限公司 | Subtitle display method and device, electronic equipment and storage medium |
CN113179444A (en) * | 2021-04-20 | 2021-07-27 | 浙江工业大学 | Voice recognition-based phonetic character synchronization method |
CN115248841A (en) * | 2021-04-27 | 2022-10-28 | 华为技术有限公司 | Method and device for simultaneous broadcast of text and speech |
CN116017011A (en) * | 2021-10-22 | 2023-04-25 | 成都极米科技股份有限公司 | Subtitle synchronization method, playing device and readable storage medium for audio and video |
CN117939223A (en) * | 2023-12-11 | 2024-04-26 | 海信视像科技股份有限公司 | Multi-device playing progress synchronization method, device and storage medium |
-
2024
- 2024-06-11 CN CN202410742486.XA patent/CN118522302B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012129445A2 (en) * | 2011-03-23 | 2012-09-27 | Audible, Inc. | Managing playback of synchronized content |
CN108366182A (en) * | 2018-02-13 | 2018-08-03 | 京东方科技集团股份有限公司 | Text-to-speech synchronizes the calibration method reported and device, computer storage media |
CN111091810A (en) * | 2019-12-19 | 2020-05-01 | 佛山科学技术学院 | VR game character expression control method and storage medium based on voice information |
CN112616062A (en) * | 2020-12-11 | 2021-04-06 | 北京有竹居网络技术有限公司 | Subtitle display method and device, electronic equipment and storage medium |
CN113179444A (en) * | 2021-04-20 | 2021-07-27 | 浙江工业大学 | Voice recognition-based phonetic character synchronization method |
CN115248841A (en) * | 2021-04-27 | 2022-10-28 | 华为技术有限公司 | Method and device for simultaneous broadcast of text and speech |
CN116017011A (en) * | 2021-10-22 | 2023-04-25 | 成都极米科技股份有限公司 | Subtitle synchronization method, playing device and readable storage medium for audio and video |
CN117939223A (en) * | 2023-12-11 | 2024-04-26 | 海信视像科技股份有限公司 | Multi-device playing progress synchronization method, device and storage medium |
Non-Patent Citations (1)
Title |
---|
王斌: "基于语音理解短视频字幕生成系统的设计与实现", 《中国优秀硕士学位论文全文数据库》, 15 March 2021 (2021-03-15) * |
Also Published As
Publication number | Publication date |
---|---|
CN118522302B (en) | 2024-11-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109741732B (en) | Named entity recognition method, named entity recognition device, equipment and medium | |
CN107610715B (en) | Similarity calculation method based on multiple sound characteristics | |
CN109147796B (en) | Speech recognition method, device, computer equipment and computer readable storage medium | |
CN107393554B (en) | A feature extraction method based on fusion of inter-class standard deviations in acoustic scene classification | |
CN107767881B (en) | Method and device for acquiring satisfaction degree of voice information | |
CN112133277B (en) | Sample generation method and device | |
WO2015139452A1 (en) | Method and apparatus for processing speech signal according to frequency domain energy | |
CN105161093A (en) | Method and system for determining the number of speakers | |
US8489404B2 (en) | Method for detecting audio signal transient and time-scale modification based on same | |
CN104464724A (en) | Speaker recognition method for deliberately pretended voices | |
CN108305639B (en) | Speech emotion recognition method, computer-readable storage medium, and terminal | |
CN108922541A (en) | Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model | |
CN109036437A (en) | Accents recognition method, apparatus, computer installation and computer readable storage medium | |
CN108682432B (en) | Voice emotion recognition device | |
CN109102800A (en) | A kind of method and apparatus that the determining lyrics show data | |
CN108288465A (en) | Intelligent sound cuts the method for axis, information data processing terminal, computer program | |
CN110516102B (en) | Lyric time stamp generation method based on spectrogram recognition | |
CN114303186A (en) | System and method for adapting human speaker embedding in speech synthesis | |
CN105845126A (en) | Method for automatic English subtitle filling of English audio image data | |
CN116095357B (en) | Live broadcasting method, device and system of virtual anchor | |
US10522160B2 (en) | Methods and apparatus to identify a source of speech captured at a wearable electronic device | |
CN119204030B (en) | Voice translation method and device for solving voice ambiguity | |
CN113674723B (en) | Audio processing method, computer equipment and readable storage medium | |
CN113160796B (en) | Language identification method, device and equipment for broadcast audio and storage medium | |
CN110197657B (en) | A dynamic sound feature extraction method based on cosine similarity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |