[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN104361895B - Voice quality assessment equipment, method and system - Google Patents

Voice quality assessment equipment, method and system Download PDF

Info

Publication number
CN104361895B
CN104361895B CN201410734839.8A CN201410734839A CN104361895B CN 104361895 B CN104361895 B CN 104361895B CN 201410734839 A CN201410734839 A CN 201410734839A CN 104361895 B CN104361895 B CN 104361895B
Authority
CN
China
Prior art keywords
voice
rhythm
user
speech
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410734839.8A
Other languages
Chinese (zh)
Other versions
CN104361895A (en
Inventor
林晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI LIULISHUO INFORMATION TECHNOLOGY Co Ltd
Original Assignee
SHANGHAI LIULISHUO INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI LIULISHUO INFORMATION TECHNOLOGY Co Ltd filed Critical SHANGHAI LIULISHUO INFORMATION TECHNOLOGY Co Ltd
Priority to CN201410734839.8A priority Critical patent/CN104361895B/en
Publication of CN104361895A publication Critical patent/CN104361895A/en
Application granted granted Critical
Publication of CN104361895B publication Critical patent/CN104361895B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The voice quality assessment equipment that the present invention provides a kind of based on rhythm, method and system, data processing equipment and method, speech processing device and method and mobile terminal, to overcome the problems, such as that existing voice technology does not consider the information in relation to voice rhythm when evaluating the pronunciation situation of user.Voice quality assessment equipment includes: storage unit, is suitable for storage pre-determined text and pre-determined text is corresponding with reference to rhythm characteristic, which includes one or more sentence, and each sentence includes one or more word;User speech receiving unit, the user speech for being directed to the typing of pre-determined text institute suitable for receiving user;Feature acquiring unit, suitable for obtaining user's rhythm characteristic of user speech;And voice quality computing unit, suitable for calculating the voice quality of user speech based on the correlation between reference rhythm characteristic and user's rhythm characteristic.Above-mentioned technology of the invention can be applied to voice technology field.

Description

Voice quality evaluation device, method and system
Technical Field
The invention relates to the technical field of voice, in particular to rhythm-based voice quality evaluation equipment, method and system, data processing equipment and method, voice processing equipment and method and a mobile terminal.
Background
With the development of the internet, the internet-based language learning application has also been rapidly developed. In some language learning applications, an application provider sends learning materials to a client through the internet, and a user acquires the learning materials through the client and performs operations on the client according to the instructions of the learning materials, such as inputting characters, inputting voice or selecting, and obtains feedback, so as to improve the language ability of the user.
For language learning, in addition to learning grammar and vocabulary, etc., an important aspect is learning the listening and speaking abilities of a language, particularly the ability to do so. For each language, speaking in different scenes often has different speaking rhythms. Generally, when people speak, people often pause after some words in a sentence are spoken, and the rhythm indicates which words are spoken and how long the words are paused. In addition, when there is more than one syllable of a word, there is a certain pause time between syllables in pronunciation. Therefore, when learning to speak in the language, the user needs to learn the speaking rhythm and/or pronunciation rhythm.
In the existing voice technology, a user records voice through a recording device of a client, a system splits the voice recorded by the user according to a text corresponding to the voice, and compares the voice of the user with an existing acoustic model word by word, thereby providing feedback to the user whether the pronunciation of the word is correct. However, the existing voice technology does not consider any information on the rhythm of the voice in evaluating the pronunciation of the user, and thus does not allow the learner to learn the rhythm of the speaking and/or pronunciation.
Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
In view of the above, the present invention provides a rhythm-based speech quality evaluation device, method and system, a data processing device and method, a speech processing device and method, and a mobile terminal, so as to at least solve the problem that the existing speech technology does not consider information about speech rhythm when evaluating the pronunciation situation of a user.
According to an aspect of the present invention, there is provided a rhythm-based speech quality evaluation apparatus including: the memory unit is suitable for storing a predetermined text and a reference rhythm characteristic corresponding to the predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words; the user voice receiving unit is suitable for receiving user voice which is input by a user aiming at the preset text; the characteristic acquisition unit is suitable for acquiring the user rhythm characteristics of the user voice; and a speech quality calculation unit adapted to calculate speech quality of the user speech based on a correlation between the reference rhythm feature and the user rhythm feature.
According to another aspect of the present invention, there is also provided a data processing apparatus adapted to be executed in a server, comprising: the server storage unit is suitable for storing a preset text and at least one piece of reference voice corresponding to the preset text; and the rhythm calculating unit is suitable for calculating the rhythm information of the reference voice according to at least one section of reference voice and storing the rhythm information in the server storage unit, or calculating the reference rhythm characteristics of at least one section of reference voice according to the rhythm information and storing the reference rhythm characteristics in the server storage unit.
According to another aspect of the present invention, there is also provided a speech processing apparatus adapted to be executed in a computer and comprising: a reference voice receiving unit adapted to receive a voice, which is entered by a specific user for a predetermined text, as a reference voice; and a rhythm calculation unit adapted to calculate rhythm information of the reference speech from the reference speech to transmit the rhythm information to a predetermined server in association with a predetermined text, or calculate a reference rhythm feature of the reference speech from the rhythm information to transmit the reference rhythm feature to the predetermined server in association with the predetermined text.
According to another aspect of the present invention, there is also provided a rhythm-based speech quality evaluation method, including the steps of: receiving user voice input by a user aiming at predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words; acquiring a user rhythm characteristic of user voice; and calculating the voice quality of the voice of the user based on the correlation between the reference rhythm characteristics corresponding to the preset text and the rhythm characteristics of the user.
According to another aspect of the present invention, there is also provided a data processing method adapted to be executed in a server, and including the steps of: storing a predetermined text and at least one piece of reference voice corresponding to the predetermined text; and calculating the rhythm information of the reference voice according to at least one section of reference voice, and storing the rhythm information, or calculating the reference rhythm characteristic of at least one section of reference voice according to the rhythm information, and storing the reference rhythm characteristic.
According to another aspect of the present invention, there is also provided a speech processing method adapted to be executed in a computer and including the steps of: receiving voice recorded by a specific user aiming at a preset text as reference voice; and calculating rhythm information of the reference voice according to the reference voice to transmit the rhythm information to the predetermined server in association with the predetermined text, or calculating a reference rhythm feature of the reference voice according to the rhythm information to transmit the reference rhythm feature to the predetermined server in association with the predetermined text.
According to another aspect of the present invention, there is also provided a mobile terminal including the rhythm-based speech quality evaluation device as described above.
According to still another aspect of the present invention, there is also provided a rhythm-based voice quality evaluation system including the rhythm-based voice quality evaluation apparatus as described above and the data processing apparatus as described above.
The rhythm-based speech quality evaluation scheme according to the embodiment of the present invention calculates the speech quality of the user speech based on the correlation between the acquired user rhythm characteristic of the user speech and the reference rhythm characteristic, and can obtain at least one of the following benefits: in the process of calculating the voice quality of the voice of the user, the information related to the voice rhythm is considered, so that the user can know the accuracy of the recorded voice in the rhythm according to the calculation result, and the user can judge whether the speaking rhythm and/or the pronunciation rhythm of the user needs to be corrected; the calculation and evaluation of the user voice are completed on the client computer or the client mobile terminal, so that the user can perform off-line learning; the calculated amount is small; time is saved; the operation is simpler and more convenient; when the representation form of the rhythm feature of the user is changed, the reference rhythm feature calculated according to the rhythm information of the reference voice can be conveniently represented into the form the same as the rhythm feature of the user, so that the voice quality evaluation equipment is more flexible and convenient to process and has stronger practicability.
These and other advantages of the present invention will become more apparent from the following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings.
Drawings
The invention may be better understood by referring to the following description in conjunction with the accompanying drawings, in which like reference numerals are used throughout the figures to indicate like or similar parts. The accompanying drawings, which are incorporated in and form a part of this specification, illustrate preferred embodiments of the present invention and, together with the detailed description, serve to further explain the principles and advantages of the invention. In the drawings:
fig. 1 is a block diagram schematically showing the structure of a mobile terminal 100;
fig. 2 is a block diagram schematically showing an exemplary configuration of a rhythm-based speech quality evaluation device 200 according to an embodiment of the present invention;
fig. 3 is a block diagram schematically illustrating one possible structure of the feature acquisition unit 230 shown in fig. 2;
fig. 4 is a block diagram schematically showing an exemplary configuration of a rhythm-based speech quality evaluation apparatus 400 according to another embodiment of the present invention;
FIG. 5 is a block diagram that schematically illustrates an exemplary architecture of a data processing apparatus 500, in accordance with an embodiment of the present invention;
FIG. 6 is a block diagram schematically illustrating an exemplary structure of a speech processing apparatus 600 according to an embodiment of the present invention;
FIG. 7 is a flow diagram that schematically illustrates an exemplary process of a tempo-based speech quality assessment method, in accordance with an embodiment of the present invention;
FIG. 8 is a flow chart schematically illustrating an exemplary process of a data processing method according to an embodiment of the present invention;
FIG. 9 is a flow diagram schematically illustrating an exemplary process of a speech processing method according to an embodiment of the present invention; and
fig. 10 is a flowchart schematically illustrating another exemplary process of a voice processing method according to an embodiment of the present invention.
Skilled artisans appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help improve the understanding of the embodiments of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the device structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.
An embodiment of the present invention provides a rhythm-based speech quality evaluation device, including: the memory unit is suitable for storing a predetermined text and a reference rhythm characteristic corresponding to the predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words; the user voice receiving unit is suitable for receiving user voice which is input by a user aiming at the preset text; the characteristic acquisition unit is suitable for acquiring the user rhythm characteristics of the user voice; and a speech quality calculation unit adapted to calculate speech quality of the user speech based on a correlation between the reference rhythm feature and the user rhythm feature.
The above-described rhythm-based voice quality evaluation apparatus according to an embodiment of the present invention may be an application that performs processing in a conventional desktop or laptop computer (not shown) or the like, may be a client application that performs processing in a mobile terminal (as shown in fig. 1) (as one of the applications 154 in the mobile terminal 100 shown in fig. 1), or may be a web application that is accessed through a browser on the above-described conventional desktop or laptop computer user or mobile terminal, or the like.
Fig. 1 is a block diagram of a mobile terminal 100. The multi-touch capable mobile terminal 100 may include a memory interface 102, one or more data processors, image processors and/or central processing units 104, and a peripheral interface 106.
The memory interface 102, the one or more processors 104, and/or the peripherals interface 106 can be discrete components or can be integrated in one or more integrated circuits. In the mobile terminal 100, the various elements may be coupled by one or more communication buses or signal lines. Sensors, devices, and subsystems can be coupled to peripheral interface 106 to facilitate a variety of functions. For example, motion sensors 110, light sensors 112, and distance sensors 114 may be coupled to peripheral interface 106 to facilitate directional, lighting, and ranging functions. Other sensors 116 may also be coupled to the peripheral interface 106, such as a positioning system (e.g., a GPS receiver), a temperature sensor, a biometric sensor, or other sensing device, to facilitate related functions.
The camera subsystem 120 and optical sensor 122, which may be, for example, a charge-coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) optical sensor, may be used to facilitate implementation of camera functions such as recording photographs and video clips.
Communication functions may be facilitated by one or more wireless communication subsystems 124, which may include radio frequency receivers and transmitters and/or optical (e.g., infrared) receivers and transmitters. The particular design and implementation of the wireless communication subsystem 124 may depend on the one or more communication networks supported by the mobile terminal 100. For example, the mobile terminal 100 may include a communication subsystem 124 designed to support a GSM network, a GPRS network, an EDGE network, a Wi-Fi or WiMax network, and a Bluetooth network.
The audio subsystem 126 may be coupled to a speaker 128 and a microphone 130 to facilitate voice-enabled functions, such as voice recognition, voice replication, digital recording, and telephony functions.
The I/O subsystem 140 may include a touch screen controller 142 and/or one or more other input controllers 144.
The touch screen controller 142 may be coupled to a touch screen 146. For example, the touch screen 146 and touch screen controller 142 may detect contact and movement or pauses made therewith using any of a variety of touch sensing technologies, including but not limited to capacitive, resistive, infrared, and surface acoustic wave technologies.
One or more other input controllers 144 may be coupled to other input/control devices 148 such as one or more buttons, rocker switches, thumbwheels, infrared ports, USB ports, and/or pointing devices such as styluses. The one or more buttons (not shown) may include up/down buttons for controlling the volume of the speaker 128 and/or microphone 130.
The memory interface 102 may be coupled with a memory 150. The memory 150 may include high speed random access memory and/or non-volatile memory, such as one or more magnetic disk storage devices, one or more optical storage devices, and/or flash memory (e.g., NAND, NOR).
The memory 150 may store an operating system 152, such as an operating system like Android, IOS or Windows Phone. The operating system 152 may include instructions for handling basic system services and performing hardware dependent tasks. The memory 150 may also store applications 154. In operation, these applications are loaded from memory 150 onto processor 104 and run on top of an operating system already run by processor 104, and utilize interfaces provided by the operating system and underlying hardware to implement various user-desired functions, such as instant messaging, web browsing, picture management, and the like. The application may be provided independently of the operating system or may be native to the operating system. The application 154 comprises a speech quality assessment device 200 according to the present invention.
One example of a rhythm-based speech quality evaluation device 200 according to an embodiment of the present invention will be described below with reference to fig. 2.
As shown in fig. 2, the voice quality evaluation apparatus 200 includes a storage unit 210, a user voice receiving unit 220, a feature acquisition unit 230, and a voice quality calculation unit 240.
As shown in fig. 2, in the speech quality evaluation apparatus 200, the storage unit 210 is configured to store a predetermined text and a reference rhythm feature corresponding to the predetermined text. The predetermined text includes one or more sentences, and each sentence includes one or more words. Wherein each word in a sentence may typically comprise a plurality of letters or at least one word.
According to one implementation, when the language of the predetermined text is, for example, a language such as english in which words are composed of letters, the predetermined text may selectively include information such as syllables and/or phonemes of each word and correspondence between the information such as syllables and/or phonemes of each word and the letters constituting the word, in addition to text contents such as one or more sentences and one or more words of each sentence.
Although the above example describes the case where the language of the predetermined text is english, the actual language of the predetermined text is not limited to english, and may be any language such as chinese, french, or german.
According to one implementation, the predetermined text and the reference tempo feature may be downloaded from a predetermined server in advance and stored in the storage unit 210. The predetermined server may be, for example, a server on which the data processing apparatus 500 described below with reference to fig. 5 resides. The calculation amount under the mode is small, extra time is not needed to be spent for calculating the reference rhythm characteristics, the time can be saved, and the operation is simpler and more convenient.
According to another implementation, the predetermined text may also be downloaded in advance from a predetermined server without downloading the reference rhythm feature. In this implementation, the rhythm information of the reference voice may be downloaded from a predetermined server and then calculated according to the rhythm information of the reference voice, thereby obtaining the reference rhythm characteristic. Thereby, the downloaded predetermined text and the reference rhythm characteristic obtained through the calculation can be saved in the storage unit 210. In this way, when the representation form of the rhythm feature of the user changes, the reference rhythm feature calculated from the rhythm information of the reference voice can be conveniently represented in the same form as the rhythm feature of the user, so that the processing of the voice quality evaluation apparatus 200 is more flexible and convenient, and the practicability is stronger.
It should be noted that, the process of calculating the reference rhythm feature according to the rhythm information of the reference speech may refer to the processing procedure described below in conjunction with fig. 5, and will not be described in detail here.
Here, the reference voice may be a voice previously recorded for a predetermined text by a specific user (for example, a user who has a language of the predetermined text as a mother language, or a professional language teacher related to the language of the predetermined text, or the like). The rhythm information may be about one or more pieces of reference speech. The reference rhythm characteristics of the multiple pieces of reference speech may be obtained by averaging the reference rhythm characteristics of the multiple pieces of reference speech.
When the user activates the speech quality evaluation device 200, the above-described predetermined text and the reference rhythm feature corresponding to the predetermined text are already stored in the storage unit 210 as described above. Then, through a display device such as the touch screen 146 of the mobile terminal 100, text content (i.e., the above-mentioned predetermined text) corresponding to the voice to be entered is presented to the user, and the user is prompted to record the corresponding voice. In this way, the user can enter a corresponding voice as a user voice through an input device such as the microphone 130 of the mobile terminal 100, and the user voice is received by the user voice receiving unit 220.
Then, the user voice receiving unit 220 forwards the user voice it receives to the feature obtaining unit 230, and the feature obtaining unit 230 obtains the user rhythm feature of the user voice.
Fig. 3 shows one possible example structure of the feature acquisition unit 230. In this example, the feature acquisition unit 230 may include an alignment subunit 310 and a feature calculation subunit 320.
As shown in fig. 3, the alignment subunit 310 may perform forced alignment (force alignment) on the user speech and the predetermined text by using a predetermined acoustic model (acoustic model) to determine a correspondence between each word and/or each syllable in each word and/or each phoneme of each syllable and a part of the user speech in the predetermined text.
Generally, an acoustic model is trained from recordings of a large number of speakers of a native language, and the acoustic model can be used to calculate the possibility that an input speech corresponds to a known character, and further, the input speech can be forcibly aligned with the known character. Here, the "input voice" may be a user voice or a reference voice to be mentioned later, and the "known word" may be a predetermined text.
The related technology of the acoustic model can be obtained by referring to related materials in http:// mi. eng.cam.ac.uk/. mjfg/ASRU _ talk09.pdf, and the related technology of forced alignment can be obtained by referring to related materials in http:// www.isip.piconepress.com/projects/speed/software/requirements/production/functional/v 1.0/section _04/s04_04_ p01.html and http:// www.phon.ox.ac.uk/j coleman/BAAP _ ASR. pdf, or other related technologies can be utilized, which are not detailed herein.
Furthermore, it should be noted that, by performing forced alignment between the user speech and the predetermined text, a corresponding relationship between each sentence in the predetermined text and a partial speech (such as a certain speech segment) of the user speech can be determined, that is, a speech segment corresponding to each sentence in the predetermined text can be determined in the user speech.
In addition to this, as described above, any one or more of the following three correspondences can be obtained as necessary by forced alignment: a correspondence between each word in the predetermined text and a part of speech of the user's speech (such as a certain speech block); a correspondence between each syllable in each word in the predetermined text and a part of speech of the user speech (such as a certain speech block); and a correspondence between each phoneme of each syllable in each word in the predetermined text and a part of speech of the user's speech, such as a certain speech block.
In this way, the feature calculation subunit 320 can calculate the user rhythm feature of the user speech based on the correspondence determined by the alignment subunit 310.
According to one implementation, the feature calculating subunit 320 may obtain, for each sentence of the predetermined text, the rhythm feature of the speech segment corresponding to each adjacent two words in the sentence in the user speech according to the time interval between the two speech blocks corresponding to the two adjacent words in the user speech. Then, the rhythm characteristic of the whole user voice is formed based on the rhythm characteristic of each voice section corresponding to each sentence of the obtained predetermined text in the user voice.
In one example, for each sentence in the predetermined text, the feature calculating subunit 320 may use information formed by all time intervals determined in the sentence as the rhythm feature of the speech segment corresponding to the sentence.
For example, for a certain sentence "how are you today" in a predetermined text, by forced alignment, it can be obtained that the sentence "how are you today" corresponds to a speech segment Use in the user's speech, and wherein the words "how", "are", "you", and "today" in turn correspond to speech blocks Ub1, Ub2, Ub3, and Ub4, respectively, in the user's speech. By forced alignment, the pause duration between two speech blocks corresponding to every two adjacent words in the speech of the user in the sentence can be obtained, that is, the following rhythm information:
(0.2-0.5), (0.6-0.8), (1.0-1.3) (assuming units of seconds).
Wherein (0.2-0.5) indicates that the pause duration between Ub1 and Ub2 is from time point "0.2" to time point "0.5", i.e., the time interval is 0.3 seconds; (0.6-0.8) indicates that the pause period between Ub2 and Ub3 is from time point "0.6" to time point "0.8", i.e., the time interval is 0.2 seconds; and (1.0-1.3) indicates that the pause period between Ub3 and Ub4 is from time point "1.0" to time point "1.3", i.e., the time interval is 0.3 seconds. It should be noted that, in this example, all pauses are taken as time intervals, regardless of the length of the pause.
Thus, in this example, the obtained information about the time interval of the sentence "how you today" in the speech of the user can be used as the rhythm feature of the speech segment Use corresponding to the sentence, wherein the information can be expressed in the form of, but not limited to, a vector, that is, (0.3, 0.2, 0.3).
The rhythm characteristics formed by the time intervals between the words and the voice parts corresponding to the words can directly reflect the pause length between the words when the user reads the sentence.
In another example, for each sentence in the predetermined text, the feature calculation subunit 320 may also use information constituted by the determined durations corresponding to all words in the sentence as the rhythm feature of the speech segment corresponding to the sentence by determining the duration of the speech block corresponding to each word in the sentence in the user speech.
For example, still taking the sentence "how are you today" as an example, by forcing alignment, the duration of the speech block corresponding to each word in the sentence in the user speech can be obtained, that is, the following rhythm information:
(0-0.2), (0.5-0.6), (0.8-1.0), (1.3-1.5) (assuming units of seconds).
It can be found that the duration of Ub1 is 0.2 seconds, the duration of Ub2 is 0.1 seconds, the duration of Ub3 is 0.2 seconds, and the duration of Ub4 is also 0.2 seconds.
Thus, in this example, the obtained information about the word duration of the sentence "how are you today" can be used as the rhythm feature of the speech segment Use corresponding to the sentence, wherein the information can be expressed in, but not limited to, the form of a vector, that is, (0.2, 0.1, 0.2, 0.2).
The rhythm characteristic formed by the duration of the voice part corresponding to each word can directly reflect the pronunciation duration of each word when the user reads the sentence and indirectly reflect the pause length between the words.
Further, according to another implementation, the information may be determined based on a size comparison of the time interval and a predetermined interval threshold. In other words, in such implementations, time intervals greater than or equal to the predetermined interval threshold are set to a first value (e.g., 1) and time intervals less than the predetermined interval threshold are set to a second value (e.g., 0). The predetermined interval threshold value may be set, for example, based on empirical values, or may be determined by experimental methods, which are not described in detail herein.
For example, still taking the sentence "how are you today" as an example, by forcing alignment, the pause duration between two speech blocks corresponding to each two adjacent words in the speech of the user in the sentence can be obtained, that is, the following rhythm information:
(0.2-0.5), (0.6-0.8), (1.0-1.3) (assuming units of seconds).
Assuming that in this example, the predetermined interval threshold is 0.25 seconds, the pause duration between Ub1 and Ub2 is 0.3 seconds, which is greater than the predetermined interval threshold, so the attribute value of the time interval is set to 1; the pause duration between Ub2 and Ub3 is 0.2 seconds, less than the predetermined interval threshold, so the attribute value for the time interval is set to 0; and the time interval between Ub3 and Ub4 is 0.3 seconds, which is greater than the predetermined interval threshold, so the attribute value of the time interval is set to 1.
The attribute values of the time interval are represented by "0" and "1", that is, "0" represents that the time interval is short, and "1" represents that the time interval is long. Thus, in this example, the information formed by the obtained attribute values of the time interval of the sentence "how are you today" can be used as the rhythm feature of the speech segment Use corresponding to the sentence, wherein the information can be expressed in the form of, but not limited to, a vector, that is, (1, 0, 1).
Assuming that the predetermined text includes two sentences which correspond to the speech segments Use1 and Use2 in the user speech, respectively, and the rhythm characteristic of the speech segment Use1 is (1, 0, 1) and the rhythm characteristic of the speech segment Use2 is (0, 1, 1), the rhythm characteristic of the user speech is { (1, 0, 1), (0, 1, 1) }.
It can be seen that, in the above manner, the formed rhythm features (such as the above-mentioned vector composed of 0 and 1 values) can more intuitively represent word-to-word intervals. In this case, by setting the predetermined interval threshold, a relatively short pause (e.g., smaller than the predetermined interval threshold) between words (and/or between syllables) and a relatively long pause (e.g., greater than or equal to the predetermined interval threshold) can be distinguished, so that the influence of the short pause on the formed rhythm characteristic is avoided, and the speaking and/or pronunciation habit of people is better met.
Several examples of obtaining the rhythm characteristics of the user are given above, and in the subsequent processing, the same form of reference rhythm characteristics are used to compare with the rhythm characteristics of the user (e.g., calculating similarity or distance between the two characteristics, etc.), and the comparison result is provided to the user, so that the user can quickly know whether the speaking rhythm and/or pronunciation rhythm is insufficient, and can immediately know how to improve (e.g., the pause between a certain two words should be longer or shorter, or the pronunciation duration of a certain word should be longer, how long a certain two syllables of a certain word should pause, etc.). It should be noted that, although the rhythm characteristics of the sentence are calculated based on the intervals between the words, in other examples, the rhythm characteristics of each word may also be calculated based on the intervals between the syllables in each word, which is similar to the above-described process, and therefore, the description is omitted here.
It should be noted that, in the embodiment of the present invention, the speaking rhythm refers to word-to-word pause, and the pronunciation rhythm refers to syllable-to-syllable pause.
In this way, after the user rhythm feature of the user voice is obtained by the feature acquisition unit 230, the voice quality calculation unit 240 can calculate the voice quality of the user voice based on the correlation between the user rhythm feature and the reference rhythm feature.
According to one implementation, the voice quality calculation unit 240 may obtain a score describing the voice quality of the user voice based on a correlation between the user rhythm feature and the reference rhythm feature.
In an example, assuming that the user rhythm feature obtained by the feature obtaining unit 230 is { (1, 0, 1), (0, 1, 1) }, and assuming that the reference rhythm feature is { (1, 0, 0), (1, 1, 1) }, a similarity between { (1, 0, 1), (0, 1, 1) } and { (1, 0, 0), (1, 1) } may be obtained through calculation, and the similarity is used as a score describing the speech quality of the user speech. That is, the higher the similarity between the calculated user rhythm feature and the reference rhythm feature, the higher the speech quality of the user speech.
Wherein, the similarity between the user rhythm feature { (1, 0, 1), (0, 1, 1) } and the reference rhythm feature { (1, 0, 0), (1, 1, 1) } can be obtained according to the similarity between the vectors of the corresponding positions therebetween, for example, the vector similarity between (1, 0, 1) and (1, 0, 0) and the vector similarity between (0, 1, 1) and (1, 1, 1) are first calculated, and then a weighted average or weighted sum of all the calculated vector similarities is taken as the similarity between the user rhythm feature and the reference rhythm feature. When calculating the weighted average or the weighted sum of the vector similarity, the weight of the sentence in the predetermined text corresponding to each vector may be used as the weight of the vector similarity corresponding to the vector, and the weight of each sentence in the predetermined text may be set empirically in advance, or may be set by using the reference rhythm feature (for example, the higher the weight of the sentence corresponding to the vector containing more elements "1" in the reference rhythm feature is set), or each weight may be set to 1, and so on.
In addition, in another example, a distance between the user rhythm feature and the reference rhythm feature based on the correlation between the two features may also be calculated, and a score for describing the voice quality of the user voice may be obtained according to the distance. For example, the reciprocal of the distance may be taken as a score describing the speech quality of the user's speech. That is, the greater the distance between the calculated user tempo feature and the reference tempo feature, the worse the speech quality of the user's speech.
For example, the distance between the user tempo feature { (1, 0, 1), (0, 1, 1) } and the reference tempo feature { (1, 0, 0), (1, 1, 1) } may be obtained from the distance between vectors of corresponding positions therebetween. For example, the inter-vector distance between (1, 0, 1) and (1, 0, 0) and the inter-vector distance between (0, 1, 1) and (1, 1, 1) may be calculated, and then a weighted average or a weighted sum of all the calculated inter-vector distances may be taken as the similarity between the user rhythm feature and the reference rhythm feature. The weight when calculating the weighted average or the weighted sum of the distances between the vectors may be set in the same manner as the weight when calculating the weighted average or the weighted sum of the similarity between the vectors, and will not be described herein again.
Further, it should be noted that, if the reference rhythm feature stored in the storage unit 210 is not represented in the same form as that of the user rhythm feature (for example, in the form of a vector), it may be first represented in the same form, and then the similarity or distance between the two may be calculated.
In addition, it should be noted that the speech quality calculating unit 240 may calculate the correlation (i.e., the similarity or the distance) between the rhythm feature of the user and the reference rhythm feature sentence by sentence, and then obtain the quality score of the speech of the user sentence by sentence (i.e., obtain the quality scores of speech segments of the speech of the user speech one by one corresponding to each sentence of the predetermined text in turn). In addition, the voice quality calculating unit 240 may also obtain the quality score describing the entire user voice after calculating the correlation (i.e., the similarity or the distance) between the user rhythm feature and the reference rhythm feature of the entire user voice.
Another example of the tempo-based speech quality evaluation apparatus according to the embodiment of the present invention is described below with reference to fig. 4.
In the example shown in fig. 4, the voice quality evaluation apparatus 400 includes an output unit 450 in addition to the storage unit 410, the user voice receiving unit 420, the feature acquisition unit 430, and the voice quality calculation unit 440. The storage unit 410, the user speech receiving unit 420, the feature obtaining unit 430, and the speech quality calculating unit 440 in the speech quality evaluating apparatus 400 shown in fig. 4 may have the same structures and functions as the corresponding units in the speech quality evaluating apparatus 200 described above with reference to fig. 2, and can achieve similar technical effects, which are not described again here.
The output unit 450 may visually output the calculation result of the voice quality, which may be presented to the user through a display device such as the touch screen 146 of the mobile terminal 100, for example.
According to one implementation, the output unit 450 may output a score reflecting the voice quality as a calculation result of the voice quality.
For example, the output unit 450 may visually output (such as sentence-by-sentence output) a score reflecting the voice quality of each voice segment corresponding to each sentence of the predetermined text in the user voice. Therefore, the user can know the speaking rhythm and/or the accuracy of the pronunciation rhythm of each sentence spoken by the user, and particularly when the score of a certain sentence is low, the user can immediately realize that the rhythm of the sentence needs to be corrected, so that the learning is more targeted.
As another example, the output unit 450 may visually output a score reflecting the voice quality of the entire user voice. Therefore, the user can integrally sense whether the rhythm of the section of speech spoken by the user is accurate or not.
In addition, in other examples, the output unit 450 may also visually output a score reflecting the voice quality of each speech section corresponding to each sentence of the predetermined text in the user voice and a score reflecting the voice quality of the entire user voice at the same time.
According to another implementation, the output unit 450 may visually output a difference between the user rhythm feature and the reference rhythm feature as a calculation result of the voice quality.
For example, the output unit 450 may represent the standard speech and the user speech in two parallel lines, where a "'" sign indicates that there is a pause between two words, and if the pauses are the same, they may be displayed in a general manner, such as a green "'" sign; if not, the pause is highlighted, such as a bold red "'".
Thus, through the output display of the output unit 450, the user can conveniently know the difference between the speaking rhythm and/or pronunciation rhythm of the user and the speaking rhythm and/or pronunciation rhythm of the standard voice (i.e. the reference voice), how much the difference is, and the like, so that the speaking rhythm and/or pronunciation rhythm of the user can be corrected more specifically and more accurately.
According to other implementation manners, the output unit 450 may also visually output, as the calculation result of the speech quality, the score sum reflecting the speech quality and the difference between the user rhythm feature and the reference rhythm feature at the same time, and the specific details of the implementation manner may refer to the description about the above two implementation manners, which is not described herein again.
As is apparent from the above description, the rhythm-based speech quality evaluation apparatus according to the embodiment of the present invention described above calculates the speech quality of the user speech based on the correlation between the acquired user rhythm feature of the user speech and the reference rhythm feature. Because the equipment considers the information related to the voice rhythm in the process of calculating the voice quality of the voice of the user, the user can know the accuracy of the recorded voice in the rhythm according to the calculation result, and the user can judge whether the speaking rhythm and/or the pronunciation rhythm of the user needs to be corrected.
Further, the above-described rhythm-based voice quality evaluation apparatus according to the embodiment of the present invention corresponds to a user client terminal, the calculation and evaluation of the user voice of which are performed on a client computer or a client mobile terminal, whereas the conventional voice technology, in which the calculation and evaluation of the user voice are performed on a server side in general, allows the user to perform offline learning (in the case where a stored learning material has been downloaded) without having to perform online learning as in the conventional art.
Furthermore, an embodiment of the present invention also provides a data processing apparatus, which is adapted to be executed in a server, and includes: a server storage unit adapted to store a predetermined text and to store at least one piece of reference voice corresponding to the predetermined text or to receive and store at least one piece of reference voice from the outside; and the rhythm calculating unit is suitable for calculating the rhythm information of at least one section of reference voice and storing the rhythm information in the server storage unit, or calculating the reference rhythm characteristics of at least one section of reference voice according to the rhythm information and storing the reference rhythm characteristics in the server storage unit.
Fig. 5 shows an example of a data processing device 500 according to an embodiment of the invention. As shown in fig. 5, the data processing apparatus 500 includes a server storage unit 510 and a tempo calculation unit 520.
The data processing device 500 may be implemented, for example, as an application residing on a server. The server may comprise, for example, a web server that may communicate with a user client (e.g., voice quality assessment device 200 or 400 described above) using http protocol, but is not so limited.
The server storage unit 510 may store text materials, i.e., predetermined texts, of various language learning materials. Here, the server storage unit 510 may store at least one piece of reference voice corresponding to a predetermined text in addition to the predetermined text for each language, or may receive and store at least one piece of reference voice from an external device such as the voice processing device 600 to be described later.
It should be understood that the predetermined text mentioned here is similar to the predetermined text mentioned above, and may optionally include information such as syllables and/or phonemes of each word (for example, when the language of the predetermined text is a language such as english in which words are composed of letters) and correspondence between information such as syllables and/or phonemes of each word and letters constituting the word, in addition to the text contents including one or more sentences and one or more words of each sentence.
The tempo calculation unit 520 may obtain tempo information or a reference tempo characteristic of at least one piece of reference speech by calculation, and store the obtained tempo information or the reference tempo characteristic in the server storage unit. The process of obtaining the reference rhythm feature may be similar to the process of obtaining the user rhythm feature described above, and will be illustrated below, and a description of part of the same contents is omitted.
According to one implementation, the tempo calculation unit 520 may store the obtained tempo information of at least one piece of reference speech in the server storage unit 510. In such an implementation, in subsequent processing, the data processing device 500 may provide its stored predetermined text and tempo information of at least one piece of reference speech to the user client (e.g., the speech quality assessment device 200 or 400 described above).
In addition, according to another implementation manner, the tempo calculation unit 520 may also obtain a reference tempo feature of at least one piece of reference speech according to the obtained tempo information of the at least one piece of reference speech, and store the obtained reference tempo feature in the server storage unit 510. In such an implementation, in subsequent processing, the data processing device 500 may provide its stored predetermined text and the reference tempo feature of at least one piece of reference speech to the user client (e.g., the speech quality assessment device 200 or 400 described above).
In one example, it is assumed that "at least one piece of reference speech" includes R1 and R2 pieces of reference speech in total. Taking a certain sentence "how about you today" in a predetermined text and a reference speech R1 as an example, by forced alignment, it can be obtained that the sentence "how you today" corresponds to the speech segment R1se in the reference speech R11And wherein the words "how", "are", "you", and "today" correspond in sequence to the speech blocks Rb1, Rb2, Rb3, and Rb4, respectively, in the reference speech. By forced alignment, tempo information of the reference speech R1 can be obtained, namely:
(0.2-0.4), (0.5-0.7), (0.9-1.2) (assuming units of seconds).
Here, (0.2-0.4) indicates that the pause period between Rb1 and Rb2 is from time point "0.2" to time point "0.4"; (0.5-0.7) indicates that the pause period between Rb2 and Rb3 is from time point "0.5" to time point "0.7"; and (0.9-1.2) indicates that the pause period between Rb3 and Rb4 is from time point "0.9" to time point "1.2". Thus, in this example, the information about the time interval of the sentence "how you today" in the obtained reference speech R1 can be used as the speech segment R1se corresponding to the sentence1Wherein the information may be represented, for example, but not limited to, in the form of a vector, i.e., (0.2, 0.2, 0.3).
In another example, the information may be determined based on a comparison of the size of the time interval and a predetermined interval threshold (assuming an interval threshold of 0.25 in this example). Thus, based on the above rhythm information and the predetermined interval threshold, the speech segment R1se can be obtained1、R1se2Is (0, 0, 1) and (0, 1, 1), the rhythm characteristic of the reference speech R1 can be obtained as { (0)0, 1) ((0, 1, 1) } is stored in the server storage unit 510 as a reference tempo feature.
In addition, in other examples, the rhythm information of the reference speech may also be formed by the duration of the speech block corresponding to each word of each sentence in the predetermined text in the reference speech, and the processing in this case is similar to the corresponding processing in the user speech, and therefore is not described again.
It should be noted that, the same processing performed in the data processing device 500 according to the embodiment of the present invention as the rhythm-based speech quality evaluation device 200 or 400 described above with reference to fig. 2 or 4 can obtain similar technical effects, and is not described in detail here.
Furthermore, an embodiment of the present invention also provides a speech processing apparatus adapted to be executed in a computer and including: the reference voice receiving unit is suitable for receiving the voice recorded by a specific user aiming at the preset text as reference voice and sending the reference voice to a preset server; and/or a rhythm calculation unit adapted to calculate rhythm information of the reference speech from the reference speech to transmit the rhythm information to the predetermined server in association with the predetermined text, or calculate a reference rhythm feature of the reference speech from the rhythm information to transmit the reference rhythm feature to the predetermined server in association with the predetermined text.
FIG. 6 shows an example of a speech processing device 600 according to an embodiment of the invention. As shown in fig. 6, the speech processing apparatus 600 includes a reference speech receiving unit 610, and may further include a tempo calculating unit 620.
As shown in fig. 6, according to one implementation, when the voice processing apparatus 600 includes only the reference voice receiving unit 610, it is possible to receive, as reference voice, voice entered by a specific user (such as a user who is native to a predetermined text language or a professional language teacher related to the language) for a predetermined text through the reference voice receiving unit 610, and transmit the reference voice to a predetermined server (such as a server where the data processing apparatus 500 described above in connection with fig. 5 resides).
Further, according to another implementation, when the speech processing apparatus 600 further includes the tempo calculation unit 620, the tempo information of the reference speech may be calculated from the reference speech received by the reference speech reception unit 610 to transmit the tempo information to the predetermined server in association with the predetermined text, or the reference tempo feature of the reference speech (the process may refer to the above-mentioned description of the relevant) may be calculated from the tempo information to transmit the reference tempo feature to the predetermined server in association with the predetermined text.
In practical applications, the speech processing device 600 may correspond to a teacher client provided on a computer or other terminal, for example implemented in software.
The user of the teacher client can record standard voice for each sentence in the preset text to serve as reference voice to be sent to the corresponding server, and the server executes subsequent processing. Under the condition, the server can conveniently collect the reference voice through the Internet without participating in the processing of recording the voice, and the time and the operation can be saved.
In addition, the teacher client can also directly and locally process and analyze the recorded standard voice (i.e. reference voice), generate parameters (such as reference voice characteristics) corresponding to the standard voice, and transmit the parameters and the predetermined text to the server for storage, so that the processing load of the server can be reduced.
In addition, the embodiment of the invention also provides a mobile terminal which comprises the rhythm-based voice quality evaluation device. The mobile terminal may have the functions of the rhythm-based voice quality evaluation device 200 or 400 described above, and may achieve similar technical effects, which will not be described in detail herein.
Further, an embodiment of the present invention also provides a rhythm-based voice quality evaluation system including the rhythm-based voice quality evaluation device 200 or 400 as described above and the data processing device 500 as described above.
According to one implementation, the voice quality evaluation system may optionally include the voice processing apparatus 600 as described above in addition to the voice quality evaluation apparatus 200 or 400 and the data processing apparatus 500 described above. In this implementation, the voice quality evaluation apparatus 200 or 400 in the voice quality evaluation system may correspond to a user client provided in a computer or a mobile terminal, the data processing apparatus 500 may correspond to a server provided, and the voice processing apparatus 600 may correspond to a teacher client. In actual processing, the teacher client may provide reference voice (optionally, rhythm information or reference rhythm characteristics of the reference voice) to the server, the server is used for storing the information and the predetermined text, and the user client may download the information from the server to analyze the user voice input by the user so as to complete voice quality evaluation. The details of the processing can refer to the descriptions given above in conjunction with fig. 2 or 4, fig. 5, and fig. 6, respectively, and are not described here again.
In addition, the embodiment of the invention also provides a rhythm-based voice quality evaluation method, which comprises the following steps: receiving user voice input by a user aiming at predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words; acquiring a user rhythm characteristic of user voice; and calculating the voice quality of the voice of the user based on the correlation between the reference rhythm characteristics corresponding to the preset text and the rhythm characteristics of the user.
An exemplary process of the above-described rhythm-based speech quality evaluation method is described below with reference to fig. 7. As shown in fig. 7, an exemplary process flow 700 of the tempo-based speech quality assessment method according to one embodiment of the present invention starts at step S710 and then proceeds to step S720.
In step S720, a user voice entered by a user for a predetermined text is received, the predetermined text including one or more sentences, each sentence including one or more words. Then, step S730 is performed. The processing in step S720 may be the same as the processing of the user speech receiving unit 220 described above with reference to fig. 2, and similar technical effects can be achieved, which is not described herein again.
According to one implementation, the predetermined text and reference tempo features may be previously downloaded from a predetermined server.
According to another implementation, the predetermined text may be downloaded from a predetermined server in advance, and the reference rhythm feature may be calculated based on rhythm information of at least one piece of reference voice downloaded from the predetermined server in advance.
In step S730, a user rhythm feature of the user speech is acquired. Then, step S740 is performed. The processing in step S730 may be the same as the processing of the feature obtaining unit 230 described above with reference to fig. 2, and similar technical effects can be achieved, which is not described herein again.
According to one implementation, in step S730, the user speech may be forcibly aligned with the predetermined text using a predetermined acoustic model to determine a correspondence between each word and/or each syllable in each word and/or each phoneme of each syllable in the predetermined text and the portion of the user speech, and a user rhythm characteristic of the user speech is obtained based on the correspondence.
The step of obtaining the user rhythm characteristic of the user voice based on the corresponding relationship may be implemented as follows: aiming at each sentence of a preset text, obtaining the rhythm characteristics of a voice section corresponding to each sentence in the voice of a user according to the time interval between two voice blocks corresponding to every two adjacent words in the voice of the user in each sentence; and forming the rhythm characteristic of the user voice based on the rhythm characteristic of each voice section corresponding to each sentence of the obtained preset text in the user voice.
In one example, for each sentence in the predetermined text, information constituted by all time intervals determined in the sentence can be used as the rhythm feature of the speech segment corresponding to the sentence. Wherein when a time interval between every two words is greater than or equal to a predetermined interval threshold, setting a first value corresponding to the time interval; setting a second value corresponding to the time interval when the time interval is less than a predetermined interval threshold; and determining the rhythm characteristic of the speech segment corresponding to the sentence according to the obtained first value and the second value.
In another example, for each sentence in the predetermined text, the duration of the speech block corresponding to each word in the sentence in the speech of the user may be determined, and information constituted by the determined durations corresponding to all words in the sentence may be used as the rhythm characteristic of the speech segment corresponding to the sentence.
In step S740, the speech quality of the user speech is calculated based on the correlation between the reference rhythm feature corresponding to the predetermined text and the user rhythm feature. The processing in step S740 may be the same as the processing of the speech quality calculating unit 240 described above with reference to fig. 2, and can achieve similar technical effects, which is not described herein again. Then, the process flow 700 ends in step S750.
Furthermore, according to another implementation manner, after the step S740, the following steps may be further optionally included: and visually outputting the calculation result of the voice quality.
Wherein, the calculation result of the voice quality may include: a score reflecting speech quality; and/or a difference between the user tempo feature and the reference tempo feature.
As is apparent from the above description, the rhythm-based speech quality evaluation method according to the embodiment of the present invention calculates the speech quality of the user speech based on the correlation between the acquired user rhythm feature of the user speech and the reference rhythm feature. Because the method considers the information related to the voice rhythm in the process of calculating the voice quality of the voice of the user, the user can know the accuracy of the recorded voice in the rhythm according to the calculation result, and the method is further beneficial for the user to judge whether the speaking rhythm and/or the pronunciation rhythm of the user needs to be corrected.
In addition, the rhythm-based voice quality evaluation method according to the embodiment of the present invention described above corresponds to a user client, and the calculation and evaluation of the user voice are performed on a client computer or a client mobile terminal, whereas the conventional voice technology is typically performed on a server side.
In addition, an embodiment of the present invention further provides a data processing method, which is suitable for being executed in a server and includes the following steps: storing a predetermined text; storing at least one piece of reference voice corresponding to a predetermined text, or receiving and storing at least one piece of reference voice from the outside; and acquiring rhythm information of at least one section of reference voice and storing the rhythm information, or acquiring reference rhythm characteristics of at least one section of reference voice according to the rhythm information and storing the reference rhythm characteristics.
An exemplary process of the above data processing method is described below with reference to fig. 8. As shown in fig. 8, an exemplary process flow 800 of the data processing method according to one embodiment of the present invention starts at step S810 and then, step S820 is performed.
In step S820, a predetermined text and at least one piece of reference voice corresponding to the predetermined text are stored, or the predetermined text is stored and at least one piece of reference voice is received and stored from the outside. Then, step S830 is performed. The processing in step S820 may be the same as the processing of the server storage unit 510 described above with reference to fig. 5, for example, and similar technical effects can be achieved, and are not described again here.
In step S830, rhythm information of at least one piece of reference speech is obtained and stored, or a reference rhythm feature of at least one piece of reference speech is obtained according to the rhythm information and stored. The processing in step S830 may be the same as the processing of the obtaining unit 520 described above with reference to fig. 5, and similar technical effects can be achieved, and are not described herein again. Then, the process flow 800 ends in step S840.
Among other things, the data processing method of the embodiment of the present invention described above can obtain similar technical effects as the data processing apparatus 500 described above, and will not be described in detail here.
In addition, an embodiment of the present invention further provides a speech processing method, which is suitable for being executed in a computer and includes the following steps: and receiving the voice recorded by the specific user aiming at the preset text as reference voice, and sending the reference voice to a preset server. Alternatively, the rhythm information of the reference speech may be calculated from the reference speech to be transmitted to the predetermined server in association with the predetermined text, or the reference rhythm feature of the reference speech may be obtained from the rhythm information to be transmitted to the predetermined server in association with the predetermined text.
An exemplary process of the above-described speech processing method is described below in conjunction with fig. 9. As shown in fig. 9, an exemplary process flow 900 of the speech processing method according to one embodiment of the present invention starts at step S910 and then proceeds to step S920.
In step S920, a voice entered by a specific user for a predetermined text is received as a reference voice. Then, step S930 is performed.
In step S930, the reference voice is transmitted to a predetermined server. The process flow 900 then ends in step S940.
The processing of the processing flow 900 may be the same as the processing of the reference speech receiving unit 610 described above with reference to fig. 6, and similar technical effects can be achieved, which is not described herein again.
Further, fig. 10 shows another exemplary process of the above-described voice processing method. As shown in fig. 10, an exemplary process flow 1000 of a speech processing method according to one embodiment of the present invention begins at step S1010 and then proceeds to step S1020.
In step S1020, a voice entered by a specific user for a predetermined text is received as a reference voice. Then, step S1030 is performed.
According to one implementation, rhythm information of the reference speech may be obtained in step S1030 to be transmitted to a predetermined server in association with a predetermined text. The process flow 1000 then ends in step S1040.
According to another implementation, a reference rhythm feature of the reference speech may be obtained from the rhythm information to transmit the reference rhythm feature to a predetermined server in association with a predetermined text in step S1030. The process flow 1000 then ends in step S1040.
The processing of the processing flow 1000 may be the same as the processing of the receiving and obtaining unit 620 described above with reference to fig. 6, and similar technical effects can be achieved, which is not described herein again.
The voice processing method according to the embodiment of the present invention can achieve similar technical effects as the voice processing apparatus 600 described above, and will not be described in detail here.
A11: a rhythm-based speech quality evaluation method comprises the following steps: receiving user voice input by a user aiming at predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words; acquiring a user rhythm characteristic of the user voice; and calculating the voice quality of the voice of the user based on the correlation between the reference rhythm characteristic corresponding to the predetermined text and the user rhythm characteristic. A12: the speech quality evaluation method according to a11, wherein the step of obtaining the user rhythm characteristics of the user speech includes: and forcibly aligning the user voice and the predetermined text by using a predetermined acoustic model to determine a corresponding relation between each word in the predetermined text and a part of the user voice, and obtaining a user rhythm characteristic of the user voice based on the corresponding relation. A13: the speech quality evaluation method according to a12, wherein the step of obtaining the user rhythm characteristics of the user speech based on the correspondence includes: aiming at each sentence of the preset text, acquiring the rhythm characteristics of a voice section corresponding to each sentence in the user voice according to the time interval between two voice blocks corresponding to every two adjacent words in the user voice in each sentence; and forming the rhythm characteristic of the user voice based on the rhythm characteristic of each voice section corresponding to each sentence of the preset text in the user voice. A14: in the speech quality evaluation method according to a13, for each sentence in the predetermined text: taking information formed by all time intervals determined in the statement as the rhythm characteristic of the voice section corresponding to the statement; or determining the time length of the voice block corresponding to each word in the sentence in the voice of the user, and taking the information formed by the determined time lengths corresponding to all the words in the sentence as the rhythm characteristic of the voice section corresponding to the sentence. A15: in the speech quality evaluation method according to a14, when a time interval between every two words is greater than or equal to a predetermined interval threshold, setting a first value corresponding to the time interval; setting a second value corresponding to the time interval when the time interval is less than a predetermined interval threshold; and determining the rhythm characteristic of the speech segment corresponding to the sentence according to the obtained first value and the second value. A16: the speech quality evaluation method according to a11 further includes: and visually outputting the calculation result of the voice quality. A17: the speech quality evaluation method according to a16, wherein the result of the speech quality calculation includes: a score reflecting the speech quality; and/or a difference between the user tempo feature and the reference tempo feature. A18: the speech quality evaluation method according to a 11: the preset text and the reference rhythm characteristics are downloaded from a preset server in advance; or the predetermined text is downloaded from a predetermined server in advance, and the reference rhythm characteristic is calculated according to the rhythm information of at least one piece of reference voice downloaded from the predetermined server in advance. A19: a data processing method, the method being adapted to be executed in a server and comprising the steps of: storing a predetermined text and at least one piece of reference voice corresponding to the predetermined text; and calculating the rhythm information of the reference voice according to the at least one section of reference voice and storing the rhythm information, or calculating the reference rhythm characteristic of the at least one section of reference voice according to the rhythm information and storing the reference rhythm characteristic. A20: a method of speech processing, the method being adapted to be executed in a computer and comprising the steps of: receiving voice recorded by a specific user aiming at a preset text as reference voice; and calculating rhythm information of the reference voice according to the reference voice so as to be sent to a predetermined server in association with the predetermined text, or calculating a reference rhythm characteristic of the reference voice according to the rhythm information so as to be sent to the predetermined server in association with the predetermined text. A21: a mobile terminal comprising a tempo-based speech quality assessment device according to the present invention. A22: a rhythm-based voice quality evaluation system includes a rhythm-based voice quality evaluation device and a data processing device according to the present invention. A23A rhythm-based speech quality evaluation system, comprising: a rhythm-based speech quality evaluation apparatus according to the present invention; the server stores the preset text and the reference rhythm information and/or the reference rhythm characteristics; and a speech processing device according to the invention.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims (19)

1. A rhythm-based speech quality evaluation apparatus comprising:
the memory unit is suitable for storing a predetermined text and a reference rhythm characteristic corresponding to the predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words;
the user voice receiving unit is suitable for receiving user voice which is input by a user aiming at the preset text;
the characteristic acquisition unit is suitable for acquiring the user rhythm characteristic of the user voice; and
a speech quality calculation unit adapted to calculate a speech quality of the user speech based on a correlation between the reference tempo feature and the user tempo feature, comprising:
calculating the similarity between the user rhythm feature and the reference rhythm feature, and taking the similarity as a score for describing the voice quality of the user voice; or
Calculating the distance between the user rhythm characteristic and the reference rhythm characteristic based on the correlation between the user rhythm characteristic and the reference rhythm characteristic, and obtaining a score for describing the voice quality of the user voice according to the distance; the acquisition process and the representation form of the reference rhythm feature and the user rhythm feature are the same, and the rhythm feature can reflect the pause length among words;
wherein the feature acquisition unit includes:
an alignment subunit adapted to perform forced alignment of the user speech with the predetermined text using a predetermined acoustic model to determine a correspondence between each word in the predetermined text and a portion of the user speech; and
a feature calculating subunit, adapted to calculate a user rhythm feature of the user speech based on the correspondence, including:
aiming at each sentence of the preset text, acquiring the rhythm characteristics of a voice section corresponding to each sentence in the user voice according to the time interval between two voice blocks corresponding to every two adjacent words in the user voice in each sentence; and
and forming the rhythm characteristics of the user voice based on the rhythm characteristics of the voice sections corresponding to the sentences of the preset text in the user voice.
2. The speech quality evaluation device according to claim 1, wherein the feature calculation subunit is adapted to, for each sentence in the predetermined text:
taking information formed by all time intervals determined in the statement as the rhythm characteristic of the voice section corresponding to the statement; or
And determining the duration of the voice block corresponding to each word in the sentence in the voice of the user, and taking the information formed by the determined durations corresponding to all the words in the sentence as the rhythm characteristic of the voice section corresponding to the sentence.
3. The speech quality evaluation apparatus according to claim 2, wherein when a time interval between each two words is greater than or equal to a predetermined interval threshold, a first value corresponding to the time interval is set; setting a second value corresponding to the time interval when the time interval is less than a predetermined interval threshold; and determining the rhythm characteristic of the speech segment corresponding to the sentence according to the obtained first value and the second value.
4. The voice quality evaluation apparatus according to claim 1, further comprising:
and the output unit is suitable for visually outputting the calculation result of the voice quality.
5. The speech quality evaluation apparatus according to claim 4, wherein the output unit is adapted to output, as the calculation result of the speech quality, a result of:
a score reflecting the speech quality; and/or
A difference between the user cadence characteristic and the reference cadence characteristic.
6. The voice quality evaluation apparatus according to claim 1, wherein:
the storage unit is suitable for downloading the preset text and the reference rhythm characteristic from a preset server in advance; or
The storage unit is suitable for downloading the preset text and the rhythm information of at least one piece of reference voice from a preset server in advance, and calculating and obtaining the reference rhythm characteristic according to the rhythm information of the at least one piece of reference voice.
7. A data processing apparatus, the apparatus being adapted to reside in a server and comprising:
a server storage unit adapted to store a predetermined text and at least one piece of reference voice corresponding to the predetermined text, the predetermined text including one or more sentences, each sentence including one or more words; and
the rhythm calculation unit is suitable for calculating the rhythm information of the reference voice according to the at least one section of reference voice and storing the rhythm information in the server storage unit, or calculating the reference rhythm characteristic of the at least one section of reference voice according to the rhythm information and storing the reference rhythm characteristic in the server storage unit; wherein,
is adapted to force an alignment of the reference speech with the predetermined text using a predetermined acoustic model to determine a correspondence between each word in the predetermined text and a portion of the reference speech; and
calculating a reference rhythm characteristic of the reference voice based on the correspondence, including:
aiming at each sentence of the preset text, acquiring the rhythm characteristics of a voice section corresponding to each sentence in the reference voice according to the time interval between two voice blocks corresponding to every two adjacent words in the reference voice in each sentence; and
and forming the reference rhythm characteristic of the reference voice based on the obtained rhythm characteristic of each voice section corresponding to each sentence of the predetermined text in the reference voice.
8. A speech processing apparatus, the apparatus being adapted to be executed in a computer and comprising:
a reference speech receiving unit adapted to receive, as reference speech, speech entered by a specific user for a predetermined text, the predetermined text including one or more sentences, each sentence including one or more words; and
a rhythm calculation unit adapted to calculate rhythm information of the reference speech from the reference speech to transmit the rhythm information to a predetermined server in association with the predetermined text, or calculate a reference rhythm feature of the reference speech from the rhythm information to transmit the reference rhythm feature to the predetermined server in association with the predetermined text; wherein,
is adapted to force an alignment of the reference speech with the predetermined text using a predetermined acoustic model to determine a correspondence between each word in the predetermined text and a portion of the reference speech; and
calculating the user rhythm characteristics of the reference voice based on the corresponding relation, including:
aiming at each sentence of the preset text, acquiring the rhythm characteristics of a voice section corresponding to each sentence in the reference voice according to the time interval between two voice blocks corresponding to every two adjacent words in the reference voice in each sentence; and
and forming the reference rhythm characteristic of the reference voice based on the obtained rhythm characteristic of each voice section corresponding to each sentence of the predetermined text in the reference voice.
9. A rhythm-based speech quality evaluation method comprises the following steps:
receiving user voice input by a user aiming at predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words;
acquiring the user rhythm characteristics of the user voice, including:
forcibly aligning the user speech with the predetermined text using a predetermined acoustic model to determine a correspondence between each word in the predetermined text and a portion of the user speech;
obtaining the user rhythm characteristics of the user voice based on the corresponding relation, including:
aiming at each sentence of the preset text, acquiring the rhythm characteristics of a voice section corresponding to each sentence in the user voice according to the time interval between two voice blocks corresponding to every two adjacent words in the user voice in each sentence; and
forming rhythm characteristics of the user voice based on rhythm characteristics of each voice section corresponding to each sentence of the preset text in the user voice; and calculating the voice quality of the voice of the user based on the correlation between the reference rhythm characteristic corresponding to the predetermined text and the user rhythm characteristic, wherein the calculation comprises the following steps:
calculating the similarity between the user rhythm feature and the reference rhythm feature, and taking the similarity as a score for describing the voice quality of the user voice; or
Calculating the distance between the user rhythm characteristic and the reference rhythm characteristic based on the correlation between the user rhythm characteristic and the reference rhythm characteristic, and obtaining a score for describing the voice quality of the user voice according to the distance; the reference rhythm characteristic and the user rhythm characteristic are identical in acquisition process and representation form, and the rhythm characteristic can reflect the pause length among words.
10. The speech quality evaluation method according to claim 9, wherein, for each sentence in the predetermined text:
taking information formed by all time intervals determined in the statement as the rhythm characteristic of the voice section corresponding to the statement; or
And determining the duration of the voice block corresponding to each word in the sentence in the voice of the user, and taking the information formed by the determined durations corresponding to all the words in the sentence as the rhythm characteristic of the voice section corresponding to the sentence.
11. The speech quality evaluation method according to claim 10, wherein when a time interval between every two words is greater than or equal to a predetermined interval threshold, a first value corresponding to the time interval is set; setting a second value corresponding to the time interval when the time interval is less than a predetermined interval threshold; and determining the rhythm characteristic of the speech segment corresponding to the sentence according to the obtained first value and the second value.
12. The voice quality evaluation method according to claim 9, further comprising: and visually outputting the calculation result of the voice quality.
13. The voice quality evaluation method according to claim 12, wherein the calculation result of the voice quality includes:
a score reflecting the speech quality; and/or
A difference between the user cadence characteristic and the reference cadence characteristic.
14. The voice quality evaluation method according to claim 9, wherein:
the preset text and the reference rhythm characteristics are downloaded from a preset server in advance; or
The predetermined text is downloaded from a predetermined server in advance, and the reference rhythm characteristic is calculated according to rhythm information of at least one piece of reference voice downloaded from the predetermined server in advance.
15. A data processing method, the method being adapted to be executed in a server and comprising the steps of:
storing a predetermined text and at least one piece of reference voice corresponding to the predetermined text, wherein the predetermined text comprises one or more sentences, and each sentence comprises one or more words; and
calculating rhythm information of the reference voice according to the at least one section of reference voice, and storing the rhythm information, or calculating reference rhythm characteristics of the at least one section of reference voice according to the rhythm information, and storing the reference rhythm characteristics; wherein
The step of calculating the reference tempo feature of the reference speech comprises:
forcibly aligning the reference speech with the predetermined text using a predetermined acoustic model to determine a correspondence between each word in the predetermined text and a portion of the reference speech;
obtaining a reference rhythm characteristic of the reference voice based on the corresponding relation, including:
aiming at each sentence of the preset text, acquiring the rhythm characteristics of a voice section corresponding to each sentence in the reference voice according to the time interval between two voice blocks corresponding to every two adjacent words in the reference voice in each sentence; and
and forming the reference rhythm characteristic of the reference voice based on the obtained rhythm characteristic of each voice section corresponding to each sentence of the predetermined text in the reference voice.
16. A method of speech processing, the method being adapted to be executed in a computer and comprising the steps of:
receiving, as reference speech, speech that a specific user has entered for a predetermined text, the predetermined text including one or more sentences, each sentence including one or more words; and
calculating rhythm information of the reference voice according to the reference voice so as to be sent to a predetermined server in association with the predetermined text, or calculating a reference rhythm characteristic of the reference voice according to the rhythm information so as to be sent to the predetermined server in association with the predetermined text; wherein
The step of calculating the reference tempo feature of the reference speech comprises:
forcibly aligning the reference speech with the predetermined text using a predetermined acoustic model to determine a correspondence between each word in the predetermined text and a portion of the reference speech;
obtaining a reference rhythm characteristic of the reference voice based on the corresponding relation, including:
aiming at each sentence of the preset text, acquiring the rhythm characteristics of a voice section corresponding to each sentence in the reference voice according to the time interval between two voice blocks corresponding to every two adjacent words in the reference voice in each sentence; and
and forming the reference rhythm characteristic of the reference voice based on the obtained rhythm characteristic of each voice section corresponding to each sentence of the predetermined text in the reference voice.
17. A mobile terminal comprising a tempo-based speech quality assessment apparatus according to any of claims 1-6.
18. A tempo-based speech quality assessment system comprising a tempo-based speech quality assessment apparatus according to any one of claims 1-6 and a data processing apparatus according to claim 7.
19. A tempo-based speech quality assessment system comprising:
a tempo-based speech quality assessment apparatus according to any one of claims 1-6;
the server stores the preset text and the reference rhythm information and/or the reference rhythm characteristics; and
the speech processing device of claim 8.
CN201410734839.8A 2014-12-04 2014-12-04 Voice quality assessment equipment, method and system Active CN104361895B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410734839.8A CN104361895B (en) 2014-12-04 2014-12-04 Voice quality assessment equipment, method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410734839.8A CN104361895B (en) 2014-12-04 2014-12-04 Voice quality assessment equipment, method and system

Publications (2)

Publication Number Publication Date
CN104361895A CN104361895A (en) 2015-02-18
CN104361895B true CN104361895B (en) 2018-12-18

Family

ID=52529151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410734839.8A Active CN104361895B (en) 2014-12-04 2014-12-04 Voice quality assessment equipment, method and system

Country Status (1)

Country Link
CN (1) CN104361895B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109427327B (en) * 2017-09-05 2022-03-08 中国移动通信有限公司研究院 Audio call evaluation method, evaluation device, and computer storage medium
CN107818797B (en) * 2017-12-07 2021-07-06 苏州科达科技股份有限公司 Voice quality evaluation method, device and system
CN109215632B (en) * 2018-09-30 2021-10-08 科大讯飞股份有限公司 Voice evaluation method, device and equipment and readable storage medium
CN109410984B (en) * 2018-12-20 2022-12-27 广东小天才科技有限公司 Reading scoring method and electronic equipment
CN113327615B (en) * 2021-08-02 2021-11-16 北京世纪好未来教育科技有限公司 Voice evaluation method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739870A (en) * 2009-12-03 2010-06-16 深圳先进技术研究院 Interactive language learning system and method
CN102237081A (en) * 2010-04-30 2011-11-09 国际商业机器公司 Method and system for estimating rhythm of voice
CN103928023A (en) * 2014-04-29 2014-07-16 广东外语外贸大学 Voice scoring method and system
CN104050965A (en) * 2013-09-02 2014-09-17 广东外语外贸大学 English phonetic pronunciation quality evaluation system with emotion recognition function and method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006337476A (en) * 2005-05-31 2006-12-14 Canon Inc Voice synthesis method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739870A (en) * 2009-12-03 2010-06-16 深圳先进技术研究院 Interactive language learning system and method
CN102237081A (en) * 2010-04-30 2011-11-09 国际商业机器公司 Method and system for estimating rhythm of voice
CN104050965A (en) * 2013-09-02 2014-09-17 广东外语外贸大学 English phonetic pronunciation quality evaluation system with emotion recognition function and method thereof
CN103928023A (en) * 2014-04-29 2014-07-16 广东外语外贸大学 Voice scoring method and system

Also Published As

Publication number Publication date
CN104361895A (en) 2015-02-18

Similar Documents

Publication Publication Date Title
CN109872727B (en) Voice quality evaluation device, method and system
US11900939B2 (en) Display apparatus and method for registration of user command
US10643036B2 (en) Language translation device and language translation method
US9548048B1 (en) On-the-fly speech learning and computer model generation using audio-visual synchronization
CN104361895B (en) Voice quality assessment equipment, method and system
CN104485115B (en) Pronounce valuator device, method and system
US10811005B2 (en) Adapting voice input processing based on voice input characteristics
US9837068B2 (en) Sound sample verification for generating sound detection model
CN104361896B (en) Voice quality assessment equipment, method and system
JP6172417B1 (en) Language learning system and language learning program
US20180211650A1 (en) Automatic language identification for speech
US20210065582A1 (en) Method and System of Providing Speech Rehearsal Assistance
CN104505103B (en) Voice quality assessment equipment, method and system
CN112562723B (en) Pronunciation accuracy determination method and device, storage medium and electronic equipment
US11437046B2 (en) Electronic apparatus, controlling method of electronic apparatus and computer readable medium
KR101854369B1 (en) Apparatus for providing speech recognition service, and speech recognition method for improving pronunciation error detection thereof
CN103594086B (en) Speech processing system, device and method
CN111723606A (en) Data processing method and device and data processing device
CN112151072B (en) Voice processing method, device and medium
CN112309389A (en) Information interaction method and device
KR20200056754A (en) Apparatus and method for generating personalization lip reading model
CN110890095A (en) Voice detection method, recommendation method, device, storage medium and electronic equipment
KR102622350B1 (en) Electronic apparatus and control method thereof
CN109102810B (en) Voiceprint recognition method and device
US12112742B2 (en) Electronic device for correcting speech input of user and operating method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant