[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2005010868A1 - Systeme de reconnaissance vocale, son terminal et son serveur - Google Patents

Systeme de reconnaissance vocale, son terminal et son serveur Download PDF

Info

Publication number
WO2005010868A1
WO2005010868A1 PCT/JP2003/009598 JP0309598W WO2005010868A1 WO 2005010868 A1 WO2005010868 A1 WO 2005010868A1 JP 0309598 W JP0309598 W JP 0309598W WO 2005010868 A1 WO2005010868 A1 WO 2005010868A1
Authority
WO
WIPO (PCT)
Prior art keywords
server
acoustic model
voice
acoustic
speech recognition
Prior art date
Application number
PCT/JP2003/009598
Other languages
English (en)
Japanese (ja)
Inventor
Tomohiro Narita
Takashi Sudou
Toshiyuki Hanazawa
Original Assignee
Mitsubishi Denki Kabushiki Kaisha
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitsubishi Denki Kabushiki Kaisha filed Critical Mitsubishi Denki Kabushiki Kaisha
Priority to JP2005504586A priority Critical patent/JPWO2005010868A1/ja
Priority to PCT/JP2003/009598 priority patent/WO2005010868A1/fr
Publication of WO2005010868A1 publication Critical patent/WO2005010868A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Definitions

  • the present invention relates to a speech recognition system and its terminal and server, and in particular, selects an appropriate acoustic model according to the use condition from a plurality of acoustic models assumed for various shelf conditions, and performs speech recognition. It concerns the art of performing recognition. Background leakage
  • Speech recognition is performed by extracting a time series of a speech spread amount from an input speech, and calculating by comparing the speech feature quantity / time series with an acoustic model prepared in advance.
  • a value output from various in-vehicle sensors such as a ⁇ ⁇ ⁇ sensor (refers to data obtained by subjecting an analog signal from the sensor to AZD conversion. This is referred to as sensor information), and a noise spectrum is calculated from the noise, and this noise spectrum and the sensor information from various job sensors are stored in association with each other, and the next time speech recognition is performed.
  • sensor information refers to data obtained by subjecting an analog signal from the sensor to AZD conversion.
  • sensor information refers to data obtained by subjecting an analog signal from the sensor to AZD conversion.
  • a noise spectrum is calculated from the noise
  • this noise spectrum and the sensor information from various job sensors are stored in association with each other, and the next time speech recognition is performed.
  • the noise spectrum of this sensor information is stored in the voice basket. From the time series of the quantity.
  • this method has a problem that the accuracy of speech recognition cannot be improved without using it until now.
  • some predetermined values are selected in advance from the output values of various sensors, and an acoustic model learned under the condition that the sensors output these values is created. Then, we can compare the sensor information obtained in the actual use difficulties with the difficult conditions of the acoustic model and select an appropriate acoustic model.
  • the data size of one acoustic model varies depending on the method of setting up and implementing the speech and speech intellectual system, but may reach several hundred kilobytes.
  • the size of the housing and the weight limit the storage capacity of the storage device that can be mounted is severely limited. Therefore, it is not realistic to adopt a configuration that allows Monoyl fungi to consider a plurality of acoustic models having such a large size.
  • the present invention has been made in order to solve the above-mentioned problem.
  • the present invention By transmitting sensor information via a network from a voice recognition device to a voice recognition server storing a plurality of acoustic models, the present invention has been made to solve the problem.
  • the purpose is to select an acoustic model suitable for the occupation of the company and to achieve high-accuracy speech recognition. Disclosure of the invention
  • the speech recognition system includes:
  • a speech recognition system in which a speech recognition supporter and a plurality of speech recognition terminals are connected via a network
  • a client-side acoustic analysis unit that calculates the volume of voice from the voice signal input from the input terminal
  • Client side receiving means for receiving an acoustic model from the disagreeable 3 voice recognition server, and client side matching means for comparing the sickle 3 acoustic model with the disagreeable 3 voice feature amount,
  • a server-side receiving means for receiving the sensor information transmitted by the Itfffi client-side transmitting means
  • Server-side acoustic model storage means for storing a plurality of acoustic models
  • server-side acoustic model selecting means for selecting an acoustic model that matches the touch sensor information from a plurality of acoustic models
  • Server-side transmission means for transmitting the acoustic model selected by the server-side acoustic model Hi selection means to the obscene speech recognition terminal.
  • a plurality of acoustic models corresponding to various sound collection jobs are stored in a speech recognition server having an unlimited storage capacity, and a sensor provided in each speech recognition terminal is stored. Based on the information from, a sound model suitable for the sound collection of the speech recognition terminal was selected and sent to the end of the speech recognition. As a result, the speech recognition terminal is limited in its own storage capacity due to limitations such as the case and weight of the terminal. Acquisition and speech recognition using that acoustic model can improve the accuracy of speech recognition.
  • FIG. 1 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 1 of the present invention.
  • FIG. 2 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 1 of the present invention.
  • FIG. 3 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 2 of the present invention.
  • FIG. 4 is a flowchart illustrating a clustering process of an acoustic model according to Embodiment 2 of the present invention.
  • FIG. 5 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 2 of the present invention.
  • FIG. 6 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 3 of the present invention.
  • FIG. 7 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 3 of the present invention.
  • FIG. 8 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 4 of the present invention.
  • FIG. 9 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 4 of the present invention.
  • FIG. 10 is a configuration diagram of a data format of sensor information and voice data transmitted from the voice recognition terminal to the voice recognition server according to Embodiment 4 of the present invention.
  • FIG. 1.1 is a block diagram showing a configuration of a speech recognition server from a speech recognition terminal according to Embodiment 5 of the present invention.
  • FIG. 12 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 5 of the present invention.
  • FIG. 1 is a block diagram showing a configuration of a speech recognition terminal and a server according to an embodiment of the present invention.
  • a microphone 1 is a device or component that collects voice
  • a voice recognition terminal 2 is a device that performs voice recognition of the voice collected by the microphone microphone 1 via an input terminal 3 and outputs a recognition result 4. It is.
  • the input terminal 3 is an audio terminal or microphone microphone connection ⁇ ?.
  • the speech recognition terminal 2 is connected to a speech recognition server 6 via a network 5.
  • Network 5 is a network that communicates digital and blue information such as the Internet, LAN (Local Area Network), public line network, mobile phone network, and communication network using artificial satellites.
  • the network 5 only needs to transmit and receive digital data between devices connected to this network, and does not ask the format of the information transmitted on the network 5. Absent. Therefore, for example, a bus designed to connect a plurality of devices, such as USB (Universal Serial 1 Bus) and SCS I (Sma 11 Computing System Interfac e). It does not matter.
  • the network 5 uses a data communication service of mobile communication.
  • data to be transmitted and received is called a packet.
  • I ⁇ is divided into units and sent and received one by one.
  • the S voice recognition server 6 to which control information such as position information indicating which part of the whole is to be configured and error correction is added is configured to be a male voice recognition terminal 2 via the network 5. It is a server computer.
  • the voice recognition server 6 is a storage device such as a hard disk device or a memory having a larger storage capacity than the voice recognition terminal 2. And stores standard patterns required for speech recognition. Also, a plurality of speech recognition terminals 2 are disliked by the speech recognition server 6 via the network 5.
  • the voice recognition terminal 2 is composed of a terminal 3 ⁇ 4 m jf ⁇ and a sensor 12, mmmn 3, a terminal transmission sound 4, and a terminal ”. Is provided.
  • the terminal-side acoustic analysis unit 11 performs acoustic analysis based on the audio signal input from the input terminal 3 and calculates an audio feature amount.
  • the sensor 12 is a sensor for detecting an environmental condition with a view to obtaining information on the view of the horse superimposed on the audio signal obtained by the microphone-phone 1, and the microphone-phone 1 is provided.
  • An element or device that detects or obtains a physical quantity in a certain environment or a change in the physical quantity may be included.
  • 'physical quantity here' includes sculpture 'pressure' flow rate, photomagnetism, time, electromagnetic waves, and the like.
  • the GPS antenna is a sensor for the GPS signal. Also, it is not always necessary to detect a physical quantity by acquiring some signal from the outside world.For example, a circuit that acquires the time of the place where the microphone is placed based on the built-in clock Is also included in the sensor mentioned here.
  • the sensor outputs an analog signal in (1) to (4), and the normal configuration is to sample the output analog signal into a digital signal by means of A / D conversion or elements. Therefore, the sensor 12 may include such an AZD variable or element.
  • the voice recognition terminal 2 are terminals of the navigation system: ⁇ indicates a sensor for monitoring the rotation of the evacuation sensor engine, and a monitoring status of the operation of the wiper. Multiple sensors may be combined, such as a sensor that rings in the evening, a sensor that monitors the opening and closing status of the door glass, and a sensor that monitors the car audio program.
  • the terminal-side transmission unit 13 is a unit that transmits sensor information near the microphone 1 obtained by the sensor 12 to the speech recognition server 6.
  • the terminal-side receiving unit 14 is configured to receive information from the speech recognition server 6 and output the received information to the terminal-side acoustic model selecting unit 16.
  • the terminal-side transmission unit 13 and the terminal-side reception unit 14 are composed of circuits or elements that send signals to the network cable and receive signals from the network cable, and are used to control these circuits or elements.
  • the computer program may be included in the terminal-side transmitting unit 13 and the terminal-side receiving unit 14.
  • the network 5 is a wireless network
  • the terminal-side transmitting unit 13 and the terminal-side receiving unit 14 have antennas for transmitting and receiving communication waves.
  • the terminal-side transmission unit 13 and the terminal-side reception unit 1 may be configured as separate parts, or may be configured by the same network input / output device.
  • the terminal-side acoustic model storage unit 15 is a storage eave or a circuit for storing an acoustic model.
  • a plurality of acoustic models can be provided according to the learning fiber, and only a part of them is stored in the terminal-side acoustic model storage unit 15.
  • each acoustic model is associated with sensor information indicating an environmental condition in which the acoustic model has been learned, and an acoustic model suitable for the environmental condition can be specified from the numerical value of the sensor information.
  • the voice recognition terminal 2 is a ⁇ ffl voice recognition device
  • the speech recognition server 6 since the speech recognition server 6 also remembers the acoustic model capability
  • the voice recognition terminal 2 must be mounted, and the storage capacity of the storage device can be extremely small.
  • the terminal-side acoustic model selection unit 16 includes a terminal-side reception unit 14-acquired acoustic model (or an acoustic model stored in the terminal-side acoustic model storage unit 15) and a terminal-side acoustic analysis. This is a sound for calculating the likelihood with the speech feature output by the unit 11.
  • the terminal-side matching unit 17 is a unit that selects a vocabulary based on the likelihood calculated by the terminal-side acoustic model selecting unit 16 and outputs it as a recognition result 4.
  • the terminal-side acoustic analysis unit 11, the terminal-side transmission unit 13, the terminal-side reception unit 14, the terminal-side acoustic model storage unit 15, and the terminal-side acoustic model HI selection unit 16 and the terminal-side collating unit 17 may be configured by dedicated circuits. ”However, the central processing unit (CPU), network I / O device (network adapter device, etc.), and storage device You may make it comprise as a computer program which performs the process equivalent to a function.
  • CPU central processing unit
  • network I / O device network adapter device, etc.
  • storage device You may make it comprise as a computer program which performs the process equivalent to a function.
  • the speech recognition server 6 includes a server-side receiving unit 21, a server-side acoustic model storage unit 22, a server-side acoustic model HI selecting unit 23, and a server-side transmitting unit 24.
  • the server-side receiving unit 21 is a unit that receives the sensor information transmitted from the terminal-side transmitting unit 13 of the voice recognition terminal 2 via the network 5.
  • the server-side acoustic model storage unit 22 is a storage device for storing a plurality of acoustic models.
  • This server-side acoustic model storage unit 22 is configured as a large-capacity storage device using a combination of a disk drive age, a CD-ROM medium and a CD-ROM drive.
  • the server-side acoustic model storage unit 22 stores all of the acoustic models that may be worthy of this speech recognition system. It has a large storage capacity.
  • the server-side acoustic model JH1 selection unit 23 is a tone for selecting an acoustic model suitable for the sensor information received by the server-side reception unit 21 from the acoustic models stored in the server-side acoustic model storage unit 22.
  • the server-side transmitting unit 24 is a unit that transmits the acoustic model selected by the server-side acoustic model selecting unit 23 to the speech recognition terminal 2 via the network 5.
  • the server-side receiving unit 21, the server-side acoustic model consideration unit 22, the server-side acoustic model measuring unit 23, and the server-side transmitting unit 24 May be configured by dedicated circuits, respectively, but are equivalent to central processing (CPU), network IZO device (network adapter device, etc.), and recording device. It may be configured as a computer program for executing processing. '
  • FIG. 2 is a flowchart illustrating processing performed by the voice recognition terminal 2 and the voice recognition server 6 according to the first embodiment.
  • a voice signal is input to the terminal-side acoustic analysis unit 11 via the input terminal 3.
  • the terminal-side acoustic analysis unit 11 converts the digital signal into a digital signal using an A / D converter, and calculates a time series of speech features such as an LPC cepstrum (Linear Predictive Coding Cepstrum) (step S 102).
  • LPC cepstrum Linear Predictive Coding Cepstrum
  • the sensor 12 acquires a physical quantity around the microphone 1 (step S103).
  • the voice recognition terminal 2 is a force navigation system
  • the sensor 12 is a 3 ⁇ 4 ⁇ sensor that detects an evacuation of a vehicle (car) equipped with the force navigation system: In ⁇ , evasion corresponds to such a physical quantity.
  • the sensor information in step S103 is to be performed next to the acoustic analysis in step S102.
  • the processing of step S103 may be performed before the processing of steps S101 to S102, or may be performed simultaneously or in parallel. Absent.
  • the terminal-side acoustic model selection unit 16 selects the sensor information obtained by the sensor 12, that is, the acoustic model learned under the condition that is closest to the sound of the microphone-phone 1.
  • the fiber condition of the acoustic model is considered to be multiple, and the terminal-side acoustic model storage unit 15 does not necessarily store all of the fiber conditions. Therefore, if none of the acoustic models stored in the terminal-side acoustic model storage unit 15 at the H pole is learned under environmental conditions close to the marrow conditions of the microphone-mouth phone 1, the speech recognition 6 to get an acoustic model.
  • the sensor information about the sensor k under the environmental conditions in which the acoustic model m has been learned is represented by S m , k
  • the current sensor information of the sensor k is represented by Sm , k. x k .
  • the terminal-side acoustic model selection unit 16 calculates a distance value D (m) between the sensor information S m , k of the acoustic model m and the sensor information x k obtained by the sensor 12 (step S104).
  • a distance value between the sensor information x k of a certain sensor k and the sensor information S m , k of the acoustic model m is D k (x k , S m , k ).
  • D k (x k , S m , k ) an absolute value of a difference between sensor information may be adopted.
  • the value is D k (x k , S m , k ).
  • the distance value D (m) For the distance value D (m), the distance value D k (x k , S m , k ) for each sensor is used. And calculate as follows.
  • the relationship between the sensor information as a physical quantity and the ⁇ value D (m) will be described. If the sensor 1 blue light is the position (may be determined based on the key or latitude, or may be determined based on the distance from a specific place as the origin), the sensor information Have different dimensions as physical quantities. However, here, by adjusting the weighting factor w k , the contribution of w k D k (x k , S m , k ) to the distance value can be set appropriately. No problem. The same applies even if the unit system is different. For example, if kmZh is used as the avoidance unit; ⁇ and mph are used, different values can be used as sensor information even if the speed is physically the same.
  • the terminal-side acoustic model selection unit 16 obtains the minimum value min ⁇ D (m) ⁇ of the distance value D (m) for each m calculated by the equation (1), and obtains this min ⁇ D (m) Evaluate whether it is smaller than the fixed value T (Step S105). In other words, it is determined whether or not there is a condition sufficiently close to the current environmental conditions at which the microphone 1 picks up, among the conditions of the terminal-side acoustic model stored in the terminal-side acoustic model storage unit 15. is there.
  • the predetermined value T is a value set in advance to determine whether or not such a condition is satisfied.
  • step S105 If the min ⁇ D (m) ⁇ force is smaller than the fixed value T (step S105: eses), the process proceeds to step S106.
  • step S107 If min ⁇ D (m) ⁇ is equal to or larger than the fixed value ⁇ f (step S105: ⁇ ), the process proceeds to step S107. this:! In ⁇ , during the difficult condition of the acoustic model stored in the acoustic model storage unit 15 on the terminal side, the profession will not be strong enough under the current environmental conditions in which the microphone 1 collects sound. Therefore, the terminal-side transmitting unit 13 transmits the sensor information to the voice recognition server 6 (Step S107).
  • the frequency that min ⁇ D (m) ⁇ is determined to be smaller than T increases, and the number of times that step S107 is executed decreases. That is, if the value of ⁇ is increased, the number of transmissions and receptions via the network 5 can be reduced. Therefore, the effect of suppressing the transmission amount of the network 5 occurs.
  • the speech recognition is performed by inputting an acoustic model having a smaller distance value between the sensor information obtained by the sensors 12 and the condition under which the acoustic model was learned, so that the accuracy of the speech recognition is improved. be able to. From the above, the value of ⁇ ⁇ may be determined in consideration of the transmission of the network 5 and the target speech recognition. .
  • the terminal-side receiving unit 21 receives the sensor information via the network 5 (step S108).
  • the server-side acoustic model selection unit 23 calculates a distance value between the environmental condition in which the acoustic model stored in the server-side acoustic model storage unit 22 is learned and the sensor information received by the server-side reception unit 21. The calculation is performed in the same manner as in step S104, and the acoustic model with the smallest distance value is selected (step S109). Subsequently, the server-side transmitting unit 24 transmits the acoustic model selected by the server-side acoustic model selecting unit 23 to the speech recognition terminal 2 (Step S110).
  • the terminal-side receiving unit 14 of the voice recognition terminal 2 receives the acoustic model transmitted by the server-side transmitting unit 24 via the network 5 (Step S111).
  • the terminal-side collating unit 17 converts the speech feature and the sound output by the terminal-side acoustic analysis unit 1 1.
  • the matching process with the sound model is performed (step S112).
  • the sickle with the highest 3 ⁇ 4 ⁇ between the standard pattern stored as the acoustic model and the time series of the speech difficulty is set as the recognition result 4.
  • pattern matching by DP (Dynamic Programming) matching is performed, and the one with the smallest distance value is set as the recognition result 4.
  • the speech recognition terminal 2 and the server 6 according to the first embodiment even when only a small number of acoustic models can be stored in the speech recognition terminal 2, the sound pickup ⁇ of the microphone mouth phone 1 can be detected by the sensor.
  • Speech Recognition Server 6 Power acquired by 1 and 2 Able to perform speech recognition by selecting an acoustic model learned under environmental conditions close to this sound collection job from among many acoustic models remembered.
  • the data size of one acoustic model is several hundred kilobytes in size, depending on how it is implemented. Therefore, the effect of reducing the number of acoustic models that need to be stored by the speech recognition terminal is significant.
  • the sensor information can take successive values. Usually, however, some values are selected from the input values, and an acoustic model using this value as sensor information is learned.
  • the sensor 12 is composed of a few types of sensors (the first sensor and the second sensor), and the speech recognition terminal 2 and the speech recognition server 6 recognize each of the acoustic models. If the number of values selected as sensor information for the first sensor is ⁇ 1 and the number of values selected as sensor information for the second sensor is ⁇ 2, the voice recognition terminal 2 and the voice recognition server 6 store The total number of acoustic models used is calculated as ⁇ 1 X ⁇ 2.
  • that is, the number of values selected as sensor information of the first sensor is greater than the number of values selected as sensor information of the second sensor.
  • is the weighting factor for the sensor information of the first sensor Is smaller than the weight coefficient of the second sensor with respect to the sensor information, it is possible to select an acoustic model according to the difficulty in collecting sound of the microphone / mouth phone 1.
  • the speech recognition terminal 2 includes a terminal-side acoustic model storage unit 15 and a terminal-side acoustic model HI selection unit 16 so that the speech recognition terminal 2 remembers the acoustic model and the speech recognition server 6 stores the acoustic model.
  • the model was selected appropriately to perform voice awakening.
  • the speech recognition terminal 2 include the terminal-side acoustic model storage unit 15 and the terminal-side acoustic model unit 1 ⁇ . That is, it goes without saying that a configuration is possible in which the acoustic model remembered by the speech recognition server 6 is unconditionally based on the sensor information acquired by the sensor 12.
  • the acoustic model received from the speech recognition server 6 is newly stored in the acoustic model storage unit 15 on the terminal side, and instead of the acoustic model of the voice recognition 1 ⁇ end 2 side, A configuration for storing the acoustic model received from the speech recognition server 6 is also possible. By doing so, next time speech recognition is performed using the same acoustic model again: ⁇ , since there is no need to transfer the acoustic model again from the speech recognition server 6, the transmission load on the network 5 can be reduced, and transmission and reception can be reduced. Example 2 can be shortened.
  • an acoustic model suitable for the sensor information is obtained from the speech recognition server.
  • the sound model from the speech recognition server! / ⁇ is It cannot place a heavy load on the network, nor can it affect the overall processing performance due to the time required for acoustic model data.
  • One way to avoid such problems is to design the speech awakening algorithm so that the data size of the acoustic model is as small as possible. This is because, if the size of the acoustic model is small, transferring the acoustic model from the speech recognition server to the speech recognition terminal does not add much load to the network.
  • a plurality of acoustic models that are similar to each other are clustered, and between the acoustic models in the same cluster is determined in advance, and stored in the speech recognition server.
  • only the difference from the acoustic model stored in the speech recognition terminal is used, and the speech recognition terminal power is calculated from the acoustic model stored in the speech recognition terminal and the sound of the speech recognition server.
  • a method of synthesizing the model is also conceivable.
  • the speech recognition terminal and the server according to the second embodiment operate based on a powerful principle.
  • FIG. 3 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the second embodiment.
  • the acoustic model conversion unit 18 calculates the acoustic model stored in the speech recognition server 6 from the contents received by the terminal-side receiving unit 14 and the acoustic model stored in the terminal-side acoustic model storage unit 15. This is a part for synthesizing a simple acoustic model.
  • the acoustic model difference calculation unit 25 calculates the dispersion between the acoustic model stored in the terminal-side acoustic model storage unit 15 and the acoustic model stored in the server-side acoustic model storage unit 22. .
  • the description is omitted.
  • the speech recognition device 2 and the server 6 assume that the acoustic model is clustered in advance. Therefore, the class ringing method of the acoustic model will be described first. Note that the clustering of the acoustic model has been completed before the speech recognition processing is performed by the speech recognition device 2 and the supervisor 6 .
  • the acoustic model of the phoneme ⁇ is represented by ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
  • FIG. 4 is a flowchart showing the clustering process of the acoustic model.
  • an initial cluster is created (step S201).
  • one initial cluster is created from all possible acoustic models that are shelved by this speech recognition system. Equations (2) and (3) are used to calculate the t t of the initial cluster r.
  • N represents the number of distributions belonging to the cluster
  • K represents ⁇ number of the speech feature.
  • step S202 it is determined whether or not the required number of clusters has already been obtained by the clustering process executed so far.
  • the required number of clusters is determined when designing the speech recognition processing system. Generally speaking, the greater the number of clusters, the smaller the distance between acoustic models in the same cluster. As a result, the amount of information of the difference data is reduced, and the amount of difference data transmitted and received via the network 5 can be suppressed.
  • Tokuko, Speech Recognition ⁇ The number of acoustic models stored in Terminal 2 and Server 6 is large:! For ⁇ , increase the number of clusters.
  • the acoustic model stored in the speech recognition server 6 is synthesized by combining the acoustic model stored in the end 2 (hereinafter referred to as the “oral sound model”) and scatter, or the acoustic model stored in the speech recognition server 6 The aim is to obtain an equivalent acoustic model.
  • the difference used here is local! / Combine with the acoustic model, and must be determined between this oral acoustic model and acoustic models belonging to the same cluster. Since the acoustic model synthesized due to the difficulty corresponds to the sensor information, the most efficient state is that the acoustic model corresponding to the sensor information and the oral sound model are classified into the same class. become.
  • step S203 a VQ distortion cluster division is performed (step S203).
  • the largest VQ distortion ( ⁇ cluster r max (initial cluster in the first loop)) is divided into two clusters, r 1 and r 2, thereby increasing the number of class IT.
  • the class ff4 after the division is calculated by the following equation: where is a small value that is predetermined for each dimension of the speech volume.
  • Step S204 the statistics of each acoustic model and the statistics of each cluster (all clusters divided in step S203)
  • the distance is calculated by selecting one from each of the acoustic models and all the clusters already obtained. However, the distance is not calculated again for the combination of the acoustic model and the cluster for which 3 ⁇ 4
  • the Bhattachaxyya distance value defined by equation (8) is used for the distance value of the statistics of the acoustic model and the statistics of each cluster.
  • Equation (8) the parameter with a suffix of 1 is the statistical value of the acoustic model, and the parameter with a suffix of 2 is the statistical value of the cluster.
  • each acoustic model is assigned the class with the smallest distance value. Belong to the evening.
  • the distance value between the acoustic model statistics and the cluster statistics may be calculated by the method of equation (8). Even in such a case, it is desirable to adopt an equation that can obtain a distance value belonging to the same class evening for an i-feed whose distance value calculated by the equation (1) is close. However, this is not « ⁇ .
  • is performed in the code book of each class (step S205).
  • the representative values of the acoustic models belonging to the class are calculated using Equations (2) and (3).
  • the distance between the statistical value of the acoustic model belonging to the cluster and the representative value is accumulated using Eq. (8), and this is defined as the VQ distortion of the current cluster.
  • Step S206 a reward value for clustering is calculated (step S206).
  • VQ distortion for all classes Let b be the ⁇ face value of the class evening ring.
  • Steps S204 to S207 constitute a loop that is executed a plurality of times.
  • the contract surface value calculated in step S206 is stored until the next execution of the task.
  • the scatter of the consultation surface value and the basket value calculated in the previous loop execution is obtained, and it is determined whether or not the thread size value is less than a predetermined threshold value (step S207). If the difference force f is less than the threshold, the acoustic model belongs to an appropriate cluster among clusters that have already been obtained, and the process returns to step S202 (step S207: Y es).
  • step S204 is performed (step S207: No).
  • FIG. 5 is a flowchart of the operation of the voice recognition device 2 and the server 6.
  • steps S101 to S105 a voice is input from the microphone 1 as in the first embodiment, and after performing sound analysis and acquisition of sensor information, a speech suitable for the sensor information is obtained.
  • step S208 For ⁇ , go to step S208 (step S105: ⁇
  • the terminal-side transmitting unit 13 transmits the sensor information and the information m for performing the low-power Jl ⁇ sound model to the speech recognition server 6 (step S208).
  • the server-side receiver 21 receives the sensor information and m (step S209), and the server-side acoustic model HI selector 23 selects the acoustic model most suitable for the received sensor information. (Step S109). Then, it is determined whether or not this acoustic model and the oral model / sound model m belong to the same cluster (step S210). If they belong to the same class, the process goes to step S211 (step S210: Yes), and the acoustic model difference calculation unit 25 calculates the acoustic model and the local! / ⁇ The server-side transmitting unit 24 calculates the difference (step S211), and transmits the difference to the speech recognition terminal 2 (step S212). .
  • the difference may be calculated based on the difference between the values of the components of the voice volume: ⁇ and the offset (the difference between the storage positions of the respective elements). It is known to find a difference value between different binary data (such as between binary files), and that may be used. Further, since the cage according to the second embodiment does not require a special request for the data structure of the acoustic model, a method of designing a data structure that can easily obtain the difference can be considered.
  • step S210 if they do not belong to the same cluster, go directly to step S212 (step S210: No).
  • the voice recognition terminal 2 side determines that the low-power acoustic model determined to be most suitable for the sensor information (in step S105, the acoustic model determined to have the smallest ⁇ with the sensor information) ), Thank you for generating the difference The Therefore, such a mouthful zo! / Information about the acoustic model m was transmitted in advance in step S208.
  • the voice recognition server 2 stores the voice recognition terminal 2 in the memory! / ⁇ Understand (or manage) the type of sound model, and after the speech recognition server selects the sound model, close to the sensor information, and then select the sound model.
  • the difference may be calculated by selecting from the low-power reverberation model that manages the model. In this case, it is necessary to notify the speech recognition terminal 2 which speech model the difference detected by the speech recognition server 6 is based on. 6 sends the information that makes the ⁇ calculation of the ⁇ sound model.
  • the terminal-side receiving unit 14 of the speech recognition terminal 2 receives the difference data or the acoustic model (Step S213). If the received data is a difference, the acoustic model generation unit 18 synthesizes an acoustic model from the utterance model m which is the key to the difference and the difference (step S2 14). Then, the terminal-side matching unit 17 performs pattern matching between the standard pattern of the acoustic model and the voice feature amount, and recognizes the most likely likelihood! Output i ⁇ i 'as recognition result 4.
  • the sound module required for speech recognition is used. Does not memorize Dell: Even if ⁇ , the acoustic model stored in the speech recognition server 6 is received via the network 5 to perform speech recognition according to the sound collection ring of the microphone 1. Met. However, instead of transmitting and receiving the acoustic model, a voice feature may be transmitted and received.
  • the speech recognition terminal and server according to the third embodiment operate based on such a principle.
  • FIG. 6 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the third embodiment.
  • the parts denoted by the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and thus the description is omitted.
  • the voice recognition terminal 2 and the voice recognition server 6 are sickled via the network 5.
  • the speech recognition amount and sensor information are transmitted from the speech recognition terminal 2 to the speech recognition server 6, and the recognition result 7 is output from the speech recognition server 6.
  • the server-side collating unit 27 generates a sound for performing collation between the speech tree and the acoustic model, similarly to the terminal-side collating unit 17 of the first embodiment.
  • FIG. 7 is a flowchart showing processing between the speech recognition terminal 2 and the speech recognition server 6 according to the second embodiment.
  • the processes denoted by the same reference numerals as those in FIG. 2 are the same as those in the first embodiment. Therefore, in the following, description will be made focusing on the processing with ⁇ unique to the flowchart.
  • a voice signal is input to the voice recognition terminal 2 via the input terminal 3 (step S101), and the acoustic analysis unit 11 converts the voice signal into the voice signal.
  • the time series of the voice feature is calculated (step S102), and sensor information is collected by the sensor 12 (step S103).
  • the sensor information and the voice coverage amount are transferred to the voice recognition server 6 via the network 5 by the terminal-side transmission unit 13 (step S301), and the server-side reception unit 2.1 transmits the information.
  • the sensor information and the voice characteristics are taken into the voice recognition server 6 (step S302).
  • the server-side acoustic model storage unit 22 of the voice recognition server 6 stores an acoustic model in advance according to a plurality of sensor information, and the server-side acoustic model selection unit 23 is acquired by the server-side reception unit 21.
  • a distance value between the obtained sensor information and the sensor information of each acoustic model is calculated by equation (1), and an acoustic model having the smallest distance value is selected (step S109).
  • Step S303 This process is the same as the matching process (step S112) in the first embodiment, and thus a detailed description is omitted.
  • the voice recognition terminal 2 and the superuser 6 according to the third embodiment, only the calculation of the voice feature amount and the acquisition of the sensor information are performed by the voice S ninth terminal 2, and the sensor I Based on this, an appropriate acoustic model was selected from the acoustic models whose speech characteristics were stored in the speech recognition server 6, and the speech was recognized. This eliminates the need for a memory or a memory or a circuit for storing an acoustic model in the voice recognition terminal 2, and can simplify the configuration of the voice recognition terminal 2.
  • voice recognition can be performed without imposing a transmission load on the network 5.
  • the data size of the acoustic model is several hundred kilobytes: ⁇ . Therefore, if the bandwidth of the network is limited, the transmission capability may reach the limit when trying to transmit the acoustic model itself. However, if the speech feature quantity can maintain a bandwidth of at most 20 kbps, it can sufficiently transfer data in real time. Therefore, a client server-side speech recognition system with extremely low network load can be constructed, and the sound collection ring of microphone 1 It is possible to perform highly accurate voice awakening according to the boundary.
  • the third embodiment has a configuration in which the recognition result 7 is output from the voice recognition server 6 instead of being output from the voice recognition terminal 2.
  • the speech recognition terminal 2 is browsing the Internet, the speech recognition utters a URL (Unifom rm Resource Relocation), and the speech recognition server 6 obtains a Web page determined from the URL, and Such a configuration is sufficient if the recognition terminal 2 is to be transmitted.
  • a URL Uniform Resource Relocation
  • the voice recognition terminal 2 may output a recognition result.
  • the voice recognition terminal 2 was provided with a terminal-side receiving unit, and the voice recognition server 6 was provided with a server-side transmitting unit.
  • the output result of the matching unit 27 was transmitted from the transmitting unit of the voice recognition server 6 to the network 5. It may be configured to transmit the data to the receiving unit of the voice recognition terminal 2 via the terminal and output the data to a desired output destination from the receiving unit.
  • a method of transmitting / receiving audio data may be considered.
  • the voice recognition terminal and the server according to the fourth embodiment operate based on such a principle.
  • FIG. 8 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the fourth embodiment.
  • the parts denoted by the same symbols as those in FIG. 1 are the same as those in the first embodiment, and therefore the description is omitted.
  • the voice recognition I-terminal 2 and the voice recognition superuser 6 are hated via the network 5.
  • voice data and sensor information are transmitted from the voice recognition terminal 2 to the voice recognition server 6, and the recognition result 7 is output from the voice recognition server 6.
  • this is different from the first embodiment.
  • the audio digital processing unit 19 is a unit that converts audio input from the input terminal 3 into digital data, and includes an A / D transformer, eaves, or a circuit.
  • the server-side acoustic analysis unit 28 is a unit that calculates the speech difficulty from the input speech on the speech recognition server 6, and has the same function as the terminal-side acoustic analysis unit 11 in the first and second embodiments. You.
  • FIG. 9 is a flowchart illustrating processing performed by the speech recognition terminal 2 and the speech recognition superuser 6 according to the first embodiment.
  • the processes denoted by the same reference numerals as those in FIG. 2 are the same as those in the first embodiment. Therefore, in the following, description will be made focusing on the processing with reference numerals unique to the flowchart. , '
  • step S101 when the user performs voice input from the microphone mouth phone 1, a voice signal is input to the voice recognition terminal 2 via the input terminal 3 (step S101), and the voice digital processing unit 19 proceeds to step S101.
  • the audio signal input at 101 is sampled by A / D conversion (step S401).
  • voice ffi-compression methods include the u-law 64kbps PCM 3 ⁇ 43 ⁇ 4 (Pulse Coded Modulation, ITU-T G.711) used in the public spring telephone network (ISDN, etc.) «Differential encoding PCM method used in PHS (Adaptive Differential encoding PCM, ADPCM. ITU-T G.726), VSELP ⁇ (Vector Sum Excited linear Prediction) used in mobile phones.
  • CELP 3 ⁇ 43 ⁇ 4 Code Excited Linear Prediction
  • One of these ⁇ should be selected according to the available bandwidth and traffic of the communication network.
  • u-law PCM ⁇ 16-40 kbps ADPCM for t1 ⁇ 2, 11.2 kbps: ⁇ for VSELP;
  • CELP is considered suitable for ⁇ .
  • the characteristics of the present invention are not lost even if other encoding methods are applied.
  • the sensor information is lightened by the sensor 12 (step S103), and the combined sensor information and voice data are rearranged into, for example, a format as shown in FIG. Then, the data is transferred to the voice recognition server 6 via the network 5 by the terminal-side transmitting section 13 (step S402).
  • a frame number indicating the processing time of the audio data is stored in the area 701.
  • This frame number is uniquely determined based on, for example, the sampling time of the audio data.
  • the meaning of the word “uniquely determined” includes “determined based on a relative time adjusted between the voice recognition terminal 2 and the voice recognition server 6”, and the relative time Is different! ⁇ Means to give a different frame number.
  • a specific time is supplied from a clock external to the voice recognition terminal 2 and the voice recognition server 6, and the frame number is uniquely determined based on this time. Good.
  • the data size occupied by the sensor information is stored in the data format area 702 of FIG.
  • the sensor information is a 32 bit value
  • the size of the area required to store the sensor information (4 bits) is expressed in bytes and 4 is stored.
  • the sensor 12 is composed of a plurality of sensors: ⁇ stores the data size of the array area necessary to store the sensor information for each. It becomes.
  • an area 703 is an area in which the sensor information acquired by the sensor 12 in step S103 is stored.
  • Sensor 1 2 Force ⁇ ⁇ ⁇ ⁇
  • An array of sensor information is stored in an area 703 that is composed of several sensors.
  • the data size of the area 703 matches the data size held in the area 702.
  • the audio data size and power S are stored in the area 704.
  • the transmitting unit 13 divides the voice data into a plurality of packets (the structure of which is assumed to be the same as the data format shown in FIG. 5) and transmits the packet: ⁇ .
  • the area 704 what is stored in the area 704 is the data size of the audio data included in each packet. The case where the packet is divided into a plurality of packets will be described later. Subsequently, audio data is stored in the area 705.
  • the terminal-side transmission unit 13 divides the voice data input via the input terminal 3 into a plurality of packets.
  • the frame number stored in the area 701 is information indicating the processing time of the audio data, and the frame number is a sampling number of the audio data included in each bucket. Determined based on time.
  • the data size of the audio data included in each bucket is stored in the area 704.
  • the output results of the sensors constituting the sensors 12 and 12 have the property of changing every moment in a short time! In ⁇ , the sensor information stored in the area 7Q3 also differs between buckets.
  • the voice recognition terminal 2 is an in-vehicle voice recognition device, and the sensor 12 obtains the loudness of the background heavy sound (such as a microphone different from the microphone 1).
  • the background is heavy!
  • the loudness of the sound will vary significantly.
  • the terminal-side transmitting unit 13 separates the voice data when the sensor information changes, regardless of the characteristics of the network 5, regardless of the characteristics of the network 5, regardless of the characteristics of the sensor 5 when the sensor I green report changes significantly during the utterance. It is desirable to send a bucket containing different sensor information.
  • the server side receiving unit 21 takes in the sensor information and the voice data to the voice recognition server 6 (step S403).
  • the server-side acoustic analysis unit 28 performs an acoustic analysis of the acquired audio data, and calculates a time series of the audio basket amount (step S404).
  • the hatch-side acoustic model selecting section 23 selects the most appropriate acoustic model based on the acquired sensor skin report (step S109), and the server-side matching section 26 selects this acoustic model.
  • the standard pattern of the model and the speech feature are collated (step S405).
  • the voice recognition terminal 2 transfers the sensor information and the voice data to the voice recognition server 6, so that the voice recognition terminal 2 does not perform the acoustic analysis, Highly accurate speech recognition can be performed based on an acoustic model suitable for the sound environment.
  • the voice recognition function can be realized without an evening program.
  • the sensor information is transmitted for each frame. Therefore, even when the environmental conditions in which the microphone-phone 1 collects sound during the utterance change rapidly, an appropriate You can select an acoustic model and perform speech recognition.
  • the method of dividing the transmission from the voice recognition terminal 2 into a plurality of frames ⁇ can also be applied to the transmission of the voice feature amount of the third embodiment. That is, since the audio feature has a time-series component, when dividing into frames, it is preferable to divide the frame in the time-series order.
  • the sensor information at the time in the time series is stored in each frame in the same manner as in the fourth embodiment, and the voice recognition server 6 selects a delicate acoustic model based on the latest sensor information included in each frame. By doing so, the accuracy of speech recognition can be further improved.
  • Example 5 In the speech recognition systems of the first to fourth embodiments, the acoustic model stored in the speech recognition terminal 2 and the server 6 is selected based on the difficult condition acquired by the sensor 12 included in the speech recognition terminal 2, so that the The voice awakening process was performed accordingly. However, it is also conceivable to select an acoustic model by combining working information obtained from the Internet, etc., using only the ⁇ ⁇ information obtained from the sensor 12.
  • the voice recognition system according to the fifth embodiment has such features.
  • the feature of the fifth embodiment is that the acoustic model is selected by combining the working information obtained from the Internet and the sensor information, so that the speech recognition system according to any of the first to fourth embodiments can be used. It is possible to combine them, and the same effect is obtained.However, here, as an example, a case where the speech recognition system of the first embodiment is combined with mouth information obtained from the Internet will be described. .
  • FIG. 11 is a block diagram illustrating the configuration of the speech recognition system according to the fifth embodiment.
  • the speech recognition system of the fifth embodiment is the same as the speech recognition system of the first embodiment, except that an internet information acquisition unit 29 is added.
  • the components marked with are the same as those in the first embodiment, and a description thereof will not be repeated.
  • the Internet information acquisition unit 29 is a unit that acquires information that works via the Internet!] More specifically, a Web page is acquired by http (Hyper Text Transfer Protocol). It has a function equivalent to an Internet browser.
  • http Hyper Text Transfer Protocol
  • the working opening information is, for example, weather information or information.
  • the Internet has a web site that provides weather information and 3 ⁇ 41 information, and according to these web sites, it is possible to obtain weather conditions, traffic congestion information, construction status, etc. in various places. it can. Therefore, in order to perform speech recognition with higher accuracy by using such work information, an acoustic model matching the available work information is boxed.
  • the weather information is the work mouth information
  • the acoustic model is learned by taking into account the effect of the background noise caused by the bow and the like.
  • the acoustic model is learned in consideration of the influence of background noise generated by road construction and the like.
  • FIG. 12 is a flowchart showing the operation of the speech recognition terminal 2 and the server 6 according to the fifth embodiment.
  • the only difference between the flowchart of FIG. 12 and the flowchart of FIG. 2 is the presence or absence of step S501. Therefore, hereinafter, the processing of step S501 will be mainly described.
  • the Internet information collection unit 29 After receiving the sensor information at the voice recognition server 6 (step S108), the Internet information collection unit 29 transmits information that affects the sound collection of the microphone 1 connected to the voice recognition terminal 2 to the Internet. (Step S501). For example, when the sensor 12 is provided with a GPS antenna, the sensor information includes the position information where the voice recognition terminal 2 and the microphone 1 are located. In step 9, based on the position information, additional information such as weather information and traffic information of the voice recognition terminal 2 and the microphone 1 is provided from the Internet.
  • the server-side acoustic model selection unit 23 selects an acoustic model based on the sensor information and the work information. Specifically, first, it is determined whether or not the working information of the current voice recognition terminal 2 and the working location of the microphone 1 match the working information of the acoustic model. Then, from among the acoustic models having the same information, an acoustic model with the smallest distance value calculated based on the equation (1) shown in the first embodiment is selected for the sensor information.
  • the acoustic model Even if the conditions for learning are not completely expressed by the sensor information alone, it can be expressed using the information, so select a more appropriate acoustic model for the sound collection environment of the microphone microphone 1 It can be powerful. As a result, the speech recognition accuracy can be improved.
  • the negative significance of using the M Blue Report is one of the environmental factors that degrade the accuracy of speech recognition. It consists in shaking acoustic models based on elements that cannot be represented by sensor information3. Therefore, enter such mouth information
  • the method used is not limited to the Internet.
  • a dedicated system or a dedicated computer for providing additional information may be prepared.
  • the voice recognition system, the terminal, and the server according to the present invention are useful for providing high-precision voice recognition even if they are used, and Due to the size and weight of the housing, such as a navigation system and a mobile phone, and the limitations of the price crane, it is suitable for providing a voice recognition function to ⁇ , which has a limited capacity of a consideration device that can be mounted.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

La présente invention concerne un système de reconnaissance vocale qui reconnaît les données vocales avec une grande exactitude dans une variété d'environnements de travail. Dans un système de reconnaissance vocale de type client et serveur dans lequel un terminal de reconnaissance vocale (2) et un serveur de reconnaissance vocale (6) connectés à un réseau partagent un traitement de reconnaissance vocale destiné à calculer une quantité de caractéristiques vocales en provenance d'un signal vocal recueilli par un microphone externe (1), stockant une pluralité de modèles acoustiques, sélectionnant un modèle acoustiques qui convient à l'environnement de recueil sonore du microphone externe (1) parmi la pluralité de modèles acoustiques et, produisant en sortie des résultats de reconnaissance par la mise en correspondance de motifs d'un motif standard de modèles acoustiques et de caractéristiques vocales. Le terminal de reconnaissance vocale (2) est doté d'un capteur (12) de façon à détecter l'environnement de recueil de son du microphone externe (1) et d'une de partie (13) destinée à transmettre la sortie en provenance du capteur (2) vers le serveur de reconnaissance vocale (6).
PCT/JP2003/009598 2003-07-29 2003-07-29 Systeme de reconnaissance vocale, son terminal et son serveur WO2005010868A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2005504586A JPWO2005010868A1 (ja) 2003-07-29 2003-07-29 音声認識システム及びその端末とサーバ
PCT/JP2003/009598 WO2005010868A1 (fr) 2003-07-29 2003-07-29 Systeme de reconnaissance vocale, son terminal et son serveur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2003/009598 WO2005010868A1 (fr) 2003-07-29 2003-07-29 Systeme de reconnaissance vocale, son terminal et son serveur

Publications (1)

Publication Number Publication Date
WO2005010868A1 true WO2005010868A1 (fr) 2005-02-03

Family

ID=34090568

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2003/009598 WO2005010868A1 (fr) 2003-07-29 2003-07-29 Systeme de reconnaissance vocale, son terminal et son serveur

Country Status (2)

Country Link
JP (1) JPWO2005010868A1 (fr)
WO (1) WO2005010868A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008108232A1 (fr) * 2007-02-28 2008-09-12 Nec Corporation Dispositif de reconnaissance audio, procédé de reconnaissance audio et programme de reconnaissance audio
JP2011118124A (ja) * 2009-12-02 2011-06-16 Murata Machinery Ltd 音声認識システムと認識方法
CN109213970A (zh) * 2017-06-30 2019-01-15 北京国双科技有限公司 笔录生成方法及装置
WO2019031870A1 (fr) * 2017-08-09 2019-02-14 엘지전자 주식회사 Procédé et appareil pour appeler un service de reconnaissance vocale à l'aide d'une technologie bluetooth à basse énergie
CN110556097A (zh) * 2018-06-01 2019-12-10 声音猎手公司 定制声学模型
WO2020096172A1 (fr) * 2018-11-07 2020-05-14 Samsung Electronics Co., Ltd. Dispositif électronique de traitement d'énoncé d'utilisateur et son procédé de commande

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002091477A (ja) * 2000-09-14 2002-03-27 Mitsubishi Electric Corp 音声認識システム、音声認識装置、音響モデル管理サーバ、言語モデル管理サーバ、音声認識方法及び音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体
JP2003122395A (ja) * 2001-10-19 2003-04-25 Asahi Kasei Corp 音声認識システム、端末およびプログラム、並びに音声認識方法
JP2003140691A (ja) * 2001-11-07 2003-05-16 Hitachi Ltd 音声認識装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002091477A (ja) * 2000-09-14 2002-03-27 Mitsubishi Electric Corp 音声認識システム、音声認識装置、音響モデル管理サーバ、言語モデル管理サーバ、音声認識方法及び音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体
JP2003122395A (ja) * 2001-10-19 2003-04-25 Asahi Kasei Corp 音声認識システム、端末およびプログラム、並びに音声認識方法
JP2003140691A (ja) * 2001-11-07 2003-05-16 Hitachi Ltd 音声認識装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KOSAKA ET AL.: "Scalar Ryushi-ka o Riyo shita Client . Server-gata Onsei Ninshiki no Jitsugen to Server-bu no Kosoku-ka no Kento", THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS GIJUTSU KENKYU HOKOKU (ONSEI), 21 December 1999 (1999-12-21), pages 31 - 36, XP002984744 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008108232A1 (fr) * 2007-02-28 2008-09-12 Nec Corporation Dispositif de reconnaissance audio, procédé de reconnaissance audio et programme de reconnaissance audio
JP5229216B2 (ja) * 2007-02-28 2013-07-03 日本電気株式会社 音声認識装置、音声認識方法及び音声認識プログラム
US8612225B2 (en) 2007-02-28 2013-12-17 Nec Corporation Voice recognition device, voice recognition method, and voice recognition program
JP2011118124A (ja) * 2009-12-02 2011-06-16 Murata Machinery Ltd 音声認識システムと認識方法
CN109213970A (zh) * 2017-06-30 2019-01-15 北京国双科技有限公司 笔录生成方法及装置
CN109213970B (zh) * 2017-06-30 2022-07-29 北京国双科技有限公司 笔录生成方法及装置
US11367449B2 (en) 2017-08-09 2022-06-21 Lg Electronics Inc. Method and apparatus for calling voice recognition service by using Bluetooth low energy technology
WO2019031870A1 (fr) * 2017-08-09 2019-02-14 엘지전자 주식회사 Procédé et appareil pour appeler un service de reconnaissance vocale à l'aide d'une technologie bluetooth à basse énergie
JP2019211752A (ja) * 2018-06-01 2019-12-12 サウンドハウンド,インコーポレイテッド カスタム音響モデル
US11011162B2 (en) 2018-06-01 2021-05-18 Soundhound, Inc. Custom acoustic models
US11367448B2 (en) 2018-06-01 2022-06-21 Soundhound, Inc. Providing a platform for configuring device-specific speech recognition and using a platform for configuring device-specific speech recognition
CN110556097A (zh) * 2018-06-01 2019-12-10 声音猎手公司 定制声学模型
CN110556097B (zh) * 2018-06-01 2023-10-13 声音猎手公司 定制声学模型
US11830472B2 (en) 2018-06-01 2023-11-28 Soundhound Ai Ip, Llc Training a device specific acoustic model
WO2020096172A1 (fr) * 2018-11-07 2020-05-14 Samsung Electronics Co., Ltd. Dispositif électronique de traitement d'énoncé d'utilisateur et son procédé de commande
US10699704B2 (en) 2018-11-07 2020-06-30 Samsung Electronics Co., Ltd. Electronic device for processing user utterance and controlling method thereof
US11538470B2 (en) 2018-11-07 2022-12-27 Samsung Electronics Co., Ltd. Electronic device for processing user utterance and controlling method thereof

Also Published As

Publication number Publication date
JPWO2005010868A1 (ja) 2006-09-14

Similar Documents

Publication Publication Date Title
EP2538404B1 (fr) Dispositif de transfert de données vocales, dispositif terminal, procédé de transfert de données vocales et système de reconnaissance vocale
US7996220B2 (en) System and method for providing a compensated speech recognition model for speech recognition
TWI508057B (zh) 語音辨識系統以及方法
EP2226793A2 (fr) Système de reconnaissance vocale et procédé d'actualisation des données
EP2956939B1 (fr) Extension de bande passante personnalisée
CN104347067A (zh) 一种音频信号分类方法和装置
KR20040084759A (ko) 이동 통신 장치를 위한 분산 음성 인식
JP2002091477A (ja) 音声認識システム、音声認識装置、音響モデル管理サーバ、言語モデル管理サーバ、音声認識方法及び音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体
JP6466334B2 (ja) リアルタイム交通検出
CN105486325A (zh) 具有语音处理机制的导航系统及其操作方法
CN104040626A (zh) 多译码模式信号分类
CN101345819A (zh) 一种用于机顶盒的语音控制系统
CN1171201C (zh) 语音识别系统及其方法
JP2015230384A (ja) 意図推定装置、及び、モデルの学習方法
WO2005010868A1 (fr) Systeme de reconnaissance vocale, son terminal et son serveur
CN112382266B (zh) 一种语音合成方法、装置、电子设备及存储介质
JP3477432B2 (ja) 音声認識方法およびサーバならびに音声認識システム
JP2008026489A (ja) 音声信号変換装置
JP2003241788A (ja) 音声認識装置及び音声認識システム
CN103474063B (zh) 语音辨识系统以及方法
CN113724698A (zh) 语音识别模型的训练方法、装置、设备及存储介质
Tan et al. Network, distributed and embedded speech recognition: An overview
JPH10254473A (ja) 音声変換方法及び音声変換装置
JP2003122395A (ja) 音声認識システム、端末およびプログラム、並びに音声認識方法
KR100383391B1 (ko) 음성인식서비스 시스템 및 방법

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2005504586

Country of ref document: JP

122 Ep: pct application non-entry in european phase