[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2024023946A1 - Speech processing device, speech processing method, and speech processing program - Google Patents

Speech processing device, speech processing method, and speech processing program Download PDF

Info

Publication number
WO2024023946A1
WO2024023946A1 PCT/JP2022/028843 JP2022028843W WO2024023946A1 WO 2024023946 A1 WO2024023946 A1 WO 2024023946A1 JP 2022028843 W JP2022028843 W JP 2022028843W WO 2024023946 A1 WO2024023946 A1 WO 2024023946A1
Authority
WO
WIPO (PCT)
Prior art keywords
loss function
model
learning
context
becomes smaller
Prior art date
Application number
PCT/JP2022/028843
Other languages
French (fr)
Japanese (ja)
Inventor
智大 田中
亮 増村
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/028843 priority Critical patent/WO2024023946A1/en
Publication of WO2024023946A1 publication Critical patent/WO2024023946A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present invention relates to an audio processing device, an audio processing method, and an audio processing program.
  • the latter task refers to a task that uses voice as input, such as voice recognition.
  • voice such as voice recognition.
  • self-supervised learning parameters are learned so that a context representation can be obtained from speech that takes into account previous and subsequent input.
  • Transformer is known as a neural network that can acquire a context representation (see, for example, Non-Patent Document 2).
  • the conventional technology has a problem in that the accuracy of tasks subsequent to self-supervised learning may decrease.
  • the self-supervised learning model for speech may overfit the learning data of the self-supervised learning.
  • a mismatch occurs between the self-supervised learning model and the data used in the subsequent task, and an effective representation for the subsequent task cannot be obtained.
  • a speech processing device has a first method in which a vector obtained by quantizing speech features by a model becomes smaller as it approaches the context representation acquired by the model from the features. and a second loss function that becomes smaller as the accuracy with which the model identifies meta information of the speech based on the context representation increases, and the first loss function and an updating unit that updates parameters of the model so that the loss function becomes small and the second loss function becomes large.
  • FIG. 1 is a diagram showing an example of the configuration of a first learning device.
  • FIG. 2 is a diagram showing an example of the configuration of the second learning device.
  • FIG. 3 is a diagram showing a configuration example of an estimation device.
  • FIG. 4 is a flowchart showing the overall flow of the learning process.
  • FIG. 5 is a flowchart showing the flow of self-supervised learning processing.
  • FIG. 6 is a flowchart showing the flow of relearning processing.
  • FIG. 7 is a flowchart showing the flow of inference processing.
  • FIG. 8 is a diagram showing an example of a computer that executes a learning program.
  • the model is, for example, a neural network and includes a speech encoder, a context network, a quantization network, a classification network and an additional network. Details of each network will be described later.
  • the additional network is a neural network for calculating the final output in the latter task described above.
  • Tasks include classification tasks, generation tasks, prediction tasks, and the like.
  • Tasks targeting speech include speech recognition to obtain text from speech, speech classification to classify speech into predetermined types (e.g. speaker attributes, emotions), speaker identification to identify the speaker of speech, etc. It will be done.
  • the process of optimizing model parameters to improve task accuracy is called learning process.
  • the process of actually executing a task using one or more models including additional networks that have been trained through the learning process is called inference process.
  • the learning process of this embodiment is comprised of two steps: self-supervised learning process and relearning process.
  • the speech encoder and context network are called a self-supervised learning model.
  • Self-supervised learning models can be used for several different tasks targeting speech.
  • additional networks are models specialized for specific tasks.
  • a self-supervised learning model is trained.
  • additional network learning is performed using the self-supervised learning model that has been trained in the self-supervised learning process.
  • the first learning device 10 performs self-supervised learning processing. Further, the second learning device 20 performs relearning processing. Further, the inference device 30 performs inference processing.
  • the first learning device 10, the second learning device 20, and the inference device 50 may be realized by different computers, or may be realized by one computer.
  • the first learning device 10 is an example of a speech processing device. Moreover, any one or more of the first learning device 10, the second learning device 20, and the reasoning device 50 can function as a speech processing device.
  • FIG. 1 is a diagram showing an example of the configuration of a first learning device.
  • the first learning device 10 has a set of an acoustic feature sequence X and a classification label l of meta information ⁇ (X 1 , l 1 ), ..., (X M , l M ); l M ⁇ l 1 ,..., l L ⁇ is input as learning data.
  • the classification label l is the correct label.
  • M is the number of pairs of audio feature series and classification labels included in the learning data, and is an integer of 1 or more.
  • l l is the l-th type of classification label.
  • L is the number of types of classification labels prepared, and is an integer of 2 or more.
  • meta information is information representing the domain of audio (call center conversation audio, online conference audio, reading audio, etc.), language, gender, etc.
  • the acoustic features are, for example, log Mel filter bank coefficients (FBANK).
  • acoustic features are not limited to logarithmic mel filter vans, but include MFCC (Mel frequency cepstral coefficient), ⁇ MFCC (first derivative of MFCC), ⁇ MFCC (second derivative of MFCC), logarithmic power, ⁇ logarithmic power (logarithmic power first-order differential), etc.
  • the acoustic feature amount may be a sample of raw speech.
  • classification label may be represented by an L-dimensional 1-hot vector.
  • the first learning device 10 includes a speech encoder section 11, a context network section 12, a quantization network section 13, a classification network section 14, a classification learning loss calculation section 15, and a context representation learning loss calculation section 16. , a learning parameter updating unit 17, and model information 10a.
  • the model information 10a is the parameter of the model used by the first learning device 10. Parameters include neural network weights and biases. Furthermore, in the learning process, the model information 10a is updated as appropriate.
  • I is the sequence length of the acoustic feature, and is an integer of 1 or more.
  • T is the sequence length of the voice intermediate feature vector sequence, and is an integer of 1 or more.
  • the audio encoder unit 11 calculates the audio intermediate feature vector sequence Z as shown in equation (1).
  • SpeechEncoder() is a function that has the function of a neural network, for example, a convolutional neural network.
  • ⁇ se1 is a parameter of the audio encoder and can be learned. ⁇ se1 is read from the model information 10a.
  • the context network unit 12 applies masking to the intermediate feature vector sequence Z, which is the output of the audio encoder unit 11, as shown in equation (2).
  • Masking() is a function that performs masking in the time direction.
  • ContextNetwork() (context network) is a function that has the function of a neural network, and is, for example, the Transformer described in Non-Patent Document 2.
  • ⁇ se2 is a parameter of the context network and can be learned. ⁇ se2 is read from the model information 10a.
  • QuantizationNetwork() is a function that has the function of a neural network, and is composed of, for example, a fully connected neural network and a Gumbel softmax function.
  • the Gumbel softmax function is a differentiable function for propagating the output of a classifier (for example, a fully connected neural network) to a subsequent network.
  • the Gumbel softmax function is described in Reference 1, for example.
  • ⁇ qn is a parameter of the quantization network and can be learned. ⁇ qn is read from the model information 10a.
  • the number of dimensions of the probability sequence O is L.
  • Each element of the probability series O corresponds to each element ⁇ l 1 ,...,l L ⁇ of the classification label l.
  • GRL( ) is a function representing a gradient reversal layer (for example, see reference document 1), and is a function that inverts the sign of the gradient during Backward in the error backpropagation method.
  • ClassNetwork() (classification network) is a function that has the function of a neural network, and is composed of, for example, a fully connected neural network and a softmax function.
  • ⁇ qn is a parameter of the quantization network and can be learned. ⁇ qn is read from the model information 10a.
  • the classification learning loss calculation unit 15 calculates the classification learning loss L class for the classification label l as shown in equation (7).
  • ClassLoss( ) is a function that calculates the loss for identifying the classification label l, for example, cross entropy loss.
  • the context expression learning loss calculation unit 16 calculates a loss L context (context expression learning loss) for learning a context expression as shown in equation (8).
  • ContextLoss() is a function that calculates loss for learning context expressions, for example, Contrastive loss.
  • Sim() in equation (8) is a function that calculates the similarity between two vectors, and is, for example, a cosine similarity.
  • ⁇ Q ( ⁇ above Q) represents the set of negative examples of the quantized vector.
  • is a temperature parameter set in advance.
  • a pair (positive example) of an element of the quantized expression vector sequence Q and an element of the corresponding context expression vector sequence is used.
  • q t and c t are a pair of corresponding elements.
  • the learning parameter updating unit 17 updates the parameters of the model based on the classification learning loss L class and the context expression learning loss L context .
  • the learning parameter updating unit 17 updates the parameters for each mini-batch.
  • the learning parameter updating unit 17 uses equations (9), (10), and ( 11) Update the parameters using the formula.
  • represents the learning rate
  • represents the weight for context representation learning loss
  • represents the weight for classification learning loss.
  • represents a weight, and the influence of loss is adjusted by changing it significantly each time learning progresses (updating is repeated in mini-batch units).
  • Equation (5) the function GRL( ) is introduced in the calculation of the classification network unit 14, so the sign of the term with ⁇ in equation (11) is inverted.
  • Equation (11) the learning parameter is updated so that it becomes smaller as the vector quantized for the input by the model approaches the acquired context representation, and the accuracy of identifying meta information based on the acquired context representation increases. be done.
  • the learning parameter updating unit 17 uses the first loss function (context representation learning loss L context ) and a second loss function (classification learning loss L class ) that becomes smaller as the accuracy with which the model identifies speech meta information based on the context expression increases. Then, the learning parameter updating unit 17 updates the parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger (Equation (11)).
  • the learning parameter update section 17 corresponds to a loss function calculation section and an update section.
  • the parameter by the update unit is transferred from the first term (the term with ⁇ ) which is the first loss function to the second loss function.
  • This is a loss function (third loss function) obtained by subtracting a second term (term with ⁇ ) attached with a weight ⁇ that increases as the number of times the update of is repeated.
  • ANN adversarial neural network
  • the ANN performs learning so that the context network does not identify meta information regarding audio. This allows the context network to obtain a universal representation without overfitting the learning data.
  • the context network will operate robustly even for speech in an unknown domain.
  • the meta information includes the domain of the voice, the characteristics of the voice (language, etc.), the attributes of the speaker of the voice (gender, age), etc., and is information different from the content of the utterance expressed in text or the like. Moreover, the content of the utterance may be rephrased as the content of information transmitted by voice.
  • the process is further repeated using the updated parameters. Furthermore, when a predetermined condition (for example, the number of repetitions) is satisfied, the iterative process ends.
  • the context network unit 12 is an example of a context expression calculation unit that inputs voice features into a model and calculates a context expression.
  • the classification network unit 14 is an example of a meta-information label calculation unit that inputs a context expression into a model and calculates a label that specifies audio meta-information.
  • the quantization network unit 13 is an example of a quantization vector calculation unit that inputs a voice feature amount into a model and calculates a quantized vector.
  • the learning parameter update unit 17 calculates the first loss function so that the vector calculated by the quantization vector calculation unit becomes smaller as it approaches the context expression calculated by the context expression calculation unit, and the meta information It can be said that the second loss function is calculated so that it becomes smaller as the label calculation accuracy of the label calculation unit increases.
  • FIG. 2 is a diagram showing an example of the configuration of the second learning device.
  • the second learning device 20 uses the parameters updated by the first learning device 10 to perform learning of a relearning model for performing tasks related to speech.
  • the model in the first learning device 10 is a model that combines a speech encoder, a context network, a quantization network, and a classification network.
  • the relearning model is a model that combines a speech encoder, a context network, and an additional network.
  • a set of an acoustic feature sequence X and a subsequent task label l' is input to the second learning device 20 as learning data.
  • the subsequent task label l' is the correct label.
  • the subsequent task label l' corresponds to information according to the task, and does not need to indicate meta information.
  • the subsequent task label l' is the text corresponding to the speech. Further, the subsequent task label l' may indicate meta information like the classification label l. Note that the processing unit of the text corresponding to the voice in the subsequent task label l' may be a phoneme, a character, or a word.
  • the second learning device 20 includes a speech encoder section 21, a context network section 22, an additional network section 23, a subsequent task learning loss calculation section 24, a learning parameter update section 25, and model information 20a.
  • the model information 20a is parameters of a model trained by the first learning device 10.
  • the model information 20a includes at least parameters ⁇ se1 and ⁇ se2 . Furthermore, the model information 20a includes a parameter ⁇ add of an additional network depending on the task.
  • the audio encoder unit 21 calculates the audio intermediate feature vector sequence Z as shown in equation (1).
  • ⁇ se1 is a parameter of the audio encoder that has been updated by the first learning device 10, and is read from the model information 20a.
  • the context network unit 22 converts the intermediate feature vector sequence Z, which is the output of the audio encoder unit 21, into a context expression C as shown in equation (12). However, unlike the context network unit 12, the context network unit 22 does not perform masking.
  • ⁇ se2 is a parameter of the context network that has been updated by the first learning device 10, and is read from the model information 10a.
  • the additional network unit 23 calculates a probability sequence P (sequence of predicted probabilities) for the subsequent task label from the context expression vector sequence C that is the output of the context network unit 22, as shown in equation (13).
  • the ClassNetwork() (classification network) in equation (13) is different from the classification network of the first learning device 10, and learning is performed in the second learning device 20.
  • the classification network of the second learning device 20 is a function having the function of a neural network, and is composed of, for example, a bidirectional LSTM and a softmax function.
  • ⁇ add is a parameter of the classification network of the subsequent task and can be learned.
  • ⁇ addn is read from the model information 20a.
  • the subsequent task learning loss calculation unit 24 calculates the subsequent task learning loss L down for the subsequent task label l' as shown in equation (14).
  • Loss( ) is a function that calculates the loss of the subsequent task (for example, classification loss), for example, cross-entropy loss. Note that Loss( ) is changed as appropriate depending on the type of subsequent task (classification task, generation task, prediction task, etc.).
  • the learning parameter updating unit 25 updates the parameters of the model based on the loss L down of the subsequent task.
  • the learning parameter update unit 25 may fix some parameters and update other parameters. For example, the learning parameter updating unit 25 updates the parameter ⁇ add without updating the parameters ⁇ se1 and ⁇ se2 .
  • the calculation of the loss L down of the subsequent task is performed for each mini-batch. Therefore, the learning parameter updating unit 25 updates the parameters for each mini-batch.
  • the process is further repeated using the updated parameters. Furthermore, when a predetermined condition (for example, the number of repetitions) is satisfied, the iterative process ends.
  • FIG. 3 is a diagram showing a configuration example of an estimation device.
  • the inference device 50 uses the relearning model to execute a task.
  • the acoustic feature series X is input to the inference device 50 as learning data.
  • the inference device 50 estimates a label corresponding to the acoustic feature sequence X.
  • the inference device 50 includes a speech encoder section 51, a context network section 52, an additional network section 53, and model information 50a.
  • the model information 50a is the parameters of each model learned by the first learning device 10 and the second learning device 20.
  • the model information 50a includes a learned speech encoder parameter ⁇ se1 and a learned context network parameter ⁇ se2 . Furthermore, the model information 50a includes the learned additional network parameter ⁇ add .
  • the context network unit 52 converts the intermediate feature vector sequence Z, which is the output of the audio encoder unit 51, into a context representation C.
  • the additional network unit 23 calculates a probability sequence P (sequence of predicted probabilities) for the subsequent task label from the context expression vector sequence C that is the output of the context network unit 52.
  • the additional network unit 53 outputs classification results based on the probability sequence P.
  • the additional network unit 53 may output the probability sequence P, or may output information specifying the subsequent task label corresponding to the element with the largest value among the elements of the probability sequence P.
  • FIG. 4 is a flowchart showing the overall flow of the learning process. As shown in FIG. 4, first, the first learning device 10 performs preliminary learning of a speech encoder, a context network, a quantization network, and a classification network (step S1).
  • the second learning device 20 uses the learned speech encoder and context network to learn an additional network (step S2). At this time, it is also possible to relearn the audio encoder and context network.
  • FIG. 5 is a flowchart showing the flow of self-supervised learning processing.
  • the self-supervised learning process corresponds to the process of step S1 in FIG.
  • the first learning device 10 inputs the acoustic feature sequence to the audio encoder and calculates the intermediate expression vector sequence (step S101).
  • the first learning device 10 applies masking to the intermediate expression vector sequence, inputs it to the context network, and calculates a context expression vector sequence (step S102).
  • the first learning device 10 inputs the intermediate representation vector sequence to the quantization network and calculates a quantized representation vector sequence (step S103).
  • the first learning device 10 applies GRL to the context expression vector sequence, inputs it to the classification network, and calculates a probability sequence for the classification label of the meta information (step S104).
  • the first learning device 10 calculates the classification learning loss based on the calculated probability sequence and the correct classification label of the meta information (step S105).
  • the first learning device 10 calculates a context expression learning loss based on the context expression vector sequence and the quantized expression vector sequence (step S106).
  • the first learning device 10 updates the parameters of the audio encoder, context network, quantization network, and classification network based on the classification learning loss and context representation learning (step S107).
  • step S108 Yes
  • step S108, No the first learning device 10 terminates the process.
  • step S101 the end condition is not satisfied (step S108, No)
  • step S101 the end condition is not satisfied
  • the termination conditions include, for example, that the process has been repeated a certain number of times, that the amount of parameter updates has converged, etc.
  • FIG. 6 is a flowchart showing the flow of the relearning process.
  • the relearning process corresponds to the process of step S2 in FIG.
  • the second learning device 20 first inputs the acoustic feature sequence to the audio encoder and calculates the intermediate expression vector sequence (step S201).
  • the second learning device 20 inputs the intermediate expression vector sequence to the context network and calculates the context expression vector sequence (step S202).
  • the second learning device 20 inputs the context expression vector sequence to the additional network and calculates a probability sequence for the classification label according to the task (step S203).
  • the second learning device 20 calculates additional learning loss based on the calculated probability sequence and the correct classification label according to the task (step S204).
  • the second learning device 20 updates the parameters of the additional network based on the additional learning loss (step S205). At this time, it is also possible to relearn the audio encoder and context network.
  • step S206 Yes
  • step S206, No the second learning device 20 terminates the process.
  • step S206, No the second learning device 20 returns to step S201 and repeats the process using the model with updated parameters.
  • the termination conditions include, for example, that the process has been repeated a certain number of times, that the amount of parameter updates has converged, etc.
  • FIG. 7 is a flowchart showing the flow of inference processing.
  • the inference device 50 inputs the acoustic feature sequence to the audio encoder and calculates the intermediate representation vector sequence (step S501).
  • the inference device 50 inputs the intermediate representation vector sequence to the context network and calculates the context expression vector sequence (step S502).
  • the inference device 50 inputs the context expression vector sequence to the additional network and calculates a probability sequence for the classification label according to the task (step S503).
  • the inference device 50 outputs a classification result based on the calculated probability series (step S504).
  • the first learning device 10 has a first loss function that decreases as the vector obtained by quantizing the voice feature amount by the model becomes closer to the context expression acquired by the model from the feature amount; A second loss function is calculated that becomes smaller as the accuracy with which the model identifies audio meta information based on the context expression increases.
  • the first learning device 10 updates the parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger. This prevents the context network from overfitting to the learning data, and prevents the accuracy of tasks subsequent to self-supervised learning from decreasing.
  • the first learning device 10 converts the first term, which is the first loss function, into a second term, which is weighted with a weight that increases as the number of repeated parameter updates increases.
  • a third loss function is calculated by subtracting , and the model parameters are updated so that the third loss function becomes smaller.
  • the first learning device 10 uses the updated parameters to learn a relearning model that executes tasks related to speech. Thereby, subsequent tasks by the additional network can be executed with high accuracy.
  • the first learning device 10 inputs the voice features into the model, calculates a context expression, inputs the context expression into the model, calculates a label that specifies the meta information of the voice, and inputs the voice features into the model. and calculate the quantized vector.
  • the first learning device 10 calculates a first loss function such that the calculated vector becomes smaller as it approaches the calculated context expression, and calculates a second loss function such that the calculated vector becomes smaller as the calculation accuracy of the label increases. calculate. Thereby, the first learning device 10 can consistently perform calculations using the model and update parameters.
  • the speech processing device provides a specific improvement over the conventional machine learning method as described in Non-Patent Document 1, and is applicable to the technical field related to speech tasks using machine learning models. This shows an improvement in
  • each component of each device shown in the drawings is functionally conceptual, and does not necessarily need to be physically configured as shown in the drawings.
  • the specific form of distributing and integrating each device is not limited to what is shown in the diagram, and all or part of the devices may be functionally or physically distributed or integrated in arbitrary units depending on various loads and usage conditions. Can be integrated and configured.
  • each processing function performed by each device is realized in whole or in part by a CPU (Central Processing Unit) and a program that is analyzed and executed by the CPU, or by hardware using wired logic. It can be realized as Note that the program may be executed not only by the CPU but also by another processor such as a GPU.
  • a CPU Central Processing Unit
  • the speech processing device installs a program that executes the above processing as packaged software or online software on a desired computer. It can be implemented by For example, by causing the information processing device to execute the above program, the information processing device can be made to function as an audio processing device.
  • the information processing device referred to here includes a desktop or notebook personal computer.
  • information processing devices include mobile communication terminals such as smartphones, mobile phones, and PHSs (Personal Handyphone Systems), as well as slate terminals such as PDAs (Personal Digital Assistants).
  • the audio processing device can also be implemented as a learning server device that uses a terminal device used by a user as a client and provides services related to the above-mentioned learning processing to the client.
  • a learning server device is implemented as a server device that provides a learning service that takes learning data as input and outputs parameters of a trained model.
  • the learning server device may be implemented as a Web server, or may be implemented as a cloud that provides services related to the above-mentioned learning processing by outsourcing.
  • FIG. 8 is a diagram showing an example of a computer that executes a learning program.
  • Computer 1000 includes, for example, a memory 1010 and a CPU 1020.
  • the computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These parts are connected by a bus 1080.
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012.
  • the ROM 1011 stores, for example, a boot program such as BIOS (Basic Input Output System).
  • Hard disk drive interface 1030 is connected to hard disk drive 1090.
  • Disk drive interface 1040 is connected to disk drive 1100.
  • Serial port interface 1050 is connected to, for example, mouse 1110 and keyboard 1120.
  • Video adapter 1060 is connected to display 1130, for example.
  • the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the learning device 5 is implemented as a program module 1093 in which computer-executable code is written.
  • Program module 1093 is stored in hard disk drive 1090, for example.
  • a program module 1093 for executing processing similar to the functional configuration of the learning device 5 is stored in the hard disk drive 1090.
  • the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
  • the setting data used in the processing of the embodiment described above is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads out the program module 1093 and program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary, and executes the processing of the embodiment described above.
  • program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like.
  • the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and program data 1094 may then be read by the CPU 1020 from another computer via the network interface 1070.
  • LAN Local Area Network
  • WAN Wide Area Network
  • the processor includes: a first loss function in which a vector obtained by quantizing voice features by a model becomes smaller as it approaches a context representation acquired by the model from the features; A second loss function that becomes smaller as the accuracy of identification is higher, and A speech processing device that updates parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger.
  • a third term is obtained by subtracting a second term, which is weighted with a weight that increases as the number of times parameter updates are repeated, from the first term, which is the first loss function. Calculate the loss function, A speech processing device that updates parameters of the model so that the third loss function becomes smaller.
  • the audio processing device comprises: A speech processing device that uses updated parameters to train a relearning model to perform speech-related tasks.
  • the audio processing device comprises: inputting the voice features into the model and calculating a context representation; inputting the context representation into the model and calculating a label identifying meta information of the speech; inputting the feature amount of the voice into the model and calculating a quantized vector;
  • the first loss function is calculated such that the calculated vector becomes smaller as it approaches the calculated context expression, and the second loss function is calculated such that the calculated vector becomes smaller as the calculation precision of the label increases.
  • the audio processing device comprises: A speech processing device that executes the task using the relearning model.
  • a non-transitory storage medium storing a program executable by a computer to perform audio processing,
  • the audio processing includes: a first loss function in which a vector obtained by quantizing voice features by a model becomes smaller as it approaches a context representation acquired by the model from the features; A second loss function that becomes smaller as the accuracy of identification is higher, and The parameters of the model are updated such that the first loss function becomes smaller and the second loss function becomes larger.
  • An inference device comprising: an inference unit that performs inference processing regarding speech using a re-learning model that is trained using parameters of the model that has been trained through pre-learning processing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

A first training device (10) calculates a first loss function that becomes smaller as a vector into which a model has quantized the feature amount of speech becomes closer to a context representation which the model has acquired from the feature amount, and a second loss function that becomes smaller as an accuracy with which the model identifies meta information of the speech on the basis of the context representation becomes higher. The first training device (10) updates the parameter of the model such that the first loss function becomes smaller and the second loss function becomes larger.

Description

音声処理装置、音声処理方法及び音声処理プログラムAudio processing device, audio processing method, and audio processing program
 本発明は、音声処理装置、音声処理方法及び音声処理プログラムに関する。 The present invention relates to an audio processing device, an audio processing method, and an audio processing program.
 従来、音声データを利用した自己教師学習で学習したニューラルネットワークのパラメータを、特定の後段タスクに対して転移することにより、その後段タスクの精度が改善することが知られている(例えば、非特許文献1を参照)。 Conventionally, it is known that the accuracy of a subsequent task can be improved by transferring the parameters of a neural network learned through self-supervised learning using audio data to a specific subsequent task (for example, in a non-patent (See Reference 1).
 ここで、後段タスクとは音声を入力としたタスクを指しており、例えば音声認識等が挙げられる。自己教師学習では、音声から前後の入力を考慮したコンテキスト表現を獲得できるよう、パラメータが学習される。コンテキスト表現を獲得できるニューラルネットワークとして、Transformerが知られている(例えば、非特許文献2を参照)。 Here, the latter task refers to a task that uses voice as input, such as voice recognition. In self-supervised learning, parameters are learned so that a context representation can be obtained from speech that takes into account previous and subsequent input. Transformer is known as a neural network that can acquire a context representation (see, for example, Non-Patent Document 2).
 しかしながら、従来の技術には、自己教師学習に対する後段のタスクの精度が低下する場合があるという問題がある。 However, the conventional technology has a problem in that the accuracy of tasks subsequent to self-supervised learning may decrease.
 例えば、非特許文献1に記載の技術では、音声の自己教師学習モデルが、自己教師学習の学習データに対して過剰に適合してしまうことがある。この場合、自己教師学習モデルと後段のタスクで用いるデータとのミスマッチが生じ、後段のタスクに対して有効な表現が獲得できない。 For example, in the technology described in Non-Patent Document 1, the self-supervised learning model for speech may overfit the learning data of the self-supervised learning. In this case, a mismatch occurs between the self-supervised learning model and the data used in the subsequent task, and an effective representation for the subsequent task cannot be obtained.
 上述した課題を解決し、目的を達成するために、音声処理装置は、モデルが音声の特徴量を量子化したベクトルが、前記モデルが前記特徴量から獲得したコンテキスト表現に近いほど小さくなる第1の損失関数と、前記コンテキスト表現を基に前記音声のメタ情報を前記モデルが識別する精度が高いほど小さくなる第2の損失関数と、を計算する損失関数計算部と、前記第1の損失関数が小さくなり、かつ前記第2の損失関数が大きくなるように、前記モデルのパラメータを更新する更新部と、を有することを特徴とする。 In order to solve the above-mentioned problems and achieve the purpose, a speech processing device has a first method in which a vector obtained by quantizing speech features by a model becomes smaller as it approaches the context representation acquired by the model from the features. and a second loss function that becomes smaller as the accuracy with which the model identifies meta information of the speech based on the context representation increases, and the first loss function and an updating unit that updates parameters of the model so that the loss function becomes small and the second loss function becomes large.
 本発明によれば、自己教師学習に対する後段のタスクの精度が低下することを抑止できる。 According to the present invention, it is possible to prevent the accuracy of tasks subsequent to self-supervised learning from decreasing.
図1は、第1の学習装置の構成例を示す図である。FIG. 1 is a diagram showing an example of the configuration of a first learning device. 図2は、第2の学習装置の構成例を示す図である。FIG. 2 is a diagram showing an example of the configuration of the second learning device. 図3は、推定装置の構成例を示す図である。FIG. 3 is a diagram showing a configuration example of an estimation device. 図4は、学習処理の全体の流れを示すフローチャートである。FIG. 4 is a flowchart showing the overall flow of the learning process. 図5は、自己教師学習処理の流れを示すフローチャートである。FIG. 5 is a flowchart showing the flow of self-supervised learning processing. 図6は、再学習処理の流れを示すフローチャートである。FIG. 6 is a flowchart showing the flow of relearning processing. 図7は、推論処理の流れを示すフローチャートである。FIG. 7 is a flowchart showing the flow of inference processing. 図8は、学習プログラムを実行するコンピュータの一例を示す図である。FIG. 8 is a diagram showing an example of a computer that executes a learning program.
 以下に、本願に係る音声処理装置、音声処理方法及び音声処理プログラムの実施形態を図面に基づいて詳細に説明する。なお、本発明は、以下に説明する実施形態により限定されるものではない。 Embodiments of an audio processing device, an audio processing method, and an audio processing program according to the present application will be described in detail below based on the drawings. Note that the present invention is not limited to the embodiments described below.
[第1の実施形態]
 第1の実施形態では、複数のモデルの学習(訓練)が行われる。モデルは、例えばニューラルネットワークであり、音声エンコーダ、コンテキストネットワーク、量子化ネットワーク、分類ネットワーク及び追加ネットワークが含まれる。各ネットワークの詳細については後述する。
[First embodiment]
In the first embodiment, learning (training) of multiple models is performed. The model is, for example, a neural network and includes a speech encoder, a context network, a quantization network, a classification network and an additional network. Details of each network will be described later.
 追加ネットワークは、前述の後段のタスクにおける最終的な出力を計算するためのニューラルネットワークである。タスクには、分類タスク、生成タスク、予測タスク等が含まれる。 The additional network is a neural network for calculating the final output in the latter task described above. Tasks include classification tasks, generation tasks, prediction tasks, and the like.
 本実施形態では、特に音声を対象としたタスクを対象とする。音声を対象としたタスクには、音声からテキストを得る音声認識、音声を所定の種別(例えば話者の属性、感情)に分類する音声分類、音声の話者を識別する話者識別等が含まれる。 In this embodiment, a task that specifically targets audio is targeted. Tasks targeting speech include speech recognition to obtain text from speech, speech classification to classify speech into predetermined types (e.g. speaker attributes, emotions), speaker identification to identify the speaker of speech, etc. It will be done.
 また、タスクの精度を向上させるためのモデルのパラメータを最適化する処理を学習処理と呼ぶ。また、学習処理によって学習済みの追加ネットワークを含む1つ以上のモデルを使って実際にタスクを実行する処理を推論処理と呼ぶ。 Additionally, the process of optimizing model parameters to improve task accuracy is called learning process. Further, the process of actually executing a task using one or more models including additional networks that have been trained through the learning process is called inference process.
 本実施形態の学習処理は、自己教師学習処理と再学習処理の2段階の処理によって構成される。 The learning process of this embodiment is comprised of two steps: self-supervised learning process and relearning process.
 ここで、音声エンコーダ及びコンテキストネットワークを自己教師学習モデルと呼ぶ。自己教師学習モデルは、音声を対象とする複数の異なるタスクで利用可能である。一方で、追加ネットワークは、特定のタスクに特化したモデルである。 Here, the speech encoder and context network are called a self-supervised learning model. Self-supervised learning models can be used for several different tasks targeting speech. On the other hand, additional networks are models specialized for specific tasks.
 自己教師学習処理では、自己教師学習モデルの学習が行われる。また、再学習処理では、自己教師学習処理で学習済みの自己教師学習モデルを使って、追加ネットワークの学習が行われる。 In the self-supervised learning process, a self-supervised learning model is trained. In addition, in the relearning process, additional network learning is performed using the self-supervised learning model that has been trained in the self-supervised learning process.
 本実施形態では、第1の学習装置10が自己教師学習処理を行う。また、第2の学習装置20が再学習処理を行う。また、推論装置30が推論処理を行う。第1の学習装置10、第2の学習装置20及び推論装置50は、それぞれ異なるコンピュータにより実現されてもよいし、1つのコンピュータにより実現されてもよい。 In this embodiment, the first learning device 10 performs self-supervised learning processing. Further, the second learning device 20 performs relearning processing. Further, the inference device 30 performs inference processing. The first learning device 10, the second learning device 20, and the inference device 50 may be realized by different computers, or may be realized by one computer.
 第1の学習装置10は、音声処理装置の一例である。また、第1の学習装置10、第2の学習装置20及び推論装置50のいずれか1つ又は複数が音声処理装置として機能することができる。 The first learning device 10 is an example of a speech processing device. Moreover, any one or more of the first learning device 10, the second learning device 20, and the reasoning device 50 can function as a speech processing device.
 図1を用いて、第1の学習装置10の構成を説明する。図1は、第1の学習装置の構成例を示す図である。 The configuration of the first learning device 10 will be explained using FIG. 1. FIG. 1 is a diagram showing an example of the configuration of a first learning device.
 図1に示すように、第1の学習装置10には、音響特徴量系列Xとメタ情報の分類ラベルlの組{(X,l),…,(X,l);l∈{l,…,l}}が学習データとして入力される。分類ラベルlは正解ラベルである。 As shown in FIG. 1, the first learning device 10 has a set of an acoustic feature sequence X and a classification label l of meta information {(X 1 , l 1 ), ..., (X M , l M ); l M ∈{l 1 ,..., l L }} is input as learning data. The classification label l is the correct label.
 ここで、Mは、学習データに含まれる音声特徴量系列と分類ラベルの組の数であり、1以上の整数である。また、lは、l種類目の分類ラベルである。Lは、用意されている分類ラベルの種類数であり、2以上の整数である。 Here, M is the number of pairs of audio feature series and classification labels included in the learning data, and is an integer of 1 or more. Further, l l is the l-th type of classification label. L is the number of types of classification labels prepared, and is an integer of 2 or more.
 また、メタ情報は、音声のドメイン(コールセンターの会話音声、オンライン会議の音声、読み上げ音声等)、言語、性別等を表す情報である。 Further, the meta information is information representing the domain of audio (call center conversation audio, online conference audio, reading audio, etc.), language, gender, etc.
 音響特徴量(音響特徴量系列Xの要素)は、例えば対数メルフィルタバンク(FBANK:log Mel filterbank coefficients)である。また、音響特徴量は対数メルフィルタバンに限られず、MFCC(Mel frequency cepstral coefficient)、ΔMFCC(MFCCの1階微分)、ΔΔMFCC(MFCCの2階微分)、対数パワー、Δ対数パワー(対数パワーの1階微分)等であってもよい。また、音響特徴量は、生の音声のサンプルであってもよい。 The acoustic features (elements of the acoustic feature series X) are, for example, log Mel filter bank coefficients (FBANK). In addition, acoustic features are not limited to logarithmic mel filter vans, but include MFCC (Mel frequency cepstral coefficient), ΔMFCC (first derivative of MFCC), ΔΔMFCC (second derivative of MFCC), logarithmic power, Δlogarithmic power (logarithmic power first-order differential), etc. Further, the acoustic feature amount may be a sample of raw speech.
 また、分類ラベルは、L次元の1-hotベクトルで表されてもよい。 Furthermore, the classification label may be represented by an L-dimensional 1-hot vector.
 図1に示すように、第1の学習装置10は、音声エンコーダ部11、コンテキストネットワーク部12、量子化ネットワーク部13、分類ネットワーク部14、分類学習ロス計算部15、コンテキスト表現学習ロス計算部16、学習パラメータ更新部17及びモデル情報10aを有する。 As shown in FIG. 1, the first learning device 10 includes a speech encoder section 11, a context network section 12, a quantization network section 13, a classification network section 14, a classification learning loss calculation section 15, and a context representation learning loss calculation section 16. , a learning parameter updating unit 17, and model information 10a.
 モデル情報10aは、第1の学習装置10で用いられるモデルのパラメータである。パラメータは、ニューラルネットワークの重み及びバイアス等である。また、学習処理において、モデル情報10aは適宜更新される。 The model information 10a is the parameter of the model used by the first learning device 10. Parameters include neural network weights and biases. Furthermore, in the learning process, the model information 10a is updated as appropriate.
 音声エンコーダ部11は、音響特徴量系列X={x,…,x}が与えられた時、音声の中間表現ベクトル系列Z={z,…,z}を計算する。ここで、Iは音響特徴量の系列長であり、1以上の整数である。また、Tは音声の中間特徴量ベクトル系列の系列長であり、1以上の整数である。 The audio encoder unit 11 calculates a speech intermediate representation vector sequence Z={z 1 , . . . , z T } when the audio feature sequence X={x 1 , . Here, I is the sequence length of the acoustic feature, and is an integer of 1 or more. Further, T is the sequence length of the voice intermediate feature vector sequence, and is an integer of 1 or more.
 音声エンコーダ部11は、(1)式のように音声の中間特徴量ベクトル系列Zを計算する。 The audio encoder unit 11 calculates the audio intermediate feature vector sequence Z as shown in equation (1).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 ここで、SpeechEncoder()(音声エンコーダ)、はニューラルネットワークの機能を持つ関数であり、例えば畳み込みニューラルネットワークである。 Here, SpeechEncoder() (speech encoder) is a function that has the function of a neural network, for example, a convolutional neural network.
 θse1は音声エンコーダのパラメータであり、学習可能である。θse1は、モデル情報10aから読み出される。 θ se1 is a parameter of the audio encoder and can be learned. θ se1 is read from the model information 10a.
 コンテキストネットワーク部12は、音声エンコーダ部11の出力である中間特徴量ベクトル系列Zに対して、(2)式のようにマスキングを適用する。 The context network unit 12 applies masking to the intermediate feature vector sequence Z, which is the output of the audio encoder unit 11, as shown in equation (2).
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 コンテキストネットワーク部12は、マスキングが適用された中間特徴量ベクトル系列(Cの上にバー)を、(3)式のようにコンテキスト表現C={c,…,c}に変換する。 The context network unit 12 converts the masked intermediate feature vector series (bar above C) into a context expression C={c 1 , . . . , c I } as shown in equation (3).
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 ここで、Masking()は時間方向にマスキングを行う機能を持つ関数である。 Here, Masking() is a function that performs masking in the time direction.
 ContextNetwork()(コンテキストネットワーク)は、ニューラルネットワークの機能を持つ関数であり、例えば非特許文献2に記載のTransformerである。 ContextNetwork() (context network) is a function that has the function of a neural network, and is, for example, the Transformer described in Non-Patent Document 2.
 θse2はコンテキストネットワークのパラメータであり、学習可能である。θse2は、モデル情報10aから読み出される。 θ se2 is a parameter of the context network and can be learned. θ se2 is read from the model information 10a.
 量子化ネットワーク部13は、音声エンコーダ部11の出力である中間特徴量ベクトル系列Zから、(4)式のように音声の量子化表現ベクトル系列Q={q,…,q}を計算する。 The quantization network unit 13 calculates a speech quantization expression vector sequence Q={q 1 , ..., q I } from the intermediate feature vector sequence Z that is the output of the audio encoder unit 11, as shown in equation (4). do.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 ここで、QuantizationNetwork()(量子化ネットワーク)は、ニューラルネットワークの機能を持つ関数であり、例えば全結合のニューラルネットワークとGumbel softmax関数により構成される。 Here, QuantizationNetwork() (quantization network) is a function that has the function of a neural network, and is composed of, for example, a fully connected neural network and a Gumbel softmax function.
 Gumbel softmax関数は、分類器(例えば全結合のニューラルネットワーク)の出力を後段のネットワークに伝搬させるための微分可能な関数である。Gumbel softmax関数については、例えば参考文献1に記載されている。 The Gumbel softmax function is a differentiable function for propagating the output of a classifier (for example, a fully connected neural network) to a subsequent network. The Gumbel softmax function is described in Reference 1, for example.
 θqnは量子化ネットワークのパラメータであり、学習可能である。θqnは、モデル情報10aから読み出される。 θ qn is a parameter of the quantization network and can be learned. θ qn is read from the model information 10a.
 参考文献1:E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-softmax,” ICLR, 2017. Reference 1: E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-softmax,” ICLR, 2017.
 分類ネットワーク部14は、コンテキストネットワーク部12の出力であるコンテキスト表現ベクトル系列Cから、(5)式及び(6)式のように、分類ラベルに対する確率系列O={ol1,…,olL}を計算する。確率系列Oの次元数はLである。確率系列Oの各要素は分類ラベルlの各要素{l,…,l}に対応している。 The classification network unit 14 generates a probability sequence O={o l1 ,..., o lL } for the classification label from the context expression vector sequence C that is the output of the context network unit 12, as shown in equations (5) and (6). Calculate. The number of dimensions of the probability sequence O is L. Each element of the probability series O corresponds to each element {l 1 ,...,l L } of the classification label l.
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
 ここで、GRL()は、Gradient reversal layer(例えば、参考文献1を参照)を表す関数であり、誤差逆伝播法におけるBackwardの際に、勾配の符号を反転させる関数である。 Here, GRL( ) is a function representing a gradient reversal layer (for example, see reference document 1), and is a function that inverts the sign of the gradient during Backward in the error backpropagation method.
 ClassNetwork()(分類ネットワーク)は、ニューラルネットワークの機能を持つ関数であり、例えば全結合のニューラルネットワークとsoftmax関数により構成される。 ClassNetwork() (classification network) is a function that has the function of a neural network, and is composed of, for example, a fully connected neural network and a softmax function.
 θqnは量子化ネットワークのパラメータであり、学習可能である。θqnは、モデル情報10aから読み出される。 θ qn is a parameter of the quantization network and can be learned. θ qn is read from the model information 10a.
 分類学習ロス計算部15は、分類ラベルlに対する分類学習ロスLclassを、(7)式のように計算する。 The classification learning loss calculation unit 15 calculates the classification learning loss L class for the classification label l as shown in equation (7).
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
 ClassLoss()は分類ラベルlを識別するためのロス計算をする関数であり、例えばクロスエントロピーロスである。 ClassLoss( ) is a function that calculates the loss for identifying the classification label l, for example, cross entropy loss.
 コンテキスト表現学習ロス計算部16は、コンテキスト表現を学習するためのロスLcontext(コンテキスト表現学習ロス)を、(8)式のように計算する。 The context expression learning loss calculation unit 16 calculates a loss L context (context expression learning loss) for learning a context expression as shown in equation (8).
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008
 ContextLoss()はコンテキスト表現を学習するためのロス計算をする関数であり、例えばContrastive lossである。 ContextLoss() is a function that calculates loss for learning context expressions, for example, Contrastive loss.
 Contrastive lossについて説明する。(8)式のSim()は、2つのベクトルの類似度を計算する関数であり、例えばコサイン類似度である。^Q(Qの上に^)は、量子化されたベクトルの負例の集合を表している。τは事前に設定される温度パラメータである。 Contrastive loss will be explained. Sim() in equation (8) is a function that calculates the similarity between two vectors, and is, for example, a cosine similarity. ^Q (^ above Q) represents the set of negative examples of the quantized vector. τ is a temperature parameter set in advance.
 (8)式の第3辺のlog内の分子では、量子化表現ベクトル系列Qの要素と、対応するコンテキスト表現ベクトル系列の要素のペア(正例)が用いられる。例えば、qとcは対応する要素同士のペアである。 In the numerator in the log of the third side of equation (8), a pair (positive example) of an element of the quantized expression vector sequence Q and an element of the corresponding context expression vector sequence is used. For example, q t and c t are a pair of corresponding elements.
 一方、(8)式の第3辺のlog内の分母では、量子化表現ベクトル系列Qの要素と、対応しないコンテキスト表現ベクトル系列の要素のペア(負例)が用いられる。例えば、qとct´(ただしt≠t´)は対応しない要素同士のペアである。 On the other hand, in the denominator in the log of the third side of equation (8), a pair (negative example) of an element of the quantized expression vector series Q and an element of the context expression vector series that do not correspond is used. For example, q t and c t' (where t≠t') are a pair of non-corresponding elements.
 学習パラメータ更新部17は、分類学習ロスLclass及びコンテキスト表現学習ロスLcontextを基に、モデルのパラメータを更新する。 The learning parameter updating unit 17 updates the parameters of the model based on the classification learning loss L class and the context expression learning loss L context .
 分類学習ロスLclass及びコンテキスト表現を学習するためのロスLcontextの計算は、ミニバッチごとに行われる。このため、学習パラメータ更新部17は、ミニバッチごとにパラメータの更新を行う。 Calculation of the classification learning loss L class and the loss L context for learning the context representation is performed for each mini-batch. Therefore, the learning parameter updating unit 17 updates the parameters for each mini-batch.
 学習パラメータをΘse={θse1,θse1}、Θqn={θqn}、Θcn={θcn}とすると、学習パラメータ更新部17は、(9)式、(10)式、(11)式によりパラメータを更新する。 Assuming that the learning parameters are Θ se = {θ se1 , θ se1 }, Θ qn = {θ qn }, Θ cn = {θ cn }, the learning parameter updating unit 17 uses equations (9), (10), and ( 11) Update the parameters using the formula.
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000011
 ここで、εは学習率、βはコンテキスト表現学習ロスに対する重み、γは分類学習ロスに対する重みを表している。また、αは重みを表しており、学習が進む(ミニバッチ単位で更新が繰り返される)ごとに大きく変化させることでロスの影響を調整する。 Here, ε represents the learning rate, β represents the weight for context representation learning loss, and γ represents the weight for classification learning loss. Further, α represents a weight, and the influence of loss is adjusted by changing it significantly each time learning progresses (updating is repeated in mini-batch units).
 (5)式に示すように、分類ネットワーク部14の計算では関数GRL()が導入されているため、(11)式のαが付された項の符号が反転している。 As shown in equation (5), the function GRL( ) is introduced in the calculation of the classification network unit 14, so the sign of the term with α in equation (11) is inverted.
 (11)式では、モデルが入力に対して量子化したベクトルと獲得したコンテキスト表現が近づき、かつ獲得されたコンテキスト表現を基にメタ情報を識別する精度が高いほど小さくなるように学習パラメータが更新される。 In Equation (11), the learning parameter is updated so that it becomes smaller as the vector quantized for the input by the model approaches the acquired context representation, and the accuracy of identifying meta information based on the acquired context representation increases. be done.
 このように、学習パラメータ更新部17は、モデルが音声の特徴量を量子化したベクトルが、モデルが特徴量から獲得したコンテキスト表現に近いほど小さくなる第1の損失関数(コンテキスト表現学習ロスLcontext)と、コンテキスト表現を基に音声のメタ情報をモデルが識別する精度が高いほど小さくなる第2の損失関数(分類学習ロスLclass)と、を計算する。そして、学習パラメータ更新部17は、第1の損失関数が小さくなり、かつ第2の損失関数が大きくなるように、モデルのパラメータを更新する((11)式)。この場合、学習パラメータ更新部17は、損失関数計算部及び更新部に相当する。 In this way, the learning parameter updating unit 17 uses the first loss function (context representation learning loss L context ) and a second loss function (classification learning loss L class ) that becomes smaller as the accuracy with which the model identifies speech meta information based on the context expression increases. Then, the learning parameter updating unit 17 updates the parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger (Equation (11)). In this case, the learning parameter update section 17 corresponds to a loss function calculation section and an update section.
 さらに具体的には、(11)式の右辺のかっこの中は、第1の損失関数である第1の項(βが付された項)から、第2の損失関数に、更新部によるパラメータの更新が繰り返された回数が多いほど大きくなる重みαが付された第2の項(αが付された項)を引いた損失関数(第3の損失関数)である。 More specifically, in the parentheses on the right side of equation (11), the parameter by the update unit is transferred from the first term (the term with β) which is the first loss function to the second loss function. This is a loss function (third loss function) obtained by subtracting a second term (term with α) attached with a weight α that increases as the number of times the update of is repeated.
 このように、本実施形態では、自己教師学習処理において、モデルの学習データに対する過適合を防ぐために、(11)式によりadversarial neural network(ANN)が実現されている。ANNについては、参考文献1に記載されている。 As described above, in this embodiment, in order to prevent overfitting of the model to the learning data in the self-supervised learning process, an adversarial neural network (ANN) is implemented using equation (11). ANN is described in Reference 1.
 本実施形態では、ANNにより、コンテキストネットワークが音声に関するメタ情報を識別しないよう学習が行われる。これにより、コンテキストネットワークが学習データに過適合することなく、普遍的な表現を獲得可能になる。 In this embodiment, the ANN performs learning so that the context network does not identify meta information regarding audio. This allows the context network to obtain a universal representation without overfitting the learning data.
 例えば、分類ラベルlが音声のドメインを示すものである場合、本実施形態によれば、コンテキストネットワークは、未知のドメインの音声に対しても頑健に動作するようになる。 For example, if the classification label l indicates a speech domain, according to this embodiment, the context network will operate robustly even for speech in an unknown domain.
 メタ情報は、音声のドメイン、音声の特性(言語等)、音声の話者の属性(性別、年齢)等であり、テキスト等で表現される発話の内容とは異なる情報である。また、発話の内容は、音声によって伝達される情報の内容と言い換えられてもよい。 The meta information includes the domain of the voice, the characteristics of the voice (language, etc.), the attributes of the speaker of the voice (gender, age), etc., and is information different from the content of the utterance expressed in text or the like. Moreover, the content of the utterance may be rephrased as the content of information transmitted by voice.
 学習パラメータ更新部17によってパラメータθse1、θse2、θqn、θcnが更新された後、更新後のパラメータを用いてさらに処理が繰り返される。また、所定の条件(例えば繰り返し回数)が充足された場合、繰り返し処理が終了する。 After the parameters θ se1 , θ se2 , θ qn , and θ cn are updated by the learning parameter update unit 17, the process is further repeated using the updated parameters. Furthermore, when a predetermined condition (for example, the number of repetitions) is satisfied, the iterative process ends.
 ここで、コンテキストネットワーク部12は、音声の特徴量をモデルに入力し、コンテキスト表現を計算するコンテキスト表現計算部の一例である。また、分類ネットワーク部14は、コンテキスト表現をモデルに入力し、音声のメタ情報を特定するラベルを計算するメタ情報ラベル計算部の一例である。また、量子化ネットワーク部13は、音声の特徴量をモデルに入力し、量子化したベクトルを計算する量子化ベクトル計算部の一例である。 Here, the context network unit 12 is an example of a context expression calculation unit that inputs voice features into a model and calculates a context expression. Furthermore, the classification network unit 14 is an example of a meta-information label calculation unit that inputs a context expression into a model and calculates a label that specifies audio meta-information. Further, the quantization network unit 13 is an example of a quantization vector calculation unit that inputs a voice feature amount into a model and calculates a quantized vector.
 これより、学習パラメータ更新部17は、量子化ベクトル計算部によって計算されたベクトルが、コンテキスト表現計算部によって計算されたコンテキスト表現に近いほど小さくなるように第1の損失関数を計算し、メタ情報ラベル計算部のラベルの計算精度が高いほど小さくなるように第2の損失関数を計算するということができる。 From this, the learning parameter update unit 17 calculates the first loss function so that the vector calculated by the quantization vector calculation unit becomes smaller as it approaches the context expression calculated by the context expression calculation unit, and the meta information It can be said that the second loss function is calculated so that it becomes smaller as the label calculation accuracy of the label calculation unit increases.
 図2を用いて、第2の学習装置20の構成を説明する。図2は、第2の学習装置の構成例を示す図である。第2の学習装置20は、第1の学習装置10によって更新されたパラメータを用いて、音声に関するタスクを実行する再学習モデルの学習を行う。第1の学習装置10におけるモデルは、音声エンコーダ、コンテキストネットワーク、量子化ネットワーク、及び分類ネットワークを組み合わせたモデルである。一方、再学習モデルは、音声エンコーダ、コンテキストネットワーク及び追加ネットワークを組み合わせたモデルである。 The configuration of the second learning device 20 will be explained using FIG. 2. FIG. 2 is a diagram showing an example of the configuration of the second learning device. The second learning device 20 uses the parameters updated by the first learning device 10 to perform learning of a relearning model for performing tasks related to speech. The model in the first learning device 10 is a model that combines a speech encoder, a context network, a quantization network, and a classification network. On the other hand, the relearning model is a model that combines a speech encoder, a context network, and an additional network.
 図2に示すように、第2の学習装置20には、音響特徴量系列Xと後段タスクラベルl´の組が学習データとして入力される。後段タスクラベルl´は正解ラベルである。 As shown in FIG. 2, a set of an acoustic feature sequence X and a subsequent task label l' is input to the second learning device 20 as learning data. The subsequent task label l' is the correct label.
 ここで、後段タスクラベルl´は、タスクに応じた情報に対応したものであり、メタ情報を示すものでなくてもよい。 Here, the subsequent task label l' corresponds to information according to the task, and does not need to indicate meta information.
 例えば、タスクが音声からテキストを得る音声認識である場合、後段タスクラベルl´は、音声に対応するテキストである。また、後段タスクラベルl´は、分類ラベルlと同じく、メタ情報を示すものであってもよい。なお、後段タスクラベルl´における音声に対応するテキストの処理単位は、音素であってもよいし、文字であってもよいし、単語であってもよい。 For example, when the task is speech recognition to obtain text from speech, the subsequent task label l' is the text corresponding to the speech. Further, the subsequent task label l' may indicate meta information like the classification label l. Note that the processing unit of the text corresponding to the voice in the subsequent task label l' may be a phoneme, a character, or a word.
 図2に示すように、第2の学習装置20は、音声エンコーダ部21、コンテキストネットワーク部22、追加ネットワーク部23、後段タスク学習ロス計算部24、学習パラメータ更新部25及びモデル情報20aを有する。 As shown in FIG. 2, the second learning device 20 includes a speech encoder section 21, a context network section 22, an additional network section 23, a subsequent task learning loss calculation section 24, a learning parameter update section 25, and model information 20a.
 モデル情報20aは、第1の学習装置10で学習が行われたモデルのパラメータである。モデル情報20aは、少なくともパラメータθse1及びθse2を含む。さらに、モデル情報20aは、タスクに応じた追加ネットワークのパラメータθaddを含む。 The model information 20a is parameters of a model trained by the first learning device 10. The model information 20a includes at least parameters θ se1 and θ se2 . Furthermore, the model information 20a includes a parameter θ add of an additional network depending on the task.
 音声エンコーダ部21は、音声エンコーダ部11と同様に、音響特徴量系列X={x,…,x}が与えられた時、音声の中間表現ベクトル系列Zを計算する。 Similar to the audio encoder unit 11, the audio encoder unit 21 calculates an intermediate representation vector sequence Z of audio when the audio feature sequence X={x 1 , . . . , x I } is given.
 音声エンコーダ部21は、(1)式のように音声の中間特徴量ベクトル系列Zを計算する。 The audio encoder unit 21 calculates the audio intermediate feature vector sequence Z as shown in equation (1).
 θse1は第1の学習装置10により更新済みの音声エンコーダのパラメータであり、モデル情報20aから読み出される。 θ se1 is a parameter of the audio encoder that has been updated by the first learning device 10, and is read from the model information 20a.
 コンテキストネットワーク部22は、コンテキストネットワーク部12と同様に、音声エンコーダ部21の出力である中間特徴量ベクトル系列Zを、(12)式のようにコンテキスト表現Cに変換する。ただし、コンテキストネットワーク部22は、コンテキストネットワーク部12と異なりマスキングを行わない。 Similarly to the context network unit 12, the context network unit 22 converts the intermediate feature vector sequence Z, which is the output of the audio encoder unit 21, into a context expression C as shown in equation (12). However, unlike the context network unit 12, the context network unit 22 does not perform masking.
Figure JPOXMLDOC01-appb-M000012
Figure JPOXMLDOC01-appb-M000012
 θse2は第1の学習装置10により更新済みのコンテキストネットワークのパラメータであり、モデル情報10aから読み出される。 θ se2 is a parameter of the context network that has been updated by the first learning device 10, and is read from the model information 10a.
 追加ネットワーク部23は、コンテキストネットワーク部22の出力であるコンテキスト表現ベクトル系列Cから、(13)式のように、後段タスクラベルに対する確率系列P(予測確率の系列)を計算する。 The additional network unit 23 calculates a probability sequence P (sequence of predicted probabilities) for the subsequent task label from the context expression vector sequence C that is the output of the context network unit 22, as shown in equation (13).
Figure JPOXMLDOC01-appb-M000013
Figure JPOXMLDOC01-appb-M000013
 (13)式のClassNetwork()(分類ネットワーク)は、第1の学習装置10の分類ネットワークとは異なるものであり、第2の学習装置20において学習が行われる。 The ClassNetwork() (classification network) in equation (13) is different from the classification network of the first learning device 10, and learning is performed in the second learning device 20.
 例えば、第2の学習装置20の分類ネットワークは、ニューラルネットワークの機能を持つ関数であり、例えば双方向LSTMとsoftmax関数により構成される。 For example, the classification network of the second learning device 20 is a function having the function of a neural network, and is composed of, for example, a bidirectional LSTM and a softmax function.
 θaddは後段タスクの分類ネットワークのパラメータであり、学習可能である。θaddnは、モデル情報20aから読み出される。 θ add is a parameter of the classification network of the subsequent task and can be learned. θ addn is read from the model information 20a.
 後段タスク学習ロス計算部24は、後段タスクラベルl´に対する後段タスク学習ロスLdownを、(14)式のように計算する。 The subsequent task learning loss calculation unit 24 calculates the subsequent task learning loss L down for the subsequent task label l' as shown in equation (14).
Figure JPOXMLDOC01-appb-M000014
Figure JPOXMLDOC01-appb-M000014
 Loss()は後段タスクのロス(例えば、分類ロス)を計算する関数であり、例えばクロスエントロピーロスである。なお、Loss()は、後段タスクの種類(分類タスク、生成タスク、予測タスク等)に応じて適宜変更される。 Loss( ) is a function that calculates the loss of the subsequent task (for example, classification loss), for example, cross-entropy loss. Note that Loss( ) is changed as appropriate depending on the type of subsequent task (classification task, generation task, prediction task, etc.).
 学習パラメータ更新部25は、後段タスクのロスLdownを基に、モデルのパラメータを更新する。 The learning parameter updating unit 25 updates the parameters of the model based on the loss L down of the subsequent task.
 学習パラメータ更新部25は、一部のパラメータを固定して、他のパラメータを更新してもよい。例えば、学習パラメータ更新部25は、パラメータθse1及びθse2を更新することなく、パラメータθaddを更新する。 The learning parameter update unit 25 may fix some parameters and update other parameters. For example, the learning parameter updating unit 25 updates the parameter θ add without updating the parameters θ se1 and θ se2 .
 後段タスクのロスLdownの計算は、ミニバッチごとに行われる。このため、学習パラメータ更新部25は、ミニバッチごとにパラメータの更新を行う。 The calculation of the loss L down of the subsequent task is performed for each mini-batch. Therefore, the learning parameter updating unit 25 updates the parameters for each mini-batch.
 学習パラメータ更新部25によってパラメータが更新された後、更新後のパラメータを用いてさらに処理が繰り返される。また、所定の条件(例えば繰り返し回数)が充足された場合、繰り返し処理が終了する。 After the parameters are updated by the learning parameter update unit 25, the process is further repeated using the updated parameters. Furthermore, when a predetermined condition (for example, the number of repetitions) is satisfied, the iterative process ends.
 図3を用いて、学習済みのモデルを用いた推論処理について説明する。図3は、推定装置の構成例を示す図である。推論装置50は、再学習モデルを用いて、タスクを実行する。 Inference processing using a learned model will be explained using FIG. 3. FIG. 3 is a diagram showing a configuration example of an estimation device. The inference device 50 uses the relearning model to execute a task.
 図3に示すように、推論装置50には、音響特徴量系列Xが学習データとして入力される。例えば、推論装置50は、音響特徴量系列Xに対応するラベルを推定する。 As shown in FIG. 3, the acoustic feature series X is input to the inference device 50 as learning data. For example, the inference device 50 estimates a label corresponding to the acoustic feature sequence X.
 図3に示すように、推論装置50は、音声エンコーダ部51、コンテキストネットワーク部52及び追加ネットワーク部53及びモデル情報50aを有する。 As shown in FIG. 3, the inference device 50 includes a speech encoder section 51, a context network section 52, an additional network section 53, and model information 50a.
 モデル情報50aは、第1の学習装置10及び第2の学習装置20で学習が行われた各モデルのパラメータである。モデル情報50aは、学習済みの音声エンコーダのパラメータθse1及び学習済みのコンテキストネットワークのパラメータθse2を含む。また、モデル情報50aは、学習済みの追加ネットワークのパラメータθaddを含む。 The model information 50a is the parameters of each model learned by the first learning device 10 and the second learning device 20. The model information 50a includes a learned speech encoder parameter θ se1 and a learned context network parameter θ se2 . Furthermore, the model information 50a includes the learned additional network parameter θ add .
 音声エンコーダ部51は、音声エンコーダ部21と同様に、音響特徴量系列X={x,…,x}が与えられた時、音声の中間表現ベクトル系列Zを計算する。 Similar to the audio encoder unit 21, the audio encoder unit 51 calculates an intermediate expression vector sequence Z of audio when the audio feature sequence X={x 1 , . . . , x I } is given.
 コンテキストネットワーク部52は、コンテキストネットワーク部22と同様に、音声エンコーダ部51の出力である中間特徴量ベクトル系列Zを、コンテキスト表現Cに変換する。 Similarly to the context network unit 22, the context network unit 52 converts the intermediate feature vector sequence Z, which is the output of the audio encoder unit 51, into a context representation C.
 追加ネットワーク部23は、コンテキストネットワーク部52の出力であるコンテキスト表現ベクトル系列Cから、後段タスクラベルに対する確率系列P(予測確率の系列)を計算する。 The additional network unit 23 calculates a probability sequence P (sequence of predicted probabilities) for the subsequent task label from the context expression vector sequence C that is the output of the context network unit 52.
 追加ネットワーク部53は、確率系列Pを基に分類結果を出力する。追加ネットワーク部53は、確率系列Pを出力してもよいし、確率系列Pの要素のうち最も値が大きい要素に対応する後段タスクラベルを特定する情報を出力してもよい。 The additional network unit 53 outputs classification results based on the probability sequence P. The additional network unit 53 may output the probability sequence P, or may output information specifying the subsequent task label corresponding to the element with the largest value among the elements of the probability sequence P.
[第1の実施形態の処理]
 図4、図5、図6及び図7を用いて、第1の実施形態の学習処理及び推論処理の流れを説明する。
[Processing of the first embodiment]
The flow of learning processing and inference processing in the first embodiment will be explained using FIGS. 4, 5, 6, and 7.
 図4は、学習処理の全体の流れを示すフローチャートである。図4に示すように、まず、第1の学習装置10は、音声エンコーダ、コンテキストネットワーク、量子化ネットワーク、分類ネットワークの事前学習を行う(ステップS1)。 FIG. 4 is a flowchart showing the overall flow of the learning process. As shown in FIG. 4, first, the first learning device 10 performs preliminary learning of a speech encoder, a context network, a quantization network, and a classification network (step S1).
 次に、第2の学習装置20は、学習済みの音声エンコーダ及びコンテキストネットワークを使って追加ネットワークの学習を行う(ステップS2)。このとき、音声エンコーダ及びコンテキストネットワークの再学習を行うことも可能である。 Next, the second learning device 20 uses the learned speech encoder and context network to learn an additional network (step S2). At this time, it is also possible to relearn the audio encoder and context network.
 図5は、自己教師学習処理の流れを示すフローチャートである。自己教師学習処理は、図4のステップS1の処理に相当する。 FIG. 5 is a flowchart showing the flow of self-supervised learning processing. The self-supervised learning process corresponds to the process of step S1 in FIG.
 図5に示すように、まず、第1の学習装置10は、音響特徴量系列を音声エンコーダに入力し、中間表現ベクトル系列を計算する(ステップS101)。 As shown in FIG. 5, first, the first learning device 10 inputs the acoustic feature sequence to the audio encoder and calculates the intermediate expression vector sequence (step S101).
 次に、第1の学習装置10は、中間表現ベクトル系列にマスキングを適用した上でコンテキストネットワークに入力し、コンテキスト表現ベクトル系列を計算する(ステップS102)。 Next, the first learning device 10 applies masking to the intermediate expression vector sequence, inputs it to the context network, and calculates a context expression vector sequence (step S102).
 また、第1の学習装置10は、中間表現ベクトル系列を量子化ネットワークに入力し、量子化表現ベクトル系列を計算する(ステップS103)。 Additionally, the first learning device 10 inputs the intermediate representation vector sequence to the quantization network and calculates a quantized representation vector sequence (step S103).
 続いて、第1の学習装置10は、コンテキスト表現ベクトル系列にGRLを適用した上で分類ネットワークに入力し、メタ情報の分類ラベルに対する確率系列を計算する(ステップS104)。 Next, the first learning device 10 applies GRL to the context expression vector sequence, inputs it to the classification network, and calculates a probability sequence for the classification label of the meta information (step S104).
 そして、第1の学習装置10は、計算した確率系列、及びメタ情報の正解の分類ラベルを基に、分類学習ロスを計算する(ステップS105)。 Then, the first learning device 10 calculates the classification learning loss based on the calculated probability sequence and the correct classification label of the meta information (step S105).
 また、第1の学習装置10は、コンテキスト表現ベクトル系列、及び量子化表現ベクトル系列を基に、コンテキスト表現学習ロスを計算する(ステップS106)。 Furthermore, the first learning device 10 calculates a context expression learning loss based on the context expression vector sequence and the quantized expression vector sequence (step S106).
 さらに、第1の学習装置10は、分類学習ロス、及びコンテキスト表現学習を基に、音声エンコーダ、コンテキストネットワーク、量子化ネットワーク、分類ネットワークのパラメータを更新する(ステップS107)。 Furthermore, the first learning device 10 updates the parameters of the audio encoder, context network, quantization network, and classification network based on the classification learning loss and context representation learning (step S107).
 ここで、終了条件が充足されている場合(ステップS108、Yes)、第1の学習装置10は処理を終了する。一方、第1の学習装置10は、終了条件が充足されていない場合(ステップS108、No)、ステップS101に戻り、パラメータが更新されたモデルを用いて処理を繰り返す。 Here, if the termination condition is satisfied (step S108, Yes), the first learning device 10 terminates the process. On the other hand, if the end condition is not satisfied (step S108, No), the first learning device 10 returns to step S101 and repeats the process using the model whose parameters have been updated.
 終了条件は、例えば、処理が一定回数だけ繰り返されたこと、パラメータの更新量が収束したこと等である。 The termination conditions include, for example, that the process has been repeated a certain number of times, that the amount of parameter updates has converged, etc.
 図6は、再学習処理の流れを示すフローチャートである。再学習処理は、図4のステップS2の処理に相当する。 FIG. 6 is a flowchart showing the flow of the relearning process. The relearning process corresponds to the process of step S2 in FIG.
 図6に示すように、まず、第2の学習装置20は、音響特徴量系列を音声エンコーダに入力し、中間表現ベクトル系列を計算する(ステップS201)。 As shown in FIG. 6, the second learning device 20 first inputs the acoustic feature sequence to the audio encoder and calculates the intermediate expression vector sequence (step S201).
 次に、第2の学習装置20は、中間表現ベクトル系列をコンテキストネットワークに入力し、コンテキスト表現ベクトル系列を計算する(ステップS202)。 Next, the second learning device 20 inputs the intermediate expression vector sequence to the context network and calculates the context expression vector sequence (step S202).
 続いて、第2の学習装置20は、コンテキスト表現ベクトル系列を追加ネットワークに入力し、タスクに応じた分類ラベルに対する確率系列を計算する(ステップS203)。 Next, the second learning device 20 inputs the context expression vector sequence to the additional network and calculates a probability sequence for the classification label according to the task (step S203).
 そして、第2の学習装置20は、計算した確率系列、及びタスクに応じた正解の分類ラベルを基に、追加学習ロスを計算する(ステップS204)。 Then, the second learning device 20 calculates additional learning loss based on the calculated probability sequence and the correct classification label according to the task (step S204).
 さらに、第2の学習装置20は、追加学習ロスを基に、追加ネットワークのパラメータを更新する(ステップS205)。このとき、音声エンコーダ及びコンテキストネットワークの再学習を行うことも可能である。 Furthermore, the second learning device 20 updates the parameters of the additional network based on the additional learning loss (step S205). At this time, it is also possible to relearn the audio encoder and context network.
 ここで、終了条件が充足されている場合(ステップS206、Yes)、第2の学習装置20は処理を終了する。一方、第2の学習装置20は、終了条件が充足されていない場合(ステップS206、No)、ステップS201に戻り、パラメータが更新されたモデルを用いて処理を繰り返す。 Here, if the termination condition is satisfied (step S206, Yes), the second learning device 20 terminates the process. On the other hand, if the end condition is not satisfied (step S206, No), the second learning device 20 returns to step S201 and repeats the process using the model with updated parameters.
 終了条件は、例えば、処理が一定回数だけ繰り返されたこと、パラメータの更新量が収束したこと等である。 The termination conditions include, for example, that the process has been repeated a certain number of times, that the amount of parameter updates has converged, etc.
 図7は、推論処理の流れを示すフローチャートである。 FIG. 7 is a flowchart showing the flow of inference processing.
 図7に示すように、まず、推論装置50は、音響特徴量系列を音声エンコーダに入力し、中間表現ベクトル系列を計算する(ステップS501)。 As shown in FIG. 7, first, the inference device 50 inputs the acoustic feature sequence to the audio encoder and calculates the intermediate representation vector sequence (step S501).
 次に、推論装置50は、中間表現ベクトル系列をコンテキストネットワークに入力し、コンテキスト表現ベクトル系列を計算する(ステップS502)。 Next, the inference device 50 inputs the intermediate representation vector sequence to the context network and calculates the context expression vector sequence (step S502).
 続いて、推論装置50は、コンテキスト表現ベクトル系列を追加ネットワークに入力し、タスクに応じた分類ラベルに対する確率系列を計算する(ステップS503)。 Subsequently, the inference device 50 inputs the context expression vector sequence to the additional network and calculates a probability sequence for the classification label according to the task (step S503).
 そして、推論装置50は、計算した確率系列に基づく分類結果を出力する(ステップS504)。 Then, the inference device 50 outputs a classification result based on the calculated probability series (step S504).
[第1の実施形態の効果]
 これまで説明してきたように、第1の学習装置10は、モデルが音声の特徴量を量子化したベクトルが、モデルが特徴量から獲得したコンテキスト表現に近いほど小さくなる第1の損失関数と、コンテキスト表現を基に音声のメタ情報をモデルが識別する精度が高いほど小さくなる第2の損失関数と、を計算する。第1の学習装置10は、第1の損失関数が小さくなり、かつ第2の損失関数が大きくなるように、モデルのパラメータを更新する。これにより、コンテキストネットワークの学習データに対する過適合を防ぎ、自己教師学習に対する後段のタスクの精度が低下することを抑止できる。
[Effects of the first embodiment]
As described above, the first learning device 10 has a first loss function that decreases as the vector obtained by quantizing the voice feature amount by the model becomes closer to the context expression acquired by the model from the feature amount; A second loss function is calculated that becomes smaller as the accuracy with which the model identifies audio meta information based on the context expression increases. The first learning device 10 updates the parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger. This prevents the context network from overfitting to the learning data, and prevents the accuracy of tasks subsequent to self-supervised learning from decreasing.
 第1の学習装置10は、第1の損失関数である第1の項から、第2の損失関数に、パラメータの更新が繰り返された回数が多いほど大きくなる重みが付された第2の項を引いた第3の損失関数を計算し、第3の損失関数が小さくなるように、モデルのパラメータを更新する。これにより、学習が進むにつれて分類ネットワークの精度が向上していく場合であっても、その影響を低減させることができる。 The first learning device 10 converts the first term, which is the first loss function, into a second term, which is weighted with a weight that increases as the number of repeated parameter updates increases. A third loss function is calculated by subtracting , and the model parameters are updated so that the third loss function becomes smaller. As a result, even if the accuracy of the classification network improves as learning progresses, its influence can be reduced.
 第1の学習装置10は、更新されたパラメータを用いて、音声に関するタスクを実行する再学習モデルの学習を行う。これにより、追加ネットワークによる後段のタスクを精度良く実行することができる。 The first learning device 10 uses the updated parameters to learn a relearning model that executes tasks related to speech. Thereby, subsequent tasks by the additional network can be executed with high accuracy.
 第1の学習装置10は、音声の特徴量をモデルに入力し、コンテキスト表現を計算し、コンテキスト表現をモデルに入力し、音声のメタ情報を特定するラベルを計算し、音声の特徴量をモデルに入力し、量子化したベクトルを計算する。第1の学習装置10は、計算したベクトルが、計算したコンテキスト表現に近いほど小さくなるように第1の損失関数を計算し、ラベルの計算精度が高いほど小さくなるように第2の損失関数を計算する。これにより、第1の学習装置10は、モデルを使った計算及びパラメータの更新を一貫して実行することができる。 The first learning device 10 inputs the voice features into the model, calculates a context expression, inputs the context expression into the model, calculates a label that specifies the meta information of the voice, and inputs the voice features into the model. and calculate the quantized vector. The first learning device 10 calculates a first loss function such that the calculated vector becomes smaller as it approaches the calculated context expression, and calculates a second loss function such that the calculated vector becomes smaller as the calculation accuracy of the label increases. calculate. Thereby, the first learning device 10 can consistently perform calculations using the model and update parameters.
 第1の実施の形態に係る音声処理装置は、非特許文献1に記載のような従来の機械学習手法に対して特定の改善を提供するものであり、機械学習モデルによる音声タスクに係る技術分野の向上を示すものである。 The speech processing device according to the first embodiment provides a specific improvement over the conventional machine learning method as described in Non-Patent Document 1, and is applicable to the technical field related to speech tasks using machine learning models. This shows an improvement in
[システム構成等]
 また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散及び統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散又は統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、CPU(Central Processing Unit)及び当該CPUにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。なお、プログラムは、CPUだけでなく、GPU等の他のプロセッサによって実行されてもよい。
[System configuration, etc.]
Further, each component of each device shown in the drawings is functionally conceptual, and does not necessarily need to be physically configured as shown in the drawings. In other words, the specific form of distributing and integrating each device is not limited to what is shown in the diagram, and all or part of the devices may be functionally or physically distributed or integrated in arbitrary units depending on various loads and usage conditions. Can be integrated and configured. Furthermore, each processing function performed by each device is realized in whole or in part by a CPU (Central Processing Unit) and a program that is analyzed and executed by the CPU, or by hardware using wired logic. It can be realized as Note that the program may be executed not only by the CPU but also by another processor such as a GPU.
 また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in this embodiment, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed manually. All or part of this can also be performed automatically using known methods. In addition, information including processing procedures, control procedures, specific names, and various data and parameters shown in the above documents and drawings may be changed arbitrarily, unless otherwise specified.
[プログラム]
 一実施形態として、音声処理装置(第1の学習装置10、第2の学習装置20又は推論装置50)は、パッケージソフトウェアやオンラインソフトウェアとして上記の処理を実行するプログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記のプログラムを情報処理装置に実行させることにより、情報処理装置を音声処理装置として機能させることができる。ここで言う情報処理装置には、デスクトップ型又はノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等のスレート端末等がその範疇に含まれる。
[program]
In one embodiment, the speech processing device (the first learning device 10, the second learning device 20, or the inference device 50) installs a program that executes the above processing as packaged software or online software on a desired computer. It can be implemented by For example, by causing the information processing device to execute the above program, the information processing device can be made to function as an audio processing device. The information processing device referred to here includes a desktop or notebook personal computer. In addition, information processing devices include mobile communication terminals such as smartphones, mobile phones, and PHSs (Personal Handyphone Systems), as well as slate terminals such as PDAs (Personal Digital Assistants).
 また、音声処理装置は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の学習処理に関するサービスを提供する学習サーバ装置として実装することもできる。例えば、学習サーバ装置は、学習用のデータを入力とし、学習済みのモデルのパラメータを出力とする学習サービスを提供するサーバ装置として実装される。この場合、学習サーバ装置は、Webサーバとして実装することとしてもよいし、アウトソーシングによって上記の学習処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。 Furthermore, the audio processing device can also be implemented as a learning server device that uses a terminal device used by a user as a client and provides services related to the above-mentioned learning processing to the client. For example, a learning server device is implemented as a server device that provides a learning service that takes learning data as input and outputs parameters of a trained model. In this case, the learning server device may be implemented as a Web server, or may be implemented as a cloud that provides services related to the above-mentioned learning processing by outsourcing.
 図8は、学習プログラムを実行するコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。 FIG. 8 is a diagram showing an example of a computer that executes a learning program. Computer 1000 includes, for example, a memory 1010 and a CPU 1020. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These parts are connected by a bus 1080.
 メモリ1010は、ROM(Read Only Memory)1011及びRAM(Random Access Memory)1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1090に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1100に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ1100に挿入される。シリアルポートインタフェース1050は、例えばマウス1110、キーボード1120に接続される。ビデオアダプタ1060は、例えばディスプレイ1130に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores, for example, a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090. Disk drive interface 1040 is connected to disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into disk drive 1100. Serial port interface 1050 is connected to, for example, mouse 1110 and keyboard 1120. Video adapter 1060 is connected to display 1130, for example.
 ハードディスクドライブ1090は、例えば、OS1091、アプリケーションプログラム1092、プログラムモジュール1093、プログラムデータ1094を記憶する。すなわち、学習装置5の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール1093として実装される。プログラムモジュール1093は、例えばハードディスクドライブ1090に記憶される。例えば、学習装置5における機能構成と同様の処理を実行するためのプログラムモジュール1093が、ハードディスクドライブ1090に記憶される。なお、ハードディスクドライブ1090は、SSD(Solid State Drive)により代替されてもよい。 The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the learning device 5 is implemented as a program module 1093 in which computer-executable code is written. Program module 1093 is stored in hard disk drive 1090, for example. For example, a program module 1093 for executing processing similar to the functional configuration of the learning device 5 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
 また、上述した実施形態の処理で用いられる設定データは、プログラムデータ1094として、例えばメモリ1010やハードディスクドライブ1090に記憶される。そして、CPU1020は、メモリ1010やハードディスクドライブ1090に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して、上述した実施形態の処理を実行する。 Further, the setting data used in the processing of the embodiment described above is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads out the program module 1093 and program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary, and executes the processing of the embodiment described above.
 なお、プログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1090に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ1100等を介してCPU1020によって読み出されてもよい。あるいは、プログラムモジュール1093及びプログラムデータ1094は、ネットワーク(LAN(Local Area Network)、WAN(Wide Area Network)等)を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール1093及びプログラムデータ1094は、他のコンピュータから、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and program data 1094 may then be read by the CPU 1020 from another computer via the network interface 1070.
 以上の実施形態に関し、更に以下の付記を開示する。 Regarding the above embodiments, the following additional notes are further disclosed.
 (付記項1)
 メモリと、
 前記メモリに接続された少なくとも1つのプロセッサと、
 を含み、
 前記プロセッサは、
 モデルが音声の特徴量を量子化したベクトルが、前記モデルが前記特徴量から獲得したコンテキスト表現に近いほど小さくなる第1の損失関数と、前記コンテキスト表現を基に前記音声のメタ情報を前記モデルが識別する精度が高いほど小さくなる第2の損失関数と、を計算し、
 前記第1の損失関数が小さくなり、かつ前記第2の損失関数が大きくなるように、前記モデルのパラメータを更新する
 音声処理装置。
(Additional note 1)
memory and
at least one processor connected to the memory;
including;
The processor includes:
a first loss function in which a vector obtained by quantizing voice features by a model becomes smaller as it approaches a context representation acquired by the model from the features; A second loss function that becomes smaller as the accuracy of identification is higher, and
A speech processing device that updates parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger.
 (付記項2)
 付記項1に記載の音声処理装置であって、前記プロセッサは、
 前記第1の損失関数である第1の項から、前記第2の損失関数に、パラメータの更新が繰り返された回数が多いほど大きくなる重みが付された第2の項を引いた第3の損失関数を計算し、
 前記第3の損失関数が小さくなるように、前記モデルのパラメータを更新する
 音声処理装置。
(Additional note 2)
The audio processing device according to Supplementary Note 1, wherein the processor comprises:
A third term is obtained by subtracting a second term, which is weighted with a weight that increases as the number of times parameter updates are repeated, from the first term, which is the first loss function. Calculate the loss function,
A speech processing device that updates parameters of the model so that the third loss function becomes smaller.
 (付記項3)
 付記項1に記載の音声処理装置であって、前記プロセッサは、
 更新されたパラメータを用いて、音声に関するタスクを実行する再学習モデルの学習を行う
 音声処理装置。
(Additional note 3)
The audio processing device according to Supplementary Note 1, wherein the processor comprises:
A speech processing device that uses updated parameters to train a relearning model to perform speech-related tasks.
 (付記項4)
 付記項1に記載の音声処理装置であって、前記プロセッサは、
 前記音声の特徴量を前記モデルに入力し、コンテキスト表現を計算し、
 前記コンテキスト表現を前記モデルに入力し、前記音声のメタ情報を特定するラベルを計算し、
 前記音声の特徴量を前記モデルに入力し、量子化したベクトルを計算し、
 計算した前記ベクトルが、計算した前記コンテキスト表現に近いほど小さくなるように前記第1の損失関数を計算し、前記ラベルの計算精度が高いほど小さくなるように前記第2の損失関数を計算する
 音声処理装置。
(Additional note 4)
The audio processing device according to Supplementary Note 1, wherein the processor comprises:
inputting the voice features into the model and calculating a context representation;
inputting the context representation into the model and calculating a label identifying meta information of the speech;
inputting the feature amount of the voice into the model and calculating a quantized vector;
The first loss function is calculated such that the calculated vector becomes smaller as it approaches the calculated context expression, and the second loss function is calculated such that the calculated vector becomes smaller as the calculation precision of the label increases. Audio Processing equipment.
 (付記項5)
 付記項1に記載の音声処理装置であって、前記プロセッサは、
 前記再学習モデルを用いて、前記タスクを実行する
 音声処理装置。
(Additional note 5)
The audio processing device according to Supplementary Note 1, wherein the processor comprises:
A speech processing device that executes the task using the relearning model.
 (付記項6)
 音声処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
 前記音声処理は、
 モデルが音声の特徴量を量子化したベクトルが、前記モデルが前記特徴量から獲得したコンテキスト表現に近いほど小さくなる第1の損失関数と、前記コンテキスト表現を基に前記音声のメタ情報を前記モデルが識別する精度が高いほど小さくなる第2の損失関数と、を計算し、
 前記第1の損失関数が小さくなり、かつ前記第2の損失関数が大きくなるように、前記モデルのパラメータを更新する
 非一時的記憶媒体。
(Additional note 6)
A non-transitory storage medium storing a program executable by a computer to perform audio processing,
The audio processing includes:
a first loss function in which a vector obtained by quantizing voice features by a model becomes smaller as it approaches a context representation acquired by the model from the features; A second loss function that becomes smaller as the accuracy of identification is higher, and
The parameters of the model are updated such that the first loss function becomes smaller and the second loss function becomes larger.
 (付記項7)
 モデルが音声の特徴量を量子化したベクトルが、前記モデルが前記特徴量から獲得したコンテキスト表現に近いほど小さくなる第1の損失関数と、前記コンテキスト表現を基に前記音声のメタ情報を前記モデルが識別する精度が高いほど小さくなる第2の損失関数と、を計算し、前記第1の損失関数が小さくなり、かつ前記第2の損失関数が大きくなるように、前記モデルのパラメータを更新する事前学習処理によって学習が行われた前記モデルのパラメータを用いて学習が行われた再学習モデルを用いて、音声に関する推論処理を行う推論部を有することを特徴とする推論装置。
(Supplementary Note 7)
a first loss function in which a vector obtained by quantizing voice features by a model becomes smaller as it approaches a context representation acquired by the model from the features; a second loss function that becomes smaller as the accuracy of identification increases, and updates the parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger. An inference device comprising: an inference unit that performs inference processing regarding speech using a re-learning model that is trained using parameters of the model that has been trained through pre-learning processing.
 10 第1の学習装置
 10a、20a、50a モデル情報
 11、21、51 音声エンコーダ部
 12、22、52 コンテキストネットワーク部
 13 量子化ネットワーク部
 14 分類ネットワーク部
 15 分類学習ロス計算部
 16 コンテキスト表現学習ロス計算部
 17、25 学習パラメータ更新部
 24 後段タスク学習ロス計算部
 23、53 追加ネットワーク部
10 First learning device 10a, 20a, 50a Model information 11, 21, 51 Audio encoder section 12, 22, 52 Context network section 13 Quantization network section 14 Classification network section 15 Classification learning loss calculation section 16 Context representation learning loss calculation Parts 17, 25 Learning parameter updating unit 24 Post-task learning loss calculation unit 23, 53 Additional network unit

Claims (7)

  1.  モデルが音声の特徴量を量子化したベクトルが、前記モデルが前記特徴量から獲得したコンテキスト表現に近いほど小さくなる第1の損失関数と、前記コンテキスト表現を基に前記音声のメタ情報を前記モデルが識別する精度が高いほど小さくなる第2の損失関数と、を計算する損失関数計算部と、
     前記第1の損失関数が小さくなり、かつ前記第2の損失関数が大きくなるように、前記モデルのパラメータを更新する更新部と、
     を有することを特徴とする音声処理装置。
    a first loss function in which a vector obtained by quantizing voice features by a model becomes smaller as it approaches a context representation acquired by the model from the features; a second loss function that becomes smaller as the accuracy of identification is higher; a loss function calculation unit that calculates a second loss function;
    an updating unit that updates parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger;
    An audio processing device comprising:
  2.  前記損失関数計算部は、前記第1の損失関数である第1の項から、前記第2の損失関数に、前記更新部によるパラメータの更新が繰り返された回数が多いほど大きくなる重みが付された第2の項を引いた第3の損失関数を計算し、
     前記更新部は、前記第3の損失関数が小さくなるように、前記モデルのパラメータを更新する
     ことを特徴とする請求項1に記載の音声処理装置。
    The loss function calculation unit assigns a weight to the second loss function from a first term that is the first loss function, which increases as the number of times the update unit repeats updating the parameter. calculate the third loss function by subtracting the second term,
    The audio processing device according to claim 1, wherein the updating unit updates parameters of the model so that the third loss function becomes smaller.
  3.  前記更新部によって更新されたパラメータを用いて、音声に関するタスクを実行する再学習モデルの学習を行う追加学習部をさらに有することを特徴とする請求項1に記載の音声処理装置。 The audio processing device according to claim 1, further comprising an additional learning unit that uses the parameters updated by the updating unit to learn a relearning model that executes tasks related to audio.
  4.  前記音声の特徴量を前記モデルに入力し、コンテキスト表現を計算するコンテキスト表現計算部と、
     前記コンテキスト表現を前記モデルに入力し、前記音声のメタ情報を特定するラベルを計算するメタ情報ラベル計算部と、
     前記音声の特徴量を前記モデルに入力し、量子化したベクトルを計算する量子化ベクトル計算部と、
     をさらに有し、
     前記損失関数計算部は、前記量子化ベクトル計算部によって計算されたベクトルが、前記コンテキスト表現計算部によって計算されたコンテキスト表現に近いほど小さくなるように前記第1の損失関数を計算し、前記メタ情報ラベル計算部のラベルの計算精度が高いほど小さくなるように前記第2の損失関数を計算する
     ことを特徴とする請求項1に記載の音声処理装置。
    a context expression calculation unit that inputs the feature amount of the voice into the model and calculates a context expression;
    a meta information label calculation unit that inputs the context expression into the model and calculates a label specifying meta information of the audio;
    a quantization vector calculation unit that inputs the feature amount of the voice into the model and calculates a quantized vector;
    It further has
    The loss function calculation unit calculates the first loss function such that the vector calculated by the quantization vector calculation unit becomes smaller as it approaches the context expression calculated by the context expression calculation unit, and The audio processing device according to claim 1, wherein the second loss function is calculated so as to become smaller as the label calculation accuracy of the information label calculation unit increases.
  5.  前記再学習モデルを用いて、前記タスクを実行する推論部をさらに有することを特徴とする請求項3に記載の音声処理装置。 The audio processing device according to claim 3, further comprising an inference unit that executes the task using the relearning model.
  6.  音声処理装置によって実行される音声処理方法であって、
     モデルが音声の特徴量を量子化したベクトルが、前記モデルが前記特徴量から獲得したコンテキスト表現に近いほど小さくなる第1の損失関数と、前記コンテキスト表現を基に前記音声のメタ情報を前記モデルが識別する精度が高いほど小さくなる第2の損失関数と、を計算する損失関数計算工程と、
     前記第1の損失関数が小さくなり、かつ前記第2の損失関数が大きくなるように、前記モデルのパラメータを更新する更新工程と、
     を有することを特徴とする音声処理方法。
    An audio processing method performed by an audio processing device, the method comprising:
    a first loss function in which a vector obtained by quantizing voice features by a model becomes smaller as it approaches a context representation acquired by the model from the features; a second loss function that becomes smaller as the accuracy of identification is higher; a loss function calculation step of calculating a second loss function;
    an updating step of updating parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger;
    An audio processing method characterized by having the following.
  7.  モデルが音声の特徴量を量子化したベクトルが、前記モデルが前記特徴量から獲得したコンテキスト表現に近いほど小さくなる第1の損失関数と、前記コンテキスト表現を基に前記音声のメタ情報を前記モデルが識別する精度が高いほど小さくなる第2の損失関数と、を計算する損失関数計算ステップと、
     前記第1の損失関数が小さくなり、かつ前記第2の損失関数が大きくなるように、前記モデルのパラメータを更新する更新ステップと、
     をコンピュータに実行させることを特徴とする音声処理プログラム。
    a first loss function in which a vector obtained by quantizing voice features by a model becomes smaller as it approaches a context representation acquired by the model from the features; a second loss function that becomes smaller as the accuracy of identification is higher; a loss function calculation step of calculating a second loss function;
    updating the parameters of the model so that the first loss function becomes smaller and the second loss function becomes larger;
    An audio processing program that causes a computer to execute.
PCT/JP2022/028843 2022-07-26 2022-07-26 Speech processing device, speech processing method, and speech processing program WO2024023946A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/028843 WO2024023946A1 (en) 2022-07-26 2022-07-26 Speech processing device, speech processing method, and speech processing program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/028843 WO2024023946A1 (en) 2022-07-26 2022-07-26 Speech processing device, speech processing method, and speech processing program

Publications (1)

Publication Number Publication Date
WO2024023946A1 true WO2024023946A1 (en) 2024-02-01

Family

ID=89705831

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/028843 WO2024023946A1 (en) 2022-07-26 2022-07-26 Speech processing device, speech processing method, and speech processing program

Country Status (1)

Country Link
WO (1) WO2024023946A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951213A (en) * 2021-02-09 2021-06-11 中国科学院自动化研究所 End-to-end online voice detection and recognition method, system and equipment
CN113327595A (en) * 2021-06-16 2021-08-31 北京语言大学 Pronunciation deviation detection method and device and storage medium
WO2022044243A1 (en) * 2020-08-28 2022-03-03 日本電信電話株式会社 Training device, inference device, methods therefor, and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022044243A1 (en) * 2020-08-28 2022-03-03 日本電信電話株式会社 Training device, inference device, methods therefor, and program
CN112951213A (en) * 2021-02-09 2021-06-11 中国科学院自动化研究所 End-to-end online voice detection and recognition method, system and equipment
CN113327595A (en) * 2021-06-16 2021-08-31 北京语言大学 Pronunciation deviation detection method and device and storage medium

Similar Documents

Publication Publication Date Title
US10643602B2 (en) Adversarial teacher-student learning for unsupervised domain adaptation
JP6637078B2 (en) Acoustic model learning device, acoustic model learning method and program
JP6712642B2 (en) Model learning device, method and program
JP6222821B2 (en) Error correction model learning device and program
CN111602148A (en) Regularized neural network architecture search
WO2023134067A1 (en) Speech classification model training method and apparatus, device, and storage medium
CN109308316B (en) Adaptive dialog generation system based on topic clustering
WO2022072801A2 (en) Systems and methods for training dual-mode machine-learned speech recognition models
JP7329393B2 (en) Audio signal processing device, audio signal processing method, audio signal processing program, learning device, learning method and learning program
WO2024023946A1 (en) Speech processing device, speech processing method, and speech processing program
Silva et al. Intelligent genetic fuzzy inference system for speech recognition: An approach from low order feature based on discrete cosine transform
JP7212596B2 (en) LEARNING DEVICE, LEARNING METHOD AND LEARNING PROGRAM
JP2019159464A (en) Device, method and program utilizing language model
JP7112348B2 (en) SIGNAL PROCESSING DEVICE, SIGNAL PROCESSING METHOD AND SIGNAL PROCESSING PROGRAM
JP7170594B2 (en) A program, apparatus and method for constructing a learning model that integrates different media data generated chronologically for the same event
CN115066690A (en) Search normalization-activation layer architecture
WO2020162240A1 (en) Language model score calculation device, language model creation device, methods therefor, program, and recording medium
Long et al. Domain adaptation of lattice-free MMI based TDNN models for speech recognition
WO2023248398A1 (en) Training device, training method, training program, and speech synthesis device
JP2018031812A (en) Sound data processor, method for processing sound data, and sound data processing program
JP2021039216A (en) Speech recognition device, speech recognition method and speech recognition program
Shinozaki et al. Automated development of dnn based spoken language systems using evolutionary algorithms
Ratajczak et al. Virtual Adversarial Training Applied to Neural Higher-Order Factors for Phone Classification.
CN112951270A (en) Voice fluency detection method and device and electronic equipment
JP2020034625A (en) Voice recognition device, voice recognition method and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22953041

Country of ref document: EP

Kind code of ref document: A1