WO2020043123A1 - 命名实体识别方法、命名实体识别装置、设备及介质 - Google Patents
命名实体识别方法、命名实体识别装置、设备及介质 Download PDFInfo
- Publication number
- WO2020043123A1 WO2020043123A1 PCT/CN2019/103027 CN2019103027W WO2020043123A1 WO 2020043123 A1 WO2020043123 A1 WO 2020043123A1 CN 2019103027 W CN2019103027 W CN 2019103027W WO 2020043123 A1 WO2020043123 A1 WO 2020043123A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature vector
- speech
- word
- speech signal
- text
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 77
- 239000013598 vector Substances 0.000 claims abstract description 379
- 239000002131 composite material Substances 0.000 claims abstract description 72
- 238000012545 processing Methods 0.000 claims abstract description 28
- 238000013136 deep learning model Methods 0.000 claims description 30
- 230000008569 process Effects 0.000 claims description 26
- 238000010606 normalization Methods 0.000 claims description 18
- 230000011218 segmentation Effects 0.000 claims description 18
- 230000015654 memory Effects 0.000 claims description 13
- 238000004458 analytical method Methods 0.000 claims description 8
- 230000009466 transformation Effects 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 6
- 150000001875 compounds Chemical class 0.000 claims description 4
- 238000012886 linear function Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 11
- 238000005070 sampling Methods 0.000 description 6
- 238000000605 extraction Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 239000000463 material Substances 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000010422 painting Methods 0.000 description 4
- 230000000306 recurrent effect Effects 0.000 description 4
- 230000006403 short-term memory Effects 0.000 description 4
- 230000002457 bidirectional effect Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000007787 long-term memory Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 239000004984 smart glass Substances 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 206010071299 Slow speech Diseases 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000035622 drinking Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000001208 nuclear magnetic resonance pulse sequence Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000035922 thirst Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
Definitions
- the present disclosure relates to the field of artificial intelligence, and more particularly, to a named entity recognition method, a named entity recognition device, a named entity recognition device, and a medium.
- the commonly used entity recognition methods are mostly directed to speech recognition in a wide range of application scenarios.
- special scenarios such as complex special names of artworks, books, foreign names, etc.
- the existing named entity recognition methods are difficult to identify well.
- a named entity recognition method which includes: collecting a voice signal; extracting a voice feature vector in the voice signal; performing a text result after voice recognition based on the voice signal, and extracting text in the text result Feature vector; splicing the speech feature vector and the text feature vector to obtain the composite feature vector of each word in the speech signal; processing the composite feature vector of each word in the speech signal through a deep learning model to obtain a name The recognition result of the entity.
- extracting a speech feature vector in a speech signal includes extracting a speech sentence feature vector and extracting a speech word feature vector.
- extracting a speech sentence feature vector in a speech signal includes: converting a sentence feature of the speech in the speech signal into a corresponding speech sentence feature vector according to a speech parameter comparison table in a preset speech sentence library. .
- extracting the speech word feature vector in the speech signal includes: performing speech analysis on the speech signal to obtain the speech word feature vector in the speech signal.
- performing speech analysis on the speech signal includes: performing discrete sampling on the speech signal in time and frequency domains to obtain a digital speech signal; and for each word in the digital speech signal in the time domain and frequency domain Process it separately to obtain its time-domain feature vector and frequency-domain feature vector; for each word in the speech signal, stitch its time-domain feature vector and frequency-domain feature vector to obtain the speech word corresponding to each word Feature vector.
- the text result after speech recognition based on the speech signal, and extracting the text feature vector in the text result includes extracting the word feature vector and extracting the word segmentation embedding feature vector.
- extracting the word segmentation embedding feature vector in the text result includes: dividing a phrase and a single word in the text result according to a phrase comparison table in a preset lexicon; Each word and the single word in the phrase are respectively converted into feature vectors corresponding to the corresponding word segmentation.
- extracting character feature vectors in the text result includes: converting the text into a character feature vector corresponding to the character according to a character and vector value comparison table in a preset character library.
- extracting the word segmentation embedding feature vector in the text result includes: dividing a phrase and a single word in the text result according to a phrase comparison table in a preset lexicon; and according to a preset transformation rule, dividing the word Each word and the single word in the phrase are respectively converted into feature vectors corresponding to the corresponding word segmentation.
- stitching the speech feature vector and the text feature vector to obtain a composite feature vector of each word in the speech signal includes: normalizing the extracted speech feature vector and text feature vector separately; The normalized dense text feature vector and dense speech feature vector for each word in the speech signal are subjected to vector stitching to obtain a composite feature vector for each word in the speech signal.
- concatenating the speech feature vector and the text feature vector to obtain a composite feature vector of each word in the speech signal includes: obtaining the dense text feature vector and dense speech for each word in the speech signal.
- the feature vectors are subjected to vector splicing to obtain a composite feature vector for each word in the speech signal; the speech feature vector and the text feature vector in the obtained composite feature vector are respectively normalized.
- normalizing the extracted speech feature vector and text feature vector separately includes: performing linear function normalization on the speech feature vector and the text feature vector, respectively.
- normalizing the extracted speech feature vector and text feature vector separately includes: normalizing the speech feature vector and the text feature vector with 0 means.
- processing the composite feature vector of each word in the speech signal through a deep learning model to obtain the recognition result of the named entity includes: sending the composite feature vector to an input of the selected deep learning model Processing the composite feature vector via each layer in the selected deep learning model; obtaining the recognition result of the named entity at the output end of the deep learning model.
- the method when the speech signal includes multiple sentences, processing the composite feature vector of each word in the speech signal through a deep learning model, and before obtaining the recognition result of the named entity, the method further includes: Corresponding to the sentence length feature value of the current sentence in the speech signal, all the composite feature vector of the obtained speech signal is truncated to obtain multiple composite feature vector sequences, and the number of the composite feature vector sequences is equal to that contained in the speech signal. Number of sentences, and the number of composite feature vectors each of the plurality of composite feature vector sequences is equal to the sentence length feature value corresponding to the current sentence in the speech signal.
- the sentence length feature value of the current sentence in the voice signal is obtained from a voice feature vector in the voice signal.
- the feature value of the sentence length of the current sentence in the speech signal is obtained from the text result after speech recognition of the speech signal.
- a named entity recognition device including: a voice signal collector for collecting a voice signal; a voice feature vector extractor configured to extract a voice feature vector in the voice signal; and a text feature vector extraction A device configured to extract the text feature vector from the text result after speech recognition based on the speech signal; a composite vector generator configured to stitch the speech feature vector and the text feature vector to obtain the speech signal A compound feature vector of each word; a named entity recognizer configured to process the compound feature vector of each word in the speech signal through a deep learning model to obtain a recognition result of the named entity.
- a named entity recognition device wherein the device includes a voice collection device, a processor, and a memory, the memory includes a set of instructions, and the set of instructions is executed by the processor At times, the named entity recognition device is caused to execute the method as described above.
- a computer-readable storage medium characterized in that computer-readable instructions are stored thereon, and the method as described above is performed when the instructions are executed by a computer.
- FIG. 1 illustrates an exemplary block diagram of a named entity recognition device according to an embodiment of the present disclosure
- FIG. 2 illustrates an exemplary flowchart of a named entity recognition method according to an embodiment of the present disclosure
- 3A shows a schematic diagram of extracting feature vectors of a speech sentence in a speech signal in a special scene according to an embodiment of the present disclosure
- 3B shows a schematic diagram of extracting a speech word feature vector in a speech signal in a special scene according to an embodiment of the present disclosure
- 3C illustrates an exemplary flowchart of extracting a speech word feature vector in a speech signal according to an embodiment of the present disclosure
- FIG. 4 shows an exemplary flowchart of extracting text feature vectors in a speech signal according to an embodiment of the present disclosure
- FIG. 5 illustrates an exemplary flowchart of splicing a speech feature vector and a text feature vector according to an embodiment of the present disclosure
- FIG. 6 shows a schematic diagram of truncating all composite feature vectors of the speech signal to obtain multiple composite feature vector sequences according to an embodiment of the present disclosure
- FIG. 7 shows a schematic block diagram of a named entity recognition apparatus according to an embodiment of the present disclosure.
- modules in a system according to embodiments of the present disclosure, any number of different modules may be used and run on a user terminal and / or server.
- the modules are merely illustrative, and different aspects of the systems and methods may use different modules.
- a flowchart is used in the present disclosure to explain operations performed by a system according to an embodiment of the present disclosure. It should be understood that the preceding or following operations are not necessarily performed precisely in sequence. Instead, the various steps can be processed in reverse order or simultaneously, as needed. At the same time, you can add other operations to these processes, or remove a step or steps from these processes.
- the present disclosure provides a named entity recognition method, device, device, and medium. Normalize and integrate the voice information that is not included in the text, such as accent, pause, and intonation, into the deep learning model, and jointly guide the process of named entity recognition.
- the effect of complex special names on judging sentence structure and identifying entities in special scenarios is solved, improving the accuracy and accuracy of entity recognition, and further increasing the scope of entity recognition applications.
- FIG. 1 illustrates an exemplary block diagram of a named entity recognition device according to an embodiment of the present disclosure.
- the named entity recognition device 100 shown in FIG. 1 may be implemented to include a voice collection device 130 and one or more dedicated or general-purpose computer processing system modules or components.
- the voice collection device such as a microphone component, may include a microphone, a microphone sleeve , Mounting rod, connecting line, etc .; it can also be a wireless microphone or a microphone circuit.
- One or more dedicated or general-purpose computer processing system modules or components such as personal computers, laptops, tablets, mobile phones, personal digital assistants (PDAs), smart glasses, smart watches, smart rings, smart helmets and any
- the smart portable device may include at least one processor 110 and a memory 120.
- the at least one processor is configured to execute a program instruction.
- the memory 120 may exist in different forms of program storage units and data storage units in the named entity identification device 100, such as a hard disk, a read-only memory (ROM), and a random access memory (RAM), which can be used to store a processor Processes and / or executes various data files used in the recognition of named entities, and possible program instructions executed by the processor.
- the named entity recognition device 100 may further include an input / output component to support input / output data flow between the named entity recognition device 100 and other components (such as a screen display device).
- the named entity recognition device 100 may also send and receive information and data from a network through a communication port.
- the named entity recognition device 100 may collect a voice signal generated from a specific surrounding environment, and perform a named entity recognition method described below on the received voice signal to implement the function of the sentence completion device described above.
- the voice signal in the specific scene may be a human voice signal, and specifically may be, for example, a commentary in a museum or an art exhibition, a discussion content of appreciation of paintings and calligraphy, a lecture content for a character or a history course, and the like.
- the processor 110, the memory 120, and the voice collection device 130 are presented as separate modules, those skilled in the art can understand that the above device modules may be implemented as separate hardware devices or integrated into one or more devices.
- Hardware devices such as integrated in smart watches or other smart devices. As long as the principles described in this disclosure can be implemented, the specific implementation of different hardware devices should not be taken as a factor limiting the scope of protection of this disclosure.
- the voice signal collected by the voice collection device 130 may include a large number of complex special names.
- the voice collected in a specific scene of painting and calligraphy appreciation may include “the dawn is quiet here”, “in the rain” “Walking”, "What do I talk about when I'm talking about running” and other complicated paintings and book titles.
- FIG. 2 illustrates an exemplary flowchart of a named entity recognition method according to an embodiment of the present disclosure.
- a voice signal in a specific scene is collected.
- the voice signal in this specific scene may be collected by a separate voice acquisition device, or may be acquired by a voice acquisition module integrated with a computer processing system.
- the embodiments of the present disclosure are not limited by the source and acquisition method of the voice signal. For example, it can be collected by a separate microphone, or it can also be collected by a microphone circuit integrated with a computer processing system.
- a speech feature vector in the speech signal is extracted.
- the extracted speech feature vector in the speech signal can be obtained, for example, by performing time and frequency domain feature extraction on the speech signal, or can be obtained by filtering and windowing the speech signal.
- a text result after speech recognition is further performed based on the speech signal, and a character feature vector in the text result is extracted.
- the voice recognition may be implemented by, for example, a deep learning algorithm or other voice signal recognition methods, and the embodiments of the present disclosure are not limited by the voice recognition method and process.
- the character feature vector extracted from the text result can be obtained by comparing with the word library or thesaurus, for example, to identify named entities, or by judging the sentence structure.
- steps S202 and S203 may be performed in parallel, or performed sequentially, and no limitation is imposed here. Further, as required, steps S202 and S203 may be performed based on different voice signals obtained after preprocessing, as long as these voice signals are derived from the same original voice signal.
- step S204 the speech feature vector and the text feature vector are spliced to obtain a composite feature vector of each word in the speech signal.
- the feature vectors of the two can be connected to form a composite feature vector, or the normalized vector can be obtained by normalizing the speech feature vector and the text feature vector.
- the composite feature vector of each word in the speech signal is processed by a deep learning model to obtain a recognition result of the named entity.
- the deep learning model can be a model based on statistical methods, such as HiddenMarkov Model (HMM), Maximum Entropy (ME), Support Vector Machine (SVM), or time series based Models of sample relations, such as long short-term memory networks (LSTM), recurrent neural networks (RNN).
- HMM HiddenMarkov Model
- ME Maximum Entropy
- SVM Support Vector Machine
- LSTM long short-term memory networks
- RNN recurrent neural networks
- extracting a speech feature vector in a speech signal may further include extracting a speech sentence vector and extracting a speech word vector.
- extracting a speech sentence vector can be more specifically described, for example, extracting prosodic features such as fundamental frequency, speech rate, formant in a speech signal, or extracting features related to the speech signal spectrum, such as Mel frequency cepstrum coefficient (MFCC).
- Extracting a speech word vector can, for example, segment the speech signal by words, extract the pronunciation duration corresponding to each word, the start time point of the character, and the end time point of the character, or the maximum frequency of pronunciation of each word in the speech signal can also be extracted , The maximum sound intensity, the average value of the sound intensity integral, etc. This will be further described below with reference to FIGS. 3A and 3B.
- FIG. 3A shows a schematic diagram of extracting a speech sentence feature vector in a speech signal in a special scene according to an embodiment of the present disclosure.
- the special scene is an art exhibition, in which the voice signal is an explanation word in the exhibition.
- the feature vector of the speech sentence in the extracted voice signal can be further set to: extract the frame-level low-level feature vector. More specifically, for example, based on each sentence in the speech signal of the current art exhibition, the fundamental frequency feature, sound quality feature, and Mel Frequency Cepstrum Coefficient (MFCC) are extracted, thereby obtaining a sentence feature vector based on each sentence.
- MFCC Mel Frequency Cepstrum Coefficient
- the fundamental frequency characteristics can reflect the overall performance of the voice.
- the fundamental frequency characteristics of women are higher than that of men, so they can be used to distinguish between genders, and after further processing, remove other genders mixed in them.
- Acoustic murmurs can obtain more accurate sentence length feature values. This feature can assist in correcting the sentence length in the text result of speech recognition to avoid recognition errors caused by the inclusion of ambient sounds or other human voices in the collected speech.
- the Mel frequency cepstrum coefficient can be further used to identify sound attributes and distinguish different human voices existing in speech.
- the feature vector of a speech sentence in a speech signal is extracted, and it can be converted into a feature of a speech sentence corresponding to the feature by comparing the extracted speech feature with a speech parameter comparison table in a preset speech sentence database. vector. It can be specifically described as, for example, in the preset voice comparison table, if the frequency range of male voices is set to 100 to 480 Hz, and the frequency range of female voices is set to 160 to 1000 Hz, then the bases in the extracted sentences are targeted. The frequency features can be classified and identified.
- the average frequency value of the current sentence is 780 Hz, it can be judged as a female voice, and the feature vector value can be obtained based on the preset rules of the table, such as the corresponding speech sentence.
- the eigenvector value is assigned a value of 1.
- FIG. 3B shows a schematic diagram of extracting a speech word feature vector in a speech signal in a special scene according to an embodiment of the present disclosure.
- the special scene is literary appreciation, in which the voice signal is the content of a review of literature.
- the speech word feature vector in the extracted speech signal can be further set to include its time domain feature vector and frequency domain feature vector.
- the speech character feature vector may include the global sequence number of the character, the start time of the character, the duration of the pronunciation, the pause time from the previous character, the maximum sound intensity of the local character, the minimum sound intensity of the local character, The maximum and minimum frequency of pronunciation, short-term average amplitude, and short-term average zero-crossing rate.
- the average zero-crossing rate can distinguish whether the current word is voiced or soft. In particular, even when the voiced and unvoiced voices overlap, it still has a good resolution effect, which makes it useful for speech recognition text. In the result, the unvoiced voiced continuous reading or ambiguity caused by the fast speaking speed is corrected, such as the recognition error caused by the continuous reading of "special” and “de” in "Teen's Witt's Trouble” during fast continuous reading.
- the short-term average energy can be used as a basis for judging the initials and finals, and voiceless.
- the position of the interrupted sentence and the continuity of words in the text result of speech recognition can be checked and corrected, and combined with the pause duration data, the sentence length of the current sentence can be further obtained based on this sentence.
- the maximum and minimum sound intensity of this word can be used to characterize the audio characteristics of this word. It can be used to correct the speech result of speech recognition in the context of noisy backgrounds or slow speech speed.
- the head and tail are identified as independent words, for example, the "may” in "smile and thirst drinking Xiongnu blood" in poem recitation is recognized as "evil".
- step S301 is required to perform discrete sampling on the speech signal in the time and frequency domain to obtain a digital speech signal.
- a unit pulse sequence may be used to sample a speech signal at a preset sampling frequency.
- its sampling value may be selected according to the Nyquist sampling theorem.
- the voice signal may be a voice signal directly collected by a microphone or a voice signal preprocessed or reduced by a computer.
- each word in the digital speech signal is further processed in the time domain and the frequency domain through step S302 to obtain its time domain feature vector and frequency domain.
- Feature vector Specifically, during the time-domain processing of the speech signal, for example, windowing processing can be used to obtain the short-term average energy and short-time over-level rate of the speech signal on a linear scale in the frequency domain processing of the speech signal. For example, after signal analysis, the maximum pronunciation frequency and cepstrum parameters of the local character of each character are extracted to obtain a frequency-domain word feature vector including the maximum pronunciation frequency characteristics and cepstrum parameter characteristics.
- step S303 After obtaining the speech sentence feature vector and the speech word feature vector of each word in the speech signal, in step S303, for each word in the speech signal, the time domain feature vector and the frequency domain feature vector are spliced to obtain the corresponding Character word vector for each word.
- the stitching can be obtained, for example, by directly connecting the time-domain feature vector and the frequency-domain feature vector, or classify them based on subsequent discrimination requirements, and group the time-domain and frequency-domain vectors belonging to the same category.
- the obtained time-domain feature vector T is (t 1 , t 2 , t 3 ), and the obtained frequency-domain feature vector F is (f 1 , f 2 , f 3 ), then After stitching it, a word feature vector M V corresponding to the current word can be obtained, and the feature vector M V is (t 1 , t 2 , t 3 , f 1 , f 2 , f 3 ).
- FIG. 4 shows a schematic diagram of extracting text feature vectors in a speech signal according to an embodiment of the present disclosure.
- a method 400 for extracting a character feature vector in a speech signal includes extracting a character feature vector and a word embedding feature vector of a character.
- the character feature vector of the character identifies each character recognized after speech recognition, for example, different values are used to represent different characters after conversion.
- the word segmentation embedding vector identifies the phrase and its constituent structure relationship appearing in the current sentence. For example, it can represent the phrase and value by different values, and can also indicate the first, middle, and ending words in a phrase by the positive and negative values. .
- step S401 the character feature vectors in the text result are extracted, and through step S4011, the text is converted into corresponding characters according to the text and vector value comparison table in the preset font.
- Word feature vector can be an existing corpus material dataset, such as the 100-dimensional Chinese character vector disclosed by Wikipedia, or a self-designed corpus material dataset for high-frequency vocabulary for a specific scene, such as for Renaissance painting Art related words vector.
- each numerical word is the vector of the feature values of each character corresponds to a sentence.
- step S402. the step-wise embedding feature vector in the character result is further extracted in step S402.
- a phrase and a single word in the text result are divided according to a phrase comparison table in a preset lexicon.
- the preset lexicon can be an existing phrase corpus material library, or a self-designed phrase corpus material dataset for high-frequency vocabulary of a specific scene. Phrase and word can be divided by different numerical values or positive or negative.
- each word and the single word in the phrase are respectively converted into feature vectors corresponding to the corresponding word segmentation according to a preset transformation rule in step S4022.
- the transformation rule may be based on a start word, a middle word, and an end word in the phrase, and each may be assigned a preset value, or may be based on the number of words in the phrase, and for each word in the phrase, based on its position in the phrase Sequence number.
- the conversion rule used is: the value corresponding to a single word Is 0, the value corresponding to the start word in the phrase is 1, the value corresponding to the middle word (except the start word and the end word can be regarded as the middle word) is 2, and the value corresponding to the end word is 3.
- a segmentation embedding feature vector P V can be obtained, where the segmentation embedding feature vector P V is (0,1,1,0,1,2 , 2,3,0,1,2,2,2,2,2,2,2,2,2,3,3), for the sentence "I want to see Monet's parasol woman", its corresponding
- the segmentation embedding feature vector P V is (0,0,0,1,3,0,1,2,2,2,2,3).
- steps S401 and S402 may be performed in parallel, or performed sequentially, and no limitation is imposed on them here. Further, as required, steps S401 and S402 may be performed based on different voice signals obtained after preprocessing, as long as these voice signals are derived from the same original voice signal.
- the speech feature vector and the text feature vector of the speech signal are obtained, the speech feature vector and the text feature vector are further stitched to obtain a composite feature vector of each word in the speech signal.
- the speech feature vector and the text feature vector are spliced, for example, a new vector can be formed by directly connecting them, or the internal vector components can be classified and spliced according to performance or role.
- the step of splicing the speech feature vector and the text feature vector to obtain the composite feature vector of each word in the speech signal includes: normalizing the extracted speech feature vector and the text feature vector respectively. Processing: The normalized dense text feature vector and dense speech feature vector for each word in the speech signal are subjected to vector splicing to obtain a composite feature vector for each word in the speech signal.
- FIG. 5 illustrates an exemplary flowchart of splicing a speech feature vector and a text feature vector according to an embodiment of the present disclosure.
- steps of a method 500 for splicing a speech feature vector and a text feature vector will be further described below with reference to FIG. 5.
- step S501 normalization processing is performed on the extracted speech feature vector and character feature vector, respectively.
- the normalization process includes performing linear function normalization on the speech feature vector and the text feature vector, respectively. Specifically, it is processed using the following normalization formula:
- X norm is the normalized dense data
- X is the original data
- X max and X min represent the maximum and minimum values in the original data set.
- the vector VV (0,0,0,1,3,0,1,2 for the segmentation embedding vector) , 2,2,2,3)
- the resulting dense segmentation embedding vector P N is (0,0,0,0.3,1,0,0.3,0.6, 0.6,0.6,0.6,1).
- the normalization process includes 0-mean normalization of the speech feature vector and the text feature vector, respectively. Specifically, it is processed using the following normalization formula:
- z is the normalized dense data
- x is the original data
- ⁇ and ⁇ represent the mean and variance of the original data set.
- the dense word vector M N obtained by normalizing using the 0-mean normalization method is (-0.64, -0.24, 0.24, 0.4, -0.8, 0.72).
- the normalized dense text feature vector and the dense speech feature vector are stitched to obtain a composite feature vector for each word in the speech signal. For example, they can be directly stitched, or the sub-vectors are sequentially stitched in a predetermined order. As an example, the process of stitching the normalized text feature vector and the speech feature vector in a predetermined order is described in detail below.
- the dense speech word vector M Vi corresponding to the i-th character Wi among the dense vectors obtained after normalization is respectively (t i , f i )
- the dense speech sentence corresponding to the text Wi The vector S Vi is (s i1 , s i2 , ... s i20 )
- the dense character feature vector D Ni of the Wi character is (d i )
- the dense participle embedding feature vector P Ni is (p i1 , p i2 , ...
- the length of each feature vector can be preset To achieve splicing.
- the length of the obtained word feature vector of the dense text may be compared, and the maximum length of the word feature vector may be selected as a reference value, and the preset length of the word feature vector of the dense text may be set to be greater than or equal to the reference value.
- the character feature vectors of the dense text of all the characters in the sentence are expanded to expand to a preset length, and the expansion process may be performed by performing a zero-padding operation, for example.
- the preset length of the vector can be set for each of the vectors, and can be expanded based on the preset length.
- the preset length of the dense speech word vector is 5, and the time domain speech word can be further set.
- the preset length of the vector is 2, the preset length of the frequency domain speech word vector is 3, the preset length of the dense speech sentence vector is 20, the preset length of the dense word feature vector of the text is 5, and the dense word segmentation embedded feature vector
- the preset length is 100.
- the dense speech word vector M Vi of the text Wi is (t i1 , 0, f i , 0,0)
- the dense speech sentence vector S Vi corresponding to the text Wi is ( s i1 , s i2 , ... s i20 )
- the dense character feature vector D Ni of Wi text is (d i , 0,0,0,0)
- the dense participle embedding feature vector P Ni is (p i1 , p i2 , ... p i98 , 0,0)
- the compound feature vector for the text Wi after sequential stitching is a row vector, which is specifically (s i1 , s i2 , ... s i20 , t i1 , 0, f i , 0, 0, d i , 0,0,0,0, p i1 , p i2 , ... p i98 , 0,0).
- multiple feature vectors belonging to each word can also be classified into different rows to form a feature vector matrix.
- all current feature vectors can be viewed first to obtain the feature vector with the most vector component members in the current multiple feature vectors, that is, the vector contains the most sub-vector elements.
- the dense segmentation embedding feature vector P Ni is (p i1 , p i2 ,... P i98 ), which has the most vector composition members, and the number of sub-vectors it contains is 98.
- the remaining feature vectors can be expanded to have the same number of sub-vectors as the current segmentation-embedded feature vector P Ni .
- a zero-padded operation can be used to expand it.
- the dense speech word vector M Vi of the text Wi obtained from zero padding, the dense speech sentence vector S Vi of the text corresponding to Wi, and the dense character feature vector D Ni of the text corresponding to Wi are all feature vectors including 98 sub-vectors.
- the order of the speech feature vector (speech sentence vector-speech word vector)-text feature vector (word vector of words-word embedding vector) can be similarly combined to form 4 lines, each line has 98
- the feature vector matrix of the column which is the feature matrix representing the character Wi.
- the normalization and splicing process of the feature vectors is not limited to the order described in the above embodiments.
- the above-mentioned splicing process may be performed first, for example, by setting a preset length to obtain the spliced Character feature row vectors, or multiple feature vectors belonging to each word are sorted into different rows to form a feature vector matrix. Thereafter, normalization processing is performed on different components of the above-mentioned stitched feature vectors.
- step S205 the composite feature vector of each word in the speech signal is processed by a deep learning model to obtain a recognition result of the named entity.
- the composite feature vector is sent to the input end of the selected deep learning model, where the composite vector can be input to the deep learning model in the form of word or phrase division, for example, or can be pre- Set the sentence length or paragraph length and input it into the deep learning model.
- the composite feature vector is processed through each layer in the selected deep learning model, where the selected deep learning model may be, for example, a Markov model or a conditional random field model.
- the deep model may also be a composite deep learning model, such as a composite deep learning model (BiLSTM + CRF) formed by combining a bidirectional long-term and short-term memory recurrent neural network with a conditional random field algorithm.
- a composite deep learning model formed by a combination of a bidirectional long-term and short-term memory recurrent neural network
- the input vector data is calculated by the forward layer and the reverse layer in the bidirectional long-term and short-term memory recurrent neural network.
- the processing at the airport algorithm layer finally gets the processing results of deep learning. Subsequently, the recognition result of the named entity can be obtained at the output of the deep learning model.
- the composite feature vector of each word in the speech signal is processed by a deep learning model.
- the method further includes: A step of truncating all the composite feature vectors of the speech signal.
- FIG. 6 is a schematic diagram of truncating all composite feature vectors of the speech signal to obtain multiple composite feature vector sequences according to an embodiment of the present disclosure.
- step S601 is required to obtain the sentence length feature value of the current sentence of the current voice signal, where the sentence length feature value identifies the sentence length of the current sentence in the voice signal.
- it may be obtained according to the data of the speech feature vector extracted from the speech signal, for example, through the sentence length feature in the aforementioned speech sentence feature vector, or in the speech word feature vector, based on each word and the previous Break a sentence with a single pause.
- it may also be statistically obtained from the sentence segmentation characteristics in the text result after speech recognition according to the speech signal.
- step S602 is performed to truncate all the obtained composite feature vectors of the voice signal, that is, based on the sentence length feature values of the current sentence of the voice signal, in order
- the composite feature vector of the speech signal is intercepted, and the multiple composite vector sequences obtained by truncation respectively represent multiple sentences in the speech signal.
- multiple composite feature vector sequences are obtained in S603.
- the number of the composite feature vector sequences is equal to the number of sentences contained in the speech signal.
- Each of the multiple composite feature vector sequences has The number of complex feature vectors is equal to the sentence length feature value corresponding to the current sentence in the speech signal.
- a preset value may be further set according to the maximum value of the sentence length feature value of the current sentence in the speech signal.
- Set the sentence length expand each sequence based on the preset sentence length, and add preset values for insufficient vector data, for example, set it to 0 so that the length of each sequence is equal to the preset sentence length, where the preset sentence length is greater than Or equal to the maximum value of the sentence length feature value of the current sentence.
- FIG. 7 shows a schematic block diagram of a named entity recognition apparatus according to an embodiment of the present disclosure.
- the named entity recognition device 700 may include a speech signal collector 710, a speech feature vector extractor 720, a text feature vector extractor 730, a composite vector generator 740, and a named entity recognizer 750.
- the voice signal collector 710 is configured to collect a voice signal.
- the voice signal collector 710 may be, for example, a microphone component, which may include a microphone, a microphone sleeve, a mounting rod, a connection line, and the like; it may also be a wireless microphone or a microphone circuit.
- the speech feature vector extractor 720 is configured to extract a speech feature vector in a speech signal. Specifically, for example, it can perform the process shown in FIG. 3C to implement the feature extraction shown in FIG. 3A and FIG. 3B .
- the text feature vector extractor 730 is configured to perform a text result after speech recognition based on a voice signal, and extract a text feature vector in the text result. Specifically, for example, it may execute the process shown in FIG. 4 to implement For text feature vector extraction.
- the composite vector generator 740 is configured to stitch the speech feature vector and the text feature vector to obtain a composite feature vector of each word in the speech signal. Specifically, for example, it can execute the process shown in FIG. 5 to achieve the splicing of the speech feature vector and the text feature vector.
- the named entity recognizer 750 is configured to process the composite feature vector of each word in the speech signal through a deep learning model to obtain a recognition result of the named entity. Specifically, for example, it can execute the process shown in FIG. 6 to obtain the corresponding entity recognition result through the processing result of the deep learning model.
- the speech feature vector extractor 720, text feature vector extractor 730, composite vector generator 740, and named entity recognizer 750 may be processed by one or more dedicated or general-purpose computer processing system modules or components, such as a personal computer, a notebook computer , Tablet computers, mobile phones, personal digital assistants (PDAs), smart glasses, smart watches, smart rings, smart helmets, and any smart portable device.
- a personal computer such as a personal computer, a notebook computer , Tablet computers, mobile phones, personal digital assistants (PDAs), smart glasses, smart watches, smart rings, smart helmets, and any smart portable device.
- PDAs personal digital assistants
- the at least one processor may be implemented via at least one processor and a memory, wherein the at least one processor is configured to execute program instructions
- the memory may exist in different forms of a program storage unit and a data storage unit, such as a hard disk, a read-only memory (ROM) ), Random Access Memory (RAM), which can be used to store various data files used by the processor in processing and / or executing named entity recognition, and possible program instructions executed by the processor.
- ROM read-only memory
- RAM Random Access Memory
- the speech signal collector 710, the speech feature vector extractor 720, the text feature vector extractor 730, the composite vector generator 740, and the named entity recognizer 750 are presented as separate modules, those skilled in the art will understand The above device module may be implemented as a separate hardware device, or may be integrated into one or more hardware devices. As long as the principles described in this disclosure can be implemented, the specific implementation of different hardware devices should not be taken as a factor limiting the scope of protection of this disclosure.
- a computer-readable storage medium on which computer-readable instructions are stored, and when the instructions are executed by a computer, the method described above can be performed.
- the method, device, and device for identifying named entities provided by the present disclosure, by using voice signal analysis to assist entity name recognition, the traditional method of performing named entity recognition for text only is extended forward to improve the accuracy and scope of named entity recognition.
- the method disclosed in the present disclosure can well overcome the problem of difficult recognition of named entities when there are multiple complex special names in the collected voice signals in special scenes, further improving the robustness and recognition accuracy of named entities. .
- the program part in the technology may be considered as a “product” or “article of manufacture” existing in the form of executable code and / or related data, which is participated or realized through a computer-readable medium.
- Tangible, permanent storage media may include memory or storage used by any computer, processor, or similar device or related module.
- various semiconductor memories, magnetic tape drives, magnetic disk drives or similar devices capable of providing storage functions for software.
- All software or parts of it may sometimes communicate over a network, such as the Internet or other communication networks.
- This type of communication can load software from one computer device or processor to another.
- a hardware platform that is loaded from a server or host computer of a fingertip detection device into a computer environment, or another computer environment that implements the system, or a system with similar functions related to providing the information required for named entity identification. Therefore, another medium capable of transmitting software elements can also be used as a physical connection between local devices, such as light waves, radio waves, electromagnetic waves, etc., and is transmitted through cables, optical cables, or air.
- the physical medium used for carrier waves, such as electrical cables, wireless connections, or optical cables, can also be considered as the medium that carries the software.
- tangible “storage” media is restricted, other terms referring to computer or machine "readable media” refer to media that participates in the execution of any instruction by a processor.
- aspects of the present disclosure can be illustrated and described through several patentable categories or situations, including any new and useful process, machine, product, or combination of materials, or to them Any new and useful improvements. Accordingly, aspects of the present disclosure may be performed entirely by hardware, may be performed entirely by software (including firmware, resident software, microcode, etc.), or may be performed by a combination of hardware and software.
- the above hardware or software can be called “data block”, “module”, “engine”, “unit”, “component” or “system”.
- aspects of the present disclosure may manifest as a computer product located in one or more computer-readable media, the product including computer-readable program code.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
提供一种命名实体识别方法(200)、命名实体识别装置(700)、命名实体识别设备(100)及介质,包括:采集特定场景下的语音信号(S201);提取语音信号中的语音特征向量(S202);基于语音信号进行语音识别后的文字结果,提取文字结果中的文字特征向量(S203);将语音特征向量与文字特征向量进行拼接,得到语音信号中每个字的复合特征向量(S204);将语音信号中每个字的复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果(S205)。
Description
相关申请的交叉引用
本公开要求于2018年08月30日提交的中国专利申请第201811005796.4号的优先权,该中国专利申请的全文通过引用的方式结合于此以作为本公开的一部分。
本公开涉及人工智能领域,更具体地涉及一种命名实体识别方法、命名实体识别装置、命名实体识别设备及介质。
随着人工智能和大数据技术的发展,对于语音识别和自然语言处理的技术需求不断提高,其中,命名实体识别作为语义理解、语音合成等任务的必要前操作,在自然语言理解中具有重要的作用。
目前常用的实体识别方法多针对广泛应用场景中的语音识别,然而在特殊场景,如复杂特殊名称的艺术品、书籍、外国人名等,现有的命名实体识别方法难以良好的识别。
因此,需要一种能够在特殊场景下具有良好识别精度和准确率的命名实体识别方法。
发明内容
根据本公开的一方面,提出了一种命名实体识别方法,包括:采集语音信号;提取语音信号中的语音特征向量;基于语音信号进行语音识别后的文字结果,提取所述文字结果中的文字特征向量;将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量;将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果。
在一些实施例中,提取语音信号中的语音特征向量包括提取语音句特征向量和提取语音字特征向量。
在一些实施例中,提取语音信号中的语音句特征向量包括:根据预设语音句库中的语音参数对照表,将所述语音信号中语音的句特征转换为与之对 应的语音句特征向量。
在一些实施例中,提取语音信号中的语音字特征向量包括:通过对语音信号进行语音分析,得到语音信号中的语音字特征向量。
在一些实施例中,对语音信号进行语音分析包括:对于语音信号进行时间和频域上的离散化采样,得到数字语音信号;对于数字语音信号中的每个字在时域上和频域上分别进行处理,得到其时域特征向量及频域的特征向量;;对于语音信号中的每个字,将其时域特征向量和频域特征向量进行拼接,得到对应于每个字的语音字特征向量。
在一些实施例中,基于语音信号进行语音识别后的文字结果,提取所述文字结果中的文字特征向量包括提取字特征向量和提取分词嵌入特征向量。
在一些实施例中,提取所述文字结果中的分词嵌入特征向量包括:根据预设词库中的词组对照表,划分所述文字结果中的词组和单字;根据预设变换规则,将所述词组中的每个字和所述单字分别转换为与对应的分词嵌入特征向量。
在一些实施例中,提取所述文字结果中的字特征向量包括:根据预设字库中的文字及向量值对照表,将所述文字转换为与之对应的字特征向量。
在一些实施例中,提取所述文字结果中的分词嵌入特征向量包括:根据预设词库中的词组对照表,划分所述文字结果中的词组和单字;根据预设变换规则,将所述词组中的每个字和所述单字分别转换为与对应的分词嵌入特征向量。
在一些实施例中,将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量包括:对于所提取的语音特征向量、文字特征向量分别进行归一化处理;将归一化处理后得到的针对语音信号中每个字的稠密文字特征向量和稠密语音特征向量进行向量拼接,得到针对语音信号中每个字的复合特征向量。
在一些实施例中,将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量包括:将得到的针对语音信号中每个字的稠密文字特征向量和稠密语音特征向量进行向量拼接,得到针对语音信号中每个字的复合特征向量;对于所得到的复合特征向量中的语音特征向量、文字特征向量分别进行归一化处理。
在一些实施例中,对于所提取的语音特征向量和文字特征向量分别进行 归一化处理包括:对所述语音特征向量和所述文字特征向量分别进行线性函数归一化。
在一些实施例中,对于所提取的语音特征向量和文字特征向量分别进行归一化处理包括:对所述语音特征向量和所述文字特征向量分别进行0均值标准化。
在一些实施例中,将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果包括:将所述复合特征向量发送至所选取的深度学习模型的输入端;经由所选取的深度学习模型中的各层对于所述复合特征向量进行处理;在所述深度学习模型的输出端获取命名实体的识别结果。
在一些实施例中,在语音信号中包括多个句子的情况下,将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果之前,还包括:根据对应于语音信号中当前句的句长特征值,对得到的所述语音信号的全部复合特征向量进行截断,得到多个复合特征向量序列,所述复合特征向量序列的数量等于语音信号中所包含的句子的数量,所述多个复合特征向量序列中的每一个所具有的复合特征向量的个数等于对应于语音信号中当前句的句长特征值。
在一些实施例中,语音信号中当前句的句长特征值由语音信号中的语音特征向量获得。
在一些实施例中,语音信号中当前句的句长特征值由语音信号进行语音识别后的文字结果获得。
根据本公开的另一方面,提供一种命名实体识别装置,包括:语音信号采集器,用于采集语音信号;语音特征向量提取器,配置为提取语音信号中的语音特征向量;文字特征向量提取器,配置为基于语音信号进行语音识别后的文字结果,提取所述文字结果中的文字特征向量;复合向量生成器,配置为将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量;命名实体识别器,配置为将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果。
根据本公开的另一方面,提供一种命名实体识别设备,其中所述设备包括语音采集装置、处理器和存储器,所述存储器包含一组指令,所述一组指令在由所述处理器执行时使所述命名实体识别设备执行如上所述的方法。
根据本公开的另一方面,提供一种计算机可读存储介质,其特征在于,其上存储有计算机可读的指令,当利用计算机执行所述指令时执行如上所述的方法。
为了更清楚地说明本公开实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员而言,在没有做出创造性劳动的前提下,还可以根据这些附图获得其他的附图。以下附图并未刻意按实际尺寸等比例缩放绘制,重点在于示出本公开的主旨。
图1示出了根据本公开的实施例的命名实体识别设备的示例性的框图;
图2示出了根据本公开的实施例的命名实体识别方法的示例性的流程图;
图3A示出了根据本公开实施例的特殊场景下提取语音信号中语音句特征向量的示意图;
图3B示出了根据本公开实施例的特殊场景下提取语音信号中语音字特征向量的示意图;
图3C示出了根据本公开实施例提取语音信号中语音字特征向量的示例性流程图;
图4示出了根据本公开实施例提取语音信号中文字特征向量的示例性流程图;
图5示出了根据本公开的实施例将语音特征向量与文字特征向量进行拼接的示例性流程图;
图6示出了根据本公开的实施例对所述语音信号的全部复合特征向量进行截断以得到多个复合特征向量序列的示意图;
图7示出了根据本公开的实施例的命名实体识别装置的示意性的框图。
下面将结合附图对本公开实施例中的技术方案进行清楚、完整地描述,显而易见地,所描述的实施例仅仅是本公开的部分实施例,而不是全部的实施例。基于本公开实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,也属于本公开保护的范围。
如本公开和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。一般说来,术语“包括”与“包含”仅提示包括已明确标识的步骤和元素,而这些步骤和元素不构成一个排它性的罗列,方法或者设备也可能包含其他的步骤或元素。
虽然本公开对根据本公开的实施例的系统中的某些模块做出了各种引用,然而,任何数量的不同模块可以被使用并运行在用户终端和/或服务器上。所述模块仅是说明性的,并且所述系统和方法的不同方面可以使用不同模块。
本公开中使用了流程图用来说明根据本公开的实施例的系统所执行的操作。应当理解的是,前面或下面操作不一定按照顺序来精确地执行。相反,根据需要,可以按照倒序或同时处理各种步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。
针对以上问题,本公开提供了一种命名实体识别方法、装置、设备及介质。将文字中不包含的语音信息,如重音、停顿、语调等同文字特征归一化融合,引入深度学习模型中,共同指导命名实体识别的过程。解决了特殊场景下复杂特殊名称对于判断句子结构、识别实体的影响,提高了实体识别的精度和准确率,并进一步地增大了实体识别应用的范围。
图1示出了根据本公开的实施例的命名实体识别设备的示例性框图。如图1所示的命名实体识别设备100可以实现为包括语音采集装置130和一个或多个专用或通用的计算机处理系统模块或部件,语音采集装置例如麦克风组件,其可包括麦克风、麦克风套管、安装杆、连接线等;其也可为无线麦克风或麦克风电路。一个或多个专用或通用的计算机处理系统模块或部件例如个人电脑、笔记本电脑、平板电脑、手机、个人数码助理(personal digital assistance,PDA)、智能眼镜、智能手表、智能指环、智能头盔及任何智能便携设备,其可以包括至少一个处理器110及存储器120。
其中,所述至少一个处理器用于执行程序指令。所述存储器120在命名实体识别设备100中可以以不同形式的程序储存单元以及数据储存单元存在,例如硬盘、只读存储器(ROM)、随机存取存储器(RAM),其能够用于存储处理器处理和/或执行命名实体识别过程中使用的各种数据文件,以及处理器所执行的可能的程序指令。虽然未在图中示出,但命名实体识别设备100还可以包括一个输入/输出组件,支持命名实体识别设备100与其他组件(如屏幕显示装置)之间的输入/输出数据流。命名实体识别设备100也可以通过 通信端口从网络发送和接收信息及数据。
在一些实施例中,命名实体识别设备100可以采集来自周围特定场景中所产生的语音信号,并对接收的语音信号执行下文描述的命名实体识别方法、实现上文描述的语句补全装置的功能。所述特定场景中的语音信号可以为人声信号,具体地可以例如为博物馆或艺术展览中的解说词、书画赏析及评论的论述内容、针对人物或历史课程的授课内容等。
尽管在图1中,处理器110、存储器120、语音采集装置130呈现为单独的模块,本领域技术人员可以理解,上述装置模块可以被实现为单独的硬件设备,也可以被集成为一个或多个硬件设备,例如集成在智能手表或其他智能设备之中。只要能够实现本公开描述的原理,不同的硬件设备的具体实现方式不应作为限制本公开保护范围的因素。
基于特定场景,由语音采集装置所采集130所采集的语音信号中可能包括大量的复杂特殊名称,例如在书画赏析的特定场景中所采集的语音可包括“这里的黎明静悄悄”、“在雨中漫步”、“当我在谈跑步时我谈些什么”等复杂的画作、书籍名称。
因此,为在特定场景中出现多种复杂特殊名称的情况下,良好地判断其语句结构、识别命名实体,本公开提出了一种命名实体识别方法。图2示出了根据本公开的实施例的命名实体识别方法的示例性流程图。
如图2所示,根据命名实体识别方法200,首先在步骤S201中,采集特定场景下的语音信号。如前所述,该特定场景下的语音信号可以是通过分立的语音采集装置采集,或者也可以是通过与计算机处理系统集成的语音采集模块采集。在本公开实施例不受语音信号的来源及获取方式的限制。例如可以为由独立的麦克风,或者也可以是通过与计算机处理系统相集成的麦克风电路采集。
基于已采集的语音信号,在步骤S202中,提取语音信号中的语音特征向量。提取语音信号中的语音特征向量例如可以通过对于语音信号进行时域及频域的特征提取得出,或者可以通过对于语音信号进行滤波及加窗分析而得出。
基于已采集的语音信号,在步骤S203中,将进一步基于语音信号进行语音识别后的文字结果,提取所述文字结果中的文字特征向量。其中所述语音识别可以通过例如深度学习算法或其他语音信号识别方法实现,本公开实施 例不受语音识别方法及过程的限制。基于语音识别后的文字结果,提取所述文字结果中的文字特征向量例如可通过与字库或词库的对照而得出,从中识别命名实体,也可以通过对于句子结构的判断得出。
应了解,步骤S202和S203的操作可以并行进行,或者按照顺序执行,在此不对其作出任何限制。进一步地,根据需要,步骤S202及S203可基于经预处理后得到的不同语音信号进行操作,只要这些语音信号皆来源于相同的原始语音信号即可。
完成语音特征向量和文字特征向量提取后,在步骤S204中,将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量。该拼接过程例如可通过将二者的特征向量进行连接,形成复合特征向量,也可以通过将上述语音特征向量与文字特征向量进行归一化处理,得到归一化向量。
基于所得到的复合特征向量,在步骤S205中,将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果。其中深度学习模型可以为采用基于统计方法的模型,例如隐马尔可夫模型(HiddenMarkovModel,HMM)、最大熵(MaxmiumEntropy,ME)、支持向量机(Support VectorMachine,SVM),也可以为采用基于时间序列上样本关系的模型,例如长短期记忆网络(LSTM)、循环神经网络(RNN)。
在一些实施例中,提取语音信号中语音特征向量可进一步地包括提取语音句向量和提取语音字向量。其中,提取语音句向量可以更具体地描述,例如提取语音信号中的基频、语速、共振峰等韵律方面的特征,或者也可以提取语音信号频谱相关的特征,例如梅尔频率倒谱系数(MFCC)。提取语音字向量可以例如将语音信号按字分段,提取其中每个字对应的发音时长、本字开始时间点、本字结束时间点,或者也可以提取语音信号中每个字的发音最大频率、最大音强、发音音强积分均值等。下面将结合图3A、图3B对其进一步说明。
图3A示出了在根据本公开实施例的特殊场景下提取语音信号中语音句特征向量的示意图。例如,该特殊场景为艺术展览,其中语音信号为展览中的讲解词。在该特殊场景下,可将所提取的语音信号中的语音句特征向量进一步设定为:提取其帧级的低层次特征向量。更具体地,例如基于当前艺术展览讲解词的语音信号中的每一句,提取其基频特征、声音质量特征、梅尔 频率倒谱系数(MFCC),由此得到基于每一句的句特征向量。
其中,基频特征可以反映出语音的整体性能,例如在普遍情况下,女性的基频特征高于男性,因此可以用来进行性别区分,并在进一步处理后,剔除掉其中混入的其他性别人声的杂音,获取较为精确的句长特征值,通过该特征可以对于语音识别的文字结果中的句长进行辅助修正,避免在所采集语音中由于混入环境音或其他人声而造成识别错误。梅尔频率倒谱系数则可进一步用于辨识声音属性,区分语音中存在的不同人声。
基于在本公开实施例中所提取的特征向量内容,在提取语音信号中的语音句特征向量时,根据本公开的实施例,例如在提取梅尔频率倒谱系数特征向量时,首先可基于预处理、分帧、加窗等过程,得到初步的信号处理结果,其后,在该初步处理结果之上采用傅里叶变换,并进一步对于变换后的信号进行滤波处理及离散余弦变化(DCT),最终得到对应于该句的动态特征向量(Delta MFCC),即句特征向量。对于基频特征、声音质量特征等向量,则可采取不同的统计函数,将每个句子时长不等的基础声学特征转换为定长的静态特征。这些统计函数可包括最大最小值、均值、时长、方差等。
在一些实施例中,提取语音信号中的语音句特征向量,可以通过将上述所提取的语音特性与预设语音句库中的语音参数对照表对照,将其转换为与其特性对应的语音句特征向量。其可以具体描述为,例如在语音预设对照表中,若将男性发声的频率范围设置为100至480赫兹,女性发声的频率范围设置为160至1000赫兹,则针对所提取的句子中的基频特征可将其进行分类标识,例如若采集到的当前句平均频率值为780赫兹,则可将其判别为女声,并基于表格的预设规则得到特征向量值,例如将其对应的语音句特征向量值赋值为1。
图3B示出了在根据本公开实施例的特殊场景下提取语音信号中语音字特征向量的示意图。例如,该特殊场景为文学赏析,其中语音信号为针对文学的评论内容。在该特殊场景下,可将所提取的语音信号中的语音字特征向量进一步设定为包括其时域特征向量和频域特征向量。更具体地,语音字特征向量可以包括该字在全局的序号、该字的开始时间点、发音时长、与前一字的停顿时长、本字发音最大音强、本字发音最小音强、本字发音最大及最小频率、短时平均幅度、短时平均过零率。
其中平均过零率可分辨当前字为浊音还是轻音,特别地,即使在语音信 号清浊音有交叠的情况下,其仍具有良好的分辨效果,这使得它可以用于对语音识别的文字结果中因语速较快而引起的清浊音连读或模糊进行修正,如修正快速连读时的“少年维特的烦恼”中“特”和“的”的连读引起的识别错误。短时平均能量可用作为判断声母韵母、有声无声的依据。基于其可对于语音识别的文字结果中断句位置和词语连续性进行校验和修正,且其与停顿时长数据相结合,可进一步依据此断句得到当前句的句长。本字发音最大、最小音强可用于表征本字音频特征,可以对于在环境背景嘈杂或语速较慢的情况下,修正语音识别的文字结果中因语速过慢而将隶属于一个字的首尾部分别识别为独立字的情况,如将诗朗诵中的“笑谈渴饮匈奴血”中的“可”识别为“可恶”。
提取如上所述的语音信号中语音字特征可通过对语音信号进行语音分析实现。参照图3C,在语音分析300中,首先需要经过步骤S301对于语音信号进行时间和频域上的离散化采样,得到数字语音信号。例如,可以采用单位脉冲序列以预设采样频率对于语音信号进行采样,特别地,其采样值可根据奈奎斯特采样定理选择。其中语音信号可为由麦克风直接采集到的语音信号或经由计算机预处理或降噪后的语音信号。
完成语音信号在时间和频域上的离散化采样后,通过步骤S302进一步地对于数字语音信号中的每个字在时域上和频域上分别进行处理,得到其时域特征向量及频域的特征向量。具体而言,在对于语音信号进行时域处理的过程中,例如可通过加窗处理,得到语音信号线性尺度下的短时平均能量、短时过电平率在对语音信号的频域处理中,例如可经过信号分析,提取的对于每个字的本字最大发音频率、倒谱参数,得到包含最大发音频率特征、倒谱参数特征的频域字特征向量。
在获得了语音信号中每个字的语音句特征向量和语音字特征向量后,在步骤S303,对于语音信号中的每个字,将其时域特征向量和频域特征向量进行拼接,得到对应于每个字的语音字特征向量。该拼接例如可以通过将时域特征向量和频域特征向量直接连接得到,也可以将其进行基于后续判别的需求分类,将隶属相同类别的时域及频域向量进行分组拼接。
以上步骤可以更具体地描述,例如所得到的时域特征向量T为(t
1,t
2,t
3),所得到的频域特征向量F为(f
1,f
2,f
3),则将其拼接后可得到对应于当前字的字特征向量M
V,其特征向量M
V为(t
1,t
2,t
3,f
1,f
2,f
3)。
图4示出了根据本公开实施例提取语音信号中文字特征向量的示意图。
如图4所示,提取语音信号中的文字特征向量的方法400包括提取文字的字特征向量和分词嵌入特征向量。所述文字的字特征向量标识语音识别后所识别出的每个字,例如采用不同的数值表征转换后的不同字。所述分词嵌入向量标识目前句子中出现的词组及其组成结构关系,例如其可以通过不同的数值表示词组和数值,也可以通过数值的正负表示一个词组中的首字、中间字和结尾字。
基于如上所述,进一步地,在步骤S401提取所述文字结果中的字特征向量中,通过步骤S4011,根据预设字库中的文字及向量值对照表,将所述文字转换为与之对应的字特征向量。其中预设字库可以为现有的语料素材数据集,例如维基百科所公开的100维中文字向量,或者可以为自行设计的针对特定场景高频词汇的语料素材数据集,如针对文艺复兴时期绘画艺术的相关字向量。
以上步骤可以更具体地描述,例如针对语音识别所获取的文字结果中“我想看莫奈的撑阳伞的女人”一句,对照维基百科所公开的100维中文字向量,可得到此句所对应的离散向量D
V,所述离散向量D
V为(28,36,108,196,300,3,314,180,204,3,91,29),向量中的每一个数值即为对应于句子中每一个文字的字特征值。
得到了与所述文字对应的字特征向量后,进一步通过步骤S402提取所述文字结果中的分词嵌入特征向量。如图4所示,首先通过步骤S4021,根据预设词库中的词组对照表,划分所述文字结果中的词组和单字。其中预设词库可以为现有的词组语料素材库,也可以为自行设计的针对特定场景高频词汇的词组语料素材数据集。划分词组和单字可通过不同的数值大小或正负。完成所述文字结果中的词组和单字的划分后,通过步骤S4022,根据预设变换规则,将所述词组中的每个字和所述单字分别转换为与对应的分词嵌入特征向量。其中,所述变换规则可基于词组中的起始字、中间字和结尾字,分别赋予预设数值,也可基于词组中的字数,对于词组中的每个字,基于其在词组中的位置顺序编号。
以上步骤可以更具体地描述,例如针对语音识别所获取的文字结果中“你怎么看村上春树的当我在谈跑步时我在谈什么”,如所采用的换规则为:单字对应的数值为0,词组中的起始字对应的数值为1,中间字(除起始字和结尾字皆可视为中间字)对应的数值为2,结尾字对应的数值为3。则基于上述的 变换规则,将上述文字结果拆分后,依照变换规则转换,可得到分词嵌入特征向量P
V,所述分词嵌入特征向量P
V为(0,1,1,0,1,2,2,3,0,1,2,2,2,2,2,2,2,2,2,2,3),对于“我想看莫奈的撑阳伞的女人”一句,其对应的分词嵌入特征向量P
V为(0,0,0,1,3,0,1,2,2,2,2,3)。
应了解,步骤S401和S402的操作可以并行进行,或者按照顺序执行,在此不对其作出任何限制。进一步地,根据需要,步骤S401和S402可基于经预处理后得到的不同语音信号进行操作,只要这些语音信号皆来源于相同的原始语音信号即可。
如上所述,得到语音信号的语音特征向量与文字特征向量后,进一步将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量。其中,将语音特征向量与文字特征向量进行拼接,例如可以通过将其直接连接形成新向量,或者可以通过将其内部向量组成部分按照性能或作用进行分类拼接。
在一些实施例中,将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量的步骤包括:对于所提取的语音特征向量、文字特征向量分别进行归一化处理;将归一化处理后得到的针对语音信号中每个字的稠密文字特征向量和稠密语音特征向量进行向量拼接,得到针对语音信号中每个字的复合特征向量。
图5示出了根据本公开的实施例将语音特征向量与文字特征向量进行拼接的示例性流程图。作为示例,下面将参考图5,进一步说明将语音特征向量与文字特征向量进行拼接的方法500的步骤。
如图5所示,首先通过步骤S501,对于所提取的语音特征向量、文字特征向量分别进行归一化处理。所述归一化处理,在一些实施例中,包括对所述语音特征向量和所述文字特征向量分别进行线性函数归一化。具体而言,即利用如下归一化公式对其进行处理:
X
norm=(X-X
min)/(X
max-X
min)
其中,X
norm为归一化后的稠密数据,X为原始数据,X
max和X
min代表原始数据集中的最大值和最小值。
例如,对于文字的分词嵌入向量而言,若向量中数据的最大值为3,最小值为0,则对于分词嵌入向量P
V=(0,0,0,1,3,0,1,2,2,2,2,3)而言,其采用线性函数方法进行归一化后,所得到的稠密分词嵌入向量P
N为 (0,0,0,0.3,1,0,0.3,0.6,0.6,0.6,0.6,1)。
在另一些实施例中,所述归一化处理包括对所述语音特征向量和所述文字特征向量分别进行0均值标准化。具体而言,即利用如下归一化公式对其进行处理:
z=(x-μ)/σ
其中z为归一化后的稠密数据,x为原始数据,μ和σ代表原始数据集中的均值和方差。
例如,对于文字的语音字向量而言,若向量中数据的均值μ为57,方差值σ为12.5,则对于语音字向量M
V中的向量(49,54,60,62,47,66)而言,其采用0均值标准化方法归一化后所得到的稠密字向量M
N为(-0.64,-0.24,0.24,0.4,-0.8,0.72)。
完成文字特征向量及语音特征向量的归一化后,通过步骤S502,对于归一化后的稠密文字特征向量和稠密语音特征向量进行拼接,得到针对语音信号中每个字的复合特征向量。其例如可以通过直接拼接,或按照既定顺序将其中各子向量依序拼接。作为示例,接下来具体说明按照既定顺序将归一化后的文字特征向量和语音特征向量进行拼接的过程。
具体而言,若基于分别归一化后所得到的各稠密向量中对应于其中第i个文字Wi的稠密语音字向量M
Vi为(t
i,f
i),对应于文字Wi的稠密语音句向量S
Vi为(s
i1,s
i2,……s
i20),Wi的文字的稠密字特征向量D
Ni为(d
i),稠密分词嵌入特征向量P
Ni为(p
i1,p
i2,……p
i98),在采用语音特征向量(语音句向量-语音字向量)-文字特征向量(文字的字向量-分词嵌入向量)的顺序进行拼接的情况下,例如可通过预设各特征向量的长度实现拼接。例如,可对于所获取的稠密文字的字特征向量的长度进行比较,选择其中长度最大值作为基准值,据此将稠密文字的字特征向量的预设长度设定为大于或等于该基准值。基于该预设长度对于该句中的所有文字的稠密文字的字特征向量进行扩充,将其扩充至预设长度,该扩充过程例如可通过对其进行补零操作。如上所述方法,可对于上述向量分别设定向量预设长度,并基于预设长度对其分别扩充,例如设稠密语音字向量的预设长度为5,其中可进一步设定,时域语音字向量的预设长度为2,频域语音字向量的预设长度为3,稠密语音句向量的预设长度为20,文字的稠密字特征向量的预设长度为5,稠密分词嵌入特征向量的预设长度为100,则进行扩充补零后,文字Wi的稠密语音字向量M
Vi 为(t
i1,0,f
i,0,0),对应于文字Wi的稠密语音句向量S
Vi为(s
i1,s
i2,……s
i20),Wi的文字的稠密字特征向量D
Ni为(d
i,0,0,0,0),稠密分词嵌入特征向量P
Ni为(p
i1,p
i2,……p
i98,0,0)则顺序拼接后针对文字Wi的复合特征向量为行向量,其具体为(s
i1,s
i2,……s
i20,t
i1,0,f
i,0,0,d
i,0,0,0,0,p
i1,p
i2,……p
i98,0,0)。
在进行拼接时,也可以将属于每个字的多个特征向量分列于不同行之中,形成特征向量矩阵。在此处的拼接过程中,例如可首先查看当前的所有特征向量,得到当前多个特征向量中,具有最多向量组成成员的特征向量,即该向量中包含最多子向量元素。如在当前字Wi中,其稠密分词嵌入特征向量P
Ni为(p
i1,p
i2,……p
i98)中具有最多的向量组成成员,其包含的子向量个数为98个。则可基于该最大子向量数,将其余特征向量进行扩充,使之扩充至与当前分词嵌入特征向量P
Ni具有相同的子向量数,例如可进行补零操作对其扩充。补零后可得到的文字Wi的稠密语音字向量M
Vi、对应于Wi的文字的稠密语音句向量S
Vi、对应于Wi的文字的稠密字特征向量D
Ni均为包含98个子向量的特征向量,接下来,可同样采用语音特征向量(语音句向量-语音字向量)-文字特征向量(文字的字向量-分词嵌入向量)的顺序,将其顺序组合,形成为4行,每行具有98列的特征向量矩阵,该矩阵即为表征文字Wi的特征矩阵。
应了解,对于特征向量的归一化及拼接过程不限于上述实施例所描述的顺序,在另一些实施例中,可先对其进行上述拼接过程,例如经由设定预设长度得到拼接后的文字特征行向量、或者经由将属于每个字的多个特征向量分列于不同行之中,形成特征向量矩阵。其后,再对于上述拼接后的特征向量的不同组成部分分别进行归一化处理。
基于所得到的复合特征向量,在步骤S205中,将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果。具体而言,首先,将所述复合特征向量发送至所选取的深度学习模型的输入端,其中所述复合向量例如可采取以字或词组的划分方式依次输入深度学习模型中,也可以通过预设句长或段落长度截取后输入深度学习模型中。其后,经由所选取的深度学习模型中的各层对于所述复合特征向量进行处理,其中所选取的深度学习模型可例如为马尔科夫模型,或者条件随机场模型。且所述深度模型亦可为复合深度学习模型,例如将双向长短时记忆循环神经网络结合条件随机场算法所形成的复合深度学习模型(BiLSTM+CRF)。具体而言, 例如选择双向长短时记忆循环神经网络结合所形成的复合深度学习模型时,输入向量数据经过双向长短时记忆循环神经网络中的前向层、反向层的计算,随后经由条件随机场算法层的处理,最终得到深度学习的处理结果。随后,在所述深度学习模型的输出端即可获取命名实体的识别结果。
在一些实施例中,在语音信号中包括多个句子的情况下,将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果之前,还包括对得到的所述语音信号的全部复合特征向量进行截断的步骤。
图6示出了根据本公开的实施例对所述语音信号的全部复合特征向量进行截断,得到多个复合特征向量序列的示意图。参照图6可知,根据语音信号复合向量的截断方法600,首先需经由步骤S601,获取当前语音信号当前句的句长特征值,其中所述句长特征值标识语音信号中当前句的句子长度,在一些实施例中,其可根据在语音信号中提取的语音特征向量的数据所获得,例如通过前述语音句特征向量中的句长特征,或通过语音字特征向量中,基于每个字与前一字停顿时长来划分断句。在一些实施例中,其也可根据语音信号进行语音识别后的文字结果中的断句特征统计得到。
基于所得到的对应于语音信号中当前句的句长特征值,通过步骤S602对得到的所述语音信号的全部复合特征向量进行截断,即,基于语音信号当前句的句长特征值,按顺序截取语音信号的复合特征向量,截断所得到的多个复合向量序列分别表征语音信号中的多个句子。
基于如上的截断方式,在S603中得到多个复合特征向量序列,所述复合特征向量序列的数量等于语音信号中所包含的句子的数量,所述多个复合特征向量序列中的每一个所具有的复合特征向量的个数等于对应于语音信号中当前句的句长特征值。
在一些实施方式中,进一步地,为了便于神经网络模型的识别或基于后续处理的需要,对于截断后得到的多个序列,可进一步根据语音信号中当前句的句长特征值的最大值设置预设句长,基于预设句长对各个序列进行扩充,对于不足的向量数据补充预设值,例如将其设置为0,使得每个序列的长度等于预设句长,其中预设句长大于或等于当前句的句长特征值的最大值。
图7示出了根据本公开的实施例的命名实体识别装置的示意性的框图。
如图7所示,命名实体识别装置700可以包括语音信号采集器710、语 音特征向量提取器720、文字特征向量提取器730、复合向量生成器740和命名实体识别器750。
其中,所述语音信号采集器710用于采集语音信号。所述语音信号采集器710例如可以为麦克风组件,其可包括麦克风、麦克风套管、安装杆、连接线等;其也可为无线麦克风或麦克风电路。
所述语音特征向量提取器720配置为提取语音信号中的语音特征向量,具体而言,其例如可执行图3C中所示出的流程,实现如图3A和图3B中所示出的特征提取。
所述文字特征向量提取器730配置为基于语音信号进行语音识别后的文字结果,提取所述文字结果中的文字特征向量,具体而言,其例如可执行图4中所示出的流程,实现对于文字特征向量的提取。
所述复合向量生成器740配置为将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量。具体而言,其例如可执行图5所示出的流程,实现对于语音特征向量与文字特征向量的拼接。
所述命名实体识别器750配置为将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果。具体而言,其例如可执行图6所示出的流程,经由深度学习模型的处理结果,得到相应的实体识别结果。
所述语音特征向量提取器720、文字特征向量提取器730、复合向量生成器740命名实体识别器750例如可以通过一个或多个专用或通用的计算机处理系统模块或部件,例如个人电脑、笔记本电脑、平板电脑、手机、个人数码助理(personal digital assistance,PDA)、智能眼镜、智能手表、智能指环、智能头盔及任何智能便携设备来实现。例如其可以经由至少一个处理器及存储器实现,其中,所述至少一个处理器用于执行程序指令,所述存储器可以以不同形式的程序储存单元以及数据储存单元存在,例如硬盘、只读存储器(ROM)、随机存取存储器(RAM),其能够用于存储处理器处理和/或执行命名实体识别过程中使用的各种数据文件,以及处理器所执行的可能的程序指令。
尽管在图7中,语音信号采集器710、语音特征向量提取器720、文字特征向量提取器730、复合向量生成器740和命名实体识别器750被呈现为单独的模块,本领域技术人员可以理解,上述装置模块可以被实现为单独的硬 件设备,也可以被集成为一个或多个硬件设备。只要能够实现本公开描述的原理,不同的硬件设备的具体实现方式不应作为限制本公开保护范围的因素。
根据本公开的另一方面,还提供了一种计算机可读存储介质,其上存储有计算机可读的指令,当利用计算机执行所述指令时可以执行如前所述的方法。
利用本公开提供的命名实体识别方法、装置及设备,通过采用语音信号分析辅助实体命名识别,将传统方法只针对文字进行命名实体识别的过程向前扩展,提高命名实体识别的精度和应用范围。特别地,本公开所述方法可以良好地克服在特殊场景中,采集到的语音信号存在多个复杂特殊名称时命名实体识别困难的问题,进一步提高命名实体识别的鲁棒性及其识别准确度。
技术中的程序部分可以被认为是以可执行的代码和/或相关数据的形式而存在的“产品”或“制品”,通过计算机可读的介质所参与或实现的。有形的、永久的储存介质可以包括任何计算机、处理器、或类似设备或相关的模块所用到的内存或存储器。例如,各种半导体存储器、磁带驱动器、磁盘驱动器或者类似任何能够为软件提供存储功能的设备。
所有软件或其中的一部分有时可能会通过网络进行通信,如互联网或其他通信网络。此类通信可以将软件从一个计算机设备或处理器加载到另一个。例如:从指尖检测设备的一个服务器或主机计算机加载至一个计算机环境的硬件平台,或其他实现系统的计算机环境,或与提供命名实体识别所需要的信息相关的类似功能的系统。因此,另一种能够传递软件元素的介质也可以被用作局部设备之间的物理连接,例如光波、电波、电磁波等,通过电缆、光缆或者空气等实现传播。用来载波的物理介质如电缆、无线连接或光缆等类似设备,也可以被认为是承载软件的介质。在这里的用法除非限制了有形的“储存”介质,其他表示计算机或机器“可读介质”的术语都表示在处理器执行任何指令的过程中参与的介质。
本公开使用了特定词语来描述本公开的实施例。如“一个实施例”、“一实施例”、和/或“一些实施例”意指与本公开至少一个实施例相关的某一特征、结构或特点。因此,应强调并注意的是,本说明书中在不同位置两次或多次提及的“一实施例”或“一个实施例”或“一替代性实施例”并不一定是指同一实施例。此外,本公开的一个或多个实施例中的某些特征、结构或特点可以进行适当的组合。
此外,本领域技术人员可以理解,本公开的各方面可以通过若干具有可专利性的种类或情况进行说明和描述,包括任何新的和有用的工序、机器、产品或物质的组合,或对他们的任何新的和有用的改进。相应地,本公开的各个方面可以完全由硬件执行、可以完全由软件(包括固件、常驻软件、微码等)执行、也可以由硬件和软件组合执行。以上硬件或软件均可被称为“数据块”、“模块”、“引擎”、“单元”、“组件”或“系统”。此外,本公开的各方面可能表现为位于一个或多个计算机可读介质中的计算机产品,该产品包括计算机可读程序编码。
除非另有定义,这里使用的所有术语(包括技术和科学术语)具有与本公开所属领域的普通技术人员共同理解的相同含义。还应当理解,诸如在通常字典里定义的那些术语应当被解释为具有与它们在相关技术的上下文中的含义相一致的含义,而不应用理想化或极度形式化的意义来解释,除非这里明确地这样定义。
上面是对本公开的说明,而不应被认为是对其的限制。尽管描述了本公开的若干示例性实施例,但本领域技术人员将容易地理解,在不背离本公开的新颖教学和优点的前提下可以对示例性实施例进行许多修改。因此,所有这些修改都意图包含在权利要求书所限定的本公开范围内。应当理解,上面是对本公开的说明,而不应被认为是限于所公开的特定实施例,并且对所公开的实施例以及其他实施例的修改意图包含在所附权利要求书的范围内。本公开由权利要求书及其等效物限定。
Claims (19)
- 一种命名实体识别方法,包括:采集语音信号;提取语音信号中的语音特征向量;基于语音信号进行语音识别后的文字结果,提取所述文字结果中的文字特征向量;将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量;将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果。
- 如权利要求1所述的命名实体识别方法,其中提取语音信号中的语音特征向量包括提取语音句特征向量和提取语音字特征向量。
- 如权利要求2所述的命名实体识别方法,其中提取语音信号中的语音句特征向量包括:根据预设语音句库中的语音参数对照表,将所述语音信号中语音的句特征转换为与之对应的语音句特征向量。
- 如权利要求2所述的命名实体识别方法,其中提取语音信号中的语音字特征向量包括:通过对语音信号进行语音分析,得到语音信号中的语音字特征向量。
- 如权利要求4所述的命名实体识别方法,其中对语音信号进行语音分析包括:对于语音信号进行时间和频域上的离散化采样,得到数字语音信号;对于数字语音信号中的每个字在时域上和频域上分别进行处理,得到其时域特征向量及频域特征向量;对于语音信号中的每个字,将其时域特征向量和频域特征向量进行拼接,得到对应于每个字的语音字特征向量。
- 如权利要求1所述的命名实体识别方法,其中,基于语音信号进行语音识别后的文字结果,提取所述文字结果中的文字特征向量包括提取字特征向量和提取分词嵌入特征向量。
- 如权利要求6所述的命名实体识别方法,其中,提取所述文字结果中的字特征向量包括:根据预设字库中的文字及向量值对照表,将所述文字转换为与之对应的字特征向量。
- 如权利要求6所述的命名实体识别方法,其中,提取所述文字结果中的分词嵌入特征向量包括:根据预设词库中的词组对照表,划分所述文字结果中的词组和单字;根据预设变换规则,将所述词组中的每个字和所述单字分别转换为与对应的分词嵌入特征向量。
- 如权利要求1所述的命名实体识别方法,其中,将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量包括:对于所提取的语音特征向量、文字特征向量分别进行归一化处理;将归一化处理后得到的针对语音信号中每个字的稠密文字特征向量和稠密语音特征向量进行向量拼接,得到针对语音信号中每个字的复合特征向量。
- 如权利要求1所述的命名实体识别方法,其中,将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量包括:将得到的针对语音信号中每个字的稠密文字特征向量和稠密语音特征向量进行向量拼接,得到针对语音信号中每个字的复合特征向量;对于所得到的复合特征向量中的语音特征向量、文字特征向量分别进行归一化处理。
- 如权利要求9或10中所述的命名实体识别方法,其中,进行归一化处理包括:对所述语音特征向量和所述文字特征向量分别进行线性函数归一化。
- 如权利要求9或10中所述的命名实体识别方法,其中,进行归一化处理包括:对所述语音特征向量和所述文字特征向量分别进行0均值标准化。
- 如权利要求1所述的命名实体识别方法,其中,将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果包括:将所述复合特征向量发送至所选取的深度学习模型的输入端;经由所选取的深度学习模型中的各层对于所述复合特征向量进行处理;在所述深度学习模型的输出端获取命名实体的识别结果。
- 如权利要求1所述的命名实体识别方法,其中,在语音信号中包括多个句子的情况下,将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果之前,还包括:根据对应于语音信号中当前句的句长特征值,对得到的所述语音信号的全部复合特征向量进行截断,得到多个复合特征向量序列,所述复合特征向量序列的数量等于语音信号中所包含的句子的数量,所述多个复合特征向量序列中的每一个所具有的复合特征向量的个数等于对应于语音信号中当前句的句长特征值。
- 如权利要求14所述的命名实体识别方法,其中,语音信号中当前句的句长特征值由语音信号中的语音特征向量获得。
- 如权利要求14所述的命名实体识别方法,其中,语音信号中当前句的句长特征值由语音信号进行语音识别后的文字结果获得。
- 一种命名实体识别装置,包括:语音信号采集器,用于采集语音信号;语音特征向量提取器,配置为提取语音信号中的语音特征向量;文字特征向量提取器,配置为基于语音信号进行语音识别后的文字结果,提取所述文字结果中的文字特征向量;复合向量生成器,配置为将语音特征向量与文字特征向量进行拼接,得到所述语音信号中每个字的复合特征向量;命名实体识别器,配置为将语音信号中每个字的所述复合特征向量通过深度学习模型进行处理,得到命名实体的识别结果。
- 一种命名实体识别设备,其中所述设备包括语音采集装置、处理器和存储器,所述存储器包含一组指令,所述一组指令在由所述处理器执行时使所述命名实体识别设备执行上述权利要求1-16中任意一项所述的方法。
- 一种计算机可读存储介质,其特征在于,其上存储有计算机可读的指令,当利用计算机执行所述指令时执行上述权利要求1-16中任意一项所述的方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/652,233 US11514891B2 (en) | 2018-08-30 | 2019-08-28 | Named entity recognition method, named entity recognition equipment and medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811005796.4 | 2018-08-30 | ||
CN201811005796.4A CN109741732B (zh) | 2018-08-30 | 2018-08-30 | 命名实体识别方法、命名实体识别装置、设备及介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020043123A1 true WO2020043123A1 (zh) | 2020-03-05 |
Family
ID=66354356
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/103027 WO2020043123A1 (zh) | 2018-08-30 | 2019-08-28 | 命名实体识别方法、命名实体识别装置、设备及介质 |
Country Status (3)
Country | Link |
---|---|
US (1) | US11514891B2 (zh) |
CN (1) | CN109741732B (zh) |
WO (1) | WO2020043123A1 (zh) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111597814A (zh) * | 2020-05-22 | 2020-08-28 | 北京慧闻科技(集团)有限公司 | 一种人机交互命名实体识别方法、装置、设备及存储介质 |
CN111753840A (zh) * | 2020-06-18 | 2020-10-09 | 北京同城必应科技有限公司 | 一种同城物流配送名片下单技术 |
CN112199953A (zh) * | 2020-08-24 | 2021-01-08 | 广州九四智能科技有限公司 | 一种电话通话中信息提取方法、装置及计算机设备 |
CN112749561A (zh) * | 2020-04-17 | 2021-05-04 | 腾讯科技(深圳)有限公司 | 一种实体识别方法及设备 |
CN113536790A (zh) * | 2020-04-15 | 2021-10-22 | 阿里巴巴集团控股有限公司 | 基于自然语言处理的模型训练方法及装置 |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110516228A (zh) * | 2019-07-04 | 2019-11-29 | 湖南星汉数智科技有限公司 | 命名实体识别方法、装置、计算机装置及计算机可读存储介质 |
CN110781646B (zh) * | 2019-10-15 | 2023-08-22 | 泰康保险集团股份有限公司 | 名称标准化方法、装置、介质及电子设备 |
CN112712796A (zh) * | 2019-10-25 | 2021-04-27 | 北大方正集团有限公司 | 语音识别方法及装置 |
CN111079421B (zh) * | 2019-11-25 | 2023-09-26 | 北京小米智能科技有限公司 | 一种文本信息分词处理的方法、装置、终端及存储介质 |
WO2021160822A1 (en) * | 2020-02-14 | 2021-08-19 | Debricked Ab | A method for linking a cve with at least one synthetic cpe |
CN111613223B (zh) * | 2020-04-03 | 2023-03-31 | 厦门快商通科技股份有限公司 | 语音识别方法、系统、移动终端及存储介质 |
CN111832293B (zh) * | 2020-06-24 | 2023-05-26 | 四川大学 | 基于头实体预测的实体和关系联合抽取方法 |
CN112632999A (zh) * | 2020-12-18 | 2021-04-09 | 北京百度网讯科技有限公司 | 命名实体识别模型获取及命名实体识别方法、装置及介质 |
CN113191151A (zh) * | 2021-06-02 | 2021-07-30 | 云知声智能科技股份有限公司 | 一种医疗命名实体一词多标的识别方法、装置及电子设备 |
CN114298044B (zh) * | 2021-12-27 | 2024-10-15 | 山东师范大学 | 一种中文命名实体识别方法及系统 |
CN114564959B (zh) * | 2022-01-14 | 2024-07-05 | 北京交通大学 | 中文临床表型细粒度命名实体识别方法及系统 |
CN115659987B (zh) * | 2022-12-28 | 2023-03-21 | 华南师范大学 | 基于双通道的多模态命名实体识别方法、装置以及设备 |
CN115938365B (zh) * | 2023-03-09 | 2023-06-30 | 广州小鹏汽车科技有限公司 | 语音交互方法、车辆及计算机可读存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110310237A1 (en) * | 2010-06-17 | 2011-12-22 | Institute For Information Industry | Facial Expression Recognition Systems and Methods and Computer Program Products Thereof |
CN104200804A (zh) * | 2014-09-19 | 2014-12-10 | 合肥工业大学 | 一种面向人机交互的多类信息耦合的情感识别方法 |
CN107945790A (zh) * | 2018-01-03 | 2018-04-20 | 京东方科技集团股份有限公司 | 一种情感识别方法和情感识别系统 |
CN108305642A (zh) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | 情感信息的确定方法和装置 |
CN108320734A (zh) * | 2017-12-29 | 2018-07-24 | 安徽科大讯飞医疗信息技术有限公司 | 语音信号处理方法及装置、存储介质、电子设备 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5598505A (en) * | 1994-09-30 | 1997-01-28 | Apple Computer, Inc. | Cepstral correction vector quantizer for speech recognition |
CN100562242C (zh) | 2006-12-21 | 2009-11-18 | 西北工业大学 | 一种可用于微波炉和电脑的负磁导率材料电磁屏蔽装置 |
US10170114B2 (en) * | 2013-05-30 | 2019-01-01 | Promptu Systems Corporation | Systems and methods for adaptive proper name entity recognition and understanding |
CN107292287B (zh) * | 2017-07-14 | 2018-09-21 | 深圳云天励飞技术有限公司 | 人脸识别方法、装置、电子设备及存储介质 |
CN107832289A (zh) * | 2017-10-12 | 2018-03-23 | 北京知道未来信息技术有限公司 | 一种基于lstm‑cnn的命名实体识别方法 |
CN108460012A (zh) * | 2018-02-01 | 2018-08-28 | 哈尔滨理工大学 | 一种基于gru-crf的命名实体识别方法 |
US10713441B2 (en) * | 2018-03-23 | 2020-07-14 | Servicenow, Inc. | Hybrid learning system for natural language intent extraction from a dialog utterance |
WO2019245916A1 (en) * | 2018-06-19 | 2019-12-26 | Georgetown University | Method and system for parametric speech synthesis |
-
2018
- 2018-08-30 CN CN201811005796.4A patent/CN109741732B/zh active Active
-
2019
- 2019-08-28 US US16/652,233 patent/US11514891B2/en active Active
- 2019-08-28 WO PCT/CN2019/103027 patent/WO2020043123A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110310237A1 (en) * | 2010-06-17 | 2011-12-22 | Institute For Information Industry | Facial Expression Recognition Systems and Methods and Computer Program Products Thereof |
CN104200804A (zh) * | 2014-09-19 | 2014-12-10 | 合肥工业大学 | 一种面向人机交互的多类信息耦合的情感识别方法 |
CN108305642A (zh) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | 情感信息的确定方法和装置 |
CN108320734A (zh) * | 2017-12-29 | 2018-07-24 | 安徽科大讯飞医疗信息技术有限公司 | 语音信号处理方法及装置、存储介质、电子设备 |
CN107945790A (zh) * | 2018-01-03 | 2018-04-20 | 京东方科技集团股份有限公司 | 一种情感识别方法和情感识别系统 |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113536790A (zh) * | 2020-04-15 | 2021-10-22 | 阿里巴巴集团控股有限公司 | 基于自然语言处理的模型训练方法及装置 |
CN112749561A (zh) * | 2020-04-17 | 2021-05-04 | 腾讯科技(深圳)有限公司 | 一种实体识别方法及设备 |
CN112749561B (zh) * | 2020-04-17 | 2023-11-03 | 腾讯科技(深圳)有限公司 | 一种实体识别方法及设备 |
CN111597814A (zh) * | 2020-05-22 | 2020-08-28 | 北京慧闻科技(集团)有限公司 | 一种人机交互命名实体识别方法、装置、设备及存储介质 |
CN111597814B (zh) * | 2020-05-22 | 2023-05-26 | 北京慧闻科技(集团)有限公司 | 一种人机交互命名实体识别方法、装置、设备及存储介质 |
CN111753840A (zh) * | 2020-06-18 | 2020-10-09 | 北京同城必应科技有限公司 | 一种同城物流配送名片下单技术 |
CN112199953A (zh) * | 2020-08-24 | 2021-01-08 | 广州九四智能科技有限公司 | 一种电话通话中信息提取方法、装置及计算机设备 |
Also Published As
Publication number | Publication date |
---|---|
CN109741732A (zh) | 2019-05-10 |
CN109741732B (zh) | 2022-06-21 |
US20200251097A1 (en) | 2020-08-06 |
US11514891B2 (en) | 2022-11-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020043123A1 (zh) | 命名实体识别方法、命名实体识别装置、设备及介质 | |
US9711139B2 (en) | Method for building language model, speech recognition method and electronic apparatus | |
Fendji et al. | Automatic speech recognition using limited vocabulary: A survey | |
WO2018227781A1 (zh) | 语音识别方法、装置、计算机设备及存储介质 | |
US10204619B2 (en) | Speech recognition using associative mapping | |
Lung et al. | Fuzzy phoneme classification using multi-speaker vocal tract length normalization | |
WO2020029404A1 (zh) | 语音处理方法及装置、计算机装置及可读存储介质 | |
CN109686383B (zh) | 一种语音分析方法、装置及存储介质 | |
US20150112675A1 (en) | Speech recognition method and electronic apparatus | |
US20150112674A1 (en) | Method for building acoustic model, speech recognition method and electronic apparatus | |
CN110097870B (zh) | 语音处理方法、装置、设备和存储介质 | |
US9495955B1 (en) | Acoustic model training | |
TW201203222A (en) | Voice stream augmented note taking | |
CN110675866B (zh) | 用于改进至少一个语义单元集合的方法、设备及计算机可读记录介质 | |
WO2021051564A1 (zh) | 语音识别方法、装置、计算设备和存储介质 | |
Vegesna et al. | Application of emotion recognition and modification for emotional Telugu speech recognition | |
CN110809796B (zh) | 具有解耦唤醒短语的语音识别系统和方法 | |
Thennattil et al. | Phonetic engine for continuous speech in Malayalam | |
Biswas et al. | Speech recognition using weighted finite-state transducers | |
WO2020073839A1 (zh) | 语音唤醒方法、装置、系统及电子设备 | |
Thalengala et al. | Study of sub-word acoustical models for Kannada isolated word recognition system | |
Mittal et al. | Speaker-independent automatic speech recognition system for mobile phone applications in Punjabi | |
Sharma et al. | Implementation of a Pitch Enhancement Technique: Punjabi Automatic Speech Recognition (PASR) | |
CN110895938B (zh) | 语音校正系统及语音校正方法 | |
KR20220116660A (ko) | 인공지능 스피커 기능을 탑재한 텀블러 장치 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19854483 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.06.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19854483 Country of ref document: EP Kind code of ref document: A1 |