CN102136001A

CN102136001A - Multi-media information fuzzy search method

Info

Publication number: CN102136001A
Application number: CN2011100730481A
Authority: CN
Inventors: 伍昕; 吴鹏; 刘赵杰
Original assignee: TVMining Beijing Media Technology Co Ltd
Current assignee: TVMining Beijing Media Technology Co Ltd
Priority date: 2011-03-25
Filing date: 2011-03-25
Publication date: 2011-07-27
Anticipated expiration: 2031-03-25
Also published as: CN102136001B

Abstract

The invention discloses a multi-media information fuzzy search method, comprising the following steps: firstly, collecting audio/video data, obtaining Lattice result of audio data, obtaining confidence grading information according to time point information and matching likelihood value grading information, rearranging multiple pieces of candidate information by adopting a stronger voice model, giving out an optimal identification result, building a word-grade and phoneme-grade index database, generating a primary information bank, inputting texts to be retrieved and time point information, converting the texts to be retrieved and time point information into a phoneme sequence, obtaining a similar phoneme sequence by utilizing a phoneme confusion matrix, and splitting the similar phoneme sequence into a plurality of phoneme combinations which enter a backward index database for search and then enter the primary information bank for accurate matching, and returning to a candidate position. By adopting the technical scheme, the retrieved quantity can be increased to the maximum extent, and the retrieving speed can be greatly promoted on the condition that the system performance is ensured.

Description

A kind of multimedia messages fuzzy retrieval method

Technical field

The present invention relates to multimedia technology field, relate in particular to a kind of multimedia messages fuzzy retrieval method.

Background technology

Increasing of accompanying information development of times, multimedia document, the news broadcast program presents the scale of magnanimityization day by day.The text messages such as traditional newspaper, magazine, books that compare, and the rich text information of advanced internet, multimedia documents such as audio, video data have abundant, the lively more form that represents, and also more help people and accept like a cork.But because multimedia document is many and assorted, how obtaining interested content easily becomes a problem that needs to be resolved hurrily.Usually the method for doing is, with manually these data being carried out information extraction, doing very like this wastes time and energy, and the technology of therefore having emerged in large numbers in recent years much based on artificial intelligence are applied to this field, and wherein the most popular is exactly speech recognition technology.Speech recognition technology is a kind of switch technology of speech-to-text, and has become after the text, just can utilize search technique to carry out omnibearing index and retrieval.

Yet speech recognition technology is not a complete reliable technique, remedies at wherein identification error and revises retrieval technique, is necessary.Along with the practicability and the increasing income of automatic speech recognition technology, the automatic speech recognition system that is fit to own field and demand is set about buying in or building by a lot of companies.Utilize speech recognition technology that the text in the audio frequency and video data is discerned, just can obtain the text message in this section data, these text messages are input in the database, just can retrieve easily.

Conventional speech recognition technology can only provide the final Chinese character information of identification, on the bearing accuracy of concrete index terms, need artificial judgement on the one hand, waste time and energy, be subjected to the restriction of speech recognition performance on the other hand, the accuracy rate of index and search also is to be difficult to control.For example Beijing, somewhere has been identified as " after all ", just can not find here when " Beijing " to search as the user so.Sometimes, " Beijing " may be sent out into " Bei Jin " or " north is frightened ", also can't find.Therefore, traditional technology based on text search, performance will be subjected to the influence of speech recognition.

Summary of the invention

The objective of the invention is to propose a kind of multimedia messages fuzzy retrieval method, can increase the quantity that retrieves to greatest extent, and under the prerequisite that guarantees system performance, greatly improve retrieval rate.

For reaching this purpose, the present invention by the following technical solutions:

A kind of multimedia messages fuzzy retrieval method may further comprise the steps:

A, collection audio, video data;

B, obtain the Lattice result of voice data, comprise time point information and match likelihood value marking information, and change into many candidate informations;

C, according to time point information and match likelihood value marking information, obtain degree of confidence marking information;

D, the stronger speech model of employing are resequenced to many candidate informations, and are provided optimal identification result;

E, adopt many candidate informations, time point information and degree of confidence marking information to set up words level and phoneme level index database, constitute the back, and raw information encoded generate the raw information storehouse to index database;

F, input text to be retrieved and time point information change into aligned phoneme sequence with text to be retrieved, and utilize the phoneme confusion matrix, obtain similar aligned phoneme sequence, split into to be no less than 1 phonotactics;

G, word and aligned phoneme sequence enter the back respectively and inquire about to index database, obtain the entry position in one group of raw information storehouse and corresponding degree of confidence marking information, return successively according to degree of confidence marking information height;

H, enter into the raw information storehouse respectively and accurately mate,, return position candidate greater than confidence threshold value according to inlet number and degree of confidence marking Information Selection confidence threshold value.

Steps A is further comprising the steps of:

Audio data format is changed into WINDOWS WAV form, and sampling rate is 16 kilo hertzs.

In the steps A, the mode of employing computer and TV card is gathered the voice data in the TV programme; The mode of employing radio and sound card is gathered the voice data in the broadcast singal.

In the step F, text to be retrieved is changed into aligned phoneme sequence according to the letter-to-phone mode.

Adopted technical scheme of the present invention, at the speech recognition errors type that may occur, utilize its similarity in phone-level, and the obfuscation of introducing by the phoneme confusion matrix, can increase the quantity that retrieves to greatest extent,, introduce the mode that a plurality of phoneme set are built index jointly simultaneously at the high problem of phone-level repetition rate, under the prerequisite that guarantees system performance, improved retrieval rate greatly.

Description of drawings

Fig. 1 is the process flow diagram of multimedia messages fuzzy search in the specific embodiment of the invention.

Embodiment

Further specify technical scheme of the present invention below in conjunction with accompanying drawing and by embodiment.

Fig. 1 is the process flow diagram of multimedia messages fuzzy search in the specific embodiment of the invention.As shown in Figure 1, this multimedia information retrieval flow process may further comprise the steps:

Step 101, collection audio, video data.The mode of employing computer and TV card is gathered the voice data in the TV programme, the mode of employing radio and sound card is gathered the voice data in the broadcast singal, then audio data format is changed into WINDOWS WAV form (pcm does not have compression), sampling rate is 16 kilo hertzs.

Because the form that TV card and sound card are recorded determines, only need get final product at the specific format transcoding of programming.

Step 102, obtain the Lattice result of voice data, comprise time point information, quiet information and match likelihood value marking information, and change into many candidate informations.

Different with common recognition result, the recognition result of this embodiment is not the optimal result (claiming 1-Best again) on the conventional meaning, but the more rich decoding path that keeps in the speech recognition claims the Lattice format result again.The principal feature of this form is: contain abundant time point and quiet information and match likelihood value marking information, and can change into by the many candidate informations of speech, perhaps being called confusion network, and optimal result, can obtain on the confusion network than optimal identification result more performance.

Step 103, according to time point information and match likelihood value marking information, calculate the marking of assessment recognition effect, also claim degree of confidence marking information.

Step 104, the stronger speech model of employing are resequenced to many candidate informations, and are provided optimal identification result.

Step 105, adopt many candidate informations, time point information and degree of confidence marking information to set up words level and phoneme level index database, constitute the back, and raw information encoded generate the raw information storehouse to index database.

In this step, according to the principle of search engine, the multiple information of utilizing above step to obtain is carried out index to basic index level.Here using two-layer index level, is respectively words level and phone set, and wherein phoneme can simply be interpreted as initial consonant or simple or compound vowel of a Chinese syllable.This way is also seldom used in search engine, why increased the index of phoneme level, mainly be because identification error may appear in speech recognition, between these identification errors and the correct text certain correlativity is arranged again simultaneously, for example phoneme is still more similar, train the phoneme confusion matrix according to common identification error, therefore the index of phoneme level has been arranged, just can utilize the phoneme confusion matrix.The frequency of occurrences of considering phoneme simultaneously is higher than individual character far away, can cause a large amount of candidate result and reduces search efficiency, has therefore adopted the indexing means of a plurality of phonotactics, can improve search efficiency greatly under the prerequisite that guarantees search quality.Two layer indexs have constituted back to index database, and it has comprised time point and confidence information, simultaneously raw information are carried out the efficient coding compression and generate the raw information storehouse.

Step 106, input text to be retrieved and time point information, (Grapheme-to-Phoneme G2P) changes into aligned phoneme sequence with text to be retrieved, and utilizes the phoneme confusion matrix according to the letter-to-phone mode, obtain similar aligned phoneme sequence, split into a plurality of phonotactics.

Step 107, word and aligned phoneme sequence enter the back respectively and inquire about to index database, obtain the entry position in one group of raw information storehouse and corresponding degree of confidence marking information, return successively according to degree of confidence marking information height.

Step 108, enter into the raw information storehouse respectively and accurately mate,, return position candidate, browse, finish primary retrieval for the user greater than confidence threshold value according to inlet number and degree of confidence marking Information Selection confidence threshold value.

By this embodiment, can mark and build the storehouse more completely to multimedia messages, later stage inquiry can be meticulousr, index and navigate to interested position quickly.Utilize the index of phone-level, can increase the multimedia messages that finds greatly, utilize confidence information, can filter out identification is not good multimedia messages, more than two technology can avoid effectively because the retrieval error that the mistake of speech recognition is brought.

The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with the people of this technology in the disclosed technical scope of the present invention; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. a multimedia messages fuzzy retrieval method is characterized in that, may further comprise the steps:

A, collection audio, video data;

E, adopt many candidate informations, time point information and degree of confidence marking information to set up words level and phoneme level index database, constitute the back to index database, and with the multi-medium data generation multimedia database of encoding;

2. a kind of multimedia messages fuzzy retrieval method according to claim 1 is characterized in that steps A is further comprising the steps of:

3. a kind of multimedia messages fuzzy retrieval method according to claim 1 is characterized in that, in the steps A, the mode of employing computer and TV card is gathered the voice data in the TV programme; The mode of employing radio and sound card is gathered the voice data in the broadcast singal.

4. a kind of multimedia messages fuzzy retrieval method according to claim 1 is characterized in that, in the step F, according to the letter-to-phone mode text to be retrieved is changed into aligned phoneme sequence.