CN114547373A - Method for intelligently identifying and searching programs based on audio - Google Patents
Method for intelligently identifying and searching programs based on audio Download PDFInfo
- Publication number
- CN114547373A CN114547373A CN202210155600.XA CN202210155600A CN114547373A CN 114547373 A CN114547373 A CN 114547373A CN 202210155600 A CN202210155600 A CN 202210155600A CN 114547373 A CN114547373 A CN 114547373A
- Authority
- CN
- China
- Prior art keywords
- video
- audio
- information
- label
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 4
- 238000007781 pre-processing Methods 0.000 claims description 8
- 238000003058 natural language processing Methods 0.000 claims description 6
- 238000012015 optical character recognition Methods 0.000 claims description 5
- 238000004140 cleaning Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000004927 fusion Effects 0.000 claims description 3
- 239000000463 material Substances 0.000 claims description 3
- 238000012805 post-processing Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 230000005236 sound signal Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Library & Information Science (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
The invention provides a method for intelligently identifying and searching programs based on audio, which is characterized by comprising the following steps of: acquiring program video resources; acquiring and extracting different types of basic data information in a video; outputting multi-dimensional video labels through model training to form a label library; and identifying and searching through the instruction words or the audio data, and matching a result. Compared with the existing searching mode, the method increases the instruction forms of the original sound of the song, humming, classical lines, behavior pictures and the like, combines a video label system, quickly and accurately retrieves the media assets, and improves the efficiency of video retrieval.
Description
Technical Field
The invention relates to a method for intelligently identifying and searching programs based on audio, belonging to the technical field of set top box videos.
Background
With the rapid development of informatization, all terminal devices are intelligentized, and most of derived applications and terminal devices support a voice recognition function; in video applications or websites, traditional search programs are searched by keyboard, voice (title, star name, keyword); if no program information is known, a corresponding program cannot be directly found through a certain detail, such as a certain piece of music, a classical line or a wonderful clip is seen, a program video corresponding to the program cannot be seen, but a certain program, a certain collection order and a certain time period are not known, under the circumstance, the traditional program searching mode cannot meet the requirement, the program searching mode is usually used for searching and finding in a website, the program, the collection order and the time period corresponding to the program cannot be seen, and then the program, the collection order and the time period are seen, so that the whole process is complicated; however, everything that runs at high speed makes it increasingly difficult for users to accept the phenomenon related to "slow", and more pursues timely feedback.
Disclosure of Invention
The invention aims to provide a method for intelligently identifying and searching programs based on audio, which has a firm structure, is not easy to separate and is convenient to construct.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a method for searching programs based on audio intelligent identification is characterized by comprising the following steps:
1) acquiring video data information including images, audios and texts by depth according to a plurality of data characteristics with different latitudes in a program video file;
2) carrying out data structured preprocessing on the type and format of the basic data: cleaning, screening, converting, sorting, etc.;
3) intelligently marking the video file according to the multi-mode feature fusion understanding result, and outputting multi-dimensional video label information: audio fingerprints, video segment keywords and subtitles corresponding to timestamps automatically label and classify corresponding videos according to the marks of the texts to form a media asset label system library;
4) training a model based on voice recognition and semantic understanding technologies, constructing grammar and appointing the grammar to use;
5) the method comprises the steps that equipment is awakened, instruction words or audio data are spoken to the terminal equipment, instruction analysis is conducted, keywords are extracted, and the intention of the keywords is recognized; performing identification retrieval, wherein the identification result is only matched in the instruction information list based on a label system;
6) and feeding back the recognition result, wherein all programs in the instruction information are contained or are directly positioned to the time positions of corresponding lines and pictures for the user to select or directly watch.
Preferably, the step 2 of performing data structured preprocessing on the type and format of the basic data includes the following specific steps:
2-1) realizing a VLAD algorithm for a video image by using a CNN network structure, wherein NetVLAD can convert video sequence characteristics into a plurality of video lens characteristics through a clustering center, an LSTM is used for modeling a time sequence relation of an image characteristic sequence, the output is sequence prediction, and Activity Recognition outputs a label corresponding to the video; the Image Description outputs the Description of the Image; the Video Description outputs the Description of the Video, and then a plurality of Video shots are weighted and summed through the learnable weight to obtain a global feature vector;
2-2) for audio information, separating audio signals from videos, extracting audio feature sequences by using VGGish, extracting audio features corresponding to different lenses by using NetVLAD, and outputting audio classification, voice-to-text, song audio fingerprints and the like; recognizing punctuation prediction and intelligent sentence break, extracting voice information of multilingual and dialect materials, and then fusing through learnable weights to generate a global feature vector of an audio modality;
2-3) for text information, performing preprocessing and post-processing on a video character information extraction algorithm by taking OCR (optical character recognition) as a basic model and a Natural Language Processing (NLP) algorithm model, intelligently identifying core character content information in a video, filtering other interference character contents, and outputting three types of text contents and text categories of titles, characters and subtitles.
The invention has the advantages that: compared with the existing searching mode, the method increases the instruction forms of the original sound of the song, humming, classical lines, behavior pictures and the like, combines a video label system, quickly and accurately retrieves the media assets, and improves the efficiency of video retrieval.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
FIG. 1 is a schematic view of the flow structure of the present invention.
FIG. 2 is a diagram of a data processing architecture according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
1) The program video file has data characteristics of a plurality of different latitudes, and video data information including but not limited to images (characters, scenes and behaviors), audio (titles/trailers/music insertion, song melodies, song names and lines dubbing) and texts (subtitles) is collected deeply;
2) carrying out data structured preprocessing on the type and format of the basic data: cleaning, screening, converting, sorting, etc.;
2-1) video image: the method comprises the steps that a VLAD algorithm is realized through a CNN network structure, NetVLAD can convert video sequence characteristics into a plurality of video lens characteristics through a clustering center, an LSTM is used for modeling a time sequence relation of an image characteristic sequence, the output is sequence prediction, and Activity Recognition outputs a label corresponding to a video segment; the Image Description outputs the Description of the Image; the Video Description outputs the Description of the Video, and then a plurality of Video shots are weighted and summed through the learnable weight to obtain a global feature vector;
2-2) audio information: separating audio signals from the video, extracting an audio feature sequence by using VGGish, extracting audio features corresponding to different lenses by using NetVLAD, and outputting audio classification, voice-to-text, song audio fingerprints and the like; recognizing punctuation prediction and intelligent sentence break, extracting voice information of multilingual and dialect materials, and then fusing through learnable weights to generate a global feature vector of an audio modality;
2-3) text information: the video character information extraction algorithm takes OCR as a basic model, carries out preprocessing and post-processing through a natural language processing NLP algorithm model, intelligently identifies the core character content information in the video, filters other interference character contents, and outputs three types of text contents of titles, characters and subtitles and text categories;
3) intelligently marking the video file according to the multi-mode feature fusion understanding result, and outputting multi-dimensional video label information: audio fingerprints, video segment keywords and subtitles corresponding to timestamps automatically label and classify corresponding videos according to the marks of the texts to form a media asset label system library;
4) training a model based on technologies such as voice recognition, semantic understanding and the like, constructing grammar and appointing the grammar to use;
5) the method comprises the steps that equipment is awakened, instruction words or audio data are spoken to the terminal equipment, instruction analysis is conducted, keywords are extracted, and the intention of the keywords is recognized;
such as humming: "a piece of music tempo", say: opening the 'officer's position where no word can be made, and thinking that 'Changjin lake ice carving and connecting' picture and the like;
6) performing identification retrieval, wherein the identification result is only matched in the instruction information list based on a label system;
7) and feeding back the recognition result, wherein all programs in the instruction information are contained or are directly positioned to the time positions of corresponding lines and pictures for the user to select or directly watch.
Claims (2)
1. A method for searching programs based on audio intelligent identification is characterized by comprising the following steps:
1) acquiring video data information including images, audios and texts by depth according to a plurality of data characteristics with different latitudes in a program video file;
2) carrying out data structured preprocessing on the type and format of the basic data: cleaning, screening, converting, sorting, etc.;
3) intelligently marking the video file according to the multi-mode feature fusion understanding result, and outputting multi-dimensional video label information: audio fingerprints, video segment keywords and subtitles corresponding to timestamps automatically label and classify corresponding videos according to the marks of the texts to form a media asset label system library;
4) training a model based on voice recognition and semantic understanding technologies, constructing grammar and appointing the grammar to use;
5) the method comprises the steps that equipment is awakened, instruction words or audio data are spoken to the terminal equipment, instruction analysis is conducted, keywords are extracted, and the intention of the keywords is recognized; performing identification retrieval, wherein the identification result is only matched in the instruction information list based on a label system;
6) and feeding back the recognition result, wherein all programs in the instruction information are contained or are directly positioned to the time positions of corresponding lines and pictures for the user to select or directly watch.
2. The method for searching programs based on intelligent audio identification of claim 1, wherein the step 2 of performing structured preprocessing on the type and format of the basic data comprises the following specific steps:
2-1) realizing a VLAD algorithm for a video image by using a CNN network structure, wherein NetVLAD can convert video sequence characteristics into a plurality of video lens characteristics through a clustering center, an LSTM is used for modeling a time sequence relation of an image characteristic sequence, the output is sequence prediction, and Activity Recognition outputs a label corresponding to the video; the Image Description outputs the Description of the Image; the Video Description outputs the Description of the Video, and then a plurality of Video shots are weighted and summed through the learnable weight to obtain a global feature vector;
2-2) for audio information, separating audio signals from videos, extracting audio feature sequences by using VGGish, extracting audio features corresponding to different lenses by using NetVLAD, and outputting audio classification, voice-to-text, song audio fingerprints and the like; recognizing punctuation prediction and intelligent sentence break, extracting voice information of multilingual and dialect materials, and then fusing through learnable weights to generate a global feature vector of an audio modality;
2-3) for text information, performing preprocessing and post-processing on a video character information extraction algorithm by taking OCR (optical character recognition) as a basic model and a Natural Language Processing (NLP) algorithm model, intelligently identifying core character content information in a video, filtering other interference character contents, and outputting three types of text contents and text categories of titles, characters and subtitles.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210155600.XA CN114547373A (en) | 2022-02-21 | 2022-02-21 | Method for intelligently identifying and searching programs based on audio |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210155600.XA CN114547373A (en) | 2022-02-21 | 2022-02-21 | Method for intelligently identifying and searching programs based on audio |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114547373A true CN114547373A (en) | 2022-05-27 |
Family
ID=81675111
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210155600.XA Pending CN114547373A (en) | 2022-02-21 | 2022-02-21 | Method for intelligently identifying and searching programs based on audio |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114547373A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115309941A (en) * | 2022-08-19 | 2022-11-08 | 联通沃音乐文化有限公司 | AI-based intelligent tag retrieval method and system |
CN116402062A (en) * | 2023-06-08 | 2023-07-07 | 之江实验室 | Text generation method and device based on multi-mode perception data |
CN118585671A (en) * | 2024-08-02 | 2024-09-03 | 北京小米移动软件有限公司 | Video retrieval method, device, electronic equipment and storage medium |
CN118626672A (en) * | 2024-08-12 | 2024-09-10 | 山东浪潮科学研究院有限公司 | Video retrieval method and system based on multi-mode information fusion |
-
2022
- 2022-02-21 CN CN202210155600.XA patent/CN114547373A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115309941A (en) * | 2022-08-19 | 2022-11-08 | 联通沃音乐文化有限公司 | AI-based intelligent tag retrieval method and system |
CN115309941B (en) * | 2022-08-19 | 2023-03-10 | 联通沃音乐文化有限公司 | AI-based intelligent tag retrieval method and system |
CN116402062A (en) * | 2023-06-08 | 2023-07-07 | 之江实验室 | Text generation method and device based on multi-mode perception data |
CN116402062B (en) * | 2023-06-08 | 2023-09-15 | 之江实验室 | Text generation method and device based on multi-mode perception data |
CN118585671A (en) * | 2024-08-02 | 2024-09-03 | 北京小米移动软件有限公司 | Video retrieval method, device, electronic equipment and storage medium |
CN118626672A (en) * | 2024-08-12 | 2024-09-10 | 山东浪潮科学研究院有限公司 | Video retrieval method and system based on multi-mode information fusion |
CN118626672B (en) * | 2024-08-12 | 2024-11-05 | 山东浪潮科学研究院有限公司 | Video retrieval method and system based on multi-mode information fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114547373A (en) | Method for intelligently identifying and searching programs based on audio | |
KR101255405B1 (en) | Indexing and searching speech with text meta-data | |
Zhang et al. | A natural language approach to content-based video indexing and retrieval for interactive e-learning | |
CN101382937A (en) | Multimedia resource processing method based on speech recognition and on-line teaching system thereof | |
CN107608960B (en) | Method and device for linking named entities | |
Kaushik et al. | Automatic sentiment detection in naturalistic audio | |
CN114880496A (en) | Multimedia information topic analysis method, device, equipment and storage medium | |
CN115580758A (en) | Video content generation method and device, electronic equipment and storage medium | |
CN114996506B (en) | Corpus generation method, corpus generation device, electronic equipment and computer readable storage medium | |
CN115422947A (en) | Ancient poetry assignment method and system based on deep learning | |
Cai et al. | Music autotagging as captioning | |
Ma et al. | A detection-based approach to broadcast news video story segmentation | |
CN116343771A (en) | Music on-demand voice instruction recognition method and device based on knowledge graph | |
Soares et al. | A framework for automatic topic segmentation in video lectures | |
CN111274960A (en) | Video processing method and device, storage medium and processor | |
Zhu et al. | Video browsing and retrieval based on multimodal integration | |
Parvez | Named entity recognition from bengali newspaper data | |
CN118170919B (en) | Method and system for classifying literary works | |
Barbosa et al. | Browsing videos by automatically detected audio events | |
Seltzer et al. | The data deluge: Challenges and opportunities of unlimited data in statistical signal processing | |
Singh et al. | Generation of Transcript in Multiple Languages | |
Kothawade et al. | Retrieving instructional video content from speech and text information | |
Chand | Lecture video segmentation using speech content | |
Madkaikar et al. | Generating Textual Video Summaries using Modified Bi-Modal Transformer and Whisper Model | |
Mishra et al. | Indexing and Segmentation of Video Contents: A Review |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |