CN114996506B

CN114996506B - Corpus generation method, corpus generation device, electronic equipment and computer readable storage medium

Info

Publication number: CN114996506B
Application number: CN202210572357.1A
Authority: CN
Inventors: 王书培; 刘攀
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2024-07-23
Anticipated expiration: 2042-05-24
Also published as: CN114996506A

Abstract

The embodiment of the invention discloses a corpus generation method, a corpus generation device, electronic equipment and a computer readable storage medium; according to the embodiment of the invention, after at least one candidate video is obtained and the video frames of the candidate video are subjected to text recognition, the audio content is extracted from the candidate video, the audio content is converted into the text content, then the similarity between the subtitle content and the text content is calculated, the text similarity of the candidate video is obtained, at least one target video of a target language is selected from the candidate video according to the text similarity, and corpus corresponding to the target language is generated based on the audio content and the subtitle content of the target video; the scheme can greatly improve the accuracy of corpus generation in voice recognition.

Description

Corpus generation method, corpus generation device, electronic equipment and computer readable storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a corpus generating method, apparatus, and computer readable storage medium.

Background

In recent years, with the rapid development of internet technology, corpus is becoming more and more important in the field of language identification, and the accuracy of corpus can often determine the accuracy of language identification. Thus, accurate corpus generation is required. The existing corpus generation method is usually marked in an auxiliary manual mode after voice recognition.

In the research and practice process of the prior art, the inventor of the invention finds that a great deal of human resources are often needed in a manual mode, errors are easy to generate, and in addition, the accuracy of voice recognition is often lower for some special languages spread in a small range, so that the accuracy of corpus generation is lower.

Disclosure of Invention

The embodiment of the invention provides a corpus generation method, a corpus generation device, electronic equipment and a computer readable storage medium, which can improve the accuracy of corpus generation.

A corpus generation method, comprising:

acquiring at least one candidate video, and carrying out text recognition on video frames of the candidate video to obtain caption content of the candidate video;

Extracting audio content from the candidate video, and converting the audio content into text content;

calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video;

Screening at least one target video of a target language from the candidate videos according to the text similarity;

and generating corpus corresponding to the target language based on the audio content and the subtitle content of the target video.

Correspondingly, an embodiment of the present invention provides a corpus generating device, including:

the acquisition unit is used for acquiring at least one candidate video, and carrying out text recognition on video frames of the candidate video to obtain subtitle content of the candidate video;

the conversion unit is used for extracting audio content from the candidate videos and converting the audio content into text content;

The calculating unit is used for calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video;

The screening unit is used for screening at least one target video of a target language from the candidate videos according to the text similarity;

and the generation unit is used for generating corpus corresponding to the target language based on the audio content and the subtitle content of the target video.

Optionally, in some embodiments, the computing unit may be specifically configured to identify a caption string in the caption content and identify a text string in the text content; calculating the conversion operation times between the caption character string and the text character string to obtain the class editing distance between the caption character string and the text character string; and determining the text similarity of the candidate video based on the caption character string, the text character string and the class editing distance.

Optionally, in some embodiments, the calculating unit may be specifically configured to fuse the subtitle string with a text string to obtain a string distance; calculating a distance difference between the class editing distance and the character string distance; and calculating the ratio between the distance difference value and the character string distance to obtain the text similarity of the candidate video.

Optionally, in some embodiments, the obtaining unit may specifically be configured to frame the candidate video, and screen out a key video frame from the video frames after the frame division; positioning a target position area in the key video frame to obtain a subtitle area of the candidate video; and identifying the text corresponding to the caption area in the video frame to obtain the caption content of the candidate video.

Optionally, in some embodiments, the obtaining unit may specifically be configured to perform text recognition on the video frame after framing to obtain a video frame text of the video frame; classifying the video frames based on the video frame texts to obtain a video frame set corresponding to each video frame text; and sequencing the video frames in the video frame set according to the playing time corresponding to the video frames, and screening the key video frames from the video frame set based on the sequencing result.

Optionally, in some embodiments, the obtaining unit may be specifically configured to screen at least one key video frame text of the key video frames from the video frame texts, and identify text location information of each key video frame text in the key video frames; screening target position information from the text position information based on the key video frame text; and positioning a position area corresponding to the target position information in the key video frame to obtain a subtitle area of the candidate video.

Optionally, in some embodiments, the acquiring unit may be specifically configured to acquire the basic video set of the target language according to a preset keyword; identifying a video type for each video and a confidence level for the video type in the base video set; and screening at least one candidate video from the basic video set based on the video type and the confidence.

Optionally, in some embodiments, the obtaining unit may be specifically configured to perform audio detection on an audio frame of each video in the base video set to obtain an audio type of the audio frame; performing silence detection on the video, and performing audio cutting on the video based on a detection result to obtain at least one audio fragment; and extracting the characteristics of the audio fragment, and determining the video type of the video and the confidence of the video type based on the extracted audio characteristics and the audio type.

Optionally, in some embodiments, the obtaining unit may be specifically configured to determine, according to the audio type and the audio feature, a voice type of the audio segment and classification information of the voice type; acquiring the audio time length of the audio fragment, and determining the classification weight of the voice type based on the audio time length; and according to the classification weight and the classification information, fusing the voice types corresponding to the audio fragments of the video to obtain the video types of the video and the confidence of the video types.

Optionally, in some embodiments, the generating unit may be specifically configured to screen the target subtitle content of the target video from the subtitle content; extracting a time axis corresponding to the target subtitle content from the target video; and taking the audio content, the target subtitle content and the time axis of the target video as initial corpus, and sending the initial corpus to a verification server for verification to obtain the corpus of the target language.

In addition, the embodiment of the invention also provides electronic equipment, which comprises a processor and a memory, wherein the memory stores application programs, and the processor is used for running the application programs in the memory to realize the corpus generation method provided by the embodiment of the invention.

In addition, the embodiment of the invention also provides a computer readable storage medium, which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the steps in any corpus generating method provided by the embodiment of the invention.

According to the embodiment of the invention, after at least one candidate video is obtained and the video frames of the candidate video are subjected to text recognition, the audio content is extracted from the candidate video, the audio content is converted into the text content, then the similarity between the subtitle content and the text content is calculated, the text similarity of the candidate video is obtained, at least one target video of a target language is selected from the candidate video according to the text similarity, and corpus corresponding to the target language is generated based on the audio content and the subtitle content of the target video; according to the scheme, the subtitle content can be identified in the candidate videos, the audio content of the candidate videos is converted into the text content, then, the target videos of the target languages are accurately screened according to the similarity of the subtitle content and the text content, and the subtitle content of the target videos can be used as a reference for manual annotation, so that the accuracy of corpus generation can be greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a scenario of a corpus generating method provided by an embodiment of the present invention;

fig. 2 is a schematic flow chart of a corpus generating method according to an embodiment of the present invention;

FIG. 3 is a search schematic of a dialect video provided by an embodiment of the present invention;

FIG. 4 is a schematic illustration of the voice type of an audio clip provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of screening key video frames according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a dialect video recognition flow provided in an embodiment of the present invention;

FIG. 7 is a schematic diagram of dialect corpus recognition provided by an embodiment of the present invention;

FIG. 8 is a schematic overall flow chart of corpus generation according to an embodiment of the present invention;

FIG. 9 is a schematic flow chart of dialect corpus generation provided by an embodiment of the present invention;

FIG. 10 is another schematic flow chart of corpus generation according to an embodiment of the present invention;

Fig. 11 is a schematic structural diagram of a corpus generating device according to an embodiment of the present invention;

Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides a corpus generation method, a corpus generation device, electronic equipment and a computer readable storage medium. The corpus generating device can be integrated in an electronic device, and the electronic device can be a server or a terminal and other devices.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. Terminals include, but are not limited to, cell phones, computers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

For example, referring to fig. 1, taking the example that the corpus generating device is integrated in an electronic device, the electronic device obtains at least one candidate video, performs text recognition on video frames of the candidate video, extracts audio content from the candidate video after obtaining subtitle content of the candidate video, converts the audio content into text content, calculates similarity between the subtitle content and the text content to obtain text similarity of the candidate video, screens at least one target video of a target language from the candidate video according to the text similarity, generates corpus corresponding to the target language based on the audio content and the subtitle content of the target video, and further improves accuracy of corpus generation.

The corpus can be marked audio content and mainly comprises audio files and marked texts corresponding to the audio files, wherein the marked texts are in one-to-one correspondence with the audio content in the audio files in a time axis and other forms. Corpus is a basic unit constituting a corpus. By corpus is meant a large-scale electronic text library that has been scientifically sampled and processed, in which language materials are stored that have actually appeared in the actual use of the language. The corpus can be used for training an acoustic model or an audio recognition model and the like, and can also be used for scenes such as question-answer searching and the like.

The corpus generation method provided by the embodiment of the application relates to a voice technology and a natural voice processing (NLP) direction in the artificial intelligence field. The embodiment of the application can carry out text recognition on the video frames of the candidate videos, extract the audio content from the candidate videos, convert the audio content into text content and the like.

Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Among the key technologies of the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Among them, natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

It will be appreciated that, in the specific embodiment of the present application, related data such as candidate videos of objects are involved, when the following embodiments of the present application are applied to specific products or technologies, permission or agreement is required, and collection, use and processing of related data are required to comply with related laws and regulations and standards of related countries and regions.

The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.

The embodiment will be described from the perspective of a corpus generating device, which may be integrated in an electronic device, where the electronic device may be a server or a terminal, and other devices; the terminal may include a tablet computer, a notebook computer, a personal computer (PC, personal Computer), a wearable device, a virtual reality device, or other devices that may generate a corpus.

A corpus generation method, comprising:

Obtaining at least one candidate video, carrying out text recognition on video frames of the candidate video to obtain subtitle content of the candidate video, extracting audio content from the candidate video, converting the audio content into text content, calculating similarity between the subtitle content and the text content to obtain text similarity of the candidate video, screening at least one target video of a target language from the candidate video according to the text similarity, and generating corpus corresponding to the target language based on the audio content and the subtitle content of the target video.

As shown in fig. 2, the specific flow of the corpus generation method is as follows:

101. and acquiring at least one candidate video, and carrying out text recognition on video frames of the candidate video to obtain subtitle content of the candidate video.

The subtitle content is content information of a subtitle in a video frame, and the subtitle may be non-video content such as a dialogue in a television, a movie, or a stage work, which is displayed in a text form, or may be text processed later in the movie work. The explanatory characters and various characters appearing below the screen of the film or television, such as film names, staff lists, gramophone, dialogs, explanatory words, character introduction, place names, ages, etc., are called subtitles. The subtitles of a movie work generally appear below the screen, while the subtitles of a drama work may be displayed on both sides of or above the stage.

The method for obtaining at least one candidate video may be various, and specifically may be as follows:

for example, a basic video set of the target language may be obtained according to a preset keyword, a video type and a confidence level of the video type of each video are identified in the basic video set, and at least one candidate video is selected from the basic video set based on the video type and the confidence level.

The method for obtaining the basic video set of the target language may be various according to the preset keywords, for example, the preset keywords may be obtained, the target keywords of the target language are selected from the preset keywords, and the original video is obtained on the network or the video platform based on the target keywords, so as to obtain the basic video set.

The target keyword may be determined by a target language, for example, taking the target language as a dialect, and the target keyword may be a Sichuan, chongqing, northeast dialect or Shanghai dialect. Based on the target keywords, videos which are possibly dialects can be searched out, so that a basic video set is obtained. Taking the target keyword as the Sichuan words as an example, the original video containing the Sichuan words can be searched in the video platform, and the searching process can be shown in fig. 3.

After the basic video set is acquired, the video type of each video and the confidence of the video type can be identified in the basic video set, wherein the video type can be understood as a scene tag of audio data in the video and is mainly used for judging audio scenes where the audio data in the video are located, and the audio scenes can be various, for example, can comprise voice, songs, crowds and the like. There may be various ways of identifying the video type of each video in the base video set, for example, audio detection is performed on the audio frame of each video in the base video set to obtain the audio type of the audio frame, silence detection is performed on the video, audio cutting is performed on the video based on the detection result to obtain at least one audio clip, feature extraction is performed on the audio clip, and the video type and the confidence of the video type are determined based on the extracted audio feature and the audio type.

Wherein the audio type is used to indicate whether the audio frame is speech, the audio type may include a speech tag and a non-speech tag. There are various ways to detect audio frequency of an audio frame, for example, extracting audio frequency information from video, and framing the audio frequency information to obtain at least one frame of audio frequency frame, and performing audio frequency detection on the audio frequency frame by using audio frequency detection technology (VAD), so as to obtain the audio frequency type of the audio frequency frame.

The audio cutting method for the video may be various, for example, a silence interval in which a silence audio frame exists may be identified in audio information of the video based on a detection result, and audio corresponding to the silence interval may be deleted in the audio information of the video, so as to obtain at least one audio clip.

After the video is subjected to audio cutting, the audio segments can be subjected to feature extraction in various manners, for example, an x-vector embedding model (an audio feature extraction model) can be used as a main system, and each audio segment is subjected to feature extraction, and the audio features (embedding) representing the audio content information are obtained through a TDNN network and STATISTICS POOLING layers.

After the audio feature is extracted, the video type and the confidence level of the video type can be determined based on the extracted audio feature and audio feature, and various manners of determining the video type and the confidence level of the video type can be adopted, for example, according to the audio type and the audio feature, the voice type and the classification information of the voice type of the audio fragment are determined, the audio duration of the audio fragment is obtained, the classification weight of the voice type is determined based on the audio duration, and according to the classification weight and the classification information, the voice type corresponding to the video fragment of the video is fused, so that the video type and the confidence level of the video type of the video are obtained.

The voice type is used to indicate sub-scene information of the audio clip in a voice or non-voice scene, for example, taking a scene as a voice, the voice type may include chinese or other languages, taking a scene as a song, and the voice type may include a type of song, for example, may include singing, and pure music, as shown in fig. 4. The method for determining the voice type and the classification information of the voice type of the audio fragment according to the audio type and the audio feature can be various, for example, the audio type of the audio frame in the audio fragment is fused to obtain the basic voice type of the audio fragment, the audio feature is classified on the basic voice type by adopting a back-end classifier, so that the voice type of each audio fragment and the classification score of the voice type are obtained, and the classification score is used as the classification information.

After the classification information and the classification weight of the voice type are determined, the voice types corresponding to the audio clips of the video can be fused, for example, the classification information can be weighted based on the classification weight to obtain weighted classification information, the target voice type is screened out from the voice types according to the weighted classification information, the target voice type is used as the video type, and the confidence corresponding to the target voice type is used as the confidence of the video type.

After determining the video type and the confidence coefficient of the video type, at least one candidate video can be screened out from the basic video set based on the video type and the confidence coefficient, and the candidate video can be screened out in various modes, for example, the video with the video type being the target video type can be screened out from the basic video set to obtain a candidate video set, and the video with the confidence coefficient exceeding a preset confidence coefficient threshold value can be screened out from the candidate video set to obtain at least one candidate video.

After at least one candidate video is obtained, text recognition can be performed on video frames of the candidate video to obtain caption content of the candidate video, for example, the candidate video can be subjected to framing in various text recognition modes, key video frames are screened out from the video frames after framing, a target position area is positioned in the key video frames to obtain caption areas of the candidate video, and texts corresponding to the caption areas are recognized in the video frames to obtain the caption content of the candidate video.

The method for screening the key video frames from the video frames after framing may be various, for example, text recognition is performed on the video frames after framing to obtain video frame texts of the video frames, the video frames are classified based on the video frame texts to obtain video frame sets corresponding to each video frame text, the video frames in the video frame sets are ordered according to playing time corresponding to the video frames, and the key video frames are screened from the video frame sets based on the ordering result.

The manner of screening the key video frames in the video frame set based on the sorting result may be various, for example, the video frame with the earliest playing time may be screened in the video frame set based on the sorting result, so as to obtain the key video, so that the key video frame can be found to be understood as a video frame in which the video frame text of the video frame has a change from the video frame of the previous frame, and particularly, as shown in fig. 5.

After screening the key video frames, a target position area can be located in the key video frames to obtain a subtitle area of the candidate video, wherein the subtitle area can be understood as a position area of the subtitle in the video frames, and the subtitle area can be located in various manners, for example, at least one key video frame text of the key video frames can be screened out from the video frame text, text position information of each key video frame text can be identified in the key video frames, the target position information is screened out from the text position information based on the key video frame text, and a position area corresponding to the target position information is located in the key video frames to obtain the subtitle area of the candidate video.

The method for screening the target position information from the text position information based on the key video frame text may be various, for example, text position information with a change in the key video frame text is screened from the text position information to obtain candidate position information, and position information with a constant ordinate is screened from the candidate position information to obtain the target position information. In the key video frame, besides the caption, there may be information such as station caption or advertisement, besides the caption, the ordinate of the caption will not change, and the abscissa of other contents will not change, so that the position information of the caption can be screened out.

After the target position information is screened out, the position area corresponding to the target position information can be positioned in the key video frame, and various positioning modes can be adopted, for example, an initial position area corresponding to each target position information can be positioned in the key video frame, the initial position areas are fused to obtain the subtitle area of the candidate video, or an initial position area corresponding to each target position information can be positioned in the key video frame, and the position area with the largest abscissa or the longest length can be screened out from the initial position areas to serve as the subtitle area.

102. Extracting audio content from the candidate video and converting the audio content into text content.

The method for extracting the audio content from the candidate video may be various, and specifically may be as follows:

For example, the audio data may be directly separated from the candidate video, so as to obtain the audio content, or the audio data may be extracted from the candidate video, so as to obtain the initial audio content, the silence detection is performed on the initial audio content, and the silence content is selected from the initial audio content based on the detection result, so as to obtain the audio content of the candidate video.

After the audio content is extracted, the audio content may be converted to text content in a variety of ways, for example, speech recognition (Automatic Speech Recognition, ASR) services may be used to convert audio content of video to text content, or other speech conversion techniques may be used to convert audio content to text content.

103. And calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video.

Wherein the text similarity is used for indicating similarity information of texts between the subtitle content and the text content.

The manner of calculating the similarity between the subtitle content and the text content may be various, and specifically may be as follows:

For example, a caption string may be identified in the caption content, a text string may be identified in the text content, the number of conversion operations between the caption string and the text string may be calculated, a class edit distance between the caption string and the text string may be obtained, and a text similarity of the candidate video may be determined based on the caption string, the text string, and the class edit distance.

The number of conversion operations between the caption string and the text string may be calculated in various manners, for example, the caption string may be converted into the text string by inserting, deleting, replacing, or the like, or the text string may be converted into the caption string, the number of insertion operations is increased by 1, and the number of replacement operations is increased by 2, so that the number of conversion operations may be calculated, and the minimum number of operations may be selected from the number of operations, thereby obtaining the class editing distance between the caption string and the text string.

After calculating the class editing distance, the text similarity of the candidate video can be determined based on the subtitle character string, the text character string and the class editing distance, and various manners of determining the text similarity can be adopted, for example, the subtitle character string and the text character string can be fused to obtain a character string distance, a distance difference between the class editing distance and the character string distance is calculated, and a ratio between the distance difference and the character string distance is calculated to obtain the text similarity of the candidate video, which can be specifically shown as a formula (1):

r＝(sum-ldist)/sum (1)

Where r is the text similarity, which may also be referred to as Levin Stanny ratio, sum is the string distance, and ldist is the class edit distance.

The string distance may be understood as a length of the subtitle string and the text string being integrated, for example, str1= 'abc', str2= 'cde', sum=3+3=6. The code information for calculating text similarity may be as follows:

104. And screening at least one target video of the target language from the candidate videos according to the text similarity.

For example, a preset text similarity threshold set may be obtained, and a target text similarity threshold corresponding to the target language may be selected from the preset text similarity threshold set. And comparing the target text similarity threshold with the text similarity of the candidate videos, and screening the videos of which the text similarity does not exceed the target text similarity threshold from the candidate videos based on the comparison result, so as to obtain target videos corresponding to the target language.

The text similarity threshold may be set according to practical application, taking a target language as a dialect as an example, the text similarity threshold may be 50%, then a video with a text similarity not exceeding 50% may be screened out of candidate videos as a dialect video, and a video with a text similarity exceeding the target text similarity threshold may be a mandarin video, so that recognition of the dialect video may be as shown in fig. 6, text content of video data may be recognized by an ASR technique, caption content may be recognized by an OCR technique, text similarity between the text content and the caption content may be calculated, then the text similarity and the text similarity threshold may be compared, a dialect video with a low threshold may be reserved, and a mandarin video with a high threshold may be stripped, so as to obtain the target video.

105. And generating corpus corresponding to the target language based on the audio content and the subtitle content of the target video.

For example, the target caption content of the target video may be selected from the caption content, the time axis corresponding to the target caption content is extracted from the target video, the audio content of the target video, the target caption content and the time axis are used as initial corpus information, and the initial corpus is sent to the verification server for verification, so as to obtain the corpus of the target language.

The method for transmitting the initial corpus to the verification server for verification may be various, for example, the initial corpus is transmitted to the verification server, so that the correction spam is manually corrected, and then, part of time axes are adjusted, so that the ASR corpus of the target language can be obtained.

Meanwhile, according to the scheme, aiming at the situation that the corpus annotation of the acquired dialect data training set is difficult (a large number of ordinary annotators can only understand 1-2 areas), the video is visually presented in a multi-mode annotation mode, and the annotation is carried out manually by utilizing the aid of an OCR recognition result, so that the problem of difficulty in manual annotation is effectively solved, and the problem can be particularly shown in fig. 7.

Optionally, after generating the corpus of the target language, the language recognition model may be trained based on the corpus to obtain a trained language recognition model, and the speech to be recognized is recognized based on the trained language recognition model to obtain text content corresponding to the speech to be recognized.

In the whole process of corpus generation, ASR and OCR technologies are respectively adopted for screening out target videos and assisting in manual labeling, so that ASR corpus is obtained, and a specific flow can be shown in FIG. 8.

Taking a corpus as an example, the whole process of generating the dialect corpus may be as shown in fig. 9, inputting a screening video, and detecting an audio type of the video to obtain audio with a scene tag and a score, where the scene tag may include voice, song, crowd, interference sound, and the like. And screening out videos with scores exceeding 80 and scenes being voice from the videos, so as to obtain candidate videos. The method comprises the steps of obtaining subtitle information of candidate videos through a subtitle extraction service, identifying subtitle contents of the candidate videos in video frames through an OCR technology, converting audio contents into text contents through an audio ASR service, calculating text similarity of the subtitle contents and the text contents, judging that the candidate videos are Mandarin videos when the text similarity exceeds 50%, stripping the text similarity, judging that the dialect videos are reserved when the text similarity does not exceed 50%, extracting target subtitle contents of the dialect videos, taking audio, time stamps (time axes) and corresponding subtitle contents as initial corpus, and manually checking and modifying the initial corpus to obtain ASR corpus.

As can be seen from the above, in the embodiment of the present application, after at least one candidate video is obtained and a video frame of the candidate video is text-identified, after subtitle content of the candidate video is obtained, audio content is extracted from the candidate video, and the audio content is converted into text content, then, similarity between the subtitle content and the text content is calculated, so as to obtain text similarity of the candidate video, then, at least one target video in a target language is selected from the candidate video according to the text similarity, and corpus corresponding to the target language is generated based on the audio content and the subtitle content of the target video; according to the scheme, the subtitle content can be identified in the candidate videos, the audio content of the candidate videos is converted into the text content, then, the target videos of the target languages are accurately screened according to the similarity of the subtitle content and the text content, and the subtitle content of the target videos can be used as a reference for manual annotation, so that the accuracy of corpus generation can be greatly improved.

According to the method described in the above embodiments, examples are described in further detail below.

In this embodiment, the corpus generating device is specifically integrated in an electronic device, the electronic device is a server, and the target language is a dialect.

As shown in fig. 10, a corpus generating method specifically includes the following steps:

201. The server obtains at least one candidate dialect video.

For example, the server acquires preset keywords, screens out target keywords of dialects from the preset keywords, and acquires the original video on a network or a video platform based on the target keywords, thereby obtaining a basic dialect video set. Extracting audio information from each video in the basic dialect video set, framing the audio information to obtain at least one frame of audio frame, and performing audio detection on the audio frame by adopting an audio detection technology (VAD) to obtain the audio type of the audio frame. And performing silence detection on the video, identifying a silence interval with a silence audio frame in the audio information of the video based on the detection result, and deleting the audio corresponding to the silence interval in the audio information of the video to obtain at least one audio fragment.

The server adopts an x-vector embedding model as a main system, performs feature extraction on each audio fragment, and obtains audio features (embedding) representing audio content information through a TDNN network and STATISTICS POOLING layers. The audio types of the audio frames in the audio fragments are fused to obtain basic voice types of the audio fragments, the rear-end classifier is adopted to classify the audio characteristics on the basic voice types, so that the voice types of each audio fragment and the class scores of the voice types are obtained, and the class scores are used as classification information.

The server acquires the audio time length of the audio fragment, determines the classification weight of the voice type based on the audio time length, weights the classification information based on the classification weight to obtain weighted classification information, screens out the target voice type from the voice types according to the weighted classification information, takes the target voice type as the video type, and takes the confidence coefficient corresponding to the target voice type as the confidence coefficient of the video type. And screening videos with the video types being languages from the basic video set to obtain a candidate dialect video set, and screening videos with the confidence degree exceeding a preset confidence degree threshold from the candidate video set to obtain at least one candidate dialect video.

202. And the server carries out text recognition on the video frames of the candidate dialect videos to obtain subtitle contents of the candidate dialect videos.

For example, the server frames the candidate video, and text identifies the video frames after the frames to obtain the video frame text of the video frames. Classifying video frames based on video frame texts to obtain a video frame set corresponding to each video frame text, sorting the video frames in the video frame set according to the playing time corresponding to the video frames, and screening out the video frame with the earliest playing time from the video frame set based on the sorting result to obtain a key video.

The server screens out at least one key video frame text of the key video frames from the video frame text, and identifies text position information of each key video frame text from the key video frames. And screening text position information with the change of the text of the key video frame from the text position information to obtain candidate position information, and screening position information with unchanged ordinate from the candidate position information to obtain target position information. In the key video frame, besides the caption, there may be information such as station caption or advertisement, besides the caption, the ordinate of the caption will not change, and the abscissa of other contents will not change, so that the position information of the caption can be screened out. And positioning an initial position area corresponding to each target position information in the key video frame, and fusing the initial position areas to obtain the subtitle areas of the candidate videos, or positioning the initial position area corresponding to each target position information in the key video frame, and screening out the position area with the largest abscissa or the longest length from the initial position areas as the subtitle areas.

203. The server extracts audio content from the candidate dialect video.

For example, the server may directly separate audio data from the candidate dialect video to obtain audio content, or may extract audio data from the candidate dialect video to obtain initial audio content, perform silence detection on the initial audio content, and screen silence content from the initial audio content based on the detection result to obtain the audio content of the candidate dialect video.

204. The server converts the audio content into text content.

For example, the server may employ an ASR service to convert the audio content of the candidate dialect video to text content, or may also employ other speech conversion techniques to convert the audio content to text content.

205. And the server calculates the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video.

For example, the server may identify a caption string in the caption content and a text string in the text content. The caption character string is converted into the text character string by inserting, deleting, replacing and the like, or the text character string is converted into the caption character string, the inserting operation frequency is increased by 1, and the replacing operation frequency is increased by 2, so that the converted operation frequency can be calculated, the minimum operation frequency is screened out from the operation frequency, and the class editing distance between the caption character string and the text character string is obtained. And fusing the caption character strings with the text character strings to obtain character string distances, calculating a distance difference value between the class editing distances and the character string distances, and calculating a ratio between the distance difference value and the character string distances to obtain the text similarity of the candidate video, wherein the text similarity can be specifically shown as a formula (1).

206. And the server screens out at least one target dialect video from the candidate dialect videos according to the text similarity.

For example, the server acquires a preset text similarity threshold set, and screens out a target text similarity threshold (50%) corresponding to the target language from the preset text similarity threshold set. And comparing the target text similarity threshold with the text similarity of the candidate dialect videos, and screening out videos of which the text similarity does not exceed the target text similarity threshold from the candidate dialect videos based on a comparison result, so as to obtain the target dialect videos.

207. The server generates corpus corresponding to the dialect based on the audio content and the subtitle content of the target dialect video.

For example, the server may screen the caption content of the target dialect video, extract the time axis corresponding to the target caption content from the target dialect video, use the audio content of the target dialect video, the target caption content and the time axis as initial corpus information, send the initial corpus to the verification server, so as to manually correct the modification spam, and then adjust part of the time axis, thereby obtaining the ASR corpus of the dialect.

Optionally, after generating the corpus of the target language, the server may further train the dialect recognition model based on the corpus to obtain a trained dialect recognition model, and recognize the speech to be recognized based on the trained dialect recognition model to obtain text content corresponding to the speech to be recognized.

As can be seen from the above, in this embodiment, after obtaining at least one candidate dialect video and performing text recognition on video frames of the candidate dialect video to obtain subtitle content of the candidate dialect video, extracting audio content from the candidate dialect video, converting the audio content into text content, calculating similarity between the subtitle content and the text content to obtain text similarity of the candidate dialect video, screening at least one target dialect video from the candidate dialect video according to the text similarity, and generating corpus corresponding to the dialect based on the audio content and the subtitle content of the target dialect video; according to the method, the caption content can be identified in the candidate dialect video, the audio content of the candidate dialect video is converted into the text content, then, the target dialect video is accurately screened out according to the similarity of the caption content and the text content, and the caption content of the target dialect video can be used as a reference for manual annotation, so that the accuracy of dialect corpus generation can be greatly improved.

In order to better implement the method, the embodiment of the invention also provides a corpus generating device, which can be integrated in electronic equipment, such as a server or a terminal, and the terminal can comprise a tablet computer, a notebook computer, a personal computer and the like.

For example, as shown in fig. 11, the corpus generating apparatus may include an acquisition unit 301, a conversion unit 302, a calculation unit 303, a screening unit 304, and a generation unit 305, as follows:

(1) An acquisition unit 301;

the obtaining unit 301 is configured to obtain at least one candidate video, and perform text recognition on video frames of the candidate video to obtain subtitle content of the candidate video.

For example, the obtaining unit 301 may specifically be configured to obtain a basic video set of the target language according to a preset keyword, identify a video type and a confidence level of the video type of each video in the basic video set, and screen at least one candidate video from the basic video set based on the video type and the confidence level. And framing the candidate video, screening out key video frames from the video frames after framing, positioning a target position area in the key video frames to obtain a caption area of the candidate video, and identifying texts corresponding to the caption area in the video frames to obtain caption contents of the candidate video.

(2) A conversion unit 302;

The conversion unit 302 is configured to extract audio content from the candidate video, and convert the audio content into text content.

For example, the conversion unit 302 may be specifically configured to separate audio data from the candidate video, thereby obtaining audio content, or may also extract audio data from the candidate video, obtain initial audio content, perform silence detection on the initial audio content, and screen out silence content from the initial audio content based on the detection result, thereby obtaining the audio content of the candidate video. The ASR service is employed to convert audio content of video to text content, or other speech conversion techniques may also be employed to convert audio content to text content.

(3) A calculation unit 303;

And a calculating unit 303, configured to calculate a similarity between the subtitle content and the text content, so as to obtain a text similarity of the candidate video.

For example, the calculating unit 303 may specifically be configured to identify a caption string in caption content, identify a text string in text content, calculate the number of conversion operations between the caption string and the text string, obtain a class editing distance between the caption string and the text string, fuse the caption string and the text string to obtain a string distance, calculate a distance difference between the class editing distance and the string distance, calculate a ratio between the distance difference and the string distance, and obtain the text similarity of the candidate video.

(4) A screening unit 304;

And the screening unit 304 is configured to screen at least one target video in the target language from the candidate videos according to the text similarity.

For example, the filtering unit 304 may be specifically configured to obtain a preset text similarity threshold set, and filter a target text similarity threshold corresponding to the target language from the preset text similarity threshold set. And comparing the target text similarity threshold with the text similarity of the candidate videos, and screening the videos of which the text similarity does not exceed the target text similarity threshold from the candidate videos based on the comparison result, so as to obtain target videos corresponding to the target language.

(5) A generating unit 305;

the generating unit 305 is configured to generate a corpus corresponding to the target language based on the audio content and the subtitle content of the target video.

For example, the generating unit 305 may be specifically configured to screen out target subtitle content of the target video from the subtitle content, extract a time axis corresponding to the target subtitle content from the target video, use the audio content of the target video, the target subtitle content and the time axis as initial corpus information, and send the initial corpus to the verification server for verification, so as to obtain the corpus of the target language.

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

As can be seen from the foregoing, in this embodiment, at least one candidate video is obtained by the obtaining unit 301, and text recognition is performed on video frames of the candidate video, after subtitle content of the candidate video is obtained, the converting unit 302 extracts audio content from the candidate video, converts the audio content into text content, then the calculating unit 303 calculates similarity between the subtitle content and the text content, obtains text similarity of the candidate video, then the screening unit 304 screens at least one target video in a target language from the candidate video according to the text similarity, and the generating unit 305 generates corpus corresponding to the target language based on the audio content and the subtitle content of the target video; according to the scheme, the subtitle content can be identified in the candidate videos, the audio content of the candidate videos is converted into the text content, then, the target videos of the target languages are accurately screened according to the similarity of the subtitle content and the text content, and the subtitle content of the target videos can be used as a reference for manual annotation, so that the accuracy of corpus generation can be greatly improved.

The embodiment of the invention also provides an electronic device, as shown in fig. 12, which shows a schematic structural diagram of the electronic device according to the embodiment of the invention, specifically:

The electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 12 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:

The processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

For example, the electronic device obtains a basic video set of the target language according to the preset keyword, identifies a video type and a confidence level of the video type of each video in the basic video set, and screens at least one candidate video in the basic video set based on the video type and the confidence level. And framing the candidate video, screening out key video frames from the video frames after framing, positioning a target position area in the key video frames to obtain a caption area of the candidate video, and identifying texts corresponding to the caption area in the video frames to obtain caption contents of the candidate video. And separating audio data from the candidate video to obtain audio content, or extracting the audio data from the candidate video to obtain initial audio content, performing silence detection on the initial audio content, and screening the silence content from the initial audio content based on the detection result to obtain the audio content of the candidate video. The ASR service is employed to convert audio content of video to text content, or other speech conversion techniques may also be employed to convert audio content to text content. Identifying a caption character string in caption content, identifying a text character string in text content, calculating the conversion operation times between the caption character string and the text character string, obtaining the class editing distance between the caption character string and the text character string, fusing the caption character string and the text character string, obtaining the character string distance, calculating the distance difference between the class editing distance and the character string distance, calculating the ratio between the distance difference and the character string distance, and obtaining the text similarity of the candidate video. And acquiring a preset text similarity threshold set, and screening out a target text similarity threshold corresponding to the target language from the preset text similarity threshold set. And comparing the target text similarity threshold with the text similarity of the candidate videos, and screening the videos of which the text similarity does not exceed the target text similarity threshold from the candidate videos based on the comparison result, so as to obtain target videos corresponding to the target language. And screening out target subtitle contents of the target video from the subtitle contents, extracting a time axis corresponding to the target subtitle contents from the target video, taking the audio contents, the target subtitle contents and the time axis of the target video as initial corpus information, and sending the initial corpus to a verification server for verification to obtain the corpus of the target language.

The specific implementation of each operation may be referred to the previous embodiments, and will not be described herein.

As can be seen from the above, in the embodiment of the present invention, after at least one candidate video is obtained and a video frame of the candidate video is text-identified, after subtitle content of the candidate video is obtained, audio content is extracted from the candidate video, and the audio content is converted into text content, then, similarity between the subtitle content and the text content is calculated, so as to obtain text similarity of the candidate video, then, at least one target video in a target language is selected from the candidate video according to the text similarity, and corpus corresponding to the target language is generated based on the audio content and the subtitle content of the target video; according to the scheme, the subtitle content can be identified in the candidate videos, the audio content of the candidate videos is converted into the text content, then, the target videos of the target languages are accurately screened according to the similarity of the subtitle content and the text content, and the subtitle content of the target videos can be used as a reference for manual annotation, so that the accuracy of corpus generation can be greatly improved.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present invention provides a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform steps in any of the corpus methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

For example, a basic video set of a target language is obtained according to a preset keyword, a video type and a confidence coefficient of the video type of each video are identified in the basic video set, and at least one candidate video is selected from the basic video set based on the video type and the confidence coefficient. And framing the candidate video, screening out key video frames from the video frames after framing, positioning a target position area in the key video frames to obtain a caption area of the candidate video, and identifying texts corresponding to the caption area in the video frames to obtain caption contents of the candidate video. And separating audio data from the candidate video to obtain audio content, or extracting the audio data from the candidate video to obtain initial audio content, performing silence detection on the initial audio content, and screening the silence content from the initial audio content based on the detection result to obtain the audio content of the candidate video. The ASR service is employed to convert audio content of video to text content, or other speech conversion techniques may also be employed to convert audio content to text content. Identifying a caption character string in caption content, identifying a text character string in text content, calculating the conversion operation times between the caption character string and the text character string, obtaining the class editing distance between the caption character string and the text character string, fusing the caption character string and the text character string, obtaining the character string distance, calculating the distance difference between the class editing distance and the character string distance, calculating the ratio between the distance difference and the character string distance, and obtaining the text similarity of the candidate video. And acquiring a preset text similarity threshold set, and screening out a target text similarity threshold corresponding to the target language from the preset text similarity threshold set. And comparing the target text similarity threshold with the text similarity of the candidate videos, and screening the videos of which the text similarity does not exceed the target text similarity threshold from the candidate videos based on the comparison result, so as to obtain target videos corresponding to the target language. And screening out target subtitle contents of the target video from the subtitle contents, extracting a time axis corresponding to the target subtitle contents from the target video, taking the audio contents, the target subtitle contents and the time axis of the target video as initial corpus information, and sending the initial corpus to a verification server for verification to obtain the corpus of the target language.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

Because the instructions stored in the computer readable storage medium may execute the steps in any corpus generation method provided by the embodiments of the present invention, the beneficial effects that any corpus generation method provided by the embodiments of the present invention can achieve are detailed in the previous embodiments, and are not described herein.

Wherein according to an aspect of the application, a computer program product or a computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in various alternative implementations of the corpus generation aspect or speech recognition aspect described above.

The foregoing has described in detail a corpus generation method, apparatus, electronic device and computer readable storage medium provided by embodiments of the present invention, and specific examples have been applied to illustrate the principles and embodiments of the present invention, where the foregoing description of the embodiments is only for helping to understand the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present invention, the present description should not be construed as limiting the present invention.

Claims

1. The corpus generating method is characterized by comprising the following steps of:

Calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video; screening out videos of which the text similarity does not exceed a target text similarity threshold value from the candidate videos, and obtaining target videos corresponding to target languages;

screening out target subtitle contents of the target video from the subtitle contents;

Extracting a time axis corresponding to the target subtitle content from the target video;

And taking the audio content, the target subtitle content and the time axis of the target video as initial corpus, and sending the initial corpus to a verification server for verification to obtain the corpus of the target language.

2. The corpus generation method according to claim 1, wherein the calculating the similarity between the subtitle content and the text content to obtain the text similarity of the candidate video includes:

Identifying a caption character string in the caption content, and identifying a text character string in the text content;

Calculating the conversion operation times between the caption character string and the text character string to obtain the class editing distance between the caption character string and the text character string;

And determining the text similarity of the candidate video based on the caption character string, the text character string and the class editing distance.

3. The corpus generation method according to claim 2, wherein the determining the text similarity of the candidate video based on the caption string, text string, and class editing distance comprises:

Fusing the caption character string and the text character string to obtain a character string distance;

Calculating a distance difference between the class editing distance and the character string distance;

and calculating the ratio between the distance difference value and the character string distance to obtain the text similarity of the candidate video.

4. A corpus generation method according to any of claims 1 to 3, wherein the text recognition of the video frames of the candidate video to obtain subtitle content of the candidate video includes:

Framing the candidate video, and screening out key video frames from the video frames after framing;

Positioning a target position area in the key video frame to obtain a subtitle area of the candidate video;

And identifying the text corresponding to the caption area in the video frame to obtain the caption content of the candidate video.

5. The corpus generation method according to claim 4, wherein the screening out key video frames from the video frames after framing includes:

Performing text recognition on the video frames after framing to obtain video frame texts of the video frames;

classifying the video frames based on the video frame texts to obtain a video frame set corresponding to each video frame text;

And sequencing the video frames in the video frame set according to the playing time corresponding to the video frames, and screening the key video frames from the video frame set based on the sequencing result.

6. The corpus generation method according to claim 5, wherein locating a target location area in the key video frame to obtain a subtitle area of the candidate video includes:

Screening at least one key video frame text of the key video frames from the video frame text, and identifying text position information of each key video frame text from the key video frames;

screening target position information from the text position information based on the key video frame text;

and positioning a position area corresponding to the target position information in the key video frame to obtain a subtitle area of the candidate video.

7. A corpus generation method according to any of claims 1 to 3, characterized in that the obtaining at least one candidate video comprises:

acquiring a basic video set of a target language according to a preset keyword;

identifying a video type for each video and a confidence level for the video type in the base video set;

And screening at least one candidate video from the basic video set based on the video type and the confidence.

8. The corpus generation method according to claim 7, wherein the identifying of the video type and the confidence level of the video type for each video in the base video set comprises:

Performing audio detection on the audio frame of each video in the basic video set to obtain the audio type of the audio frame;

Performing silence detection on the video, and performing audio cutting on the video based on a detection result to obtain at least one audio fragment;

And extracting the characteristics of the audio fragment, and determining the video type of the video and the confidence of the video type based on the extracted audio characteristics and the audio type.

9. The corpus generation method according to claim 8, wherein the determining the video type of the video and the confidence level of the video type based on the extracted audio features and audio types comprises:

Determining the voice type of the audio fragment and the classification information of the voice type according to the audio type and the audio characteristics;

acquiring the audio time length of the audio fragment, and determining the classification weight of the voice type based on the audio time length;

And according to the classification weight and the classification information, fusing the voice types corresponding to the audio fragments of the video to obtain the video types of the video and the confidence of the video types.

10. A corpus generating apparatus, comprising:

the screening unit is used for screening videos of which the text similarity does not exceed a target text similarity threshold value from the candidate videos to obtain target videos corresponding to target languages;

the generating unit is used for screening out target subtitle contents of the target video from the subtitle contents; extracting a time axis corresponding to the target subtitle content from the target video; and taking the audio content, the target subtitle content and the time axis of the target video as initial corpus, and sending the initial corpus to a verification server for verification to obtain the corpus of the target language.

11. An electronic device comprising a processor and a memory, the memory storing an application, the processor being configured to run the application in the memory to perform the steps in the corpus generation method of any of claims 1 to 9.

12. A computer program product comprising computer programs/instructions which when executed by a processor implement the steps of the corpus generation method of any of claims 1 to 9.

13. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the corpus generation method of any of claims 1 to 9.