CN113490058A

CN113490058A - Intelligent subtitle matching system applied to later stage of movie and television

Info

Publication number: CN113490058A
Application number: CN202110960220.9A
Authority: CN
Inventors: 马晨光
Original assignee: Unisound Shanghai Intelligent Technology Co Ltd
Current assignee: Unisound Shanghai Intelligent Technology Co Ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-10-08

Abstract

The invention discloses an intelligent subtitle matching system applied to the later stage of movies and videos, which comprises an input subsystem, an identification subsystem, a subtitle matching subsystem and an output subsystem, wherein the input subsystem is used for selecting and inputting movie and video to be processed, the identification subsystem is used for carrying out voice identification and lip language identification on the movie and video, the subtitle matching subsystem is used for carrying out automatic matching generation on subtitles of the movie and video according to data identified by the identification subsystem, and the output subsystem is used for carrying out final output on the movie and video with the subtitles. The method and the device have reasonable design, and can realize automatic and accurate matching of scene captions between the text generated by voice recognition and the specific scene of the video by matching voice recognition with video lip language recognition, thereby greatly reducing the workload of editing personnel and improving the working efficiency of the editing personnel.

Description

Intelligent subtitle matching system applied to later stage of movie and television

Technical Field

The invention relates to the field of movie and television editing, in particular to an intelligent subtitle matching system applied to the later stage of movies and television.

Background

Subtitles refer to non-video contents such as dialogs in television, movie and stage works displayed in a text form, and also generally refer to characters in post-processing of movie and television works. The commentary and various characters appearing below the movie screen or the television screen, such as the film title, the credits, the lyrics, the dialogues, the captions and the explanatory words are called subtitles according to the introduction of people, the place name and the year. Subtitles for movies and television works generally appear below the screen, whereas subtitles for drama works may appear on both sides or above the stage. The excellent caption has five characteristics of accuracy, consistency, clearness, readability and equivalence. The accuracy refers to that the finished product has no low-level errors such as wrongly written characters and the like; consistency means that the consistency of the subtitles in form and presentation is critical to the understanding of the viewer; intelligibility refers to the complete presentation of audio, including speaker recognition and non-conversational content, that needs to be presented verbally in a clear manner; readability means that the time of the subtitle is enough for the audience to read, the subtitle is synchronous with audio, and the effective content of the picture is not covered by the subtitle; the term "equivalent" means that the subtitle should completely convey the content and intention of the video material, and the content of the subtitle should be equivalent to that of the video material.

At present, subtitles need a large amount of manual input when a video is shot, and later-stage speech recognition only recognizes a matching sequence aiming at speech, so that specific frames are difficult to match, the workload of editing personnel is greatly increased, the editing efficiency of the editing personnel is influenced, and certain defects exist.

Disclosure of Invention

The invention aims to provide an intelligent subtitle matching system applied to the later stage of movies and televisions, so that accurate matching of subtitles and videos is realized, and the working efficiency of editing personnel is improved.

The invention is realized by the following steps:

the intelligent subtitle matching system applied to the later stage of movies and televisions comprises an input subsystem, an identification subsystem, a subtitle matching subsystem and an output subsystem, wherein the input subsystem is used for selecting and inputting movie videos to be processed, the identification subsystem is used for carrying out voice identification and lip language identification on the movie videos, the subtitle matching subsystem is used for carrying out automatic matching generation on the subtitles of the movie videos according to data identified by the identification subsystem, and the output subsystem is used for carrying out final output on the movie videos with the subtitles.

The recognition subsystem comprises a voice recognition unit and a lip language recognition unit, the voice recognition unit is used for recognizing voice in the movie and television video and generating the voice into text in real time, and the lip language recognition unit is used for recognizing lip language in the movie and television video frame by frame.

The recognition subsystem also comprises a text conversion unit which is used for freely converting or matching different languages for the text generated by speech recognition.

The subtitle matching subsystem comprises a calibration matching unit and a subtitle inserting unit, the calibration matching unit is used for calibrating and matching the subtitle time of a specific scene according to the lip language recognition time of the lip language recognition unit, and the subtitle inserting unit is used for inserting a text segment of a text generated by voice recognition and corresponding to a time point into each scene.

The subtitle matching subsystem further comprises a subtitle editing unit, and the subtitle editing unit is used for editing and modifying the position and size characteristics of the subtitles generated in the movie and television videos.

According to the method and the device, the text generated by voice recognition and the specific scene of the video can be automatically and accurately matched with the scene subtitle through voice recognition and video lip language recognition, so that the workload of editing personnel is greatly reduced, and the working efficiency of the editing personnel is improved.

Drawings

Fig. 1 is a block diagram of a structure of an intelligent subtitle matching system applied to the later stage of movies and televisions.

In the figure, 1, an input subsystem; 2. identifying a subsystem; 3. a subtitle matching subsystem; 4. an output subsystem; 5. a voice recognition unit; 6. a lip language identification unit; 7. a calibration matching unit; 8. a subtitle insertion unit; 9. a subtitle editing unit; 10. and a text conversion unit.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

Referring to fig. 1, an intelligent subtitle matching system applied to a later stage of a movie comprises an input subsystem 1, an identification subsystem 2, a subtitle matching subsystem 3 and an output subsystem 4, wherein the input subsystem 1 is used for selecting and inputting a movie video to be processed, the identification subsystem 2 is used for performing voice identification and lip language identification of the movie video, the subtitle matching subsystem 3 is used for performing automatic matching generation of a movie video subtitle according to data identified by the identification subsystem 2, and the output subsystem 4 is used for performing final output of the movie video with a subtitle.

The recognition subsystem 2 comprises a voice recognition unit 5 and a lip language recognition unit 6, wherein the voice recognition unit 5 is used for recognizing voice in the movie and television video and generating the voice into text in real time, and the lip language recognition unit 6 is used for recognizing lip language in the movie and television video frame by frame. In the embodiment, the voice recognition method is mainly a mode matching method, in the training stage, a user speaks each word in a vocabulary list in sequence, and stores the feature vector of each word as a template into a template library, in the recognition stage, the feature vector of input voice is compared with each template in the template library in sequence in similarity, and the highest similarity is output as a recognition result; the lip language recognition uses a machine vision technology to continuously recognize human faces from images, judge a person speaking in the images and extract continuous mouth shape change characteristics of the person.

The recognition subsystem 2 further comprises a text conversion unit 10, and the text conversion unit 10 is configured to perform free conversion or matching of different languages on the text generated by speech recognition. In the embodiment, by converting the text in different languages, the clipping personnel can select the language of the inserted caption according to the requirement.

The subtitle matching subsystem 3 comprises a calibration matching unit 7 and a subtitle inserting unit 8, the calibration matching unit 7 is used for calibrating and matching the subtitle time of a specific scene according to the lip language recognition time of the lip language recognition unit 6, and the subtitle inserting unit 8 is used for inserting the text segment of the corresponding time point in the text generated by voice recognition into each scene. In the present embodiment, the calibration matching unit 7 allocates the caption time according to the recognition time of the lip language in the scene, and ensures that the caption matches the movie scene in cooperation with the speech recognition text of the corresponding time extracted by the caption inserting unit 8.

The subtitle matching subsystem 3 further comprises a subtitle editing unit 9, and the subtitle editing unit 9 is used for editing and modifying the position and size characteristics of the subtitles generated in the movie and television videos. In the present embodiment, the subtitle editing unit 9 provides a feature modification function for subtitles, and an editor can modify the features of the subtitles, such as size and position, according to actual situations, so as to ensure the display effect of the subtitles.

The following is a list of preferred embodiments of the intelligent subtitle matching system applied to the later stage of movie and television for clearly illustrating the content of the present invention, and it should be understood that the content of the present invention is not limited to the following embodiments, and other modifications by conventional technical means of those skilled in the art are within the scope of the idea of the present invention.

The embodiment of the invention provides an operation process of an intelligent subtitle matching system applied to the later stage of movies and televisions, which specifically comprises the following steps:

s1, selecting the original film and video with audio and the front face identification picture of people through the input subsystem 1 to carry out system input;

s2, recognizing the audio in the movie video by the voice recognition unit 5 in the subsystem 2, converting the audio into a text in real time, and recognizing the lip language action in the movie video frame by the lip language recognition unit 6 in the subsystem 2;

s3, the calibration matching unit 7 calibrates and matches the corresponding caption time according to the time when the lip language appears in the video scene, and then the caption inserting unit 8 invokes the recognition text of the corresponding time point in the text converted by the voice recognition as the caption of the matching scene;

and S4, the output subsystem 4 outputs the video after the subtitle matching is finished.

To further facilitate understanding of the above process, the following is exemplified:

taking a certain scene in a movie video as an example, the voice recognition unit 5 recognizes dialogue voice in the scene and converts the dialogue voice into text in real time, the lip language recognition unit 6 recognizes lip language actions in the scene and recognizes a plurality of sets of consecutive lip language actions, the calibration matching unit 7 counts the running time of each set of consecutive lip language actions as the appearance time of the subtitle, then the subtitle insertion unit 8 calls the text converted by the corresponding voice recognition as the subtitle, for example, the starting time of one set of consecutive lip language actions is the third minute of the video, the ending time is the third minute of the video and twenty seconds, the appearance time of the subtitle is the third minute, the ending time is the third twentieth seconds, the subtitle insertion unit 8 selects the text generated by three-to-twenty-second voice recognition to insert into the video, and so on until all pictures of the consecutive lip language actions are inserted with matching text of matching time, the output subsystem 4 performs video output.

The present invention is not limited to the above embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The utility model provides an intelligent subtitle matching system for movie & TV later stage which characterized in that: the system comprises an input subsystem (1), an identification subsystem (2), a subtitle matching subsystem (3) and an output subsystem (4), wherein the input subsystem (1) is used for selecting and inputting a movie video to be processed, the identification subsystem (2) is used for carrying out voice identification and lip language identification on the movie video, the subtitle matching subsystem (3) is used for carrying out automatic matching generation on a movie video subtitle according to data identified by the identification subsystem, and the output subsystem (4) is used for carrying out final output on the movie video with the subtitle.

2. The system for matching smart subtitles in the late stage of movie and television according to claim 1, wherein: the recognition subsystem (2) comprises a voice recognition unit (5) and a lip language recognition unit (6), wherein the voice recognition unit (5) is used for recognizing voice in the movie and television video and generating the voice into a text in real time, and the lip language recognition unit (6) is used for recognizing lip language in the movie and television video frame by frame.

3. The system for matching smart subtitles in the late stage of movie and television according to claim 2, wherein: the recognition subsystem (2) further comprises a text conversion unit (10), and the text conversion unit (10) is used for freely converting or matching texts generated by speech recognition in different languages.

4. The system for matching smart subtitles in the late stage of movie and television according to claim 2, wherein: the subtitle matching subsystem (3) comprises a calibration matching unit (7) and a subtitle inserting unit (8), wherein the calibration matching unit (7) is used for calibrating and matching the subtitle time of a specific scene according to the lip language recognition time of the lip language recognition unit, and the subtitle inserting unit (8) is used for inserting a text segment of a text generated by voice recognition and corresponding to a time point into each scene.

5. The system for matching smart subtitles in the late stage of movie and television according to claim 4, wherein: the subtitle matching subsystem (3) further comprises a subtitle editing unit (9), and the subtitle editing unit (9) is used for editing and modifying the position and size characteristics of the subtitles generated in the movie and television videos.