CN115499683A

CN115499683A - Voice recognition method and device and electronic equipment

Info

Publication number: CN115499683A
Application number: CN202210938209.7A
Authority: CN
Inventors: 付硕; 赵伊; 林斐凡; 余世兵; 范艺含
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-08-05
Filing date: 2022-08-05
Publication date: 2022-12-20
Anticipated expiration: 2042-08-05
Also published as: CN115499683B

Abstract

The disclosure relates to a voice recognition method, a voice recognition device and electronic equipment, wherein the method comprises the following steps: displaying a multimedia information recognition page comprising a voice recognition control; responding to the trigger operation based on the voice recognition control, and sending a voice recognition request carrying multimedia information to be recognized to a server; the server is used for responding to the voice recognition request and dividing the multimedia information to be recognized into at least two information segments; the system comprises a plurality of information fragments and a control unit, wherein the information fragments are used for sequentially determining any unidentified information fragment in the at least two information fragments as a current information fragment based on a preset sequence; the voice recognition module is used for carrying out voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment; and displaying the current voice recognition result returned by the server in the multimedia information recognition page. By adopting the technical scheme, the display efficiency of the voice recognition result can be improved, and the interaction convenience between the terminal account and the multimedia information to be recognized is improved.

Description

Voice recognition method and device and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a speech recognition method and apparatus, and an electronic device.

Background

In a video editing scene in the related art, a terminal usually sends a complete video to a server through video editing software, the server performs voice-to-text recognition on the complete video information to obtain a full voice-to-text result of the video, the server sends the full voice-to-text result to the terminal, and the terminal displays the full voice-to-text result.

In the related art, the server needs to wait for the generation of the full voice-to-text result of the video and then sends the full voice-to-text result to the terminal, so that the display efficiency of the voice-to-text result of the video is reduced, the waiting time of the terminal account is increased, and the interaction convenience between the terminal account and the video editing software is reduced.

Disclosure of Invention

The invention provides a voice recognition method, a voice recognition device and electronic equipment, and aims to at least solve the problems that in the related art, a server needs to wait for the generation of a full voice-to-text result of a video and then sends the full voice-to-text result to a terminal, so that the display efficiency of the voice-to-text result of the video is reduced, the waiting time of a terminal account is prolonged, and the interaction convenience between the terminal account and video editing software is reduced. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a speech recognition method, including:

displaying a multimedia information identification page; the multimedia information recognition page comprises a voice recognition control;

responding to the triggering operation based on the voice recognition control, and sending a voice recognition request carrying multimedia information to be recognized to a server; the server is used for responding to the voice recognition request and dividing the multimedia information to be recognized into at least two information segments; the system comprises at least two information fragments and a control unit, wherein the information fragments are used for sequentially determining any unidentified information fragment in the at least two information fragments as a current information fragment based on a preset sequence; the voice recognition module is used for carrying out voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment;

receiving a current voice recognition result corresponding to the current information fragment returned by the server;

and displaying a current voice recognition result corresponding to the current information segment in a target area in the multimedia information recognition page.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech recognition method including:

displaying a multimedia information identification page; the multimedia information recognition page comprises a voice recognition control, and the voice recognition control is used for triggering voice recognition of multimedia information to be recognized;

responding to the triggering operation based on the voice recognition control, and displaying the recognition progress information of the multimedia information to be recognized in a rolling mode in the multimedia information recognition page;

displaying a current voice recognition result corresponding to the current information segment in a target area in the multimedia recognition page;

the target area is an area where the identification progress information moves at the current time, and the current time is the identification time of the current information fragment; the multimedia information to be identified comprises at least two information segments, and the current information segment is any unidentified information segment determined from the at least two information segments based on a preset sequence.

In an optional embodiment, a mask layer is disposed in the multimedia information identification page, and the scrolling displaying the identification progress information of the multimedia information to be identified in the multimedia information identification page in response to the triggering operation based on the voice recognition control includes:

and responding to the triggering operation based on the voice recognition control, displaying the voice recognition progress information in the mask layer, and enabling the recognition progress information to roll in the mask layer along a preset direction.

In an optional embodiment, the multimedia information identification page includes an identification display area, and the displaying, in a target area in the multimedia identification page, a current speech recognition result corresponding to a current information fragment includes:

and displaying a current voice recognition result corresponding to the current information segment in a target area in the multimedia recognition page, and displaying the current voice recognition result in the recognition display area.

In an optional embodiment, the at least two information segments are obtained by dividing the multimedia information to be identified based on a preset duration and a playing time sequence of the multimedia information to be identified, the multimedia information identification page includes a subtitle display area, a time display area, and an identification display area, and the displaying of the multimedia information identification page includes:

displaying a time axis in the time display area, displaying a mask layer on the subtitle display area and the time display area, and displaying the voice recognition control in the recognition display area;

the time axis comprises at least two playing time intervals, the at least two playing time intervals are obtained by dividing the total playing time of the multimedia information to be identified based on the preset time, each playing time interval corresponds to each information clip, and the preset sequence is determined based on the playing time sequence of each information clip; the mask layer is used for displaying identification progress information of the multimedia information to be identified, and the identification progress information is used for representing the identified duration in the total playing duration.

In an optional embodiment, the subtitle display area includes a recognized subtitle display sub-area, and before the displaying the current information segment corresponding to the current speech recognition result, the method further includes:

displaying a recognized voice recognition result corresponding to the recognized information segment in the recognized subtitle display sub-area, and displaying recognized progress information in a second preset range of the position of the recognized voice recognition result in the mask layer;

the identified subtitle display sub-region is located in a first preset range of the position of a playing time interval corresponding to the identified information segment, and the playing time sequence of the identified information segment is earlier than that of the current information segment; the identified progress information is used for representing the identified duration in the total playing duration, the identified progress information is vertically arranged with the subtitle display area and the time display area, and the intersection point of the identified progress information and the time axis represents the identified duration.

In an optional embodiment, the subtitle display area further includes a current subtitle display sub-area representing the target area, the current subtitle display sub-area is located within a third preset range of a position of the identified subtitle display sub-area, and the displaying of the current information segment corresponds to a current speech recognition result, including:

displaying the current voice recognition result in the current subtitle display sub-area;

displaying current recognition progress information in a fourth preset range of the position of the current voice recognition result in the mask layer; the current identification progress information is used for representing the identified current time length in the total playing time length of the multimedia information to be identified, the current identification progress information is vertically arranged with the subtitle display area and the time display area, and the intersection point of the current identification progress information and the time axis represents the current time length.

In an optional embodiment, the displaying the current speech recognition result in the current subtitle display sub-area includes:

when the number of the characters corresponding to the current character recognition result is larger than the number of the target characters which can be contained in the current subtitle display sub-area, displaying omission information and partial characters in the current character recognition result in the current subtitle display sub-area;

the text number corresponding to the partial text is equal to the target text number, and the omission information is used for representing the text except the partial text in the current text recognition result.

In an optional embodiment, the displaying the current recognition progress information within a fourth preset range of the position of the current speech recognition result in the mask layer includes:

moving the recognized progress information to a fourth preset range of the position of the current character recognition result for displaying from a second preset range of the position of the recognized voice recognition result;

wherein the moved identified progress information is updated to the current identified progress information.

In an optional embodiment, the multimedia information identification page includes an identification display area, and the method further includes:

and responding to the recognition ending instruction of the multimedia information to be recognized, and displaying a voice recognition result corresponding to each information segment in the recognition display area based on the preset sequence.

In an optional embodiment, the dividing, by the at least two information segments, the to-be-recognized multimedia information based on a preset duration and a playing time sequence of the to-be-recognized multimedia information, where the recognition display area includes a recognition display sub-area corresponding to each information segment, the recognition display sub-area corresponding to each information segment includes a text area and a non-text area, and the displaying, in the recognition display area, the voice recognition result corresponding to each information segment based on the preset sequence includes:

displaying a voice recognition result corresponding to each information segment in a text area of the recognition display sub-area corresponding to each information segment based on the preset sequence, and displaying playing time corresponding to each information segment and a time adjusting control corresponding to each information segment in a non-text area of the recognition display sub-area corresponding to each information segment;

and the time adjusting control corresponding to each information segment is used for adjusting the playing time corresponding to each information segment.

In an optional embodiment, the method further comprises:

in response to a time adjustment operation triggered by a time adjustment control corresponding to a target information segment, updating a voice recognition result displayed in a text area corresponding to the target information segment into a voice recognition result of an information segment corresponding to the adjusted playing time;

wherein the target information segment is any one of the at least two information segments.

According to a third aspect of the embodiments of the present disclosure, there is provided a speech recognition method including:

receiving a voice recognition request carrying multimedia information to be recognized and sent by a terminal; the voice recognition request is sent by the terminal in response to the triggering operation of the voice recognition control in the multimedia information recognition page based on the display;

responding to the voice recognition request, and dividing the multimedia information to be recognized into at least two information segments;

sequentially determining any unidentified information segment in the at least two information segments as a current information segment based on a preset sequence;

performing voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment;

sending the current voice recognition result to the terminal; the terminal is used for displaying the current voice recognition result in a target area in the multimedia information recognition page.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus including:

a first identification page display module configured to display the multimedia information identification page; the multimedia information recognition page comprises a voice recognition control;

the first trigger operation response module is configured to execute a trigger operation based on the voice recognition control, and send a voice recognition request carrying multimedia information to be recognized to a server; the server is used for responding to the voice recognition request and dividing the multimedia information to be recognized into at least two information segments; the device comprises at least two information fragments and a control unit, wherein the control unit is used for sequentially determining any unidentified information fragment in the at least two information fragments as a current information fragment based on a preset sequence; the voice recognition module is used for carrying out voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment;

the recognition result receiving module is configured to execute receiving of a current voice recognition result corresponding to the current information fragment returned by the server;

and the first recognition result display module is configured to execute displaying of a current voice recognition result corresponding to the current information segment in a target area in the multimedia information recognition page.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus including:

a second identification page display module configured to perform displaying the multimedia information identification page; the multimedia information recognition page comprises a voice recognition control, and the voice recognition control is used for triggering voice recognition of multimedia information to be recognized;

a second trigger operation response module configured to execute a trigger operation in response to the voice recognition control, and scroll-display the recognition progress information of the multimedia information to be recognized in the multimedia information recognition page;

the second recognition result display module is configured to execute the current voice recognition result corresponding to the current information segment in the target area in the multimedia recognition page;

In an optional embodiment, a mask layer is disposed in the multimedia information recognition page, and the second trigger operation response module is configured to perform a trigger operation in response to the voice recognition control, display the voice recognition progress information in the mask layer, and scroll the recognition progress information in a preset direction in the mask layer.

In an optional embodiment, the multimedia information recognition page includes a recognition display area, and the second recognition result display module is configured to perform displaying, in a target area in the multimedia recognition page, a current speech recognition result corresponding to a current information segment, and displaying the current speech recognition result in the recognition display area.

In an optional embodiment, the at least two information clips are obtained by dividing the multimedia information to be recognized based on a preset duration and a playing time sequence of the multimedia information to be recognized, the multimedia information recognition page includes a subtitle display area, a time display area and a recognition display area, the first recognition page display module or the second recognition page display module is configured to display a time axis in the time display area, display a mask layer on the subtitle display area and the time display area, and display the voice recognition control in the recognition display area;

In an alternative embodiment, the subtitle display area includes an identified subtitle display sub-area, and the apparatus further includes:

the semantic progress display module is configured to display the recognized voice recognition result corresponding to the recognized information fragment in the recognized subtitle display sub-area, and display the recognized progress information within a second preset range of the position of the recognized voice recognition result in the mask layer;

the identified subtitle display subareas are located in a first preset range of the position of a playing time interval corresponding to the identified information fragment, and the playing time sequence of the identified information fragment is earlier than that of the current information fragment; the identified progress information is used for representing the identified duration in the total playing duration, the identified progress information is vertically arranged with the subtitle display area and the time display area, and the intersection point of the identified progress information and the time axis represents the identified duration.

In an optional embodiment, the subtitle display area further includes a current subtitle display sub-area representing the target area, the current subtitle display sub-area is located within a third preset range of a position of the identified subtitle display sub-area, and the first identification result display module or the second identification result display module includes:

a current voice recognition result display unit configured to perform displaying the current voice recognition result in the current subtitle display sub-area;

the current recognition progress information display unit is configured to display current recognition progress information within a fourth preset range of the position of the current voice recognition result in the mask layer; the current identification progress information is used for representing the identified current time length in the total playing time length of the multimedia information to be identified, the current identification progress information is vertically arranged with the subtitle display area and the time display area, and the intersection point of the current identification progress information and the time axis represents the current time length.

In an optional embodiment, the current speech recognition result is a current text recognition result, and the current speech recognition result display unit is configured to display omission information and a part of text in the current text recognition result in the current subtitle display sub-region when the number of text corresponding to the current text recognition result is greater than the number of target text that can be accommodated in the current subtitle display sub-region;

the number of the characters corresponding to the partial characters is equal to the number of the target characters, and the omission information is used for representing the characters except the partial characters in the current character recognition result.

In an optional embodiment, the current recognition progress information display unit is configured to perform displaying by moving the recognized progress information from the second preset range of the position of the recognized speech recognition result to the fourth preset range of the position of the current character recognition result;

In an alternative embodiment, the multimedia information identification page includes an identification display area, and the apparatus further includes:

and the recognition ending instruction response module is configured to execute a recognition ending instruction responding to the multimedia information to be recognized, and display a voice recognition result corresponding to each information segment in the recognition display area based on the preset sequence.

In an optional embodiment, the at least two information segments are obtained by dividing the multimedia information to be recognized based on a preset duration and a playing time sequence of the multimedia information to be recognized, the recognition display area includes a recognition display sub-area corresponding to each information segment, the recognition display sub-area corresponding to each information segment includes a text area and a non-text area, and the recognition end instruction response module includes:

a result time display unit configured to display a voice recognition result corresponding to each information segment in a text area of the recognition display sub-area corresponding to each information segment based on the preset sequence, and display a playing time corresponding to each information segment and a time adjustment control corresponding to each information segment in a non-text area of the recognition display sub-area corresponding to each information segment;

In an optional embodiment, the apparatus further comprises:

the time adjustment operation response module is configured to execute time adjustment operation triggered based on a time adjustment control corresponding to a target information fragment, and update a voice recognition result displayed in a text area corresponding to the target information fragment into a voice recognition result of an information fragment corresponding to the adjusted playing time;

According to a sixth aspect of an embodiment of the present disclosure, there is provided a speech recognition apparatus including:

the voice recognition request receiving module is configured to execute a voice recognition request which is sent by a receiving terminal and carries multimedia information to be recognized; the voice recognition request is sent by the terminal in response to the triggering operation of the voice recognition control in the recognition page based on the displayed multimedia information;

a dividing module configured to perform dividing the multimedia information to be recognized into at least two information segments in response to the voice recognition request;

a current information segment determining module configured to sequentially determine any unidentified information segment in the at least two information segments as a current information segment based on a preset sequence;

the recognition module is configured to perform voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment;

a recognition result sending module configured to execute sending the current speech recognition result to the terminal; the terminal is used for displaying the current voice recognition result in a target area in the multimedia information recognition page.

According to a seventh aspect of the embodiments of the present disclosure, there is provided a speech recognition system, including a terminal and a server:

the terminal is used for displaying a multimedia information identification page; the multimedia information recognition page comprises a voice recognition control; the voice recognition control is used for responding to the triggering operation based on the voice recognition control and sending a voice recognition request carrying multimedia information to be recognized to the server; the system comprises a server, a multimedia information identification page and a voice recognition server, wherein the server is used for receiving a current voice recognition result returned by the server;

the server is used for responding to the voice recognition request and dividing the multimedia information to be recognized into at least two information segments; the system comprises at least two information fragments and a control unit, wherein the information fragments are used for sequentially determining any unidentified information fragment in the at least two information fragments as a current information fragment based on a preset sequence; and the voice recognition module is used for carrying out voice recognition on the current information segment to obtain the current voice recognition result corresponding to the current information segment.

According to an eighth aspect of embodiments of the present disclosure, there is provided an electronic device for speech recognition, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech recognition method according to any of the above embodiments.

According to a ninth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, cause the electronic device to perform the speech recognition method according to any one of the above embodiments.

According to a tenth aspect of the embodiments of the present disclosure, there is provided a computer program product comprising a computer program, wherein the computer program is configured to implement the speech recognition method according to any of the above embodiments when executed by a processor.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps that a multimedia information recognition page comprising a voice recognition control is displayed, and a voice recognition request carrying multimedia information to be recognized is sent to a server in response to the triggering operation based on the voice recognition control; the server is used for responding to the voice recognition request and dividing the multimedia information to be recognized into at least two information segments; the system is used for sequentially determining any unidentified information segment in at least two information segments as a current information segment based on a preset sequence; the voice recognition module is used for carrying out voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment; in the multimedia information identification page, the current voice identification result returned by the server is displayed, so that the terminal can display a voice identification result in the multimedia information identification page at intervals, the display efficiency of the voice identification result of the multimedia information to be identified is improved, the waiting time of a terminal account is reduced, and the interaction convenience between the terminal account and the multimedia information to be identified and the multimedia information editing software is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic illustration of an application environment for speech recognition, according to an example embodiment.

FIG. 2 is a flow chart one illustrating a method of speech recognition according to an example embodiment.

FIG. 3 is a flowchart illustrating a speech recognition method according to an example embodiment.

Fig. 4 is a first diagram illustrating a multimedia information identification page according to an exemplary embodiment.

Fig. 5 is a second schematic diagram of a multimedia information identification page according to an exemplary embodiment.

Fig. 6 is a flowchart illustrating displaying a current voice recognition result and current progress information of multimedia information to be recognized in a multimedia information recognition page according to an exemplary embodiment.

Fig. 7 is a third schematic diagram illustrating a multimedia information identification page, according to an example embodiment.

Fig. 8 is a fourth schematic diagram illustrating a multimedia information recognition page, according to an example embodiment.

Fig. 9 is a fifth diagram illustrating a multimedia information identification page, according to an example embodiment.

Fig. 10 is a block diagram illustrating a speech recognition apparatus according to an example embodiment.

Fig. 11 is a block diagram of a speech recognition apparatus second shown in accordance with an example embodiment.

FIG. 12 is a block diagram illustrating an electronic device for speech recognition according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment for speech recognition according to an exemplary embodiment, and as shown in fig. 1, the application environment may include a terminal 01 and a server 02.

In an alternative embodiment, the terminal 01 may be configured to display a multimedia information recognition page including a voice recognition control; the voice recognition control is used for responding to the triggering operation based on the voice recognition control and sending a voice recognition request carrying the multimedia information to be recognized to the server; and the voice recognition server is used for displaying the current voice recognition result returned by the server in the target area in the multimedia information recognition page. Specifically, the terminal 01 may include, but is not limited to, a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an Augmented Reality (AR)/Virtual Reality (VR) device, a smart wearable device, and other types of electronic devices. Optionally, the operating system running on the electronic device may include, but is not limited to, an android system, an IOS system, linux, windows, and the like.

In an alternative embodiment, the server 02 may be used to provide background services for the terminal 01. The server 02 may be configured to divide the multimedia information to be recognized into at least two pieces of information in response to the voice recognition request; the device comprises a display unit, a display unit and a control unit, wherein the display unit is used for displaying at least two information segments, and the control unit is used for sequentially determining any unidentified information segment in the at least two information segments as a current information segment based on a preset sequence; and the voice recognition module is used for carrying out voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment. For example, the server 02 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like.

In addition, it should be noted that fig. 1 shows only one application environment of the speech recognition method provided by the present disclosure. In other scenarios, there may be other application environments, and the present disclosure is not limited thereto.

Fig. 2 is a flow chart illustrating a method of speech recognition according to an exemplary embodiment, which may include the following steps, as shown in fig. 2.

In step S11, a multimedia information identification page is displayed; the multimedia information identification page comprises a voice identification control.

Optionally, in a multimedia information editing scene, the multimedia information to be recognized may be subjected to speech recognition by multimedia information editing software. The multimedia information identification page may be a page used for performing voice recognition on multimedia information to be recognized in the multimedia information editing software. Illustratively, in the process of identifying the multimedia information to be identified, the terminal can open the multimedia information editing software and display the multimedia information identification page. The multimedia information recognition page can comprise an information loading control and a voice recognition control, the information loading control is used for uploading the multimedia information to be recognized to the multimedia information editing software, and the voice recognition control is used for triggering voice recognition on the multimedia information to be recognized.

It should be noted that the multimedia information to be recognized may be multimedia information including voice information, which may include but is not limited to: video, audio, etc.

In step S13, in response to the triggering operation based on the voice recognition control, sending a voice recognition request carrying multimedia information to be recognized to a server; the server is used for responding to the voice recognition request and dividing the multimedia information to be recognized into at least two information segments; the system comprises at least two information fragments and a control unit, wherein the information fragments are used for sequentially determining any unidentified information fragment in the at least two information fragments as a current information fragment based on a preset sequence; and the voice recognition module is used for carrying out voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment.

In step S15, a current speech recognition result corresponding to the current information fragment returned by the server is received.

In step S17, a current speech recognition result corresponding to the current information segment is displayed in the target area of the multimedia information recognition page.

Optionally, in step S13, the terminal account may operate (for example, click, slide, drag, and the like) the voice recognition control, so as to trigger a voice recognition request, and the terminal sends, in response to the voice recognition request, the voice recognition request carrying the multimedia information to be recognized to the server. The server is used for responding to the voice recognition request and dividing the multimedia information to be recognized into at least two information segments. Optionally, the server may divide the multimedia information to be identified into at least two information segments in various ways, which is not limited in this respect.

In an embodiment, the server may divide the multimedia information to be recognized based on a preset duration and a playing time sequence of the multimedia information to be recognized, so as to obtain the at least two information segments. For example, if the total playing time of the multimedia information to be identified is 20s and the preset time is 5s, the multimedia information to be identified may be divided into: information segment 1 corresponding to 0-5s, information segment 2 corresponding to 5s-10s, information segment 3 corresponding to 10s-15s, and information segment 4 corresponding to 15s-20 s. The format of the information fragment may be: { data: "information fragment 1", startTime: "0", endTime: "5" }, { data: "information fragment 2", startTime: "5", endTime: "10" }, { data: "information fragment 3", startTime: "10", endTime: "15" }, { data: "information fragment 4", startTime: "15", endTime: "20".

In another embodiment, the server may divide the multimedia information to be recognized based on the preset speech volume and the playing time sequence of the multimedia information to be recognized, so as to obtain the at least two information segments. For example, if the total speech volume included in the multimedia information to be recognized is 2KB and the preset speech volume is 0.5KB, the multimedia information to be recognized may be divided into: the information fragment corresponding to 0 KB to 0.5KB, the information fragment corresponding to 0.5KB to 1KB, the information fragment corresponding to 1KB to 1.5KB and the information fragment corresponding to 1.5KB to 2 KB.

Alternatively, in step S13, the preset sequence may be determined based on a playing time sequence of the at least two information segments. After the server obtains the plurality of information segments, any unidentified information segment in at least two information segments can be sequentially determined to be the current information segment based on a preset sequence; and performing voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment. In the above step S15, the server sends the current speech recognition result to the terminal. In the above step S17, the terminal may display the current speech recognition result in the target area in the multimedia information recognition page.

In an alternative embodiment, the speech recognition result may be a text recognition result, that is, a text-to-text process is performed on the speech in the multimedia information to be recognized. In another alternative embodiment, the speech recognition result may be a non-text recognition result. For example, the non-character recognition result may be an emotion recognition result, that is, the emotion of the terminal account corresponding to the voice in the multimedia information to be recognized is recognized.

For example, the server divides the multimedia information to be recognized based on a preset duration and a playing time sequence of the multimedia information to be recognized, and the obtained at least two information segments may be: information segment 1 corresponding to 0-5s, information segment 2 corresponding to 5s-10s, information segment 3 corresponding to 10s-15s, and information segment 4 corresponding to 15s-20 s. The playing time sequence of each information fragment is as follows: information segment 1-information segment 2-information segment 3-information segment 4. The preset sequence may be: information segment 1-information segment 2-information segment 3-information segment 4. The server can firstly take the information segment 1 as the current information segment to obtain the current voice recognition result 1 of the information segment 1, the server sends the current voice recognition result 1 to the terminal, and the terminal displays the current voice recognition result 1 in the multimedia information recognition page. The server then takes the information segment 2 as a current information segment to obtain a current voice recognition result 2 of the information segment 2, the server sends the current voice recognition result 2 to the terminal, and the terminal displays the current voice recognition result 2 in a multimedia information recognition page. The server then takes the information fragment 3 as a current information fragment to obtain a current voice recognition result 3 of the information fragment 3, the server sends the current voice recognition result 3 to the terminal, and the terminal displays the current voice recognition result 3 in the multimedia information recognition page. The server then takes the information segment 4 as a current information segment to obtain a current voice recognition result 4 of the information segment 4, the server sends the current voice recognition result 4 to the terminal, and the terminal displays the current voice recognition result 4 in a multimedia information recognition page. According to the scheme, the terminal can display the voice recognition result in the multimedia information recognition page at intervals, so that the display efficiency of the voice recognition result of the multimedia information to be recognized is improved, the waiting time of the terminal account is reduced, and the interaction convenience between the terminal account and the multimedia information to be recognized and the multimedia information editing software is improved.

Fig. 3 is a flowchart illustrating a speech recognition method according to an exemplary embodiment, where as shown in fig. 3, the speech recognition method may include:

in step S21, a multimedia information identification page is displayed; the multimedia information recognition page comprises a voice recognition control, and the voice recognition control is used for triggering voice recognition of multimedia information to be recognized;

in step S23, in response to a triggering operation based on the voice recognition control, displaying the recognition progress information of the multimedia information to be recognized in a rolling manner in the multimedia information recognition page;

in step S25, displaying a current speech recognition result corresponding to the current information segment in a target area in the multimedia recognition page;

Optionally, in step S11 or step S21, the multimedia information identification page may be displayed in various ways, which is not limited in this respect.

In an embodiment, the at least two information segments are obtained by dividing the multimedia information to be recognized based on a preset duration and a playing time sequence of the multimedia information to be recognized, where the multimedia information recognition page includes a subtitle display area, a time display area, and a recognition display area, and in step S11 or step S21, the displaying the multimedia information recognition page may include:

displaying a time axis in the time display area, displaying a mask layer in the subtitle display area and the time display area, and displaying a voice recognition control in the recognition display area;

the time axis comprises at least two playing time intervals, the at least two playing time intervals are obtained by dividing the total playing time of the multimedia information to be identified based on preset time, each playing time interval corresponds to each information clip, and the preset sequence is determined based on the playing time sequence of each information clip; the mask layer is used for displaying identification progress information of the multimedia information to be identified, and the identification progress information is used for representing the identified duration in the total playing duration.

It should be noted that the subtitle display area, the time display area, and the identification display area may be disposed at any position in the multimedia information identification page, which is not specifically limited in this disclosure. Fig. 4 is a schematic diagram illustrating a multimedia information identification page according to an exemplary embodiment, where as shown in fig. 4, the subtitle display area may be located in a lower area of the time display area, the identification display area may be located in an upper area of the subtitle display area, and the like.

In this embodiment, an identification display area may be set on the multimedia information identification page, and the speech recognition control may be displayed in the identification display area. The speech recognition control may be disposed at any position in the recognition display area, which is not specifically limited by the present disclosure. Continuing with FIG. 4, the speech recognition control can be displayed in a central location in the recognition display area.

In this embodiment, the total playing time of the multimedia information to be identified may be divided into at least two playing time intervals based on the preset time, the time axis is generated based on the at least two playing time intervals, and the time axis is displayed in the time display region. Continuing with fig. 3, if the total playing time of the multimedia information to be identified is 20s and the preset time is 5s, at least two playing time intervals may be: a playing time interval 1 of 0-5s, a playing time interval 2 of 5s-10s, a playing time interval 3 of 10s-15s and a playing time interval 4 of 15s-20 s. It can be seen that the number of the at least two playing time intervals is equal to the number of the at least two information clips, and each playing time interval corresponds to each information clip, that is, each playing time interval corresponds to the playing duration of one information clip.

In this embodiment, the identification progress information of the multimedia information to be identified may be injected into the mask layer, the mask layer is disposed on the subtitle display area and the time display area, and the identification progress information of the multimedia information to be identified may be displayed to the terminal account in cooperation with the transparent mask layer, where the identification progress information is used to represent the identified duration in the total playing duration of the multimedia information to be identified. As an example, the recognition progress information (progress) can be calculated when the total playing duration (duration) of the multimedia information to be recognized and the end time (end time) of the playing time interval corresponding to the current information segment are known. For example, if the total playing time is 20s, and the ending time of the playing time interval corresponding to the current information segment is 10s, the current identification progress information is 10s. In addition, the mask layer is covered on the subtitle display area and the time display area, and the content below the mask layer is locked, so that the content below the mask layer is in a non-operation state, and the defect of low voice recognition precision caused by dragging of unfinished track content (including the content in the subtitle display area corresponding to the subtitle track and the content in the time display area corresponding to the time track) by a terminal account is avoided.

In the embodiment of the disclosure, the voice recognition control is displayed in the recognition display area, so that the interaction convenience between the terminal account and the multimedia information to be recognized can be improved; the time shaft comprising at least two playing time intervals is displayed in the time display area, so that the identification progress information of the multimedia information to be identified can be accurately displayed for the terminal account, and the interaction convenience between the terminal account and the multimedia information to be identified and the multimedia information editing software is further improved; in addition, the mask layer is displayed on the caption display area and the time display area, and the identification progress information is injected into the mask layer, so that the convenience of displaying the identification progress information to a terminal account is improved, the content below the mask layer is in a non-operation state, and the voice identification precision of the multimedia information to be identified is improved.

In another embodiment, if the multimedia information identification page includes a subtitle display area, a voice amount display area, and an identification display area, in step S11 or step S21, the displaying the multimedia information identification page may include:

displaying a voice volume axis in a voice volume display area, displaying a mask layer on the subtitle display area and the voice volume display area, and displaying a voice recognition control in a recognition display area; the voice volume axis comprises at least two voice intervals, the mask layer is used for injecting recognition progress information of multimedia information to be recognized, and the recognition progress information is used for representing recognized voice volume in the total voice volume.

In a possible embodiment, the subtitle display area includes a recognized subtitle display sub-area, and before the displaying the current information segment corresponding to the current speech recognition result, that is, before the step S17 or the step S25, the method may further include:

displaying a recognized voice recognition result corresponding to the recognized information fragment in the recognized subtitle display sub-area, and displaying recognized progress information in a second preset range of the position of the recognized voice recognition result in the mask layer;

the identified subtitle display subarea is positioned in a first preset range of the position of a playing time interval corresponding to an identified information fragment, and the playing time sequence of the identified information fragment is earlier than that of the current information fragment; the identified progress information is used for representing the identified duration which is identified in the total playing duration, the identified progress information is arranged perpendicular to the subtitle display area and the time display area, and the intersection point of the identified progress information and the time axis represents the identified duration.

Alternatively, an information segment, of the at least two information segments, whose playing time sequence is earlier than that of the current information segment may be used as the recognized information segment, and since the terminal displays the recognized speech recognition result of the recognized information segment before the server returns the current speech recognition result corresponding to the current information segment, a recognized subtitle display sub-region may be set in the subtitle display region, and the recognized speech recognition result corresponding to the recognized information segment may be displayed in the recognized subtitle display sub-region. Since each information segment corresponds to a playing time interval, as an example, the identified subtitle display sub-region may be located within a first preset range of a position of the playing time interval corresponding to the identified information segment. Specifically, the playing time interval may be located in a lower area of the playing time interval corresponding to the identified information segment. Fig. 5 is a second schematic diagram of a multimedia information identification page according to an exemplary embodiment, where as shown in fig. 5, the identified information segment is the information segment 1, and the playing time interval corresponding to the information segment 1 is 0-5s, the identified subtitle display sub-area may be located in the lower area of the playing time interval of 0-5 s.

Optionally, in order to display the recognized progress information, the terminal may further display the recognized progress information within a second preset range of the position of the recognized speech recognition result in the mask layer. As an example, the second preset range may be a region behind the position where the recognized voice recognition result is located, and as shown in fig. 5, the recognized progress information may be displayed in a region behind the position where the recognized voice recognition result is located. The identified progress information is used to represent the identified duration that has been identified in the total playing duration, for example, if the total playing duration is 20s, the ending time of the playing time interval corresponding to the identified information segment is 5s, and the identified progress information is 5s.

Since the identification progress information is injected into the mask layer, the identification progress information may be displayed in the mask layer in various ways, which is not specifically limited in the embodiment of the present disclosure. In one embodiment, the identified progress information may be an axis perpendicular to the subtitle display area and the time display area, and an intersection of the identified progress information and the time axis represents the identified duration. In another embodiment, the recognition progress information may be recognition progress prompt information disposed behind the recognized subtitle display sub-region, and the recognition progress prompt information may be displayed in the form of text and/or images. In a text form, for example, the identified progress indication information may be "the identified progress information is xx".

In the embodiment of the disclosure, the recognized voice recognition result corresponding to the recognized information segment is displayed in the recognized subtitle display sub-area, so that the terminal can display a voice recognition result in the multimedia information recognition page at intervals, the display efficiency of the voice recognition result of the multimedia information to be recognized is improved, the waiting time of a terminal account is reduced, and the interaction convenience between the terminal account and video editing software is improved; in addition, the recognized progress information is displayed within a second preset range of the position of the recognized voice recognition result in the mask layer, the recognized progress information can be further conveniently and accurately displayed to the terminal account, and interaction convenience between the terminal account and the multimedia information to be recognized and the multimedia information editing software is further improved.

In an operable embodiment, in a case that the plurality of information segments are obtained by dividing based on the voice volume, the recognized subtitle display sub-region may be located within a first preset range of a position where a voice interval corresponding to the recognized information segment is located, the recognized progress information is used for representing a recognized voice volume that has been recognized in the total voice volume, the recognized progress information is arranged perpendicular to the subtitle display region and the time display region, and an intersection of the recognized progress information and the voice volume axis represents the recognized voice volume.

Fig. 6 is a flowchart illustrating displaying a current speech recognition result and current progress information of multimedia information to be recognized in a multimedia information recognition page according to an exemplary embodiment, as shown in fig. 6, in an alternative embodiment, the subtitle display area may further include a current subtitle display sub-area representing the target area, the current subtitle display sub-area is located within a third preset range of a location of the recognized subtitle display sub-area, and in step S17 or step S25, the displaying a current information segment corresponding to the current speech recognition result may include:

in step 31, the current speech recognition result is displayed in the current caption display sub-area.

In step 33, displaying the current recognition progress information within a fourth preset range of the position of the current speech recognition result in the mask layer; the current identification progress information is used for representing the identified current time length in the total playing time length of the multimedia information to be identified, the current identification progress information is vertically arranged with the subtitle display area and the time display area, and the intersection point of the current identification progress information and the time axis represents the current time length.

Alternatively, in step S31 above, since the terminal already displays the recognized speech recognition result of the recognized information segment before the server returns the current speech recognition result corresponding to the current information segment, a current subtitle display sub-region may be set in the subtitle display region, and the current speech recognition result corresponding to the current information segment may be displayed in the current subtitle display sub-region. As each information segment corresponds to a playing time interval, as an example, the current subtitle display sub-region may be located within a preset range of a position of the playing time interval corresponding to the current information segment, that is, within a third preset range of a position of the identified subtitle display sub-region. Specifically, the third preset range may be located in a rear area of the position of the identified subtitle display sub-area. Fig. 7 is a third schematic diagram of a multimedia information identification page according to an exemplary embodiment, as shown in fig. 7, a current information segment is the information segment 2, and a playing time interval corresponding to the information segment 2 is 5s to 10s, so that the current subtitle display sub-region may be located in a lower region of the playing time interval of 5s to 10s and in a rear region of a position where the identified subtitle display sub-region is located.

Optionally, in step S33, in order to display the current recognition progress information, the terminal may further display the current recognition progress information within a fourth preset range of the position of the current voice recognition result in the mask layer, where the current recognition progress information may be used to represent a current time length that has been recognized in the total playing time length of the multimedia information to be recognized, for example, the total playing time length is 20S, an ending time of a playing time interval corresponding to the current information clip is 10S, and the current recognition progress information is 10S. As an example, the fourth preset range may be a rear area of a position where a current voice recognition result is located. The current recognition progress information is used for representing the recognized current time length in the total playing time length. As shown in fig. 7, the current recognition progress information may be displayed in a region behind the position of the current speech recognition result.

Since the identification progress information is injected into the mask layer, the identification progress information may be displayed in the mask layer in various ways, which is not specifically limited by the present disclosure. In one embodiment, the current recognized progress information may be an axis perpendicular to the subtitle display area and the time display area, and an intersection of the current recognized progress information and the time axis represents the current time duration. In another embodiment, the current recognition progress information may be recognition progress prompting information arranged behind the current subtitle display sub-region, and the recognition progress prompting information may be displayed in a form of a text and/or an image, for example, a text, and the current recognition progress prompting information may be "the current recognition progress information is xx".

In the embodiment of the disclosure, the current voice recognition result corresponding to the current information segment is displayed in the current subtitle display sub-area, so that the terminal can display a voice recognition result in the multimedia information recognition page at intervals, the display efficiency of the voice-to-text recognition result of the multimedia information to be recognized is improved, the waiting time of a terminal account is reduced, and the interaction convenience between the terminal account and the multimedia information to be recognized and the multimedia information editing software is improved; in addition, the current recognition progress information is displayed within a fourth preset range of the position of the current character recognition result in the mask layer, the current recognition progress information can be further conveniently and accurately displayed for the terminal account, and the interaction convenience between the terminal account and the multimedia information to be recognized and the multimedia information editing software is further improved.

In an operable embodiment, in a case where the plurality of information pieces are divided based on the voice volume, the current recognition progress information is used for representing the current voice volume which has been recognized in the total voice volume, the current recognition progress information is arranged perpendicular to the subtitle display area and the time display area, and an intersection point of the current recognition progress information and the voice volume axis represents the current voice volume.

Alternatively, in step S31, the terminal may display the current speech recognition result in the current subtitle display sub-area in a variety of ways, which is not limited herein.

In one embodiment, if the current speech recognition result is a current character recognition result, the displaying the current speech recognition result in the current subtitle display sub-area in step S31 may include:

Optionally, when the number of the characters corresponding to the current character recognition result is greater than the number of the target characters that can be accommodated by the current subtitle display sub-region, it indicates that the current subtitle display sub-region cannot display all the characters in the current character recognition result, and then part of the characters in the current character recognition result can be displayed, and the number of the characters corresponding to the part of the characters is equal to the number of the target characters that can be accommodated by the current subtitle display sub-region.

In an embodiment, when the number of characters corresponding to the current character recognition result is less than or equal to the number of target characters that can be accommodated in the current subtitle display sub-region, the terminal may directly display the current character recognition result in the current subtitle display sub-region.

Optionally, in the step S33, the terminal may display the current recognition progress information in the masking layer within a fourth preset range of the position of the current character recognition result in a plurality of ways, which is not specifically limited herein.

In one embodiment, the step S33 may include: moving the recognized progress information from a second preset range of the position of the recognized voice recognition result to a fourth preset range of the position of the current character recognition result for displaying; and updating the moved identified progress information into the current identified progress information.

In this embodiment, since the recognized progress information is already displayed before the current recognition progress is displayed, when the current recognition progress information is displayed, the recognized progress information may be moved from the second preset range of the position where the recognized speech recognition result is located to the fourth preset range of the position where the current word recognition result is located for displaying, and the moved recognized progress information becomes the current recognition progress information. Taking the second preset range as the rear area of the position of the recognized voice recognition result and the fourth preset range as the rear area of the position of the current character recognition result as an example, the terminal can move the recognized progress information from the rear area of the position of the recognized voice recognition result to the rear area of the position of the current character recognition result for displaying, so that the recognition progress information change information of the multimedia information to be recognized is displayed to the terminal account through the dynamic movement of the recognition progress information, and the interaction convenience and the interaction experience between the terminal account, the multimedia information to be recognized and the multimedia information editing software are further improved.

In an optional embodiment, in the step S23, the scrolling and displaying the recognition progress information of the multimedia information to be recognized in the multimedia information recognition page in response to the repeated operation based on the voice recognition control may include:

and in response to the triggering operation based on the voice recognition control, displaying the voice recognition progress information in the mask layer, and enabling the recognition progress information to scroll along a preset direction in the mask layer.

In this embodiment, after the terminal account triggers the voice recognition control, the terminal displays the voice recognition progress information in the mask layer, and causes the recognition progress information to scroll along a preset direction in the mask layer. For example, the predetermined direction may be any direction, for example, a direction from left to right in the mask layer. Continuing with fig. 4, after the terminal account number triggers the voice recognition control, the voice recognition progress information may be displayed on the leftmost side of the mask layer, and during the recognition of the current information segment, the recognition progress information is caused to scroll in the left-to-right direction within the mask layer, for example, to the position shown in fig. 5, and then from the position shown in fig. 5, to the position shown in fig. 7. The terminal can display the current voice recognition result corresponding to the current information segment in real time in the target area in which the recognition progress information has scrolled, for example, in the process of scrolling from fig. 4 to fig. 5, the target area is the recognized subtitle display sub-area, in the process of scrolling from fig. 5 to fig. 7, the target area is the current subtitle display sub-area, and then the current voice recognition result is displayed in the current subtitle display sub-area. Therefore, the recognition progress is scrolled in real time in the recognition process of the current information segment, the current voice recognition result corresponding to the current information segment is displayed in real time, namely the recognition progress information is displayed while recognition is carried out, the recognition progress information can be further, conveniently and accurately displayed for the terminal account, and the interaction convenience between the terminal account and the multimedia information to be recognized and the multimedia information editing software is further improved.

In an optional embodiment, the method may further include:

and hiding the current identification progress information in response to the identification finishing instruction of the multimedia information to be identified.

In this embodiment, the terminal may establish a connection with the server in a polling manner, that is, the terminal periodically sends an inquiry to the server to inquire whether the server needs a terminal service, and if so, the terminal establishes a connection with the server, or the terminal establishes a long connection with the server. After the server finishes semantic recognition operation on the last information segment and sends a voice recognition result corresponding to the last information segment to the terminal through the connection mode, the server can send a recognition ending instruction representing the end of recognition of the multimedia information to the terminal through the connection mode, the terminal responds to the recognition ending instruction and hides the current recognition progress information, therefore, the voice recognition of the multimedia information to be recognized is ended through the hidden prompt terminal account of the current recognition progress information, and the interaction convenience and the interaction experience of the terminal account and the multimedia information to be recognized are further improved.

Fig. 8 is a fourth schematic view illustrating a multimedia information recognition page according to an exemplary embodiment, as shown in fig. 8, when the server sends a recognition end command indicating that the multimedia information recognition is ended to the terminal, the terminal hides the current recognition progress information and the mask layer in response to the recognition end command.

The last information segment is an information segment with the latest playing time sequence among the plurality of information segments, and for example, the last information segment is the information segment 4.

In a possible embodiment, the multimedia information identification page includes an identification display area, and the method may further include:

and responding to a recognition ending instruction of the multimedia information to be recognized, and displaying a voice recognition result corresponding to each information segment in the recognition display area based on a preset sequence.

In this embodiment, after the server completes the semantic recognition operation on the last information segment and sends the speech recognition result corresponding to the last information segment to the terminal, the server may send a recognition end instruction indicating that the recognition of the multimedia information is ended to the terminal, and the terminal displays the speech recognition result corresponding to each information segment in the recognition display area based on the preset sequence in response to the recognition end instruction. Continuing as shown in fig. 8, the preset sequence is information segment 1, information segment 2, information segment 3 and information segment 4, and in the recognition display area, the voice recognition result 1 corresponding to the information segment 1, the voice recognition result 2 corresponding to the information segment 2, the voice recognition result 3 corresponding to the information segment 3 and the voice recognition result 4 corresponding to the information segment 4 are sequentially displayed from top to bottom, so that after the recognition of the multimedia information to be recognized is finished, the voice recognition result corresponding to each information segment is displayed in one recognition display area, the convenience of the terminal account for viewing the voice recognition result corresponding to each information segment is improved, and the interaction convenience and the interaction experience of the terminal account and the multimedia information to be recognized are further improved.

Optionally, the embodiment of the present disclosure may display, in the recognition display area, the voice recognition result corresponding to each of the information segments based on the preset sequence in a plurality of ways, which is not specifically limited.

In an embodiment, the dividing, by the server, the multimedia information to be recognized based on a preset duration and a playing time sequence of the multimedia information to be recognized, where the recognition display area includes a recognition display sub-area corresponding to each information segment, the recognition display sub-area corresponding to each information segment includes a text area and a non-text area, and the displaying, in the recognition display area, the voice recognition result corresponding to each information segment based on the preset sequence may include:

displaying a voice recognition result corresponding to each information segment in a text area of the recognition display sub-area corresponding to each information segment based on the preset sequence, and displaying playing time corresponding to each information segment and a time adjusting control corresponding to each information segment in a non-text area of the recognition display sub-area corresponding to each information segment; and the time adjusting control corresponding to each information segment is used for adjusting the playing time corresponding to each information segment.

In this embodiment, the identification display area may be divided into a plurality of identification display sub-areas, the number of the plurality of identification display sub-areas is equal to the number of the at least two information segments, the identification display sub-area corresponding to each information segment includes a text area and a non-text area, and the text area and the non-text area may be disposed at any position in the identification display sub-areas. The terminal may display a voice recognition result corresponding to each information segment in a text region of the recognition display sub-region corresponding to each information segment based on a preset sequence, and display a play time corresponding to each information segment in a non-text region of the recognition display sub-region corresponding to each information segment. The playing time may be a starting time and an ending time of a playing time interval corresponding to each information segment. In addition, a time adjustment control corresponding to each information fragment can be displayed; and the time adjusting control corresponding to each information segment is used for adjusting the playing time corresponding to each information segment.

Fig. 9 is a schematic diagram of a multimedia information recognition page according to an exemplary embodiment, where as shown in fig. 9, the preset sequence is information segment 1-information segment 2-information segment 3-information segment 4, in the recognition display region, according to the sequence from top to bottom, the voice recognition result 1 corresponding to the information segment 1 is displayed in the text region of the recognition display sub-region corresponding to the information segment 1 in sequence, and in the non-text region of the recognition display sub-region corresponding to the information segment 1, the play time (the left is the start time and the right is the end time) and the time adjustment control corresponding to the information segment 1 are displayed, in the text region of the recognition display sub-region corresponding to the information segment 2, the voice recognition result 2 corresponding to the information segment 2 is displayed, and in the non-text region of the recognition display sub-region corresponding to the information segment 2, the play time and time adjustment control corresponding to the information segment 2 are displayed, 8230, so that after the multimedia information recognition is finished, the voice recognition result corresponding to each information segment 2 is displayed, the play time and the time adjustment control corresponding to the account and the time adjustment control corresponding to the terminal are displayed, thereby further improving convenience of viewing of the multimedia information and the terminal.

In an exemplary embodiment, in a case that the plurality of information segments are to identify the multimedia information to be identified based on the voice volume, the identification display area may include an identification display sub-area corresponding to each of the information segments, and the identification display sub-area corresponding to each of the information segments includes a text area and a non-text area, where the non-text area is used for displaying the voice volume corresponding to each of the information segments, for example, the non-text area is used for displaying a start voice volume and an end voice volume corresponding to each of the information segments. In addition, a voice volume adjusting control corresponding to each information fragment can be displayed; and the voice volume adjusting control corresponding to each information segment is used for adjusting the voice volume corresponding to each information segment.

In an optional embodiment, in step S17 or step S25, displaying a current speech recognition result corresponding to the current information segment in the target area in the multimedia recognition page, may further include:

and displaying a current voice recognition result corresponding to the current information fragment in a target area in the multimedia recognition page, and displaying the current voice recognition result in the recognition display area.

In the embodiment, the current voice recognition result corresponding to the current information fragment can be displayed in the target area, and simultaneously the current voice recognition result is displayed in the recognition display area in real time, so that the current voice recognition result is displayed in multiple areas in real time, the flexibility and convenience of displaying the current voice recognition result are improved, and the interaction convenience between the terminal account and the multimedia information to be recognized and the multimedia information editing software is further improved.

It should be noted that, the manner of displaying the current speech recognition result in the recognition display area is the same as that in fig. 8 and 9, and is not repeated herein.

In an optional embodiment, the method may further include:

in response to the time adjustment operation triggered by the time adjustment control corresponding to the target information segment, updating the voice recognition result displayed in the text area corresponding to the target information segment into the voice recognition result of the information segment corresponding to the adjusted playing time;

the target information fragment is any one of at least two information fragments.

In this embodiment, because each of the information segments corresponds to a non-text region of the identification display sub-region, the playing time corresponding to each of the information segments and the time adjustment control corresponding to each of the information segments are displayed, and the time adjustment control corresponding to each of the information segments is used to adjust the playing time corresponding to each of the information segments. When the sequence of the voice recognition results corresponding to a certain target information segment needs to be adjusted, the terminal account can operate the time adjustment control corresponding to the target information segment, so that time adjustment operation is triggered, the terminal responds to the time adjustment operation, records the adjusted playing time, obtains the voice recognition results of the information segment corresponding to the adjusted playing time, updates the voice recognition results of the information segment corresponding to the adjusted playing time in the text region corresponding to the target information segment, improves the convenience of adjusting the display sequence of the voice recognition results corresponding to the information segment, and further improves the convenience and interactive experience of interaction between the terminal account and multimedia information to be recognized.

For example, if the target information segment is the information segment 1 and the corresponding playing time interval is 0-5s, the terminal account adjusts the time adjustment control corresponding to the information segment 1, adjusts 0-5s to 5-10s and 5-10s to the playing time interval corresponding to the information segment 2, the terminal obtains the speech recognition result of the information segment 2 and updates the speech recognition result displayed in the text region corresponding to the information segment 1 to the speech recognition result corresponding to the information segment 2.

In an optional embodiment, a text adjustment control corresponding to each information segment may be further displayed in a non-text region of the identification display sub-region corresponding to each information segment, where the text adjustment control is used to adjust a speech recognition result displayed in the text region of the identification display sub-region corresponding to each information segment. Illustratively, the adjustments include, but are not limited to: deleting, editing, downloading, sharing, storing and skipping to the corresponding information segment in the multimedia information to be identified.

And under the condition that the character adjusting control is used for deleting the voice recognition result, the terminal responds to the triggering operation of the terminal account based on the character adjusting control, and deletes the corresponding voice recognition result.

And under the condition that the character adjusting control is used for editing the voice recognition result, the terminal responds to the terminal account and based on the triggering operation of the character adjusting control, and edits the corresponding voice recognition result.

And under the condition that the character adjusting control is used for downloading the voice recognition result, the terminal responds to the triggering operation of the terminal account number based on the character adjusting control and downloads the corresponding voice recognition result.

And under the condition that the text adjustment control is used for sharing the voice recognition result, the terminal responds to the terminal account and based on the triggering operation of the text adjustment control, and shares the corresponding voice recognition result to other terminal accounts.

And under the condition that the character adjusting control is used for saving the voice recognition result, the terminal responds to the triggering operation of the terminal account based on the character adjusting control and saves the corresponding voice recognition result.

And under the condition that the character adjusting control is used for jumping to the corresponding position in the multimedia information to be recognized, the terminal responds to the terminal account and triggers the operation based on the character adjusting control, jumps to the information segment corresponding to the voice recognition result in the multimedia information to be recognized, and displays the information segment.

In an optional embodiment, in a case that the number of the at least two information segments satisfies a preset condition, the displaying, in the recognition display area, the speech recognition result corresponding to each information segment based on a preset order includes:

and based on the preset sequence, displaying the voice recognition result corresponding to each information segment in a rolling manner in the recognition display area.

In this embodiment, if the number of the at least two information segments meets the preset condition, that is, the number of the at least two information segments exceeds the maximum number of the information segments that can be accommodated in the recognition display area, the terminal can scroll and display the voice recognition result corresponding to each information segment in the recognition display area based on the preset sequence, so that not only the voice recognition result can be displayed in the recognition display area, but also the display tidiness and the attractiveness of the voice recognition result can be improved, and thus the interactive convenience and the interactive experience of the terminal account and the multimedia information to be recognized are further improved.

In an optional embodiment, the identifying the display area includes deleting a control, and the method further includes:

and in response to the triggering operation based on the deleting control, deleting the voice recognition result corresponding to each information segment displayed in the recognition display area.

In this embodiment, the recognition display area may further include a deletion control for deleting the speech recognition result. The deletion control can be arranged at any position of the identification display area, and the comparison of the present disclosure is not particularly limited. As further shown in fig. 7 and 8, the delete control is disposed in a region below the identification display sub-region corresponding to each of the information pieces. The terminal can respond to the operation triggered by the terminal account based on the deleting control, and delete the voice recognition result corresponding to each information segment displayed in the recognition display area, so that the terminal account can edit the voice recognition result conveniently, and the interaction convenience and interaction experience of the terminal account and the multimedia information to be recognized are further improved.

In an optional embodiment, the identifying the display area includes adding a new control, and the method further includes:

and responding to the trigger operation based on the newly added control, and displaying a blank display sub-area in the identification display area.

In this embodiment, the identification display area may further include a new added control, where the new added control is used to add a blank display sub-area in the identification display area. The newly added control may be set at any position of the identification display area, which is not specifically limited in this disclosure. As further shown in fig. 7 and fig. 8, the new addition control may be disposed in a region below the identification display sub-region corresponding to each information segment. The terminal can respond to the operation triggered by the terminal account based on the newly added control, and a blank display sub-region is added in the identification display region, so that the terminal account can edit the voice identification result and manage the identification display region conveniently, and the interaction convenience and the interaction experience of the terminal account and the multimedia information to be identified are further improved.

In an optional embodiment, in a case that the to-be-identified media information is a video, the multimedia information identification page may further include a multimedia information playing area and a multimedia information display area, where the multimedia information playing area is used to play the to-be-identified multimedia information in an identification process of the to-be-identified multimedia information. The multimedia information display area is used for displaying the video frames in the multimedia information to be identified in real time in the identification process of the multimedia information to be identified. An information preview axis perpendicular to the subtitle display area, the time display area, and the multimedia information display area may be set, the information preview axis may be in a static state, and the video frame in the multimedia information to be recognized may be moved in a preset direction in the multimedia information display area, for example, from right to left in the multimedia information display area. In the multimedia information display area, the video frame intersecting the information preview axis is the video frame played in real time in the multimedia information playing area.

It should be noted that the multimedia information playing area and the multimedia information displaying area may be disposed at any position in the multimedia information identification page, which is not specifically limited in this disclosure. As an example, the multimedia information playing area may be disposed in a right area of the above-described recognition display area, the multimedia information display area may be disposed in a lower area of the subtitle display area, and the like.

The speech recognition method provided by the embodiment of the present disclosure is described below with a server as an execution subject. The speech recognition method may include:

receiving a voice recognition request carrying multimedia information to be recognized and sent by a terminal; the voice recognition request is sent by the terminal in response to the triggering operation of the voice recognition control in the multimedia information recognition page.

And responding to the voice recognition request, and dividing the multimedia information to be recognized into at least two information segments.

And sequentially determining any unidentified information fragment in the at least two information fragments as the current information fragment based on a preset sequence.

And carrying out voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment.

Sending the current voice recognition result to a terminal; the terminal is used for displaying the current voice recognition result in the target area in the multimedia information recognition page.

FIG. 10 is a block diagram of a speech recognition device I according to an example embodiment. Referring to fig. 10, the apparatus includes:

a first identification page display module 41 configured to perform displaying a multimedia information identification page; the multimedia information identification page comprises a voice identification control;

a first trigger operation response module 43 configured to execute a trigger operation based on the voice recognition control, and send a voice recognition request carrying multimedia information to be recognized to the server; the server is used for responding to the voice recognition request and dividing the multimedia information to be recognized into at least two information segments; the system comprises at least two information fragments and a control unit, wherein the information fragments are used for sequentially determining any unidentified information fragment in the at least two information fragments as a current information fragment based on a preset sequence; the voice recognition module is used for carrying out voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment;

a recognition result receiving module 45 configured to receive a current speech recognition result corresponding to the current information segment returned by the server;

and the first recognition result display module 47 is configured to execute displaying a current speech recognition result corresponding to the current information segment in the target area in the multimedia information recognition page.

Fig. 11 is a block diagram of a speech recognition apparatus shown in accordance with an exemplary embodiment. Referring to fig. 11, the apparatus includes:

a second identification page display module 51 configured to perform displaying the multimedia information identification page; the multimedia information recognition page comprises a voice recognition control, and the voice recognition control is used for triggering voice recognition of multimedia information to be recognized;

a second trigger operation response module 53, configured to execute, in response to a trigger operation based on the voice recognition control, scroll-displaying the recognition progress information of the multimedia information to be recognized in the multimedia information recognition page;

a second recognition result display module 55 configured to display a current speech recognition result corresponding to the current information segment in the target area of the multimedia recognition page;

the target area is an area in which the identification progress information moves at the current time, and the current time is the identification time of the current information fragment; the multimedia information to be identified comprises at least two information segments, and the current information segment is any unidentified information segment determined from the at least two information segments based on a preset sequence.

In an optional embodiment, a mask layer is disposed in the multimedia information recognition page, and the second trigger operation response module is configured to perform a trigger operation based on the voice recognition control, display the voice recognition progress information in the mask layer, and scroll the recognition progress information in the mask layer along a preset direction.

In an optional embodiment, the multimedia information recognition page includes a recognition display area, and the second recognition result display module is configured to display a current speech recognition result corresponding to a current information segment in a target area of the multimedia recognition page, and display the current speech recognition result in the recognition display area.

In an optional embodiment, the at least two information clips are obtained by dividing the multimedia information to be recognized based on a preset duration and a playing time sequence of the multimedia information to be recognized, the multimedia information recognition page includes a subtitle display area, a time display area and a recognition display area, the first recognition page display module or the second recognition page display module is configured to display a time axis in the time display area, display mask layers on the subtitle display area and the time display area, and display the voice recognition control in the recognition display area;

In an alternative embodiment, the subtitle display area includes a recognized subtitle display sub-area, and the apparatus further includes:

a semantic progress display module configured to display a recognized speech recognition result corresponding to the recognized information fragment in the recognized subtitle display sub-region, and display recognized progress information within a second preset range of a position of the recognized speech recognition result within the mask layer;

the identified subtitle display sub-region is located in a first preset range of the position of the playing time interval corresponding to the identified information segment, and the playing time sequence of the identified information segment is earlier than that of the current information segment; the identified progress information is used for representing the identified duration which is identified in the total playing duration, the identified progress information is arranged perpendicular to the subtitle display area and the time display area, and an intersection point of the identified progress information and the time axis represents the identified duration.

a current voice recognition result display unit configured to perform displaying of the current voice recognition result in the current subtitle display sub-area;

a current recognition progress information display unit configured to display current recognition progress information within a fourth preset range of a position of the current speech recognition result within the mask; the current identification progress information is used for representing the identified current time length in the total playing time length of the multimedia information to be identified, the current identification progress information is vertically arranged with the subtitle display area and the time display area, and the intersection point of the current identification progress information and the time axis represents the current time length.

In an optional embodiment, the current speech recognition result is a current text recognition result, and the current speech recognition result display unit is configured to display omission information and a part of text in the current text recognition result in the current subtitle display sub-area when the number of text corresponding to the current text recognition result is greater than the number of target text that can be accommodated in the current subtitle display sub-area;

the number of characters corresponding to the partial characters is equal to the target number of characters, and the omission information is used for representing characters except for the partial characters in the current character recognition result.

In an optional embodiment, the current recognition progress information displaying unit is configured to perform displaying by moving the recognized progress information from the second preset range of the position where the recognized speech recognition result is located to a fourth preset range of the position where the current character recognition result is located;

wherein the moved recognized progress information is updated to the current recognized progress information.

In an optional embodiment, the multimedia information identification page includes an identification display area, and the apparatus further includes:

In an optional embodiment, the at least two information segments are obtained by dividing the multimedia information to be recognized based on a preset time length and a playing time sequence of the multimedia information to be recognized, the recognition display area includes a recognition display sub-area corresponding to each information segment, the recognition display sub-area corresponding to each information segment includes a text area and a non-text area, and the recognition end instruction response module includes:

a result time display unit configured to display a voice recognition result corresponding to each of the information segments in a text area of the recognition display sub-area corresponding to each of the information segments based on the preset sequence, and display a playing time corresponding to each of the information segments and a time adjustment control corresponding to each of the information segments in a non-text area of the recognition display sub-area corresponding to each of the information segments;

In an optional embodiment, the apparatus further comprises:

the time adjustment operation response module is configured to execute time adjustment operation triggered by a time adjustment control corresponding to a target information fragment, and update a voice recognition result displayed in a text area corresponding to the target information fragment into a voice recognition result of an information fragment corresponding to the adjusted playing time;

wherein, the target information segment is any one of the at least two information segments.

The embodiment of the present disclosure further provides a speech recognition apparatus, which includes:

the dividing module is configured to execute the step of dividing the multimedia information to be recognized into at least two information segments in response to the voice recognition request;

a current information segment determining module configured to perform a sequential determination of any one unrecognized information segment of the at least two information segments as a current information segment based on a preset order;

a recognition result sending module configured to execute sending the current speech recognition result to a terminal; the terminal is used for displaying the current voice recognition result in the target area of the multimedia information recognition page.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

In an exemplary embodiment, there is also provided an electronic device for speech recognition, comprising a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the steps of any of the speech recognition methods described in the embodiments above when executing the instructions stored in the memory.

The electronic device may be a terminal, a server, or a similar computing device, taking the electronic device as a server as an example, fig. 12 is a block diagram of an electronic device for speech recognition according to an exemplary embodiment, where the electronic device 12 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 61 (the CPU 61 may include but is not limited to a Processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 63 for storing data, and one or more storage media 62 (e.g., one or more mass storage devices) for storing an application program 623 or data 622. Memory 63 and storage medium 62 may be, among other things, transient or persistent storage. The program stored on the storage medium 62 may include one or more modules, each of which may include a sequence of instructions operating on the electronic device. Still further, the central processor 61 may be arranged to communicate with the storage medium 62, and execute a series of instruction operations in the storage medium 62 on the electronic device 60. The electronic device 60 may also include one or more power supplies 66, one or more wired or wireless network interfaces 65, one or more input-output interfaces 64, and/or one or more operating systems 621, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, and so forth.

The input/output interface 64 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the electronic device 60. In one example, the input/output Interface 64 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In an exemplary embodiment, the input/output interface 64 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

It will be understood by those skilled in the art that the structure shown in fig. 12 is only an illustration, and does not limit the structure of the electronic device. For example, electronic device 60 may also include more or fewer components than shown in FIG. 12, or have a different configuration than shown in FIG. 12.

In an exemplary embodiment, a computer-readable storage medium is also provided, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the steps of any of the speech recognition methods of the above embodiments.

In an exemplary embodiment, a computer program product is also provided, which comprises a computer program that, when being executed by a processor, implements the speech recognition method provided in any one of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided by the present disclosure may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A speech recognition method, comprising:

responding to the trigger operation based on the voice recognition control, and sending a voice recognition request carrying multimedia information to be recognized to a server; the server is used for responding to the voice recognition request and dividing the multimedia information to be recognized into at least two information segments; the system comprises at least two information fragments and a control unit, wherein the information fragments are used for sequentially determining any unidentified information fragment in the at least two information fragments as a current information fragment based on a preset sequence; the voice recognition module is used for carrying out voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment;

2. A speech recognition method, comprising:

displaying a current voice recognition result corresponding to a current information fragment in a target area in the multimedia recognition page;

3. The voice recognition method of claim 2, wherein a mask layer is disposed in the multimedia information recognition page, and the scrolling displaying the recognition progress information of the multimedia information to be recognized in the multimedia information recognition page in response to the triggering operation based on the voice recognition control comprises:

4. The speech recognition method according to claim 2, wherein the multimedia information recognition page comprises a recognition display area, and the displaying of the current speech recognition result corresponding to the current information segment in the target area of the multimedia recognition page comprises:

5. The speech recognition method according to claim 1 or 2, wherein the at least two information segments are obtained by dividing the multimedia information to be recognized based on a preset duration and a playing time sequence of the multimedia information to be recognized, the multimedia information recognition page includes a subtitle display area, a time display area, and a recognition display area, and the displaying the multimedia information recognition page includes:

6. The speech recognition method of claim 5, wherein the caption display area comprises a recognized caption display sub-area, and before the displaying the current information segment corresponding to the current speech recognition result, the method further comprises:

7. The speech recognition method according to claim 6, wherein the subtitle display area further includes a current subtitle display sub-area that represents the target area, the current subtitle display sub-area is located within a third preset range of a position of the recognized subtitle display sub-area, and the displaying of the current information segment corresponds to a current speech recognition result, including:

8. The speech recognition method of claim 7, wherein the current speech recognition result is a current text recognition result, and the displaying the current speech recognition result in the current subtitle display sub-area comprises:

when the number of the characters corresponding to the current character recognition result is larger than the number of the target characters which can be accommodated in the current subtitle display sub-area, displaying omission information and partial characters in the current character recognition result in the current subtitle display sub-area;

9. The speech recognition method of claim 7, wherein the displaying the current recognition progress information within a fourth preset range of the position of the current speech recognition result in the mask layer comprises:

moving the recognized progress information from a second preset range of the position of the recognized voice recognition result to a fourth preset range of the position of the current character recognition result for displaying;

10. The speech recognition method according to claim 1 or 2, wherein the multimedia information recognition page includes a recognition display area, the method further comprising:

and responding to the recognition ending instruction of the multimedia information to be recognized, and displaying the voice recognition result corresponding to each information segment in the recognition display area based on the preset sequence.

11. The speech recognition method according to claim 10, wherein the at least two information segments are obtained by dividing the multimedia information to be recognized based on a preset duration and a playing time sequence of the multimedia information to be recognized, the recognition display area includes a recognition display sub-area corresponding to each information segment, the recognition display sub-area corresponding to each information segment includes a text area and a non-text area, and the displaying the speech recognition result corresponding to each information segment in the recognition display area based on the preset sequence comprises:

12. The speech recognition method of claim 11, wherein the method further comprises:

13. A speech recognition apparatus, comprising:

the first identification page display module is configured to display the multimedia information identification page; the multimedia information recognition page comprises a voice recognition control;

the first trigger operation response module is configured to execute a trigger operation based on the voice recognition control, and send a voice recognition request carrying multimedia information to be recognized to a server; the server is used for responding to the voice recognition request and dividing the multimedia information to be recognized into at least two information segments; the system comprises at least two information fragments and a control unit, wherein the information fragments are used for sequentially determining any unidentified information fragment in the at least two information fragments as a current information fragment based on a preset sequence; the voice recognition module is used for carrying out voice recognition on the current information segment to obtain a current voice recognition result corresponding to the current information segment;

and the first recognition result display module is configured to execute displaying a current voice recognition result corresponding to the current information segment in a target area in the multimedia information recognition page.

14. A speech recognition apparatus, comprising:

the second identification page display module is configured to display the multimedia information identification page; the multimedia information recognition page comprises a voice recognition control, and the voice recognition control is used for triggering voice recognition of the multimedia information to be recognized;

15. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech recognition method of any of claims 1 to 12.