CN109857842B

CN109857842B - Method and device for recognizing fault-reporting text

Info

Publication number: CN109857842B
Application number: CN201811574619.8A
Authority: CN
Inventors: 罗晓天
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2021-06-15
Anticipated expiration: 2038-12-21
Also published as: CN109857842A

Abstract

The application discloses a method and a device for identifying fault-reporting texts, wherein a plurality of text data are spliced into a spliced character string, preset keywords are searched in the spliced character string, the hit positions of the preset keywords in the spliced character string are determined, so that the text data comprising the preset keywords are determined according to the hit positions and the position intervals of each text data in the spliced character string, and at the moment, the text data comprising the preset keywords are texts comprising fault-reporting information. Therefore, when fault analysis is carried out on the spliced character strings, fault analysis can be carried out on the plurality of text data simultaneously, batch analysis of the text data is realized, the analysis efficiency of a large amount of text data is improved, and then faults existing in the currently played video resources can be rapidly analyzed from the large amount of text data, so that processing measures can be timely taken according to the fault problems existing in the currently played video resources, and the time that the currently played video resources are in a fault state is shortened.

Description

Method and device for recognizing fault-reporting text

Technical Field

The application relates to the technical field of internet, in particular to a method and a device for recognizing fault-reporting texts.

Background

With the development of internet technology, acquiring video resources through a network becomes an important means for acquiring video resources. After the user acquires the movie and television resources through the network, the user can not only watch the movie and television resources, but also publish comments on the playing page of the movie and television resources through the network in the watching process, and feed back the faults of the currently played movie and television resources, so that the video fault monitoring system can know the faults of the currently played movie and television resources according to the comments.

In the prior art, a video fault monitoring system can only analyze one text data at a time so as to judge whether the text data belongs to an error report text; if the text data belongs to the fault report text, corresponding processing measures are taken according to the fault report text to which the text data belongs. The failure reporting text refers to a text with content related to a failure problem existing in a currently played video resource. The failure reporting text refers to a comment text for feeding back a failure existing in the currently played video resource.

If a large number of users watch the currently played video resources at the same time, a large number of comment texts appear on the currently played page, so that the video failure monitoring system needs to analyze a large number of text data in the time period. However, since the video failure monitoring system can only analyze one comment text at a time, the video failure monitoring system cannot process a large amount of text data in time, and thus cannot find a failure in the currently played video resource in time, and further cannot take a processing measure in time according to the failure in the video resource, which results in a long time for the currently played video resource to be in a failure state.

Disclosure of Invention

In order to solve the technical problems in the prior art, the application provides a method and a device for identifying fault-reporting texts, which can rapidly process a large amount of text data so as to take processing measures in time according to the fault problem of the currently played video resources, thereby shortening the time of the currently played video resources in a fault state.

In order to achieve the above purpose, the technical solution provided by the present application is as follows:

the application provides a method for recognizing fault-reporting texts, which comprises the following steps:

acquiring a plurality of text data;

splicing the plurality of text data to obtain spliced character strings and position intervals of each text data in the spliced character strings;

searching a preset keyword in the spliced character string, and determining the hit position of the preset keyword in the spliced character string; the preset keywords are determined according to a preset fault reporting text;

and determining the text data comprising the preset keywords based on the hit positions and the position intervals of each text data in the spliced character strings.

Optionally, the splicing the text data to obtain a spliced character string and a position interval of each text data in the spliced character string specifically includes:

splicing the text data, and connecting two adjacent text data through separators to obtain spliced character strings;

obtaining the position interval of each text data in the spliced character string according to the position of the separator in the spliced character string;

the determining, based on the hit position and the position interval of each text data in the concatenated character string, the text data including the preset keyword specifically includes:

comparing the hit position with the position of the separator, and determining that the preset keyword is positioned between the adjacent first separator and the second separator;

and determining text data comprising the preset keywords according to the position of the first separator and the position of the second separator.

Optionally, the comparing the hit position with the position of the separator determines that the preset keyword is located between the adjacent first separator and the second separator, specifically:

and comparing the hit position with the position of the separator by utilizing a dichotomy method, and determining that the preset keyword is positioned between the adjacent first separator and the second separator.

Optionally, the acquiring the plurality of text data specifically includes:

acquiring a plurality of original text data;

and discarding the text data of which the information value is lower than a preset value threshold value in the plurality of original text data to obtain a plurality of text data.

Optionally, the acquiring the plurality of text data specifically includes:

acquiring a plurality of original text data;

and carrying out duplication removal on the plurality of original text data to obtain a plurality of text data.

Optionally, after determining the text data including the preset keyword, the method further includes:

inputting the text data including the preset keywords into a fastText model for fault verification to obtain verified text data;

and carrying out fault identification on the verified text data, and determining video fault content.

Optionally, the searching for the preset keyword in the concatenated string specifically includes:

and matching the preset keywords with the splicing character string by using a regular expression.

The application also provides a device for recognizing the fault-reporting text, which comprises:

an acquisition unit configured to acquire a plurality of text data;

the splicing unit is used for splicing the plurality of text data to obtain a spliced character string and a position interval of each text data in the spliced character string;

the search unit is used for searching preset keywords in the spliced character string and determining the hit positions of the preset keywords in the spliced character string; the preset keywords are determined according to a preset fault reporting text;

and the first determining unit is used for determining the text data comprising the preset keywords based on the hit positions and the position intervals of each text data in the spliced character string.

Optionally, the splicing unit specifically includes:

the splicing subunit is used for splicing the text data, and two adjacent text data are connected through a separator to obtain a spliced character string;

the first obtaining subunit is configured to obtain a position interval of each text data in the concatenated character string according to the position of the delimiter in the concatenated character string;

the first determining unit specifically includes:

the first determining subunit is used for comparing the hit position with the position of the separator and determining that the preset keyword is positioned between the adjacent first separator and the second separator;

and the second determining subunit is used for determining the text data comprising the preset keywords according to the position of the first separator and the position of the second separator.

Optionally, the first determining subunit specifically includes:

and the preset keyword is determined to be positioned between the adjacent first delimiter and the second delimiter by comparing the hit position with the position of the delimiter by utilizing a dichotomy method.

Optionally, the obtaining unit specifically includes:

a second acquiring subunit, configured to acquire a plurality of original text data;

and the filtering subunit is used for discarding the text data of which the information value is lower than a preset value threshold value in the plurality of original text data to obtain a plurality of text data.

Optionally, the obtaining unit specifically includes:

a third acquiring subunit, configured to acquire a plurality of original text data;

and the duplication removing subunit is used for carrying out duplication removal on the plurality of original text data to obtain a plurality of text data.

Optionally, the method further includes:

the verification unit is used for inputting the text data comprising the preset keywords into a fastText model for fault verification to obtain verified text data;

and the second determining unit is used for carrying out fault identification on the verified text data and determining video fault content.

Compared with the prior art, the method has the advantages that:

according to the fault-reporting text identification method, a plurality of text data are spliced into a spliced character string, preset keywords are searched in the spliced character string, the hit positions of the preset keywords in the spliced character string are determined, so that the text data including the preset keywords are determined according to the hit positions and the position intervals of each text data in the spliced character string, and at the moment, the text data including the preset keywords are texts including fault-reporting information. Therefore, in the method, when the spliced character string is subjected to fault analysis, the fault analysis can be simultaneously performed on the plurality of text data, so that the batch analysis of the text data is realized, the analysis efficiency of a large amount of text data is improved, the fault of the currently played video resource can be quickly analyzed from the large amount of text data, the processing measures can be timely taken according to the fault problem of the currently played video resource, and the time that the currently played video resource is in the fault state is shortened.

In addition, the method also determines the text data belonging to the fault reporting text in the plurality of text data according to the hit position of the preset keyword in the spliced character string and the position interval of each text data in the spliced character string, so that the text data comprising the fault reporting information can be quickly screened out from the plurality of text data, and fault identification can be carried out according to the text data comprising the fault reporting information in the following process, thereby further realizing that the fault existing in the currently played video resource is quickly analyzed out from the plurality of text data, so that processing measures can be timely taken according to the fault problem existing in the currently played video resource, and further shortening the time of the currently played video resource in a fault state.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an embodiment of a method for identifying an error-reporting text according to an embodiment of the present application;

fig. 2 is a flowchart of another implementation of a method for fault-reporting text recognition according to an embodiment of the present disclosure;

fig. 3 is a flowchart of an implementation manner of S206 provided in an embodiment of the present application;

fig. 4 is a flowchart of another implementation manner of S206 provided in this application example;

fig. 5 is a flowchart of an implementation manner of S206a provided in this embodiment of the present application;

fig. 6 is a flowchart of yet another implementation of a method for fault-reporting text recognition according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an apparatus for recognizing an error-reporting text according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Method embodiment one

Referring to fig. 1, which is a flowchart illustrating an implementation manner of a method for identifying an error text according to an embodiment of the present application.

The method for recognizing the fault-reporting text provided by the embodiment of the application comprises the following steps:

s101: a plurality of text data is acquired.

The text data may be text in which the user rates the target object. As an example, the text data may be text data in which a user rates a video; or text data for the user to evaluate the clothing; but also textual data that the user rates the restaurant. This is not a particular limitation of the present application.

For convenience of understanding and explanation, in the embodiment of the present application, text data is text data for a user to evaluate a video, and is taken as an example for explanation.

In this case, the text data may be text data in which the user evaluates the content of the currently played video resource, text data in which the user evaluates the playing state of the currently played video resource, or other text data.

For example, if the text data is "this movie is true! "the text data is the text data of the user evaluating the content of the currently played movie and television resources; if the text data is "this set does not open. ", the text data is the text data that the user evaluated the playing status of the currently playing movie resource.

In addition, the text data may be acquired in a variety of ways. The text data can be acquired from a barrage sent by the user during video playing, can also be acquired from a forum published by the user, and can also be acquired in other ways.

As an embodiment, when the movie resource is playing and different users send different barrages for the movie resource, S101 may specifically be: and acquiring a plurality of text data according to the barrages sent by a plurality of users.

S102: and splicing the plurality of text data to obtain spliced character strings and position intervals of each text data in the spliced character strings.

As an embodiment, S102 may specifically be: and splicing the plurality of text data according to a preset sequence to obtain a spliced character string and a position interval of each text data in the spliced character string.

The preset sequence may be a preset sequence, and may also be determined according to an actual application scenario.

For example, when the plurality of text data includes: when the first text data, the second text data and the third text data are generated, S102 may be to sequentially splice the first text data, the second text data and the third text data to obtain a first spliced character string; s102, the second text data, the first text data and the third text data can be spliced in sequence to obtain a second spliced character string; s102 may also be configured to splice the second text data, the third text data, and the first text data in sequence to obtain a third spliced character string. In addition, S102 may also adopt other splicing sequences, which are not specifically limited in this embodiment.

As another embodiment, S102 may specifically be: and splicing the plurality of text data according to a preset splicing rule to obtain a splicing character string and a position interval of each text data in the splicing character string.

The preset splicing rule can be preset and can also be determined according to an actual application scene. For example, the preset splicing rule may be that separators are added between different text data when splicing is performed, so that the different text data are distinguished by the separators. At this time, the position interval of the text data in the concatenated string may be determined according to the position of the separator in the concatenated string.

As an example, when the plurality of text data includes: when the first text data, the second text data, and the third text data include 6 characters, the second text data includes 8 characters, and the third text data includes 4 characters, S102 may specifically be: and splicing the first text data, the first segmentation symbol, the second text data, the second segmentation symbol and the third text data in sequence to obtain a fourth spliced character string, wherein the fourth spliced character string comprises 20 characters, and the position of each text data in the spliced character string can be determined according to the positions of the segmentation symbols.

Moreover, the specific process of determining the position of each text data in the concatenated string according to the position of the separator is as follows: determining the position interval of the first text data in the fourth splicing character string to be the 1 st character to the 6 th character according to the position of the first separator; determining the position interval of the second text data in the fourth splicing character string from the 8 th character to the 15 th character according to the positions of the first separator and the second separator; and determining the position interval of the third text data in the fourth splicing character string from the 17 th character to the 20 th character according to the position of the second separator.

For example, S102 may be: the text data "front high energy", "lack of the third set" and "the fifteenth set subtitle asymmetry" are spliced into a spliced character string "the third set | | | | the fifteenth set subtitle asymmetry | | | | | front high energy" cannot be found, and "| | | | |" is a separator in the spliced character string.

It should be noted that, when a plurality of text data are spliced, different splicing rules may be adopted according to different splicing orders, and no specific limitation is made in the embodiment of the present application.

In addition, in order to avoid that the position interval of the text data cannot be accurately determined by using the separator due to the occurrence of the separator in the text data, in the present application, the separator may be a specific symbol, and the usage rate of the specific symbol is low. In this way, the position section of the text data can be accurately specified by the delimiter.

S103: and searching a preset keyword in the spliced character string, and determining the hit position of the preset keyword in the spliced character string.

The preset keywords are determined according to the preset fault reporting text. The preset keyword may be a partial word in the preset failure text, or a word having similar semantics to the partial word.

For example, when the preset failure text is "set 7 cannot be found", the preset keyword may be "not found", or may be "missing".

Since the preset keyword may be a word or a word, the hit position of the preset keyword in the concatenated string may be a position or a position interval.

As an example, when the concatenation character string is "the third set | | | the fifteenth set subtitle cannot be found | | | | front high energy", and the preset keyword is "cannot be found", then S103 may specifically be: searching for the character pair in the front high-energy | | is lack of the third set | is absent in the fifteenth set of the subtitles, and determining that the position of the character pair from the 1 st character to the 3 rd character in the hit position of the spliced character string is the position interval of the unavailable character.

In addition, the hit location may include one location information or may include a plurality of location information. If the preset keyword in the spliced character string appears only once, the hit position comprises position information; if the preset keyword appears more than twice in the spliced character string, the hit position comprises more than two pieces of position information.

Because the keywords are simple to describe and can be used for refining the core content of the preset fault text, the search efficiency can be improved by using the search keywords instead of the search of the preset fault text, the search process is simplified, and the efficiency of obtaining the fault problem is improved.

It should be noted that, the fault problem may also be obtained by directly querying a preset fault report text, which is not specifically limited in the present application.

S104: and determining the text data comprising the preset keywords based on the hit positions and the position intervals of each text data in the spliced character strings.

Since the preset keyword is determined according to the preset failure text, it can be determined that the text data including the preset keyword belongs to the failure text.

As an example, when the concatenation character string is "third set | | | fifteenth set subtitle is not present | | | front high energy" and the preset keyword is "found", a position interval of "third set found" is from the 1 st character to the 6 th character, a position interval of "fifteenth set subtitle is not present" is from the 8 th character to the 15 th character and a position interval of "front high energy" is from the 17 th character to the 20 th character, and a position of "found" is from the 1 st character to the 3 rd character. In this case, S104 may specifically be: since the position interval from the 1 st character to the 3 rd character is within the position interval from the 8 th character to the 15 th character, it can be determined that the text data "cannot find the third set" includes the keyword "cannot find", and at this time, it can be determined that the text data "cannot find the third set" belongs to the preset fault text.

In the embodiment of the present application, the example in which the hit position includes 1 piece of position information is described. However, in this embodiment of the present application, the hit position may further include more than two pieces of position information, and at this time, for each piece of position information, the text data corresponding to each piece of position information may be determined by the method provided in this embodiment of the present application, and for the sake of brevity, this is not described again in this application.

In addition, since the failure-reporting text has multiple purposes, the use route of the determined failure-reporting text is not specifically limited in the embodiment of the present application.

For convenience of explanation and understanding, the embodiment of the present application will be described by taking one application of the barrier text as an example.

After the failure text is determined, that is, after the text data including the preset keyword is determined, the failure recognition may be performed using the text data including the preset keyword.

As an implementation manner, the fault identification by using text data including preset keywords may specifically be:

and carrying out fault identification on the text data comprising the preset keywords, and determining the content of the video fault.

Since it can only be determined whether the text data belongs to the preset fault text by the keyword, which kind of the preset fault text the text data belongs to cannot be determined. Thus, it is necessary to perform failure recognition on text data including preset keywords.

As an implementation manner, the fault identification by using text data including preset keywords may specifically be: and matching the text data comprising the preset keywords with a preset fault text one by one, and determining video fault content according to the first preset fault text when the text data comprising the preset keywords is successfully matched with the first preset fault text.

As an example, when the text data including the preset keyword is "the third set cannot be found", the first preset fault text is "the third set cannot be found", the second preset fault text is "the subtitle cannot be found", and the third preset fault text is "the playable incapability", the fault recognition is performed by using the text data including the preset keyword, which may specifically be: matching the text data of which the third set cannot be found with the first preset fault text, the second preset fault text and the third fault text one by one, and determining that the text data of which the third set cannot be found is successfully matched with the first preset fault text, so that the video fault content is determined as follows: the third set cannot be found.

It should be noted that, when performing fault identification, not only text data that is the same as the preset fault text content may be identified, but also text data that is different from the preset fault text content in expression but has the same semantic meaning may be identified, which is not specifically limited in the present application.

In order to further improve the efficiency of processing a large amount of text data, the present application further provides another embodiment of a method for failing in text recognition, which will be explained and explained below with reference to the drawings.

The second method embodiment:

for the sake of brevity, the second method embodiment is an improvement on the first method embodiment, and details of the second method embodiment and the first method embodiment are not repeated herein.

Referring to fig. 2, which is a flowchart illustrating another implementation of a method for fault text recognition according to an embodiment of the present application.

s201: a plurality of original text data is acquired.

S202: and carrying out duplication removal on the plurality of original text data to obtain a plurality of duplicated text data.

The original text data may include a plurality of pieces of the same text data. Because the video evaluation contents expressed by the same text data are the same, in order to further improve the processing efficiency of the plurality of text data and further improve the efficiency of acquiring the video fault problem, the deduplication processing can be performed on the plurality of original text data so as to ensure that no text data with the same content exists in the subsequently processed text data, thereby being beneficial to improving the processing efficiency of the plurality of text data.

As an embodiment, S202 may specifically be: and comparing each original text data with other original text data, and judging whether the original text data has the same content with other original text data. If the contents of at least two original text data are the same, only one original text data is retained.

As an example, assume that 5 original text data include: the first text data, the second text data, the third text data, the fourth text data, and the fifth text data, and the contents of the first text data, the third text data, and the fourth text data are all the same, then S202 may specifically be: and comparing the first text data with the second text data, the third text data, the fourth text data and the fifth text data in sequence, determining that the contents of the first text data, the third text data and the fourth text data are the same, and only keeping the first text data at the moment.

It should be noted that, multiple deduplication methods may be adopted for deduplication of multiple original text data, and this is not specifically limited in this application.

In addition, in order to further improve the processing efficiency of the plurality of text data and further improve the efficiency of acquiring the video failure problem, the deduplication processing provided by the application can not only deduplicate the text data with the same content, but also deduplicate the text data with different content and the same semantic meaning.

Two text data different in content but identical in semantics means that the two text data express the same meaning. For example, "video playing is not smooth" is expressed with the same meaning as "video playing comparison card", and thus "video playing is not smooth" and "video playing comparison card" are two text data with different contents but with the same semantic meaning.

As another embodiment, S202 may specifically be: firstly, processing a plurality of original text data by using a semantic extraction method to obtain semantic text data corresponding to each original text data; then, each semantic text data is compared with other semantic text data, and whether the semantic text data is the same as the other semantic text data in content is judged. And if the contents of at least two semantic text data are the same, only retaining the original text data corresponding to one semantic text data.

S203: and discarding the text data of which the information value is lower than a preset value threshold value in the plurality of de-duplicated text data to obtain a plurality of text data.

The information value in the text data means whether or not the information included in the text data has a value that can be analyzed. For example, text data such as advertisement promotion text, job recruitment text, and single emoji text is not analyzable for video rating text data.

The evaluation criterion of the information value of the text data may be preset or determined according to an actual application scenario, which is not specifically limited in the present application.

Because the information value of the text data is indefinite, some text data has a high information value, and some text data has a low information value, in order to further improve the processing efficiency of a plurality of text data and further improve the efficiency of acquiring a video failure problem, the text data can be screened according to the information value of the text data, the text data with high information value is retained, and the text data with low information value is discarded.

The preset value threshold is a basis for judging whether the text data is valuable for acquiring the video fault, and if the information value of the text data is higher than or equal to the preset value threshold, the text data is valuable for acquiring the video fault; if the information value of the text data is lower than a preset value threshold value, the text data is represented to have no value for acquiring video faults.

Moreover, the preset value threshold may be preset, or may be set according to an actual application scenario, which is not specifically limited in this application.

In addition, the preset value may be determined according to different characteristics of the text data, for example, length characteristics of the text data or semantic characteristics of the text data.

As an embodiment, the information value of the text data may be determined according to a length characteristic of the text data, and the preset value threshold is determined according to the length interval. At this time, if the length of the text data does not belong to the length interval, the information value of the text data is lower than a preset value threshold value; and if the length of the text data belongs to the length interval, the information value of the text data is higher than or equal to a preset value threshold value.

Thus, S203 may specifically be: acquiring the lengths of a plurality of deduplicated text data, and judging whether the length of each deduplicated text data belongs to the length interval; if so, determining that the information value of the text data after the duplication removal is higher than or equal to a preset value threshold, and keeping the text data after the duplication removal; if not, determining that the information value of the text data after the duplication removal is lower than a preset value threshold, and discarding the text data after the duplication removal.

As another embodiment, the information value of the text data may be determined according to the semantics of the text data, and the preset value threshold is determined according to the semantic set. At this time, if the semantics of the text data belong to the semantic set, the information value of the text data is higher than or equal to a preset value threshold; and if the semantics of the text data do not belong to the semantic set, the information value of the text data is lower than a preset value threshold. Where a semantic set is a set of semantics that are valuable for obtaining a fault text.

Thus, S203 may specifically be: obtaining semantic texts corresponding to the text data after the duplication removal, and judging whether each semantic text belongs to the semantic set; if so, determining that the information value of the text data after the duplication removal is higher than or equal to a preset value threshold, and keeping the text data after the duplication removal; if not, determining that the information value of the text data after the duplication removal is lower than a preset value threshold, and discarding the text data after the duplication removal.

S204: and splicing the plurality of text data to obtain spliced character strings and position intervals of each text data in the spliced character strings.

The content of S204 is the same as that of S102, and is not described herein again.

S205: searching a preset keyword in the spliced character string, and determining the hit position of the preset keyword in the spliced character string; the preset keywords are determined according to a preset fault reporting text.

As an embodiment, S205 may specifically be: and matching the preset keywords with the spliced character string by using a regular expression, and determining the hit position of the preset keywords in the spliced character string.

S206: and determining the text data comprising the preset keywords based on the hit positions and the position intervals of each text data in the spliced character strings.

Since the hit position may include more than one position information and the specific process of determining text data including a preset keyword according to each position information is the same, for convenience of explanation and explanation, explanation and explanation will be given below by taking an example in which the hit position includes one position information.

The S206 can adopt various different embodiments, and three embodiments are explained and illustrated below as examples.

Referring to fig. 3, it is a flowchart of an implementation manner of S206 provided in this application.

As an embodiment, when the concatenation character string includes only N text data, S206 may specifically be:

s2061: and judging whether the hit position belongs to a position interval of the 1 st text data in the spliced character string. If yes, go to S2062; if not, go to S2063.

S2062: determining that the 1 st text data includes the preset keyword.

S2063: and judging whether the hit position belongs to a position interval of the 2 nd text data in the spliced character string. If yes, go to S2064; if not, go to S2065.

S2064: determining that the 2 nd text data includes the preset keyword.

S2065: and judging whether the hit position belongs to a position interval of the 3 rd text data in the spliced character string.

By analogy, when it is determined that the ith text data does not include a preset keyword according to the hit position and the position interval of the ith text data in the spliced character string, it is necessary to further determine whether the (i + 1) th text data includes the preset keyword according to the hit position and the position interval of the (i + 1) th text data in the spliced character string, so as to find out the text data including the preset keyword. Wherein i is a positive integer and i is not more than N-1.

S2066: and judging whether the hit position belongs to the position interval of the (N-1) th text data in the spliced character string. If yes, go to S2067; if not, go to S2068.

S2067: and determining that the (N-1) th text data comprises the preset keyword.

S2068: and determining that the Nth text data comprises the preset keyword.

In this embodiment, whether the mth text data includes a preset keyword is determined according to the hit position and the position interval of the mth text data in the concatenated string, where m is a positive integer and is not greater than N. Therefore, the text data including the preset keywords can be accurately determined, the text data belonging to the preset fault text can be further determined, and the playing fault of the video can be further acquired according to the text data including the preset keywords.

It should be noted that there is no fixed determination order when determining the affiliation between the hit position and the position interval of the text data in the concatenated string. For example, the determination may be made in accordance with the order of arrangement of the plurality of text data, or in accordance with another order, which is not specifically limited in the present application.

The embodiments provided above specifically describe a detailed process of determining text data including a preset keyword when a concatenated string includes only a plurality of text data. In addition, in the spliced character string, adjacent text data can be distinguished through separators, and the separators can be only added between different text data, or can be added between different text data or added at the head and tail positions of the spliced character string.

For example, in the first concatenation string "the third set | | | | the fifteenth set of subtitles is not in | | | | preceding high energy cannot be found", the separator "| | | |" is only added between different text data. In the second splicing string, | | | | where the third set, | | | | the fifteenth set subtitle is not in front of | | | | | "can not be found, the separator" | | | "is added not only between different text data, but also at the head and tail positions of the splicing string.

Thus, S206 may adopt not only the above embodiment but also another embodiment. For convenience of explanation and understanding, the following description will be given taking as an example a separator added between different text data and the head and tail positions of the concatenation character string.

Referring to fig. 4, it is a flowchart of another implementation of S206 provided in this application.

As another embodiment, when adjacent text data in the concatenated string are distinguished by separators, S206 may specifically be:

s206 a: and comparing the hit position with the position of the separator to determine that the preset keyword is positioned between the adjacent first separator and the second separator.

As an embodiment, S206a may specifically be: the hit location is sequentially compared with the locations of the separators to determine a first separator and a second separator.

As another embodiment, in order to further improve the efficiency of determining the first delimiter and the second delimiter, further improve the processing efficiency of multiple text data, and improve the efficiency of acquiring the video failure problem, S206a may specifically be: and comparing the hit position with the position of the separator by utilizing a dichotomy method, and determining that the preset keyword is positioned between the adjacent first separator and the second separator.

It should be noted that the positions in the concatenated string may be represented by various kinds of position information, which is not specifically limited in this embodiment of the present application.

As an embodiment, the position in the concatenated string may be represented by a position number, for example, in the concatenated string where the third set of the fifth set of subtitles is not higher than the position of the third set of the fifth set of subtitles, "find" the position in the concatenated string is 2, "five" the position in the concatenated string is 10, "can" the position in the concatenated string is 21.

Therefore, when the position in the concatenated string is represented by the position label, S206a may specifically be:

and comparing the sizes of the hit position and the position of the separator to determine that the preset keyword is positioned between the adjacent first separator and the second separator.

For ease of explanation and illustration, the following explanation and illustration will be made in conjunction with fig. 5.

Referring to fig. 5, a flowchart of an implementation manner of S206a according to an embodiment of the present application is provided.

As shown in fig. 5, when the concatenated string includes 11 separators, and the position of the ith separator is lower than the position of the (i + 1) th separator, where i is a positive integer and i is smaller than 11, then S206a may specifically be:

s206a 1: it is determined whether the hit location is less than the location of the 6 th delimiter. If yes, go to S206a 2; if not, S206a11 is executed.

It should be noted that the hit position may not be equal to the position of the 6 th delimiter, and therefore, when it is determined that the hit position is not less than the 6 th delimiter, it is determined that the hit position is greater than the position of the 6 th delimiter, at this time, S206a11 may be executed.

S206a 2: it is determined whether the hit location is greater than the location of the 4 th delimiter. If yes, go to S206a 3; if not, S206a6 is executed.

S206a 3: it is determined whether the hit location is greater than the location of the 5 th delimiter. If yes, go to S206a 4; if not, 206a5 is performed.

S206a 4: it is determined that the position of the first delimiter is the position of the 5 th delimiter and the position of the second delimiter is the position of the 6 th delimiter.

S206a 5: it is determined that the position of the first delimiter is the position of the 4 th delimiter and the position of the second delimiter is the position of the 5 th delimiter.

S206a 6: it is determined whether the hit location is greater than the location of the 3 rd delimiter. If yes, go to S206a 7; if not, 206a8 is performed.

S206a 7: it is determined that the position of the first delimiter is the position of the 3 rd delimiter and the position of the second delimiter is the position of the 4 th delimiter.

S206a 8: it is determined whether the hit location is greater than the location of the 2 nd delimiter. If yes, go to S206a 9; if not, 206a10 is performed.

S206a 9: it is determined that the position of the first delimiter is the position of the 2 nd delimiter and the position of the second delimiter is the position of the 3 rd delimiter.

S206a 10: it is determined that the position of the first delimiter is the position of the 1 st delimiter and the position of the second delimiter is the position of the 2 nd delimiter.

S206a 11: it is determined whether the hit location is less than the location of the 8 th delimiter. If yes, go to S206a 12; if not, S206a15 is executed.

S206a 12: it is determined whether the hit location is less than the location of the 7 th delimiter. If yes, go to S206a 13; if not, 206a14 is performed.

S206a 13: it is determined that the position of the first delimiter is the position of the 6 th delimiter and the position of the second delimiter is the position of the 7 th delimiter.

S206a 14: it is determined that the position of the first delimiter is the position of the 7 th delimiter and the position of the second delimiter is the position of the 8 th delimiter.

S206a 15: it is determined whether the hit location is less than the 9 th delimiter location. If yes, go to S206a 16; if not, 206a17 is performed.

S206a 16: it is determined that the position of the first delimiter is the position of the 8 th delimiter and the position of the second delimiter is the position of the 9 th delimiter.

S206a 17: it is determined whether the hit location is less than the location of the 10 th delimiter. If yes, go to S206a 18; if not, 206a19 is performed.

S206a 18: it is determined that the position of the first delimiter is the position of the 9 th delimiter and the position of the second delimiter is the position of the 10 th delimiter.

S206a 19: it is determined that the position of the first delimiter is the position of the 10 th delimiter and the position of the second delimiter is the position of the 11 th delimiter.

It should be noted that the hit position may not be equal to the position of the ith separator, and therefore, when the hit position is determined to be not less than the ith separator, the hit position is determined to be greater than the position of the ith separator; and when the hit position is determined not to be larger than the ith separator, determining that the hit position is smaller than the ith separator, wherein i is a positive integer and is less than or equal to 11.

It should be further noted that the above example provides only one implementation of the dichotomy, and other implementations of the dichotomy may also be adopted, which is not specifically limited in this application.

S206 b: and determining text data comprising the preset keywords according to the position of the first separator and the position of the second separator.

Since the position interval of each text data in the concatenated string is determined according to the position of the delimiter, when the position of the first delimiter and the position of the second delimiter are obtained, text data including the preset keyword can be determined.

S207: and inputting the text data including the preset keywords into a fastText model for fault verification to obtain verified text data.

Because the text data including the preset keyword may not belong to the preset failure text, if the text data including the preset keyword is not discarded, the fault identification is continuously performed on the text data including the preset keyword, and the text data including the preset keyword is not discarded until the corresponding preset failure text cannot be found.

Therefore, in order to improve the efficiency of acquiring the fault, the text data input fastText model including the preset keyword may be subjected to fault check, so that text data not belonging to the preset failure text is discarded, and only text data belonging to the preset failure text is retained.

S208: and carrying out fault identification on the verified text data, and determining video fault content.

According to the fault-reporting text recognition method provided by the embodiment of the application, repeated text data can be eliminated by removing the duplication of the original text data, the subsequently processed text data is reduced, and the efficiency of acquiring the fault problem is improved; text data with low information value can be provided by discarding the text data with the information value lower than the preset value threshold value in the plurality of original text data, so that the text data of subsequent processing is reduced, and the efficiency of acquiring the fault problem is improved; the text data including the preset keywords are input into a fastText model for fault verification, so that the text data which do not belong to preset fault reporting texts can be eliminated, the subsequently processed text data are reduced, and the efficiency of acquiring fault problems is improved; and text data comprising preset keywords is determined by a dichotomy, so that the efficiency of acquiring the fault problem is improved, treatment measures can be taken timely according to the fault problem of the currently played video resource, and the time of the currently played video resource in a fault state is shortened.

The above embodiments provide specific implementations of obtaining text data belonging to a preset fault text from a large amount of text data. In addition, the present application provides still another embodiment of the method for recognizing the error-reporting text, which will be explained and explained below with reference to the accompanying drawings.

The third method embodiment:

method embodiment three is an improvement made on the basis of method embodiment one or method embodiment two, and for the sake of brevity, an explanation and an explanation will be given below taking an improvement made on the basis of embodiment two as an example. For brevity, the same parts in the third method embodiment and the second method embodiment are not described again.

Referring to fig. 6, which is a flowchart illustrating yet another implementation of a method for fault text recognition according to an embodiment of the present application.

S601: a plurality of original text data is acquired.

S601 is the same as S201, and is not described herein again.

S602: acquiring identifiers of a plurality of original text data, wherein each original text data corresponds to the identifier one by one.

The mark is used to distinguish different original text data, and may be a number, a letter, or another symbol having a mark function, which is not specifically limited in this application.

S603: and carrying out duplication removal on the plurality of original text data to obtain a plurality of duplicated text data.

The content of S603 is the same as that of S202, and is not described herein again.

S604: and acquiring identification numbers of a plurality of the text data after the duplication removal, wherein each text data after the duplication removal corresponds to the identification number one by one.

The identification number is used to distinguish different original text data, and the identifier may be a number, a letter, or another symbol having an identification function, which is not specifically limited in this application.

As an example, when 6 pieces of original text data include: the first text data, the second text data, the third text data, the fourth text data, the fifth text data, and the sixth text data, where the first text data, the third text data, the fourth text data, and the fifth text data have the same content, and the first text data is retained during deduplication, then S604 may specifically be: and acquiring identification numbers of the first text data, the second text data and the sixth text data.

S605: and determining the mapping relation between the identifiers and the identification numbers according to the identifiers of the plurality of original text data and the identification numbers of the plurality of de-duplicated text data.

For convenience of explanation and explanation, 6 original text data will be explained and explained as an example.

The 6 original text data include: the content of the first text data, the content of the third text data, the content of the fourth text data and the content of the fifth text data are the same, and the first text data is reserved during duplication elimination.

Further, the first text data is denoted by a, the second text data by b, the third text data by c, the fourth text data by d, the fifth text data by e, and the sixth text data by f.

The identification number of the first text data is 1, the identification number of the second text data is 2, and the identification number of the sixth text data is 3.

At this time, the mapping relationship between the identifier and the identification number is specifically as follows: the identifications a, c, d and e all correspond to the identification number 1; the mark b corresponds to the identification number 2; the identifier f corresponds to the identification number 3.

S606: and discarding the text data of which the information value is lower than a preset value threshold value in the plurality of de-duplicated text data to obtain a plurality of text data, and obtaining the type mark of the discarded text data based on the identification number.

As an example, when the information value of the second text data is lower than the preset value threshold and the identification number of the second text data is 2, the type corresponding to the identification number 2 is marked as a non-value text.

S607: and splicing the plurality of text data to obtain spliced character strings and position intervals of each text data in the spliced character strings.

S607 is the same as S204, and is not described herein again.

S608: searching a preset keyword in the spliced character string, and determining the hit position of the preset keyword in the spliced character string; the preset keywords are determined according to a preset fault reporting text.

S608 is the same as S205, and is not described herein again.

S609: and determining the text data comprising the preset keywords based on the hit positions and the position intervals of each text data in the spliced character strings.

S609 is the same as S206, and is not described herein again.

S610: and inputting the text data including the preset keywords into a fastText model for fault verification to obtain verified text data, and obtaining a type mark of the rejected text data based on the identification number.

As an example, when the sixth text data does not belong to the preset failure text and the identification number of the sixth text data is 3, the sixth text data is discarded by the fastText model, and the type corresponding to the identification number 3 is marked as the useless text.

S611: and carrying out fault identification on the verified text data, and determining video fault content.

S611 is the same as S207, and is not described herein again.

S612: and determining the type mark of the text data comprising the preset keywords based on the identification number according to the video fault content.

As an example, when the video failure content of the first text data is "the thirteenth set cannot be found", and the identification number of the first text data is 1, the type corresponding to the identification number 1 is marked as that the thirteenth set cannot be found.

S613: and determining the types of the plurality of original text data according to the mapping relation between the identification and the identification number and the type mark of the text data.

As an example, when the mapping relationship between the identifier and the identification number is specifically: the identifiers a, c, d and e correspond to the identifier 1, the identifier b corresponds to the identifier 2, and the identifier f corresponds to the identifier 3, and when the type label corresponding to the identifier 1 cannot find the thirteenth set, the type label corresponding to the identifier 2 is an worthless text, and the type label corresponding to the identifier 3 is a useless text, the type label corresponding to the identifier a is that the thirteenth set cannot be found, the type label corresponding to the identifier b is an worthless text, the type label corresponding to the identifier c is that the thirteenth set cannot be found, the type label corresponding to the identifier d is that the thirteenth set cannot be found, the type label corresponding to the identifier e is that the thirteenth set cannot be found, and the type label corresponding to the identifier f is a useless text.

According to the method for recognizing the fault-reporting text, the type marks of all original text data can be quickly obtained through the mapping relation between the identification and the identification number and the type marks based on the identification number. Therefore, whether each original text data comprises a video fault or not can be determined according to the type mark corresponding to the identifier of each original text data, and the text type of the text data which does not belong to the preset fault report text can be determined, so that the accuracy of the obtained fault problem is further improved, and the accuracy of analysis of a large amount of text data is improved.

It should be noted that any method for recognizing the failure-reporting text provided in the above embodiment may be applied not only to the technical field of video failure monitoring, but also to other fields related to comment monitoring, for example, to the field of comment monitoring of at least one of a commodity, a movie, a restaurant, and a scenic spot.

Based on the method for recognizing the fault-reporting text provided by the above embodiment, the application also provides a device for recognizing the fault-reporting text, which will be explained and explained with reference to the accompanying drawings.

The embodiment of the device is as follows:

referring to fig. 7, the diagram is a schematic structural diagram of an apparatus for recognizing an error text according to an embodiment of the present application.

The device of the text recognition of reporting an obstacle that this application embodiment provided includes:

an acquisition unit 701 configured to acquire a plurality of text data;

a splicing unit 702, configured to splice the plurality of text data to obtain a spliced character string and a position interval of each text data in the spliced character string;

a searching unit 703, configured to search a preset keyword in the concatenated string, and determine a hit position of the preset keyword in the concatenated string; the preset keywords are determined according to a preset fault reporting text;

a first determining unit 704, configured to determine text data including the preset keyword based on the hit position and a position interval of each text data in the concatenated string.

In order to further improve the processing efficiency of the text data and further improve the efficiency of acquiring the failure problem, the splicing unit 702 specifically includes:

the first determining unit 704 specifically includes:

In order to further improve the processing efficiency of the text data and further improve the efficiency of acquiring the failure problem, the first determining subunit specifically includes:

In order to further improve the processing efficiency of the text data and further improve the efficiency of acquiring the failure problem, the acquiring unit 701 specifically includes:

In order to further improve the processing efficiency of the text data and further improve the efficiency of acquiring the fault problem, the method further comprises the following steps:

In order to further improve the processing efficiency of the text data and further improve the efficiency of acquiring the failure problem, the searching unit 703 specifically includes:

The device of the text recognition of reporting an obstacle that this application embodiment provided includes: an acquisition unit 701, a splicing unit 702, a search unit 703 and a first determination unit 704. According to the device, a plurality of text data are spliced into a spliced character string, a preset keyword is searched in the spliced character string, the hit position of the preset keyword in the spliced character string is determined, so that the text data comprising the preset keyword is determined according to the hit position and the position interval of each text data in the spliced character string, and at the moment, the text data comprising the preset keyword is a text comprising fault reporting information. So, in the device, when carrying out failure analysis to the concatenation string, can carry out failure analysis to a plurality of text data simultaneously to realized text data's batch analysis, improved a large amount of text data's analysis efficiency, and then can follow the trouble that the analysis of a large amount of text data exists the movie & TV resource of present broadcast fast, so that in time take measures according to the trouble problem that the movie & TV resource of present broadcast exists, and then shortened the time that the movie & TV resource of present broadcast is in the fault condition.

In addition, the device also determines the text data belonging to the fault reporting text in the plurality of text data according to the hit position of the preset keyword in the spliced character string and the position interval of each text data in the spliced character string, so that the text data comprising the fault reporting information can be quickly screened out from the plurality of text data, fault identification can be conveniently carried out according to the text data comprising the fault reporting information in the subsequent process, the fault existing in the currently played video resource can be further quickly analyzed out from the plurality of text data, processing measures can be timely taken according to the fault problem existing in the currently played video resource, and the time that the currently played video resource is in the fault state can be further shortened.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The foregoing is merely a preferred embodiment of the invention and is not intended to limit the invention in any manner. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A method of fault-reporting text recognition, comprising:

acquiring a plurality of text data;

splicing the plurality of text data to obtain spliced character strings and position intervals of each text data in the spliced character strings; the splicing of the text data to obtain a spliced character string and a position interval of each text data in the spliced character string specifically includes: splicing the text data, and connecting two adjacent text data through separators to obtain spliced character strings; obtaining the position interval of each text data in the spliced character string according to the position of the separator in the spliced character string;

2. The method according to claim 1, wherein the determining text data including the preset keyword based on the hit position and a position interval of each text data in the concatenated string specifically includes:

3. The method according to claim 2, wherein the comparing the hit location and the separator location determines that the predetermined keyword is located between adjacent first and second separators, specifically:

4. The method according to claim 1, wherein the obtaining of the plurality of text data specifically includes:

acquiring a plurality of original text data;

5. The method according to claim 1, wherein the obtaining of the plurality of text data specifically includes:

acquiring a plurality of original text data;

6. The method according to claim 1, wherein after determining the text data including the preset keyword, the method further comprises:

7. The method according to claim 1, wherein the searching for the preset keyword in the concatenated string specifically comprises:

8. An apparatus for fault-reporting text recognition, comprising:

an acquisition unit configured to acquire a plurality of text data;

the splicing unit is used for splicing the plurality of text data to obtain a spliced character string and a position interval of each text data in the spliced character string; the splicing unit specifically comprises: the splicing subunit is used for splicing the text data, and two adjacent text data are connected through a separator to obtain a spliced character string; the first obtaining subunit is configured to obtain a position interval of each text data in the concatenated character string according to the position of the delimiter in the concatenated character string;

9. The apparatus according to claim 8, wherein the first determining unit specifically includes:

10. The apparatus according to claim 9, wherein the first determining subunit is specifically:

11. The apparatus according to claim 8, wherein the obtaining unit specifically includes:

12. The apparatus according to claim 8, wherein the obtaining unit specifically includes:

13. The apparatus of claim 8, further comprising: