[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2022017245A1 - 一种文本识别网络、神经网络训练的方法以及相关设备 - Google Patents

一种文本识别网络、神经网络训练的方法以及相关设备 Download PDF

Info

Publication number
WO2022017245A1
WO2022017245A1 PCT/CN2021/106397 CN2021106397W WO2022017245A1 WO 2022017245 A1 WO2022017245 A1 WO 2022017245A1 CN 2021106397 W CN2021106397 W CN 2021106397W WO 2022017245 A1 WO2022017245 A1 WO 2022017245A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
feature
recognition
image
text
Prior art date
Application number
PCT/CN2021/106397
Other languages
English (en)
French (fr)
Inventor
刘志广
王靓伟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022017245A1 publication Critical patent/WO2022017245A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a text recognition network, a method for training a neural network, and related equipment.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • the recognition of characters in images by neural networks based on deep learning is a common application of artificial intelligence.
  • the embodiments of the present application provide a text recognition network, a method for training a neural network, and related equipment, generating a recognition result according to the semantic feature of the predicted character and the image feature of the image to be recognized, and performing the recognition operation according to the features of more dimensions; and Due to image problems such as blurred images or some characters in the to-be-recognized image being occluded, the accuracy of the predicted characters will not be affected, which is beneficial to improve the accuracy of the text recognition results.
  • the embodiments of the present application provide a text recognition network, which can be used in the field of text recognition in the field of artificial intelligence.
  • the text recognition network is a neural network used to recognize characters in an image
  • the text recognition network includes an image feature extraction module, a text feature acquisition module and a recognition module.
  • the image feature extraction module is used for acquiring the to-be-recognized image, and performing feature extraction on the to-be-recognized image to generate a first feature corresponding to the first character in the to-be-recognized image.
  • the first character is a character to be recognized in the image to be recognized, and the image feature extraction module in the text recognition network can specifically be expressed as a convolutional neural network, a directional gradient histogram or a local binary pattern, and the image to be recognized can be An entire image may also be a segmented image that includes a line of characters or a column of characters after the image segmentation operation has been performed.
  • the text feature acquisition module is configured to acquire preset characters corresponding to the first characters in the image to be recognized, and perform text prediction according to the preset characters to generate semantic features of the first predicted characters.
  • the preset character can be a start flag character, which can be represented as a ⁇ BOS> character in a computer program, and is used to instruct the text feature acquisition module to start text prediction.
  • the recognition module is configured to combine the first feature and the semantic feature of the first predicted character, and perform a recognition operation according to the combined feature to generate a recognition result corresponding to the first character in the image to be recognized.
  • the identification module may be a classification network specifically, and the classification network may specifically be expressed as a classifier, and the classifier may select a multi-layer perceptron, or be composed of a linear transformation matrix and a classification function.
  • the semantic features of the predicted characters are generated according to the second characters corresponding to the recognized characters in the first characters, and the recognition operation is performed according to the features of more dimensions, which is beneficial to The accuracy of the text recognition result is improved; and when the to-be-recognized image is blurred or some characters in the to-be-recognized image are occluded and other factors, the accuracy of the features of the fuzzy characters or occluded characters included in the first feature will be greatly reduced, Based on the semantic information of the recognized characters, the semantic features of the predicted characters are generated. The accuracy of the predicted characters will not be affected due to image problems such as blurred images or occlusion of some characters in the image to be recognized. According to the semantic features of the predicted characters and the image The recognition results are generated together with the features, which is beneficial to improve the accuracy of the text recognition results.
  • the text feature acquisition module is specifically configured to acquire a preset character corresponding to the first character in the to-be-recognized image when the recognition operation is performed on the to-be-recognized image for the first time, and according to the Text prediction is performed on the preset characters to generate semantic features of the first predicted characters.
  • the initial operation of recognizing the first character in the to-be-recognized image refers to the initial segmentation of the to-be-recognized image (that is, a text area of the to-be-recognized image) When performing an identification operation.
  • performing the first-time recognition operation on the first character in the to-be-recognized image refers to the initial recognition operation on the entire to-be-recognized image.
  • the text feature acquisition module is specifically configured to determine at least one recognition result and preset character corresponding to at least one recognized character in the first character as the first character in the case that the recognition operation has been performed on at least one character in the first character two characters, and perform text prediction according to the second characters to generate semantic features of the second predicted characters corresponding to the second characters.
  • the execution device when performing a recognition operation on the first character in the image to be recognized for the first time, the execution device generates the semantic feature of the first predicted character according to the preset character, and has performed recognition on at least one character in the first character. In the case of operation, the execution device determines at least one recognition result corresponding to the recognized character in the first character and the preset character as at least one second character corresponding to the recognized character in the first character, ensuring that the solution The integrity of the whole identification process no longer requires manual intervention, which improves the user stickiness of this solution.
  • the recognition module is further configured to perform a recognition operation according to the first feature and the semantic feature of the second predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.
  • the text recognition network in this solution since the text recognition network in this solution performs a recognition operation, it may only be able to obtain recognition results for some characters in the first characters, and at least one character in the first characters has been recognized before
  • the execution device performs text prediction according to at least one recognition result corresponding to the at least one recognized character to generate the semantic feature of the second predicted character, and executes according to the first feature and the semantic feature of the second predicted character.
  • the identification operation further improves the integrity of the scheme.
  • the text feature acquisition module includes: a first generation submodule, configured to perform vectorization processing on each preset character in the at least one preset character, so as to generate each preset character and the position code of each preset character is generated according to the position of the first character of each preset character in the image to be recognized.
  • the combination sub-module is used to combine the character encoding of the preset character and the position encoding of the preset character to obtain the initial feature of the preset character, and perform the self-attention encoding operation and the self-attention decoding according to the initial feature of the preset character operations to generate semantic features of the first predicted character.
  • the combination mode of the character code of the preset character and the position code of the preset character is any one of the following: splicing, adding, fusing and multiplying.
  • text prediction is performed by performing a self-attention encoding operation and a self-attention decoding operation on the initial features of the preset characters, so as to generate the semantic features of the first predicted characters, and the calculation speed is fast and the complexity is low.
  • the identification module includes: a calculation sub-module, configured to calculate the similarity between the first feature and the semantic feature of the first predicted character.
  • the similarity can be obtained by calculating the cosine similarity, Euclidean distance, Mahalanobis distance, etc. between the first feature and the semantic features of the first predicted character, or the similarity can be obtained by calculating the difference between the first feature and the first predicted character.
  • the semantic features are obtained by dot product operation.
  • the similarity may include one similarity value or two transposed similarity values.
  • the second generating sub-module is configured to generate the second feature and the third feature according to the first feature, the semantic feature and the similarity of the first predicted character, wherein the second feature is a combination of the first feature and the first feature Semantic features of the predicted characters, and the third feature is a combination of the first features based on the semantic features of the first predicted characters.
  • the second generating submodule is further configured to combine the second feature and the third feature, and perform a recognition operation according to the combined feature to generate a recognition result.
  • the similarity between the first feature and the semantic feature of the first predicted character is calculated, and then the second feature and the third feature are generated according to the similarity between the first feature and the semantic feature of the first predicted character , the second feature is the combination of the semantic features of the first predicted character on the basis of the first feature, and the third feature is the combination of the first feature based on the semantic features of the first predicted character, that is, not only according to the predicted character Semantic features are used to enhance the image features of the characters to be recognized, and the image features of the characters to be recognized are integrated into the semantic features of the predicted characters, which is conducive to the full fusion of image features and predicted character features, and is conducive to improving the accuracy of text recognition results.
  • the text recognition network further includes a feature update module, the feature update module is configured to: combine the feature of the preset character with the first feature to generate the updated first feature;
  • the feature of the preset character may be an initial feature of the preset character or an updated feature of the preset character.
  • the first features include image features of a plurality of first characters, at least one of the first characters in the plurality of first characters is a character that has been subjected to a recognition operation, and the preset character includes recognition corresponding to the plurality of recognized characters.
  • the feature of the preset character includes the feature of the recognition result corresponding to the recognized character.
  • the updated first feature will enhance the features of the recognized characters relative to the first feature.
  • the recognition module is specifically configured to perform a recognition operation according to the updated first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the to-be-recognized image.
  • the semantic features of the recognized characters are integrated into the image features, so that the features of the recognized characters in the image features are more obvious, so that the recognition module can more intensively recognize the characters that have not yet been recognized, so as to reduce the number of single recognition modules.
  • the difficulty of the recognition process is conducive to improving the accuracy of text recognition.
  • the feature update module is specifically configured to: perform a self-attention encoding operation according to the initial feature of the preset character, obtain the updated feature of the preset character, and according to the first feature and the preset character Let the updated features of the characters perform a self-attention encoding operation to generate the updated first features.
  • the self-attention encoding method is adopted to combine the features of the preset characters with the first features, which is beneficial to realize the sufficient combination of the features of the preset characters and the first features, and has low complexity and is easy to implement.
  • a first character when the granularity of the recognition operation performed by the text recognition network is character, a first character includes at least one character, and a recognition result output by the text recognition network performing a recognition operation is included in Include one character.
  • a recognition result output by the text recognition network performing a recognition operation is included in Include one character.
  • the granularity of the recognition operation performed by the text recognition network can be characters or words, which expands the application scenarios of this solution and improves the implementation flexibility of this solution.
  • the embodiments of the present application provide a method for training a text recognition network, which can be used in the field of text recognition in the field of artificial intelligence.
  • the text recognition network is a neural network used to recognize characters in an image, and the text recognition network includes an image feature extraction module, a text feature acquisition module and a recognition module.
  • the method includes: a training device inputs an image to be recognized into an image feature extraction module, and performs feature extraction on the image to be recognized to generate a first feature corresponding to a first character in the image to be recognized, wherein the first character is in the image to be recognized character to be recognized; input the preset character corresponding to the first character in the image to be recognized into the text feature acquisition module, and perform text prediction according to the preset character to generate the semantic feature of the first predicted character.
  • the training device performs a recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.
  • the training device trains the text recognition network according to the correct result corresponding to the first character in the image to be recognized, the recognition result and a loss function, where the loss function indicates the correct result corresponding to the first character in the image to be recognized and a loss function corresponding to the first character in the image to be recognized.
  • the loss function indicates the correct result corresponding to the first character in the image to be recognized and a loss function corresponding to the first character in the image to be recognized.
  • the loss function may specifically be a cross-entropy loss function, a focal loss function or a center loss function.
  • the second aspect of the embodiments of the present application may also perform the steps in the various possible implementations of the first aspect.
  • the second aspect of the embodiments of the present application and the specific implementation steps of the various possible implementations of the second aspect, and each possible implementation for the beneficial effects brought by the implementation manner, reference may be made to the descriptions in the various possible implementation manners in the first aspect, and details are not repeated here.
  • the embodiments of the present application provide a text recognition method, which can be used in the field of text recognition in the field of artificial intelligence.
  • the method includes: the execution device inputs the image to be recognized into an image feature extraction module, and performs feature extraction on the image to be recognized to generate a first feature corresponding to a first character in the image to be recognized, wherein the first character is in the image to be recognized character to be recognized; input the preset character corresponding to the first character in the image to be recognized into the text feature acquisition module, and perform text prediction according to the second character to generate the semantic feature of the first predicted character.
  • the execution device performs a recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.
  • the image feature extraction module, the text feature acquisition module and the recognition module belong to the same text recognition network.
  • the third aspect of the embodiments of the present application may also perform the steps in the various possible implementations of the first aspect.
  • the third aspect of the embodiments of the present application and the specific implementation steps of the various possible implementations of the third aspect, and each possible implementation for the beneficial effects brought by the implementation manner, reference may be made to the descriptions in the various possible implementation manners in the first aspect, and details are not repeated here.
  • an embodiment of the present application provides a training device for a text recognition network.
  • the text recognition network is a neural network for recognizing characters in an image.
  • the text recognition network includes an image feature extraction module, a text feature acquisition module, and a recognition module.
  • the training device of the recognition network includes: an input unit for inputting the image to be recognized into the image feature extraction module, and performing feature extraction on the image to be recognized to generate a first feature corresponding to the first character in the image to be recognized, wherein the first feature is A character is a character to be recognized in the image to be recognized; the input unit is further configured to input a preset character corresponding to the first character in the image to be recognized into the text feature acquisition module, and perform text prediction according to the preset character, To generate the semantic feature of the first predicted character; a recognition unit for performing a recognition operation by the recognition module according to the first feature and the semantic feature of the first predicted character, to generate a recognition result corresponding to the first character in the image to be recognized; The training unit is used to train the text recognition network according to the correct result corresponding to the first character in the image to be recognized, the recognition result and the loss function, and the loss function indicates the correct result corresponding to the first character in the image to be recognized and The similarity between the recognition results corresponding to the first character in the image
  • the fourth aspect of the embodiments of the present application may also perform the steps in the various possible implementations of the second aspect.
  • an embodiment of the present application provides an execution device, which may include a processor, the processor is coupled with a memory, the memory stores program instructions, and the above-mentioned first aspect is implemented when the program instructions stored in the memory are executed by the processor. The steps performed by the text recognition network.
  • an embodiment of the present application provides a training device, which may include a processor, the processor is coupled to a memory, the memory stores program instructions, and the above-mentioned second aspect is implemented when the program instructions stored in the memory are executed by the processor.
  • the training method of the text recognition network may include a processor, the processor is coupled to a memory, the memory stores program instructions, and the above-mentioned second aspect is implemented when the program instructions stored in the memory are executed by the processor.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, causes the computer to execute the text described in the first aspect above The steps performed by the recognition network, or, causing the computer to perform the training method for the text recognition network described in the second aspect above.
  • an embodiment of the present application provides a circuit system, the circuit system includes a processing circuit, and the processing circuit is configured to perform the steps performed by the text recognition network described in the first aspect above, or perform the second step described above.
  • the training method of the text recognition network described in the aspect is configured to perform the steps performed by the text recognition network described in the first aspect above, or perform the second step described above.
  • an embodiment of the present application provides a computer program that, when running on a computer, causes the computer to perform the steps performed by the text recognition network described in the first aspect above, or execute the steps described in the second aspect above. Training methods for text recognition networks.
  • an embodiment of the present application provides a chip system, where the chip system includes a processor for implementing the functions involved in the above aspects, for example, sending or processing the data and/or information involved in the above method .
  • the chip system further includes a memory for storing necessary program instructions and data of the server or the communication device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • FIG. 1 is a schematic structural diagram of an artificial intelligence main frame provided by an embodiment of the present application.
  • FIG. 2 is a system architecture diagram of a text recognition system provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a workflow of a text recognition network provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of generating a fourth feature in a workflow of a text recognition network provided by an embodiment of the present application
  • FIG. 5 is a schematic flowchart of generating a fifth feature and a sixth feature in a workflow of a text recognition network provided by an embodiment of the present application;
  • FIG. 6 is a schematic diagram of a network architecture of a text recognition network provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a method for training a text recognition network provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a beneficial effect of a text recognition network provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a text recognition network provided by an embodiment of the present application.
  • FIG. 10 is another schematic structural diagram of a text recognition network provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a training device for a text recognition network provided by an embodiment of the application.
  • FIG. 12 is another schematic structural diagram of the apparatus for training a text recognition network provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a training device provided by an embodiment of the application.
  • FIG. 15 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the embodiments of the present application provide a text recognition network, a method for training a neural network, and related equipment, generating a recognition result according to the semantic feature of the predicted character and the image feature of the image to be recognized, and performing the recognition operation according to the features of more dimensions; and Due to image problems such as blurred images or some characters in the to-be-recognized image being occluded, the accuracy of the predicted characters will not be affected, which is beneficial to improve the accuracy of the text recognition results.
  • Figure 1 shows a schematic structural diagram of the main frame of artificial intelligence.
  • the above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communicate with the outside through sensors; computing power is provided by smart chips, including but not limited to central processing unit (CPU), embedded neural-network processing unit (NPU), graphics processor (graphics processing unit, GPU), application specific integrated circuit (ASIC) and field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips; the basic platform includes distributed computing framework and network related
  • the platform guarantee and support can include cloud storage and computing, interconnection network, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, etc. .
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, the productization of intelligent information decision-making, and the realization of landing applications. Its application areas mainly include: intelligent terminals, intelligent manufacturing, Smart transportation, smart home, smart healthcare, smart security, autonomous driving, smart city, etc.
  • the embodiments of the present application can be applied to various fields of artificial intelligence, and can be specifically applied to various scenarios where characters in images need to be recognized.
  • the aforementioned images are images collected by devices such as cameras, printers, and scanners.
  • an enterprise needs to scan documents such as receipts or invoices to obtain image files, and recognize characters in the image files to extract text information.
  • functions such as file digital filing, file fast indexing or file analysis can be realized.
  • the user needs to input the information on the ID card, driver's license, driving license or passport, etc., then the user can use the camera to capture the image of the aforementioned certificate, and recognize the characters in the image to extract the key information, etc.
  • the examples here are only to facilitate the understanding of the application scenarios of the embodiments of the present application, and the application scenarios of the embodiments of the present application are not exhaustive. In the aforementioned scenarios, there may be a possibility that the image quality is low. Therefore, the image needs to be recognized through the text recognition network provided in the embodiment of the present application, so as to improve the accuracy of the recognition result.
  • the text recognition system provided by the embodiment of the present application is first introduced in the embodiment of the present application with reference to FIG. 2 .
  • FIG. 2 is a system architecture of the text recognition system provided by the embodiment of the present application. picture.
  • the text recognition system 200 includes an execution device 210 , a training device 220 , a database 230 and a data storage system 240 , and the execution device 210 includes a computing module 211 .
  • a training data set is stored in the database 230, and the training data set may include a plurality of images to be recognized and a correct result corresponding to the first character in each image to be recognized.
  • the training device 220 generates a target model/rule 201 for processing sequence data, and performs iterative training on the target model/rule 201 by using the training data set in the database to obtain a mature target model/rule 201 .
  • the execution device 210 may call data, codes, etc. in the data storage system 240 , and may also store data, instructions, etc. in the data storage system 240 .
  • the data storage system 240 may be configured in the execution device 210 , or the data storage system 240 may be an external memory relative to the execution device 210 .
  • the calculation module 211 can perform a recognition operation on the image to be recognized inputted by the execution device 210 through the mature target model/rule 201 to obtain a recognition result of the first character in the image to be recognized.
  • FIG. 2 is only a schematic structural diagram of two image processing systems provided by an embodiment of the present invention, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the execution device 210 and the client device may be separate devices, and the execution device 210 is configured with an input/output interface to perform data interaction with the client device. The output interface inputs the captured image, and the execution device 210 returns the processing result to the client device through the input/output interface.
  • an embodiment of the present application provides a text recognition network.
  • the text recognition network includes an image feature extraction module, a text feature acquisition module, and a recognition module.
  • the image feature extraction module is used to extract the image features of the first character in the image to be recognized.
  • the text feature acquisition module is used for the semantic information of the preset character corresponding to the first character in the image to be recognized, performs text prediction, and obtains the semantic feature of the predicted character, and then the recognition module is based on the image feature of the first character in the image to be recognized. Perform the recognition operation with the semantic features of the predicted characters to generate the recognition results.
  • the accuracy of the predicted characters will not be affected due to image problems such as blurred images or some characters in the image to be recognized are occluded.
  • the embodiment of the present application includes an inference phase and a training phase, and the processes of the inference phase and the training phase are different. The following describes the inference phase and the training phase respectively.
  • FIG. 3 is a schematic flowchart of a workflow of a text recognition network provided by an embodiment of the present application. The method may include:
  • the execution device inputs the to-be-recognized image to an image feature extraction module, and performs feature extraction on the to-be-recognized image to generate a first feature corresponding to a first character in the to-be-recognized image.
  • the execution device after acquiring the to-be-recognized image, the execution device will input the to-be-recognized image into the image feature extraction module of the text recognition network, so as to perform feature extraction on the to-be-recognized image through the image feature extraction module, so as to generate a A first feature corresponding to the first character in the image is recognized, where the first character is a character to be recognized in the image to be recognized.
  • the image feature extraction module in the text recognition network can specifically be expressed as a convolutional neural network, a histogram of oriented gradient (HOG), a local binary pattern (LBP) or other methods for image processing.
  • HOG histogram of oriented gradient
  • LBP local binary pattern
  • An image to be recognized may include one or more rows of first characters, or an image to be recognized may include one or more columns of first characters. If the granularity of the recognition operation performed by the text recognition network is character, that is, each time the execution device performs a recognition operation through the text recognition network, it can obtain the recognition result of a character in the image to be recognized, then a first character includes one or more characters. . As an example, for example, a first character included in the image to be recognized is "cat", and the text recognition network performs a recognition operation to obtain a recognition result of the character "c".
  • a first character included in the image to be recognized is "the weather is awesome today”
  • the text recognition network performs a recognition operation to generate a feature of the character "jin”
  • the output is the same as the character "jin” corresponding recognition results.
  • a first character includes one or more words.
  • a first character included in the image to be recognized is "how are you”
  • the text recognition network performs a recognition operation to obtain a recognition result of the word "how”.
  • a first character included in the image to be recognized is "the weather is great today”
  • the text recognition network obtains a recognition result of the word "today” every time a recognition operation is performed, etc.
  • the execution device will perform image segmentation on the to-be-recognized image to generate at least one segmented to-be-recognized image (that is, to divide the to-be-recognized image into at least one text area) ). If an image to be recognized includes one or more lines of first characters, each segmented image to be recognized (that is, each text area) includes one line of first characters; if an image to be recognized includes one or more columns the first character, each segmented image to be recognized includes a column of the first character.
  • the text recognition network is further configured with an image segmentation module, and the execution device performs image segmentation on the image to be recognized through the image segmentation module of the text recognition network to obtain at least one segmented image to be recognized.
  • a first neural network for image segmentation may also be configured, then the execution device performs image segmentation on the image to be recognized through the first neural network to obtain at least A segmented image to be recognized.
  • the image segmentation module in the text recognition network or the first neural network for image segmentation can be specifically expressed as a shape robust text detection network based on a progressive scale expansion network (shape Robust Text Detection with Progressive Scale Expansion Network, PSENet), rCTPN, ASTER or other neural networks for image segmentation, etc., which are not limited here.
  • PSENet shape Robust Text Detection with Progressive Scale Expansion Network
  • rCTPN Progressive Scale Expansion Network
  • ASTER other neural networks for image segmentation, etc., which are not limited here.
  • step 301 may include: the execution device inputs the segmented image to be recognized to the image feature extraction module, and performs feature extraction on the segmented to-be-recognized image to generate a first character corresponding to the segmented to-be-recognized image.
  • the first feature of , the segmented image is a text area in the image to be recognized.
  • a first feature refers to a feature of a segmented image to be recognized, which includes image features of a row of first characters (ie, a text area in the image to be recognized), or includes image features of a column of first characters.
  • an image to be recognized when an image to be recognized includes multiple lines of first characters or multiple columns of first characters, after acquiring the image to be recognized, the execution device inputs the entire image to be recognized into text recognition
  • feature extraction is performed on the entire to-be-recognized image to generate a first feature corresponding to the first character in the to-be-recognized image.
  • a first feature refers to the feature of the entire image to be recognized. If an image to be recognized includes a row of first characters or a column of first characters, then a first feature is a row or a column of first characters in the image to be recognized. If an image to be recognized includes multiple rows or columns of first characters, a first feature is an image feature of multiple rows or columns of first characters in the image to be identified.
  • the execution device inputs a preset character corresponding to the first character in the image to be recognized into a text feature acquisition module, and performs text prediction according to the preset character to generate a semantic feature of the first predicted character.
  • the execution device when the execution device performs the first recognition operation on the first character in the image to be recognized, it will acquire a preset character corresponding to the first character in the image to be recognized, A preset character corresponding to a character is input into the text feature acquisition module of the text recognition network, and the text feature acquisition module performs text prediction according to the preset character to generate the semantic feature of the first predicted character.
  • the initial operation of recognizing the first character in the to-be-recognized image refers to the initial segmentation of the to-be-recognized image (that is, a text area of the to-be-recognized image) When performing an identification operation. If the executing device has not performed image segmentation on the entire to-be-recognized image, performing the first-time recognition operation on the first character in the to-be-recognized image refers to the initial recognition operation on the entire to-be-recognized image.
  • the preset character can be a start flag character, which can be represented as a ⁇ BOS> character in a computer program, and is used to instruct the text feature acquisition module to start text prediction.
  • the expression form of the preset character is predefined, and specifically, it can be expressed as a vector including N elements, and each element of the N elements is a certain value. Further, N is an integer greater than or equal to 1.
  • the preset character may specifically be a vector including 32 1s, or the preset character may specifically be a vector including 64 2s, etc., which will not be exhaustive here.
  • the text feature acquisition module of the text recognition network may include an encoding module and a decoding module.
  • the encoding module is used to extract the text features of the input characters
  • the decoding module is used to generate the text features of the predicted characters according to the text features of the input characters.
  • the encoding module may be an encoder in recurrent neural networks (RNNs), and the decoding module may be a decoder in a recurrent neural network; as an example, for example, the encoding module and the decoding module may be long-short-term memory networks ( The encoding module and decoding module in the long short term memory network, LSTM).
  • the encoding module may also be a self-attention encoding module, and the decoding module may be a self-attention decoding module; as an example, for example, the encoding module and the decoding module may be bidirectional encoder representations from transformers, BERT) neural network self-attention encoding module and self-attention decoding module, etc., encoding module and decoding module can also be expressed as encoding module and decoding module in other neural networks used for text prediction, etc. Exhaustive.
  • the encoding module and the decoding module in the text feature acquisition module are a self-attention encoding module and a self-attention decoding module, respectively.
  • Step 302 may include: the execution device converts the preset character from character form to tensor form through the text feature acquisition module, so as to generate a character code of the preset character, and according to the position of the first character of the preset character in the image to be recognized , generate the position code of the preset character; combine the character code of the preset character and the position code of the preset character to obtain the initial feature of the preset character. Further, the execution device performs a self-attention encoding operation and a self-attention decoding operation according to the initial characteristic of the preset character through the text feature acquisition module, so as to generate the semantic feature of the first predicted character.
  • text prediction is performed by performing a self-attention encoding operation and a self-attention decoding operation on the initial features of the preset characters, so as to generate the semantic features of the first predicted characters, the calculation speed is fast, and the complexity Low.
  • the execution device may perform vectorization (embedding) processing on the preset characters through the text feature acquisition module, so as to generate character codes of the preset characters.
  • the training device may also obtain the one-hot encoding of the preset character, and determine the one-hot encoding of the preset character as the character encoding of the preset character, etc.
  • the process of generating the character encoding of the preset character is not limited here.
  • the character code of the preset character may be a vector including M elements, and the value of M is related to what kind of neural network is used by the text feature acquisition module of the text recognition network, which is not limited here.
  • the position of the first character of the preset character in the image to be recognized is the first position
  • the position code of the preset character indicates that the position of the preset character is the first position
  • the position code of the preset character may also be a vector including M elements.
  • the value of M is 512
  • the position code of the preset character can be a vector including 1 1 and 511 0s
  • the 1 in the position code of the preset character is at the first position, indicating that the preset character is to be recognized.
  • the position of the first character in the image is the first position.
  • the execution device can also perform secondary conversion on the aforementioned 512 elements through a cosine function.
  • a process for generating semantic features of a first predicted character After the execution device obtains the initial features of the preset characters, it needs to perform text prediction through the text feature acquisition module of the text recognition network, that is, perform a self-attention encoding operation on the initial features of the preset characters to generate the updated preset characters. feature, and perform a self-attention decoding operation on the updated feature of the preset character to generate the semantic feature of the first predicted character.
  • Step 302 may include: the execution device converts the preset character from the character form to the tensor form through the text feature acquisition module, so as to generate the character code of the preset character, and determine the character code of the preset character as the initial feature of the preset character . Further, the execution device performs an encoding operation and a decoding operation according to the initial feature of the preset character through the text feature acquisition module, so as to generate the semantic feature of the first predicted character.
  • step 302 can be modified correspondingly, which is not exhaustive here.
  • step 301 may be executed first, and then step 302 may be executed, or step 302 may be executed first, and then step 301 may be executed, or step 301 and step 302 may be executed simultaneously.
  • the execution device combines the feature of the preset character with the first feature through the feature update module to generate a fourth feature.
  • the execution device may further compare the feature of the preset character with the first feature. combined to generate a fourth feature, where the fourth feature is the updated first feature.
  • the feature of the preset character may be an updated feature of the preset character, or may be an initial feature of the preset character.
  • the execution device performs a self-attention encoding operation according to the initial feature of the preset character through the feature update module of the text recognition network, to obtain the updated feature of the preset character, and according to the first feature and the preset character Let the updated features of the characters perform a self-attention encoding operation to generate the fourth feature.
  • Q char is obtained by multiplying the initial feature of the preset character with the first conversion matrix
  • K char is obtained by multiplying the initial feature of the preset character with the second conversion matrix
  • Q char K char represents the dot product of Q char and K char
  • softmax(Q char K char )V char represents the dot product of softmax(Q char K char ) and V char
  • V char is the initial feature of the preset character and the third transformation matrix is multiplied
  • Q′ char represents the preset character
  • the first transformation matrix, the second transformation matrix and the third transformation matrix may be the same or different. It should be understood that the examples in formula (1) are only for easier understanding of the solution, and are not used to limit the solution.
  • Q' img represents the fourth feature
  • Q' char represents the updated feature of the preset character
  • K img is obtained by multiplying the first feature and the fourth transformation matrix
  • V img is the first feature and the fifth transformation matrix.
  • the fourth transformation matrix and the fifth transformation matrix may be the same or different. It should be understood that the examples in formula (2) are only for easier understanding of the solution, and are not used to limit the solution.
  • FIG. 4 is a schematic flowchart of generating the fourth feature in the workflow of the text recognition network provided by the embodiment of the application.
  • the text recognition network is used to perform a recognition
  • the operation obtains the recognition result of a character, that is, a second character includes a character as an example.
  • the execution device inputs the image to be recognized into the image feature extraction module of the text recognition network, and obtains the image feature of the first character in the image to be recognized (that is, the first feature of the first character in the image to be recognized),
  • the image feature extraction module includes multiple convolution layers and multiple pooling layers as an example, and max poll refers to maximum pooling. As shown in FIG.
  • the execution device generates the character code and position code of the preset character to obtain the initial feature of the preset character, and generates the updated feature Q′ char of the preset character through the above formula (1).
  • the executing device executes self-attention encoding through the above formula (2).
  • the feature update module in the text recognition network More neural networks can also be set, for example, the feature update module in the text recognition network can also be set with a feedforward neural network, a regularization module, etc.
  • Figure 4 is only an example to facilitate understanding of this scheme, and is not intended to be limited. this program.
  • the execution device performs a self-attention encoding operation according to the first feature and the initial feature of the preset character through the feature update module of the text recognition network to generate the fourth feature.
  • the execution device performs an encoding operation according to the initial characteristics of the preset characters through the feature update module of the text recognition network, and obtains the updated characteristics of the preset characters. According to the first characteristics and the updated characteristics of the preset characters The feature performs an encoding operation to generate a fourth feature. Further, the feature updating module of the text recognition network performs an encoding operation through an encoder, which is an encoder in a recurrent neural network.
  • the execution device performs an encoding operation according to the first feature and the initial feature of the preset character through the feature update module of the text recognition network to generate the fourth feature.
  • the execution device performs a recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character to generate a first recognition result.
  • the execution device combines the first feature and the semantic feature of the first predicted character through the recognition module, and performs a recognition operation according to the combined feature to generate a first recognition result. If the granularity of the recognition operation performed by the text recognition network is characters, a first recognition result is a character recognized by the text recognition network. If the granularity of the recognition operation performed by the text recognition network is words, a first recognition result is a word recognized by the text recognition network.
  • step 303 is an optional step. If step 303 is executed, step 304 includes: the executing device combines the fourth feature (that is, the updated first feature) and the fifth feature through the identification module, and according to the combination The subsequent features perform a recognition operation to generate a first recognition result.
  • the execution device directly combines the fourth feature (that is, the updated first feature) and the fifth feature by means of splicing, matrix multiplication, and combination through the identification module.
  • the execution device performs a combination operation of the fourth feature and the semantic feature of the first predicted character according to the similarity between the fourth feature and the semantic feature of the first predicted character through the identification module.
  • the execution device calculates the first similarity between the fourth feature and the semantic feature of the first predicted character through the recognition module; generates the fifth feature according to the fourth feature, the semantic feature of the first predicted character and the first similarity; and A sixth feature is generated according to the fourth feature, the semantic feature and the similarity of the first predicted character. Further, the fifth feature and the sixth feature are combined by the identification module.
  • the first similarity can be obtained by calculating the cosine similarity, Euclidean distance, Mahalanobis distance, etc. between the fourth feature and the semantic feature of the first predicted character, or the first similarity can be obtained by comparing the fourth feature and the first predicted character.
  • a semantic feature of a predicted character is obtained by performing a dot product operation.
  • the first similarity may include one similarity value, or may be two transposed similarity values.
  • the fifth feature is the combination of the semantic features of the first predicted character on the basis of the fourth feature
  • the sixth feature is that the fourth feature is combined on the basis of the semantic feature of the first predicted character.
  • the manner in which the fifth feature and the sixth feature are combined includes, but is not limited to, splicing, addition, multiplication, or other combination manners, which are not exhaustive here.
  • FIG. 5 A schematic flow chart, in FIG. 5 , generating the first similarity by means of dot product is taken as an example.
  • K vis represents the fourth feature (that is, the updated first feature)
  • Q lin represents the semantic feature of the first predicted character
  • P lin is obtained by multiplying Q lin and the first weight point
  • P vis is obtained by multiplying K vis and the second weight point
  • the first weight and the second weight are determined in the text recognition network training stage of.
  • S vis represents the similarity between the fourth feature and the first predicted character
  • S vis is determined by the formula Calculated
  • S lin represents the similarity between the first predicted character and the fourth feature
  • S lin passes the formula It is calculated that d represents the number of dimensions of the feature, that is, d represents the number of elements included in the fourth feature or the fifth feature.
  • the fifth feature is a combination of the semantic features of the first predicted character on the basis of the fourth feature. In Figure 5, the fifth feature is expressed as splicing get, By formulas S lin , K lin and Point multiplication to get.
  • the sixth feature is a combination of the fourth feature based on the semantic feature of the first predicted character. In Figure 5, the sixth feature is expressed as splicing get, By formulas S vis , K vis and Point multiplication to get. It should be understood that FIG. 5 is only an example to facilitate understanding of the solution, and is not used to limit the solution.
  • the process of performing a recognition operation on the combined features After the execution device combines the fifth feature and the sixth feature through the identification module, the combined feature is input to the classification network in the identification module, so as to perform the identification operation through the classification network in the identification module, and obtain the output of the entire identification module A first recognition result of .
  • the classification network can be specifically expressed as a classifier, the classifier can choose a multi-layer perceptron (MLP), and the classifier can also be composed of a linear transformation matrix and a softmax classification function, etc.
  • MLP multi-layer perceptron
  • the classification is not limited here.
  • the specific form of the network is not limited here. The specific form of the network.
  • step 304 includes: the executing device combines the first feature obtained in step 301 with the semantic feature of the first predicted character through the recognition module, and performs a recognition operation according to the combined feature to generate a The first recognition result.
  • the executing device combines the first feature obtained in step 301 with the semantic feature of the first predicted character through the recognition module, and performs a recognition operation according to the combined feature to generate a The first recognition result.
  • the execution device inputs the second character corresponding to the recognized character in the first character to the text feature acquisition module, and performs text prediction according to the second character to generate the semantic feature of the first predicted character.
  • step 305 is similar to the specific implementation of step 302.
  • the execution device When the execution device has performed a recognition operation on at least one character in the first character, it will obtain all the At least one second character corresponding to the recognized character. Specifically, the execution device determines at least one recognition result corresponding to all recognized characters in the first characters as at least one second character corresponding to the recognized characters in the first characters. In the embodiment of the present application, in the case that the recognition operation has been performed on at least one character in the first character, the execution device determines at least one recognition result and preset character corresponding to the recognized character in the first character as the same as the first character.
  • At least one second character corresponding to a recognized character in a character when the first character in the to-be-recognized image is recognized for the first time, the execution device generates the semantic feature of the first predicted character according to the preset character, which ensures this The integrity of the scheme, the entire identification process no longer requires manual intervention, which improves the user stickiness of the scheme.
  • step 305 includes: the executing device determines the first recognition result and the preset character as a second character corresponding to the recognized character in the first character. If the executing device enters step 305 through step 307, step 305 includes: the executing device determines the preset character, the first recognition result and the at least one second recognition result as a plurality of first characters corresponding to the recognized characters in the first character two characters.
  • the first character is a word including at least one character
  • one recognition result includes one character
  • each second character includes one character.
  • the granularity of the recognition operation performed by the text recognition network is words
  • the first character includes at least one word
  • a recognition result is a word including one or more characters
  • each second character is a word including one or more characters words.
  • the granularity of the recognition operation performed by the text recognition network may be characters or words, which expands the application scenarios of the solution and improves the implementation flexibility of the solution.
  • the execution device inputs all the second characters corresponding to all the recognized characters in the first character into the text feature acquisition module of the text recognition network, so as to pass the encoding module and the decoding module in the text feature acquisition module according to all the second characters Perform text prediction to generate semantic features of the first predicted character.
  • Step 302 may include: the executing device converts any second character of the at least one second character from a character form to a tensor form through a text feature acquisition module, so as to generate a character code of the second character, and according to the at least one second character The position of any second character in the first character in the image to be recognized generates a position code of the second character. The execution device combines the character code and the position code of any second character in the at least one second character through the text feature acquisition module of the text recognition network to obtain an initial feature of the second character.
  • the execution device performs the foregoing operations on each of the at least one second character through the text feature acquisition module of the text recognition network, thereby generating an initial feature of each of the at least one second character. Further, the execution device performs the self-attention encoding operation and the self-attention decoding operation according to the initial characteristic of the second character through the text feature acquisition module, so as to generate the semantic characteristic of the first predicted character.
  • Step 302 may include: the executing device converts each second character in the at least one second character from a character form to a tensor form through a text feature acquisition module, so as to generate a character code of each second character, and converts each second character into a tensor form.
  • the character encoding of the characters is determined as the initial characteristic of each second character.
  • the execution device performs an encoding operation and a decoding operation according to the initial features of all the second characters in the at least one second character through the text feature acquisition module, so as to generate the semantic feature of the first predicted character.
  • the execution device combines the feature of the second character with the first feature through the feature update module to generate a seventh feature.
  • step 306 is similar to the specific implementation of step 303.
  • the feature of the second character may also be combined with the first feature to generate a seventh feature, where the seventh feature is the updated first feature.
  • the feature of the second character may be an updated feature of the preset character, or may be an initial feature of the preset character.
  • the first feature includes image features of a plurality of first characters, at least one of the first characters in the plurality of first characters is a character that has been recognized by an operation, and the second character includes recognition corresponding to the plurality of recognized characters.
  • the feature of the second character includes the feature of the recognition result corresponding to the recognized character. Then the features of the recognized characters will be enhanced with respect to the seventh feature relative to the first feature.
  • the semantic features of the recognized characters are integrated into the image features, so that the features of the recognized characters in the image features are more obvious, so that the recognition module can more intensively recognize the characters that have not been recognized, so as to reduce the single recognition module
  • the difficulty of the secondary recognition process is beneficial to improve the accuracy of text recognition.
  • the execution device performs a self-attention encoding operation according to the initial feature of the second character through the feature update module of the text recognition network, and obtains the updated feature of the second character, according to the first feature and the first feature.
  • the updated features of the two characters perform a self-attention encoding operation to generate the seventh feature (ie, the updated first feature).
  • the self-attention encoding method is adopted to combine the features of the second characters with the first features, which is conducive to realizing a sufficient combination of the second character features and the first features, and has low complexity and is easy to implement.
  • the executing device performs a self-attention encoding operation according to the first feature and the initial feature of the second character through the feature updating module of the text recognition network to generate the seventh feature.
  • the execution device performs the encoding operation according to the initial feature of the second character through the feature updating module of the text recognition network, and obtains the updated feature of the second character, according to the first feature and the updated feature of the second character
  • the feature performs an encoding operation to generate a seventh feature.
  • the feature updating module of the text recognition network performs an encoding operation through an encoder, which is an encoder in a recurrent neural network.
  • the execution device performs an encoding operation according to the first feature and the initial feature of the second character through the feature updating module of the text recognition network to generate the seventh feature.
  • step 306 For specific implementation manners of various forms in step 306, reference may be made to the description in step 303 above, which will not be repeated here.
  • the execution device performs a recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character to generate a second recognition result.
  • step 307 is similar to the specific implementation of step 304.
  • the execution device combines the first feature and the semantic feature of the first predicted character through the recognition module, and performs recognition according to the combined feature. operation to generate a second recognition result.
  • step 306 is an optional step. If step 306 is executed, step 307 includes: the execution device combines the seventh feature (that is, the updated first feature) with the semantic feature of the first predicted character through the recognition module, And perform a recognition operation according to the combined features to generate a second recognition result.
  • the execution device directly combines the seventh feature (that is, the updated first feature) with the first feature by means of splicing, matrix multiplication, combination, etc. through the identification module.
  • the execution device performs a combination operation of the seventh feature and the semantic feature of the first predicted character according to the similarity between the seventh feature and the semantic feature of the first predicted character through the identification module.
  • the execution device calculates the similarity between the seventh feature (that is, the updated first feature) and the semantic feature of the first predicted character through the recognition module; according to the seventh feature, the semantic feature and the similarity of the first predicted character, A second feature and a third feature are generated.
  • the second feature is the combination of the semantic features of the first predicted character on the basis of the seventh feature
  • the third feature is the combination of the seventh feature based on the semantic features of the first predicted character; according to the second feature and the third feature
  • the three features perform a recognition operation to generate a second recognition result.
  • the similarity between the first feature and the semantic feature of the first predicted character is calculated, and then the second feature and the third feature are generated according to the similarity between the first feature and the semantic feature of the first predicted character feature, the second feature is the combination of the semantic features of the first predicted character on the basis of the first feature, and the third feature is the combination of the first feature based on the semantic features of the first predicted character, that is, not only according to the predicted character
  • the semantic features of the characters to be recognized are enhanced by the image features of the characters to be recognized, and the image features of the characters to be recognized are integrated into the semantic features of the predicted characters, which is conducive to the full fusion of the image features and the predicted character features, and is conducive to improving the accuracy of the text recognition results. .
  • the process of performing a recognition operation on the combined features After the execution device combines the second feature and the third feature through the recognition module, the combined feature is input to the classification network in the recognition module, so as to perform the recognition operation through the classification network in the recognition module, and obtain the output of the entire recognition module A first recognition result of .
  • step 307 includes: the executing device combines the first feature obtained in step 301 with the semantic feature of the first predicted character through the recognition module, and performs a recognition operation according to the combined feature to generate a The second recognition result.
  • the executing device combines the first feature obtained in step 301 with the semantic feature of the first predicted character through the recognition module, and performs a recognition operation according to the combined feature to generate a The second recognition result.
  • steps 301 to 304 are executed once, steps 305 to 307 can be repeatedly executed to obtain multiple second identification results.
  • the execution device executes steps 305 to 307, a recognition result of one character in a first character can be obtained, and the execution device repeatedly executes steps 305 to 307 multiple times. , to get a recognition result of all characters in the first character.
  • the granularity of a recognition operation performed by the text recognition network is a word
  • each time the execution device performs steps 305 to 307 a recognition result of a word in a first character can be obtained, and the execution device repeatedly performs steps 305 to 307 to obtain A recognition result for all words in the first character. Further, the output result of the entire first character can be output.
  • the execution device can directly output the entire first character after performing steps 301 to 304. Character recognition result.
  • FIG. 6 is a schematic diagram of a network architecture of a text recognition network provided by an embodiment of the present application.
  • the text recognition structure includes an image feature extraction module, A1, A2 and a recognition module.
  • A1 represents a text feature acquisition module
  • A2 represents a feature update module.
  • the execution device inputs the image to be recognized into the image feature extraction module, so as to obtain the image feature (ie, the first feature) of the first character in the image to be recognized, which is compared with the first character in the image to be recognized.
  • the character corresponding to the character is input into A1 (that is, the text feature acquisition module), and the character corresponding to the first character in the image to be recognized may be a preset character, or a preset character and a second character, so as to obtain the module through the text feature
  • A1 that is, the text feature acquisition module
  • the character corresponding to the first character in the image to be recognized may be a preset character, or a preset character and a second character, so as to obtain the module through the text feature
  • the initial features of characters are generated, and the self-attention encoding operation and the self-attention decoding operation are performed on the initial features of the characters, and the semantic features of the predicted characters are obtained.
  • the execution device After the execution device obtains the first feature, it will also perform self-attention encoding on the initial feature of the character to obtain the updated feature of the character, and then perform a self-attention encoding operation according to the first feature and the updated feature of the character to generate Updated first feature.
  • the executing device inputs the updated first feature and the semantic feature of the predicted character into the recognition module, so as to perform the recognition operation by the recognition module and input the recognition result.
  • the specific implementation of each step in FIG. 6 can refer to the above description, and will not be repeated here. It should be understood that in actual situations, more or less neural network layers may be set in the text recognition network, and FIG. 6 is only for An example for the convenience of understanding this solution, and is not used to limit this solution.
  • the semantic features of the predicted characters are generated according to the second characters corresponding to the recognized characters in the first characters, and the recognition operation is performed according to the features of more dimensions, which is beneficial to The accuracy of the text recognition result is improved; and when the to-be-recognized image is blurred or some characters in the to-be-recognized image are occluded and other factors, the accuracy of the features of the fuzzy characters or occluded characters included in the first feature will be greatly reduced, Based on the semantic information of the recognized characters, the semantic features of the predicted characters are generated. The accuracy of the predicted characters will not be affected due to image problems such as blurred images or occlusion of some characters in the image to be recognized. According to the semantic features of the predicted characters and the image The recognition results are generated together with the features, which is beneficial to improve the accuracy of the text recognition results.
  • FIG. 7 is a schematic flowchart of a training method for a text recognition network provided by an embodiment of the present application. The method may include:
  • the training device acquires the image to be recognized from the training data set.
  • the training device is preconfigured with a training data set, the training data set includes a plurality of images to be recognized, and the correct result corresponding to the first character in each image to be recognized.
  • An image to be recognized is randomly obtained from the collection.
  • the training device inputs the to-be-recognized image to the image feature extraction module, and performs feature extraction on the to-be-recognized image to generate a first feature corresponding to the first character in the to-be-recognized image.
  • the training device inputs a preset character corresponding to the first character in the image to be recognized into the text feature acquisition module, and performs text prediction according to the preset character to generate a semantic feature of the first predicted character.
  • the training device combines the feature of the preset character with the first feature through the feature update module to generate a fourth feature.
  • the training device performs a recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character to generate a first recognition result.
  • the training device inputs the second character corresponding to the recognized character in the first character to the text feature acquisition module, and performs text prediction according to the second character to generate the semantic feature of the first predicted character.
  • the training device combines the feature of the second character with the first feature through the feature update module to generate a seventh feature.
  • the training device performs a recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character to generate a second recognition result.
  • steps 702 to 708 performed by the training device is similar to the specific implementation of steps 301 to 307 in the embodiment corresponding to FIG. 3 . Please refer to the description of steps 301 to 307 in the corresponding embodiment of FIG. It is not repeated here.
  • the training device trains the text recognition network according to the correct result corresponding to the first character in the image to be recognized, the recognition result, and the loss function.
  • the training device after obtaining the recognition result of the first character in the image to be recognized, the training device will, according to the correct result corresponding to the first character in the image to be recognized and the recognition result of the first character in the image to be recognized , calculate the function value of the loss function, and take the gradient derivation of the function value of the loss function to reversely update the weight parameters of the text recognition network to complete a training of the text recognition network.
  • the training device repeatedly performs the aforementioned steps to realize iterative training of the text recognition network.
  • the execution device may directly output the entire first character after performing steps 701 to 705 character recognition result, the training device calculates the function value of the loss function according to the correct result corresponding to the first character in the image to be recognized and the first recognition result output in step 705 .
  • a first character includes a plurality of characters to be recognized, or a first character includes a plurality of words to be recognized
  • the execution device executes steps 701 to 705 once, and executes steps 706 to 709 at least once
  • the recognition result of the entire first character can be directly output, then the training device is based on the correct result corresponding to the first character in the image to be recognized, the first recognition result output in step 705 and at least one second recognition result obtained by step 708, Calculate the function value of the loss function.
  • the loss function indicates the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result of a first character in the image to be recognized, and the training target is to zoom in with the first character in the image to be recognized.
  • the loss function may specifically be expressed as a cross-entropy loss function, a focal loss function, a center (center) loss function, or other types of loss functions, etc., which are not limited here.
  • the preset condition may be that the loss function satisfies the convergence condition, or that the number of iterations reaches a preset number of times.
  • a training method for a text recognition network which improves the integrity of the solution; because when the image to be recognized is blurred or some characters in the image to be recognized are occluded and other factors, the ambiguity included in the first feature The accuracy of features of characters or occluded characters is greatly reduced.
  • the semantic features of the predicted characters are generated, and the recognition results are generated according to the semantic features of the predicted characters and the image features. Due to blurred images or some characters in the to-be-recognized image are occluded and other image problems, It does not affect the accuracy of the predicted characters, which is beneficial to improve the accuracy of the text recognition results output by the trained text recognition network.
  • svt, SVTP and CT80 are respectively three public data sets, and the first row of data in table 1 indicates that optical character recognition (optical character recognition, OCR) technology is adopted to compare data set svt, data set SVTP respectively to data set svt and data set SVTP. Perform text recognition with the images in the dataset CT80 to obtain the accuracy of the recognition results.
  • the second row of data in Table 1 indicates the accuracy of the recognition results obtained by using the text recognition network provided by the embodiments of the present application to perform text recognition on images in the dataset svt, dataset SVTP, and dataset CT80 respectively. Obviously, the accuracy of the recognition result obtained by using the text recognition network provided by the embodiment of the present application is higher.
  • FIG. 8 is a schematic diagram of a beneficial effect of the text recognition network provided by the embodiment of the present application.
  • the obtained recognition result is "shcct"
  • the text recognition network provided by the embodiment of the present application
  • the obtained recognition result is "sheet”.
  • the data in the second row and the data in the third row in FIG. 8 can be understood by analogy.
  • the recognition result obtained by using the text recognition network provided by the embodiment of the present application has higher accuracy.
  • FIG. 9 is a schematic structural diagram of a text recognition network provided by an embodiment of the present application.
  • the text recognition network 900 may include an image feature extraction module 901 , a text feature acquisition module 902 and a recognition module 903 .
  • the image feature extraction module 901 is used to obtain the image to be recognized, and perform feature extraction on the image to be recognized, so as to generate a first feature corresponding to the first character in the image to be recognized, wherein the first character is the image to be recognized that needs to be processed.
  • the text feature acquisition module 902 is used to acquire the preset character corresponding to the first character in the image to be recognized, and perform text prediction according to the preset character to generate the semantic feature of the first predicted character;
  • the recognition module 903 is used to perform a recognition operation according to the first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.
  • the text feature acquisition module 902 is specifically configured to acquire a preset character corresponding to the first character in the image to be recognized when the recognition operation is performed on the image to be recognized for the first time, and according to the preset character Perform text prediction to generate the semantic feature of the second predicted character; the text feature acquisition module 902 is further configured to perform a recognition operation on at least one character in the first character The recognition result corresponding to the character is determined as the second character, and a semantic feature of the second predicted character corresponding to the second character is generated.
  • the recognition module 903 is further configured to perform a recognition operation according to the first feature and the semantic feature of the second predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.
  • the text feature acquisition module 902 includes: a first generation sub-module 9021, which is used to perform vectorization processing on the preset characters, so as to generate the character codes of the preset characters, and according to the position of the first character of the preset characters in the image to be recognized , the position code of the preset character is generated; the combination submodule 9022 is used to combine the character code of the preset character and the position code of the preset character to obtain the initial feature of the preset character, and execute according to the initial feature of the preset character Self-attention encoding operation and self-attention decoding operation to generate semantic features of the first predicted character.
  • a first generation sub-module 9021 which is used to perform vectorization processing on the preset characters, so as to generate the character codes of the preset characters, and according to the position of the first character of the preset characters in the image to be recognized , the position code of the preset character is generated
  • the combination submodule 9022 is used to combine the character code of the preset character and the position code of the prese
  • the recognition module 903 includes: a calculation sub-module 9031, used to calculate the similarity between the first feature and the semantic feature of the first predicted character; the second generation sub-module 9032, is used to generate a second feature and a third feature according to the first feature, the semantic feature and the similarity of the first predicted character, wherein the second feature is the semantic feature of the first predicted character combined on the basis of the first feature, The third feature is that the first feature is combined on the basis of the semantic feature of the first predicted character; the second generating sub-module 9032 is further configured to perform a recognition operation according to the second feature and the third feature to generate a recognition result.
  • a calculation sub-module 9031 used to calculate the similarity between the first feature and the semantic feature of the first predicted character
  • the second generation sub-module 9032 is used to generate a second feature and a third feature according to the first feature, the semantic feature and the similarity of the first predicted character, wherein the second feature is the semantic feature of the first predicted character combined on the basis of the first
  • the text recognition network further includes a feature update module 904, the feature update module 904 is used for: combining the feature of the preset character with the first feature to generate the updated first feature A feature; the recognition module 903 is specifically configured to perform a recognition operation according to the updated first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the to-be-recognized image.
  • the feature update module 904 is specifically configured to: perform a self-attention encoding operation according to the initial feature of the preset character, obtain the updated feature of the preset character, and according to the first feature and the preset character The updated features perform a self-attention encoding operation to generate the updated first features.
  • a first character when the granularity of the recognition operation performed by the text recognition network is character, a first character includes at least one character, and a recognition result output by the text recognition network performing a recognition operation includes one character;
  • a recognition result output by the text recognition network performing a recognition operation when the granularity of the recognition operation performed by the text recognition network is words, a first character includes at least one word, and a recognition result output by the text recognition network performing a recognition operation once is a word including one or more characters.
  • FIG. 11 is a schematic structural diagram of the training apparatus for a text recognition network provided by an embodiment of the present application.
  • the text recognition network is a neural network used to recognize characters in an image, and the text recognition network includes an image feature extraction module, a text feature acquisition module and a recognition module.
  • the training device 1100 of the text recognition network includes: an input unit 1101 , a recognition unit 1102 and a training unit 1103 .
  • the input unit 1101 is used to input the to-be-recognized image to the image feature extraction module, and perform feature extraction on the to-be-recognized image to generate a first feature corresponding to the first character in the to-be-recognized image, where the first character is the to-be-recognized image
  • the input unit 1101 is further configured to input the preset character corresponding to the first character in the image to be recognized into the text feature acquisition module, and perform text prediction according to the preset character to generate the first prediction
  • the recognition unit 1102 is used to perform a recognition operation by the recognition module according to the first feature and the semantic feature of the first predicted character, to generate a recognition result corresponding to the first character in the image to be recognized;
  • the loss function indicates the correct result corresponding to the first character in the image
  • FIG. 12 is a schematic structural diagram of the apparatus for training a text recognition network provided by an embodiment of the present application.
  • the input unit 1101 is specifically configured to input a preset character corresponding to the first character in the to-be-recognized image to the text feature acquisition module when the recognition operation is performed on the to-be-recognized image for the first time;
  • the text recognition network training device 1100 further includes
  • the generating unit 1104 is configured to determine the recognition result corresponding to the recognized character in the first character as the second character through the text feature acquisition module under the condition that the recognition operation has been performed on at least one character in the first character, and A semantic feature of the second predicted character corresponding to the second character is generated.
  • the recognition unit 1102 is further configured to perform a recognition operation through the recognition module according to the first feature and the semantic feature of the second predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized .
  • the input unit 1101 is specifically configured to perform vectorization processing on the preset character through the text feature acquisition module, so as to generate a character code of the preset character, and according to the first character of the preset character in the to-be-recognized image
  • the position of a character, the position code of the preset character is generated; the character code of the preset character and the position code of the preset character are combined by the text feature acquisition module to obtain the initial feature of the preset character, and according to the initial character of the preset character.
  • the features perform a self-attention encoding operation and a self-attention decoding operation to generate semantic features of the first predicted character.
  • the identifying unit 1102 is specifically configured to: calculate the similarity between the first feature and the semantic feature of the first predicted character through the identifying module; according to the first feature, the semantic feature of the first predicted character and Similarity, the second feature and the third feature are generated by the recognition module, wherein the second feature is the semantic feature of the first predicted character combined on the basis of the first feature, and the third feature is the semantic feature of the first predicted character
  • the recognition module is specifically configured to: calculate the similarity between the first feature and the semantic feature of the first predicted character through the identifying module; according to the first feature, the semantic feature of the first predicted character and Similarity, the second feature and the third feature are generated by the recognition module, wherein the second feature is the semantic feature of the first predicted character combined on the basis of the first feature, and the third feature is the semantic feature of the first predicted character
  • the first feature is combined on the basis of ; the recognition operation is performed according to the second feature and the third feature by the recognition module to generate the recognition result.
  • the text recognition network also includes a feature update module.
  • the training device 1100 of the text recognition network further includes a combining unit 1105 for combining the features of the preset characters with the first features through the feature updating module to generate the updated first features; the identifying unit 1102 is specifically used for The recognition module performs a recognition operation according to the updated first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.
  • the combining unit 1105 is specifically configured to perform a self-attention encoding operation according to the initial features of the preset characters through the feature update module, to obtain the updated features of the preset characters; through the feature update module, according to the first A feature and the updated feature of the preset character perform a self-attention encoding operation to generate the updated first feature.
  • a first character when the granularity of the recognition operation performed by the text recognition network is character, a first character includes at least one character, and a recognition result output by the text recognition network performing a recognition operation includes one character;
  • a recognition result output by the text recognition network performing a recognition operation when the granularity of the recognition operation performed by the text recognition network is words, a first character includes at least one word, and a recognition result output by the text recognition network performing a recognition operation once is a word including one or more characters.
  • FIG. 13 is a schematic structural diagram of the execution device provided by the embodiment of the present application.
  • the execution device 1300 may be deployed with the execution device corresponding to FIG. 9 or FIG. 10 .
  • the text recognition network 900 described in the example is used to implement the functions of the execution device in the embodiments corresponding to FIG. 3 to FIG. 6 .
  • the execution device 1300 includes: a receiver 1301, a transmitter 1302, a processor 1303, and a memory 1304 (wherein the number of processors 1303 in the execution device 1300 may be one or more, and one processor is taken as an example in FIG. 13 ) , wherein the processor 1303 may include an application processor 13031 and a communication processor 13032 .
  • the receiver 1301, the transmitter 1302, the processor 1303, and the memory 1304 may be connected by a bus or otherwise.
  • Memory 1304 may include read-only memory and random access memory, and provides instructions and data to processor 1303 .
  • a portion of memory 1304 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1304 stores processors and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the processor 1303 controls the operation of the execution device.
  • various components of the execution device are coupled together through a bus system, where the bus system may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the above embodiments of the present application may be applied to the processor 1303 or implemented by the processor 1303 .
  • the processor 1303 may be an integrated circuit chip, which has signal processing capability. In the implementation process, each step of the above method can be completed by the hardware integrated logic circuit in the processor 1303 or the instructions in the form of software.
  • the above-mentioned processor 1303 can be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), a field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field programmable Field-programmable gate array
  • the processor 1303 may implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of this application.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 1304, and the processor 1303 reads the information in the memory 1304, and completes the steps of the above method in combination with its hardware.
  • the receiver 1301 can be used to receive input numerical or character information, and to generate signal input related to performing the relevant setting and function control of the device.
  • the transmitter 1302 can be used to output digital or character information through the first interface; the transmitter 1302 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1302 can also include a display device such as a display screen .
  • the application processor 13031 is configured to execute the functions of the execution device in the embodiments corresponding to FIG. 3 to FIG. 6 . It should be noted that, for the specific implementation manner of the application processor 13031 executing the functions of the execution device in the embodiments corresponding to FIG. 3 to FIG. , and will not be repeated here.
  • FIG. 14 is a schematic structural diagram of the training device provided by the embodiment of the present application.
  • the described training device 1100 of the text recognition network is used to realize the function of the training device corresponding to FIG. 7 .
  • the training device 1400 is implemented by one or more servers, and the training device 1400 may vary greatly due to different configurations or performances, and may include one or more central processing units (CPU) 1422 (for example, one or more processors) and memory 1432, one or more storage media 1430 (eg, one or more mass storage devices) that store applications 1442 or data 1444.
  • the memory 1432 and the storage medium 1430 may be short-term storage or persistent storage.
  • the program stored in the storage medium 1430 may include one or more modules (not shown in the figure), and each module may include a series of instructions to operate on the training device. Further, the central processing unit 1422 may be configured to communicate with the storage medium 1430 to execute a series of instruction operations in the storage medium 1430 on the training device 1400 .
  • Training device 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input and output interfaces 1458, and/or, one or more operating systems 1441, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.
  • operating systems 1441 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.
  • the central processing unit 1422 is configured to implement the function of the training device in the embodiment corresponding to FIG. 7 . It should be noted that, for the specific implementation manner of the central processing unit 1422 performing the function of the training device in the embodiment corresponding to FIG. 7 and the beneficial effects brought about, reference may be made to the descriptions in the respective method embodiments corresponding to FIG. Repeat them one by one.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer executes the execution device as described in the embodiments corresponding to FIG. 3 to FIG. 6 .
  • the embodiments of the present application also provide a computer program product that, when running on a computer, causes the computer to execute the steps executed by the execution device in the embodiments corresponding to FIG. 3 to FIG. Corresponds to the steps performed by the training device in the embodiment.
  • An embodiment of the present application further provides a circuit system, where the circuit system includes a processing circuit, and the processing circuit is configured to execute the steps executed by the execution device in the embodiments corresponding to FIG. 3 to FIG. 6 , or to execute the steps as described above.
  • FIG. 7 corresponds to the steps performed by the training device in the embodiment.
  • the execution device or training device provided in this embodiment of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, where the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc.
  • the processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip executes the steps executed by the execution device in the embodiment corresponding to FIG. 3 to FIG. 6 , or executes the steps executed by the training device in the embodiment corresponding to FIG. step.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • FIG. 15 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip may be represented as a neural network processor NPU 150, and the NPU 150 is mounted as a co-processor to the main CPU (Host CPU), tasks are allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1503, which is controlled by the controller 1504 to extract the matrix data in the memory and perform multiplication operations.
  • the arithmetic circuit 1503 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 1503 is a two-dimensional systolic array. The arithmetic circuit 1503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1503 is a general-purpose matrix processor.
  • the arithmetic circuit 1503 fetches the data corresponding to the matrix B from the weight memory 1502 and buffers it on each PE in the arithmetic circuit.
  • the operation circuit 1503 fetches the data of the matrix A from the input memory 1501 and performs the matrix operation on the matrix B, and stores the partial result or the final result of the matrix in the accumulator 1508 .
  • Unified memory 1506 is used to store input data and output data.
  • the weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 1505, and the DMAC is transferred to the weight memory 1502.
  • Input data is also moved into unified memory 1506 via the DMAC.
  • DMAC Direct Memory Access Controller
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 1510, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 1509.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 1510 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1509 to obtain instructions from the external memory, and also for the storage unit access controller 1505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1506 or the weight data to the weight memory 1502 or the input data to the input memory 1501 .
  • the vector calculation unit 1507 includes a plurality of operation processing units, and further processes the output of the operation circuit 1503, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 1507 can store the vector of processed outputs to the unified memory 1506 .
  • the vector calculation unit 1507 may apply a linear function and/or a nonlinear function to the output of the operation circuit 1503, such as linear interpolation of the feature plane extracted by the convolutional layer, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 1507 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 1503, such as for use in subsequent layers in a neural network.
  • the instruction fetch buffer (instruction fetch buffer) 1509 connected to the controller 1504 is used to store the instructions used by the controller 1504;
  • the unified memory 1506, the input memory 1501, the weight memory 1502 and the instruction fetch memory 1509 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each layer in the recurrent neural network can be performed by the operation circuit 1503 or the vector calculation unit 1507 .
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the program of the method in the first aspect.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • U disk mobile hard disk
  • ROM read-only memory
  • RAM magnetic disk or optical disk
  • a computer device which may be a personal computer, server, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wire eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)

Abstract

一种文本识别网络、神经网络训练的方法以及相关设备,文本识别网络为用于识别图像中字符的神经网络,文本识别网络包括图像特征提取模块,用于获取待识别图像,并对待识别图像进行特征提取,以生成与待识别图像中的第一字符对应的第一特征(301);文本特征获取模块用于获取与待识别图像中的第一字符对应的预设字符,并根据预设字符进行文本预测,以生成第一预测字符的语义特征(302);识别模块用于根据第一特征和第一预测字符的语义特征执行识别操作,以生成与待识别图像对应的识别结果,根据更多维度的特征执行识别操作;且图像质量问题不会影响预测字符的准确度,有利于提高文本识别结果的准确度。

Description

一种文本识别网络、神经网络训练的方法以及相关设备
本申请要求于2020年7月24日提交中国专利局、申请号为202010723541.2、发明名称为“一种文本识别网络、神经网络训练的方法以及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种文本识别网络、神经网络训练的方法以及相关设备。
背景技术
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。目前,基于深度学习(deep learning)的神经网络识别图像中的字符是人工智能常见的一个应用方式。
但是,在实际情况中,当待识别图像的质量较低,例如待识别图像模糊或待识别图像中部分字符被遮挡等情况时,都可能会导致神经网络输出错误的识别结果,从而降低文本识别结果的准确率。因此,一种提高文本识别结果准确率的方案亟待推出。
发明内容
本申请实施例提供了一种文本识别网络、神经网络训练的方法以及相关设备,根据预测字符的语义特征和待识别图像的图像特征一起生成识别结果,根据更多维度的特征执行识别操作;且由于图像模糊或待识别图像中部分字符被遮挡等图像发生问题,并不会影响预测字符的准确度,有利于提高文本识别结果的准确度。
为解决上述技术问题,本申请实施例提供以下技术方案:
第一方面,本申请实施例提供一种文本识别网络,可用于人工智能领域的文本识别领域中。文本识别网络为用于识别图像中字符的神经网络,文本识别网络包括图像特征提取模块、文本特征获取模块和识别模块。图像特征提取模块,用于获取待识别图像,并对待识别图像进行特征提取,以生成与待识别图像中的第一字符对应的第一特征。其中,第一字符为待识别图像中需要进行识别的字符,文本识别网络中的图像特征提取模块具体可以表现为卷积神经网络、方向梯度直方图或局部二值模式,该待识别图像可以为一整个图像,也可以为执行过图像分割操作后包括一行字符或者包括一列字符的分割后图像。文本特征获取模块,用于获取与待识别图像中的第一字符对应的预设字符,并根据预设字符进行文本预测,以生成第一预测字符的语义特征。预设字符可以为起始标志字符,在计算机程序中可以表现为<BOS>字符,用于指示文本特征获取模块开始进行文本预测。识别模块,用于将第一特征和第一预测字符的语义特征进行组合,并根据组合后的特征执行识别操作, 以生成与待识别图像中的第一字符对应的识别结果。其中,识别模块具体可以为分类网络,分类网络具体可以表现为分类器,分类器可以选择多层感知机,或者,由一个线性变换矩阵和分类函数组成。
本实现方式中,不仅获取待识别图像的图像特征,还根据与第一字符中的已识别字符对应的第二字符,生成预测字符的语义特征,根据更多维度的特征执行识别操作,有利于提高文本识别结果的准确度;且由于当出现待识别图像模糊或待识别图像中部分字符被遮挡等因素时,第一特征中包括的模糊字符或被遮挡字符的特征的准确度会大大降低,基于已识别字符的语义信息,生成预测字符的语义特征,由于图像模糊或待识别图像中部分字符被遮挡等图像发生问题,并不会影响预测字符的准确度,根据预测字符的语义特征和图像特征一起生成识别结果,有利于提高文本识别结果的准确度。
在第一方面的一种可能实现方式中,文本特征获取模块,具体用于在初次对待识别图像执行识别操作的情况下,获取与待识别图像中的第一字符对应的预设字符,并根据预设字符进行文本预测,以生成第一预测字符的语义特征。其中,若执行设备对整个待识别图像进行过图像分割,则初次对待识别图像中的第一字符进行识别操作指的是初次对分割后的待识别图像(也即待识别图像的一个文本区域)执行识别操作时。若执行设备未对整个待识别图像进行过图像分割,则初次对待识别图像中的第一字符进行识别操作指的是初次对整个待识别图像进行识别操作。文本特征获取模块,具体用于在已对第一字符中至少一个字符执行过识别操作的情况下,将与第一字符中至少一个已识别字符对应的至少一个识别结果和预设字符确定为第二字符,并根据第二字符进行文本预测,以生成与第二字符对应的第二预测字符的语义特征。
本实现方式中,在初次对待识别图像中的第一字符执行识别操作的情况下,执行设备根据预设字符生成第一预测字符的语义特征,在已对第一字符中至少一个字符执行过识别操作的情况下,执行设备将与第一字符中的已识别字符对应的至少一个识别结果和预设字符确定为与第一字符中的已识别字符对应的至少一个第二字符,保证了本方案的完整性,整个识别过程不再需要人工干预,提高了本方案的用户粘性。
在第一方面的一种可能实现方式中,识别模块,还用于根据第一特征和第二预测字符的语义特征执行识别操作,以生成与待识别图像中的第一字符对应的识别结果。
本实现方式中,由于本方案中的文本识别网络每执行一次识别操作,可能只能够得到第一字符中的部分字符的识别结果,在已对所述第一字符中的至少一个字符执行过识别操作的情况下,执行设备在根据与至少一个已识别字符对应的至少一个识别结果进行文本预测,以生成第二预测字符的语义特征之后,并根据第一特征和第二预测字符的语义特征执行识别操作,进一步提高了本方案的完整性。
在第一方面的一种可能实现方式中,文本特征获取模块包括:第一生成子模块,用于对至少一个预设字符中每个预设字符进行向量化处理,以生成每个预设字符的字符编码,并根据每个预设字符在待识别图像中的第一字符的位置,生成每个预设字符的位置编码。组合子模块,用于将预设字符的字符编码和预设字符的位置编码进行组合,得到预设字符的初始特征,并根据预设字符的初始特征执行自注意力编码操作和自注意力解码操作,以 生成第一预测字符的语义特征。其中,预设字符的字符编码和预设字符的位置编码之间组合的方式为以下中的任一项:拼接、相加、融合和相乘。
本实现方式中,通过对预设字符的初始特征执行自注意力编码操作和自注意力解码操作的方式来进行文本预测,以生成第一预测字符的语义特征,计算速度快,且复杂度低。
在第一方面的一种可能实现方式中,识别模块包括:计算子模块,用于计算第一特征和第一预测字符的语义特征之间的相似度。其中,相似度可以通过计算第一特征和第一预测字符的语义特征之间的余弦相似度、欧式距离、马氏距离等获得,或者,相似度可以通过将第一特征和第一预测字符的语义特征进行点乘操作获得。相似度中可以包括一个相似度值,也可以为两个转置的两个相似度值。第二生成子模块,用于根据第一特征、第一预测字符的语义特征和相似度,生成第二特征和第三特征,其中,第二特征为在第一特征的基础上组合了第一预测字符的语义特征,第三特征为在第一预测字符的语义特征的基础上组合了第一特征。第二生成子模块,还用于将第二特征和第三特征进行组合,并根据组合后的特征执行识别操作,以生成识别结果。
本实现方式中,计算第一特征和第一预测字符的语义特征之间的相似度,进而根据第一特征和第一预测字符的语义特征之间的相似度,生成第二特征和第三特征,第二特征为在第一特征的基础上组合了第一预测字符的语义特征,第三特征为在第一预测字符的语义特征的基础上组合了第一特征,也即不仅根据预测字符的语义特征来增强待识别字符的图像特征,而且在预测字符的语义特征中融入了待识别字符的图像特征,有利于图像特征和预测字符特征的充分融合,有利于提高文本识别结果的准确度。
在第一方面的一种可能实现方式中,文本识别网络还包括特征更新模块,特征更新模块,用于:将预设字符的特征与第一特征进行组合,以生成更新后的第一特征;其中,预设字符的特征可以为预设字符的初始特征或者预设字符的更新后特征。第一特征中包括多个第一字符的图像特征,多个第一字符中有至少一个第一字符是已经被执行过识别操作的字符,在预设字符包括与多个已识别字符对应的识别结果的情况下,预设字符的特征包括的为已识别字符对应的识别结果的特征。则更新后的第一特征相对于第一特征,已识别字符的特征会得到增强。识别模块,具体用于根据更新后的第一特征和第一预测字符的语义特征执行识别操作,以生成与待识别图像中的第一字符对应的识别结果。
本实现方式中,在图像特征中融入已经识别字符的语义特征,使图像特征中已经识别的字符的特征更为明显,从而识别模块可以更为集中识别尚未识别的字符,以降低识别模块单次识别过程的难度,有利于提高文本识别的准确率。
在第一方面的一种可能实现方式中,特征更新模块,具体用于:根据预设字符的初始特征执行自注意力编码操作,得到预设字符的更新后特征,并根据第一特征和预设字符的更新后特征执行自注意力编码操作,以生成更新后的第一特征。本实现方式中,采用自注意力编码的方式,将预设字符的特征与第一特征组合,有利于实现预设字符特征与第一特征的充分组合,且复杂度较低,易于实现。
在第一方面的一种可能实现方式中,在文本识别网络执行识别操作的粒度为字符的情况下,一个第一字符中包括至少一个字符,文本识别网络执行一次识别操作输出的一个识 别结果中包括一个字符。在文本识别网络执行识别操作的粒度为词语的情况下,一个第一字符中包括至少一个词语,文本识别网络执行一次识别操作输出的一个识别结果为包括一个或多个字符的词语。
本实现方式中,文本识别网络执行识别操作的粒度可以为字符或词语,扩展了本方案的应用场景,提高了本方案的实现灵活性。
第二方面,本申请实施例提供一种文本识别网络的训练方法,可用于人工智能领域的文本识别领域中。文本识别网络为用于识别图像中字符的神经网络,文本识别网络包括图像特征提取模块、文本特征获取模块和识别模块。方法包括:训练设备将待识别图像输入至图像特征提取模块,对待识别图像进行特征提取,以生成与待识别图像中的第一字符对应的第一特征,其中,第一字符为待识别图像中需要进行识别的字符;并将与待识别图像中的第一字符对应的预设字符输入至文本特征获取模块,并根据预设字符进行文本预测,以生成第一预测字符的语义特征。训练设备根据第一特征和第一预测字符的语义特征通过识别模块执行识别操作,以生成与待识别图像中的第一字符对应的识别结果。训练设备根据与待识别图像中的第一字符对应的正确结果、识别结果和损失函数,对文本识别网络进行训练,损失函数指示与待识别图像中的第一字符对应的正确结果和与待识别图像中的第一字符对应的识别结果之间的相似度,损失函数的训练目标为拉近与待识别图像中的第一字符对应的正确结果和与待识别图像中的第一字符对应的识别结果之间的相似度,损失函数具体可以为交叉熵损失函数、焦点损失函数或中心损失函数。
本申请实施例的第二方面还可以执行第一方面的各个可能实现方式中的步骤,对于本申请实施例第二方面以及第二方面的各种可能实现方式的具体实现步骤,以及每种可能实现方式所带来的有益效果,均可以参考第一方面中各种可能的实现方式中的描述,此处不再一一赘述。
第三方面,本申请实施例提供一种文本识别方法,可用于人工智能领域的文本识别领域中。方法包括:执行设备将待识别图像输入至图像特征提取模块,对待识别图像进行特征提取,以生成与待识别图像中的第一字符对应的第一特征,其中,第一字符为待识别图像中需要进行识别的字符;并将与待识别图像中的第一字符对应的预设字符输入至文本特征获取模块,并根据第二字符进行文本预测,以生成第一预测字符的语义特征。执行设备根据第一特征和第一预测字符的语义特征,通过识别模块执行识别操作,以生成与待识别图像中的第一字符对应的识别结果。其中,图像特征提取模块、文本特征获取模块和识别模块归属于同一文本识别网络。
本申请实施例的第三方面还可以执行第一方面的各个可能实现方式中的步骤,对于本申请实施例第三方面以及第三方面的各种可能实现方式的具体实现步骤,以及每种可能实现方式所带来的有益效果,均可以参考第一方面中各种可能的实现方式中的描述,此处不再一一赘述。
第四方面,本申请实施例提供一种文本识别网络的训练装置,文本识别网络为用于识别图像中字符的神经网络,文本识别网络包括图像特征提取模块、文本特征获取模块和识别模块,文本识别网络的训练装置包括:输入单元,用于将待识别图像输入至图像特征提 取模块,对待识别图像进行特征提取,以生成与待识别图像中的第一字符对应的第一特征,其中,第一字符为待识别图像中需要进行识别的字符;输入单元,还用于将与待识别图像中的第一字符对应的预设字符输入至文本特征获取模块,并根据预设字符进行文本预测,以生成第一预测字符的语义特征;识别单元,用于根据第一特征和第一预测字符的语义特征通过识别模块执行识别操作,以生成与待识别图像中的第一字符对应的识别结果;训练单元,用于根据与待识别图像中的第一字符对应的正确结果、识别结果和损失函数,对文本识别网络进行训练,损失函数指示与待识别图像中的第一字符对应的正确结果和与待识别图像中的第一字符对应的识别结果之间的相似度。
本申请实施例的第四方面还可以执行第二方面的各个可能实现方式中的步骤,对于本申请实施例第四方面以及第四方面的各种可能实现方式的具体实现步骤,以及每种可能实现方式所带来的有益效果,均可以参考第二方面中各种可能的实现方式中的描述,此处不再一一赘述。
第五方面,本申请实施例提供了一种执行设备,可以包括处理器,处理器和存储器耦合,存储器存储有程序指令,当存储器存储的程序指令被处理器执行时实现上述第一方面所述的文本识别网络执行的步骤。
第六方面,本申请实施例提供了一种训练设备,可以包括处理器,处理器和存储器耦合,存储器存储有程序指令,当存储器存储的程序指令被处理器执行时实现上述第二方面所述的文本识别网络的训练方法。
第七方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的文本识别网络执行的步骤,或者,使得计算机执行上述第二方面所述的文本识别网络的训练方法。
第八方面,本申请实施例提供了一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行上述第一方面所述的文本识别网络执行的步骤,或者,执行上述第二方面所述的文本识别网络的训练方法。
第九方面,本申请实施例提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的文本识别网络执行的步骤,或者,执行上述第二方面所述的文本识别网络的训练方法。
第十方面,本申请实施例提供了一种芯片系统,该芯片系统包括处理器,用于实现上述各个方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存服务器或通信设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
附图说明
图1为本申请实施例提供的人工智能主体框架的一种结构示意图;
图2为本申请实施例提供的文本识别系统的一种系统架构图;
图3为本申请实施例提供的文本识别网络的工作流程的一种流程示意图;
图4为本申请实施例提供的文本识别网络的工作流程中生成第四特征的一种流程示意图;
图5为本申请实施例提供的文本识别网络的工作流程中生成第五特征和第六特征的一种流程示意图;
图6为本申请实施例提供的文本识别网络的一种网络架构示意图;
图7为本申请实施例提供的文本识别网络的训练方法的一种流程示意图;
图8为本申请实施例提供的文本识别网络的一种有益效果示意图;
图9为本申请实施例提供的文本识别网络的一种结构示意图;
图10为本申请实施例提供的文本识别网络的另一种结构示意图;
图11为本申请实施例提供的文本识别网络的训练装置的一种结构示意图;
图12为本申请实施例提供的文本识别网络的训练装置的另一种结构示意图;
图13为本申请实施例提供的执行设备的一种结构示意图;
图14为本申请实施例提供的训练设备的一种结构示意图;
图15为本申请实施例提供的芯片的一种结构示意图。
具体实施方式
本申请实施例提供了一种文本识别网络、神经网络训练的方法以及相关设备,根据预测字符的语义特征和待识别图像的图像特征一起生成识别结果,根据更多维度的特征执行识别操作;且由于图像模糊或待识别图像中部分字符被遮挡等图像发生问题,并不会影响预测字符的准确度,有利于提高文本识别结果的准确度。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平 台实现支撑。通过传感器与外部沟通;计算能力由智能芯片提供,前述智能芯片包括但不限于中央处理器(central processing unit,CPU)、嵌入式神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)和现场可编程逻辑门阵列(field programmable gate array,FPGA)等硬件加速芯片;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶、智慧城市等。
本申请实施例可以应用于人工智能的各种领域中,具体可以应用于各种需要对图像中的字符进行识别的场景中,前述图像为通过摄像机、打印机、扫描机等设备采集得到的图像。作为示例,例如在一种应用场景中,在金融、会计和税务领域等领域中,企业需要对收据或发票等文件进行扫描得到图像文件,并对图像文件中的字符进行识别以提取文本信息,进而能够实现文件数字化归档、文件快速索引或文件分析等功能。在另一种应用场景中,用户需要输入身份证、驾驶证、行驶证或护照等证件上的信息,则用户可以利用摄像机采集得到前述证件的图像,并对图像中的字符进行识别以提取出关键信息等。应当理解,此处举例仅为方便对本申请实施例的应用场景进行理解,不对本申请实施例的应用场景进行穷举。在前述种种场景中,都可能存在图像质量较低的可能性,因此,需要通过本申请实施例提供的文本识别网络对图像进行识别,以提高识别结果的准确率。
为了便于理解本方案,本申请实施例中首先结合图2对本申请实施例提供的文本识别 系统进行介绍,请先参阅图2,图2为本申请实施例提供的文本识别系统的一种系统架构图。在图2中,文本识别系统200包括执行设备210、训练设备220、数据库230和数据存储系统240,执行设备210中包括计算模块211。
在训练阶段,数据库230中存储有训练数据集合,训练数据集合中可以包括多个待识别图像以及与每个待识别图像中第一字符对应的正确结果。训练设备220生成用于处理序列数据的目标模型/规则201,并利用数据库中的训练数据集合对目标模型/规则201进行迭代训练,得到成熟的目标模型/规则201。
在推理阶段,执行设备210可以调用数据存储系统240中的数据、代码等,也可以将数据、指令等存入数据存储系统240中。数据存储系统240可以配置于执行设备210中,也可以为数据存储系统240相对执行设备210是外部存储器。计算模块211可以通过成熟的目标模型/规则201对执行设备210输入的待识别图像进行识别操作,得到待识别图像中第一字符的识别结果。
本申请的一些实施例中,例如图2中,“用户”可以直接与执行设备210进行交互,也即执行设备210与客户设备集成于同一设备中。但图2仅是本发明实施例提供的两种图像处理系统的架构示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制。在本申请的另一些实施例中,执行设备210和客户设备可以为分别独立的设备,执行设备210配置有输入/输出接口,与客户设备进行数据交互,“用户”可以通过客户设备向输入/输出接口输入采集到的图像,执行设备210通过输入/输出接口将处理结果返回给客户设备。
基于上述描述,本申请实施例提供了一种文本识别网络,文本识别网络包括图像特征提取模块、文本特征获取模块和识别模块,图像特征提取模块用于提取待识别图像中第一字符的图像特征,文本特征获取模块用于与待识别图像中的第一字符对应的预设字符的语义信息,进行文本预测,得到预测字符的语义特征,进而识别模块根据待识别图像中第一字符的图像特征和预测字符的语义特征执行识别操作,以生成识别结果,由于图像模糊或待识别图像中部分字符被遮挡等图像发生问题,并不会影响预测字符的准确度,根据预测字符的语义特征和图像特征一起生成识别结果,有利于提高文本识别结果的准确度。由图2中的描述可知,本申请实施例包括推理阶段和训练阶段,而推理阶段和训练阶段的流程有所不同,以下分别对推理阶段和训练阶段进行描述。
一、推理阶段
本申请实施例中,推理阶段描述的是执行设备210如何利用成熟的文本识别网络对待识别图像进行字符识别的过程。请参阅图3,图3为本申请实施例提供的文本识别网络的工作流程的一种流程示意图,方法可以包括:
301、执行设备将待识别图像输入至图像特征提取模块,对待识别图像进行特征提取,以生成与待识别图像中的第一字符对应的第一特征。
本申请实施例中,执行设备在获取到待识别图像之后,会将待识别图像输入至文本识别网络的图像特征提取模块中,以通过图像特征提取模块对待识别图像进行特征提取,以生成与待识别图像中的第一字符对应的第一特征,第一字符为待识别图像中需要进行识别的字符。
其中,文本识别网络中的图像特征提取模块具体可以表现为卷积神经网络、方向梯度直方图(histogram of oriented gradient,HOG)、局部二值模式(local binary pattern,LBP)或其他用于对图像进行特征提取的神经网络等。
一个待识别图像中可以包括一行或多行第一字符,或者,一个待识别图像中包括一列或多列第一字符。若文本识别网络执行识别操作的粒度为字符,也即执行设备每通过文本识别网络执行一次识别操作,能够得到对待识别图像中一个字符的识别结果,则一个第一字符中包括一个或多个字符。作为示例,例如待识别图像中包括的一个第一字符为“cat”,文本识别网络执行一次识别操作等得到一个字符“c”的识别结果。作为另一示例,例如待识别图像中包括的一个第一字符为“今天天气真棒”,文本识别网络执行一次识别操作生成的为一个字符“今”的特征,输出的为与字符“今”对应的识别结果。
若文本识别网络执行识别操作的粒度为词语,也即执行设备每通过文本识别网络执行一次识别操作,能够得到对待识别图像中一个词语的识别结果,则一个第一字符中包括一个或多个词语。作为示例,例如待识别图像中包括的一个第一字符为“how are you”,文本识别网络执行一次识别操作等得到一个词语“how”的识别结果。作为另一示例,例如待识别图像中包括的一个第一字符为“今天天气真棒”,文本识别网络每执行一次识别操作等得到一个词语“今天”的识别结果等,应理解,上述举例均仅为方便理解本方案,不用限定本方案。
具体的,在一种实现方式中,执行设备在获取到待识别图像之后,会对待识别图像进行图像分割,以生成至少一个分割后的待识别图像(也即将待识别图像分割为至少一个文本区域)。若一个待识别图像中包括一行或多行第一字符,则每个分割后的待识别图像(也即每个文本区域)为包括一行第一字符;若一个待识别图像中包括一列或多列第一字符,则每个分割后的待识别图像中均包括一列第一字符。
更具体的,在一种情况下,文本识别网络中还配置有图像分割模块,执行设备通过文本识别网络的图像分割模块,对待识别图像进行图像分割,以得到至少一个分割后的待识别图像。在另一种情况下,执行设备上除了配置文本识别网络之外,还可以配置有用于进行图像分割的第一神经网络,则执行设备通过第一神经网络对待识别图像进行图像分割,以得到至少一个分割后的待识别图像。进一步地,文本识别网络中的图像分割模块或者用于进行图像分割的第一神经网络具体均可以表现为基于渐进尺度扩展网络的形状鲁棒文本检测网络(shape Robust Text Detection with Progressive Scale Expansion Network,PSENet)、rCTPN、ASTER或其他用于图像分割的神经网络等,此处不做限定。
对应的,步骤301可以包括:执行设备将分割后的待识别图像输入至图像特征提取模块,对分割后的待识别图像进行特征提取,以生成与分割后的待识别图像中的第一字符对应的第一特征,分割后的图像为待识别图像中的一个文本区域。一个第一特征指的是一个分割后的待识别图像的特征,其中包括一行第一字符(也即待识别图像中的一个文本区域)的图像特征,或者,包括一列第一字符的图像特征。
在另一种实现方式中,在一个待识别图像中包括多行第一字符或者包括多列第一字符的情况下,执行设备在获取到待识别图像之后,将整个待识别图像输入至文本识别网络的 图像特征提取模块中,对整个待识别图像进行特征提取,以生成与待识别图像中的第一字符对应的第一特征。其中,一个第一特征指的是整个待识别图像的特征,若一个待识别图像中包括一行第一字符或一列第一字符,则一个第一特征为待识别图像中的一行或一列第一字符的图像特征;若一个待识别图像中包括多行或多列第一字符,则一个第一特征为待识别图像中的多行或多列第一字符的图像特征。
302、执行设备将与待识别图像中的第一字符对应的预设字符输入至文本特征获取模块,根据预设字符进行文本预测,以生成第一预测字符的语义特征。
本申请实施例中,执行设备在初次对待识别图像中的第一字符进行识别操作的情况下,会获取与待识别图像中的第一字符对应的预设字符,将与待识别图像中的第一字符对应的预设字符输入至文本识别网络的文本特征获取模块中,以根据预设字符,通过文本特征获取模块进行文本预测,生成第一预测字符的语义特征。
其中,若执行设备对整个待识别图像进行过图像分割,则初次对待识别图像中的第一字符进行识别操作指的是初次对分割后的待识别图像(也即待识别图像的一个文本区域)执行识别操作时。若执行设备未对整个待识别图像进行过图像分割,则初次对待识别图像中的第一字符进行识别操作指的是初次对整个待识别图像进行识别操作。
预设字符可以为起始标志字符,在计算机程序中可以表现为<BOS>字符,用于指示文本特征获取模块开始进行文本预测。预设字符的表现形式为预先定义好的,具体可以表现为而包括N个元素的向量,N个元素中每个元素都是一个确定的数值。进一步地,N为大于或等于1的整数。作为示例,例如预设字符具体可以为包括32个1的向量,或者,预设字符具体可以为包括64个2的向量等等,此处不做穷举。
文本识别网络的文本特征获取模块可以包括编码模块和解码模块,编码模块用于提取输入字符的文本特征,解码模块用于根据输入字符的文本特征生成预测字符的文本特征。进一步地,编码模块可以为循环神经网络(recurrent neural networks,RNNs)中的编码器,解码模块为循环神经网络的中的解码器;作为示例,例如编码模块和解码模块可以为长短时记忆网络(long short term mermory network,LSTM)中的编码模块和解码模块。编码模块也可以为自注意力(self-attention)编码模块,解码模块为自注意力解码模块;作为示例,例如编码模块和解码模块可以为基于转换器的双向编码表征(bidirectional encoder representations from transformers,BERT)的神经网络的自注意力编码模块和自注意力解码模块等,编码模块和解码模块还可以表现为其他用于进行文本预测的神经网络中的编码模块和解码模块等,此处不做穷举。
具体的,在一种实现方式中,文本特征获取模块中的编码模块和解码模块分别为自注意力编码模块和自注意力解码模块。步骤302可以包括:执行设备通过文本特征获取模块将预设字符由字符形式转换为张量形式,以生成预设字符的字符编码,并根据预设字符在待识别图像中的第一字符的位置,生成预设字符的位置编码;将预设字符的字符编码和预设字符的位置编码进行组合,得到预设字符的初始特征。进而执行设备通过文本特征获取模块,根据预设字符的初始特征执行自注意力编码操作和自注意力解码操作,以生成第一预测字符的语义特征。
本申请实施例中,通过对预设字符的初始特征执行自注意力编码操作和自注意力解码操作的方式来进行文本预测,以生成第一预测字符的语义特征,计算速度快,且复杂度低。
更具体的,针对字符编码的生成过程。执行设备可以通过文本特征获取模块对预设字符进行向量化(embedding)处理,以生成预设字符的字符编码。训练设备还可以获取预设字符的独热(one-hot)编码,并将预设字符的独热编码确定为预设字符的字符编码等,此处不限定生成预设字符的字符编码的过程。其中,预设字符的字符编码可以为包括M个元素的向量,M的取值与文本识别网络的文本特征获取模块采用什么样的神经网络有关,此处不做限定。
针对位置编码的生成过程。预设字符在待识别图像中的第一字符的位置为首位,预设字符的位置编码指示预设字符的位置为首位。可选地,预设字符的位置编码也可以为包括M个元素的向量。作为示例,例如M的取值为512,则预设字符的位置编码可以为包括1个1和511个0的向量,预设字符的位置编码中的1位于首位,指示预设字符在待识别图像中的第一字符的位置为首位,可选地,执行设备还可以通过余弦函数,对前述512个元素进行二次转换等,应理解,此处对于M的取值、位置编码的表现形式的举例均仅为方便理解本方案,不用于限定本方案。字符编码和位置编码进行组合的方式包括但不限于拼接(conact)、相加(add)、融合(fusion)和相乘等。
针对生成第一预测字符的语义特征的过程。执行设备在得到预设字符的初始特征之后,需要通过文本识别网络的文本特征获取模块进行文本预测,也即对预设字符的初始特征执行自注意力编码操作,以生成预设字符的更新后特征,对预设字符的更新后特征进行自注意力解码操作,以生成第一预测字符的语义特征。
在另一种实现方式中,文本特征获取模块中的编码模块和解码模块选自于循环神经网络。步骤302可以包括:执行设备通过文本特征获取模块将预设字符由字符形式转换为张量形式,以生成预设字符的字符编码,并将预设字符的字符编码确定为预设字符的初始特征。进而执行设备通过文本特征获取模块,根据预设字符的初始特征执行编码操作和解码操作,以生成第一预测字符的语义特征。
需要说明的是,当文本特征获取模块中的编码模块和解码模块选自于其他类型的用于进行文本预测的神经网络时,步骤302可以对应进行修改,此处不做穷举。
此外,本申请实施例不限定步骤301和步骤302的执行顺序,可以先执行步骤301,再执行步骤302,也可以先执行步骤302,再执行步骤301,也可以同时执行步骤301和步骤302。
303、执行设备通过特征更新模块,将预设字符的特征与第一特征进行组合,以生成第四特征。
本申请的一些实施例中,执行设备在通过文本识别网络的图像特征提取模块生成与待识别图像中的第一字符对应的第一特征之后,还可以将预设字符的特征与第一特征进行组合,以生成第四特征,第四特征为更新后的第一特征。其中,预设字符的特征可以为预设字符的更新后特征,也可以为预设字符的初始特征。
具体的,在一种实现方式中,执行设备通过文本识别网络的特征更新模块,根据预设 字符的初始特征执行自注意力编码操作,得到预设字符的更新后特征,根据第一特征和预设字符的更新后特征执行自注意力编码操作,以生成第四特征。
针对预设字符的更新后特征的生成过程,为更直观地理解自注意力编码的过程,如下公开了对预设字符进行自注意力编码操作的公式:
Q′ char=Norm(softmax(Q charK char)V char+Q char);(1)
其中,Q char为预设字符的初始特征与第一转换矩阵相乘得到,K char为预设字符的初始特征与第二转换矩阵相乘得到,Q charK char代表Q char与K char点乘,softmax(Q charK char)V char代表softmax(Q charK char)与V char点乘,V char为预设字符的初始特征与第三转换矩阵相乘得到,Q′ char代表预设字符的更新后特征,第一转换矩阵、第二转换矩阵和第三转换矩阵可以相同或不同,应理解,式(1)中的举例仅为更方便理解本方案,不用于限定本方案。
针对生成第四特征的过程,为更直观地理解根据第一特征和预设字符的更新后特征执行自注意力编码的过程,如下公开了对预设字符进行自注意力编码操作的公式:
Q′ img=Norm(softmax(Q′ charK img)V img+Q′ char);(2)
其中,Q′ img代表第四特征,Q′ char代表预设字符的更新后特征,K img为将第一特征与第四转换矩阵相乘得到,V img为将第一特征与第五转换矩阵相乘得到,第四转换矩阵和第五转换矩阵可以相同或不同,应理解,式(2)中的举例仅为更方便理解本方案,不用于限定本方案。
为了更为直观的理解本方案,请参阅图4,图4为本申请实施例提供的文本识别网络的工作流程中生成第四特征的一种流程示意图,图4中以文本识别网络执行一次识别操作得到一个字符的识别结果,也即一个第二字符中包括一个字符为例。如图4所示,执行设备将待识别图像输入至文本识别网络的图像特征提取模块,得到待识别图像中第一字符的图像特征(也即待识别图像中第一字符的第一特征),图4中以图像特征提取模块包括多个卷积层和多个池化层为例,max poll指的是最大池化。如图4所示,执行设备生成预设字符的字符编码和位置编码,以得到预设字符的初始特征,并通过上述式(1)生成预设字符的更新后特征Q′ char。执行设备在得到待识别图像中第一字符的图像特征(也即待识别图像中第一字符的第一特征)和预设字符的更新后特征后,通过上述式(2)执行自注意力编码操作,以生成第四特征,需要说明的是,在实际情况中,执行设备在根据第一特征和预设字符的更新后特征执行自注意力编码操作之外,文本识别网络中的特征更新模块还可以设置有更多的神经网络,例如文本识别网络中的特征更新模块中还可以设置有前馈神经网络、正则化模块等,图4仅为方便理解本方案的一种示例,不用于限定本方案。
在另一种实现方式中,执行设备通过文本识别网络的特征更新模块,根据第一特征和 预设字符的初始特征执行自注意力编码操作,以生成第四特征。
在另一种实现方式中,执行设备通过文本识别网络的特征更新模块,根据预设字符的初始特征执行编码操作,得到预设字符的更新后特征,根据第一特征和预设字符的更新后特征执行编码操作,以生成第四特征。进一步地,文本识别网络的特征更新模块通过编码器执行编码操作,该编码器为循环神经网络中的编码器。
在另一种实现方式中,执行设备通过文本识别网络的特征更新模块,根据第一特征和预设字符的初始特征执行编码操作,以生成第四特征。
304、执行设备根据第一特征和第一预测字符的语义特征,通过识别模块执行识别操作,以生成一个第一识别结果。
本申请实施例中,执行设备通过识别模块,将第一特征和第一预测字符的语义特征进行组合,并根据组合后的特征执行识别操作,以生成一个第一识别结果。若文本识别网络执行识别操作的粒度为字符,则一个第一识别结果为文本识别网络识别出的一个字符。若文本识别网络执行识别操作的粒度为词语,则一个第一识别结果为文本识别网络识别出的一个词语。
具体的,步骤303为可选步骤,若执行步骤303,则步骤304包括:执行设备通过识别模块,将第四特征(也即更新后的第一特征)和第五特征进行组合,并根据组合后的特征执行识别操作,以生成一个第一识别结果。
针对第四特征和第一预测字符的语义特征进行组合的过程。在一种实现方式中,执行设备通过识别模块,直接将第四特征(也即更新后的第一特征)和第五特征进行拼接、矩阵乘法、组合等方式进行组合。
在另一种实现方式中,执行设备通过识别模块,根据第四特征和第一预测字符的语义特征之间的相似度,执行第四特征和第一预测字符的语义特征的组合操作。执行设备通过识别模块,计算第四特征和第一预测字符的语义特征之间的第一相似度;根据第四特征、第一预测字符的语义特征和第一相似度,生成第五特征;并根据第四特征、第一预测字符的语义特征和相似度,生成第六特征。进而通过识别模块将第五特征和第六特征进行组合。
其中,第一相似度可以通过计算第四特征和第一预测字符的语义特征之间的余弦相似度、欧式距离、马氏距离等获得,或者,第一相似度可以通过将第四特征和第一预测字符的语义特征进行点乘操作获得。进一步地,第一相似度中可以包括一个相似度值,也可以为两个转置的两个相似度值。第五特征为在第四特征的基础上组合了第一预测字符的语义特征,第六特征为在第一预测字符的语义特征的基础上组合了第四特征。第五特征和第六特征进行组合的方式包括但不限于拼接、相加、相乘或其他组合方式等,此处不做穷举。
更具体的,为了更直观的理解生成第五特征和第六特征的过程,请参阅图5,图5为本申请实施例提供的文本识别网络的工作流程中生成第五特征和第六特征的一种流程示意图,图5中以通过点乘的方式生成第一相似度为例。其中,K vis代表第四特征(也即更新后的第一特征),Q lin代表第一预测字符的语义特征,
Figure PCTCN2021106397-appb-000001
Figure PCTCN2021106397-appb-000002
分别为第一权重和第二权重,P lin通过Q lin和第一权重点乘得到,P vis通过K vis和第二权重点乘得到,第一权重和第二权重 为文本识别网络训练阶段确定的。S vis代表第四特征与第一预测字符的相似度,S vis通过公式
Figure PCTCN2021106397-appb-000003
计算得到,S lin代表第一预测字符与第四特征的相似度,S lin通过公式
Figure PCTCN2021106397-appb-000004
计算得到,d代表特征的维度数量,也即d代表第四特征或第五特征中包括的元素数量。第五特征为在第四特征的基础上组合了第一预测字符的语义特征,在图5中,第五特征表示为
Figure PCTCN2021106397-appb-000005
拼接
Figure PCTCN2021106397-appb-000006
得到,
Figure PCTCN2021106397-appb-000007
通过公式S lin、K lin
Figure PCTCN2021106397-appb-000008
点乘获得。第六特征为在第一预测字符的语义特征的基础上组合了第四特征,在图5中,第六特征表示为
Figure PCTCN2021106397-appb-000009
拼接
Figure PCTCN2021106397-appb-000010
得到,
Figure PCTCN2021106397-appb-000011
通过公式S vis、K vis
Figure PCTCN2021106397-appb-000012
点乘获得。应理解,图5仅为方便理解本方案的一种示例,不用于限定本方案。
针对根据组合后的特征执行识别操作的过程。执行设备在通过识别模块,将第五特征和第六特征进行组合之后,将组合后的特征输入至识别模块中的分类网络,以通过识别模块中的分类网络执行识别操作,得到整个识别模块输出的一个第一识别结果。
其中,分类网络具体可以表现为分类器,分类器可以选择多层感知机(multi-layer perceptron,MLP),分类器也可以为由一个线性变换矩阵和softmax分类函数组成等,此处不限定分类网络的具体形式。
若不执行步骤303,则步骤304包括:执行设备通过识别模块,将通过步骤301得到的第一特征和第一预测字符的语义特征进行组合,并根据组合后的特征执行识别操作,以生成一个第一识别结果。具体实现过程可参见执行步骤303情况中的描述,此处不再一一赘述。
305、执行设备将与第一字符中的已识别字符对应的第二字符输入至文本特征获取模块,根据第二字符进行文本预测,以生成第一预测字符的语义特征。
本申请实施例中,步骤305的具体实现方式与步骤302的具体实现方式类似,执行设备在已对第一字符中至少一个字符执行过识别操作的情况下,会获取与第一字符中的所有已识别字符对应的至少一个第二字符。具体的,执行设备将与第一字符中的所有已识别字符对应的至少一个识别结果确定为与第一字符中的已识别字符对应的至少一个第二字符。本申请实施例中,在已对第一字符中至少一个字符执行过识别操作的情况下,执行设备将与第一字符中的已识别字符对应的至少一个识别结果和预设字符确定为与第一字符中的已识别字符对应的至少一个第二字符,在初次对待识别图像中的第一字符执行识别操作的情况下,执行设备根据预设字符生成第一预测字符的语义特征,保证了本方案的完整性,整个识别过程不再需要人工干预,提高了本方案的用户粘性。
更具体的,若执行设备为通过步骤304进入步骤305,则步骤305包括:执行设备将第一识别结果和预设字符确定为与第一字符中的已识别字符对应的一个第二字符。若执行设备为通过步骤307进入步骤305,则步骤305包括:执行设备将预设字符、第一识别结 果和至少一个第二识别结果确定为与第一字符中的已识别字符对应的多个第二字符。
其中,在文本识别网络执行识别操作的粒度为字符的情况下,第一字符为包括至少一个字符的词语,一个识别结果中包括一个字符,每个第二字符中包括一个字符。在文本识别网络执行识别操作的粒度为词语的情况下,第一字符中包括至少一个词语,一个识别结果为包括一个或多个字符的词语,每个第二字符为包括一个或多个字符的词语。本申请实施例中,文本识别网络执行识别操作的粒度可以为字符或词语,扩展了本方案的应用场景,提高了本方案的实现灵活性。
执行设备将与第一字符中的所有已识别字符对应的所有第二字符输入至文本识别网络的文本特征获取模块中,以根据所有第二字符,通过文本特征获取模块中的编码模块和解码模块进行文本预测,生成第一预测字符的语义特征。
具体的,在一种实现方式中,文本特征获取模块中的编码模块和解码模块分别为自注意力编码模块和自注意力解码模块。步骤302可以包括:执行设备通过文本特征获取模块将至少一个第二字符中任一个第二字符由字符形式转换为张量形式,以生成一个第二字符的字符编码,并根据至少一个第二字符中任一个第二字符在待识别图像中的第一字符的位置,生成一个第二字符的位置编码。执行设备通过文本识别网络的文本特征获取模块,将至少一个第二字符中任一个第二字符的字符编码和的位置编码进行组合,得到一个第二字符的初始特征。执行设备通过文本识别网络的文本特征获取模块,对至少一个第二字符中每个第二字符均执行前述操作,从而生成至少一个第二字符中每个第二字符的初始特征。进而执行设备通过文本特征获取模块,根据第二字符的初始特征执行自注意力编码操作和自注意力解码操作,以生成第一预测字符的语义特征。
在另一种实现方式中,文本特征获取模块中的编码模块和解码模块选自于循环神经网络。步骤302可以包括:执行设备通过文本特征获取模块将至少一个第二字符中每个第二字符由字符形式转换为张量形式,以生成每个第二字符的字符编码,并将每个第二字符的字符编码确定为每个第二字符的初始特征。进而执行设备通过文本特征获取模块,根据至少一个第二字符中所有第二字符的初始特征执行编码操作和解码操作,以生成第一预测字符的语义特征。
上述两种实现方式的具体实现方式均可参阅上述步骤302中的描述,此处不做赘述。
306、执行设备通过特征更新模块,将第二字符的特征与第一特征进行组合,以生成第七特征。
本申请实施例中,步骤306的具体实现方式与步骤303的具体实现方式类似,执行设备在通过文本识别网络的图像特征提取模块生成与待识别图像中的第一字符对应的第一特征之后,还可以将第二字符的特征与第一特征进行组合,以生成第七特征,第七特征为更新后的第一特征。其中,第二字符的特征可以为预设字符的更新后特征,也可以为预设字符的初始特征。第一特征中包括多个第一字符的图像特征,多个第一字符中有至少一个第一字符是已经被执行过识别操作的字符,在第二字符包括与多个已识别字符对应的识别结果的情况下,第二字符的特征包括的为已识别字符对应的识别结果的特征。则第七特征相对于第一特征,已识别字符的特征会得到增强。
本申请实施例中,在图像特征中融入已经识别字符的语义特征,使图像特征中已经识别的字符的特征更为明显,从而识别模块可以更为集中识别尚未识别的字符,以降低识别模块单次识别过程的难度,有利于提高文本识别的准确率。
具体的,在一种实现方式中,执行设备通过文本识别网络的特征更新模块,根据第二字符的初始特征执行自注意力编码操作,得到第二字符的更新后特征,根据第一特征和第二字符的更新后特征执行自注意力编码操作,以生成第七特征(也即更新后的第一特征)。本申请实施例中,采用自注意力编码的方式,将第二字符的特征与第一特征组合,有利于实现第二字符特征与第一特征的充分组合,且复杂度较低,易于实现。
在另一种实现方式中,执行设备通过文本识别网络的特征更新模块,根据第一特征和第二字符的初始特征执行自注意力编码操作,以生成第七特征。
在另一种实现方式中,执行设备通过文本识别网络的特征更新模块,根据第二字符的初始特征执行编码操作,得到第二字符的更新后特征,根据第一特征和第二字符的更新后特征执行编码操作,以生成第七特征。进一步地,文本识别网络的特征更新模块通过编码器执行编码操作,该编码器为循环神经网络中的编码器。
在另一种实现方式中,执行设备通过文本识别网络的特征更新模块,根据第一特征和第二字符的初始特征执行编码操作,以生成第七特征。
步骤306中各种形式的具体实现方式可参考上述步骤303中的描述,此处不再一一赘述。
307、执行设备根据第一特征和第一预测字符的语义特征,通过识别模块执行识别操作,以生成一个第二识别结果。
本申请实施例中,步骤307的具体实现方式与步骤304的具体实现方式类似,执行设备通过识别模块,将第一特征和第一预测字符的语义特征进行组合,并根据组合后的特征执行识别操作,以生成一个第二识别结果。
具体的,步骤306为可选步骤,若执行步骤306,则步骤307包括:执行设备通过识别模块,将第七特征(也即更新后第一特征)和第一预测字符的语义特征进行组合,并根据组合后的特征执行识别操作,以生成一个第二识别结果。
针对第七特征和第一预测字符的语义特征进行组合的过程。在一种实现方式中,执行设备通过识别模块,直接将第七特征(也即更新后的第一特征)和第一特征进行拼接、矩阵乘法、组合等方式进行组合。
在另一种实现方式中,执行设备通过识别模块,根据第七特征和第一预测字符的语义特征之间的相似度,执行第七特征和第一预测字符的语义特征的组合操作。执行设备通过识别模块,计算第七特征(也即更新后的第一特征)和第一预测字符的语义特征之间的相似度;根据第七特征、第一预测字符的语义特征和相似度,生成第二特征和第三特征。其中,第二特征为在第七特征的基础上组合了第一预测字符的语义特征,第三特征为在第一预测字符的语义特征的基础上组合了第七特征;根据第二特征和第三特征执行识别操作,以生成第二识别结果。
本申请实施例中,计算第一特征和第一预测字符的语义特征之间的相似度,进而根据 第一特征和第一预测字符的语义特征之间的相似度,生成第二特征和第三特征,第二特征为在第一特征的基础上组合了第一预测字符的语义特征,第三特征为在第一预测字符的语义特征的基础上组合了第一特征,也即不仅根据预测字符的语义特征来增强待识别字符的图像特征,而且在预测字符的语义特征中融入了待识别字符的图像特征,有利于图像特征和预测字符特征的充分融合,有利于提高文本识别结果的准确度。
针对根据组合后的特征执行识别操作的过程。执行设备在通过识别模块,将第二特征和第三特征进行组合之后,将组合后的特征输入至识别模块中的分类网络,以通过识别模块中的分类网络执行识别操作,得到整个识别模块输出的一个第一识别结果。
若不执行步骤306,则步骤307包括:执行设备通过识别模块,将通过步骤301得到的第一特征和第一预测字符的语义特征进行组合,并根据组合后的特征执行识别操作,以生成一个第二识别结果。步骤307的具体实现方式均可以参阅上述步骤304中的描述,此处不做赘述。
需要说明的是,本申请实施例不限定步骤301至304与步骤305至307的执行次数,可以在执行一次步骤301至304后,反复执行步骤305至307,得到多个第二识别结果。
具体的,若文本识别网络执行一次识别操作的粒度为字符,则执行设备每执行一次步骤305至307,能够得到一个第一字符中一个字符的识别结果,执行设备反复执行步骤305至307多次,以得到一个第一字符中所有字符的识别结果。若文本识别网络执行一次识别操作的粒度为词语,则执行设备每执行一次步骤305至307,能够得到一个第一字符中一个词语的识别结果,执行设备反复执行步骤305至307多次,以得到一个第一字符中所有词语的识别结果。进而可以输出整个第一字符的输出结果。
进一步地,若一个第一字符中只包括一个待识别的字符,或者,一个第一字符中只包括一个待识别的词语,则执行设备在执行完步骤301至304之后,可以直接输出整个第一字符的识别结果。
为了更为直观的理解本方案,请参阅图6,图6为本申请实施例提供的文本识别网络的一种网络架构示意图。文本识别结构包括图像特征提取模块、A1、A2和识别模块,A1代表文本特征获取模块,A2代表特征更新模块。如图6所示,执行设备将待识别图像输入至图像特征提取模块中,以得到待识别图像中的第一字符的图像特征(也即第一特征),将与待识别图像中的第一字符对应的字符输入至A1(也即文本特征获取模块),待识别图像中的第一字符对应的字符可以为预设字符,也可以为预设字符和第二字符,以通过文本特征获取模块生成字符的初始特征,并对字符的初始特征执行自注意力编码操作和自注意力解码操作,得到预测字符的语义特征。执行设备在得到第一特征之后,还会将字符的初始特征进行自注意力编码,以得到字符的更新后特征,进而根据第一特征与字符的更新后特征执行自注意力编码操作,以生成更新后的第一特征。执行设备将更新后的第一特征和预测字符的语义特征输入至识别模块中,以通过识别模块执行识别操作,输入识别结果。图6中各步骤的具体实现方式可参阅上述描述,此处不再一一赘述,应理解,在实际情况中,文本识别网络中可以设置更多或更少的神经网络层,图6仅为方便理解本方案的一种示例,不用于限定本方案。
本实现方式中,不仅获取待识别图像的图像特征,还根据与第一字符中的已识别字符对应的第二字符,生成预测字符的语义特征,根据更多维度的特征执行识别操作,有利于提高文本识别结果的准确度;且由于当出现待识别图像模糊或待识别图像中部分字符被遮挡等因素时,第一特征中包括的模糊字符或被遮挡字符的特征的准确度会大大降低,基于已识别字符的语义信息,生成预测字符的语义特征,由于图像模糊或待识别图像中部分字符被遮挡等图像发生问题,并不会影响预测字符的准确度,根据预测字符的语义特征和图像特征一起生成识别结果,有利于提高文本识别结果的准确度。
二、训练阶段
本申请实施例中,训练阶段描述的是训练设备220如何对文本识别网络进行训练的过程。请参阅图7,图7为本申请实施例提供的文本识别网络的训练方法的一种流程示意图,方法可以包括:
701、训练设备从训练数据集合中获取待识别图像。
本申请实施例中,训练设备上预先配置有训练数据集合,训练数据集合中包括多个待识别图像,以及,与每个待识别图像中的第一字符对应的正确结果,训练设备从训练数据集合中随机获取一个待识别图像。
702、训练设备将待识别图像输入至图像特征提取模块,对待识别图像进行特征提取,以生成与待识别图像中的第一字符对应的第一特征。
703、训练设备将与待识别图像中的第一字符对应的预设字符输入至文本特征获取模块,根据预设字符进行文本预测,以生成第一预测字符的语义特征。
704、训练设备通过特征更新模块,将预设字符的特征与第一特征进行组合,以生成第四特征。
705、训练设备根据第一特征和第一预测字符的语义特征,通过识别模块执行识别操作,以生成一个第一识别结果。
706、训练设备将与第一字符中的已识别字符对应的第二字符输入至文本特征获取模块,根据第二字符进行文本预测,以生成第一预测字符的语义特征。
707、训练设备通过特征更新模块,将第二字符的特征与第一特征进行组合,以生成第七特征。
708、训练设备根据第一特征和第一预测字符的语义特征,通过识别模块执行识别操作,以生成一个第二识别结果。
本申请实施例中,训练设备执行步骤702至708的具体实现方式与图3对应实施例中步骤301至307的具体实现方式类似,可参阅图3对应实施例中对步骤301至307的描述,此处不做赘述。
709、训练设备根据与待识别图像中的第一字符对应的正确结果、识别结果和损失函数,对文本识别网络进行训练。
本申请实施例中,训练设备在得到待识别图像中一个第一字符的识别结果之后,会根据与待识别图像中的第一字符对应的正确结果和待识别图像中一个第一字符的识别结果,计算损失函数的函数值,并对损失函数的函数值进行梯度求导,以反向更新文本识别网络 的权重参数,以完成对文本识别网络的一次训练。训练设备反复执行前述步骤,以实现对文本识别网络的迭代训练。
具体的,若一个第一字符中只包括一个待识别的字符,或者,一个第一字符中只包括一个待识别的词语,则执行设备在执行完步骤701至705之后,可以直接输出整个第一字符的识别结果,则训练设备根据与待识别图像中的第一字符对应的正确结果和步骤705输出的第一识别结果,计算损失函数的函数值。
若一个第一字符中包括多个待识别的字符,或者,一个第一字符中包括多个待识别的词语,则执行设备在执行一次步骤701至705,并执行至少一次步骤706至709之后,可以直接输出整个第一字符的识别结果,则训练设备根据与待识别图像中的第一字符对应的正确结果、步骤705输出的第一识别结果和通过步骤708得到的至少一个第二识别结果,计算损失函数的函数值。
其中,损失函数指示与待识别图像中的第一字符对应的正确结果和待识别图像中一个第一字符的识别结果之间的相似度,训练的目标为拉近与待识别图像中的第一字符对应的正确结果和待识别图像中一个第一字符的识别结果之间的相似度。损失函数具体可以表现为交叉熵损失函数、焦点(focal)损失函数、中心(center)损失函数或其他类型的损失函数等,此处不做限定。
预设条件可以为损失函数满足收敛条件,也可以为迭代次数达到预设次数。
本申请实施例中,提供了文本识别网络的训练方法,提高了本方案的完整性;由于当出现待识别图像模糊或待识别图像中部分字符被遮挡等因素时,第一特征中包括的模糊字符或被遮挡字符的特征的准确度会大大降低。在训练阶段,基于已识别字符的语义信息,生成预测字符的语义特征,根据预测字符的语义特征和图像特征一起生成识别结果,由于图像模糊或待识别图像中部分字符被遮挡等图像发生问题,并不会影响预测字符的准确度,有利于提高训练后的文本识别网络输出的文本识别结果的准确度。
为了更直观的了解本申请实施例带来的有益效果,以下表1中通过实验数据对本申请实施例所带来的有益效果进行展示。
表1
  svt SVTP CT80
OCR 88.2% 77.67% 84.98%
本申请实施例 92.4% 84.2% 89.9%
参阅上述表1,svt、SVTP和CT80分别为三个公开的数据集,表1中的第一行数据指示采用光学字符识别(optical character recognition,OCR)技术对分别对数据集svt、数据集SVTP和数据集CT80中的图像进行文本识别,得到的识别结果的准确率。表1中的第二行数据指示采用本申请实施例提供的文本识别网络,分别对数据集svt、数据集SVTP和数据集CT80中的图像进行文本识别,得到的识别结果的准确率。很明显,采用本申请实施例提供的文本识别网络得到的识别结果的准确率更高。
此外,请继续参阅图8,图8为本申请实施例提供的文本识别网络的一种有益效果示意图。针对图8中的第一行数据,当仅根据待识别图像的图像特征,对待识别图像中的字 符进行识别,得到的识别结果为“shcct”,采用本申请实施例提供的文本识别网络,对对待识别图像中的字符进行识别,得到的识别结果为“sheet”。针对图8中的第二行数据和第三行数据可类推理解,很明显,采用本申请实施例提供的文本识别网络得到的识别结果的准确度更高。
在图1至图8所对应的实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关设备。具体参阅图9,图9为本申请实施例提供的文本识别网络的一种结构示意图。文本识别网络900包括可以包括图像特征提取模块901,文本特征获取模块902和识别模块903。图像特征提取模块901,用于获取待识别图像,并对待识别图像进行特征提取,以生成与待识别图像中的第一字符对应的第一特征,其中,第一字符为待识别图像中需要进行识别的字符;文本特征获取模块902,用于获取与待识别图像中的第一字符对应的预设字符,并根据预设字符进行文本预测,以生成第一预测字符的语义特征;识别模块903,用于根据第一特征和第一预测字符的语义特征执行识别操作,以生成与待识别图像中的第一字符对应的识别结果。
在一种可能的设计中,文本特征获取模块902,具体用于在初次对待识别图像执行识别操作的情况下,获取与待识别图像中的第一字符对应的预设字符,并根据预设字符进行文本预测,以生成第二预测字符的语义特征;文本特征获取模块902,还用于在已对第一字符中至少一个字符执行过识别操作的情况下,将与第一字符中的已识别字符对应的识别结果确定为第二字符,并生成与第二字符对应的第二预测字符的语义特征。
在一种可能的设计中,识别模块903,还用于根据第一特征和第二预测字符的语义特征执行识别操作,以生成与待识别图像中的第一字符对应的识别结果。
在一种可能的设计中,请参阅图10,图10为本申请实施例提供的文本识别网络的一种结构示意图。文本特征获取模块902包括:第一生成子模块9021,用于对预设字符进行向量化处理,以生成预设字符的字符编码,并根据预设字符在待识别图像中的第一字符的位置,生成预设字符的位置编码;组合子模块9022,用于将预设字符的字符编码和预设字符的位置编码进行组合,得到预设字符的初始特征,并根据预设字符的初始特征执行自注意力编码操作和自注意力解码操作,以生成第一预测字符的语义特征。
在一种可能的设计中,请参阅图10,识别模块903包括:计算子模块9031,用于计算第一特征和第一预测字符的语义特征之间的相似度;第二生成子模块9032,用于根据第一特征、第一预测字符的语义特征和相似度,生成第二特征和第三特征,其中,第二特征为在第一特征的基础上组合了第一预测字符的语义特征,第三特征为在第一预测字符的语义特征的基础上组合了第一特征;第二生成子模块9032,还用于根据第二特征和第三特征执行识别操作,以生成识别结果。
在一种可能的设计中,请参阅图10,文本识别网络还包括特征更新模块904,特征更新模块904,用于:将预设字符的特征与第一特征进行组合,以生成更新后的第一特征;识别模块903,具体用于根据更新后的第一特征和第一预测字符的语义特征执行识别操作,以生成与待识别图像中的第一字符对应的识别结果。
在一种可能的设计中,特征更新模块904,具体用于:根据预设字符的初始特征执行 自注意力编码操作,得到预设字符的更新后特征,并根据第一特征和预设字符的更新后特征执行自注意力编码操作,以生成更新后的第一特征。
在一种可能的设计中,在文本识别网络执行识别操作的粒度为字符的情况下,一个第一字符中包括至少一个字符,文本识别网络执行一次识别操作输出的一个识别结果中包括一个字符;在文本识别网络执行识别操作的粒度为词语的情况下,一个第一字符中包括至少一个词语,文本识别网络执行一次识别操作输出的一个识别结果为包括一个或多个字符的词语。
需要说明的是,文本识别网络900中各模块/单元之间的信息交互、执行过程等内容,与本申请中图3至图6对应的各个方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例还提供一种文本识别网络的训练装置,具体参阅图11,图11为本申请实施例提供的文本识别网络的训练装置的一种结构示意图。文本识别网络为用于识别图像中字符的神经网络,文本识别网络包括图像特征提取模块、文本特征获取模块和识别模块。文本识别网络的训练装置1100包括:输入单元1101、识别单元1102和训练单元1103。输入单元1101,用于将待识别图像输入至图像特征提取模块,对待识别图像进行特征提取,以生成与待识别图像中的第一字符对应的第一特征,其中,第一字符为待识别图像中需要进行识别的字符;输入单元1101,还用于将与待识别图像中的第一字符对应的预设字符输入至文本特征获取模块,并根据预设字符进行文本预测,以生成第一预测字符的语义特征;识别单元1102,用于根据第一特征和第一预测字符的语义特征通过识别模块执行识别操作,以生成与待识别图像中的第一字符对应的识别结果;训练单元1103,用于根据与待识别图像中的第一字符对应的正确结果、识别结果和损失函数,对文本识别网络进行训练,损失函数指示与待识别图像中的第一字符对应的正确结果和与待识别图像中的第一字符对应的识别结果之间的相似度。
在一种可能的设计中,请参阅图12,图12为本申请实施例提供的文本识别网络的训练装置的一种结构示意图。输入单元1101,具体用于在初次对待识别图像执行识别操作的情况下,将与待识别图像中的第一字符对应的预设字符输入至文本特征获取模块;文本识别网络的训练装置1100还包括生成单元1104,用于在已对第一字符中至少一个字符执行过识别操作的情况下,通过文本特征获取模块将与第一字符中的已识别字符对应的识别结果确定为第二字符,并生成与第二字符对应的第二预测字符的语义特征。
在一种可能的设计中,识别单元1102,还用于根据第一特征和第二预测字符的语义特征,通过识别模块执行识别操作,以生成与待识别图像中的第一字符对应的识别结果。
在一种可能的设计中,输入单元1101,具体用于通过文本特征获取模块对预设字符进行向量化处理,以生成预设字符的字符编码,并根据预设字符在待识别图像中的第一字符的位置,生成预设字符的位置编码;通过文本特征获取模块将预设字符的字符编码和预设字符的位置编码进行组合,得到预设字符的初始特征,并根据预设字符的初始特征执行自注意力编码操作和自注意力解码操作,以生成第一预测字符的语义特征。
在一种可能的设计中,识别单元1102,具体用于:通过识别模块计算第一特征和第一 预测字符的语义特征之间的相似度;根据第一特征、第一预测字符的语义特征和相似度,通过识别模块生成第二特征和第三特征,其中,第二特征为在第一特征的基础上组合了第一预测字符的语义特征,第三特征为在第一预测字符的语义特征的基础上组合了第一特征;通过识别模块根据第二特征和第三特征执行识别操作,以生成识别结果。
在一种可能的设计中,请参阅图12,文本识别网络还包括特征更新模块。文本识别网络的训练装置1100还包括组合单元1105,用于通过特征更新模块,将预设字符的特征与第一特征进行组合,以生成更新后的第一特征;识别单元1102,具体用于通过识别模块,根据更新后的第一特征和第一预测字符的语义特征执行识别操作,以生成与待识别图像中的第一字符对应的识别结果。
在一种可能的设计中,组合单元1105,具体用于通过特征更新模块,根据预设字符的初始特征执行自注意力编码操作,得到预设字符的更新后特征;通过特征更新模块,根据第一特征和预设字符的更新后特征执行自注意力编码操作,以生成更新后的第一特征。
在一种可能的设计中,在文本识别网络执行识别操作的粒度为字符的情况下,一个第一字符中包括至少一个字符,文本识别网络执行一次识别操作输出的一个识别结果中包括一个字符;在文本识别网络执行识别操作的粒度为词语的情况下,一个第一字符中包括至少一个词语,文本识别网络执行一次识别操作输出的一个识别结果为包括一个或多个字符的词语。
需要说明的是,文本识别网络的训练装置1100中各模块/单元之间的信息交互、执行过程等内容,与本申请中图7对应的各个方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例还提供了一种执行设备,请参阅图13,图13为本申请实施例提供的执行设备的一种结构示意图,其中,执行设备1300上可以部署有图9或图10对应实施例中所描述的文本识别网络900,用于实现图3至图6对应实施例中执行设备的功能。具体的,执行设备1300包括:接收器1301、发射器1302、处理器1303和存储器1304(其中执行设备1300中的处理器1303的数量可以一个或多个,图13中以一个处理器为例),其中,处理器1303可以包括应用处理器13031和通信处理器13032。在本申请的一些实施例中,接收器1301、发射器1302、处理器1303和存储器1304可通过总线或其它方式连接。
存储器1304可以包括只读存储器和随机存取存储器,并向处理器1303提供指令和数据。存储器1304的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1304存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器1303控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1303中,或者由处理器1303实现。处理器1303可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的 各步骤可以通过处理器1303中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1303可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1303可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1304,处理器1303读取存储器1304中的信息,结合其硬件完成上述方法的步骤。
接收器1301可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器1302可用于通过第一接口输出数字或字符信息;发射器1302还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1302还可以包括显示屏等显示设备。
本申请实施例中,应用处理器13031,用于执行图3至图6对应实施例中执行设备的功能。需要说明的是,对于应用处理器13031执行图3至图6对应实施例中执行设备的功能的具体实现方式以及带来的有益效果,均可以参考图3至图6对应的各个方法实施例中的叙述,此处不再一一赘述。
本申请实施例还提供了一种训练设备,请参阅图14,图14为本申请实施例提供的训练设备的一种结构示意图,训练设备1400上可以部署有图11或12对应实施例中所描述的文本识别网络的训练装置1100,用于实现图7对应的训练设备的功能。具体的,训练设备1400由一个或多个服务器实现,训练设备1400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1422(例如,一个或一个以上处理器)和存储器1432,一个或一个以上存储应用程序1442或数据1444的存储介质1430(例如一个或一个以上海量存储设备)。其中,存储器1432和存储介质1430可以是短暂存储或持久存储。存储在存储介质1430的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备中的一系列指令操作。更进一步地,中央处理器1422可以设置为与存储介质1430通信,在训练设备1400上执行存储介质1430中的一系列指令操作。
训练设备1400还可以包括一个或一个以上电源1426,一个或一个以上有线或无线网络接口1450,一个或一个以上输入输出接口1458,和/或,一个或一个以上操作系统1441,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
本申请实施例中,中央处理器1422,用于实现图7对应实施例中的训练设备的功能。需要说明的是,对于中央处理器1422执行图7对应实施例中训练设备的功能的具体实现方式以及带来的有益效果,均可以参考图7对应的各个方法实施例中的叙述,此处不再一一赘述。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有程序,当其在计算机上运行时,使得计算机执行如上述图3至图6对应实施例中执行设备所执行的步骤,或者,执行如上述图7对应实施例中训练设备所执行的步骤。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如上述图3至图6对应实施例中执行设备所执行的步骤,或者,执行如上述图7对应实施例中训练设备所执行的步骤。
本申请实施例中还提供一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行如上述图3至图6对应实施例中执行设备所执行的步骤,或者,执行如上述图7对应实施例中训练设备所执行的步骤。
本申请实施例提供的执行设备或训练设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使芯片执行上述图3至图6对应实施例中执行设备所执行的步骤,或者,执行如上述图7对应实施例中训练设备所执行的步骤。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图15,图15为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 150,NPU 150作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1503,通过控制器1504控制运算电路1503提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路1503内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1503是二维脉动阵列。运算电路1503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路1503从权重存储器1502中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路1503从输入存储器1501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1508中。
统一存储器1506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1505,DMAC被搬运到权重存储器1502中。输入数据也通过DMAC被搬运到统一存储器1506中。
BIU为Bus Interface Unit即,总线接口单元1510,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1509的交互。
总线接口单元1510(Bus Interface Unit,简称BIU),用于取指存储器1509从外部存储器获取指令,还用于存储单元访问控制器1505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1506或将权重数据搬运到权重存储器1502中或将输入数据数据搬运到输入存储器1501中。
向量计算单元1507包括多个运算处理单元,在需要的情况下,对运算电路1503的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1507能将经处理的输出的向量存储到统一存储器1506。例如,向量计算单元1507可以将线性函数和/或非线性函数应用到运算电路1503的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1507生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1503的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1504连接的取指存储器(instruction fetch buffer)1509,用于存储控制器1504使用的指令;
统一存储器1506,输入存储器1501,权重存储器1502以及取指存储器1509均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,循环神经网络中各层的运算可以由运算电路1503或向量计算单元1507执行。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述第一方面方法的程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CLU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (26)

  1. 一种文本识别网络,其特征在于,所述文本识别网络为用于识别图像中字符的神经网络,所述文本识别网络包括图像特征提取模块、文本特征获取模块和识别模块;
    所述图像特征提取模块,用于获取待识别图像,并对所述待识别图像进行特征提取,以生成与所述待识别图像中的第一字符对应的第一特征,其中,所述第一字符为所述待识别图像中需要进行识别的字符;
    所述文本特征获取模块,用于获取与所述待识别图像中的第一字符对应的预设字符,并根据所述预设字符进行文本预测,以生成第一预测字符的语义特征;
    所述识别模块,用于根据所述第一特征和所述第一预测字符的语义特征执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果。
  2. 根据权利要求1所述的网络,其特征在于,
    所述文本特征获取模块,具体用于在初次对所述待识别图像执行识别操作的情况下,获取与所述待识别图像中的第一字符对应的预设字符,并根据所述预设字符进行文本预测,以生成所述第一预测字符的语义特征;
    所述文本特征获取模块,还用于在已对所述第一字符中至少一个字符执行过识别操作的情况下,将与所述第一字符中的已识别字符对应的识别结果确定为第二字符,并根据所述第二字符进行文本预测,以生成与所述第二字符对应的第二预测字符的语义特征。
  3. 根据权利要求2所述的网络,其特征在于,
    所述识别模块,还用于根据所述第一特征和所述第二预测字符的语义特征执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果。
  4. 根据权利要求1至3任一项所述的网络,其特征在于,所述文本特征获取模块包括:
    第一生成子模块,用于对所述预设字符进行向量化处理,以生成所述预设字符的字符编码,并根据所述预设字符在所述待识别图像中的第一字符的位置,生成所述预设字符的位置编码;
    组合子模块,用于将所述预设字符的字符编码和所述预设字符的位置编码进行组合,得到所述预设字符的初始特征,并根据所述预设字符的初始特征执行自注意力编码操作和自注意力解码操作,以生成所述第一预测字符的语义特征。
  5. 根据权利要求1至3任一项所述的网络,其特征在于,所述识别模块包括:
    计算子模块,用于计算所述第一特征和所述第一预测字符的语义特征之间的相似度;
    第二生成子模块,用于根据所述第一特征、所述第一预测字符的语义特征和所述相似度,生成第二特征和第三特征,其中,所述第二特征为在所述第一特征的基础上组合了所述第一预测字符的语义特征,所述第三特征为在所述第一预测字符的语义特征的基础上组合了所述第一特征;
    所述第二生成子模块,还用于根据所述第二特征和所述第三特征执行识别操作,以生成识别结果。
  6. 根据权利要求1至3任一项所述的网络,其特征在于,所述文本识别网络还包括特征更新模块,所述特征更新模块,用于:
    将所述预设字符的特征与所述第一特征进行组合,以生成更新后的第一特征;
    所述识别模块,具体用于根据所述更新后的第一特征和所述第一预测字符的语义特征执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果。
  7. 根据权利要求6所述的网络,其特征在于,
    所述特征更新模块,具体用于根据所述预设字符的初始特征执行自注意力编码操作,得到所述预设字符的更新后特征,并根据所述第一特征和所述预设字符的更新后特征执行自注意力编码操作,以生成所述更新后的第一特征。
  8. 根据权利要求1至3任一项所述的网络,其特征在于,
    在所述文本识别网络执行识别操作的粒度为字符的情况下,一个第一字符中包括至少一个字符,所述文本识别网络执行一次识别操作输出的一个识别结果中包括一个字符;
    在所述文本识别网络执行识别操作的粒度为词语的情况下,一个第一字符中包括至少一个词语,所述文本识别网络执行一次识别操作输出的一个识别结果为包括一个或多个字符的词语。
  9. 一种文本识别网络的训练方法,其特征在于,所述文本识别网络为用于识别图像中字符的神经网络,所述文本识别网络包括图像特征提取模块、文本特征获取模块和识别模块,所述方法包括:
    将待识别图像输入至所述图像特征提取模块,对所述待识别图像进行特征提取,以生成与所述待识别图像中的第一字符对应的第一特征,其中,所述第一字符为所述待识别图像中需要进行识别的字符;
    将与所述待识别图像中的第一字符对应的预设字符输入至所述文本特征获取模块,并根据所述预设字符进行文本预测,以生成第一预测字符的语义特征;
    根据所述第一特征和所述第一预测字符的语义特征通过所述识别模块执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果;
    根据与所述待识别图像中的第一字符对应的正确结果、识别结果和损失函数,对所述文本识别网络进行训练,所述损失函数指示与所述待识别图像中的第一字符对应的正确结果和与所述待识别图像中的第一字符对应的识别结果之间的相似度。
  10. 根据权利要求9所述的方法,其特征在于,
    所述文本特征获取模块,具体用于在初次对所述待识别图像执行识别操作的情况下,获取与所述待识别图像中的第一字符对应的预设字符,并根据所述预设字符进行文本预测,以生成所述第一预测字符的语义特征;
    所述文本特征获取模块,还用于在已对所述第一字符中至少一个字符执行过识别操作的情况下,将与所述第一字符中的已识别字符对应的识别结果确定为第二字符,并根据所述第二字符进行文本预测,以生成与所述第二字符对应的第二预测字符的语义特征。
  11. 根据权利要求10所述的方法,其特征在于,
    所述识别模块,还用于根据所述第一特征和所述第二预测字符的语义特征执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果。
  12. 一种文本识别方法,其特征在于,所述方法包括:
    将待识别图像输入至图像特征提取模块,对所述待识别图像进行特征提取,以生成与所述待识别图像中的第一字符对应的第一特征,其中,所述第一字符为所述待识别图像中需要进行识别的字符;
    将与所述待识别图像中的第一字符对应的预设字符输入至文本特征获取模块,并根据所述第二字符进行文本预测,以生成第一预测字符的语义特征;
    根据所述第一特征和所述第一预测字符的语义特征,通过识别模块执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果;
    其中,所述图像特征提取模块、所述文本特征获取模块和所述识别模块归属于同一文本识别网络。
  13. 根据权利要求12所述的方法,其特征在于,所述将与所述待识别图像中的第一字符对应的预设字符输入至文本特征获取模块,包括:
    在初次对所述待识别图像执行识别操作的情况下,将与所述待识别图像中的第一字符对应的预设字符输入至文本特征获取模块;
    所述方法还包括:
    在已对所述第一字符中至少一个字符执行过识别操作的情况下,通过所述文本特征获取模块将与所述第一字符中的已识别字符对应的识别结果确定为第二字符,并根据所述第二字符进行文本预测,以生成与所述第二字符对应的第二预测字符的语义特征。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    根据所述第一特征和所述第二预测字符的语义特征,通过所述识别模块执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果。
  15. 根据权利要求12至14任一项所述的方法,其特征在于,所述将与所述待识别图像中的第一字符对应的预设字符输入至文本特征获取模块,并根据所述第二字符进行文本预测,以生成第一预测字符的语义特征,包括:
    通过所述文本特征获取模块对所述预设字符进行向量化处理,以生成所述预设字符的字符编码,并根据所述预设字符在所述待识别图像中的第一字符的位置,生成所述预设字符的位置编码;
    通过所述文本特征获取模块将所述预设字符的字符编码和所述预设字符的位置编码进行组合,得到所述预设字符的初始特征,并根据所述预设字符的初始特征执行自注意力编码操作和自注意力解码操作,以生成第一预测字符的语义特征。
  16. 根据权利要求12至14任一项所述的方法,其特征在于,所述通过识别模块,根据所述第一特征和所述第一预测字符的语义特征执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果,包括:
    通过所述识别模块计算所述第一特征和所述第一预测字符的语义特征之间的相似度;
    根据所述第一特征、所述第一预测字符的语义特征和所述相似度,通过所述识别模块生成第二特征和第三特征,其中,所述第二特征为在所述第一特征的基础上组合了所述第一预测字符的语义特征,所述第三特征为在所述第一预测字符的语义特征的基础上组合了所述第一特征;
    通过所述识别模块根据所述第二特征和所述第三特征执行识别操作,以生成识别结果。
  17. 根据权利要求12至14任一项所述的方法,其特征在于,所述文本识别网络还包括特征更新模块,所述方法还包括:
    通过所述特征更新模块,将所述预设字符的特征与所述第一特征进行组合,以生成更新后的第一特征;
    所述通过识别模块,根据所述第一特征和所述第一预测字符的语义特征执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果,包括:
    通过所述识别模块,根据所述更新后的第一特征和所述第一预测字符的语义特征执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果。
  18. 根据权利要求17所述的方法,其特征在于,所述通过所述特征更新模块,将所述预设字符的特征与所述第一特征进行组合,以生成更新后的第一特征,包括:
    通过所述特征更新模块,根据所述预设字符的初始特征执行自注意力编码操作,得到所述预设字符的更新后特征;
    通过所述特征更新模块,根据所述第一特征和所述预设字符的更新后特征执行自注意力编码操作,以生成所述更新后的第一特征。
  19. 根据权利要求12至14任一项所述的方法,其特征在于,
    在所述文本识别网络执行识别操作的粒度为字符的情况下,一个第一字符中包括至少一个字符,所述文本识别网络执行一次识别操作输出的一个识别结果中包括一个字符;
    在所述文本识别网络执行识别操作的粒度为词语的情况下,一个第一字符中包括至少一个词语,所述文本识别网络执行一次识别操作输出的一个识别结果为包括一个或多个字符的词语。
  20. 一种文本识别网络的训练装置,其特征在于,所述文本识别网络为用于识别图像中字符的神经网络,所述文本识别网络包括图像特征提取模块、文本特征获取模块和识别模块,所述装置包括:
    输入单元,用于将待识别图像输入至所述图像特征提取模块,对所述待识别图像进行特征提取,以生成与所述待识别图像中的第一字符对应的第一特征,其中,所述第一字符为所述待识别图像中需要进行识别的字符;
    所述输入单元,还用于将与所述待识别图像中的第一字符对应的预设字符输入至所述文本特征获取模块,并根据所述预设字符进行文本预测,以生成第一预测字符的语义特征;
    识别单元,用于根据所述第一特征和所述第一预测字符的语义特征通过所述识别模块执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果;
    训练单元,用于根据与所述待识别图像中的第一字符对应的正确结果、识别结果和损失函数,对所述文本识别网络进行训练,所述损失函数指示与所述待识别图像中的第一字符对应的正确结果和与所述待识别图像中的第一字符对应的识别结果之间的相似度。
  21. 根据权利要求20所述的装置,其特征在于,
    所述输入单元,具体用于在初次对所述待识别图像执行识别操作的情况下,将与所述待识别图像中的第一字符对应的预设字符输入至文本特征获取模块;
    所述输入单元,还用于在已对所述第一字符中至少一个字符执行过识别操作的情况下,通过所述文本特征获取模块将与所述第一字符中的已识别字符对应的识别结果确定为第二字符,并根据所述第二字符进行文本预测,以生成与所述第二字符对应的第二预测字符的语义特征。
  22. 根据权利要求21所述的装置,其特征在于,
    所述识别单元,还用于根据所述第一特征和所述第二预测字符的语义特征,通过所述识别模块执行识别操作,以生成与所述待识别图像中的第一字符对应的识别结果。
  23. 一种执行设备,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器存储有程序指令,当所述存储器存储的程序指令被所述处理器执行时实现权利要求1至8中任一项所述文本识别网络执行的步骤。
  24. 一种训练设备,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器存储有程序指令,当所述存储器存储的程序指令被所述处理器执行时实现权利要求9至11中任一项所述的方法。
  25. 一种计算机可读存储介质,其特征在于,包括程序,当其在计算机上运行时,使得计算机执行如权利要求1至8中任一项所述文本识别网络执行的步骤,或者,使得计算机执行如权利要求9至11中任一项所述的方法。
  26. 一种电路系统,其特征在于,所述电路系统包括处理电路,所述处理电路配置为执行如权利要求1至8中任一项所述文本识别网络执行的步骤,或者,所述处理电路配置为执行如权利要求9至11中任一项所述的方法。
PCT/CN2021/106397 2020-07-24 2021-07-15 一种文本识别网络、神经网络训练的方法以及相关设备 WO2022017245A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010723541.2 2020-07-24
CN202010723541.2A CN112016543B (zh) 2020-07-24 2020-07-24 一种文本识别网络、神经网络训练的方法以及相关设备

Publications (1)

Publication Number Publication Date
WO2022017245A1 true WO2022017245A1 (zh) 2022-01-27

Family

ID=73499014

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/106397 WO2022017245A1 (zh) 2020-07-24 2021-07-15 一种文本识别网络、神经网络训练的方法以及相关设备

Country Status (2)

Country Link
CN (1) CN112016543B (zh)
WO (1) WO2022017245A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114140802A (zh) * 2022-01-29 2022-03-04 北京易真学思教育科技有限公司 一种文本识别方法、装置、电子设备和存储介质
CN114495106A (zh) * 2022-04-18 2022-05-13 电子科技大学 一种应用于dfb激光器芯片的深度学习mocr方法
CN114661904A (zh) * 2022-03-10 2022-06-24 北京百度网讯科技有限公司 文档处理模型的训练方法、装置、设备、存储介质及程序
CN115565186A (zh) * 2022-09-26 2023-01-03 北京百度网讯科技有限公司 文字识别模型的训练方法、装置、电子设备和存储介质
CN116071759A (zh) * 2023-03-06 2023-05-05 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种融合gpt2预训练大模型的光学字符识别方法
CN116311271A (zh) * 2023-03-22 2023-06-23 北京百度网讯科技有限公司 文本图像的处理方法及装置
WO2023165111A1 (zh) * 2022-03-01 2023-09-07 达而观信息科技(上海)有限公司 客服热线中用户意图轨迹识别的方法及系统

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016543B (zh) * 2020-07-24 2024-09-20 华为技术有限公司 一种文本识别网络、神经网络训练的方法以及相关设备
CN113011246A (zh) * 2021-01-29 2021-06-22 招商银行股份有限公司 票据分类方法、装置、设备及存储介质
CN112819684B (zh) * 2021-03-02 2022-07-26 成都视海芯图微电子有限公司 一种面向图像文本识别的加速装置
CN112801228B (zh) * 2021-04-06 2021-08-06 北京世纪好未来教育科技有限公司 一种文本识别方法、电子设备及其存储介质
CN113762050B (zh) * 2021-05-12 2024-05-24 腾讯云计算(北京)有限责任公司 图像数据处理方法、装置、设备以及介质
CN113610081A (zh) * 2021-08-12 2021-11-05 北京有竹居网络技术有限公司 一种字符识别方法及其相关设备
CN113657390B (zh) * 2021-08-13 2022-08-12 北京百度网讯科技有限公司 文本检测模型的训练方法和检测文本方法、装置和设备
CN113837965B (zh) * 2021-09-26 2024-06-18 北京百度网讯科技有限公司 图像清晰度识别方法、装置、电子设备及存储介质
CN115035538B (zh) * 2022-03-22 2023-04-07 北京百度网讯科技有限公司 文本识别模型的训练方法、文本识别方法及装置
CN114743020B (zh) * 2022-04-02 2024-05-14 华南理工大学 一种结合标签语义嵌入和注意力融合的食物识别方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180217976A1 (en) * 2017-01-30 2018-08-02 International Business Machines Corporation Text prediction using captured image from an image capture device
CN108898137A (zh) * 2018-05-25 2018-11-27 黄凯 一种基于深度神经网络的自然图像字符识别方法及系统
CN109117846A (zh) * 2018-08-22 2019-01-01 北京旷视科技有限公司 一种图像处理方法、装置、电子设备和计算机可读介质
CN109389091A (zh) * 2018-10-22 2019-02-26 重庆邮电大学 基于神经网络和注意力机制结合的文字识别系统及方法
CN111126410A (zh) * 2019-12-31 2020-05-08 讯飞智元信息科技有限公司 字符识别方法、装置、设备及可读存储介质
CN112016543A (zh) * 2020-07-24 2020-12-01 华为技术有限公司 一种文本识别网络、神经网络训练的方法以及相关设备

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI685695B (zh) * 2018-09-20 2020-02-21 友達光電股份有限公司 顯示面板

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180217976A1 (en) * 2017-01-30 2018-08-02 International Business Machines Corporation Text prediction using captured image from an image capture device
CN108898137A (zh) * 2018-05-25 2018-11-27 黄凯 一种基于深度神经网络的自然图像字符识别方法及系统
CN109117846A (zh) * 2018-08-22 2019-01-01 北京旷视科技有限公司 一种图像处理方法、装置、电子设备和计算机可读介质
CN109389091A (zh) * 2018-10-22 2019-02-26 重庆邮电大学 基于神经网络和注意力机制结合的文字识别系统及方法
CN111126410A (zh) * 2019-12-31 2020-05-08 讯飞智元信息科技有限公司 字符识别方法、装置、设备及可读存储介质
CN112016543A (zh) * 2020-07-24 2020-12-01 华为技术有限公司 一种文本识别网络、神经网络训练的方法以及相关设备

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114140802A (zh) * 2022-01-29 2022-03-04 北京易真学思教育科技有限公司 一种文本识别方法、装置、电子设备和存储介质
WO2023165111A1 (zh) * 2022-03-01 2023-09-07 达而观信息科技(上海)有限公司 客服热线中用户意图轨迹识别的方法及系统
CN114661904A (zh) * 2022-03-10 2022-06-24 北京百度网讯科技有限公司 文档处理模型的训练方法、装置、设备、存储介质及程序
CN114495106A (zh) * 2022-04-18 2022-05-13 电子科技大学 一种应用于dfb激光器芯片的深度学习mocr方法
CN115565186A (zh) * 2022-09-26 2023-01-03 北京百度网讯科技有限公司 文字识别模型的训练方法、装置、电子设备和存储介质
CN115565186B (zh) * 2022-09-26 2023-09-22 北京百度网讯科技有限公司 文字识别模型的训练方法、装置、电子设备和存储介质
CN116071759A (zh) * 2023-03-06 2023-05-05 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种融合gpt2预训练大模型的光学字符识别方法
CN116071759B (zh) * 2023-03-06 2023-07-18 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 一种融合gpt2预训练大模型的光学字符识别方法
CN116311271A (zh) * 2023-03-22 2023-06-23 北京百度网讯科技有限公司 文本图像的处理方法及装置
CN116311271B (zh) * 2023-03-22 2023-12-26 北京百度网讯科技有限公司 文本图像的处理方法及装置

Also Published As

Publication number Publication date
CN112016543B (zh) 2024-09-20
CN112016543A (zh) 2020-12-01

Similar Documents

Publication Publication Date Title
WO2022017245A1 (zh) 一种文本识别网络、神经网络训练的方法以及相关设备
CN111797893B (zh) 一种神经网络的训练方法、图像分类系统及相关设备
CN110020620B (zh) 一种大姿态下的人脸识别方法、装置及设备
WO2022116856A1 (zh) 一种模型结构、模型训练方法、图像增强方法及设备
CN111401406B (zh) 一种神经网络训练方法、视频帧处理方法以及相关设备
WO2021238333A1 (zh) 一种文本处理网络、神经网络训练的方法以及相关设备
WO2021218471A1 (zh) 一种用于图像处理的神经网络以及相关设备
Zhou et al. Image classification using biomimetic pattern recognition with convolutional neural networks features
CN111275107A (zh) 一种基于迁移学习的多标签场景图像分类方法及装置
CN110222718B (zh) 图像处理的方法及装置
WO2023179482A1 (zh) 一种图像处理方法、神经网络的训练方法以及相关设备
US20240232575A1 (en) Neural network obtaining method, data processing method, and related device
CN113011568B (zh) 一种模型的训练方法、数据处理方法及设备
WO2023231954A1 (zh) 一种数据的去噪方法以及相关设备
WO2022179606A1 (zh) 一种图像处理方法及相关装置
WO2023231753A1 (zh) 一种神经网络的训练方法、数据的处理方法以及设备
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
JP2010157118A (ja) パターン識別装置及びパターン識別装置の学習方法ならびにコンピュータプログラム
Bose et al. Light weight structure texture feature analysis for character recognition using progressive stochastic learning algorithm
CN111950700A (zh) 一种神经网络的优化方法及相关设备
Bezak Building recognition system based on deep learning
Alphonse et al. Novel directional patterns and a Generalized Supervised Dimension Reduction System (GSDRS) for facial emotion recognition
CN116863194A (zh) 一种足溃疡图像分类方法、系统、设备及介质
WO2021218725A1 (zh) 一种图像数据处理方法及相关装置
CN114913339A (zh) 特征图提取模型的训练方法和装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21845402

Country of ref document: EP

Kind code of ref document: A1