CN114998909A

CN114998909A - Image character language identification method and system

Info

Publication number: CN114998909A
Application number: CN202210640881.8A
Authority: CN
Inventors: 姜孟源; 陈振标; 杜晓祥
Original assignee: Beijing Yunshang Technology Co ltd
Current assignee: Beijing Yunshang Technology Co ltd
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-09-02

Abstract

A kind of recognition method and system of the image character language, this method is through simulating the image character under the real scene, combine background picture, dictionary storehouse and each language style font storehouse, the artificial synthesis of image character with label forms the first training data set; collecting image characters in a real scene, and carrying out manual classification and labeling on the collected image characters to form a second training data set; constructing a neural network CRNN for OCR language recognition, and performing primary training by using a first training data set to obtain an OCR character recognition model; performing fine-tuning on the OCR character recognition model by adopting a first training data set and a second training data set to obtain an OCR language recognition classification model; and performing model reasoning by using the OCR language recognition and classification model after find-tuning. The invention can obtain better fitting effect, reduce missing detection or accidental injury and improve the performance of the model.

Description

Image character language identification method and system

Technical Field

The invention relates to the technical field of OCR (optical character recognition) word processing, in particular to a method and a system for recognizing image words.

Background

At present, in internet scenes, particularly scenes such as maritime cooperation, cross-country trade, international advertisement and the like, characters of multiple languages are often contained on the same picture at the same time, and the prior language selection is not enough to cover all scene requirements, so that language identification is necessary before character identification.

In the prior art, a multilingual OCR language recognition scheme generally includes a general image recognition model and an image recognition model based on multiple labels, directly extracts semantic features of whole sentence image characters, and then classifies the semantic features by a feature classifier. In the traditional technology, only semantic features of an image layer are considered, and the relation between contexts is not considered based on a sequence, so that languages with similar body features of the image layer cannot be distinguished. Meanwhile, when more than two languages appear in the whole sentence of the picture text, and the ratio of one language exceeds 80%, the image identification method based on the multi-label often has missed detection or accidental injury. In summary, a new technical solution for recognizing image and text languages is needed.

Disclosure of Invention

Therefore, the invention provides an image character language identification method and system to solve the problems of easy missed detection or accidental injury and poor performance of the traditional scheme.

In order to achieve the above purpose, the invention provides the following technical scheme: an image character language identification method comprises the following steps:

simulating image characters in a real scene, combining a background picture, a dictionary library and various language style font libraries in the simulation process, and artificially synthesizing the image characters with labels to form a first training data set;

collecting image characters in a real scene, and carrying out manual classification and labeling on the collected image characters to form a second training data set;

constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model;

performing fine-tuning on the OCR character recognition model by adopting the first training data set and the second training data set to obtain an OCR language recognition classification model;

and performing model reasoning by using the OCR language identification classification model after fine-tuning.

As an optimal scheme of the image character language identification method, the image characters under the real scene are simulated:

collecting a picture data set without characters from the Internet, and cutting the picture data set to be used as a background picture of artificially synthesized image characters;

collecting dictionary libraries or dictionary libraries of various languages from the Internet, and acquiring font libraries of corresponding languages from various language official nets to synthesize image characters of different styles.

As a preferred scheme of the image and character language identification method, the dictionary database and the dictionary database are numbered, all characters correspond to unique indexes, and labels serving as artificially synthesized image characters are stored in a specified format;

each training sample comprises language-level marking information and character-level marking information; the character-level marking information formed according to the unique index is used for the OCR character recognition model; and language level marking information is used for the language identification and classification model in fine-tuning.

As a preferred scheme of the image and word language identification method, the neural network CRNN comprises a convolutional neural network CNN and a recurrent neural network RNN;

the convolutional neural network CNN adopts the backbone part of SE _ ResNeXt50_32x4d to extract image semantic features, and 2 layers of Bi-directional LSTM are connected behind SE _ ResNeXt50_32x4d to extract image character sequence features; and performing OCR character recognition model training by adopting the first training data set based on CTC _ loss.

As a preferred scheme of the image character language identification method, in the fine-tuning process of the OCR character identification model, the first training data set and the second training data set are mixed into one training data set according to a preset distribution proportion, language level labeling information is adopted, the backbone part and the Bi-directional LSTM layer of SE _ ResNeXt50_32x4d are frozen, and a character identification classifier is changed into a language identification classifier for fine tuning.

The invention also provides a system for recognizing the languages of the image and the character, which comprises the following steps:

the first data set generation module is used for simulating image characters in a real scene, and the simulation process combines a background picture, a dictionary library and various language style font libraries to artificially synthesize the image characters with labels to form a first training data set;

the second data set generation module is used for collecting the image characters in the real scene and carrying out manual classification and labeling on the collected image characters to form a second training data set;

the character recognition processing module is used for constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model;

the language identification and classification module is used for performing fine-tuning on the OCR character identification model by adopting the first training data set and the second training data set to obtain an OCR language identification and classification model;

and the model reasoning module is used for carrying out model reasoning by utilizing the OCR language identification classification model after the fine-tuning.

As an optimal solution of the image character language identification system, in the first data set generation module, a simulation process is performed on image characters in a real scene:

As a preferred scheme of the image and word language identification system, in the first data set generation module, the dictionary database and the dictionary database are numbered, so that all characters correspond to unique indexes, and tags serving as artificially synthesized image and words are stored in an appointed format;

in the second data set generation module, each training sample comprises language-level labeling information and character-level labeling information; using the character-level marking information formed according to the unique index in the OCR character recognition model; and language level marking information is used for the language identification and classification model in fine-tuning.

As a preferred scheme of the image and language identification system, a neural network CRNN in the character identification processing module comprises a convolutional neural network CNN and a cyclic neural network RNN;

the convolutional neural network CNN adopts the backbone part of SE _ ResNeXt50_32x4d to extract image semantic features, and 2 layers of Bi-directional LSTM are connected behind SE _ ResNeXt50_32x4d to extract image character sequence features; training the OCR character recognition model using the first training data set based on CTC _ loss.

As a preferred scheme of the image and character language identification system, in the process of performing fine-tuning on the OCR character identification model, the language identification classification module mixes the first training data set and the second training data set into one training data set according to a preset distribution proportion, uses language-level labeling information, freezes the backbone part of SE _ resext 50_32x4d and the Bi-directional LSTM layer, and changes a character identification classifier into a language identification classifier for fine tuning.

The invention has the following advantages: simulating image characters in a real scene, wherein the simulation process combines a background picture, a dictionary library and various language style font libraries to artificially synthesize the image characters with labels to form a first training data set; collecting image characters in a real scene, and carrying out manual classification and labeling on the collected image characters to form a second training data set; constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model; performing fine-tuning on the OCR character recognition model by adopting the first training data set and the second training data set to obtain an OCR language recognition classification model; and performing model reasoning by using the OCR language identification classification model after fine-tuning. The invention can obtain better fitting effect, reduce missed detection or accidental injury and improve the performance of the model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope of the present invention.

Fig. 1 is a schematic flow chart of a method for recognizing language types of image texts according to embodiment 1 of the present invention;

fig. 2 is a flow chart of synthesizing multi-lingual text images of different styles in the method for recognizing image languages according to embodiment 1 of the present invention;

fig. 3 is a storage format of a tag of an artificially synthesized text image in the method for recognizing language types of image texts according to embodiment 1 of the present invention;

fig. 4 is a detailed diagram of a block model of SE _ resenext 50_32x4d in the image language identification method according to embodiment 1 of the present invention;

fig. 5 is a detail of the CRNN overall framework and SE _ resenext 50_32x4d modification in the image language identification method according to embodiment 1 of the present invention;

fig. 6 is a CRNN model inference configuration in the image and text language identification method provided in embodiment 1 of the present invention;

fig. 7 is a schematic diagram of an image and language identification system according to embodiment 2 of the present invention.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Example 1

Referring to fig. 1, fig. 2 and fig. 3, embodiment 1 of the present invention provides a method for recognizing language types of image texts, including:

s1, simulating image characters in a real scene, and artificially synthesizing the image characters with labels to form a first training data set in the simulation process by combining a background picture, a dictionary library and various language style font libraries;

s2, collecting image characters in a real scene, and manually classifying and labeling the collected image characters to form a second training data set;

s3, constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model;

s4, performing fine-tuning on the OCR character recognition model by adopting the first training data set and the second training data set to obtain an OCR language recognition classification model;

and S5, performing model reasoning by using the OCR language recognition and classification model after fine-tuning.

In this embodiment, in step S1, a simulation process is performed on the image and text in the real scene:

Specifically, in the process of simulating image characters in a real scene and artificially synthesizing training data to form a first training data set, firstly, a picture data set without characters is collected from the internet, and is cut by taking the example 32x 352 as an example to serve as a background picture of the artificially synthesized image characters; then collecting dictionary libraries or dictionary libraries of various languages from the internet, wherein the example comprises the following steps: latin family, chinese simplified, chinese traditional, japanese, korean, tiancheng, tamil, thai and arabic; the Chinese, Japanese and Korean languages are used as word units to generate dictionary libraries, and the Latin language, Tiancheng language, Tamil language, Thai language and Arabic languages are used as word units to generate dictionary libraries. In addition, attention is paid to the right-to-left writing order of arabic; secondly, acquiring a font library (font file) of a corresponding language from various language official nets or other ways for synthesizing image characters of different styles; thus, image characters are synthesized based on the picture character background, the dictionary library and font files of various languages.

In this embodiment, the dictionary database and the dictionary database are numbered, so that all characters correspond to unique indexes, and tags serving as artificially synthesized image characters are stored in a specified format;

With reference to fig. 3, specifically, the dictionary base (the dictionary base needs to be deduplicated) and the dictionary base are numbered from 0, it is ensured that all characters correspond to unique indexes, and the unique indexes are stored in a json format and used as labels of all artificially synthesized image characters, which indicates that each training sample not only contains language-level labeling information, but also contains each character-level labeling information in a picture, wherein the character-level labeling information formed according to the unique indexes is used in the OCR character recognition model of step S3, and the language-level labeling information is used for fine-tuning the OCR language recognition classification model of step S4.

In this embodiment, the neural network CRNN includes a convolutional neural network CNN and a recurrent neural network RNN;

Specifically, the CRNN neural network constructed in step S3 includes a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN), wherein the CNN structure employs a backbone portion of SE _ resenext 50_32x4d for image semantic feature extraction, and is followed by 2 layers of Bi-directional LSTM for extracting picture text sequence features, and finally, based on CTC _ loss, multilingual text recognition model training is performed with a synthetic training data set.

In the present example, a SE _ resenext 50_32X4d (cardinality is 32, and dimensionality reduction is 4) network of the CNN backbone is modified to ensure that the X-axis direction resolution is unchanged when layer2, layer3 and layer4 are downsampled for the first time, and the Y-axis keeps 2-fold downsampling, that is, stride is (1, 2). The final input is (32 × 352 × channel), the back-bone output is (1 × 88 × channel), the X-axis direction of the extracted feature map is 1/4 of the original, and the Y-axis direction is 1/32 of the original, see fig. 4 and 5. In this example, CTC _ loss is used to solve the problem of mismatch between the model output and the tag length, and the following formula is implemented:

LSTM given an input x, the output is the probability of l:

wherein pi ∈ B ^-1 (l) Representing all paths pi that are l after B transformation.

Wherein, for any path formula, there are:

in this embodiment, in the fine-tuning process of the OCR character recognition model, the first training data set and the second training data set are mixed into one training data set according to a preset distribution ratio, the language level labeling information is adopted, the backbone part and the Bi-directional LSTM layer of SE _ resenxt 50_32x4d are frozen, and the character recognition classifier is changed into a language recognition classifier for fine tuning.

Specifically, the artificially synthesized image character training data and the real scene image character training data are calculated according to the following ratio of 1: 1 into a training data set, using labels which are all language level labels, freezing the backbone part and the LSTM layer of the model SE _ ResNeXt50_32x4d, and changing the multilingual character recognition classifier into a language recognition classifier for fine adjustment.

Referring to fig. 6, in order to ensure batch processing in the inference process, a size fixed bit 32 × 352 of the picture is input, if the width of the input picture is less than 352, original image copying is performed until 352 is filled, and if the width is greater than 352, subgraphs with the width equal to 352 size are cut out to the two sides by taking the center of the picture as the origin to predict.

In summary, the image characters under the real scene are simulated, and the simulation process is combined with the background picture, the dictionary library and the font libraries of various languages to artificially synthesize the image characters with labels to form a first training data set; collecting image characters in a real scene, and carrying out manual classification and labeling on the collected image characters to form a second training data set; constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model; performing fine-tuning on the OCR character recognition model by adopting the first training data set and the second training data set to obtain an OCR language recognition classification model; and performing model reasoning by using the OCR language identification classification model after fine-tuning. The invention can obtain better fitting effect, reduce missing detection or accidental injury and improve the performance of the model.

It should be noted that the method of the embodiments of the present disclosure may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may only perform one or more steps of the method of the embodiments of the present disclosure, and the devices may interact with each other to complete the method.

It should be noted that the above describes some embodiments of the disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Example 2

Referring to fig. 7, embodiment 2 of the present invention provides an image language identification system, including:

the first data set generating module 1 is used for simulating image characters in a real scene, and artificially synthesizing the image characters with labels to form a first training data set in the simulation process by combining a background picture, a dictionary library and font libraries of various languages;

the second data set generation module 2 is used for collecting image characters in a real scene, and carrying out manual classification and labeling on the collected image characters to form a second training data set;

the character recognition processing module 3 is used for constructing a neural network CRNN for OCR language recognition, and performing preliminary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model;

a language identification and classification module 4, configured to perform fine-tuning on the OCR character recognition model by using the first training data set and the second training data set, so as to obtain an OCR language identification and classification model;

and the model reasoning module 5 is used for performing model reasoning by using the OCR language identification and classification model after find-tuning.

In this embodiment, in the first data set generating module 1, a simulation process is performed on image and text in a real scene:

searching a picture data set without characters from the Internet, and cutting the picture data set to be used as a background picture of artificially synthesized image characters;

In this embodiment, in the first data set generating module 1, the dictionary database and the dictionary database are numbered, so that all characters correspond to unique indexes, and are stored in a specified format as tags of artificially synthesized image characters;

in the second data set generation module 2, each training sample comprises language-level labeling information and character-level labeling information; the character-level marking information formed according to the unique index is used for the OCR character recognition model; and language level marking information is used for the language identification and classification model in fine-tuning.

In this embodiment, the neural network CRNN in the word recognition processing module 3 includes a convolutional neural network CNN and a recurrent neural network RNN;

the convolutional neural network CNN adopts a backbone part of SE _ ResNeXt50_32x4d to extract image semantic features, and 2 layers of Bi-directional LSTM are connected behind SE _ ResNeXt50_32x4d to extract image character sequence features; and performing OCR character recognition model training by adopting the first training data set based on CTC _ loss.

In this embodiment, in the process of performing fine-tuning on the OCR character recognition model, the language recognition and classification module 4 mixes the first training data set and the second training data set into one training data set according to a preset distribution ratio, and uses language-level labeling information to freeze the backbone portion and the Bi-directional LSTM layer of SE _ resext 50_32x4d, and then changes the character recognition classifier into a language recognition classifier for fine tuning.

It should be noted that, for the information interaction, execution process, and other contents between the modules/units of the system, since the same concept is based on the method embodiment in embodiment 1 of the present application, the technical effect brought by the information interaction, execution process, and other contents are the same as those of the method embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment of the present application, and are not described herein again.

Example 3

Embodiment 3 of the present invention provides a non-transitory computer-readable storage medium, where a program code of the image language identification method is stored in the computer-readable storage medium, where the program code includes instructions for executing the image language identification method according to embodiment 1 or any possible implementation manner thereof.

The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Example 4

An embodiment 4 of the present invention provides an electronic device, including: a memory and a processor;

the processor and the memory are communicated with each other through a bus; the memory stores program instructions executable by the processor, and the processor calls the program instructions to execute the image and language identification method of embodiment 1 or any possible implementation manner thereof.

Specifically, the processor may be implemented by hardware or software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like; when implemented in software, the processor may be a general-purpose processor implemented by reading software code stored in a memory, which may be integrated in the processor, located external to the processor, or stand-alone.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to be performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. An image character language identification method is characterized by comprising the following steps:

simulating image characters in a real scene, wherein the simulation process combines a background picture, a dictionary library and various language style font libraries to artificially synthesize image characters with labels to form a first training data set;

and performing model reasoning by using the OCR language identification and classification model after find-tuning.

2. The method according to claim 1, wherein the image text is simulated in real scene by:

3. An image and word language identification method according to claim 2, characterized in that the dictionary database and the dictionary database are numbered so that all characters correspond to unique indexes and are stored in a specified format as labels of artificially synthesized image words;

each training sample comprises language-level marking information and character-level marking information; using the character-level marking information formed according to the unique index in the OCR character recognition model; and language level marking information is used for the language identification and classification model in fine-tuning.

4. The method as claimed in claim 3, wherein the neural network CRNN comprises a convolutional neural network CNN and a recurrent neural network RNN;

5. The method as claimed in claim 4, wherein in the fine-tuning process of the OCR character recognition model, the first training data set and the second training data set are mixed into one training data set according to a preset distribution ratio, and the character recognition classifier is changed into a language recognition classifier for fine tuning by freezing the backbone part and Bi-directional LSTM layer of SE _ ResNeXt50_32x4d by using language level labeling information.

6. An image and text language identification system, comprising:

7. The system according to claim 6, wherein said first data set generating module performs a simulation process on image text in a real scene:

8. An image and language identification system according to claim 7, wherein in the first data set generation module, the dictionary database and the dictionary database are numbered, so that all characters correspond to unique indexes, and tags used as artificially synthesized image characters are stored in a specified format;

in the second data set generation module, each training sample comprises language-level labeling information and character-level labeling information; the character-level marking information formed according to the unique index is used for the OCR character recognition model; and language level marking information is used for the language identification and classification model defined-tuning.

9. The system according to claim 8, wherein said neural network CRNN of said character recognition processing module comprises a convolutional neural network CNN and a recurrent neural network RNN;

10. The system of claim 9, wherein during the definition-tuning process of the OCR character recognition module, the language recognition and classification module mixes the first training data set and the second training data set into a training data set according to a predetermined distribution ratio, and uses language labeling information to freeze the backbone portion of SE _ resenext 50_32x4d and the Bi-directional LSTM layer, so as to change the character recognition and classification module into a language recognition and classification module for fine tuning.