CN114998909A - Image character language identification method and system - Google Patents
Image character language identification method and system Download PDFInfo
- Publication number
- CN114998909A CN114998909A CN202210640881.8A CN202210640881A CN114998909A CN 114998909 A CN114998909 A CN 114998909A CN 202210640881 A CN202210640881 A CN 202210640881A CN 114998909 A CN114998909 A CN 114998909A
- Authority
- CN
- China
- Prior art keywords
- data set
- language
- image
- training data
- ocr
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000013528 artificial neural network Methods 0.000 claims abstract description 31
- 102100032202 Cornulin Human genes 0.000 claims abstract description 26
- 101000920981 Homo sapiens Cornulin Proteins 0.000 claims abstract description 26
- 238000013145 classification model Methods 0.000 claims abstract description 25
- 238000002372 labelling Methods 0.000 claims abstract description 25
- 238000013527 convolutional neural network Methods 0.000 claims description 28
- 230000008569 process Effects 0.000 claims description 24
- 238000004088 simulation Methods 0.000 claims description 12
- 230000000306 recurrent effect Effects 0.000 claims description 6
- 230000008014 freezing Effects 0.000 claims description 2
- 238000007710 freezing Methods 0.000 claims description 2
- 208000012260 Accidental injury Diseases 0.000 abstract description 5
- 238000001514 detection method Methods 0.000 abstract description 5
- 208000014674 injury Diseases 0.000 abstract description 5
- 230000000694 effects Effects 0.000 abstract description 4
- 230000015572 biosynthetic process Effects 0.000 abstract 1
- 238000003786 synthesis reaction Methods 0.000 abstract 1
- 238000012015 optical character recognition Methods 0.000 description 47
- 230000002194 synthesizing effect Effects 0.000 description 6
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/263—Language identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Character Discrimination (AREA)
Abstract
A kind of recognition method and system of the image character language, this method is through simulating the image character under the real scene, combine background picture, dictionary storehouse and each language style font storehouse, the artificial synthesis of image character with label forms the first training data set; collecting image characters in a real scene, and carrying out manual classification and labeling on the collected image characters to form a second training data set; constructing a neural network CRNN for OCR language recognition, and performing primary training by using a first training data set to obtain an OCR character recognition model; performing fine-tuning on the OCR character recognition model by adopting a first training data set and a second training data set to obtain an OCR language recognition classification model; and performing model reasoning by using the OCR language recognition and classification model after find-tuning. The invention can obtain better fitting effect, reduce missing detection or accidental injury and improve the performance of the model.
Description
Technical Field
The invention relates to the technical field of OCR (optical character recognition) word processing, in particular to a method and a system for recognizing image words.
Background
At present, in internet scenes, particularly scenes such as maritime cooperation, cross-country trade, international advertisement and the like, characters of multiple languages are often contained on the same picture at the same time, and the prior language selection is not enough to cover all scene requirements, so that language identification is necessary before character identification.
In the prior art, a multilingual OCR language recognition scheme generally includes a general image recognition model and an image recognition model based on multiple labels, directly extracts semantic features of whole sentence image characters, and then classifies the semantic features by a feature classifier. In the traditional technology, only semantic features of an image layer are considered, and the relation between contexts is not considered based on a sequence, so that languages with similar body features of the image layer cannot be distinguished. Meanwhile, when more than two languages appear in the whole sentence of the picture text, and the ratio of one language exceeds 80%, the image identification method based on the multi-label often has missed detection or accidental injury. In summary, a new technical solution for recognizing image and text languages is needed.
Disclosure of Invention
Therefore, the invention provides an image character language identification method and system to solve the problems of easy missed detection or accidental injury and poor performance of the traditional scheme.
In order to achieve the above purpose, the invention provides the following technical scheme: an image character language identification method comprises the following steps:
simulating image characters in a real scene, combining a background picture, a dictionary library and various language style font libraries in the simulation process, and artificially synthesizing the image characters with labels to form a first training data set;
collecting image characters in a real scene, and carrying out manual classification and labeling on the collected image characters to form a second training data set;
constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model;
performing fine-tuning on the OCR character recognition model by adopting the first training data set and the second training data set to obtain an OCR language recognition classification model;
and performing model reasoning by using the OCR language identification classification model after fine-tuning.
As an optimal scheme of the image character language identification method, the image characters under the real scene are simulated:
collecting a picture data set without characters from the Internet, and cutting the picture data set to be used as a background picture of artificially synthesized image characters;
collecting dictionary libraries or dictionary libraries of various languages from the Internet, and acquiring font libraries of corresponding languages from various language official nets to synthesize image characters of different styles.
As a preferred scheme of the image and character language identification method, the dictionary database and the dictionary database are numbered, all characters correspond to unique indexes, and labels serving as artificially synthesized image characters are stored in a specified format;
each training sample comprises language-level marking information and character-level marking information; the character-level marking information formed according to the unique index is used for the OCR character recognition model; and language level marking information is used for the language identification and classification model in fine-tuning.
As a preferred scheme of the image and word language identification method, the neural network CRNN comprises a convolutional neural network CNN and a recurrent neural network RNN;
the convolutional neural network CNN adopts the backbone part of SE _ ResNeXt50_32x4d to extract image semantic features, and 2 layers of Bi-directional LSTM are connected behind SE _ ResNeXt50_32x4d to extract image character sequence features; and performing OCR character recognition model training by adopting the first training data set based on CTC _ loss.
As a preferred scheme of the image character language identification method, in the fine-tuning process of the OCR character identification model, the first training data set and the second training data set are mixed into one training data set according to a preset distribution proportion, language level labeling information is adopted, the backbone part and the Bi-directional LSTM layer of SE _ ResNeXt50_32x4d are frozen, and a character identification classifier is changed into a language identification classifier for fine tuning.
The invention also provides a system for recognizing the languages of the image and the character, which comprises the following steps:
the first data set generation module is used for simulating image characters in a real scene, and the simulation process combines a background picture, a dictionary library and various language style font libraries to artificially synthesize the image characters with labels to form a first training data set;
the second data set generation module is used for collecting the image characters in the real scene and carrying out manual classification and labeling on the collected image characters to form a second training data set;
the character recognition processing module is used for constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model;
the language identification and classification module is used for performing fine-tuning on the OCR character identification model by adopting the first training data set and the second training data set to obtain an OCR language identification and classification model;
and the model reasoning module is used for carrying out model reasoning by utilizing the OCR language identification classification model after the fine-tuning.
As an optimal solution of the image character language identification system, in the first data set generation module, a simulation process is performed on image characters in a real scene:
collecting a picture data set without characters from the Internet, and cutting the picture data set to be used as a background picture of artificially synthesized image characters;
collecting dictionary libraries or dictionary libraries of various languages from the Internet, and acquiring font libraries of corresponding languages from various language official nets to synthesize image characters of different styles.
As a preferred scheme of the image and word language identification system, in the first data set generation module, the dictionary database and the dictionary database are numbered, so that all characters correspond to unique indexes, and tags serving as artificially synthesized image and words are stored in an appointed format;
in the second data set generation module, each training sample comprises language-level labeling information and character-level labeling information; using the character-level marking information formed according to the unique index in the OCR character recognition model; and language level marking information is used for the language identification and classification model in fine-tuning.
As a preferred scheme of the image and language identification system, a neural network CRNN in the character identification processing module comprises a convolutional neural network CNN and a cyclic neural network RNN;
the convolutional neural network CNN adopts the backbone part of SE _ ResNeXt50_32x4d to extract image semantic features, and 2 layers of Bi-directional LSTM are connected behind SE _ ResNeXt50_32x4d to extract image character sequence features; training the OCR character recognition model using the first training data set based on CTC _ loss.
As a preferred scheme of the image and character language identification system, in the process of performing fine-tuning on the OCR character identification model, the language identification classification module mixes the first training data set and the second training data set into one training data set according to a preset distribution proportion, uses language-level labeling information, freezes the backbone part of SE _ resext 50_32x4d and the Bi-directional LSTM layer, and changes a character identification classifier into a language identification classifier for fine tuning.
The invention has the following advantages: simulating image characters in a real scene, wherein the simulation process combines a background picture, a dictionary library and various language style font libraries to artificially synthesize the image characters with labels to form a first training data set; collecting image characters in a real scene, and carrying out manual classification and labeling on the collected image characters to form a second training data set; constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model; performing fine-tuning on the OCR character recognition model by adopting the first training data set and the second training data set to obtain an OCR language recognition classification model; and performing model reasoning by using the OCR language identification classification model after fine-tuning. The invention can obtain better fitting effect, reduce missed detection or accidental injury and improve the performance of the model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope of the present invention.
Fig. 1 is a schematic flow chart of a method for recognizing language types of image texts according to embodiment 1 of the present invention;
fig. 2 is a flow chart of synthesizing multi-lingual text images of different styles in the method for recognizing image languages according to embodiment 1 of the present invention;
fig. 3 is a storage format of a tag of an artificially synthesized text image in the method for recognizing language types of image texts according to embodiment 1 of the present invention;
fig. 4 is a detailed diagram of a block model of SE _ resenext 50_32x4d in the image language identification method according to embodiment 1 of the present invention;
fig. 5 is a detail of the CRNN overall framework and SE _ resenext 50_32x4d modification in the image language identification method according to embodiment 1 of the present invention;
fig. 6 is a CRNN model inference configuration in the image and text language identification method provided in embodiment 1 of the present invention;
fig. 7 is a schematic diagram of an image and language identification system according to embodiment 2 of the present invention.
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example 1
Referring to fig. 1, fig. 2 and fig. 3, embodiment 1 of the present invention provides a method for recognizing language types of image texts, including:
s1, simulating image characters in a real scene, and artificially synthesizing the image characters with labels to form a first training data set in the simulation process by combining a background picture, a dictionary library and various language style font libraries;
s2, collecting image characters in a real scene, and manually classifying and labeling the collected image characters to form a second training data set;
s3, constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model;
s4, performing fine-tuning on the OCR character recognition model by adopting the first training data set and the second training data set to obtain an OCR language recognition classification model;
and S5, performing model reasoning by using the OCR language recognition and classification model after fine-tuning.
In this embodiment, in step S1, a simulation process is performed on the image and text in the real scene:
collecting a picture data set without characters from the Internet, and cutting the picture data set to be used as a background picture of artificially synthesized image characters;
collecting dictionary libraries or dictionary libraries of various languages from the Internet, and acquiring font libraries of corresponding languages from various language official nets to synthesize image characters of different styles.
Specifically, in the process of simulating image characters in a real scene and artificially synthesizing training data to form a first training data set, firstly, a picture data set without characters is collected from the internet, and is cut by taking the example 32x 352 as an example to serve as a background picture of the artificially synthesized image characters; then collecting dictionary libraries or dictionary libraries of various languages from the internet, wherein the example comprises the following steps: latin family, chinese simplified, chinese traditional, japanese, korean, tiancheng, tamil, thai and arabic; the Chinese, Japanese and Korean languages are used as word units to generate dictionary libraries, and the Latin language, Tiancheng language, Tamil language, Thai language and Arabic languages are used as word units to generate dictionary libraries. In addition, attention is paid to the right-to-left writing order of arabic; secondly, acquiring a font library (font file) of a corresponding language from various language official nets or other ways for synthesizing image characters of different styles; thus, image characters are synthesized based on the picture character background, the dictionary library and font files of various languages.
In this embodiment, the dictionary database and the dictionary database are numbered, so that all characters correspond to unique indexes, and tags serving as artificially synthesized image characters are stored in a specified format;
each training sample comprises language-level marking information and character-level marking information; the character-level marking information formed according to the unique index is used for the OCR character recognition model; and language level marking information is used for the language identification and classification model in fine-tuning.
With reference to fig. 3, specifically, the dictionary base (the dictionary base needs to be deduplicated) and the dictionary base are numbered from 0, it is ensured that all characters correspond to unique indexes, and the unique indexes are stored in a json format and used as labels of all artificially synthesized image characters, which indicates that each training sample not only contains language-level labeling information, but also contains each character-level labeling information in a picture, wherein the character-level labeling information formed according to the unique indexes is used in the OCR character recognition model of step S3, and the language-level labeling information is used for fine-tuning the OCR language recognition classification model of step S4.
In this embodiment, the neural network CRNN includes a convolutional neural network CNN and a recurrent neural network RNN;
the convolutional neural network CNN adopts the backbone part of SE _ ResNeXt50_32x4d to extract image semantic features, and 2 layers of Bi-directional LSTM are connected behind SE _ ResNeXt50_32x4d to extract image character sequence features; and performing OCR character recognition model training by adopting the first training data set based on CTC _ loss.
Specifically, the CRNN neural network constructed in step S3 includes a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN), wherein the CNN structure employs a backbone portion of SE _ resenext 50_32x4d for image semantic feature extraction, and is followed by 2 layers of Bi-directional LSTM for extracting picture text sequence features, and finally, based on CTC _ loss, multilingual text recognition model training is performed with a synthetic training data set.
In the present example, a SE _ resenext 50_32X4d (cardinality is 32, and dimensionality reduction is 4) network of the CNN backbone is modified to ensure that the X-axis direction resolution is unchanged when layer2, layer3 and layer4 are downsampled for the first time, and the Y-axis keeps 2-fold downsampling, that is, stride is (1, 2). The final input is (32 × 352 × channel), the back-bone output is (1 × 88 × channel), the X-axis direction of the extracted feature map is 1/4 of the original, and the Y-axis direction is 1/32 of the original, see fig. 4 and 5. In this example, CTC _ loss is used to solve the problem of mismatch between the model output and the tag length, and the following formula is implemented:
LSTM given an input x, the output is the probability of l:
wherein pi ∈ B -1 (l) Representing all paths pi that are l after B transformation.
in this embodiment, in the fine-tuning process of the OCR character recognition model, the first training data set and the second training data set are mixed into one training data set according to a preset distribution ratio, the language level labeling information is adopted, the backbone part and the Bi-directional LSTM layer of SE _ resenxt 50_32x4d are frozen, and the character recognition classifier is changed into a language recognition classifier for fine tuning.
Specifically, the artificially synthesized image character training data and the real scene image character training data are calculated according to the following ratio of 1: 1 into a training data set, using labels which are all language level labels, freezing the backbone part and the LSTM layer of the model SE _ ResNeXt50_32x4d, and changing the multilingual character recognition classifier into a language recognition classifier for fine adjustment.
Referring to fig. 6, in order to ensure batch processing in the inference process, a size fixed bit 32 × 352 of the picture is input, if the width of the input picture is less than 352, original image copying is performed until 352 is filled, and if the width is greater than 352, subgraphs with the width equal to 352 size are cut out to the two sides by taking the center of the picture as the origin to predict.
In summary, the image characters under the real scene are simulated, and the simulation process is combined with the background picture, the dictionary library and the font libraries of various languages to artificially synthesize the image characters with labels to form a first training data set; collecting image characters in a real scene, and carrying out manual classification and labeling on the collected image characters to form a second training data set; constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model; performing fine-tuning on the OCR character recognition model by adopting the first training data set and the second training data set to obtain an OCR language recognition classification model; and performing model reasoning by using the OCR language identification classification model after fine-tuning. The invention can obtain better fitting effect, reduce missing detection or accidental injury and improve the performance of the model.
It should be noted that the method of the embodiments of the present disclosure may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may only perform one or more steps of the method of the embodiments of the present disclosure, and the devices may interact with each other to complete the method.
It should be noted that the above describes some embodiments of the disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Example 2
Referring to fig. 7, embodiment 2 of the present invention provides an image language identification system, including:
the first data set generating module 1 is used for simulating image characters in a real scene, and artificially synthesizing the image characters with labels to form a first training data set in the simulation process by combining a background picture, a dictionary library and font libraries of various languages;
the second data set generation module 2 is used for collecting image characters in a real scene, and carrying out manual classification and labeling on the collected image characters to form a second training data set;
the character recognition processing module 3 is used for constructing a neural network CRNN for OCR language recognition, and performing preliminary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model;
a language identification and classification module 4, configured to perform fine-tuning on the OCR character recognition model by using the first training data set and the second training data set, so as to obtain an OCR language identification and classification model;
and the model reasoning module 5 is used for performing model reasoning by using the OCR language identification and classification model after find-tuning.
In this embodiment, in the first data set generating module 1, a simulation process is performed on image and text in a real scene:
searching a picture data set without characters from the Internet, and cutting the picture data set to be used as a background picture of artificially synthesized image characters;
collecting dictionary libraries or dictionary libraries of various languages from the Internet, and acquiring font libraries of corresponding languages from various language official nets to synthesize image characters of different styles.
In this embodiment, in the first data set generating module 1, the dictionary database and the dictionary database are numbered, so that all characters correspond to unique indexes, and are stored in a specified format as tags of artificially synthesized image characters;
in the second data set generation module 2, each training sample comprises language-level labeling information and character-level labeling information; the character-level marking information formed according to the unique index is used for the OCR character recognition model; and language level marking information is used for the language identification and classification model in fine-tuning.
In this embodiment, the neural network CRNN in the word recognition processing module 3 includes a convolutional neural network CNN and a recurrent neural network RNN;
the convolutional neural network CNN adopts a backbone part of SE _ ResNeXt50_32x4d to extract image semantic features, and 2 layers of Bi-directional LSTM are connected behind SE _ ResNeXt50_32x4d to extract image character sequence features; and performing OCR character recognition model training by adopting the first training data set based on CTC _ loss.
In this embodiment, in the process of performing fine-tuning on the OCR character recognition model, the language recognition and classification module 4 mixes the first training data set and the second training data set into one training data set according to a preset distribution ratio, and uses language-level labeling information to freeze the backbone portion and the Bi-directional LSTM layer of SE _ resext 50_32x4d, and then changes the character recognition classifier into a language recognition classifier for fine tuning.
It should be noted that, for the information interaction, execution process, and other contents between the modules/units of the system, since the same concept is based on the method embodiment in embodiment 1 of the present application, the technical effect brought by the information interaction, execution process, and other contents are the same as those of the method embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment of the present application, and are not described herein again.
Example 3
The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
Example 4
An embodiment 4 of the present invention provides an electronic device, including: a memory and a processor;
the processor and the memory are communicated with each other through a bus; the memory stores program instructions executable by the processor, and the processor calls the program instructions to execute the image and language identification method of embodiment 1 or any possible implementation manner thereof.
Specifically, the processor may be implemented by hardware or software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like; when implemented in software, the processor may be a general-purpose processor implemented by reading software code stored in a memory, which may be integrated in the processor, located external to the processor, or stand-alone.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to be performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.
Claims (10)
1. An image character language identification method is characterized by comprising the following steps:
simulating image characters in a real scene, wherein the simulation process combines a background picture, a dictionary library and various language style font libraries to artificially synthesize image characters with labels to form a first training data set;
collecting image characters in a real scene, and carrying out manual classification and labeling on the collected image characters to form a second training data set;
constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model;
performing fine-tuning on the OCR character recognition model by adopting the first training data set and the second training data set to obtain an OCR language recognition classification model;
and performing model reasoning by using the OCR language identification and classification model after find-tuning.
2. The method according to claim 1, wherein the image text is simulated in real scene by:
collecting a picture data set without characters from the Internet, and cutting the picture data set to be used as a background picture of artificially synthesized image characters;
collecting dictionary libraries or dictionary libraries of various languages from the Internet, and acquiring font libraries of corresponding languages from various language official nets to synthesize image characters of different styles.
3. An image and word language identification method according to claim 2, characterized in that the dictionary database and the dictionary database are numbered so that all characters correspond to unique indexes and are stored in a specified format as labels of artificially synthesized image words;
each training sample comprises language-level marking information and character-level marking information; using the character-level marking information formed according to the unique index in the OCR character recognition model; and language level marking information is used for the language identification and classification model in fine-tuning.
4. The method as claimed in claim 3, wherein the neural network CRNN comprises a convolutional neural network CNN and a recurrent neural network RNN;
the convolutional neural network CNN adopts the backbone part of SE _ ResNeXt50_32x4d to extract image semantic features, and 2 layers of Bi-directional LSTM are connected behind SE _ ResNeXt50_32x4d to extract image character sequence features; training the OCR character recognition model using the first training data set based on CTC _ loss.
5. The method as claimed in claim 4, wherein in the fine-tuning process of the OCR character recognition model, the first training data set and the second training data set are mixed into one training data set according to a preset distribution ratio, and the character recognition classifier is changed into a language recognition classifier for fine tuning by freezing the backbone part and Bi-directional LSTM layer of SE _ ResNeXt50_32x4d by using language level labeling information.
6. An image and text language identification system, comprising:
the first data set generation module is used for simulating image characters in a real scene, and the simulation process combines a background picture, a dictionary library and various language style font libraries to artificially synthesize the image characters with labels to form a first training data set;
the second data set generation module is used for collecting the image characters in the real scene and carrying out manual classification and labeling on the collected image characters to form a second training data set;
the character recognition processing module is used for constructing a neural network CRNN for OCR language recognition, and performing primary training on the constructed neural network CRNN by using the first training data set to obtain an OCR character recognition model;
the language identification and classification module is used for performing fine-tuning on the OCR character identification model by adopting the first training data set and the second training data set to obtain an OCR language identification and classification model;
and the model reasoning module is used for carrying out model reasoning by utilizing the OCR language identification classification model after the fine-tuning.
7. The system according to claim 6, wherein said first data set generating module performs a simulation process on image text in a real scene:
searching a picture data set without characters from the Internet, and cutting the picture data set to be used as a background picture of artificially synthesized image characters;
collecting dictionary libraries or dictionary libraries of various languages from the Internet, and acquiring font libraries of corresponding languages from various language official nets to synthesize image characters of different styles.
8. An image and language identification system according to claim 7, wherein in the first data set generation module, the dictionary database and the dictionary database are numbered, so that all characters correspond to unique indexes, and tags used as artificially synthesized image characters are stored in a specified format;
in the second data set generation module, each training sample comprises language-level labeling information and character-level labeling information; the character-level marking information formed according to the unique index is used for the OCR character recognition model; and language level marking information is used for the language identification and classification model defined-tuning.
9. The system according to claim 8, wherein said neural network CRNN of said character recognition processing module comprises a convolutional neural network CNN and a recurrent neural network RNN;
the convolutional neural network CNN adopts the backbone part of SE _ ResNeXt50_32x4d to extract image semantic features, and 2 layers of Bi-directional LSTM are connected behind SE _ ResNeXt50_32x4d to extract image character sequence features; and performing OCR character recognition model training by adopting the first training data set based on CTC _ loss.
10. The system of claim 9, wherein during the definition-tuning process of the OCR character recognition module, the language recognition and classification module mixes the first training data set and the second training data set into a training data set according to a predetermined distribution ratio, and uses language labeling information to freeze the backbone portion of SE _ resenext 50_32x4d and the Bi-directional LSTM layer, so as to change the character recognition and classification module into a language recognition and classification module for fine tuning.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210640881.8A CN114998909A (en) | 2022-06-08 | 2022-06-08 | Image character language identification method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210640881.8A CN114998909A (en) | 2022-06-08 | 2022-06-08 | Image character language identification method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114998909A true CN114998909A (en) | 2022-09-02 |
Family
ID=83033131
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210640881.8A Pending CN114998909A (en) | 2022-06-08 | 2022-06-08 | Image character language identification method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114998909A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107480144A (en) * | 2017-08-03 | 2017-12-15 | 中国人民大学 | Possess the image natural language description generation method and device across language learning ability |
CN109214386A (en) * | 2018-09-14 | 2019-01-15 | 北京京东金融科技控股有限公司 | Method and apparatus for generating image recognition model |
CN109670502A (en) * | 2018-12-18 | 2019-04-23 | 成都三零凯天通信实业有限公司 | Training data generation system and method based on dimension language character recognition |
CN109685100A (en) * | 2018-11-12 | 2019-04-26 | 平安科技(深圳)有限公司 | Character identifying method, server and computer readable storage medium |
CN113239967A (en) * | 2021-04-14 | 2021-08-10 | 北京达佳互联信息技术有限公司 | Character recognition model training method, recognition method, related equipment and storage medium |
CN113392299A (en) * | 2020-12-02 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Picture resource obtaining method and device, readable storage medium and equipment |
-
2022
- 2022-06-08 CN CN202210640881.8A patent/CN114998909A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107480144A (en) * | 2017-08-03 | 2017-12-15 | 中国人民大学 | Possess the image natural language description generation method and device across language learning ability |
CN109214386A (en) * | 2018-09-14 | 2019-01-15 | 北京京东金融科技控股有限公司 | Method and apparatus for generating image recognition model |
CN109685100A (en) * | 2018-11-12 | 2019-04-26 | 平安科技(深圳)有限公司 | Character identifying method, server and computer readable storage medium |
CN109670502A (en) * | 2018-12-18 | 2019-04-23 | 成都三零凯天通信实业有限公司 | Training data generation system and method based on dimension language character recognition |
CN113392299A (en) * | 2020-12-02 | 2021-09-14 | 腾讯科技(深圳)有限公司 | Picture resource obtaining method and device, readable storage medium and equipment |
CN113239967A (en) * | 2021-04-14 | 2021-08-10 | 北京达佳互联信息技术有限公司 | Character recognition model training method, recognition method, related equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Stefanini et al. | From show to tell: A survey on deep learning-based image captioning | |
Yang et al. | Learning to extract semantic structure from documents using multimodal fully convolutional neural networks | |
EP3926531B1 (en) | Method and system for visio-linguistic understanding using contextual language model reasoners | |
Jain et al. | Unconstrained scene text and video text recognition for arabic script | |
Wilkinson et al. | Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections | |
CN112257421A (en) | Nested entity data identification method and device and electronic equipment | |
CN111783471B (en) | Semantic recognition method, device, equipment and storage medium for natural language | |
CN111459977A (en) | Conversion of natural language queries | |
CN110968725B (en) | Image content description information generation method, electronic device and storage medium | |
CN114970553B (en) | Information analysis method and device based on large-scale unmarked corpus and electronic equipment | |
CN117765132A (en) | Image generation method, device, equipment and storage medium | |
CN112101031A (en) | Entity identification method, terminal equipment and storage medium | |
CN116955591A (en) | Recommendation language generation method, related device and medium for content recommendation | |
US20210350090A1 (en) | Text to visualization | |
CN117453949A (en) | Video positioning method and device | |
CN117891930B (en) | Book knowledge question-answering method based on knowledge graph enhanced large language model | |
CN114860905A (en) | Intention identification method, device and equipment | |
CN113961669A (en) | Training method of pre-training language model, storage medium and server | |
CN115345168A (en) | Cascade pooling of natural language processing | |
CN114020907A (en) | Information extraction method and device, storage medium and electronic equipment | |
CN114998909A (en) | Image character language identification method and system | |
CN114332476B (en) | Method, device, electronic equipment, storage medium and product for recognizing wiki | |
CN115130437B (en) | Intelligent document filling method and device and storage medium | |
CN114565751A (en) | OCR recognition model training method, OCR recognition method and related device | |
CN114117055A (en) | Method, device, equipment and readable medium for extracting text entity relationship |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |