CN116343190B - Natural scene character recognition method, system, equipment and storage medium - Google Patents
Natural scene character recognition method, system, equipment and storage medium Download PDFInfo
- Publication number
- CN116343190B CN116343190B CN202310623773.4A CN202310623773A CN116343190B CN 116343190 B CN116343190 B CN 116343190B CN 202310623773 A CN202310623773 A CN 202310623773A CN 116343190 B CN116343190 B CN 116343190B
- Authority
- CN
- China
- Prior art keywords
- time step
- character
- vector
- time
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000003860 storage Methods 0.000 title claims abstract description 13
- 239000013598 vector Substances 0.000 claims abstract description 79
- 238000010586 diagram Methods 0.000 claims abstract description 11
- 230000002776 aggregation Effects 0.000 claims abstract description 8
- 238000004220 aggregation Methods 0.000 claims abstract description 8
- 230000006870 function Effects 0.000 claims description 29
- 230000000007 visual effect Effects 0.000 claims description 19
- 238000010606 normalization Methods 0.000 claims description 15
- 230000004913 activation Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 9
- 230000004931 aggregating effect Effects 0.000 claims description 6
- 235000019580 granularity Nutrition 0.000 claims description 5
- 230000001131 transforming effect Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 3
- 230000014509 gene expression Effects 0.000 abstract description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 239000002131 composite material Substances 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- -1 carrier Substances 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 239000000306 component Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 239000007858 starting material Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/63—Scene text, e.g. street names
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19127—Extracting features by transforming the feature space, e.g. multidimensional scaling; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a natural scene character recognition method, a system, equipment and a storage medium, which are one-to-one schemes, wherein: the method and the device have the advantages that images are encoded into a vector space, so that local and global multi-granularity semantics are given, global vectors are obtained through aggregation, and channel attention diagrams of different time steps are generated in parallel, so that character information of different time steps is decoded, and due to the fact that a vector-to-sequence decoding mode is adopted, not only can recognition speed be improved, but also different characters share some feature expressions (such as attention diagrams are activated strongly) in the channel space, but some channel weights with distinguishing force features also have certain differences, therefore, the fact that the global vectors can generate robust character feature expressions (such as the fact that lack of attention to shared channel features does not affect the expression of distinguishing force channel features) can be ensured under the condition of low-quality attention diagrams, and therefore, the scheme provided by the invention can accurately recognize characters of natural scenes.
Description
Technical Field
The present invention relates to the field of natural scene text recognition technologies, and in particular, to a natural scene text recognition method, system, device, and storage medium.
Background
The natural scene character recognition is a universal character recognition technology, has become a hot research direction in the fields of computer vision and document analysis in recent years, and is widely applied to the fields of automatic driving, license plate recognition, vision impaired people assistance and the like. The goal of this task is to convert the text content in the image into editable text.
Since characters in a natural scene have the characteristics of low resolution, complex background, easiness in noise interference and the like, the traditional character recognition technology cannot be applied to the natural scene. Therefore, the character recognition technology in the natural scene has great research significance.
With the development of deep learning technology in the field of computer vision in recent years, the recent scene character recognition method achieves a better effect. In the text recognition process, as shown in fig. 1, an input image is firstly encoded into a sequence signal, and the sequence-to-sequence decoding mechanism is realized through a CNN (convolutional neural network); the sequence of character information is then decoded by the alignment structure, which is implemented by a sequence-to-sequence decoder, which may be an attention-based decoder or a CTC (connection timing classification) based decoder, and the characters provided at the top of fig. 1 are examples. However, the sequence-to-sequence alignment structure is complex in design, and the speed and the robustness of the character recognition process cannot be effectively balanced, so that the speed and the accuracy of scene character recognition are still to be improved.
Disclosure of Invention
The invention aims to provide a natural scene character recognition method, a system, equipment and a storage medium, which can rapidly and accurately recognize characters of a natural scene.
The invention aims at realizing the following technical scheme:
a natural scene text recognition method comprises the following steps:
step 1, converting a natural scene image to be identified into sequence information, and extracting multi-granularity visual feature vectors through a multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;
step 2, aggregating the multi-granularity visual feature vectors to obtain global vectors;
and 3, generating channel attention map of each time step in parallel by using the global vector, obtaining character feature vectors of each time step by combining the global vector, and predicting characters of each time step by using the character feature vectors of each time step.
A natural scene text recognition system, comprising:
the encoder is used for converting the natural scene image to be identified into sequence information and extracting multi-granularity visual feature vectors through the multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;
the feature aggregation module is used for aggregating the visual feature vectors with the multiple granularities to obtain a global vector;
and the vector-to-sequence decoder is used for generating a channel attention map of each time step in parallel by using the global vector, obtaining a character feature vector of each time step by combining the global vector, and predicting the character of each time step by using the character feature vector of each time step.
A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.
A readable storage medium storing a computer program which, when executed by a processor, implements the method described above.
According to the technical scheme provided by the invention, images are encoded into a vector space, so that local and global multi-granularity semantics are endowed, global vectors are obtained through aggregation, and then channel attention patterns with different time steps are generated in parallel, so that character information with different time steps is decoded; meanwhile, the invention simplifies the sequence-to-sequence decoding mode, adopts a vector-to-sequence decoding mode, and therefore, the recognition speed is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a conventional scene text recognition method provided by the background of the invention;
FIG. 2 is a schematic diagram of a natural scene text recognition method according to an embodiment of the present invention;
FIG. 3 is a flowchart of a natural scene text recognition method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a natural scene text recognition method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a visual channel attention map according to an embodiment of the present invention;
FIG. 6 is a diagram comparing the present invention with the conventional scene text recognition method according to the embodiment of the present invention;
FIG. 7 is a schematic diagram of a natural scene text recognition system according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The terms that may be used herein will first be described as follows:
the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.
The following describes a natural scene text recognition method, system, equipment and storage medium. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer. The reagents or apparatus used in the examples of the present invention were conventional products commercially available without the manufacturer's knowledge.
Example 1
The embodiment of the present invention provides a natural scene text recognition method, fig. 2 illustrates the main principle of the method, unlike the existing scene text recognition method illustrated in fig. 1, the encoder of the present invention is an image-to-vector encoder, wherein ViT is a visual Transformer network responsible for encoding an input image into a plurality of granularity visual feature vectors, each granularity visual feature vector being given local and global multi-granularity semantics, and ViT networks provided herein are only examples; thereafter, global vectors (not shown in fig. 2) are obtained by aggregation; the global vector is decoded in parallel by a vector-to-sequence decoder to obtain character information of different time steps, and the characters provided at the top of fig. 2 are all examples; in a vector-to-sequence decoder, the present invention uses a channel attention approach to decode the character information for each time step from the global vector. Experimental results show that the method provided by the invention achieves the most advanced performance in scene character recognition tasks. The following describes the above method in detail, fig. 3 shows the main flow of the above method, and fig. 4 shows the relevant framework of the above method, and the above method mainly includes the following steps:
step 1, converting a natural scene image to be identified into sequence information, and extracting multi-granularity visual feature vectors through a multi-layer transducer module.
In the embodiment of the present invention, step 1 is implemented by an encoder, which may be a ViT network listed above.
As shown in fig. 4, the encoder mainly includes an embedded layer and a multi-layer transducer module. The embedded layer is mainly responsible for converting the natural scene image to be identified into sequence information, for example, the size of the natural scene image to be identified is 128×32, and the natural scene image to be identified is converted into sequence information with the length of 32. The multi-layer transducer module is mainly responsible for extracting multi-granularity visual feature vectors from the sequence information, where N in fig. 4 is the number of layers of the transducer module, for example, n=12 may be set.
And 2, aggregating the multi-granularity visual feature vectors to obtain a global vector.
In the embodiment of the invention, the step 2 can be realized by a feature aggregation module, and global average pooling operation can be adopted during aggregation, and the sum average of all the visual feature vectors is calculated to obtain a global vector. The global vector inputs a vector-to-sequence based decoder for decoding the characters at different time steps.
And 3, generating channel attention map of each time step in parallel by using the global vector, obtaining character feature vectors of each time step by combining the global vector, and predicting characters of each time step by using the character feature vectors of each time step.
In the embodiment of the invention, the step 3 can be realized by a decoder of a vector-to-sequence, and the main process is as follows:
(1) And generating a channel attention map of each time step in parallel by using the global vector. And generating corresponding time embedded information for each time step, introducing the time embedded information of each time step for the global vector through the first full-connection layer, and sequentially obtaining the channel attention map of each time step through the second full-connection layer, the activation function and the normalization layer.
The manner in which the channel attention map for a single time step is generated is expressed as:
;
wherein ,representing a first full connection layer,/->Representing a second full connection layer,/->Time embedded information representing the correspondence of time step t, < >>Channel attention map representing time step t, V representing a global vector; />The activation function is represented as a function of the activation,the normalization exponential function is a normalization operation performed by a normalization layer.
(2) The channel attention of each time step is striven for combining the global vector to obtain the character feature vector of each time step; specifically, the channel attention map of each time step may be multiplied by the global vector point to obtain the character feature vector of each time step.
;
wherein ,is the character feature vector of time step t.
(3) Classifying character feature vectors of each time step through a fully-connected classification layer, and predicting characters of each time step; the prediction mode is expressed as:
;
wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>The classification layer is fully connected.
As shown in fig. 4, the parallel channel attention map calculation module in the vector-to-sequence decoder is mainly responsible for executing the (1) th to (2) th portions, and the fully-connected classification layer is responsible for executing the (3) th portion. In fig. 4, 1× C, M × C, M ×k each represents a dimension, for example, 1×c is a dimension of a global vector, C represents the number of channels, M represents the maximum number of characters (i.e., the maximum time step), and K represents the number of categories of characters.
Fig. 5 shows an example of a visual channel attention map, where the vector-to-sequence based decoder has the property of feature multiplexing, i.e. different character classes share some feature expressions in the channel space (e.g. channels where all attention maps are strongly activated). However, some channel weights with distinguishing features (i.e., associated channel features) also differ somewhat. Thus, feature reusability ensures that global vectors can also generate robust character feature vectors in the case of low quality attention seeking (e.g., lack of attention to shared channel features does not affect the expression of discriminative channel features).
In the above scheme provided by the embodiment of the present invention, the internal parameters of the encoder and the decoder based on vector-to-sequence need to be optimized by pre-selecting a loss function, where the loss function is expressed as:
;
wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>For the real label of the character of time step t (an example of a relevant real label is provided on the right side of fig. 4), M is the total number of time steps, which is equivalent to the maximum number of characters, e.g., m=25 can be set; l is a loss function.
Illustratively, a random gradient descent (SGD) pair may be employed for end-to-end training. At the beginning of training, the learning rate was selected to be 0.001, the learning rate was reduced to be 0.0001 after 10 epochs (rounds), and the training was completed after 20 epochs in total.
For example, existing data sets may be employed for training. For example, using the ST dataset with the 90K dataset; wherein: the ST (SynthText) dataset is a composite dataset containing 8 ten thousand composite images. The 90K (SynthText 90K) dataset is another composite dataset that contains 900 tens of thousands of images, with the ST dataset used for training along with the 90K dataset.
Compared with various existing scene character recognition methods, the method provided by the embodiment of the invention has better recognition accuracy. In order to more intuitively embody the performance of the present invention, the following description is made in connection with experiments.
1. The performance of the present invention in a scene word recognition task is described in connection with a dataset.
The data set employed in this section includes: IIIT5K, IC13, SVT, IC15, SVTP, and CT datasets.
ICDAR2013 (IC 13): the dataset contained 1095 test images, with images containing less than 3 characters or containing non-alphanumeric characters discarded in the experiment.
ICDAR2015 (IC 15): the dataset provides 500 images of the scene. By filtering some extremely distorted images, 1811 cropped text image blocks are finally preserved.
IIIT 5K-Words (IIIT 5K): the dataset contained 3000 images collected from the website, all for the experiment.
Street View Text (SVT): the dataset is cut out from 250 images of google streetscape according to word level labels to obtain 647 text image blocks.
Street View Text-Perspective (SVTP): the dataset contained 639 images, also cropped from google street view images, many of which were severely distorted.
CUTE80 (CT): the data set is used to evaluate the performance of the model in recognizing curved text. It contains 288 cut text image blocks.
The present invention provides schemes with accuracy (precision) in IIIT5K, IC13, SVT, IC15, SVTP, and CT data sets of 95.1%,98.4%,96.0%,87.0%,90.5% and 89.9%, respectively.
2. Compared with the identification effect of the prior method.
FIG. 6 shows the result of comparing the present invention with the prior art scene text recognition method; the four columns in fig. 6 are: the first column is 6 natural scene images to be identified, the second column is the identification result of the invention, the third column is the identification result of the attention-based decoder (i.e., the attention-based decoder is used in the method shown in fig. 1), and the fourth column is the identification result of the CTC-based decoder (i.e., the CTC-based decoder is used in the method shown in fig. 1). As can be seen from fig. 6, the present invention can accurately recognize each character in each image.
From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.
Example two
The invention also provides a natural scene text recognition system, which is mainly realized based on the method provided by the previous embodiment, as shown in fig. 7, and mainly comprises:
the encoder is used for converting the natural scene image to be identified into sequence information and extracting multi-granularity visual feature vectors through the multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;
the feature aggregation module is used for aggregating the visual feature vectors with the multiple granularities to obtain a global vector;
and the vector-to-sequence decoder is used for generating a channel attention map of each time step in parallel by using the global vector, obtaining a character feature vector of each time step by combining the global vector, and predicting the character of each time step by using the character feature vector of each time step.
In the embodiment of the invention, the parallel generation of the channel attention map of each time step by using the global vector comprises the following steps: and generating corresponding time embedded information for each time step, introducing the time embedded information of each time step for the global vector through the first full-connection layer, and sequentially obtaining the channel attention map of each time step through the second full-connection layer, the activation function and the normalization layer.
In the embodiment of the present invention, the manner of generating the channel attention map of a single time step is expressed as:
;
wherein ,representing a first full connection layer,/->Representing a second full connection layer,/->Time embedded information representing the correspondence of time step t, < >>Channel attention map representing time step t, V representing a global vector; />The activation function is represented as a function of the activation,the normalization exponential function is a normalization operation performed by a normalization layer.
In the embodiment of the invention, the internal parameters of the encoder and the decoder based on vector to sequence are optimized in advance by using a loss function, and the loss function is expressed as:
;
wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>The real label of the character of the time step t, M is the total number of the time steps and is equal to the maximum character number; l is a loss function.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.
Example III
The present invention also provides a processing apparatus, as shown in fig. 8, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.
Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.
In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;
the output device may be a display terminal;
the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.
Example IV
The invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.
The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (8)
1. A natural scene text recognition method is characterized by comprising the following steps:
step 1, converting a natural scene image to be identified into sequence information, and extracting multi-granularity visual feature vectors through a multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;
step 2, aggregating the multi-granularity visual feature vectors to obtain global vectors;
step 3, generating channel attention map of each time step in parallel by using the global vector, obtaining character feature vectors of each time step by combining the global vector, and predicting characters of each time step by using the character feature vectors of each time step;
generating a channel attention map for each time step in parallel using the global vector includes: and generating corresponding time embedded information for each time step, introducing the time embedded information of each time step for the global vector through the first full-connection layer, and sequentially obtaining the channel attention map of each time step through the second full-connection layer, the activation function and the normalization layer.
2. The method of claim 1, wherein the way to generate a single time step channel attention map is expressed as:
;
wherein ,representing a first full connection layer,/->Representing a second full connection layer,/->Time embedded information representing the correspondence of time step t, < >>Channel representing time step tNote that in the force diagram, V represents a global vector; />The activation function is represented as a function of the activation,the normalization exponential function is a normalization operation performed by a normalization layer.
3. The method of claim 1, wherein step 1 is implemented by an encoder, and step 3 is implemented by a decoder based on vector-to-sequence, and the internal parameters of the encoder and the decoder based on vector-to-sequence are optimized in advance by using a loss function, where the loss function is expressed as:
;
wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>The real label of the character of the time step t, M is the total number of the time steps and is equal to the maximum character number; l is a loss function.
4. A natural scene text recognition system, comprising:
the encoder is used for converting the natural scene image to be identified into sequence information and extracting multi-granularity visual feature vectors through the multi-layer transducer module; the Transformer module is used for transforming the Transformer module into a Transformer module;
the feature aggregation module is used for aggregating the visual feature vectors with the multiple granularities to obtain a global vector;
the vector-to-sequence decoder is used for generating channel attention map of each time step in parallel by using the global vector, obtaining character feature vectors of each time step by combining the global vector, and predicting characters of each time step by using the character feature vectors of each time step;
generating a channel attention map for each time step in parallel using the global vector includes: and generating corresponding time embedded information for each time step, introducing the time embedded information of each time step for the global vector through the first full-connection layer, and sequentially obtaining the channel attention map of each time step through the second full-connection layer, the activation function and the normalization layer.
5. The natural scene text recognition system of claim 4, wherein the way to generate the channel attention map for a single time step is expressed as:
;
wherein ,representing a first full connection layer,/->Representing a second full connection layer,/->Time embedded information representing the correspondence of time step t, < >>Channel attention map representing time step t, V representing a global vector; />The activation function is represented as a function of the activation,for normalizing the exponential function, the method is composed of a normalization layerAnd performing normalization operation.
6. The natural scene text recognition system of claim 4, wherein the encoder and the vector-to-sequence based decoder are each optimized in advance with a loss function, the loss function being expressed as:
;
wherein ,for the category to which the character of the predicted time step t belongs,/-for the predicted time step t>The real label of the character of the time step t, M is the total number of the time steps and is equal to the maximum character number; l is a loss function.
7. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-3.
8. A readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310623773.4A CN116343190B (en) | 2023-05-30 | 2023-05-30 | Natural scene character recognition method, system, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310623773.4A CN116343190B (en) | 2023-05-30 | 2023-05-30 | Natural scene character recognition method, system, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116343190A CN116343190A (en) | 2023-06-27 |
CN116343190B true CN116343190B (en) | 2023-08-29 |
Family
ID=86879119
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310623773.4A Active CN116343190B (en) | 2023-05-30 | 2023-05-30 | Natural scene character recognition method, system, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116343190B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117037136B (en) * | 2023-10-10 | 2024-02-23 | 中国科学技术大学 | Scene text recognition method, system, equipment and storage medium |
CN117912005B (en) * | 2024-03-19 | 2024-07-05 | 中国科学技术大学 | Text recognition method, system, device and medium using single mark decoding |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112541501A (en) * | 2020-12-18 | 2021-03-23 | 北京中科研究院 | Scene character recognition method based on visual language modeling network |
CN114399757A (en) * | 2022-01-13 | 2022-04-26 | 福州大学 | Natural scene text recognition method and system for multi-path parallel position correlation network |
CN115116066A (en) * | 2022-06-17 | 2022-09-27 | 复旦大学 | Scene text recognition method based on character distance perception |
CN115471851A (en) * | 2022-10-11 | 2022-12-13 | 小语智能信息科技(云南)有限公司 | Burma language image text recognition method and device fused with double attention mechanism |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10565305B2 (en) * | 2016-11-18 | 2020-02-18 | Salesforce.Com, Inc. | Adaptive attention model for image captioning |
-
2023
- 2023-05-30 CN CN202310623773.4A patent/CN116343190B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112541501A (en) * | 2020-12-18 | 2021-03-23 | 北京中科研究院 | Scene character recognition method based on visual language modeling network |
CN114399757A (en) * | 2022-01-13 | 2022-04-26 | 福州大学 | Natural scene text recognition method and system for multi-path parallel position correlation network |
CN115116066A (en) * | 2022-06-17 | 2022-09-27 | 复旦大学 | Scene text recognition method based on character distance perception |
CN115471851A (en) * | 2022-10-11 | 2022-12-13 | 小语智能信息科技(云南)有限公司 | Burma language image text recognition method and device fused with double attention mechanism |
Non-Patent Citations (1)
Title |
---|
张重生 等.基于Transformer的低质场景字符检测算法.北京邮电大学学报.2022,第45卷(第2期),第124-130页. * |
Also Published As
Publication number | Publication date |
---|---|
CN116343190A (en) | 2023-06-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804530B (en) | Subtitling areas of an image | |
CN116343190B (en) | Natural scene character recognition method, system, equipment and storage medium | |
CN112541501B (en) | Scene character recognition method based on visual language modeling network | |
CN111476023B (en) | Method and device for identifying entity relationship | |
CN110189342B (en) | Automatic segmentation method for brain glioma region | |
CN112633431B (en) | Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC | |
CN113449801B (en) | Image character behavior description generation method based on multi-level image context coding and decoding | |
CN111914731B (en) | Multi-mode LSTM video motion prediction method based on self-attention mechanism | |
CN113392265A (en) | Multimedia processing method, device and equipment | |
CN113762261B (en) | Method, device, equipment and medium for recognizing characters of image | |
CN114529903A (en) | Text refinement network | |
CN115512195A (en) | Image description method based on multi-interaction information fusion | |
CN117315249A (en) | Image segmentation model training and segmentation method, system, equipment and medium | |
CN117934803A (en) | Visual positioning method based on multi-modal feature alignment | |
Sah et al. | Understanding temporal structure for video captioning | |
CN118155231B (en) | Document identification method, device, equipment, medium and product | |
CN114020871A (en) | Multi-modal social media emotion analysis method based on feature fusion | |
CN117235605A (en) | Sensitive information classification method and device based on multi-mode attention fusion | |
CN117612151A (en) | English artistic text recognition method based on structure enhanced attention | |
Chen et al. | Audio captioning with meshed-memory transformer | |
CN116484864A (en) | Data identification method and related equipment | |
Liu et al. | Attention-based convolutional LSTM for describing video | |
CN117912005B (en) | Text recognition method, system, device and medium using single mark decoding | |
Kai et al. | HAFE: A Hierarchical Awareness and Feature Enhancement Network for Scene Text Recognition | |
CN118627020B (en) | Method for enhancing multi-mode feature fusion based on contrast learning and structured information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |