CN112819724B

CN112819724B - CNN-based scanned document image enhancement method

Info

Publication number: CN112819724B
Application number: CN202110163992.XA
Authority: CN
Inventors: 尹旷; 王红斌; 方健
Original assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2024-06-28
Anticipated expiration: 2041-02-05
Also published as: CN112819724A

Abstract

A CNN-based scanned document image enhancement method. The invention discloses an enhancement method for document images resampled in a scanning or photographing mode and the like, which comprises deep learning model proposal, training sample generation, training method design and data preprocessing. The invention is suitable for solving the problem of image degradation generated in the digitizing process of the traditional paper document, can improve the subjective definition and character recognizability of the document, and can improve the accuracy of Optical Character Recognition (OCR). The invention is based on a convolutional neural network model, a self-built small sample data set is generated by adopting a combined random degradation model to train the model, and the model realizes the definition improvement of the image on subjective visual feeling and the improvement of the character detection rate and the identification rate through feature extraction, feature mapping and image reconstruction. The model provided by the invention has the advantages of taking the image enhancement capability, the model complexity, the training difficulty and the generalization capability into consideration, can meet the industrial real-time processing task, and can expand the application field of the model by a small-scale migration learning method.

Description

CNN-based scanned document image enhancement method

Technical Field

The invention provides a CNN-based scanned document image enhancement method, in particular relates to a deep learning model construction and training method, and belongs to the field of image processing algorithms.

Background

The long-term storage of traditional paper documents is difficult due to the influence of various factors such as printing modes, printing ink, printing stock, storage modes, storage environments and the like, and a large amount of resources are required to be consumed for long-term storage of data. In addition, conventional paper documents have natural drawbacks in terms of information dissemination, copying, sharing, etc., and occupy more space than digital storage, so many conventional paper documents are resampled digitally. Digital media have the characteristics of small space occupation, simple storage, low copying and spreading costs, and the like, and are therefore more widely used. In the process of digitizing the traditional paper document, the paper document needs to be resampled in the modes of scanning, photographing, OCR and the like, but is often influenced by factors such as paper breakage, wrinkling, ink color reduction and the like in the resampling process, so that the problems of poor resampling result, poor visual result, incapability of carrying out optical character recognition and the like are caused. Therefore, the enhancement and repair of the resampling result is one of methods for solving the problem, and the degradation result is repaired by constructing a model with better performance and strong adaptability, so that the subjective readability of the document is greatly improved, and the input with clear characteristics is provided for OCR work.

In order to solve the problems, the conventional image enhancement method designs a specific algorithm according to the degradation mode of the document to enhance the expression of original information of the image, and because a certain degradation priori information is needed, constructing a degradation model consistent with the real situation is a key for completing image enhancement, but in practical application, the degradation of the document image is influenced by various factors, and it is difficult to artificially establish a perfect degradation model. In addition, the OCR technology is a method in the field of machine vision, and the machine vision has a great difference between the utilization of effective features of an image and a human eye vision system, so that the human eye vision readability of the image is subjectively improved, and the quality of OCR is not necessarily improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a scanning document image enhancement method based on CNN. The method mainly solves the problems of low readability and poor OCR performance of the obtained digital image after resampling of the traditional paper document by using a scanner and a camera by reconstructing the details of the image and enhancing the character characteristics.

The technical scheme of the invention is as follows: a training sample set is established through an existing clear resampling scanning document, a TensorFlow framework is used for building a neural network model, transfer learning is conducted based on model parameters of large-scale data set pre-training, a deep learning model and trained parameters are used for enhancing and outputting a scanning stable image to be processed.

The method of the invention comprises the following steps:

Step 1, performing color space conversion and normalization processing on ground truth, and then acquiring a degradation image by combining a degradation model;

step 2, constructing a deep learning model;

The deep learning model is composed of a feature extraction module, a feature nonlinear mapping module and an image reconstruction module, wherein the feature extraction module is composed of a plurality of convolution layers, convolution kernels are 3 multiplied by 3, the feature nonlinear mapping module is composed of one 1 multiplied by 1 convolution layer, and the image reconstruction module is composed of two 3 multiplied by 3 convolution layers;

step 3, carrying out sub-image division processing on the degraded image and ground truth to form a training image pair;

step 4, training a deep learning model by using the training image pair;

and 5, inputting the image to be processed into a trained deep learning model to obtain an enhanced scanned text image.

Further, the specific implementation manner of performing the color space conversion and normalization processing on ground truth in the step 1 is as follows;

First, ground truth is subjected to color space conversion, an image is converted from an RGB color space to a YCbCr color space, and only luminance information of a Y channel is reserved, and the conversion process is as follows:

Y＝0.257×x_R+0.564×x_G+0.098×x_B+16 (1)

Wherein x _R、x_G、x_B represents the red, green and blue three channel pixel values of the pixel at the corresponding position of the input image, and Y is the output pixel value;

and carrying out normalization processing on the converted gray level image, wherein the normalization processing is as follows:

where Y is the input pixel value and x is the normalized pixel output value.

Further, the specific implementation manner of acquiring the degradation image by combining the degradation model in the step 1 is as follows;

And randomly generating a degradation model by combining a random weighting method, and obtaining a degradation image for model training by utilizing ground truth after color conversion and normalization and the random degradation generation model, wherein the degradation image generation process is shown in the following formula:

Z(x)＝αB(x)+βG_θ(x)+γx (3)

Wherein Z (x) is a degenerated sample generated by combining random degenerating methods; b (x) is bicubic method degradation image; g _θ (x) is a handwriting blur degradation image obtained using gaussian blur processing with a kernel size θ; x is ground truth, α, β, γ are weighting coefficients, and α+β+γ=1 is satisfied.

Further, θ∈ {3,5,7}.

Further, sub-image selection is performed on the degraded image and ground truth in a manner that the degraded image is partially overlapped with each other in a step size of 14 of 33×33, and the sub-image is selected by taking the 21×21 pixel in the middle at the corresponding position on ground truth, so as to form a training image pair.

Further, training is performed in a gradient descent mode, the mini-batch size is set to 128, the learning rate is initialized to 5×10 ^-4, the training is reduced by 1×10 ^-4 after every 1×10 ⁵ mini-batch training until the minimum learning rate is 1×10 ^-4, the training is stopped, network parameters are saved, the model parameters are initialized by adopting a He method, an optimizer uses Stochastic GRADIENT DESCENT, and the loss function is L2 distance loss.

Furthermore, the feature extraction module in the deep learning model is composed of 4 convolution layers, namely the deep learning model comprises 7 convolution layers, wherein the number of 1-layer convolution kernels and 2-layer convolution kernels is 128,3, the number of 4-layer convolution kernels is 64, the number of 5-layer convolution kernels is 32, the number of 6-layer convolution kernels and 7-layer convolution kernels is 1, and the step sizes of the 7-layer convolution kernels are 1.

The deep learning model and the training method provided by the invention enhance the definition of the scanned text image and improve the text detection rate and the text recognition rate of OCR.

Compared with the prior art, the invention has the advantages and beneficial effects that: the invention realizes the degradation image enhancement on the small sample data set in a self-supervision mode, has small required data volume, higher quality of the generated image, stronger adaptability to unknown data, simple model structure and quicker running performance.

Drawings

Fig. 1 is a model structure of deep learning proposed by the present invention.

Fig. 2 is ground truth for generating training samples.

Fig. 3 is the result of ground truth degradation by the bicubic method.

Fig. 4 is the result of ground truth gaussian filter degradation.

Fig. 5 is the result of ground truth combined random degeneration method.

Fig. 6 is a truly degraded scanned document image.

Fig. 7 is the result of a degraded text image enhanced by the method of the present invention.

Detailed Description

The whole technical scheme of the invention comprises the following steps: training sample generation, learning model construction, data preprocessing, model training and text-enhanced reasoning. The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

(1) Training sample generation:

Y＝0.257×x_R+0.564×x_G+0.098×x_B+16 (1)

Wherein x _R、x_G、x_B represents the three channel pixel values of red, green and blue of the pixel at the corresponding position of the input image, and Y is the output pixel value. The image is converted into the single-channel intensity image by using the above formula, so that not only is the main characteristic information of enough text documents reserved, but also the complexity of model training is reduced, the gray level image is uniformly used as the model input, and convenience is provided for model migration learning.

where Y is the input pixel value and x is the normalized pixel output value.

By combining a random weighting method, a model is generated by utilizing ground truth after color conversion and normalization and random degradation, and a degradation image for model training is obtained. The degraded image generation process is shown as follows:

Z(x)＝αB(x)+βG_θ(x)+γx (3)

Wherein Z (x) is a degraded sample image generated by combining random degradation methods, and is used as an input of a model for model training, as shown in fig. 5; b (x) is bicubic method degradation image, as shown in FIG. 3; g _θ (x) is a handwriting blur degradation image obtained using a gaussian blur process with a kernel size θ, as shown in fig. 4; x is ground truth, and the target value output as a model is used for model training, as shown in fig. 2; α, β, γ are weighting coefficients, and α+β+γ=1 is satisfied. The model uses the data set composed of the degradation sample image and ground truth to carry out self-supervision training, so as to realize the enhancement of the degradation text image. In practical use, the randomly chosen gaussian kernel size is limited, i.e. θ e {3,5,7}. For each generation of the sample, alpha, beta, gamma and theta are randomly selected within a limited range, so that different degradation processes under the real condition can be simulated by combining random degradation models, and the problem of data set deficiency is solved.

(2) Building a deep learning model: based on TensorFlow development framework, a deep learning model was built according to the following parameters. The model is composed of a feature extraction module, a feature nonlinear mapping module and an image reconstruction module, wherein the feature extraction module is composed of four layers of convolution layers, the convolution sum is 3 multiplied by 3, the feature nonlinear mapping module is composed of one layer of 1 multiplied by 1 convolution layer, and the image reconstruction module is composed of two layers of 3 multiplied by 3 convolution layers. 1. The number of the convolution kernels of the 2 layers is 128,3, the number of the convolution kernels of the 4 layers is 64, the number of the convolution kernels of the 5 th layer is 32, and the numbers of the convolution kernels of the 6 th layer and the 7 th layer are 1. In addition, the input layer is the first layer of convolution, the hidden layer is the middle five layers of convolution, the output layer is the seventh layer of convolution, and the leak ReLU activation function is added after the input layer and the hidden layer. No padding is used for each layer, the small image block size input by the input layer is 33×33 pixels, and the output layer output is 21×21 pixels.

TABLE 1 Structure of deep learning model

(3) Data preprocessing: in order to reduce the memory capacity requirements of the model, the invention uses a fixed-size sub-image (MINI IMAGE) as an input to the model. Since the model does not employ upsampling or deconvolution operations, the image size is reduced after the input image is processed by the deep learning model, and thus the sub-image selection is performed on the input images (i.e., the degraded image in step (1)) and ground truth in the following manner. The input image selects the sub-images in a 33×33 size 14 step partially overlapped mode, and the central 21×21 pixels are selected from the sub-images at the corresponding positions on ground truth to form a training image pair.

(4) Model training: the model was trained using a gradient descent (GRADIENT DESCENT) approach with mini-batch size set to 128. The learning rate was initialized to 5 x 10 ^-4 and reduced by 1 x 10 ^-4 after every 1 x 10 ⁵ mini-batch exercises until the minimum learning rate was 1 x 10 ^-4 and the exercises stopped and the network parameters saved. Model parameters are initialized by adopting a He method, an optimizer uses Stochastic GRADIENT DESCENT (SGD), and a loss function is L2 distance loss. The model parameter is initialized by adopting a He method, and the computer environment is thatCore ^TM i7-6700HQ CPU@2.60GHz,NVIDIA 960M GPU,16GB RAM, model building and training was accomplished based on TensorFlow.1.3.0.

(5) Text enhancement: the trainable parameters of all parameters in the model are changed into False, the step length is changed into 21, a check point file (model parameter) obtained during training is loaded, an image to be processed is taken as an input operation model, a processed sub-image is obtained after the image is calculated by the model, and a complete enhanced output image is generated by splicing the sub-image according to the position corresponding relation before clipping, so that an enhanced scanned text image is obtained.

The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims

1. The CNN-based scanned document image enhancement method is characterized by comprising the following steps of:

the specific implementation mode of acquiring the degradation image by combining the degradation model in the step 1 is as follows;

Z(x)＝αB(x)+βG_θ(x)+γx (3)

Wherein Z (x) is a degenerated sample generated by combining random degenerating methods; b (x) is bicubic method degradation image; g _θ (x) is a handwriting blur degradation image obtained using gaussian blur processing with a kernel size θ; x is ground truth, and α, β, γ are weighting coefficients, and satisfy α+β+γ=1;

step 2, constructing a deep learning model;

step 4, training a deep learning model by using the training image pair;

2. The CNN-based scanned document image enhancement method according to claim 1, wherein: the specific implementation manner of performing color space conversion and normalization processing on ground truth in step 1 is as follows;

Y＝0.257×x_R+0.564×x_G+0.098×x_B+16 (1)

where Y is the input pixel value and x is the normalized pixel output value.

3. The CNN-based scanned document image enhancement method according to claim 1, wherein: θ ε {3,5,7}.

4. The CNN-based scanned document image enhancement method according to claim 1, wherein: sub-image selection is performed on the degraded image and ground truth in a manner that the degraded image selects the sub-images in a 33×33 size 14 step partially overlapped manner, and the intermediate 21×21 pixels are selected at the corresponding positions on ground truth to form a training image pair.

5. The CNN-based scanned document image enhancement method according to claim 1, wherein: training is carried out in a gradient descent mode, the mini-batch size is set to 128, the learning rate is initialized to 5×10 ^-4, the training is reduced by 1×10 ^-4 after every 1×10 ⁵ mini-batch training until the minimum learning rate is 1×10 ^-4, the training is stopped, network parameters are saved, the model parameters are initialized by adopting a He method, an optimizer uses Stochastic GRADIENT DESCENT, and the loss function is L2 distance loss.

6. The CNN-based scanned document image enhancement method according to claim 1, wherein: the feature extraction module in the deep learning model is composed of 4 convolution layers, namely the deep learning model comprises 7 convolution layers, wherein the number of 1-layer convolution kernels and 2-layer convolution kernels is 128,3, the number of 4-layer convolution kernels is 64, the number of 5-layer convolution kernels is 32, the number of 6-layer convolution kernels and 7-layer convolution kernels is 1, and the step sizes of the 7-layer convolution kernels are 1.