CN111046636B

CN111046636B - Method, device, computer equipment and storage medium for screening PDF file information

Info

Publication number: CN111046636B
Application number: CN201911274586.XA
Authority: CN
Inventors: 邓涛; 肖隆韬; 邓爽
Original assignee: Shenzhen Qianhai Huanrong Lianyi Information Technology Service Co Ltd
Current assignee: Shenzhen Qianhai Huanrong Lianyi Information Technology Service Co Ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2024-04-12
Anticipated expiration: 2039-12-12
Also published as: CN111046636A

Abstract

The invention discloses a method for screening PDF file information, which comprises the following steps: operating a PDF plug-in, wherein the PDF plug-in is used for nondestructively converting a PDF file into a text file; a section of C language code written manually is obtained, and the C language code is used for supporting a java virtual machine to call the PDF plug-in; executing the C language code, and calling the PDF plug-in through the C language code to simulate a user to convert a PDF file into a text file; converting the text file into XML data by executing the built-in code and analyzing the XML data; primarily matching the parsed XML data through a preset rule to obtain file information after primary matching; paging and storing the file information after primary matching to a processor cache through an optimization algorithm; and matching the character string data input by the user with the file information in the processor cache again to obtain an output result. The user can quickly and accurately match and position the input character strings, the recognition efficiency is high, and the character strings are mature and stable.

Description

Method, device, computer equipment and storage medium for screening PDF file information

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method, an apparatus, a computer device, and a storage medium for screening PDF file information.

Background

PDF (Portable Document Format ) is a portable document format that is independent of software, hardware, and operating systems, and retains the original format of a file even when it is cross-platform. Due to the advantages of cross-platform, multimedia integration, security and the like, PDF has become one of the most widely used electronic document formats at present, and a large amount of valuable data is presented in the form of PDF files. The document structure of the PDF format is different from the format of HTML, XML and the like, the PDF file has no special definition on the table, but only the position combination of lines and words, and the PDF contains marks such as keywords, separators, data and the like, but the PDF file stores corresponding information in a binary stream mode. Therefore, the PDF file has a complex structure, and the PDF file extraction data technology is relatively difficult.

As a structured file format, PDF documents are composed of a plurality of modules called 'objects', the objects are numbered to realize the reference and random access among the objects, page objects contain page contents (characters, pictures and the like) and information for displaying the page, such as fonts, page sizes and the like, the whole document is in a tree structure, each object is a node and contains different types of data, so that the PDF analysis is difficult, and the conventional operation at present is solved by means of a third party module.

In the current daily office process, users often need to input a string of text to match whether the string of text is contained in the contracted text in the PDF file format, and quick positioning and matching are needed, and the existing technical condition is a relatively blank field for solving the office requirements.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The embodiment of the invention provides a method for screening PDF file information, which aims to solve the problems. The method accurately identifies the data information in the PDF file so as to meet the requirement of rapid screening and positioning of the input character string and the information in the contract text in the PDF format.

In a first aspect, an embodiment of the present invention provides a method for screening PDF file information, including the following steps:

operating a PDF plug-in, wherein the PDF plug-in is used for nondestructively converting a PDF file into a text file;

a section of C language code written manually is obtained, and the C language code is used for supporting a java virtual machine to call the PDF plug-in;

executing the C language code, and calling the PDF plug-in through the C language code to simulate a user to convert a PDF file into a text file;

converting the text file into XML data by executing the built-in code and analyzing the XML data;

primarily matching the parsed XML data through a preset rule to obtain file information after primary matching;

paging and storing the file information after primary matching to a processor cache through an optimization algorithm;

and matching the character string data input by the user with the file information in the processor cache again to obtain an output result.

The method comprises the steps of dividing XML data into a plurality of data blocks according to preset labels, connecting each data block into tree structure data according to the child-parent level relation of the preset labels, and covering index relation of adjacent labels.

The method comprises the further technical scheme that the step of storing the file information after primary matching to a processor cache through paging by an optimization algorithm comprises the step of storing the index relation between the data block and tree-shaped structure data and the index relation between adjacent tags into a Redis cache so as to facilitate subsequent matching.

The further technical scheme is that the step of re-matching the character string data input by the user with the file information in the processor cache to obtain an output result comprises the step of starting a plurality of threads to perform word segmentation query according to the character string data input by the user and combining the word segmentation query results.

According to a further technical scheme, the running PDF plug-in is used for nondestructively converting the PDF file into the text file, and the text file in the step of nondestructively converting the PDF file into the text file comprises a doc file and a docx file.

In a second aspect, the embodiment of the invention also provides a device for screening PDF file information, which comprises the following units:

the plug-in operation unit is used for operating a PDF plug-in, and the PDF plug-in is used for nondestructively converting a PDF file into a text file;

the code acquisition unit is used for acquiring a section of C language code written manually, wherein the C language code is used for supporting a java virtual machine to call the PDF plugin;

the code execution unit is used for executing the C language code, and calling the PDF plug-in through the C language code to simulate a user to convert a PDF file into a text file;

the conversion analysis unit is used for converting the text file into XML data and analyzing the XML data by executing the built-in code;

the primary matching unit is used for carrying out primary matching on the analyzed XML data through a preset rule so as to obtain file information after primary matching;

the optimization storage unit is used for carrying out paging storage on the file information subjected to primary matching through an optimization algorithm and storing the file information into a processor cache;

and the re-matching unit is used for re-matching the character string data input by the user with the file information in the processor cache so as to obtain an output result.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the method when executing the computer program.

In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, implement the above-described method.

The embodiment of the invention provides a method for screening PDF file information, which comprises the steps of converting, matching and storing PDF file information in advance through a java virtual machine of a processor, so that a user can quickly and accurately match and position an input character string, the identification efficiency is high, and the method is mature and stable; the screening work of the PDF file information can be realized only by the PDF plug-in and the C language code, and no additional third party software or module is required to be installed, so that the memory space occupancy rate is small, and the occupied system threads are reduced; the screened PDF file information is temporarily stored in a system cache, so that the PDF file information can be deleted quickly after being used, and the confidentiality and the safety degree on the content of the PDF file are high.

The invention is further described below with reference to the drawings and specific embodiments.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for screening PDF file information according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a unit structure of a device for screening PDF file information according to an embodiment of the present invention;

fig. 3 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1, as shown in the drawing, a method for screening PDF file information according to the embodiment includes the following steps:

s100, a PDF plug-in is operated, and the PDF plug-in is used for carrying out lossless conversion on a PDF file into a text file;

s200, a section of C language code written manually is obtained, and the C language code is used for supporting a java virtual machine to call the PDF plug-in; a virtual machine is an abstract computer implemented by emulating various computer functions on an actual computer. The Java virtual machine has a perfect hardware architecture, such as a processor, a stack, a register and the like, and also has a corresponding instruction system. The Java virtual machine masks information related to a specific operating system platform so that Java programs can run unmodified on multiple platforms by only generating object code (bytecode) that runs on the Java virtual machine.

S300, executing the C language code, and calling the PDF plug-in through the C language code to simulate a user to convert a PDF file into a text file;

s400, converting the text file into XML data and analyzing the XML data by executing the built-in code; the PDF file is first converted to an intermediate format document, which in this embodiment is XML (Extensible Markup Language ) data, and in other embodiments may also be converted to HTML (Hyper Text Markup Language ).

S600, primarily matching the parsed XML data through a preset rule to obtain file information after primary matching;

s700, paging and storing the file information after primary matching to a processor cache through an optimization algorithm;

s800, matching the character string data input by the user with the file information in the processor cache again to obtain an output result.

Step S600, primarily matching the analyzed XML data through a preset rule to obtain file information after primary matching, wherein the step comprises the steps of splitting the XML data into a plurality of data blocks according to preset labels, connecting the data blocks into tree structure data according to the child-parent level relation of the preset labels, and covering the index relation of adjacent labels.

The step S700 of storing the file information after the primary matching to the processor cache through paging by an optimization algorithm comprises the step of storing the index relation between the data block and the tree structure data and the index relation between adjacent tags into a Redis cache so as to facilitate the subsequent matching. Redis cache is used for backing up data stored in the memory, and the data is not searched in the database when the data is not changed in nature, but is fetched from the memory. Redis cache is an open source log-type, key-Value database written in ANSIC language, supporting network, based on memory and persistent, and provides multiple language APIs.

Step S800 is to match the character string data input by the user with the file information in the processor cache again to obtain an output result, and the method comprises the steps of starting a plurality of threads to perform word segmentation query according to the character string data input by the user and combining the word segmentation query results.

Step S100 is to run a PDF plug-in, wherein the PDF plug-in is used for performing lossless conversion on a PDF file into a text file, and the text file comprises a doc file and a docx file. The reason is that doc and docx files can be maximally converted without damage in the PDF file conversion process, compared to other types of text data types.

Referring to fig. 2, the present invention further provides a device for screening PDF file information, corresponding to the above method for screening PDF file information. The apparatus for filtering PDF file information includes a unit for performing the above method for filtering PDF file information, and may be configured in a desktop computer, a tablet computer, a laptop computer, etc. Specifically, the device for screening PDF file information comprises the following units:

a plug-in operation unit 100, configured to operate a PDF plug-in, where the PDF plug-in is configured to convert a PDF file into a text file in a lossless manner;

the code obtaining unit 200 is configured to obtain a section of manually written C language code, where the C language code is used to support the java virtual machine to call the PDF plug-in;

a code execution unit 300 for executing the C language code, and calling the PDF plug-in through the C language code to simulate the user to convert the PDF file into a text file;

a conversion parsing unit 400 for converting the text file into XML data and parsing by executing the built-in code;

a primary matching unit 600, configured to perform primary matching on the parsed XML data according to a preset rule, so as to obtain file information after primary matching;

the optimization storage unit 700 is configured to perform paging storage on the file information after the initial matching through an optimization algorithm, and store the file information to a processor cache;

and a re-matching unit 800, configured to re-match the character string data input by the user with the file information in the processor cache, so as to obtain an output result.

According to the method for screening PDF file information, the information of the PDF file is converted, matched and stored in advance through the java virtual machine of the processor, so that a user can quickly and accurately match and position an input character string, the recognition efficiency is high, and the method is mature and stable; the screening work of the PDF file information can be realized only by the PDF plug-in and the C language code, and no additional third party software or module is required to be installed, so that the memory space occupancy rate is small, and the occupied system threads are reduced; the screened PDF file information is temporarily stored in a system cache, so that the PDF file information can be deleted quickly after being used, and the confidentiality and the safety degree on the content of the PDF file are high.

The above-described means for screening PDF file information may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 3.

Referring to fig. 3, fig. 3 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a terminal or a server, where the terminal may be an electronic device with a communication function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device. The server may be an independent server or a server cluster formed by a plurality of servers.

With reference to FIG. 3, the computer device 500 includes a processor 520, a network interface 550, and a memory may include a non-volatile storage medium 530 and an internal memory 540, coupled by a system bus 510.

The non-volatile storage medium 530 may store an operating system 531 and computer programs 532. The computer program 532 includes program instructions that, when executed, cause the processor 520 to perform a cross-tenant authorization method.

The processor 520 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 540 provides an environment for the execution of the computer program 532 in the non-volatile storage medium 530, which computer program 532, when executed by the processor 520, causes the processor 520 to perform a cross-tenant authorization method.

The network interface 550 is used for network communication with other devices. Those skilled in the art will appreciate that the structures shown in FIG. 3 are block diagrams only and do not constitute a limitation of the computer device 500 to which the present teachings apply, and that a particular computer device 500 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

It should be appreciated that in embodiments of the present application, the processor 520 may be a central processing unit (Central Processing Unit, CPU), the processor 520 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present invention also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program, wherein the computer program includes program instructions.

The storage medium may be a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that can store program codes.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for screening PDF file information, comprising the steps of:

acquiring a C language code, wherein the C language code is used for supporting a java virtual machine to call the PDF plug-in;

paging and storing the file information after primary matching to a processor cache;

matching the character string data input by the user with file information in the processor cache again to obtain an output result;

the step of primarily matching the parsed XML data through a preset rule to obtain file information after primary matching comprises the steps of splitting the XML data into a plurality of data blocks according to preset labels, connecting the data blocks into tree structure data according to the child-parent level relation of the preset labels, and covering the index relation of adjacent labels;

the step of storing the file information after the primary matching to the processor cache in a paging way comprises the step of storing the index relation between the data block and the tree structure data and the index relation between adjacent tags into a Redi s cache so as to facilitate the subsequent matching;

the step of re-matching the character string data input by the user with the file information in the processor cache to obtain an output result comprises the steps of starting a plurality of threads to perform word segmentation query according to the character string data input by the user, and combining the word segmentation query results;

the text file includes a doc file or a docx file.

2. An apparatus for screening PDF file information, which performs the method of claim 1 at run-time, comprising the following units:

the code acquisition unit is used for acquiring a C language code, wherein the C language code is used for supporting a java virtual machine to call the PDF plugin;

the optimization storage unit is used for carrying out paging storage on the file information after the primary matching to the processor cache;

3. A computer device comprising a memory and a processor, the memory having stored thereon a computer program, the processor implementing the method of claim 1 when executing the computer program.

4. A storage medium storing a computer program comprising program instructions which, when executed by a processor, implement the method of claim 1.