[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108664471A - Text region error correction method, device, equipment and computer readable storage medium - Google Patents

Text region error correction method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN108664471A
CN108664471A CN201810430989.8A CN201810430989A CN108664471A CN 108664471 A CN108664471 A CN 108664471A CN 201810430989 A CN201810430989 A CN 201810430989A CN 108664471 A CN108664471 A CN 108664471A
Authority
CN
China
Prior art keywords
file
error correction
phrase
target
editable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810430989.8A
Other languages
Chinese (zh)
Other versions
CN108664471B (en
Inventor
张远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yiyin Technology Co ltd
Shenzhen Lian Intellectual Property Service Center
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN201810430989.8A priority Critical patent/CN108664471B/en
Publication of CN108664471A publication Critical patent/CN108664471A/en
Application granted granted Critical
Publication of CN108664471B publication Critical patent/CN108664471B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention discloses a kind of Text region error correction method, device, equipment and computer readable storage medium, the method includes:When receiving when error correction file, the extension name for waiting for error correction file is read, and the attribute for waiting for error correction file is determined according to extension name;Judgement waits for whether the attribute of error correction file is read-only file, and the attribute of error correction file is read-only file if waiting for, treats error correction file and carries out attribute conversion, generates editable file;Multiple keywords in editable file are read, crucial phrase is formed, and determine the target file type of editable file according to crucial phrase;According to the default mapping relations of the file type of editable file and error correction library, target error correction library corresponding with target file type is determined, and invocation target error correction library is to editable file error correction.This programme sets different error correction libraries according to different file types, carries out error correction using target error correction library corresponding with file type, error correction can be made more accurate, improve error correction efficiency.

Description

Text region error correction method, device, equipment and computer readable storage medium
Technical field
The invention mainly relates to intelligent identification technology fields, specifically, being related to a kind of Text region error correction method, dress It sets, equipment and computer readable storage medium.
Background technology
Many scenes need Text region that can not be in editing files (such as PDF, picture) to be converted to editable text at present Part is likely difficult to distinguish in identification process and leads in the file of conversion that there are wrong words, at present to conversion for similar word Wrong word afterwards does not have recognition mechanism, without mechanism for correcting errors yet;Further for the wrong word in the presence of human-edited's file, together Sample does not identify mechanism for correcting errors, can only be time-consuming and laborious by hand inspection.
Invention content
The main object of the present invention is to provide a kind of Text region error correction method, device, equipment and computer-readable storage Medium, it is intended to solve the problems, such as not identify mechanism for correcting errors for wrong word in file in the prior art.
To achieve the above object, the present invention provides a kind of Text region error correction method, the Text region error correction method packet Include following steps:
When receiving when error correction file, the extension name of error correction file is waited for described in reading, and determine according to the extension name The attribute for waiting for error correction file;
Wait for whether the attribute of error correction file is read-only file described in judgement, if described wait for that the attribute of error correction file is read-only text Part then waits for that error correction file carries out attribute conversion to described, generates editable file;
Multiple keywords in the editable file are read, form crucial phrase, and determine according to the crucial phrase The target file type of the editable file;
According to the default mapping relations of the file type of editable file and error correction library, determine and the target file type Corresponding target error correction library, and call target error correction library to the editable file error correction.
Preferably, the calling target error correction library includes to the step of editable file error correction:
It identifies at least one of editable file sentence, and detects the connection in each sentence identified Each sentence is divided into multiple phrases to be identified by word according to the conjunction;
Each default phrase in the phrase to be identified and target error correction library is compared one by one, judges the target error correction It whether there is the default phrase consistent with the phrase to be identified in library;
If the default phrase consistent with the phrase to be identified is not present in target error correction library, the target is obtained Phrase is preset in error correction library with the highest target of phrase similarity to be identified, and the phrase to be identified is replaced with described Target presets phrase.
Preferably, it is described by the phrase to be identified replace with the target preset phrase the step of include:
Obtain with the currently adjacent phrase to be identified of phrase to be identified, and by the adjacent phrase to be identified and the mesh Mark presets phrase and forms sentence to be identified, judges that the target presets phrase and editable text according to the sentence to be identified The semantic situation matching of part;
If the target is preset phrase and matched with the editable file, the phrase to be identified is replaced with into the mesh Mark presets phrase.
Preferably, the step of target file type that the editable file is determined according to the crucial phrase is wrapped It includes:
The crucial phrase and predetermined keyword group library are compared, determine the target critical in predetermined keyword group library Phrase, wherein the Match of elemental composition rate highest of the target keyword group and the crucial phrase;
According to the mapping relations of crucial phrase and file type in predetermined keyword group library, determines and closed with the target The corresponding target file type is determined as the target text of the editable file by the corresponding target file type of keyword group Part type.
Preferably, described to include to described the step of waiting for that error correction file carries out attribute conversion, generating editable file:
Wait for that error correction file is scanned to described, according to the magnitude relationship waited in error correction file between each word and Spaced relationship waits for the title in error correction file and paragraph described in determining;
The word in the title and the paragraph is scanned one by one, and the word of the scanning is carried out according to default literal pool Identification, and title identifier is added to the caption text of the identification;
In the caption text of the identification and paragraph teletext to predictive editor, the editable file will be generated.
Preferably, the multiple keywords read in the editable file, the step of forming crucial phrase include:
The phrase in the editable file is read, and counts the frequency that each phrase occurs, the frequency is more than The phrase of preset value is as the keyword;
The phrase in the title is obtained according to the title identifier, by phrase and the keyword in the title Crucial phrase is formed together.
Preferably, include after the step of calling target error correction library is to the editable file error correction:
The corrected editable file is exported, and is receiving the behaviour of the amendment to the editable file of the output When making, amendment word corresponding with operation is corrected is transferred in target error correction library, to be updated to target error correction library.
In addition, to achieve the above object, the present invention also proposes a kind of Text region error correction device, the Text region error correction Device includes:
Read module, for when receiving when error correction file, the extension name of error correction file being waited for described in reading, and according to institute State the attribute that error correction file is waited for described in extension name determination;
Judgment module, for judge it is described wait for whether the attribute of error correction file is read-only file, if described wait for error correction file Attribute be read-only file, then to it is described wait for error correction file carry out attribute conversion, generate editable file;
Determining module forms crucial phrase, and according to described for reading multiple keywords in the editable file Crucial phrase determines the target file type of the editable file;
Correction module is used for the default mapping relations of the file type and error correction library according to editable file, determining and institute Target file type corresponding target error correction library is stated, and calls target error correction library to the editable file error correction.
In addition, to achieve the above object, the present invention also proposes a kind of Text region error correction apparatus, the Text region error correction Equipment includes:Memory, processor, communication bus and the Text region error correcting routine being stored on the memory;
The communication bus is for realizing the connection communication between processor and memory;
The processor is for executing the Text region error correcting routine, to realize following steps:
When receiving when error correction file, the extension name of error correction file is waited for described in reading, and determine according to the extension name The attribute for waiting for error correction file;
Wait for whether the attribute of error correction file is read-only file described in judgement, if described wait for that the attribute of error correction file is read-only text Part then waits for that error correction file carries out attribute conversion to described, generates editable file;
Multiple keywords in the editable file are read, form crucial phrase, and determine according to the crucial phrase The target file type of the editable file;
According to the default mapping relations of the file type of editable file and error correction library, determine and the target file type Corresponding target error correction library, and call target error correction library to the editable file error correction.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium Storage medium storage there are one either more than one program the one or more programs can by one or one with On processor execute for:
When receiving when error correction file, the extension name of error correction file is waited for described in reading, and determine according to the extension name The attribute for waiting for error correction file;
Wait for whether the attribute of error correction file is read-only file described in judgement, if described wait for that the attribute of error correction file is read-only text Part then waits for that error correction file carries out attribute conversion to described, generates editable file;
Multiple keywords in the editable file are read, form crucial phrase, and determine according to the crucial phrase The target file type of the editable file;
According to the default mapping relations of the file type of editable file and error correction library, determine and the target file type Corresponding target error correction library, and call target error correction library to the editable file error correction.
The Text region error correction method of the present embodiment reads the extension for waiting for error correction file when receiving when error correction file Name, and the attribute for waiting for error correction file is determined according to the extension name;Judgement waits for whether the attribute of error correction file is read-only file, if It waits for that the attribute of error correction file is read-only file, treats error correction file and carry out attribute conversion, generate editable file;Read editable Multiple keywords in file form crucial phrase, and the target file type of editable file is determined according to crucial phrase;Root According to the default mapping relations of the file type and error correction library of editable file, target error correction corresponding with target file type is determined Library, and call target error correction library to editable file error correction.This programme can carry out read-only file and non-read-only file It identifies error correction, when error correction file is read-only file, editable file is first converted into, according to crucial in editable file Phrase determines its file type, and target error correction library corresponding with its file type is called to carry out error correction.Because of different file types Belonging to different industries has specific phrase, to set different error correction libraries, use and file type according to different file types Corresponding target error correction library carries out error correction, error correction can be made more accurate, while avoiding artificial error correction, improves error correction efficiency.
Description of the drawings
Fig. 1 is the flow diagram of the Text region error correction method first embodiment of the present invention;
Fig. 2 is the flow diagram of the Text region error correction method second embodiment of the present invention;
Fig. 3 is the high-level schematic functional block diagram of the Text region error correction device first embodiment of the present invention;
Fig. 4 is the device structure schematic diagram for the hardware running environment that present invention method is related to.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific implementation mode
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The present invention provides a kind of Text region error correction method.
Fig. 1 is please referred to, Fig. 1 is the flow diagram of Text region error correction method first embodiment of the present invention.In this implementation In example, the Text region error correction method includes:
Step S10 waits for the extension name of error correction file, and according to the expansion when receiving when error correction file described in reading The attribute of error correction file is waited for described in exhibition name determination;
The present invention Text region error correction method be applied to system server, suitable for the wrong word electronic document into Row identification is corrected.Electronic document can be the read-only file that such as PDF, picture are such, can also be word, EXCEL etc Editable file needs the electronic document for carrying out error correction as waiting for error correction file using such.Because of read-only file and editable file Between possessed by cannot modify and the otherness that can modify, entangled the two wrong word therein is identified It staggers the time, needs to be carried out according to the otherness of the two.Different types of file has no extension name, and error correction text is waited for when receiving When part, the extension name for waiting for error correction file is read, the attribute for waiting for error correction file is determined according to the extension name of reading.This waits for error correction file Attribute i.e. characterize wait for that error correction file belongs to read-only file and still falls within editable file, in order to determine attribute according to extension name, It is previously provided with read-only extension name library and editable extension name library.Wherein read-only extension name library includes various read-only type file institutes The extension name having, such as read-only extension name library { pdf, jpg, png, bmp };Editable extension name library includes various editable types Extension name possessed by file, such as editable extension name library { doc, txt, xls, ppt }.Reading the extension name for waiting for error correction file Afterwards, judge that this extension name is present in read-only extension name library and is still present in editable extension name library.If being present in read-only extension Name library in, then can determine wait for error correction file attribute be read-only file, and if be present in editable extension name library, can determine Wait for that the attribute of error correction file is editable file.
Step S20 waits for whether the attribute of error correction file is read-only file described in judgement, if the attribute for waiting for error correction file It is read-only file, then waits for that error correction file carries out attribute conversion to described, generate editable file;
Further, because read-only file has not editable characteristic, the wrong word of identification cannot be corrected, needed First it is converted, be converted into editable file.After waiting for the attribute of error correction file in determination, judge to wait entangling Whether the attribute of wrong file is read-only file, if waiting for, the attribute of error correction file is read-only file, treats error correction file and is belonged to Property conversion, wait for that error correction file is converted to by read-only and editable wait for error correction file.The word in error correction file is treated when conversion It is identified, and obtains the word of identification from literal pool, by the teletext to text editor of this identification, generation can compile Collect file.And for judging to wait for that the attribute of error correction file is not read-only file, i.e., when itself is editable file, then it is not required to Attribute conversion is carried out to it, directly carrying out wrong word identification to editable file corrects.
Step S30 reads multiple keywords in the editable file, forms crucial phrase, and according to the key Phrase determines the target file type of the editable file;
Understandably, the file in different industries field have its characteristic phrase, as in legal field " prosecution ", " on Tell ", " defendant ", the phrases such as " plaintiff ", the phrases such as " bond ", " financing ", " savings ", " loan " in financial industry;Wrong other When word identifies, this various types of phrase, sentence are formed into error correction library, are identified by error correction library.If to all industries Files in different types be all made of same error correction library and carry out wrong word identification and correct, same error correction library include a large amount of phrase, Sentence brings many noises to the file identified, reduces recognition efficiency.In order to the file in different industries field into the hand-manipulating of needle Identification to property, the present embodiment classify to file type according to industry field, and for the setting pair of different types of file The error correction library answered carries out error correction to the file of the industry type using the error correction library in certain industry field, improves error correction efficiency. After generating editable file, in order to use its corresponding error correction library to carry out error correction, it is thus necessary to determine that the file type belonging to it.Cause File type is distinguished according to industry field, for belonging to the file in certain industry field, is carried related to the industry field Keyword, such as " prosecution " in above-mentioned legal field, " appeal ", " defendant ", " plaintiff ".So as to by reading in file Entrained multiple keywords form crucial phrase, are determined belonging to file by this crucial phrase comprising multiple keywords Industry field, and then determine the target file type belonging to editable file.
Step S40 is determined and the target according to the default mapping relations of the file type of editable file and error correction library File type corresponding target error correction library, and call target error correction library to the editable file error correction.
Further, because different types of editable file is provided with corresponding error correction library, you can the text of editing files Mapping relations are previously provided between part type and error correction library, a kind of file type is corresponding with an error correction library.Default mapping is closed System can be key_value key-value pairs, and using file type as key key, error correction library is as value value.Determining editable text After the target file type of part, mapping relations are preset according to this and can determine target error correction library corresponding with target file type, led to It crosses and inquires target error correction library as value as the target file type of key, and the realization of invocation target error correction library is to editable The error correction of file.Specifically, the step of error correction includes:
Step S41 identifies at least one of editable file sentence, and detects in each sentence identified Conjunction, each sentence is divided into multiple phrases to be identified according to the conjunction;
Understandably, editable file includes a plurality of sentence, when carrying out error correction identification to editable file, is first identified At least one of editable file sentence sets various types of punctuation marks and is accorded with as identification marking, detection editable text Identification marking symbol in part, and using the content between two identification markings symbols as a sentence in editable file.Such as set Determine identification marking symbol include element,.;“”:, when detecting ", " in editable file, continue to detect, know until detecting Next arbitrary element in other identifier, the content between this element and ", " are the sentence in editable file.It is identifying After at least one of editable file sentence, further the content in sentence is divided, setting divides conjunction, detection Sentence is divided into multiple phrases to be identified by the conjunction in each sentence identified according to conjunction.Wherein conjunction includes But it is not limited to:With with, with, both, together and and simultaneously, then, be, with regard to and, just, then, then, as, in addition, as, If although if as, but, still however only however cause, because, by, with or also,, unless waiting.Work as detection When to identified statements including any one conjunction, continue to detect, until detecting next arbitrary connection in sentence Word, the phrase between this two conjunctions is phrase to be identified.If continuing detection does not detect conjunction, i.e., in sentence only It detects a conjunction, then this sentence is divided into two phrases to be identified, so as to subsequently to the phrase to be identified of this division It is identified.
Step S42 one by one compares each default phrase in the phrase to be identified and target error correction library, described in judgement It whether there is the default phrase consistent with the phrase to be identified in target error correction library;
Further, it is provided in target error correction library multiple corresponding pre- with industry field belonging to editable file type If phrase, after editable file is divided into multiple phrases to be identified, phrase to be identified and default phrase is compared, judge mesh It marks and whether there is default phrase corresponding with phrase to be identified in error correction library.Because preset phrase characterization is the affiliated row of this class file Accurate vocabulary in industry field, if in the presence of default phrase corresponding with phrase to be identified, illustrate phrase to be identified be it is correct, no It needs to carry out error correction.
Step S43 is obtained if the default phrase consistent with the phrase to be identified is not present in target error correction library Phrase is preset with the highest target of phrase similarity to be identified in target error correction library, and the phrase to be identified is replaced It is changed to the target and presets phrase.
And if there is no the default phrases consistent with phrase to be identified in target error correction library, illustrate phrase possibility to be identified For wrong other phrase, need to correct it.Obtained when correction in target error correction library with the highest mesh of phrase similarity to be identified Mark presets phrase, and wherein similarity includes semantic similar two aspects of font face phase Sihe, shape similar representation word to be identified Phrase shape possessed by group most probable, and it is semantic similar, it indicates to combine this semantic phrase to be identified most possibly possessed It is semantic.Font face when default phrase and semantic equal and phrase to be identified similarity highest, then illustrate this default phrase most It may be the correct phrase of phrase to be identified, phrase is preset so as to which phrase to be identified is replaced with target.It is needed before replacing Determine that target presets the semantic matches of phrase, semantic matches are just replaced, and specific step includes:
Step S431, obtain with the currently adjacent phrase to be identified of phrase to be identified, and by the adjacent word to be identified Group presets phrase with the target and forms sentence to be identified, judges that the target presets phrase and institute according to the sentence to be identified State the semantic situation matching of editable file;
Understandably, for same type file, the semantic situation of characterization is with uniformity, current phrase to be identified It is consistent with the semantic situation of editable file it to be formed by sentence with its front and back adjacent phrase to be identified.It obtains and currently waits knowing Adjacent phrase to be identified before and after other phrase is divided because phrase to be identified is when dividing according to conjunction, thus by mesh After mark is preset in phrase to be identified adjacent before and after phrase is placed on, the conjunction for adding division forms sentence to be identified.According to institute The consistency for forming sentence to be identified and the semantic situation of editable file judges that target presets the language of phrase and editable file The matching of adopted scene.Such as sentence " legitimate rights and interests of company by method Lu protect ", identified when dividing conjunction " " " by ", and obtained phrase to be identified is " company ", " legitimate rights and interests " and " method Lu protections ".Current phrase " method to be identified Lu is protected ", there is no presetting, phrase is corresponding, and the highest target of its shape similarity is obtained from target error correction library and presets word It is " legal to obtain its front and back adjacent phrase to be identified when presetting phrase progress semantic matches to this target for group " legal protection " Current phrase " legal protection " to be identified is added to adjacent phrase to be identified " legitimate rights and interests ", in conjunction with the company of division by equity " It connects word " by " and forms sentence " legitimate rights and interests are protected by law " to be identified.According to the semanteme of this sentence to be identified and editable file The consistency of scene judges that target presets the semantic situation matching of phrase " legal protection " and editable file.
Step S432, if the target preset phrase with can the editing files match, change phrase to be identified by described Group presets word for for target.
If sentence to be identified is consistent with the semantic situation of editable file, illustrate that target presets phrase and editable file Matching, and preset phrase with target and phrase to be identified is replaced, with to there are the phrases to be identified of wrong word to carry out error correction Processing, corrects it and presets phrase for target.Determine that target corresponding with phrase to be identified is pre- by shape and two aspect of semanteme If phrase, it can be ensured that replaced target presets the correctness of phrase.
The Text region error correction method of the present embodiment reads the extension for waiting for error correction file when receiving when error correction file Name, and the attribute for waiting for error correction file is determined according to the extension name;Judgement waits for whether the attribute of error correction file is read-only file, if It waits for that the attribute of error correction file is read-only file, treats error correction file and carry out attribute conversion, generate editable file;Read editable Multiple keywords in file form crucial phrase, and the target file type of editable file is determined according to crucial phrase;Root According to the default mapping relations of the file type and error correction library of editable file, target error correction corresponding with target file type is determined Library, and call target error correction library to editable file error correction.This programme can carry out read-only file and non-read-only file It identifies error correction, when error correction file is read-only file, editable file is first converted into, according to crucial in editable file Phrase determines its file type, and target error correction library corresponding with its file type is called to carry out error correction.Because of different file types Belonging to different industries has specific phrase, to set different error correction libraries, use and file type according to different file types Corresponding target error correction library carries out error correction, error correction can be made more accurate, while avoiding artificial error correction, improves error correction efficiency.
Further, described true according to crucial phrase in another embodiment of the Text region error correction method of the present invention The step of target file type of the editable file includes calmly:
The crucial phrase and predetermined keyword group library are compared, are determined in predetermined keyword group library by step S31 Target critical phrase, wherein the Match of elemental composition rate highest of the target keyword group and the crucial phrase;
Understandably, the file in different industries field has different keywords, in order to be compiled according to crucial phrase determination The target file type of file is collected, the present embodiment is provided with predetermined keyword group library, this predetermined keyword group library is to pre-set Include the crucial phrase of multiple industry fields.Such as predetermined keyword group library [A, B, C], i.e. predetermined keyword group library includes three passes Keyword group A, B, C, wherein crucial phrase A include keyword a1, a2, a3, b1 and c1, crucial phrase B include keyword a1, b1, B2, b3 and c1, crucial phrase C include keyword a1, b1, c1, c2 and c3.By multiple keywords formed crucial phrase library it Afterwards, crucial phrase and predetermined keyword group library are compared, determines that target corresponding with crucial phrase is closed in predetermined keyword group library The Match of elemental composition rate highest of keyword group, this target critical phrase and crucial phrase.Wherein the highest reality of Match of elemental composition rate is two Keywords matching quantity between then is most.By each keyword in crucial phrase and each crucial phrase in predetermined keyword group library Possessed keyword comparison, determines the crucial phrase for the same keyword for having most quantity in predetermined keyword group library.Such as shape Multiple keywords at crucial phrase are respectively a1, a2, b1 and d1, because of the number of itself and the had same keywords of crucial phrase A Amount is most, to which this crucial phrase A is determined as target critical phrase, the Match of elemental composition with the crucial phrase of editable file Rate highest.
Step S32, according to the mapping relations of crucial phrase and file type in predetermined keyword group library, determining and institute The corresponding target file type of target critical phrase is stated, the corresponding target file type is determined as the editable file Target file type.
Further, the mapping relations of each crucial phrase and file type are provided in predetermined keyword group library, it is such as above-mentioned Include the predetermined keyword group library of three crucial phrases A, B, C, three crucial phrase difference mapped file types a, b, c.To After the target critical phrase in determining predetermined keyword group library, according to the mapping relations in predetermined keyword group library, you can really Fixed target file type corresponding with target critical phrase.As above-mentioned after determining that crucial phrase A is target critical phrase, because of mesh It is a to mark the file type that crucial phrase A maps in predetermined keyword group library with it, to be determined as a and target keyword The corresponding target file type of group.This target file type corresponding with target keyword is the file destination of editable file Type, to realize the target file type for determining editable file according to crucial phrase.
Further, described to wait for error correction text to described in another embodiment of the Text region error correction method of the present invention Part carry out attribute conversion, generate editable file the step of include:
Step S21 waits for that error correction file is scanned, according to the size waited in error correction file between each word to described Relationship and spaced relationship wait for the title in error correction file and paragraph described in determining;
Further, it is contemplated that generally include title and paragraph in file, and the word size between title and paragraph and Word interval is different, and wherein the word of title is more than the word of paragraph, and the word interval between title and paragraph is more than title Neutralize the word interval in paragraph.To attribute be read-only file when error correction file carry out attribute conversion when, first treat Error correction file is scanned, according to the obtained magnitude relationship and spaced relationship waited in error correction file between each word of scanning Determine title and paragraph therein.Specifically, when the interval between scanning becomes smaller to word or word becomes larger, then judge The content scanned before this is title;Or word change from small to big after become smaller again, and the interval between word first have it is small become larger, then Become larger and become smaller, then judges scanning from paragraph to title again to paragraph.It realizes between size and word by institute's scan text The variation at interval, the title and paragraph treated in error correction file distinguish.
Step S22 scans the word in the title and the paragraph one by one, according to default literal pool to the scanning Word is identified, and adds title identifier to the caption text of the identification;
The present embodiment is provided with default literal pool, presets in order to which read-only is waited for that error correction file is converted to editable file Literal pool is to pre-set, including various words.It is determining after title and paragraph in error correction file, one by one to title It is scanned with the word in paragraph, and each word in the word and default literal pool of scanning gained is compared, with right The word scanned is identified.Title identifier wherein is added for the caption text of identification, to carry out caption text and section Fall the differentiation of word.
Step S23 can be compiled in the caption text of the identification and paragraph teletext to predictive editor described in generation Collect file.
Further, it after the caption text and paragraph word of scanning are identified according to default literal pool, will identify Caption text and paragraph teletext to predictive editor in, this predictive editor be set in advance as carry out copy editor Tool, such as word document, wps documents, by the teletext to predictive editor of identification into edlin, you can generation can compile Collect file.
Further, in another embodiment of the Text region error correction method of the present invention, the reading editable file In multiple keywords, formed crucial phrase the step of include:
Step S33 reads the phrase in the editable file, and counts the frequency that each phrase occurs, will be described The frequency is more than the phrase of preset value as the keyword;
Further, row belonging to editable file is embodied in order to enable to be formed by crucial phrase by multiple keywords Industry domain type, read keyword should be the phrase that number is more appeared in editable file, thus occurrence number Industry field belonging to more phrase determines the type of editable file.To read the phrase in editable file, and The frequency that statistics phrase occurs, the frequency the more more can react the type of editable file.In addition in view of conjunction has Universality, it is all general in any industry field, and the type of editable file cannot be embodied, to which conjunction be arranged in statistics It removes.In order to more accurately embody the type of editable file, preset value is set, only when the frequency that a phrase occurs is more than in advance It is literary to react editable by the affiliated industry field of phrase more than this frequency of occurrence if when value, just regarding this phrase as keyword The type of part.
Step S34 obtains the phrase in the title according to the title identifier, by the title phrase and institute It states keyword and forms crucial phrase together.
Understandably, it is contemplated that title content or topic Types in file can react file type, such as title content The file that file type is legal industry can be reflected for " labour contract ";And topic Types include " claims " " explanation Book " etc can then reflect the file that file type is patent industry.To which the frequency in editable file is more than preset value Phrase as keyword after, also obtain title in phrase.Because being added with title identifier to caption text, according to title mark Know symbol and can determine caption text, and obtains phrase therein.By in title phrase and keyword form crucial phrase together, with The more accurately type of reaction editable file.
Further, Fig. 2 is please referred to, on the basis of Text region error correction method first embodiment of the present invention, proposes this Invention Text region error correction method second embodiment, in a second embodiment, the calling target error correction library to it is described can Include after the step of editing files error correction:
The corrected editable file is exported, and is receiving the editable file to the output by step S50 Amendment operation when, by with correct the corresponding amendment word of operation and be transferred in target error correction library, with to the target error correction Library updates.
Further, after invocation target error correction library carries out error correction to the wrong word in editable file, this is passed through The editable file of error correction exports, and is output to interface display possessed by the terminal being connect with system server.To editable text The monitoring personnel that the error correction result of part is monitored is detected the editable file that this display interface is shown, checks its error correction As a result correctness illustrates that the error correction in target error correction library can suitable for current if error correction result is correct after testing Editing files;And when detecting that error correction result is incorrect, then it is current to illustrate that the error correction in target error correction library is not suitable for Editable file needs to be updated target error correction library.For the incorrect part of error correction result, monitoring personnel can be repaiied Positive operation inputs correctly amendment word and is modified to incorrect part in editable file.To receive this to defeated When the amendment operation of the editable file gone out, this amendment word corresponding with operation is corrected is transferred in target error correction library, to mesh Mark error correction library is updated.When subsequently carrying out the error correction of editable file by target error correction library again, then call this updated Target error correction library includes that the correct target error correction library progress error correction for correcting word improves error correction by makeover process repeatedly Correctness.
In addition, please referring to Fig. 3, the present invention provides a kind of Text region error correction device, is filled in Text region error correction of the present invention It sets in first embodiment, the Text region error correction device includes:
Read module 10, for when receiving when error correction file, waiting for the extension name of error correction file described in reading, and according to The attribute of error correction file is waited for described in the extension name determination;
Judgment module 20, for judge it is described wait for whether the attribute of error correction file is read-only file, if described wait for error correction text The attribute of part is read-only file, then waits for that error correction file carries out attribute conversion to described, generate editable file;
Determining module 30 forms crucial phrase, and according to institute for reading multiple keywords in the editable file State the target file type that crucial phrase determines the editable file;
Correction module 40, for the default mapping relations according to the file type and error correction library of editable file, determine and Target file type corresponding target error correction library, and call target error correction library to the editable file error correction.
The Text region error correction device of the present embodiment, when receiving when error correction file, the reading of read module 10 waits for error correction The extension name of file, and the attribute for waiting for error correction file is determined according to the extension name;The judgement of judgment module 20 waits for error correction file Whether attribute is read-only file, if waiting for, the attribute of error correction file is read-only file, treats error correction file and carries out attribute conversion, generates Editable file;Determining module 30 reads multiple keywords in editable file, forms crucial phrase, and according to crucial phrase Determine the target file type of editable file;Correction module 40 is according to the file type of editable file and presetting for error correction library Mapping relations determine target error correction library corresponding with target file type, and call target error correction library to editable file Error correction.Error correction can be identified to read-only file and non-read-only file in this programme, when error correction file is read-only file, first It is converted into editable file, its file type is determined according to crucial phrase in editable file, and is called and its files classes Type corresponding target error correction library carries out error correction.There is specific phrase because different file types belong to different industries, to basis Different file types set different error correction libraries, carry out error correction using target error correction library corresponding with file type, can make error correction more To be accurate, while artificial error correction is avoided, improves error correction efficiency.
Wherein, each virtual functions module of upper Text region error correction device is stored in Text region error correction apparatus shown in Fig. 4 Memory 1005 in, when processor 1001 executes Text region error correcting routine, realize modules in embodiment illustrated in fig. 3 Function.
With reference to Fig. 4, Fig. 4 is the device structure schematic diagram for the hardware running environment that present invention method is related to.
Text region of embodiment of the present invention error correction apparatus can be PC (personal computer, personal computer), Can be the terminal devices such as smart mobile phone, tablet computer, E-book reader, pocket computer.
As shown in figure 4, the Text region error correction apparatus may include:Processor 1001, such as CPU (Central Processing Unit, central processing unit), memory 1005, communication bus 1002.Wherein, communication bus 1002 for realizing Connection communication between processor 1001 and memory 1005.Memory 1005 can be high-speed RAM (random access Memory, random access memory), can also be stable memory (non-volatile memory), such as disk storage Device.Memory 1005 optionally can also be the storage device independently of aforementioned processor 1001.
Optionally, which can also include user interface, network interface, camera, RF (Radio Frequency, radio frequency) circuit, sensor, voicefrequency circuit, WiFi (Wireless Fidelity, WiMAX) module etc.. User interface may include display screen (Display), input unit such as keyboard (Keyboard), and optional user interface can be with Including standard wireline interface and wireless interface.Network interface may include optionally standard wireline interface and wireless interface (such as WI-FI interfaces).
It will be understood by those skilled in the art that Text region error correction apparatus structure shown in Fig. 4 is not constituted to word The restriction for identifying error correction apparatus may include either combining certain components or different than illustrating more or fewer components Component is arranged.
As shown in figure 4, as may include that operating system, network are logical in a kind of memory 1005 of computer storage media Believe module and Text region error correcting routine.Operating system is management and control Text region error correction apparatus hardware and software resource Program, support the operation of Text region error correcting routine and other softwares and/or program.Network communication module is for realizing depositing Communication between each component in the inside of reservoir 1005, and with communicated between other hardware and softwares in Text region error correction apparatus.
In Text region error correction apparatus shown in Fig. 4, processor 1001 is for executing the text stored in memory 1005 Word identifies error correcting routine, realizes the step in above-mentioned each embodiment of Text region error correction method.
The present invention provides a kind of computer readable storage medium, the computer-readable recording medium storage there are one or More than one program of person, the one or more programs can also be executed by one or more than one processor with Step in the above-mentioned each embodiment of Text region error correction method of realization.
It should also be noted that, herein, the terms "include", "comprise" or its any other variant are intended to non- It is exclusive to include, so that process, method, article or device including a series of elements include not only those elements, But also include other elements that are not explicitly listed, or further include solid by this process, method, article or device Some elements.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including There is also other identical elements in the process of the element, method, article or device.
The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art Going out the part of contribution can be expressed in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disc, CD), including some instructions use so that a station terminal equipment (can be mobile phone, Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.
The foregoing is merely the preferred embodiment of the present invention, are not intended to limit the scope of the invention, every at this Under the design of invention, using equivalent structure transformation made by description of the invention and accompanying drawing content, or directly/it is used in it indirectly His relevant technical field is included in the scope of patent protection of the present invention.

Claims (10)

1. a kind of Text region error correction method, which is characterized in that the Text region error correction method includes the following steps:
When receiving when error correction file, the extension name of error correction file is waited for described in reading, and according to described in extension name determination Wait for the attribute of error correction file;
Wait for whether the attribute of error correction file is read-only file described in judgement, if described wait for that the attribute of error correction file is read-only file, It then waits for that error correction file carries out attribute conversion to described, generates editable file;
Multiple keywords in the editable file are read, form crucial phrase, and according to described in crucial phrase determination The target file type of editable file;
According to the default mapping relations of the file type of editable file and error correction library, determination is corresponding with the target file type Target error correction library, and call target error correction library to the editable file error correction.
2. Text region error correction method as described in claim 1, which is characterized in that the calling target error correction library is to institute The step of stating editable file error correction include:
It identifies at least one of editable file sentence, and detects the conjunction in each sentence identified, press Each sentence is divided into multiple phrases to be identified according to the conjunction;
Each default phrase in the phrase to be identified and target error correction library is compared one by one, is judged in target error correction library With the presence or absence of the default phrase consistent with the phrase to be identified;
If the default phrase consistent with the phrase to be identified is not present in target error correction library, the target error correction is obtained Phrase is preset with the highest target of phrase similarity to be identified in library, and the phrase to be identified is replaced with into the target Default phrase.
3. Text region error correction method as claimed in claim 2, which is characterized in that described to replace with the phrase to be identified The target preset phrase the step of include:
The to be identified phrase adjacent with current phrase to be identified is obtained, and the adjacent phrase to be identified and the target is pre- If phrase forms sentence to be identified, judge that the target presets phrase and the editable file according to the sentence to be identified Semantic situation matching;
If the target is preset phrase and matched with the editable file, it is pre- that the phrase to be identified is replaced with the target If phrase.
4. Text region error correction method as described in claim 1, which is characterized in that described to determine institute according to the crucial phrase The step of target file type for stating editable file includes:
The crucial phrase and predetermined keyword group library are compared, determine the target keyword in predetermined keyword group library Group, wherein the Match of elemental composition rate highest of the target keyword group and the crucial phrase;
According to the mapping relations of crucial phrase and file type in predetermined keyword group library, determine and the target keyword The corresponding target file type, is determined as the file destination class of the editable file by the corresponding target file type of group Type.
5. Text region error correction method as described in claim 1, which is characterized in that described to wait for that error correction file belongs to described Property conversion, generate editable file the step of include:
It waits for that error correction file is scanned to described, the magnitude relationship between each word and interval in error correction file is waited for according to described Relationship waits for the title in error correction file and paragraph described in determining;
The word in the title and the paragraph is scanned one by one, and the word of the scanning is known according to default literal pool Not, and to the caption text of the identification title identifier is added;
In the caption text of the identification and paragraph teletext to predictive editor, the editable file will be generated.
6. Text region error correction method as claimed in claim 5, which is characterized in that described to read in the editable file Multiple keywords, formed crucial phrase the step of include:
The phrase in the editable file is read, and counts the frequency that each phrase occurs, the frequency is more than default The phrase of value is as the keyword;
The phrase in the title is obtained according to the title identifier, together with the keyword by the phrase in the title Form crucial phrase.
7. Text region error correction method as claimed in any one of claims 1 to 6, which is characterized in that described that the target is called to entangle Include after the step of wrong library is to the editable file error correction:
The corrected editable file is exported, and is receiving the operation of the amendment to the editable file of the output When, amendment word corresponding with operation is corrected is transferred in target error correction library, to be updated to target error correction library.
8. a kind of Text region error correction device, which is characterized in that the Text region error correction device includes:
Read module, for when receiving when error correction file, the extension name of error correction file being waited for described in reading, and according to the expansion The attribute of error correction file is waited for described in exhibition name determination;
Judgment module, for judge it is described wait for whether the attribute of error correction file is read-only file, if the category for waiting for error correction file Property be read-only file, then to it is described wait for error correction file carry out attribute conversion, generate editable file;
Determining module forms crucial phrase, and according to the key for reading multiple keywords in the editable file Phrase determines the target file type of the editable file;
Correction module is used for the default mapping relations of the file type and error correction library according to editable file, determines and the mesh File type corresponding target error correction library is marked, and calls target error correction library to the editable file error correction.
9. a kind of Text region error correction apparatus, which is characterized in that the Text region error correction apparatus includes:Memory, processor, Communication bus and the Text region error correcting routine being stored on the memory;
The communication bus is for realizing the connection communication between processor and memory;
The processor is for executing the Text region error correcting routine, to realize as described in any one of claim 1-7 The step of Text region error correction method.
10. a kind of computer readable storage medium, which is characterized in that be stored with word knowledge on the computer readable storage medium Other error correcting routine is realized when the Text region error correcting routine is executed by processor as described in any one of claim 1-7 The step of Text region error correction method.
CN201810430989.8A 2018-05-07 2018-05-07 Character recognition error correction method, device, equipment and computer readable storage medium Active CN108664471B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810430989.8A CN108664471B (en) 2018-05-07 2018-05-07 Character recognition error correction method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810430989.8A CN108664471B (en) 2018-05-07 2018-05-07 Character recognition error correction method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN108664471A true CN108664471A (en) 2018-10-16
CN108664471B CN108664471B (en) 2024-01-23

Family

ID=63778807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810430989.8A Active CN108664471B (en) 2018-05-07 2018-05-07 Character recognition error correction method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN108664471B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147791A (en) * 2019-05-20 2019-08-20 上海联影医疗科技有限公司 Character recognition method, device, equipment and storage medium
CN111079417A (en) * 2019-12-17 2020-04-28 米哈游科技(上海)有限公司 Wrongly written character checking method, wrongly written character checking device, server and storage medium
CN111310473A (en) * 2020-02-04 2020-06-19 四川无声信息技术有限公司 Text error correction method and model training method and device thereof
CN111626049A (en) * 2020-05-27 2020-09-04 腾讯科技(深圳)有限公司 Title correction method and device for multimedia information, electronic equipment and storage medium
CN111639566A (en) * 2020-05-19 2020-09-08 浙江大华技术股份有限公司 Method and device for extracting form information
CN112668581A (en) * 2020-12-29 2021-04-16 北京声智科技有限公司 Document title identification method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095778A (en) * 2016-05-26 2016-11-09 达而观信息科技(上海)有限公司 The Chinese search word automatic error correction method of search engine
CN106991416A (en) * 2017-03-14 2017-07-28 浙江大学 It is a kind of based on the laboratory test report recognition methods taken pictures manually
CN107633250A (en) * 2017-09-11 2018-01-26 畅捷通信息技术股份有限公司 A kind of Text region error correction method, error correction system and computer installation
CN107741928A (en) * 2017-10-13 2018-02-27 四川长虹电器股份有限公司 A kind of method to text error correction after speech recognition based on field identification
CN107818289A (en) * 2016-09-13 2018-03-20 北京搜狗科技发展有限公司 A kind of prescription recognition methods and device, a kind of device for prescription identification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095778A (en) * 2016-05-26 2016-11-09 达而观信息科技(上海)有限公司 The Chinese search word automatic error correction method of search engine
CN107818289A (en) * 2016-09-13 2018-03-20 北京搜狗科技发展有限公司 A kind of prescription recognition methods and device, a kind of device for prescription identification
CN106991416A (en) * 2017-03-14 2017-07-28 浙江大学 It is a kind of based on the laboratory test report recognition methods taken pictures manually
CN107633250A (en) * 2017-09-11 2018-01-26 畅捷通信息技术股份有限公司 A kind of Text region error correction method, error correction system and computer installation
CN107741928A (en) * 2017-10-13 2018-02-27 四川长虹电器股份有限公司 A kind of method to text error correction after speech recognition based on field identification

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147791A (en) * 2019-05-20 2019-08-20 上海联影医疗科技有限公司 Character recognition method, device, equipment and storage medium
CN111079417A (en) * 2019-12-17 2020-04-28 米哈游科技(上海)有限公司 Wrongly written character checking method, wrongly written character checking device, server and storage medium
CN111310473A (en) * 2020-02-04 2020-06-19 四川无声信息技术有限公司 Text error correction method and model training method and device thereof
CN111639566A (en) * 2020-05-19 2020-09-08 浙江大华技术股份有限公司 Method and device for extracting form information
CN111626049A (en) * 2020-05-27 2020-09-04 腾讯科技(深圳)有限公司 Title correction method and device for multimedia information, electronic equipment and storage medium
CN111626049B (en) * 2020-05-27 2022-12-16 深圳市雅阅科技有限公司 Title correction method and device for multimedia information, electronic equipment and storage medium
CN112668581A (en) * 2020-12-29 2021-04-16 北京声智科技有限公司 Document title identification method and device

Also Published As

Publication number Publication date
CN108664471B (en) 2024-01-23

Similar Documents

Publication Publication Date Title
CN108664471A (en) Text region error correction method, device, equipment and computer readable storage medium
CN104899016B (en) Allocating stack Relation acquisition method and device
CN105787366A (en) Android software visualization safety analysis method based on module relations
CN110955608B (en) Test data processing method, device, computer equipment and storage medium
CN110286917A (en) File packing method, device, equipment and storage medium
CN109446753A (en) Detect method, apparatus, computer equipment and the storage medium of pirate application program
CN110287104A (en) Method for generating test case, device, terminal and computer readable storage medium
CN111414402A (en) Log threat analysis rule generation method and device
CN112328489A (en) Test case generation method and device, terminal equipment and storage medium
CN111475494A (en) Mass data processing method, system, terminal and storage medium
CN106325596A (en) Automatic handwriting error correction method and system
US20130151519A1 (en) Ranking Programs in a Marketplace System
CN106649210B (en) Data conversion method and device
KR20120012803A (en) Proprietary circuit layout identification
CN110502513B (en) Data acquisition method, device, equipment and computer readable storage medium
CN113434542B (en) Data relationship identification method and device, electronic equipment and storage medium
CN110990834A (en) Static detection method, system and medium for android malicious software
CN111221690B (en) Model determination method and device for integrated circuit design and terminal
CN107272989A (en) Using startup method, device and terminal device
JP2013077124A (en) Software test case generation device
CN113805861B (en) Code generation method based on machine learning, code editing system and storage medium
CN110851346A (en) Method, device and equipment for detecting boundary problem of query statement and storage medium
CN113158001B (en) Network space IP asset attribution and correlation discrimination method and system
CN113220949B (en) Construction method and device of private data identification system
CN111880776B (en) Hierarchical relationship obtaining method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20231229

Address after: Room 1104, 11th Floor, Building 16, No. 6 Wenhuayuan West Road, Beijing Economic and Technological Development Zone, Daxing District, Beijing, 100000

Applicant after: Beijing Yiyin Technology Co.,Ltd.

Address before: 518000 Room 202, block B, aerospace micromotor building, No.7, Langshan No.2 Road, Xili street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: Shenzhen LIAN intellectual property service center

Effective date of registration: 20231229

Address after: 518000 Room 202, block B, aerospace micromotor building, No.7, Langshan No.2 Road, Xili street, Nanshan District, Shenzhen City, Guangdong Province

Applicant after: Shenzhen LIAN intellectual property service center

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant before: PING AN PUHUI ENTERPRISE MANAGEMENT Co.,Ltd.

GR01 Patent grant
GR01 Patent grant