[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

US20210073531A1 - Multi-page document recognition in document capture - Google Patents

Multi-page document recognition in document capture Download PDF

Info

Publication number
US20210073531A1
US20210073531A1 US16/953,561 US202016953561A US2021073531A1 US 20210073531 A1 US20210073531 A1 US 20210073531A1 US 202016953561 A US202016953561 A US 202016953561A US 2021073531 A1 US2021073531 A1 US 2021073531A1
Authority
US
United States
Prior art keywords
document
page
values
data
fields
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/953,561
Inventor
Ming Fung Ho
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Open Text Corp
Original Assignee
Open Text Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Open Text Corp filed Critical Open Text Corp
Priority to US16/953,561 priority Critical patent/US20210073531A1/en
Publication of US20210073531A1 publication Critical patent/US20210073531A1/en
Assigned to OPEN TEXT CORPORATION reassignment OPEN TEXT CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: EMC CORPORATION
Assigned to EMC CORPORATION reassignment EMC CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HO, MING FUNG
Priority to US17/940,777 priority patent/US11868717B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06K9/00456
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging
    • G06K9/00469
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • G06K2209/01
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • each page is processed independently with its own data entry form, value extraction, and validation.
  • the metadata document generated through document capture for each page has to be mapped to a multi-page structure and data values reconciled across pages through additional processing.
  • FIG. 1 is a flow chart illustrating an embodiment of a process to capture data.
  • FIG. 2 is a block diagram illustrating an embodiment of a document capture system and environment.
  • FIG. 3 is a block diagram illustrating an embodiment of a document capture system.
  • FIG. 4 is a block diagram illustrating an embodiment of a data validation user interface.
  • FIG. 5 is a screen shot illustrating an embodiment of a technique to minimize eye strain and/or fatigue in manual indexing.
  • FIG. 6 is a flow chart illustrating an embodiment of a process to facilitate manual indexing.
  • FIG. 7 is a block diagram illustrating an embodiment of a document capture system and process.
  • FIG. 8 is a block diagram illustrating an embodiment of an interface to validate a multi-page document.
  • FIG. 9A is a flow chart illustrating an embodiment of a process to capture document data.
  • FIG. 9B is a flow chart illustrating an embodiment of a process to capture document data.
  • FIG. 10 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document.
  • FIG. 11 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document.
  • FIG. 12 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document.
  • the invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor.
  • these implementations, or any other form that the invention may take, may be referred to as techniques.
  • the order of the steps of disclosed processes may be altered within the scope of the invention.
  • a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task.
  • the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
  • a multi-page document as a single entity, with a single corresponding data entry form in an automated document capture context is disclosed.
  • the pages comprising a multi-page document are identified and associated with a multi-page document type.
  • a corresponding data entry form is used to provide a structured representation of data extracted from the pages of the multi-page document. Structures that may span multiple pages, such as a table or list of values, are associated together in a single array or other structure of the data entry form. Validation of extracted values based on dependency fields that may occur on different pages is facilitated, both in automated processing and in human validation.
  • FIG. 1 is a flow chart illustrating an embodiment of a process to capture data.
  • document content is captured into a digital format ( 102 ), e.g., by scanning the physical sheet(s) to create a scanned image.
  • the document is classified ( 104 ).
  • classification includes detecting a document type corresponding to an associated data entry form.
  • Data is extracted from the digital content ( 106 ), for example through optical character recognition (OCR) and/or optical mark recognition (OMR) techniques.
  • Extracted data is validated ( 108 ).
  • validation may be performed at least in part by an automated process, for example by comparing multiple occurrences of the same value, by performing computations or other manipulations based on extracted data, etc.
  • all or a subset of extracted values may be validated manually, by a human indexer or other operator.
  • output is delivered ( 110 ), e.g., by storing the document image and associated data in an enterprise content management system or other repository.
  • FIG. 2 is a block diagram illustrating an embodiment of a document capture system and environment.
  • a client system 212 is attached to a scanner 204 .
  • Documents are scanned by scanner 204 and the resulting document image is sent by the client system 212 to document capture system 202 for processing, e.g., using all or part of the process of FIG. 1 .
  • document capture system 202 uses a library of data entry forms 206 to create a structured representation of data extracted from a scanned document. For example, as in FIG. 1 steps 104 and 106 , in some embodiments a document is classified by type and an instance of a corresponding data entry form is created and populated with data values extracted from the document image.
  • data validation may be performed, at least in part, by document capture system 202 by accessing external data 208 via a network 210 .
  • an external third party database that associates street addresses with correct postal zip codes may be used to validate a zip code value extracted from a document.
  • validation may be performed at least in part by a plurality of manual indexers each using an associated client system 212 to communicate via network 210 with document capture system 202 .
  • document capture system 202 may be configured to queue human validation tasks and to serve tasks out to indexers using clients 212 .
  • Each client system 212 may use a browser based and/or installed client software provided functionality to validate data as described herein.
  • the resulting raw document image and/or form data are delivered as output, for example by storing the document image and associated form data in a repository 214 , such as an enterprise content management (ECM) or other repository.
  • ECM enterprise content management
  • FIG. 3 is a block diagram illustrating an embodiment of a document capture system.
  • the document capture system 202 of FIG. 2 is shown to receive document image data, e.g., via network 204 from a scanning client system 212 .
  • Document image data is received in some embodiments in batches and is stored in an image store 308 .
  • Document image data is provided to a data extraction module 310 which uses a data entry forms library 312 to classify each document by type and create an instance of a type-specific data entry form.
  • Data extraction module 310 uses OCR, OMR, and/or other techniques to extract data values from the document image and uses the extracted values to populate the corresponding data entry form instance.
  • data extraction module 310 may provide a score or other indication of a degree of confidence with which an extracted value has been determined based on a corresponding portion of the document image.
  • a corresponding location within the document image from which the data value entered by the extraction module in that form field was extracted for example the portion that shows the text to which OCR or other techniques were applied to determine the text present in the image, is recorded.
  • the data extraction module 310 provides the populated form to a validation module 314 configured to perform validation (automated and/or human as configured and/or required).
  • the validation module 314 applies one or more validation rules to identify fields that may require a human operator to validate.
  • the validation module 314 may communicate via a communications interface 316 , for example a network interface card or other communications interface, to obtain external data to be used in validation and/or to generate and provide to human indexers via associated client systems, such as one or more of clients 212 of FIG. 2 , tasks to perform human/manual validation of all or a subset of form fields.
  • the validated data is provided to a delivery/output module 318 configured to provide output via communication interface 316 , for example by storing the document image and/or extracted data (structured data as capture using the corresponding data entry form) in an enterprise content management system or other repository.
  • FIG. 4 is a block diagram illustrating an embodiment of a data validation user interface.
  • validation interface 400 includes a document image display area 402 , a data entry form interface 404 , and a navigation frame 406 .
  • a document image 408 is displayed in document image display area 402 .
  • portions of document image 408 that correspond to data entry form fields in the form shown in data entry form interface 404 are highlighted, as indicated in FIG. 4 by the cross-hatched rectangles in document image 408 as shown.
  • thumbnails are shown in navigation pane 406 , each corresponding for example to an associated document and/or page from which data has been captured.
  • FIG. 4 is highlighted (thicker outer outline), indicating that document image 408 as displayed in document image display area 402 corresponds to the topmost thumbnail.
  • controls are provided (e.g., on screen controls, key stokes or combinations, etc.) to enable the operator to pan, scroll, and/or zoom in/out with respect to the document image 408 , for example to focus and zoom in on (magnify) a particular portion of the document image 408 .
  • a cursor advances to the next field and a corresponding portion of the document image 408 is highlighted.
  • FIG. 5 is a screen shot illustrating an embodiment of a technique to minimize eye strain and/or fatigue in manual indexing.
  • partial screen shot 500 includes a portion of a manual data validation user interface that includes a data entry form field 502 , in this example with a current value of “888-555-1348” displayed, and nearby to the form field, as displayed in the data entry form portion of the data validation interface, a snippet 504 taken from a corresponding document image, which shows just the portion of the document image that contains the image of the text (in this case numerical values) extracted from the document to populate the form field 502 .
  • a confirmation or other informational and/or error message 506 similarly is displayed near the form field 502 .
  • the form field 502 , corresponding snippet 504 , and confirmation message 506 are all in the line of sight, or nearly so, at the same time, enabling all information required to validate the value entered in the form field 502 , including entering any correction that may be required, to be viewed at the same time and/or with minimal eye or head movement and without requiring the operator to scan back and forth between the document image frame and the data entry form, and/or to scroll, pan, or zoom in/out in the document image as viewed to locate and scale to a readable size the text to be validated.
  • the snippet 504 is scaled to ensure readability, for example by including in the snippet only (or mostly) the text to be validated and scaling the image to a readable size, for example until the image is of at least a prescribed minimum size and/or the displayed characters are of a prescribed minimum “point” or other size.
  • the system automatically pans to the next data entry form field, retrieves and displays near the form field a corresponding document image snippet. In this way, the operator can navigate through the form and corresponding portions of the document image without retargeting, i.e., without having to redirect their eyes to a different point or points on the screen.
  • FIG. 6 is a flow chart illustrating an embodiment of a process to facilitate manual indexing.
  • the process of FIG. 6 is used to provide an interface such as the one shown and described above in connection with FIG. 5 .
  • a snippet containing the text or other document image portion corresponding to a data entry form field to be validated is obtained, and an association between the snippet and/or the associated location in the document image, on the one hand, and the corresponding form field, on the other hand, is stored ( 602 ).
  • the snippet is scaled as/if need for readability ( 604 ).
  • the scaled (if applicable) snippet is displayed adjacent or otherwise near to the form field where corresponding extracted data to be validated is displayed and/or entered ( 606 ).
  • the metadata object is usually not defined per page. To export per-page forms, effort must be made to map values to their corresponding attributes of a metadata object used to represent the multi-page document in the content management system.
  • automatic detection and processing of a multi-page document as a single document is disclosed.
  • automatic detection of the pages comprising a multiple page document is performed.
  • Data values are extracted from the pages comprising the document and used to populate a single electronic data entry form for the multi-page document.
  • the operator can then go through the electronic data entry form, for example to validate data fields as required, and the document capture and/or validation system shows the location in the captured document of the corresponding data, regardless of which page(s) it occurs in, rather than the operator having to find and/or choose each page, indexing each independently, and then reconcile later data that occurs in and/or spans multiple pages.
  • FIG. 7 is a block diagram illustrating an embodiment of a document capture system and process.
  • scanned pages 702 , 704 , and 706 comprise a multi-page document.
  • First page recognition and/or other techniques are applied in various embodiments to detect automatically the beginning and/or ending of a multi-page document such as document.
  • the pages 702 , 704 , and 706 are identified through a process 708 as comprising a single multi-page document.
  • a corresponding document type is determined and data values are extracted from pages 702 , 704 , and 706 to populate a single data entry form 710 configured to capture data values extracted from the multi-page document.
  • the respective locations within the page images 702 , 704 , and 706 of data extracted to populate form 710 are shown as small cross-hatched rectangles.
  • the rows at the bottom of page 702 and the top of page 704 in this example comprise a single table, list, or other array that spans pages 702 and 704 .
  • the corresponding extracted data values are in some embodiments captured initially in page specific arrays, the rows of which are concatenated in the example shown to populate the single table at the bottom of form 710 .
  • FIG. 8 is a block diagram illustrating an embodiment of an interface to validate a multi-page document.
  • the interface 800 includes a page image display area 802 , in which in the example shown an image of page 702 of FIG. 7 is shown.
  • the interface 800 further includes a data entry form area 804 , in this example corresponding to the form 710 of FIG. 7 .
  • Thumbnails for the pages 702 , 704 , and 706 of FIG. 7 are displayed in navigation pane 806 .
  • the topmost thumbnail as displayed in navigation pane 806 is highlighted as being currently “selected” for display in page image display area 802 .
  • selection by a human operator of a thumbnail in navigation pane 806 results in an image of the corresponding page being displayed in page image display area 802 .
  • a corresponding portion or portions of the multi-page document in one or more pages may be navigated to automatically. For example, navigation to a first row of the three column table at the bottom of the form in this example may in some embodiments cause the first page 702 to be displayed.
  • Selection of a cell in one of the bottom three rows may cause the second page of the multi-page document, from which the corresponding data was extracted in this example, to be displayed in the page image display area 802 .
  • selection of a field in form area 804 results in a snippet of a corresponding portion of the page from which the data associated with that field was extracted is determined, retrieved, and displayed, for example in a location adjacent or nearly adjacent to the field, as described above.
  • FIG. 9A is a flow chart illustrating an embodiment of a process to capture document data.
  • the beginning and/or end of a multi-page document is/are detected ( 902 ).
  • known techniques to detect a first page may be used, and a multi-page document may be determined to have been encountered if one or more subsequent pages are scanned prior to a next “first” page is detected.
  • a document type is determined and a corresponding data entry form instance is created ( 904 ).
  • Scalar (single value) and array (tables, lists, or other two dimensional sets of data) data values are identified and extracted, for example using OCR, OMR, or other automated extraction techniques ( 906 ).
  • Occurrences of the same and/or dependent values in multiple locations, including across page boundaries may be used to perform automated and/or manual validation ( 908 ). For example, a name that appears at the beginning of a life insurance application and again in an attached report of a physical examination may be cross-checked to determine the accuracy of data extraction from one or both of the locations. Rows of arrays that span multiple pages are concatenated into a single form table ( 910 ). Array values may be validated using the full table, including across page boundaries ( 912 ). For example, quantity and unit price fields may be multiplied and the result compared to a line item subtotal, subtotals in all rows (including potentially across page boundaries) may be summed and compared to an extracted total, etc.
  • FIG. 9B is a flow chart illustrating an embodiment of a process to capture document data.
  • a library of metadata document types is defined, with each document type containing scalar fields and tables of array fields ( 922 ).
  • Automatic page recognition is done as in prior art, with page types determined ( 924 ). Values are extracted into per-page scalar and array fields by name, and each field's location on the page is saved ( 926 ).
  • the multi-page document type is determined from an analysis of the stream of page types ( 928 ). Data from the component pages is automatically combined into the document type ( 930 ). A given named scalar field may occur on any page, or in multiple pages.
  • Data validation is performed ( 932 ).
  • FIG. 10 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document.
  • an indication is received that an operator is done validating a currently displayed data ( 1002 ), e.g., the “Date” field in the example shown in FIG. 8 . If no more fields remain to be validated ( 1004 ), the process ends. Otherwise, if the next field to be validated is on the same page ( 1006 ) the next field in the data entry form is advanced to and displayed, and a corresponding snippet or other portion of the current page, from which the associated data value to be validated was extracted, is displayed adjacent to the form field ( 1008 ).
  • next form field requiring validation is associated with data from a different page of the multi-page document ( 1006 )
  • the system automatically retrieves or otherwise accesses the other page and/or an applicable portion thereof (e.g., a corresponding snippet) ( 1010 ), transparently to the human operator, and the next form field and the corresponding snippet obtained from the other page of the multi-page document are displayed for validation ( 1008 ) transparently to and without requiring any further action by the human operator.
  • an applicable portion thereof e.g., a corresponding snippet
  • FIG. 11 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document.
  • a definition of a library of validation rules is received ( 1102 ). Examples include, without limitation, a rule requiring that a first value extracted from a named field A must match a second value extracted from a named field B. Another example is a rule requiring that a sum or other computation based on a specified set of fields must equal a value extracted from another named field, e.g., subtotals in an array must sum to equal a total.
  • Document type definitions are received ( 1104 ). Each definition identifies validation rules to be applied, and as applicable a mapping to the document type fields to be used to apply each rule.
  • An operator interface is provided that facilitates multi-field validation, including across page boundaries ( 1106 ).
  • the interface enables an operator to iterate through just the dependent fields that require validation.
  • the system advances automatically to display a next one of the dependent fields and its associate document image portion, from whichever page in which it may be located.
  • the system iterates through the dependent fields until the operator enters data that clears the validation error and/or there are no more dependent fields to be displayed.
  • FIG. 12 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document.
  • an instance of a multi-page document type is received ( 1202 ).
  • Applicable validation rules are evaluated, e.g., sequentially, including those requiring the concurrent processing of data values extracted from different pages, and any dependent fields are marked as having an error if a rule fails ( 1204 ). If during validation by a human operator (see, e.g., FIG.
  • a data value as extracted is corrected, e.g., the human operator enters a corrected value in a form field, validation rules affected by the change are re-evaluated, for example to ensure a correction that satisfied a first rule did not introduce an inconsistency that caused a second rule to not be satisfied. If the value for a field is visually confirmed with the page image to be correct, then the field can be flagged so it is henceforth no longer marked as having an error when a rule is re-evaluated ( 1204 ). In this way, operators can be more efficient by navigating only to unconfirmed fields.
  • human operator validation of errors involving fields that have dependency relationships with other fields is facilitated by displaying the fields together, in a single screen, along with each fields corresponding document image snippet, even if the snippets come from different pages.
  • corresponding snippets are displayed, even if they come from multiple, different pages.
  • the human operator need only navigate through fields in a single data entry form, and the system transparently retrieves and displays for each field its corresponding snippet or other partial image, without regard to page boundaries.
  • multi-page documents can be processed more efficiently in the document capture context. Values in the same document can be reconciled and either auto-corrected or flagged for manual confirmation without switching between documents or data entry forms, copying over data values from one form to the other, etc. This facilitates use of data redundancy found in many document images.
  • the data entry form is abstracted from its page definition. The operator does not have to worry about where a value is, enable the operator to focus on validating data on the form. If there are variations in page versions, the operator does not have to worry about it. The location logic will find the right place. In addition, the developer and operator do not have to incur the cost and complexity of copying data back and forth between page forms.
  • Array data is shown in one table, rather than in a table per page, thereby improving the user experience. Finally, it is easier to map content management metadata objects to new document types, since all of the extracted data values for and structure of a multi-page document are capture in one form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Character Input (AREA)

Abstract

Techniques to capture document data are disclosed. It is determined that a sequence of pages in a stream of document page images comprise a single multi-page document. Data is extracted from two or more different pages included in the sequence. The data extracted from two or more different pages included in the sequence of pages is used to populate a data entry form associated with the multi-page document.

Description

    CROSS REFERENCE TO OTHER APPLICATIONS
  • This application is a continuation of U.S. patent application Ser. No. 16/290,453, filed Mar. 1, 2019, entitled “MULTI-PAGE DOCUMENT RECOGNITION IN DOCUMENT CAPTURE”, which is a continuation of U.S. patent application Ser. No. 15/221,433, filed Jul. 27, 2016, entitled “MULTI-PAGE DOCUMENT RECOGNITION IN DOCUMENT CAPTURE”, issued as U.S. Pat. No. 10,248,858, which is a continuation of U.S. patent application Ser. No. 13/720,671, filed Dec. 19, 2012, entitled “MULTI-PAGE DOCUMENT RECOGNITION IN DOCUMENT CAPTURE” issued as U.S. Pat. No. 9,430,453, which are incorporated herein by reference for all purposes.
  • BACKGROUND OF THE INVENTION
  • In document capture, typically pages are recognized and validated one at a time, in sequence. In the typical approach, each page is processed independently with its own data entry form, value extraction, and validation. In the case of a multi-page document, typically the metadata document generated through document capture for each page has to be mapped to a multi-page structure and data values reconciled across pages through additional processing.
  • In practice, during data validation human operators typically must navigate through multiple pages and associated data entry forms, for example to compare and reconcile values that occur in different pages, etc. This approach depends on the knowledge of human operators of the location of data in different pages of a multiple page document, and in the worst case may require an operator to hunt through multiple independent pages and/or associated page-specific data entry forms to cross-validate data, for example. In addition, treating each page as a separate document results in suboptimal processing of structures such as tables, which may span multiple pages.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
  • FIG. 1 is a flow chart illustrating an embodiment of a process to capture data.
  • FIG. 2 is a block diagram illustrating an embodiment of a document capture system and environment.
  • FIG. 3 is a block diagram illustrating an embodiment of a document capture system.
  • FIG. 4 is a block diagram illustrating an embodiment of a data validation user interface.
  • FIG. 5 is a screen shot illustrating an embodiment of a technique to minimize eye strain and/or fatigue in manual indexing.
  • FIG. 6 is a flow chart illustrating an embodiment of a process to facilitate manual indexing.
  • FIG. 7 is a block diagram illustrating an embodiment of a document capture system and process.
  • FIG. 8 is a block diagram illustrating an embodiment of an interface to validate a multi-page document.
  • FIG. 9A is a flow chart illustrating an embodiment of a process to capture document data.
  • FIG. 9B is a flow chart illustrating an embodiment of a process to capture document data.
  • FIG. 10 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document.
  • FIG. 11 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document.
  • FIG. 12 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document.
  • DETAILED DESCRIPTION
  • The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
  • A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
  • Processing a multi-page document as a single entity, with a single corresponding data entry form, in an automated document capture context is disclosed. In various embodiments, the pages comprising a multi-page document are identified and associated with a multi-page document type. A corresponding data entry form is used to provide a structured representation of data extracted from the pages of the multi-page document. Structures that may span multiple pages, such as a table or list of values, are associated together in a single array or other structure of the data entry form. Validation of extracted values based on dependency fields that may occur on different pages is facilitated, both in automated processing and in human validation.
  • FIG. 1 is a flow chart illustrating an embodiment of a process to capture data. In the example shown, document content is captured into a digital format (102), e.g., by scanning the physical sheet(s) to create a scanned image. The document is classified (104). In some embodiments, classification includes detecting a document type corresponding to an associated data entry form. Data is extracted from the digital content (106), for example through optical character recognition (OCR) and/or optical mark recognition (OMR) techniques. Extracted data is validated (108). In various embodiments, validation may be performed at least in part by an automated process, for example by comparing multiple occurrences of the same value, by performing computations or other manipulations based on extracted data, etc. In various embodiments, all or a subset of extracted values, e.g., those for which less than a required degree of confidence is achieved through automated extraction and/or validation, may be validated manually, by a human indexer or other operator. Once all data has been validated, output is delivered (110), e.g., by storing the document image and associated data in an enterprise content management system or other repository.
  • FIG. 2 is a block diagram illustrating an embodiment of a document capture system and environment. In the example shown, a client system 212 is attached to a scanner 204. Documents are scanned by scanner 204 and the resulting document image is sent by the client system 212 to document capture system 202 for processing, e.g., using all or part of the process of FIG. 1. In the example shown, document capture system 202 uses a library of data entry forms 206 to create a structured representation of data extracted from a scanned document. For example, as in FIG. 1 steps 104 and 106, in some embodiments a document is classified by type and an instance of a corresponding data entry form is created and populated with data values extracted from the document image. In some embodiments, data validation may be performed, at least in part, by document capture system 202 by accessing external data 208 via a network 210. For example, an external third party database that associates street addresses with correct postal zip codes may be used to validate a zip code value extracted from a document. In the example shown, validation may be performed at least in part by a plurality of manual indexers each using an associated client system 212 to communicate via network 210 with document capture system 202. For example, document capture system 202 may be configured to queue human validation tasks and to serve tasks out to indexers using clients 212. Each client system 212 may use a browser based and/or installed client software provided functionality to validate data as described herein. In some embodiments, once validation has been completed the resulting raw document image and/or form data are delivered as output, for example by storing the document image and associated form data in a repository 214, such as an enterprise content management (ECM) or other repository.
  • FIG. 3 is a block diagram illustrating an embodiment of a document capture system. In the example shown, the document capture system 202 of FIG. 2 is shown to receive document image data, e.g., via network 204 from a scanning client system 212. Document image data is received in some embodiments in batches and is stored in an image store 308. Document image data is provided to a data extraction module 310 which uses a data entry forms library 312 to classify each document by type and create an instance of a type-specific data entry form. Data extraction module 310 uses OCR, OMR, and/or other techniques to extract data values from the document image and uses the extracted values to populate the corresponding data entry form instance. In some embodiments, data extraction module 310 may provide a score or other indication of a degree of confidence with which an extracted value has been determined based on a corresponding portion of the document image. In some embodiments, for each data entry form field a corresponding location within the document image from which the data value entered by the extraction module in that form field was extracted, for example the portion that shows the text to which OCR or other techniques were applied to determine the text present in the image, is recorded. In the example shown, the data extraction module 310 provides the populated form to a validation module 314 configured to perform validation (automated and/or human as configured and/or required). In some embodiments, the validation module 314 applies one or more validation rules to identify fields that may require a human operator to validate. In the example shown, the validation module 314 may communicate via a communications interface 316, for example a network interface card or other communications interface, to obtain external data to be used in validation and/or to generate and provide to human indexers via associated client systems, such as one or more of clients 212 of FIG. 2, tasks to perform human/manual validation of all or a subset of form fields. The validated data is provided to a delivery/output module 318 configured to provide output via communication interface 316, for example by storing the document image and/or extracted data (structured data as capture using the corresponding data entry form) in an enterprise content management system or other repository.
  • FIG. 4 is a block diagram illustrating an embodiment of a data validation user interface. In the example shown, validation interface 400 includes a document image display area 402, a data entry form interface 404, and a navigation frame 406. A document image 408 is displayed in document image display area 402. In the example shown, portions of document image 408 that correspond to data entry form fields in the form shown in data entry form interface 404 are highlighted, as indicated in FIG. 4 by the cross-hatched rectangles in document image 408 as shown. In this example, thumbnails are shown in navigation pane 406, each corresponding for example to an associated document and/or page from which data has been captured. In this example, the topmost thumbnail image as shown in navigation frame 406 of FIG. 4 is highlighted (thicker outer outline), indicating that document image 408 as displayed in document image display area 402 corresponds to the topmost thumbnail. In some embodiments, controls are provided (e.g., on screen controls, key stokes or combinations, etc.) to enable the operator to pan, scroll, and/or zoom in/out with respect to the document image 408, for example to focus and zoom in on (magnify) a particular portion of the document image 408. In some embodiments, as the operator validates each field a cursor advances to the next field and a corresponding portion of the document image 408 is highlighted.
  • FIG. 5 is a screen shot illustrating an embodiment of a technique to minimize eye strain and/or fatigue in manual indexing. In the example shown, partial screen shot 500 includes a portion of a manual data validation user interface that includes a data entry form field 502, in this example with a current value of “888-555-1348” displayed, and nearby to the form field, as displayed in the data entry form portion of the data validation interface, a snippet 504 taken from a corresponding document image, which shows just the portion of the document image that contains the image of the text (in this case numerical values) extracted from the document to populate the form field 502. In this example, a confirmation or other informational and/or error message 506 similarly is displayed near the form field 502. As a result, the form field 502, corresponding snippet 504, and confirmation message 506 are all in the line of sight, or nearly so, at the same time, enabling all information required to validate the value entered in the form field 502, including entering any correction that may be required, to be viewed at the same time and/or with minimal eye or head movement and without requiring the operator to scan back and forth between the document image frame and the data entry form, and/or to scroll, pan, or zoom in/out in the document image as viewed to locate and scale to a readable size the text to be validated. In some embodiments, the snippet 504 is scaled to ensure readability, for example by including in the snippet only (or mostly) the text to be validated and scaling the image to a readable size, for example until the image is of at least a prescribed minimum size and/or the displayed characters are of a prescribed minimum “point” or other size.
  • In some embodiments, as an operator finishes validation of a field, indicated for example by pressing the “enter” key or selecting another key or on screen control, the system automatically pans to the next data entry form field, retrieves and displays near the form field a corresponding document image snippet. In this way, the operator can navigate through the form and corresponding portions of the document image without retargeting, i.e., without having to redirect their eyes to a different point or points on the screen.
  • FIG. 6 is a flow chart illustrating an embodiment of a process to facilitate manual indexing. In various embodiments, the process of FIG. 6 is used to provide an interface such as the one shown and described above in connection with FIG. 5. In the example shown in FIG. 6, a snippet containing the text or other document image portion corresponding to a data entry form field to be validated is obtained, and an association between the snippet and/or the associated location in the document image, on the one hand, and the corresponding form field, on the other hand, is stored (602). The snippet is scaled as/if need for readability (604). The scaled (if applicable) snippet is displayed adjacent or otherwise near to the form field where corresponding extracted data to be validated is displayed and/or entered (606).
  • Typically, as noted above pages comprising a multiple page document have been processed separately, each page having its own corresponding electronic data entry form associated with it. The per-page form approach has a number of shortcomings. For example, a value (e.g., an account number on the footer of each page in an invoice) may occur in several pages. An error on a single page results in work for the operator, because typically there is no framework to reconcile data across pages and to auto-correct data. In addition, in production, the operator will only become aware of the problem when he navigates to the page. If there are large discrepancies between values of many pages, then the operator must manually look at each page and that takes time.
  • In semi-structured and unstructured documents, there can be any number of variations of pages. If the data-entry form is page-based, and a unique form per page is used, this results in an unmanageable number of forms. If a generic form that contains a union of possible fields is used, this results in forms with unused fields. This requires extra work to handle. Furthermore, if a value is copied from another page, its source value and location typically is not shown because only the current page, and not the page from which the copied value was extracted, is shown. If the page is changed, it would result in the data-entry form being changed. Changing the data-entry form then disrupts the sequence of work, resulting in lower operator efficiency.
  • Under the form-per-page approach, when a table spans multiple pages, the technique of copying data between pages results in a large set of duplicate values. Extra effort is then needed to synchronize if the user makes any changes on any page. The navigation problem described above is compounded. For example, suppose the sub-totals on a multi-page invoice line items table do not add up. It is more cumbersome for the operator to go through each page and then each table, and to work with duplicate row values.
  • In content management systems, the metadata object is usually not defined per page. To export per-page forms, effort must be made to map values to their corresponding attributes of a metadata object used to represent the multi-page document in the content management system.
  • In light of all the foregoing shortcomings of the per-page approach to document capture as applied to multi-page documents, automatic detection and processing of a multi-page document as a single document is disclosed. In various embodiments, automatic detection of the pages comprising a multiple page document is performed. Data values are extracted from the pages comprising the document and used to populate a single electronic data entry form for the multi-page document. The operator can then go through the electronic data entry form, for example to validate data fields as required, and the document capture and/or validation system shows the location in the captured document of the corresponding data, regardless of which page(s) it occurs in, rather than the operator having to find and/or choose each page, indexing each independently, and then reconcile later data that occurs in and/or spans multiple pages.
  • FIG. 7 is a block diagram illustrating an embodiment of a document capture system and process. In the example shown, scanned pages 702, 704, and 706 comprise a multi-page document. First page recognition and/or other techniques are applied in various embodiments to detect automatically the beginning and/or ending of a multi-page document such as document. The pages 702, 704, and 706 are identified through a process 708 as comprising a single multi-page document. A corresponding document type is determined and data values are extracted from pages 702, 704, and 706 to populate a single data entry form 710 configured to capture data values extracted from the multi-page document. In the example shown, the respective locations within the page images 702, 704, and 706 of data extracted to populate form 710 are shown as small cross-hatched rectangles. The rows at the bottom of page 702 and the top of page 704 in this example comprise a single table, list, or other array that spans pages 702 and 704. The corresponding extracted data values are in some embodiments captured initially in page specific arrays, the rows of which are concatenated in the example shown to populate the single table at the bottom of form 710.
  • FIG. 8 is a block diagram illustrating an embodiment of an interface to validate a multi-page document. In the example shown, the interface 800 includes a page image display area 802, in which in the example shown an image of page 702 of FIG. 7 is shown. The interface 800 further includes a data entry form area 804, in this example corresponding to the form 710 of FIG. 7. Thumbnails for the pages 702, 704, and 706 of FIG. 7 (not numbered individually in FIG. 8) are displayed in navigation pane 806. In the example shown, the topmost thumbnail as displayed in navigation pane 806 is highlighted as being currently “selected” for display in page image display area 802. In various embodiments, selection by a human operator of a thumbnail in navigation pane 806 results in an image of the corresponding page being displayed in page image display area 802. In some embodiments, as an operator navigates to different form fields in the form area 804 a corresponding portion or portions of the multi-page document, in one or more pages may be navigated to automatically. For example, navigation to a first row of the three column table at the bottom of the form in this example may in some embodiments cause the first page 702 to be displayed. Selection of a cell in one of the bottom three rows, either manually or automatically as the system advances to a next field to be validated, in various embodiments may cause the second page of the multi-page document, from which the corresponding data was extracted in this example, to be displayed in the page image display area 802. In various embodiments, selection of a field in form area 804 results in a snippet of a corresponding portion of the page from which the data associated with that field was extracted is determined, retrieved, and displayed, for example in a location adjacent or nearly adjacent to the field, as described above.
  • FIG. 9A is a flow chart illustrating an embodiment of a process to capture document data. In the example shown, the beginning and/or end of a multi-page document is/are detected (902). For example, known techniques to detect a first page may be used, and a multi-page document may be determined to have been encountered if one or more subsequent pages are scanned prior to a next “first” page is detected. A document type is determined and a corresponding data entry form instance is created (904). Scalar (single value) and array (tables, lists, or other two dimensional sets of data) data values are identified and extracted, for example using OCR, OMR, or other automated extraction techniques (906). Occurrences of the same and/or dependent values in multiple locations, including across page boundaries, may be used to perform automated and/or manual validation (908). For example, a name that appears at the beginning of a life insurance application and again in an attached report of a physical examination may be cross-checked to determine the accuracy of data extraction from one or both of the locations. Rows of arrays that span multiple pages are concatenated into a single form table (910). Array values may be validated using the full table, including across page boundaries (912). For example, quantity and unit price fields may be multiplied and the result compared to a line item subtotal, subtotals in all rows (including potentially across page boundaries) may be summed and compared to an extracted total, etc.
  • FIG. 9B is a flow chart illustrating an embodiment of a process to capture document data. In the example shown, a library of metadata document types is defined, with each document type containing scalar fields and tables of array fields (922). Automatic page recognition is done as in prior art, with page types determined (924). Values are extracted into per-page scalar and array fields by name, and each field's location on the page is saved (926). The multi-page document type is determined from an analysis of the stream of page types (928). Data from the component pages is automatically combined into the document type (930). A given named scalar field may occur on any page, or in multiple pages. Data validation is performed (932).
  • FIG. 10 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document. In the example shown, an indication is received that an operator is done validating a currently displayed data (1002), e.g., the “Date” field in the example shown in FIG. 8. If no more fields remain to be validated (1004), the process ends. Otherwise, if the next field to be validated is on the same page (1006) the next field in the data entry form is advanced to and displayed, and a corresponding snippet or other portion of the current page, from which the associated data value to be validated was extracted, is displayed adjacent to the form field (1008). If the next form field requiring validation is associated with data from a different page of the multi-page document (1006), the system automatically retrieves or otherwise accesses the other page and/or an applicable portion thereof (e.g., a corresponding snippet) (1010), transparently to the human operator, and the next form field and the corresponding snippet obtained from the other page of the multi-page document are displayed for validation (1008) transparently to and without requiring any further action by the human operator.
  • FIG. 11 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document. In the example shown, a definition of a library of validation rules is received (1102). Examples include, without limitation, a rule requiring that a first value extracted from a named field A must match a second value extracted from a named field B. Another example is a rule requiring that a sum or other computation based on a specified set of fields must equal a value extracted from another named field, e.g., subtotals in an array must sum to equal a total. Document type definitions are received (1104). Each definition identifies validation rules to be applied, and as applicable a mapping to the document type fields to be used to apply each rule. An operator interface is provided that facilitates multi-field validation, including across page boundaries (1106). In some embodiments, for example, the interface enables an operator to iterate through just the dependent fields that require validation. As the operator corrects and/or confirms the entered value for a first dependent field, for example, and hits “enter”, the system advances automatically to display a next one of the dependent fields and its associate document image portion, from whichever page in which it may be located. The system iterates through the dependent fields until the operator enters data that clears the validation error and/or there are no more dependent fields to be displayed. In various embodiments, by combining data extracted from multiple page images of a multi-page documents into a single document type and associated data entry form, automated and manual validation of dependent data fields that occur on or across different pages is facilitated, without requiring software code and/or human action to navigate between data entry forms used to capture data extracted from individual pages.
  • FIG. 12 is a flow chart illustrating an embodiment of a process to perform validation of data values extracted from a multi-page document. In the example shown, an instance of a multi-page document type is received (1202). Applicable validation rules are evaluated, e.g., sequentially, including those requiring the concurrent processing of data values extracted from different pages, and any dependent fields are marked as having an error if a rule fails (1204). If during validation by a human operator (see, e.g., FIG. 10) a data value as extracted is corrected, e.g., the human operator enters a corrected value in a form field, validation rules affected by the change are re-evaluated, for example to ensure a correction that satisfied a first rule did not introduce an inconsistency that caused a second rule to not be satisfied. If the value for a field is visually confirmed with the page image to be correct, then the field can be flagged so it is henceforth no longer marked as having an error when a rule is re-evaluated (1204). In this way, operators can be more efficient by navigating only to unconfirmed fields.
  • In various embodiments, human operator validation of errors involving fields that have dependency relationships with other fields, such as a “name” value that occurs in more than one page of a multi-page document, is facilitated by displaying the fields together, in a single screen, along with each fields corresponding document image snippet, even if the snippets come from different pages. Likewise, as an operator iterates through error fields in a table or other two dimensional data structure, corresponding snippets are displayed, even if they come from multiple, different pages. The human operator need only navigate through fields in a single data entry form, and the system transparently retrieves and displays for each field its corresponding snippet or other partial image, without regard to page boundaries.
  • Using techniques described herein, multi-page documents can be processed more efficiently in the document capture context. Values in the same document can be reconciled and either auto-corrected or flagged for manual confirmation without switching between documents or data entry forms, copying over data values from one form to the other, etc. This facilitates use of data redundancy found in many document images. In addition, the data entry form is abstracted from its page definition. The operator does not have to worry about where a value is, enable the operator to focus on validating data on the form. If there are variations in page versions, the operator does not have to worry about it. The location logic will find the right place. In addition, the developer and operator do not have to incur the cost and complexity of copying data back and forth between page forms. Array data is shown in one table, rather than in a table per page, thereby improving the user experience. Finally, it is easier to map content management metadata objects to new document types, since all of the extracted data values for and structure of a multi-page document are capture in one form.
  • Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.

Claims (20)

What is claimed is:
1. A method of capturing document data, comprising:
obtaining a multi-page document;
extracting data from multiple pages of the multi-page document;
identifying two or more values located on at least two different pages of the multi-page document, wherein at least a first one of the two or more values is dependent on at least a second one of the two or more values; and
validating the at least the first one of the two or more values and the at least the second one of the two or more values according to one or more validation rules.
2. The method of claim 1, wherein the validating comprises: accessing a set of validation rules in a library of validation rules; sequentially applying the set of validation rules to the at least the first one of the two or more values and the at least the second one of the two or more values; and marking the at least the first one of the two or more values and the at least the second one of the two or more values as having an error if one of the validation rules fails.
3. The method of claim 2, further comprising receiving a document type definition corresponding to the multi-page document, wherein the document type definition identifies the set of validation rules to be applied to the at least the first one of the two or more values and the at least the second one of the two or more values.
4. The method of claim 3, wherein the document type definition includes a mapping to document type fields to be used to apply each rule.
5. The method of claim 4, further comprising identifying a document type of the multi-page document, and identifying the document type definition based on the identified document type of the multi-page document.
6. The method of claim 5, wherein the document type contains one or more scalar fields and one or more tables of array fields.
7. The method of claim 6, further comprising extracting values from each page into per-page scalar and array fields by name.
8. The method of claim 7, wherein for each extracted value, a corresponding location on the page from which the value was extracted is saved.
9. The method of claim 6, further comprising combining data extracted from the respective pages into a form associated with the document type, wherein combining data extracted from the respective pages into a form associated with the document type includes forming an array that spans multiple pages concatenating a first set of rows of values extracted from a first page with a second set of rows of values extracted from a second page to create a combined set of rows to be included in the document type.
10. The method of claim 1, wherein the validating comprises providing one or more of the at least the first one of the two or more values and the at least the second one of the two or more values to a user for manual validation.
11. The method of claim 10, further comprising presenting an interface to the user, wherein the interface displays to the user a plurality of fields which are marked as having errors and enables the user to iterate through the plurality of fields, wherein the plurality of fields displayed to the user include only dependent fields that require validation.
12. The method of claim 11, further comprising identifying a document type of the multi-page document, and creating an instance of a selected one of a plurality of type-specific data entry forms in a forms library based at least in part on the identified document type of the multi-page document, wherein the plurality of fields displayed to the user are fields contained in the created instance.
13. The method of claim 12, wherein the two or more values presented to the user for manual validation are identified by determining whether the first one of the two or more values matches the second one of the two or more values in the multi-page document.
14. The method of claim 11, wherein the interface is configured to repetitively iterate through each of the plurality of fields until either the user enters data that clears the validation error associated with the field.
15. The method of claim 11, further comprising populating, based at least in part on the data extracted from the multi-page document, a plurality of fields of the instance of the selected data entry form including the plurality of fields displayed to the user.
16. The method of claim 11, wherein as each form field is displayed, a corresponding snippet or other partial image from a page from which a current data value associated with the form field was extracted is displayed adjacent to the field.
17. The method of claim 1, wherein the first and second ones of the two or more values are contained in a table or array of the multi-page document.
18. The method of claim 1, The method further comprising determining that a sequence of pages comprise the multi-page document, wherein determining that a sequence of pages in a stream of document page images comprise the single multi-page document includes processing each page individually to determine a corresponding page type; and processing the stream of page types to identify a sequence associated with a multi-page document type.
19. A document capture system, comprising:
a communication or other interface configured to receive a multi-page document; and one or more processors coupled to the interface and configured to:
obtain a multi-page document;
extract data from multiple pages of the multi-page document;
identify two or more values located on at least two different pages of the multi-page document, wherein at least a first one of the two or more values is dependent on at least a second one of the two or more values; and
validate the at least the first one of the two or more values and the at least the second one of the two or more values according to one or more validation rules.
20. A computer program product to capture document data, the computer program product being embodied in a non-transitory computer readable storage medium and comprising computer instructions for:
obtaining a multi-page document;
extracting data from multiple pages of the multi-page document;
identifying two or more values located on at least two different pages of the multi-page document, wherein at least a first one of the two or more values is dependent on at least a second one of the two or more values; and
validating the at least the first one of the two or more values and the at least the second one of the two or more values according to one or more validation rules.
US16/953,561 2012-12-19 2020-11-20 Multi-page document recognition in document capture Abandoned US20210073531A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/953,561 US20210073531A1 (en) 2012-12-19 2020-11-20 Multi-page document recognition in document capture
US17/940,777 US11868717B2 (en) 2012-12-19 2022-09-08 Multi-page document recognition in document capture

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US13/720,671 US9430453B1 (en) 2012-12-19 2012-12-19 Multi-page document recognition in document capture
US15/221,433 US10248858B2 (en) 2012-12-19 2016-07-27 Multi-page document recognition in document capture
US16/290,453 US10860848B2 (en) 2012-12-19 2019-03-01 Multi-page document recognition in document capture
US16/953,561 US20210073531A1 (en) 2012-12-19 2020-11-20 Multi-page document recognition in document capture

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/290,453 Continuation US10860848B2 (en) 2012-12-19 2019-03-01 Multi-page document recognition in document capture

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/940,777 Continuation US11868717B2 (en) 2012-12-19 2022-09-08 Multi-page document recognition in document capture

Publications (1)

Publication Number Publication Date
US20210073531A1 true US20210073531A1 (en) 2021-03-11

Family

ID=56739455

Family Applications (5)

Application Number Title Priority Date Filing Date
US13/720,671 Active 2033-10-27 US9430453B1 (en) 2012-12-19 2012-12-19 Multi-page document recognition in document capture
US15/221,433 Active 2033-08-26 US10248858B2 (en) 2012-12-19 2016-07-27 Multi-page document recognition in document capture
US16/290,453 Active US10860848B2 (en) 2012-12-19 2019-03-01 Multi-page document recognition in document capture
US16/953,561 Abandoned US20210073531A1 (en) 2012-12-19 2020-11-20 Multi-page document recognition in document capture
US17/940,777 Active US11868717B2 (en) 2012-12-19 2022-09-08 Multi-page document recognition in document capture

Family Applications Before (3)

Application Number Title Priority Date Filing Date
US13/720,671 Active 2033-10-27 US9430453B1 (en) 2012-12-19 2012-12-19 Multi-page document recognition in document capture
US15/221,433 Active 2033-08-26 US10248858B2 (en) 2012-12-19 2016-07-27 Multi-page document recognition in document capture
US16/290,453 Active US10860848B2 (en) 2012-12-19 2019-03-01 Multi-page document recognition in document capture

Family Applications After (1)

Application Number Title Priority Date Filing Date
US17/940,777 Active US11868717B2 (en) 2012-12-19 2022-09-08 Multi-page document recognition in document capture

Country Status (1)

Country Link
US (5) US9430453B1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20230010577A (en) 2021-07-12 2023-01-19 한국생명공학연구원 Cutibacterium avidum strain, culture medium from thereof and anti-bacteria uses of thereof

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9430453B1 (en) 2012-12-19 2016-08-30 Emc Corporation Multi-page document recognition in document capture
US9288361B2 (en) * 2013-06-06 2016-03-15 Open Text S.A. Systems, methods and computer program products for fax delivery and maintenance
CA2924711A1 (en) * 2013-09-25 2015-04-02 Chartspan Medical Technologies, Inc. User-initiated data recognition and data conversion process
US11049190B2 (en) 2016-07-15 2021-06-29 Intuit Inc. System and method for automatically generating calculations for fields in compliance forms
US10725896B2 (en) 2016-07-15 2020-07-28 Intuit Inc. System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on code coverage
US10579721B2 (en) 2016-07-15 2020-03-03 Intuit Inc. Lean parsing: a natural language processing system and method for parsing domain-specific languages
US10140277B2 (en) 2016-07-15 2018-11-27 Intuit Inc. System and method for selecting data sample groups for machine learning of context of data fields for various document types and/or for test data generation for quality assurance systems
US11222266B2 (en) 2016-07-15 2022-01-11 Intuit Inc. System and method for automatic learning of functions
US10789422B2 (en) * 2017-09-27 2020-09-29 Equifax Inc. Synchronizing data-entry fields with corresponding image regions
KR102462516B1 (en) * 2018-01-09 2022-11-03 삼성전자주식회사 Display apparatus and Method for providing a content thereof
US10970534B2 (en) * 2018-01-29 2021-04-06 Open Text Corporation Document processing system capture flow compiler
US10755039B2 (en) 2018-11-15 2020-08-25 International Business Machines Corporation Extracting structured information from a document containing filled form images
US10402641B1 (en) 2019-03-19 2019-09-03 Capital One Services, Llc Platform for document classification
US11631266B2 (en) * 2019-04-02 2023-04-18 Wilco Source Inc Automated document intake and processing system
US11543943B2 (en) * 2019-04-30 2023-01-03 Open Text Sa Ulc Systems and methods for on-image navigation and direct image-to-data storage table data capture
JP7338230B2 (en) * 2019-05-13 2023-09-05 富士フイルムビジネスイノベーション株式会社 Information processing device and information processing program
US11163956B1 (en) 2019-05-23 2021-11-02 Intuit Inc. System and method for recognizing domain specific named entities using domain specific word embeddings
US11295072B2 (en) * 2019-06-03 2022-04-05 Adp, Llc Autoform filling using text from optical character recognition and metadata for document types
US11048933B2 (en) * 2019-07-31 2021-06-29 Intuit Inc. Generating structured representations of forms using machine learning
US11783128B2 (en) 2020-02-19 2023-10-10 Intuit Inc. Financial document text conversion to computer readable operations
CN112380825B (en) * 2020-11-17 2022-07-15 平安科技(深圳)有限公司 PDF document cross-page table merging method and device, electronic equipment and storage medium
JP2022137608A (en) * 2021-03-09 2022-09-22 キヤノン株式会社 Information processing apparatus, information processing method, and program
US12039261B2 (en) * 2022-05-03 2024-07-16 Bold Limited Systems and methods for improved user-reviewer interaction using enhanced electronic documents linked to online documents
US20240303412A1 (en) * 2023-03-06 2024-09-12 Ricoh Company, Ltd. Intelligent document processing assistant

Family Cites Families (66)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5191525A (en) 1990-01-16 1993-03-02 Digital Image Systems, Corporation System and method for extraction of data from documents for subsequent processing
US5713020A (en) 1993-09-02 1998-01-27 Microsoft Corporation Method and system for generating database queries containing multiple levels of aggregation
US5666490A (en) * 1994-05-16 1997-09-09 Gillings; Dennis Computer network system and method for managing documents
US6345278B1 (en) * 1998-06-04 2002-02-05 Collegenet, Inc. Universal forms engine
US6400845B1 (en) * 1999-04-23 2002-06-04 Computer Services, Inc. System and method for data extraction from digital images
US6163774A (en) 1999-05-24 2000-12-19 Platinum Technology Ip, Inc. Method and apparatus for simplified and flexible selection of aggregate and cross product levels for a data warehouse
US6405207B1 (en) 1999-10-15 2002-06-11 Microsoft Corporation Reporting aggregate results from database queries
US7203663B1 (en) * 2000-02-15 2007-04-10 Jpmorgan Chase Bank, N.A. System and method for converting information on paper forms to electronic data
US20020029207A1 (en) 2000-02-28 2002-03-07 Hyperroll, Inc. Data aggregation server for managing a multi-dimensional database and database management system having data aggregation server integrated therein
US7072897B2 (en) 2000-04-27 2006-07-04 Hyperion Solutions Corporation Non-additive measures and metric calculation
US6735335B1 (en) * 2000-05-30 2004-05-11 Microsoft Corporation Method and apparatus for discriminating between documents in batch scanned document files
US7165065B1 (en) 2000-07-14 2007-01-16 Oracle Corporation Multidimensional database storage and retrieval system
US7324983B1 (en) 2001-11-08 2008-01-29 I2 Technologies Us, Inc. Reproducible selection of members in a hierarchy
US20030140306A1 (en) * 2002-01-18 2003-07-24 Robinson Robert J. System and method for remotely entering and verifying data capture
US6775682B1 (en) 2002-02-26 2004-08-10 Oracle International Corporation Evaluation of rollups with distinct aggregates by using sequence of sorts and partitioning by measures
US7171615B2 (en) * 2002-03-26 2007-01-30 Aatrix Software, Inc. Method and apparatus for creating and filing forms
US7548935B2 (en) 2002-05-09 2009-06-16 Robert Pecherer Method of recursive objects for representing hierarchies in relational database systems
US20030231344A1 (en) * 2002-05-30 2003-12-18 Fast Bruce Brian Process for validating groups of machine-read data fields
US7020649B2 (en) 2002-12-30 2006-03-28 International Business Machines Corporation System and method for incrementally maintaining non-distributive aggregate functions in a relational database
US7953694B2 (en) 2003-01-13 2011-05-31 International Business Machines Corporation Method, system, and program for specifying multidimensional calculations for a relational OLAP engine
US7089266B2 (en) 2003-06-02 2006-08-08 The Board Of Trustees Of The Leland Stanford Jr. University Computer systems and methods for the query and visualization of multidimensional databases
US7707144B2 (en) 2003-12-23 2010-04-27 Siebel Systems, Inc. Optimization for aggregate navigation for distinct count metrics
US7756739B2 (en) 2004-02-12 2010-07-13 Microsoft Corporation System and method for aggregating a measure over a non-additive account dimension
US20050289182A1 (en) * 2004-06-15 2005-12-29 Sand Hill Systems Inc. Document management system with enhanced intelligent document recognition capabilities
US8402361B2 (en) 2004-11-09 2013-03-19 Oracle International Corporation Methods and systems for implementing a dynamic hierarchical data viewer
US20060161525A1 (en) 2005-01-18 2006-07-20 Ibm Corporation Method and system for supporting structured aggregation operations on semi-structured data
US7529408B2 (en) * 2005-02-23 2009-05-05 Ichannex Corporation System and method for electronically processing document images
US7584205B2 (en) 2005-06-27 2009-09-01 Ab Initio Technology Llc Aggregating data with complex operations
AU2006307452B2 (en) * 2005-10-25 2011-03-03 Charactell Ltd Form data extraction without customization
GB2448275A (en) * 2006-01-03 2008-10-08 Kyos Systems Inc Document analysis system for integration of paper records into a searchable electronic database
US7831617B2 (en) 2006-07-25 2010-11-09 Microsoft Corporation Re-categorization of aggregate data as detail data and automated re-categorization based on data usage context
US9715710B2 (en) 2007-03-30 2017-07-25 International Business Machines Corporation Method and system for forecasting using an online analytical processing database
US8640056B2 (en) 2007-07-05 2014-01-28 Oracle International Corporation Data visualization techniques
US7716233B2 (en) 2007-05-23 2010-05-11 Business Objects Software, Ltd. System and method for processing queries for combined hierarchical dimensions
US8140572B1 (en) 2007-07-19 2012-03-20 Salesforce.Com, Inc. System, method and computer program product for aggregating on-demand database service data
US20090049375A1 (en) * 2007-08-18 2009-02-19 Talario, Llc Selective processing of information from a digital copy of a document for data entry
US8290272B2 (en) * 2007-09-14 2012-10-16 Abbyy Software Ltd. Creating a document template for capturing data from a document image and capturing data from a document image
WO2009039530A1 (en) * 2007-09-20 2009-03-26 Kyos Systems, Inc. Method and apparatus for editing large quantities of data extracted from documents
US8200618B2 (en) 2007-11-02 2012-06-12 International Business Machines Corporation System and method for analyzing data in a report
US20090144307A1 (en) 2007-11-29 2009-06-04 Robert Joseph Bestgen Performing Hierarchical Aggregate Compression
FR2924834B1 (en) * 2007-12-10 2010-12-31 Serensia IMPROVED METHOD AND SYSTEM FOR ASSISTED ENTRY IN PARTICULAR FOR COMPUTER MANAGEMENT TOOLS
US8671112B2 (en) * 2008-06-12 2014-03-11 Athenahealth, Inc. Methods and apparatus for automated image classification
US20100017395A1 (en) 2008-07-16 2010-01-21 Sapphire Information Systems Ltd. Apparatus and methods for transforming relational queries into multi-dimensional queries
US8495007B2 (en) 2008-08-28 2013-07-23 Red Hat, Inc. Systems and methods for hierarchical aggregation of multi-dimensional data sources
US8547589B2 (en) * 2008-09-08 2013-10-01 Abbyy Software Ltd. Data capture from multi-page documents
US8856649B2 (en) 2009-06-08 2014-10-07 Business Objects Software Limited Aggregation level and measure based hinting and selection of cells in a data display
JP5175903B2 (en) 2009-08-31 2013-04-03 アクセンチュア グローバル サービスィズ ゲーエムベーハー Adaptive analysis multidimensional processing system
US8321357B2 (en) * 2009-09-30 2012-11-27 Lapir Gennady Method and system for extraction
US20110258195A1 (en) * 2010-01-15 2011-10-20 Girish Welling Systems and methods for automatically reducing data search space and improving data extraction accuracy using known constraints in a layout of extracted data elements
US20120173519A1 (en) 2010-04-07 2012-07-05 Google Inc. Performing pre-aggregation and re-aggregation using the same query language
US8756617B1 (en) * 2010-05-18 2014-06-17 Google Inc. Schema validation for secure development of browser extensions
US8458206B2 (en) 2010-05-28 2013-06-04 Oracle International Corporation Systems and methods for providing custom or calculated data members in queries of a business intelligence server
US10726200B2 (en) * 2011-02-04 2020-07-28 Benjamin Chou Systems and methods for user interfaces that provide enhanced verification of extracted data
US8577826B2 (en) * 2010-07-14 2013-11-05 Esker, Inc. Automated document separation
US20120166616A1 (en) 2010-12-23 2012-06-28 Enxsuite System and method for energy performance management
WO2012095839A2 (en) 2011-01-10 2012-07-19 Optier Ltd. Systems and methods for performing online analytical processing
US8806656B2 (en) 2011-02-18 2014-08-12 Xerox Corporation Method and system for secure and selective access for editing and aggregation of electronic documents in a distributed environment
US8719295B2 (en) 2011-06-27 2014-05-06 International Business Machines Corporation Multi-granularity hierarchical aggregate selection based on update, storage and response constraints
US10769554B2 (en) * 2011-08-01 2020-09-08 Intuit Inc. Interactive technique for using a user-provided image of a document to collect information
US9715625B2 (en) * 2012-01-27 2017-07-25 Recommind, Inc. Hierarchical information extraction using document segmentation and optical character recognition correction
US11631265B2 (en) * 2012-05-24 2023-04-18 Esker, Inc. Automated learning of document data fields
US10235441B1 (en) 2012-06-29 2019-03-19 Open Text Corporation Methods and systems for multi-dimensional aggregation using composition
US10169442B1 (en) 2012-06-29 2019-01-01 Open Text Corporation Methods and systems for multi-dimensional aggregation using composition
US9317484B1 (en) * 2012-12-19 2016-04-19 Emc Corporation Page-independent multi-field validation in document capture
US9032545B1 (en) 2012-12-19 2015-05-12 Emc Corporation Securing visual information on images for document capture
US9430453B1 (en) 2012-12-19 2016-08-30 Emc Corporation Multi-page document recognition in document capture

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20230010577A (en) 2021-07-12 2023-01-19 한국생명공학연구원 Cutibacterium avidum strain, culture medium from thereof and anti-bacteria uses of thereof

Also Published As

Publication number Publication date
US20230005285A1 (en) 2023-01-05
US20190197306A1 (en) 2019-06-27
US20160335494A1 (en) 2016-11-17
US10248858B2 (en) 2019-04-02
US9430453B1 (en) 2016-08-30
US10860848B2 (en) 2020-12-08
US11868717B2 (en) 2024-01-09

Similar Documents

Publication Publication Date Title
US11868717B2 (en) Multi-page document recognition in document capture
US10120537B2 (en) Page-independent multi-field validation in document capture
US20230401379A1 (en) Systems and methods for on-image navigation and direct image-to-data storage table data capture
US11182604B1 (en) Computerized recognition and extraction of tables in digitized documents
US20070171473A1 (en) Information processing apparatus, Information processing method, and computer program product
EP0567834A2 (en) Advanced data capture architecture data processing system and method for scanned images of document forms
US20090049375A1 (en) Selective processing of information from a digital copy of a document for data entry
JP2009520246A (en) Format data extraction without customization
US9471800B2 (en) Securing visual information on images for document capture
JP2019169178A (en) Information processing system and processing method of the same, and program
US7936951B2 (en) System for document digitization
US10614125B1 (en) Modeling and extracting elements in semi-structured documents
EP2884425A1 (en) Method and system of extracting structured data from a document
US20170132462A1 (en) Document checking support apparatus, document checking support system, and non-transitory computer readable medium
CN109726369A (en) A kind of intelligent template questions record Implementation Technology based on normative document
US20150227690A1 (en) System and method to facilitate patient on-boarding
JP5766438B2 (en) Method and system for click-through function in electronic media
US20070168916A1 (en) Specification wizard
JP2001005886A (en) Data processor and storage medium
JP2000003403A (en) Method for supporting slip input
WO2022029874A1 (en) Data processing device, data processing method, and data processing program
JP5445740B2 (en) Image processing apparatus, image processing system, and processing program
US10769357B1 (en) Minimizing eye strain and increasing targeting speed in manual indexing operations
US20240205348A1 (en) Display system, display method, and display program for displaying a cotent of electronic document
US20140281938A1 (en) Finding multiple field groupings in semi-structured documents

Legal Events

Date Code Title Description
AS Assignment

Owner name: OPEN TEXT CORPORATION, CANADA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:EMC CORPORATION;REEL/FRAME:056153/0143

Effective date: 20170112

Owner name: EMC CORPORATION, MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HO, MING FUNG;REEL/FRAME:056153/0132

Effective date: 20121218

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION