[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2001013279A2 - Base de donnees interrogeable a partir d'un volume eleve de donnees journalistiques saisies - Google Patents

Base de donnees interrogeable a partir d'un volume eleve de donnees journalistiques saisies Download PDF

Info

Publication number
WO2001013279A2
WO2001013279A2 PCT/US2000/022492 US0022492W WO0113279A2 WO 2001013279 A2 WO2001013279 A2 WO 2001013279A2 US 0022492 W US0022492 W US 0022492W WO 0113279 A2 WO0113279 A2 WO 0113279A2
Authority
WO
WIPO (PCT)
Prior art keywords
ocr
image
data
text
newsprint
Prior art date
Application number
PCT/US2000/022492
Other languages
English (en)
Other versions
WO2001013279A9 (fr
WO2001013279A3 (fr
Inventor
John R. Yokley
Don Nissen
Erik Schwartz
Bryan Kornele
Ed Lee
Kevin Kapel
Original Assignee
Ptfs, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ptfs, Inc. filed Critical Ptfs, Inc.
Priority to AU70605/00A priority Critical patent/AU7060500A/en
Publication of WO2001013279A2 publication Critical patent/WO2001013279A2/fr
Publication of WO2001013279A9 publication Critical patent/WO2001013279A9/fr
Publication of WO2001013279A3 publication Critical patent/WO2001013279A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text

Definitions

  • the invention relates generally to the fields of data storage and retrieval, including databases, scanning and digitizing, image processing, and searching techniques, and relates more particularly to the various processes involved in the high volume processing of newspaper material to create a word searchable database with key fields.
  • a typical process first involves cutting newspaper articles from a newspaper, stamping a date stamp which contains the date of the paper and the source newspaper from which the article was clipped, and identifying the main subject of the article by circling, underlining, or writing on the clippings themselves. These and other markings on the clippings are referred to as marks or markings. The various markings on the clippings are invariably on the text of the article due to the small margins on newspapers, although they need not be so restricted. The process then involves collecting these articles according to subject and putting the related articles in a properly marked envelope. The envelopes, perhaps millions of them, are then stored.
  • a paper index or card catalog of the subjects is created and the envelopes are made available for research and other purposes to interested parties.
  • the creation of a fully searchable system is made difficult by the sheer volume of envelopes. Further, the articles are prone to deterioration from excessive handling. These factors can combine to limit the number of fully searchable, public accessible systems.
  • microfilm or microfiche also currently provides a method by which old news information can be stored and manually retrieved.
  • the microfilmed information can only be accessed by date or special indices that may have been developed for special collections of news events or articles. Even when these indices have been created, the process of obtaining the correct microfilm or microfiche, mounting it on the optical reader/printer, and locating the correct article by the frame of the microfilm is manual and time intensive.
  • An embodiment of the present invention is directed to overcoming or reducing the effects of at least one of the problems set forth above.
  • Embodiments of the present invention are directed toward providing, to varying degrees, (i) a technique for removing markings made on printed material, including newspapers, (ii) a vacuum fed, belt driven bitonal, grayscale or color scanner having auto detect sensors for start and stop of the belt, and an exit tray which obviates the need to handle the source material after scanning, (iii) a process and tool for recognizing the flow of articles in newspapers, and jumping to additional pages, and of following the story-line while performing OCR, (iv) a process and tool for performing OCR on newspaper articles which results in a quality which is acceptable for word searching, (v) a process and flow for de-columnization of newspaper articles, (vi) a process and flow for creating clipped articles from an image of the full newspaper scanned from microfilm, microfiche, or original copy, (vii) a process and tool for performing custom spell checks meeting the particular needs of newspaper articles by discerning commonly confused letters, recognizing common terms, and rejecting terms which are known never to appear in newspapers, (
  • a method of processing newsprint data which has been scanned into a digital image includes removing marks in the digital image of scanned newsprint data using a grayscale enhance function.
  • the method further includes performing optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output.
  • OCR optical character recognition
  • the method further includes storing the OCR output in a digital storage medium and controlling the work flow between the processes of removing marks, performing OCR, and storing the OCR output.
  • a method of retrieving digitally stored newsprint data includes providing a database of newsprint information, the database having been created using the method of the previous paragraph.
  • the method further includes searching the database using adaptive pattern recognition processing and morphology such that text which does not exactly match a search string can be retrieved.
  • a computer program product including computer readable program code for processing newsprint data which has been scanned into a digital image.
  • the computer readable program code includes a first program part, a second program part, a third program part, and a fourth program part.
  • the first program part is for removing marks in the digital image of scanned newsprint data using a grayscale enhance function.
  • the second program part is for performing optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output.
  • OCR optical character recognition
  • the third program part is for storing the OCR output in a digital storage medium.
  • the fourth program part is for controlling the work flow between the processes of removing marks, performing OCR, and storing the OCR output.
  • a computer program product including computer readable program code for retrieving digitally stored newsprint data.
  • the computer readable program code includes a first program part and a second program part.
  • the first program part is for providing a database of newsprint information, the database having been created using the method described above.
  • the second program part is for searching the database using adaptive pattern recognition processing and morphology such that text which does not exactly match a search string can be retrieved.
  • a device for processing newsprint data which has been scanned into a digital image includes an image cleaner, an OCR unit, a digital storage medium, and a coordinator.
  • the image cleaner removes marks in the digital image of scanned newsprint data using a grayscale enhance function.
  • the OCR unit performs optical character recognition ("OCR") on the digital image, after removing marks, to produce an OCR output.
  • the digital storage medium is for storing the OCR output.
  • the coordinator controls the work flow between the image cleaner, the OCR unit, and the digital storage medium.
  • a retrieval system including a database and a searching system.
  • the database includes newsprint information, the database having been created using the method described earlier.
  • the searching system is capable of performing adaptive pattern recognition processing and morphology such that text which does not exactly match the search string can be retrieved.
  • a method of utilizing digitally stored newsprint data includes searching text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data.
  • the method further includes producing a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image.
  • the method further includes displaying the particular scanned image of newsprint data which corresponds to the text which produced the search result.
  • a computer program product including computer readable program code for utilizing digitally stored newsprint data.
  • the computer readable program code includes a first program part, a second program part, and a third program part.
  • the first program part is for searching text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data.
  • the second program part is for producing a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image.
  • the third program part is for displaying the particular scanned image of newsprint data which corresponds to the text which produced the search result.
  • a retrieval system including a search engine and a user interface.
  • the search engine searches text in a word searchable database, the text having been produced with optical character recognition technology from at least one scanned image of newsprint data.
  • the search engine also produces a search result from searching the text in the word searchable database, the search result corresponding to text from a particular scanned image of the at least one scanned image.
  • the user interface displays the particular scanned image of newsprint data which corresponds to the text which produced the search result.
  • FIG. 1 shows a high level diagram of an embodiment of the present invention.
  • FIG. 2 shows a high level modular diagram of a data capture and conversion system according to an embodiment of the present invention.
  • FIG. 3 shows a flow diagram for a newspaper digitization process according to an embodiment of the present invention.
  • FIG. 4 shows a flow diagram for a newspaper digitization process with remote indexing according to an embodiment of the present invention.
  • FIG. 5 lists process steps in a clipping digitization process according to an embodiment of the present invention.
  • FIG. 6 shows a clipping envelope barcode
  • FIG. 7 shows part of the preparation of newspaper clippings.
  • FIGS. 8A-8B show part of the process of newspaper clipping scanning.
  • FIG. 8C shows an exit tray for a scanner.
  • FIG. 8D shows a scanning production line
  • FIG. 9 shows a Zeutchel digital scanner.
  • FIGS. 1 0A-1 0B show part of the screen and process for key field indexing.
  • FIG. 1 1 lists several key fields.
  • FIG. 1 2A shows a newspaper obituary entered into a digital system.
  • FIG. 1 2B contains a flow for remote data entry.
  • FIG. 1 3 depicts the process for three-step Grayscale Enhancement and OCR voting.
  • FIG. 14 shows an example of grayscale image enhancement.
  • FIGS. 1 5A-, 1 5B, and 1 6 depict screens used in quality control.
  • FIG. 1 7 is a high level block diagram showing a cataloging system.
  • FIG. 1 8 is a high level block diagram showing a retrieval system.
  • FIGS. 1 9-20 show screens used in an editor for a cataloging system.
  • FIG. 21 shows a high-level block diagram of a device for processing newsprint data which has been scanned into a digital image.
  • the high level block diagram of FIG. 1 shows the three main components 10, 12, 14 of an embodiment of the present invention, as well as the interface 16 between them.
  • the data capture and conversion system 10 contains the functionality for taking newspaper articles and full newspaper pages and creating a word searchable database from them in high volume production.
  • the retrieval system 12 contains the functionality for storing the created database and allowing authorized users to access the database and perform searches.
  • the cataloging system 14 contains the functionality for editing and maintaining the database after it is created and sent to the retrieval system 12.
  • the interface 1 6 is a generic interface and its implementation can be quite varied, as indicated by the non-limiting examples which follow.
  • the interface 1 6 can comprise, for example, an electronic realtime connection over the Internet or other data network, including a wide area network ("WAN") or a metropolitan area network (“MAN”) .
  • the interface 1 6 could also comprise carrying a hard disk or other storage medium from the data capture and conversion system 1 0 to the retrieval system 1 2 or cataloging system 14.
  • the interface 1 6 can comprise, for example, a local area network (“LAN”) or the internal bus of a common computer system.
  • LAN local area network
  • the data capture and conversion system 1 0 is shown in greater detail in
  • FIG. 2 This system 10 performs the functionality required to take either a newspaper clipping of an article, or create a clipping and associated jump from microfiche or microfilm or hardcopy, and create: an electronic image of the article, searchable text from the image, and a word searchable key field "meta data" database entry for the article.
  • a different "re-keyed" process is also automated when the image data obtained is not of sufficient quality to produce reasonable quality text from an OCR process.
  • newsprint data is defined to include, without limitation, any data from a newspaper, whether stored on paper, microfiche, microfilm, a digital storage medium such as a hard disk, an analog storage medium such as a tape.
  • Newsprint data can, therefore, include text, pictures, and other types of data.
  • FIG. 2 depicts the modular nature of the data capture and conversion system 10.
  • a variety of work performing AVATAR software modules 1 5 perform various work functions. Work is served out to the work performing software modules 1 5 by the AVATAR centra! coordinator 20 which controls the flow of data between them based upon the variable work flow definition defined for a particular customer's image processing project.
  • the coordinator 20 of the preferred embodiment utilizes each of the work performing modules 1 5, also referred to as processes, as is necessary for a particular application.
  • the flow of the process can also be changed for different imaging work flow applications.
  • the modules Manual Article Crop and Jump Connector are used to process articles obtained when digitizing microfilm but are not required when processing images obtained by scanning the hard copy clipping articles themselves.
  • One advantage of a computer controlled automated process flow is that the electronic work tasks can be tracked using a database or other computer software and these tasks are properly routed based upon data processing requirements to eliminate human error during repetitive processes.
  • the work performing modules 1 5 are client modules and login to the coordinator 20 which is a server based module.
  • the coordinator 20 can be, for example, a processor such as a Pentium processor coupled to appropriate memory, or a complete computer system including PCs, mainframes, general purpose systems, and application specific systems.
  • the client operations for obtaining work from the coordinator 20 are as follows:
  • the coordinator 20, along with the correct work performing software modules 1 5, comprises software, and is therefore easily configurable to accommodate different applications and conditions.
  • the modular nature of the data capture and conversion system 1 0 is demonstrated by the preferred mechanism for interfacing each of the modules 1 5 to the coordinator 20.
  • the modules 1 5 preferably use standardized function calls to communicate with the coordinator 20.
  • the coordinator 20 When called by a software work performing software module 1 5, the coordinator 20 provides work on an as requested work type or on a next available basis. When requesting work, the work performing software 1 5 can request work by project or folder.
  • the coordinator 20 also has a security capability which will allow only authorized users or accounts to acquire work when requested or as it is available.
  • the coordinator 20 can also serve out work for different customer's jobs concurrently, for example projects can be performed for two different newspapers simultaneously.
  • FIG. 3 The process flow for an embodiment of the invention for a newspaper digitization application is shown in FIG. 3, and the process steps are listed in FIG. 5.
  • the process steps listed in FIG. 5 list the physical process which occurs during newspaper clipping digitization as opposed to the AVATAR software process flow in FIG. 3. For that reason, FIGS. 3 and 5 provide slightly different information.
  • the discussion of the application will follow the outline of the process flow of FIG. 3, and will also address the process steps of FIG. 5 from within that outline.
  • the portions of the process can be performed in different physical locations. When this is performed a high bandwidth wide area telecommunications connection is preferably required. When this is not possible information can be transferred from one location to another via, for example, high density magnetic tape or other digital storage medium.
  • FIG. 4 shows one alternate flow process in which key field indexing is being performed remotely, image data is being transferred from the primary location via tape, and key field data, as well as any image fixes or rejections performed by the Index Operator, is streamed back in real-time over the telecommunications connection.
  • the work flow is controlled by the coordinator 20.
  • the work flow concerns the tracking, collecting, moving, and otherwise organizing and controlling of files and other data and information among the various stations or processes where it is used, generated, stored, etc.
  • This information includes the images at all stages of the process, including the scanned image, the enhanced image, the OCR'd image, the bitonal image, etc.
  • This information also includes all data which is generated such as indications of jumps, OCR accuracy, errors, etc.
  • This information also includes non-electronic data, such as the original document that is scanned.
  • Controlling the work flow includes such processes as sending the digital image to a process, sending the digital image to a station, automatically rejecting the digital image if quality is insufficient, presenting the digital image to an operator at a station, checking the digital image in after a process has been performed, checking the digital image out to a station or a process, associating a barcode with non-electronic data relating to the digital image, tracking nonelectronic data associated with the digital image using a barcode, and collecting and storing data relating to the digital image.
  • the newspapers clippings have already been cut, date stamped and hand marked, and sorted into envelopes by category, as described in the background section. These envelopes are then delivered to the data capture and conversion system 10, of FIG. 1 , so that the digitization process can begin.
  • the first step in the digitization process is to capture the category information from the envelope and use it to create a folder for the clippings in each envelope.
  • a large folder that can physically house the unfolded clippings is utilized so that the clippings can be unfolded and laid flat.
  • an electronic folder can also be created for the digitized images of the clippings in each envelope. This is done in the bar-coding step 40 (see FIG. 3) by typing in the category information and creating a barcode label which can be placed on the envelope or folder and used to track it and its contents through the process.
  • FIG. 6 shows a barcode label 60 which has been created for an envelope.
  • Meta data information 62 is typically subject data and can include the subject and the dates of the clippings in the envelope, as well as a biographical proper name meta data when the envelope contains information about a specific individual.
  • the meta data 62 is typed in by an operator utilizing the barcode module 40 of FIG. 3 and appears on the printed barcode label 60. Capturing the identifying information 62 electronically is referred to as envelope indexing.
  • the barcode label 60 can also be used to indicate a variety of other information.
  • the barcode label 60 can indicate in which box or shipment that envelope was received and on what date the envelope was received.
  • the label 60 can also be used to indicate whether or not there are additional envelopes, folders, or other information associated with that envelope. In the present application some of the envelopes will have an oversized folder associated with them to contain clippings which are unusually large, and this information is indicated on the barcode label 60.
  • two barcode labels are printed by the barcode module 40 after the envelope subject meta data 62 has been entered.
  • One barcode is affixed to the original envelope itself and a second is placed inside the envelope to be affixed later to a scanning separator sheet, as discussed in the section on scanning below.
  • the use of a barcode label 60 helps to eliminate human operator errors and, thus, to ensure positive envelope control during scanning and repacking of the clippings. Errors are also prevented by having the data capture and conversion system 1 0 oversee the flow of electronic folders. This is done by the coordinator 20 of FIG.
  • the scanning step 41 is performed after the bar-coding step 40 is completed. Before the clippings can be scanned, they must be unfolded and flattened, referred to as clipping preparation in FIG. 5.
  • FIG. 7 shows part of the process of unfolding and flattening the clippings from an envelope. A special process is used to flatten the clippings, which can range in size from about one inch square to 1 5" x 25" or larger. Due to the fact that the clippings have been folded to fit into small envelopes for a period of up to one hundred years, a physical process is required to prepare and flatten the clippings prior to the scanning step. At this point the clippings are unfolded and taped with acid free tape if torn.
  • any continuations of a newspaper article from one page and column to another are marked. These continuations are termed jumps.
  • Red markers are used to indicate the various pages of an article. One red marker is used for the first page of the article if it contains a jump and a second, third, and fourth marker, etc. is used for the continuations. Markers are placed as near as possible to the upper right hand corner of the clipping to provide key field operators a visual clue as to which articles contain jump continuations (see the discussion of Key Field Indexing below).
  • the articles that have continuations (jumps) are also inserted into a second folded separator sheet inside the large preparation folders. The special separation, in addition to the markers, serves to notify the scanner operator that these pages must be kept together during the scanning process.
  • the complete prepared folder is run through one of several different commercial configuration laminators set to the correct temperature and pressure to flatten the clippings and remove folds without any adverse damage.
  • FIG. 8A shows a portion of the scanner 80.
  • FIG. 8A also shows a portion of a barcode reader 82.
  • the separator sheet is used for several functions: it contains the barcode, it has affixed the original clipping envelope that is used later in the repacking step, and it separates the clippings in one envelope from another in the specially designed removable exit tray.
  • the scanner itself can process the barcode and a separate barcode scanner is not necessary.
  • the scanner 80 used to convert the newspaper clippings is a specially customized version of a commercially available scanner, the Banctec 4530. However, several scanners are used in the process and other types of scanners are used for image conversion projects using different materials.
  • the customized scanner 80 clearly performs a critical task in the process by reliably and efficiently scanning a high volume of clippings of various sizes, without damaging them.
  • the scanner 80 possesses specific features which make it particularly suited to this application. These include a vacuum feed for the clippings, a belt drive with auto-detect sensors for start and stop of the belt, and a custom-built exit tray for collecting the clippings as they exit the scanner 80. Referring to FIG. 8C, it can be seen that the exit tray allows the clippings to be retrieved without being handled another time and thus helps to increase work flow time and motion efficiency in addition to helping preserve the clippings. Scanning processes are run in parallel as shown in FIG. 8D.
  • Grayscale or color scanning would allow a variety of image processing techniques to be used to remove the date stamp and handwriting from the clippings. Therefore, an interface scanning card was built for the scanner 80 to enable grayscale scanning.
  • An IPT/GS Gray Scale Capture Board (the "grayscale board") was utilized to connect to the specially created 4530 interface card in the Banctec 4530 scanner.
  • the grayscale board is a PCI-based scanner interface card that provides grayscale, capture capability to a scanner once an appropriate interface has been built.
  • the grayscale board has a maximum data throughput of 1 32 MB/s, a maximum scanner speed of 25 MHz, a maximum PCI bus speed of 33 MHz, and a maximum image size of 65,536 by 65,536.
  • the grayscale board can perform the processing features of binarization, cropping, inversion, scaling, and smoothing 2 x 1 , and supports image formats of 1 bit bitonal, 8 or 16 bit grayscale, and combined grayscale and black-and-white.
  • the scanning interface hardware and software allows the scanning operator to operate the scanner from PC based controls (keyboard and mouse) in addition to foot pedals used to rotate the clipping image if its configuration dictates scanning the image upside down or reversed.
  • the scanning step 41 may also be performed with an oversize scanner if the clipping is large and not suited for the high volume production scanner 80.
  • a German built Zeutchel overhead digital scanner shown in FIG. 9, is used in the present application to rapidly scan large clipping images in a production environment.
  • Special application programmer interface (“API") software was written to allow the specially designed AVATAR workflow software to operate the Zeutchel scanner.
  • Other scanners of this type and configuration may also be used in the embodiment.
  • the oversize scanner is not intended to be used for high volume clippings smaller than 1 1 "x1 7", which is the maximum size which the Banctec 4530 scanner can scan.
  • Other types of special purpose scanners can also be integrated into the process flow.
  • Enhancement is performed in two places in the flow process for newspaper digitization, but depending upon the requirements, specific image enhancement tasks can be performed at any location.
  • the first enhancement routine is used to improve the quality of the scanned image and to reduce the file size of the image.
  • the operations performed preferably include deskew, despeckle, and removing lines and black portions. These functions are performed in the cropping enhancement step 42. These functions reduce the file size of the image in addition to removing unwanted image data and noise. This makes the grayscale image easier to visually process by the operator in the next work flow process, key fielding.
  • the work flow process 42 includes a Deskew function for JPEG images. This function must be performed prior to cropping Since JPEG images are virtually impossible to deskew with any accuracy and reliability, the embodiment includes the following process for deskew of JPEG Images:
  • Bitonal image deskewed using convential deskew software and angle of deskew is saved.
  • Original JPEG image is rotated by the angle determined in above.
  • JPEG cropping is performed on the properly deskewed JPEG image.
  • step 42 the quality of the grayscale image is automatically detected and assessed and if it is not adequate, then the image is failed, as noted by the Fail Grayscale Image box 43, and sent to the rescan module 50. This assessment is automatic. 4. Key Field Entry
  • a variety of information can be entered during the key field entry process 44, which is illustrated in FIGS . 1 0A- 1 0B.
  • Key fielding can utilize two methods, key-from-image or key off hard copy.
  • the information which is entered in this process 44 includes the publication date and the source of the newspaper, which are taken from the date stamp, and the headline or title, as shown in FIG . 1 0A via the key-from-image approach.
  • Additional key fields can be utilized for a variety of useful information, such as the envelope category, which is preferably captured during the bar-coding process 40.
  • the process also allows data to be keyed from hard copy as is performed when it is more convenient or cost effective to do so or when the image is not required in the final searchable database.
  • FIG . 1 1 shows the meta data 1 00 for a particular clipping .
  • This process involves a computer search for specific text strings. When located, the software copies a defined variable number of characters which make up the byline.
  • the work flow software allows the key fields to be entered while the scanned grayscale image is on the screen. This process is efficient in terms of time and obviates any need for handling the actual clippings. The process can be made even more efficient by allowing the operator to drag and highlight the appropriate text with a mouse, or otherwise indicate the relevant text, and then use a simple OCR program to enter the text directly. The operator then only needs to verify that the OCR result is accurate.
  • the key field step 44 also provides another opportunity to review the image.
  • the operator can fail the image, as indicated by the Fail Enhance box 45, and send it to the rescan module 50.
  • there is an "error flag” box which an operator can check if the image has an error of any kind.
  • the operator can also indicate that the image needs to be rescanned by checking the "rescan page” box. Additionally, the operator can indicate that the image has a jump by removing the "first page” check in the check box for the second and further pages. Because most of the clippings are a single page, the "first page" default is currently set to checked. If the image is rejected, the operator can key in the specific reason.
  • the comment will be included. If a rejected image has been through OCR processing then the % OCR confidence will be shown. The Project and Folder are also indicated so that the operators will know what job they are currently processing.
  • the electronic clipping is ready for the multi-step grayscale enhancement process 46, which includes grayscale enhance functions.
  • This process 46 is further illustrated in FIG. 1 3. As shown in FIG. 1 3, four separate processes are performed and the output with the best results are selected for the final OCR voting step 48. More or less than four (4) processes can be utilized.
  • the first process involved is threshold enhancement 131 . Since it is difficult to develop on computer algorithm that will effectively clean up and remove unwanted hand written material from the image without removing significant parts of the original image itself, multiple algorithms were written which complete against to obtain the best results when processing an image. Four different algorithms are described. A JPEG image is the input, however, other formats can be used in different embodiments.
  • the leftmost processing line of process 1 31 in FIG. 1 3 selects a grayscale threshold level and then performs a simple decision based on the threshold level. All values equal to or 5 larger than the grayscale threshold level are classified in one category, either black or white, and all values less than the grayscale threshold are classified in the other category. The end result is a black and white, bitonal, image. Step 1 32 is effectively performed along with step 1 31 for the leftmost processing line.
  • FIG. 21 shows an image i s cleaner 1 81 , as shown in FIG. 21 , which comprises a processor or computer system which can run image enhancement routines or grayscale enhance functions.
  • FIG. 21 also shows the coordinator 20, a digital storage medium 1 83, and an OCR unit 1 82 which is described further in a section below.
  • the interconnections in FIG. 21 are intended to be illustrative and are not intended to
  • the coordinator 20, image cleaner 1 81 , and OCR unit 1 82 all comprise a common processor (not shown) .
  • All processing columns in FIG. 1 3 can provide a rejection for an unrecognizable JPEG image, indicated by the Fail Grayscale Enhance box 47. This automatic process rejects to rescan 50.
  • the max value of black (B max ) is set to 10 on a scale of
  • the max value of white (W max ) is set to 236 on a scale of 0-255 colors.
  • a nTotaloflntensity nTotaloflntensity + Average Intensity value is kept. Along with a count (n) of valid sections. ⁇ The Averaged sampled intensity value (l avg ) is equal to nTotaloflntensity/n.
  • ⁇ G avg Average value of the gray pixel intensities.
  • step ii For every pixel in the image. ⁇ Get the number of White Pixels (W n ), intensity values of 255, number of Black Pixels (B n -), intensity values of 0, and the number of Gray Pixels (G , intensity values other than 255 or 0.
  • Step # 2 (dilation of gray image) again.
  • Convertlmage_ex3() Changes image to bitonal. This algorithm converts all pixels that have an intensity value greater than 1 to white (255) and less than or equal to 1 to black (0).
  • Newspaper BK6 Enhancement performs newspaper clipping image enhancement as follows: Gets the average intensity value of the image (l avg ) by averaging all the pixel values.
  • the computed intensity value for () is computed by the following.
  • the max value of black (B max ) is set to 1 0 on a scale of 0-
  • the image is divided up into 21 separate sections.
  • Section is a valid section
  • nTotaloflntensity nTotaloflntensity + Average Intensity value is kept. Along with a count (n) of valid sections.
  • the computed intensity value is equal to nTotaloflntensity/n.
  • call Convertlmage_ex5() algorithm which converts image to bitonal but only the pixels with values less than 254 converts black pixels and all others to white.
  • Step 1 32 of FIG. 1 3 is a bitonal conversion process.
  • the leftmost processing line of step 1 31 in FIG. 1 3 has already performed the bitonal conversion, so nothing is done in that processing line at step 1 32.
  • the end product of those respective enhancement processes is converted into a bitonal image.
  • the gray pixels that are close enough to black are converted to black and everything else is dropped out.
  • the level of gray that is used to become black is a variable that can be adjusted for specific requirements.
  • Sample before and after shots are shown in FIG. 1 4.
  • the before shot 1 41 is a scanned clipping which is a grayscale image.
  • the before shot 1 41 is the input into the threshold enhancement process 1 31 of FIG. 1 3.
  • the after shot 142 shows the result after threshold enhancement 131 has been successfully used to remove the handwritten word and circle mark, as well as the date stamp.
  • the threshold enhancement 1 31 has also lightened up the background making the entire clipping easier to read.
  • step 1 33 is the same for each processing line.
  • the bitonal images, which are typically different for each processing line, are prepared for OCR.
  • OCR can be performed by an OCR unit 1 82, shown in FIG. 21 , which contains a processor or computer system which can execute software and perform the various functions described herein.
  • the images have been treated as one single image up to this point.
  • the text zoning step 1 33 provides three different consecutive options, one of which is selected and applied to the next step (two-pass OCR processing 1 34).
  • a custom newspaper decolumnization program is run, if this program is successful then the two-pass OCR 1 34 is initiated. If newspaper decolumnization fails, standard autozoning is next implemented, however if this process fails then the image is treated as one zone and the total image without zoning is sent for OCR processing 1 34.
  • the standard auto-zoning process is a commercially available software routine that provides autozoning but is not specifically created or tuned to newspapers.
  • the newspaper decolumnization process is developed to recognize columns. This decolumnization process groups the text of a column so that the OCR module performs its functions within the specified zone only. Without zoning, the OCR software reads the image from left to right and scans across column breaks as if the sentence continues in that direction as opposed to down the column. This creates groupings of words from left to right but does not maintain the original sentence. format of the newspaper. This presents a problem if the text is to be imported into another system for re-use or if a word has been hyphenated and continued on the next line. It also may present a problem if the lines of the columns are not aligned well.
  • Steps 1 34 and 1 35 of FIG. 1 3 are only implemented to provide OCR confidence statistics. These steps are also the same for each of the processing lines of FIG. 1 3.
  • the zoned, bitonal image files are passed through two separate OCR processes to yield two separate result files for each processing column of FIG. 1 3.
  • the two OCR processes use different OCR applications and therefore can produce different results.
  • Each of the OCR processes attempts to recognize each character in the bitonal image and each also produces a confidence level output indicating how well the OCR process thinks that it has done on each character.
  • the winning OCR engine's output is used for each character in the bitonal image.
  • the total confidence level output for the combination of the two engine results is given as a percentage between 0 and 100.
  • step 1 36 If one of the two OCR engines fail, the character and confidence output from the remaining OCR engine is used. These three confidence level outputs are then compared in step 1 36, and the highest one is selected. The processing line that produced the winning confidence level output is flagged, and the zoned, bitonal image file from the output of the zoning process 1 33 of that winning processing column is used in step 1 37.
  • the folder can be automatically flagged for manual re-work by the QCR module (see FIG. 3 49).
  • QCR 49 an operator can manually perform the newspaper threshold enhancement process of step 1 31 in FIG. 1 3. If the image cannot be adequately corrected, then the image is rejected to the rescan module 50.
  • step 1 37 the bitonal image file that gave rise to the winning confidence level is passed through five different OCR applications. More or less than 5 OCR applications can be used. Each OCR application gives a confidence level on each character of output and voting is performed in step 1 38 on every character and the highest confidence character is selected and put into a separate file. The file containing this OCR text, which gave rise to the highest confidence level for each character, is then output from the voting OCR module 48 (see FIG. 3).
  • the character and confidence output from the remaining OCR engines is used to determine the winner and to determine the overall confidence rating for the ASCII text file created from the multi-pass OCR voting process.
  • the voting can be done in a variety of ways, and the threshold can be set at appropriate values to yield acceptable results.
  • the OCR engines may compute a confidence level for each individual character and the OCR engine with the highest confidence for that character can be used for that character. The overall confidence of the resulting file is based on the confidence of each character.
  • a simple vote can be used between the OCR engines' outputs, and the most common output for a given character can be selected, with confidence levels computed from the level of agreement for a given character, and with a suitable decision process being used to resolve ties.
  • Other voting schemes will be apparent to those of ordinary skill in the art.
  • the confidence output is below a specified output (currently set at 75% confidence) the image is flagged for repair in the QCR module described below.
  • QCR Quality Control and Repair
  • the quality control and repair process 49 of FIG, 3 is used to verify all data (imaging, OCR, and meta-data) that has been assembled from the data capture and conversion processes and to fix any data or image that is determined to be inferior unless the image needs to be rescanned. If QCR cannot fix the incorrect data or poorly enhanced image the particular image and problem is noted and the folder is rejected to re-scan 50.
  • the QCR operator, step 49 has the capability to view the grayscale and bitonal image, the OCR text file, and the meta data, as indicated in FIG. 1 5A - FIG.1 6.
  • the leftmost window has the scanned grayscale image
  • the middle window has the enhanced bitonal image with the date stamp and other markings removed
  • the rightmost window has the OCR text.
  • the quality control module allows the operator to view thumbnails of all images in each folder with a quick visual scan of the basic integrity of all the images, as seen in FIG. 1 5B. This is useful if 1 00% QC, as opposed to QC sampling, is required.
  • the QCR module also contains a "capsule view" FIG.
  • the QCR module capsule view provides the operator information concerning the number of images in a folder left to QC.
  • the a variable setting in the software allows this to be set at 1 00% or at some leveling of statistical sampling like 5 per folder.
  • the capsule view tells the operator how many images are left in a folder to QC to meet the required QC sampling level.
  • the QCR module 49 can perform a variety of functions to repair images, such as manual cropping or the grayscale enhancement should the automated process discusses above not be sufficient to create a high quality image.
  • the image is sent back into the work flow to voting OCR 48. If the image is not repairable, then it is sent to the rescan module 50 and the folder with the appropriate clipping is directed to that location. The clipping is then rescanned.
  • the rescan module 50 uses a different scanner from the high volume, high speed scanner 80 described earlier.
  • the inspection is manual, but other embodiments may automate all or part of the process. Because this is designed as the final quality check, the end result, which is the electronic folder associated with that clipping, can be failed for any reason.
  • the key field data is incorrect for any reason, it is corrected immediately by the QCR operator 49. If the image is incorrect for any reason, then it is repaired as described and the electronic folder is sent back to the OCR process 48.
  • FIG 1 5B allows the operator to view and edit the OCRed text.
  • the bundling process 51 bundles the multiple images that exist, when articles include continuation "jumps," into one multipage image file. This process also combines the text of these multiple files into a single text file. As described earlier, any images that are jumps from the first page articles have been so indicated by the indexing operator by removing the "first page" indicator in step 44. This flag triggers the work flow software to join the images and text files into one file when the electronic folder is processed by the bundle jump software in step 51 .
  • the bitonal image and ASCII file is also associated with the meta data record stored in the workflow database.
  • the format of the image file can be any image file format such as the TIFF format.
  • composite file formats such as Adobe PDF can be utilized.
  • PDF Adobe Portable Document Format
  • PDF allows image and text files to be combined into one single file which can contain a single page or make up an entire document.
  • the final product is ready to be exported to the retrieval system 1 2 (see FIG. 1 ) .
  • the export can be set up to provide data in a variety of formats for many types of manufacturer's retrieval systems. However, in the current integrated AVATAR Digital Asset Management System (ADAMS) the data is exported using a direct database and file system connection to the ADAMS ArchivalWare retrieval server 53 (see FIG. 3).
  • the final product preferably includes the ASCII text, any associated images or other digital objects (audio, video etc.), and the meta data describing these objects. Depending on the file format and the needs of the database, this information can be in one file or separated into multiple files.
  • the clippings are no longer needed and are sent back to the customer.
  • the process has been optimized to protect the clippings from excessive handling and from inadvertent misplacement. This helps to ensure that the original source of this valuable data is not lost or destroyed.
  • the software to perform this task includes the following functions: (i) the ability to manually electronically zone/mark each article on the full newspaper page, whether the article is rectangular in shape or is made up of a series of rectangles, (ii) when an article has been manually zoned, it is marked in a colored outline so that the operator can determine when each article on the page has been zoned, and (iii) if an article has a jump, the operator has the ability to select/deselect a jump flag indicating to the work flow software in the next process that the article requires a mate.
  • the jump connector software is required when using the manual article jump software previously described.
  • the software to perform this task includes the following functions: (i) automatic selection of pages that have been flagged with jumps, (ii) a split screen that auto selects a jump first page and allows the operator to move to the correct page of the paper while viewing the jumped page on the other half of the screen, (iii) with the jump displayed on one half of the computer screen, the jump article can be effectively zoned as with the manual article crop software described above, and (iv) the article images viewed on both halves of the computer screen can be electronically connected for processing by the image and text bundle module 51 .
  • the manual process described above for zoning articles from the full page can be automated.
  • the process involves invoking computer algorithms that attempt to understand the structure of the page layout and separate articles on the page.
  • the algorithms may have fixed input and tuning parameters which produce more accurate results.
  • Auto connection can be accomplished by defining a fixed methodology a newspaper or other publication uses to provide a "jump" on a different publication page. This methodology such as “continued from page # "ARTICLE NAME" allows the correct jump process to become automated.
  • the data capture and conversion system 1 0 can produce a large variety of reports. Particularly useful are the production reports on the throughput and efficiency of the process.
  • the system 1 0 can also input digital images directly with the image/object import utility.
  • the system 1 0 does not need to scan information in order for it to be entered.
  • the system 10 can bundle both images and text into a common file using the PDF format.
  • FIG. 1 2A shows an example of an obituary that has been entered. It includes the key field information in the meta data, as well as the text in a separate field below.
  • the process flow for microfilm data capture and conversion is different from that of the present newspaper scanning application.
  • the process control software can send the electronic data directly to the QC module 49 (see FIG. 3).
  • the system utilizes a real time spell check system 1 21 which incorporates a custom dictionary for newspaper applications.
  • the recognized and correctable entries of the preferred embodiment include (i) states in the United States and state abbreviations, (ii) cities, (iii) counties in a particular state, (iv) cemeteries, (v) churches, (vi) first names, and (vii) hard to recognize names.
  • states in the United States and state abbreviations include (i) states in the United States and state abbreviations, (ii) cities, (iii) counties in a particular state, (iv) cemeteries, (v) churches, (vi) first names, and (vii) hard to recognize names.
  • a misspelling the operator is prompted in real time that a potential error has occurred. The operator must then check the keyed word against the original (image or hard copy data) and invoke or ignore the suggested correction.
  • the list of entries is expandable and modifiable to meet a particular application.
  • the data entry 1 20 is performed remotely with labor which is cost effective such as in offshore locations or in penal institutions.
  • the remote keying software is designed for download 1 24 to the final QC 1 25 and processing work flow software 1 27.
  • the next remote process is performed by remote QC personnel.
  • the keyed data is run against a specially designed keyed-data-test software program 1 22.
  • the program implements the following tests and when a potential problem is found the test operator is prompted and the problem is inserted into the operator's viewing window.
  • the validity and format of the data is checked against the correct format for the application.
  • the field format delimiters contained in each keyed article (for data input) will preferably be:
  • the program checks to see that all required fields for a given article are present, that correct meta-data delimiters inserted by the Macros of 1 20 have not been accidentally deleted and that the data is in the correct format for export/loading software 1 28.
  • the keyed-data-test software 1 22 includes functions which help address problems of differentiating between I, J, L, and T. These letters are difficult to distinguish on small font-type, microfilmed, newspaper death notice data. The feature attempts to recognize when these particular letters have been confused and places the potential problem in the viewing window for the test operator to repair.
  • the keyed-data-test software 1 22 also includes functions that flag when words occur that should never be in the typed text. When data is being keyed by penal institution inmates this software effectively provides a stop list of words that the test operator can verify are indeed in the newspaper. Any class of words could be put into this list, and it is often used to ensure that offensive language is not entered, whether purposely or accidentally. This gives the ultimate purveyor of the database some assurance that it will not offend others nor be embarrassed.
  • the keyed-data-test software 1 22 also includes a function that automatically places a period after "Mr”, “Mrs”, “Ms”, and "Dr”. Further, the keyed-data-test software 1 22 also includes functions that report statistics like the number of characters and articles typed. This data is used for invoicing and for tracking productivity.
  • the data is downloaded 1 24 from the remote keying facilities to the data processing location. This is normally performed via telecommunications methods but the data can be moved via magnetic disk or tape.
  • Final QC 1 25 is performed at the data processing facility using the same keyed test program used by the remote facilities. If the data has been "cleaned up” at the remote facilities the test program will flag few or no errors.
  • a sample of the data is reviewed manually 127 to make sure that text has not been omitted when the re-keying effort has been performed. Because the keying software 1 20 allows the data to be entered in the same format as the original, the QC technician scans the left and right sides of the textual columns to compare them to the original. With this technique it is easy to spot missing information.
  • the keyed-data-load software module 122 is executed the data is loaded into the retrieval system 128.
  • the software un-wraps the columns of text which were keyed "as is" for QC purposes.
  • hyphenated words are reconnected.
  • the software uses a look up table to ensure that words that are to remain hyphenated do so.
  • the text verification system offers a number of benefits over spell checkers as well as other data entry verification systems.
  • the text verification system speeds up data entry by recognizing and entering words before they are completely entered, it provides automatic correction of errors, it assists those for whom English is a second language, and it screens the errant entry from a disgruntled data entry person.
  • the text verification system also can include a variety of other utilities and features which may be customized to the particular application.
  • the cataloging system 1 4 (see FIG. 1 ) is delivered to the database maintainer. It is used for a variety of functions, including updating, modifying, repairing, replacing, and editing of the database entries. This can be useful, for example, in cataloging special entries by adding terms to unique enhancement term key fields.
  • the cataloging includes an image/text editor 1 62 and a database editor 1 64.
  • the image/text editor 1 62 allows editing of the database entries themselves and the database editor 1 64 allows the cataloging technician to clean up dirty OCRed text to a 100% correct format. This is done by comparison of the image to the keyed text and manually editing and re-saving the corrected text. Small image editing functions can preferably be performed, such as image de-skew and cropping.
  • Both the image/text editor 1 62 and the database editor 1 64 are preferably contained in an editing program called AVATAR EDIT.
  • the user enters a search string in this editor, using natural language or specific terms for example, and the retrieval system 1 2 (described below) delivers the search results in this editor as well.
  • the meta data appears in the left-hand side of the screen
  • the scanned article appears on the top right-hand side of the screen
  • the search results appear on the bottom right-hand side of the screen.
  • FIG. 20 shows another configuration in which the OCR text appears along with the actual scanned article and the search results. This is a screen which can be used to edit the OCR text.
  • the retrieval system 1 2 of FIG. 1 includes a number of different features and modules, as shown in FIG . 1 8.
  • the interconnection between the elements of FIG . 1 8 is intended to indicate the integrated nature of the system 1 2, rather than physical or logical connections between the elements.
  • the user interface 1 71 is the primary user interface to the retrieval system 1 2.
  • the user interface 1 71 is a Web interface which allows access to the retrieval system 1 2 over the world wide web, but other interfaces can also be used, such as an MS-Windows application interface.
  • the user interface 1 71 can be implemented with any display device controlled, for example, by a processor.
  • Security features 1 77 such as password protection, are in place to ensure that only registered and authorized users gain access. Users could be given access for limited periods of time or to limited parts of the database. The security features 1 77 can also be used for billing purposes.
  • the SQL/ODBC database 1 75 can hold the meta data and the actual word searchable text files and the associated image files containing the digitized clippings, however the system also allows the text and images to be stored under the operating systems file structure 1 73.
  • the database 1 75 and file structure 1 73 can include any storage medium.
  • a digital storage medium such as a hard disk is used.
  • the search engine 1 79 allows the user to search both the SQL/ODBC database 1 75 and the text files 1 73 using sophisticated searching techniques especially configured for this application.
  • the search engine 1 79 preferably includes a processor running appropriate software to perform the necessary functions.
  • Searching is preferably performed by entering a search string (not shown).
  • the search engine 1 79 preferably uses the search string to search various text files in the SQL/ODBC database 1 75 and/or the text files 1 73 and produces at least one search result. Sample search results are listed in the bottom right of FIG. 1 9, and correspond in this embodiment to different newspaper articles which have been scanned.
  • the retrieval system 1 2 preferably includes a computer system running software which performs one or more of the functions described.
  • the search engine 1 79 incorporates Adaptive Pattern Recognition Processing (APRP).
  • APRP Adaptive Pattern Recognition Processing
  • fuzzy searching provides techniques which are fault tolerant. It finds patterns within the search string, and within words, and matches those patterns with patterns in the meta data or the text data. This processing technique also allows for user feedback to help refine the search. Fuzzy searching provides the ability to retrieve approximations of search queries and has a natural tolerance for errors in both input data and query terms. It eliminates the need for OCR clean up, which is especially useful in applications that handle large volumes of scanned documents. High precision and recall gives end-users a high level of confidence that their queries will return all of the requested information regardless of errors in spelling or in the "dirty data" which they may be searching.
  • APRP Adaptive Pattern Recognition Processing
  • the search engine 1 79 also provides semantic expansion and semantic search capability.
  • Preferable features of the Semantic Network include:
  • the baseline Semantic Network is preferably created from complete dictionaries, a thesaurus, and other semantic resources, and gives users a built-in knowledgebase of 400,000 word meanings and over 1 .6 million word relationships.
  • Natural Language Processing Users can preferably simply enter straightforward, plain English queries, which are then automatically enhanced by a rich set of related terms and concepts, to find information targeted to their specific context.
  • Morphology The Network preferably recognizes words at the root level, which is a much more accurate approach than the simple stemming techniques characteristic of other text retrieval software. This minimizes word misses which are caused by irregular or variant spellings.
  • Idioms The Network preferably recognizes idioms for more accurate searches, and processes phrases like "real estate” and “kangaroo court” as single units of meaning, not as individual words.
  • Semantics The Network preferably recognizes multiple meanings of words and allows users to simply point and click to choose the meaning appropriate to their queries.
  • Multi-layered dictionary The baseline Semantic Network preferably supports multi-layered dictionary structures that add even greater depth and flexibility. This enables integration of specialized reference works for legal, medical, finance, engineering, and other disciplines. End users can also preferably add personalized definitions and concepts without affecting the integrity of the baseline knowledgebase.
  • the functionality disclosed in this application can be, at least partially, implemented by hardware, software, or a combination of both. This may be done, for example, with a Pentium-based computer system running database and editing software, or other programs.
  • this functionality may be embodied in computer readable media or computer program products to be used in programming an information- processing apparatus to perform in accordance with the invention. Such media or products may include magnetic, magnetic-optical, optical, and other types of media, including for example 3.5 inch diskettes and other digital storage media.
  • This functionality may also be embodied in computer readable media such as a transmitted waveform to be used in transmitting the information or functionality.
  • software implementations can be written in any suitable language, including without limitation high-level programming languages such as C + + , mid-level and low-level languages, assembly languages, and application-specific or device-specific languages.
  • Such software can run on a general purpose computer such as a 486 or a Pentium, an application specific piece of hardware, or other suitable device.
  • the required logic may also be performed by an application specific integrated circuit ("ASIC") or other device.
  • ASIC application specific integrated circuit
  • the technique may use analog circuitry, digital circuitry, or a combination of both.
  • Embodiments may also include various hardware components which are well known in the art, such as connectors, cables, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Processing Or Creating Images (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

L'invention concerne un procédé de numérisation d'informations papier tirées d'un journal, qui comprend la saisie des informations en format d'image numérique puis le traitement de l'image, en vue de produire un texte que l'on peut interroger. Le traitement comporte le retrait des tampons de données et autres signes/marques imprimés sur le papier, améliorant l'image au moyen de fonctions de traitement d'une bibliothèque d'images, et réalisant une reconnaissance optique de caractères (OCR), afin de sélectionner une sortie OCR optimale. La sortie OCR optimale produit un texte extrêmement précis qui peut être interrogé, par un traitement de reconnaissance adaptatif, la logique aléatoire, la morphologie, et d'autres techniques, afin de constituer une base de données interrogeable par mots d'information papier tirée de journaux. Ce procédé est dirigé par logiciel de manière à ce que le flux de travail, électronique ou non-électronique, entre divers procédés ou postes, puisse être suivi et séquencé et que les données adéquates soient recueillies et stockées.
PCT/US2000/022492 1999-08-17 2000-08-17 Base de donnees interrogeable a partir d'un volume eleve de donnees journalistiques saisies WO2001013279A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU70605/00A AU7060500A (en) 1999-08-17 2000-08-17 Word searchable database from high volume scanning of newspaper data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14922299P 1999-08-17 1999-08-17
US60/149,222 1999-08-17

Publications (3)

Publication Number Publication Date
WO2001013279A2 true WO2001013279A2 (fr) 2001-02-22
WO2001013279A9 WO2001013279A9 (fr) 2001-06-14
WO2001013279A3 WO2001013279A3 (fr) 2004-02-19

Family

ID=22529293

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/022492 WO2001013279A2 (fr) 1999-08-17 2000-08-17 Base de donnees interrogeable a partir d'un volume eleve de donnees journalistiques saisies

Country Status (2)

Country Link
AU (1) AU7060500A (fr)
WO (1) WO2001013279A2 (fr)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1552466A1 (fr) * 2002-10-18 2005-07-13 Olive Software, Inc. Systeme et procede de preparation automatique de depots de donnees a partir de materiels de type microfilms
EP1684199A2 (fr) * 2005-01-19 2006-07-26 Olive Software, Inc. Numérisation de microfiches
WO2006125831A1 (fr) * 2005-05-27 2006-11-30 Thomas Henry Dispositifs et procedes permettant a un utilisateur de gerer une pluralite d'objets et notamment de documents papier
WO2007024392A1 (fr) * 2005-08-24 2007-03-01 Hewlett-Packard Development Company, L.P. Classification de regions definies dans une image numerique
US20130300562A1 (en) * 2012-05-11 2013-11-14 Sap Ag Generating delivery notification
US9386877B2 (en) 2007-05-18 2016-07-12 Kraft Foods R & D, Inc. Beverage preparation machines and beverage cartridges
US10380554B2 (en) 2012-06-20 2019-08-13 Hewlett-Packard Development Company, L.P. Extracting data from email attachments
US10445617B2 (en) 2018-03-14 2019-10-15 Drilling Info, Inc. Extracting well log data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0539106A2 (fr) * 1991-10-24 1993-04-28 AT&T Corp. Système électronique de livraison d'information
US5402504A (en) * 1989-12-08 1995-03-28 Xerox Corporation Segmentation of text styles
US5809167A (en) * 1994-04-15 1998-09-15 Canon Kabushiki Kaisha Page segmentation and character recognition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5402504A (en) * 1989-12-08 1995-03-28 Xerox Corporation Segmentation of text styles
EP0539106A2 (fr) * 1991-10-24 1993-04-28 AT&T Corp. Système électronique de livraison d'information
US5809167A (en) * 1994-04-15 1998-09-15 Canon Kabushiki Kaisha Page segmentation and character recognition system

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1552466A4 (fr) * 2002-10-18 2007-07-25 Olive Software Inc Systeme et procede de preparation automatique de depots de donnees a partir de materiels de type microfilms
EP1552466A1 (fr) * 2002-10-18 2005-07-13 Olive Software, Inc. Systeme et procede de preparation automatique de depots de donnees a partir de materiels de type microfilms
EP1684199A2 (fr) * 2005-01-19 2006-07-26 Olive Software, Inc. Numérisation de microfiches
EP1684199A3 (fr) * 2005-01-19 2008-07-09 Olive Software, Inc. Numérisation de microfiches
FR2886429A1 (fr) * 2005-05-27 2006-12-01 Thomas Henry Systeme permettant a un utilisateur de gerer une pluralite de documents papier
WO2006125831A1 (fr) * 2005-05-27 2006-11-30 Thomas Henry Dispositifs et procedes permettant a un utilisateur de gerer une pluralite d'objets et notamment de documents papier
WO2007024392A1 (fr) * 2005-08-24 2007-03-01 Hewlett-Packard Development Company, L.P. Classification de regions definies dans une image numerique
US7539343B2 (en) 2005-08-24 2009-05-26 Hewlett-Packard Development Company, L.P. Classifying regions defined within a digital image
US9386877B2 (en) 2007-05-18 2016-07-12 Kraft Foods R & D, Inc. Beverage preparation machines and beverage cartridges
US10952562B2 (en) 2007-05-18 2021-03-23 Koninklijke Douwe Egberts B.V. Beverage preparation machines and beverage cartridges
US20130300562A1 (en) * 2012-05-11 2013-11-14 Sap Ag Generating delivery notification
US10380554B2 (en) 2012-06-20 2019-08-13 Hewlett-Packard Development Company, L.P. Extracting data from email attachments
US10445617B2 (en) 2018-03-14 2019-10-15 Drilling Info, Inc. Extracting well log data
US10565467B2 (en) 2018-03-14 2020-02-18 Drilling Info, Inc. Extracting well log data

Also Published As

Publication number Publication date
WO2001013279A9 (fr) 2001-06-14
WO2001013279A3 (fr) 2004-02-19
AU7060500A (en) 2001-03-13

Similar Documents

Publication Publication Date Title
US6243501B1 (en) Adaptive recognition of documents using layout attributes
US7050630B2 (en) System and method of locating a non-textual region of an electronic document or image that matches a user-defined description of the region
US6768816B2 (en) Method and system for interactive ground-truthing of document images
US7773822B2 (en) Apparatus and methods for management of electronic images
US5628003A (en) Document storage and retrieval system for storing and retrieving document image and full text data
Papadopoulos et al. The IMPACT dataset of historical document images
US7081975B2 (en) Information input device
US5923792A (en) Screen display methods for computer-aided data entry
EP1473641B1 (fr) Appareil, méthode, support de stockage et programme de traitement de données
US6178417B1 (en) Method and means of matching documents based on text genre
US20050289182A1 (en) Document management system with enhanced intelligent document recognition capabilities
US20010042083A1 (en) User-defined search template for extracting information from documents
US7548916B2 (en) Calculating image similarity using extracted data
US20100128922A1 (en) Automated generation of form definitions from hard-copy forms
US20090116746A1 (en) Systems and methods for parallel processing of document recognition and classification using extracted image and text features
WO2007117334A2 (fr) Système d'analyse de document pour l'intégration de documents sur papier dans une base de données électronique interrogeable
Kim et al. Automated labeling in document images
WO2011051815A2 (fr) Système et procédé pour utiliser des réseaux à variance dynamique
US20060167899A1 (en) Meta-data generating apparatus
KR20060001392A (ko) 문자 인식을 이용한 내용검색 기반의 문서 이미지 저장 방법
WO2001013279A2 (fr) Base de donnees interrogeable a partir d'un volume eleve de donnees journalistiques saisies
US20030101199A1 (en) Electronic document processing system
US20060176521A1 (en) Digitization of microfiche
JPH0934903A (ja) ファイル検索装置
Yacoub et al. Document digitization lifecycle for complex magazine collection

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: C2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1-43, DESCRIPTION, REPLACED BY NEW PAGES 1-43; PAGES 44-51, CLAIMS, REPLACED BY NEW PAGES 44-51; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP