WO2008077126A2

WO2008077126A2 - Method for categorizing portions of text

Info

Publication number: WO2008077126A2
Application number: PCT/US2007/088207
Authority: WO
Inventors: Rebecca J. Passonneau; Tae Yano
Original assignee: The Trustees Of Columbia University In The City Of New York
Priority date: 2006-12-19
Filing date: 2007-12-19
Publication date: 2008-06-26
Also published as: WO2008077126A3

Abstract

Systems and methods for categorizing portions of text concerning an associated image are disclosed herein. In some embodiments, a method for categorizing portions of text concerning an associated image is disclosed which includes clustering the text into clusters based on a first set of one or more criteria, associating the clustered text with one or more images, and assigning one or more labels to the associated text based on a second set of one or more criteria.

Description

METHOD FOR CATEGORIZING PORTIONS OF TEXT

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional Application No.

60/870,726 entitled "Random Access Digital Image Archivists Text Extraction Tool," filed on December 19, 2006, which is incorporated by reference in its entirety herein.

BACKGROUND

FIELD

[0002] The present application relates to categorizing portions of text concerning an associated image.

BACKGROUND ART

[0003] Electronic representations of text may be segmented using a variety of factors. For example, some portions of text may be associated with other portions of text by quantifying, then comparing, word distributions on two sides of a potential segment boundary. In such a case, a large difference would be a positive indicator for a segment boundary, while a small difference would be a negative indicator for a segment boundary. This is but one example among many of methods for segmenting text. However, all these methods share the basic assumption that there are measurable differences in patterns of word usage between segments. Thus, there is a need for a method to categorize text with regard to features of that text that cannot be discovered simply through analysis of word distribution. Furthermore, there exists a need for a method for top down processing of text, i.e., determining what words or segments of text would be indicative of a particular categorical label. Some text representations may be more useful than others for deciding whether to assign a particular label to one or more portions of text. SUMMARY

[0004] Systems and methods for categorizing portions of text concerning an associated image are disclosed herein.

[0005] In some embodiments, a method for categorizing portions of text concerning an associated image is disclosed which includes clustering the text into clusters based on a first set of one or more criteria, associating the clustered text with one or more images, and assigning one or more labels to the associated text based on a second set of one or more criteria.

[0006] In some embodiments, the method further comprises analyzing the text to identify a second set of one or more criteria relevant to the associated text. In some embodiments, the portions of text are electronically scanned into a database. In some embodiments, the scanned text is converted into a document encoding format. In some embodiments, the method further comprises searching a database for an electronic version of said text. The method may also further comprise accessing the database through the Internet. Additionally, the method may even further comprise searching the same or another database for an electronic version of the one or more images associated with the text. The method may further comprise accessing that database through the Internet.

[0007] In some embodiments, the categorizing portions of text concerning an associated image further comprises determining a text representation for use by a machine learning algorithm. In some embodiments, the assigning one or more labels to the associated text comprises using a machine learning algorithm. In some embodiments, the machine learning algorithm comprises Naive Bayes algorithm. In some embodiments, the machine learning algorithm comprises a support vector machine algorithm.

[0008] In some embodiments, a computer program product for categorizing portions of text concerning an associated image is embodied in a computer readable medium and comprises computer instructions for clustering the text into clusters based on a first set of one or more criteria, associating the clustered text with one or more images, and assigning one or more labels to the associated text based on a second set of one or more criteria. [0009] In some embodiments, the computer readable medium comprises instructions for analyzing the text to identify the second set of one or more criteria relevant to the text. In some embodiments, the computer readable medium comprises instructions for determining a text representation for use by a machine learning algorithm. In some embodiments, the system further comprises a database comprising one or more images associated with the text. In some embodiments, computer readable medium comprises storing a machine learning algorithm. In some embodiments, the computer readable medium comprises instructions for using a machine learning algorithm , e.g., a Naive Bayes algorithm or a support vector machine algorithm.

[0010] In some embodiments, a computer database product for categorizing portions of text concerning an associated image is embodied in a computer readable medium and is created by clustering the text into clusters based on a first set of one or more criteria, associating the clustered text with one or more images, and assigning one or more labels to the associated text based on a second set of one or more criteria. [0011] In some embodiments, the computer database product is created by analyzing the text to identify the second set of one or more criteria relevant to the text. In some embodiments, the computer database product is created by determining a text representation for use by a machine learning algorithm. In some embodiments, the machine learning algorithm is a Naive Bayes algorithm or a support vector machine algorithm.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] The accompanying drawings, which are incorporated and constitute part of this disclosure, illustrate some embodiments of the invention.

[0013] FIG. 1 illustrates a flow chart of the method in accordance with an embodiment of the present invention.

[0014] FIG. 2 illustrates a schematic of the system in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION

[0015] The following embodiments are all described with reference to the use of texts, and associated images from those texts, on the subject of art or art history. However, it is envisioned that any type of text with associated images could be used in accordance with the present invention. For example, the text and images could be obtained from electronic sources such as web pages available on the World Wide Web.

[0016] Figure 1 is an exemplary embodiment of a method 100 for categorizing portions of text concerning an associated image. Certain steps may be combined, and certain steps may occur in a different sequence or simultaneously. Typically, text that is to be categorized is explanatory text having an association with certain images, such as artwork. The text may include descriptions of the art work, for example, such as information on the image, the techniques or artistic style of rendering the image, the subject depicted in the image, or the historical background of the image. Accessing the text is necessary to the categorization process. [0017] In an exemplary embodiment, scanning (110) is composed of scanning the text (112) into an electronic format using a scanning apparatus (not shown) performing an optical character recognition algorithm or similar technique to obtain an electronic version of the text. The optical character recognition scanning process may scan an entire text automatically and uninterrupted. Alternatively, the optical character recognition scanning process may scan the text in a plurality portions and between the scanning of one or more portions a user may adjust the parameters of the optical character recognition algorithm. For example, a user may recognize that the optical character recognition algorithm consistently incorrectly scans the character "b"; the user may then adjust the algorithm to compensate, thus improving the efficiency of the scanning of subsequent portions of the text. [0018] The image to be associated with the scanned text may be similarly scanned into an electronic format (114). In the same or yet another embodiment, the image may already exist in a pre-loaded database, in which case the pre-loaded database may be accessed (116) to load the image. In some embodiments, the image and text is scanned in the same step (118). The scanned image is stored (1 15) in a database for future use. Alternatively, the image data is provided in electronic format, e.g., as an image file. In some embodiments, the text categorization may proceed independently of the image data.

[0019] In an exemplary embodiment, the text may already exist in an electronic format that may be accessed (116) by a user. This electronic format may be present on a database accessible through the Internet in a searchable form. Alternatively, the format may be on some other database accessible through a local network, or in any other suitable manner. In either case, a user may perform a search (117) of the database for the text and/or image for use in the present inventive method. The search (117) may also be a conducted across the Internet as a whole or any portion thereof. The text found via a search (1 17) may already exist in a machine readable format, or alternatively, may be in an image-type format, such as a Portable Document Format (PDF) file (i.e., an electronic image of the text). In the case where the text is already present in a machine readable format it may then be converted into a document encoding format, by use of the methods detailed herein. Alternatively, where the text is not already in a machine readable format it may be converted (119) into a machine readable format by use of an optical character recognition process, or by any other appropriate process for conversion of an electronic document into a machine readable format.

[0020] In an exemplary embodiment, the scanned text is converted (120) to a document encoding format, e.g., Text Encoding Initiative (TEI) Lite with Extensible Markup Language (XML) format. The conversion (120) may be include marking-up (122) the text for any or all structural indicators, e.g., sentences, paragraphs, headers, section changes, tables of contents, title pages, conclusions, image captions, and so on. The conversion (120) to a document encoding format may further include locating (124), in the electronic version of the text, the location of an image corresponding to its location in the original version of the text. The conversion (120) to a document encoding format may also include tagging (126) the image locations with a tag indicating that the location was the location in original text were the image appeared.

[0021] The conversion (120) may be performed automatically, i.e., without user interaction. Alternatively, the conversion (120) may be performed in a user interactive manner. Additionally, in the same or another embodiment, the locating (124) the image locations may involve identifying gaps in the electronic version of the text that indicate an image was at that location. In the same or another embodiment, locating (124) the image locations may involve identifying caption text that corresponds to the image that appears in that location in the original version of the text. In the same or yet another embodiment the locating (124) the image locations may involve identifying image labels that correspond to the image that appears in that location in the original version of the text. Furthermore, the tagging (126) of the image locations may involve tagging the image locations with XML tags that are unique to each image. Those XML tags may correspond to the image's figure number, if it has one, or some other unique information about that image, e.g., a descriptive caption. The tagging (126) may also involve assigning to each image a unique image identifying number.

[0022] In accordance with one exemplary embodiment, a section of text appears as follows after the conversion (120) of the text into a TEI Lite format with XML mark-ups:

<head>A REVERSAL OF FORTUNES</head>

<p>So powerful was the Sasanian army that in <hi rend="sc">a.d.</hi> 260 Shapur I even succeeded in capturing the Roman emperor Valerian near Edessa (in

/<gap reason=" illustration" desc="2-29 Head of a Sasanian king (Shapur II?), ca. A.D. 350 "M

<figDesc>Silver with mercury gilding, 1 [special character] 3¾ [special character] high.

Metropolitan Museum of Art, New York.</fϊgDesc>

</figure>

modern Turkey). His victory over Valerian was so significant an event that Shapur commemorated it in a series of rock-cut reliefs in the cliffs of Bishapur in Iran, far from the site of his triumph. We illustrate a detail of one of the Bishapur reliefs (<hi rend="sc">fig.</hi> <hi rend="b">2-30</hi>). Shapur appears larger than life, riding in from the left and wearing the same distinctive tall Sasanian crown the king in the silver portrait wears. The crown breaks through the relief's border and draws the viewer's attention to the king. A Roman soldier's crumpled body lies between the legs of the Sasanian's horse—a time-honored motif (compare the <hi rend="i">Standard of Ur,</hi> <hi rend="sc">fig.</hi> 2-8). Here the sculptor probably meant to personify the entire Roman army. At the right, attendants lead in Valerian, who kneels before Shapur and begs for mercy. Above, a <hi rend="i">putto</hi>-like (cherub or childlike) figure borrowed from the repertory of Greco-Roman art hovers above the king and brings him a victory garland. Similar scenes of kneeling enemies before triumphant generals are commonplace in Roman art—but at Bishapur the roles are reversed. This appropriation of Roman compositional patterns and motifs in a relief celebrating the Sasanian defeat of the Romans adds another, ironic, level of meaning to the political message in stone. </p>

The XML tag "<div3 type="subsecl">" indicates that the text that follows is a Type 3 Division, which here has been selected for a sub-section, hence the "subsecl" portion of the tag. The "<head>" tag indicates that the text following is the heading of this sub-section, while the "</head>" tag indicates the end of the heading text. Similarly the "<p>" tag indicates the beginning of a paragraph, while the "</p>" tag indicates the closing of a paragraph. The "<gap reason="illustration" desc="2-30 Triumph of Shapur I over Valerian, rock-cut relief, Bishapur, Iran, ca. A.D. 26O."/>" tag is indicates the location of a gap in the text and the reason for that gap. In this example, the reason for a gap in the text was the appearance of an image in that text. Therefore, the encoded text indicates that reason by use of the tag "<gap reason="illustration"", and further includes the a description of the illustration that is composed of the caption of that image: "desc="2-30 Triumph of Shapur I over Valerian, rock-cut relief, Bishapur, Iran, ca. A. D. 26O."/>". In this exemplary embodiment, the text is thus converted (120) to a TEI Lite format with XML markups for further processing as detailed below.

[0023] In an exemplary embodiment, the scanned text is clustered (130) into groupings that relate to one or more images based on a first set of one or more criteria. Clustering (130) is accomplished by using a software program to determine what portions of text are related to the associated one or more images, as will be described below. The software program useful for clustering text is a program capable of reading text in a TEI Lite format with XML mark-ups. [0024] In the same or another embodiment, clustering the scanned text (130) may include choosing one or more paragraphs based on a set of criteria (132). For example, in one embodiment, the paragraph which includes the XML tag for the image is always included in the clustering algorithm. In the same or another embodiment, the immediately preceding paragraph is always included. Furthermore, subsequent paragraphs may be included based on one or more criteria. For example, a criteria might consist of including each subsequent paragraph until a certain condition is met. Such a condition might be that a subsequent paragraph is tagged with a XML tag indicating a new division, e.g., a tag for a new subsection. In such a case, all paragraphs up to but not including that paragraph would be included in the cluster. A second condition might be that a subsequent paragraph contains the XML tag for a different image. In such a case, all paragraphs up to but not including that paragraph would be included in the cluster. These two conditions are examples of any number of possible conditions that might operate to terminate the inclusion of subsequent paragraphs. Furthermore, these two conditions may be operated independently, such that satisfaction of either, or both, will operate to terminate the inclusion of subsequent paragraphs in the cluster.

[0025] In an exemplary embodiment, associating selected portions of text to a related image (140) may be performed. Associating (140) is accomplished by using a software program to determine what portions of text are related to the associated one or more images. The software program useful for clustering text is a program capable of reading text in a TEI Lite format with XML mark-ups. Associating selected portions of the text to a related image (140) includes identifying the location of a XML tag or the unique identifying number, e.g., the image plate number, marking the location of an image within a cluster of text associated with that image (142). In the same or another embodiment, the identifying (142) may be followed by creating an electronic association (144) between XML tag or the unique identifying number marking the location of the image within the cluster of text with an electronic version of the image. This electronic association (144) may allow for an electronic link, whereby an electronic version of the image is displayed along with the associated cluster of text on a displaying device.

[0026] According to one exemplary embodiment, an example of text would appear as follows after the clustering (130) and associating (140) of the text by the custom program:

ID: 0

DESC: 2-1 White Temple and ziggurat, Uruk (modern Warka), Iraq, ca. 3200-3000 B.C.

PARAGRAPHS: 19:

Like other Sumerian temples, the corners of the White Temple are oriented to the cardinal points of the compass. The building, probably dedicated to Anu, the sky god, is of modest proportions (sixty-one by sixteen feet). By design, it did not accommodate large throngs of worshipers but only a select few, the priests and perhaps the leading community members. The temple has several chambers. The central hall, or cella, was set aside for the divinity and housed a stepped altar. The Sumerians referred to their temples as “waiting rooms,” a reflection of their belief that the deity would descend from the heavens to appear before the priests in the cella. How or if the Uruk temple was roofed is uncertain.

PARAGRAPHS:20:

The Sumerian idea that the gods reside above the world of humans is central to most of the world's religions. Moses ascended Mount Sinai to receive the Ten Commandments from the Hebrew God, and the Greeks placed the home of their gods and goddesses on Mount Olympus. The elevated placement of Mesopotamian temples on giant platforms reaching to the sky is consistent with this widespread religious concept. Eroded ziggurats still dominate most of the ruined cities of–

For convenience sake the XML tags are not shown in the above example, however, it is understood that such tags could be viewed. The "ID" here is the unique identifying number assigned to figure 2-1, the "White Temple and ziggurat, Uruk (modern Warka), Iraq, ca. 3200-3000 B.C." image, so labeled in the original text. Paragraphs 19 and 20 are paragraphs that have been clustered and associated with the image tagged "ID: 0". In this exemplary embodiment, the associated cluster of text is thus formatted and in condition for labeling as detailed below.

[0027] In an exemplary embodiment, one or more associated clusters of text, or portions thereof, are labeled with one or more labels based on a set of one or more criteria (150). The label assigned to any given cluster of text, or portion thereof, may be chosen from among seven categories of labels described herein. Assigning one or more labels may include analyzing one or more associated clusters of text, or portions thereof, to identify the one or more associated clusters of text, or portions thereof, that could be appropriately labeled with one or more of the categories of labels described herein (152). In the same or another embodiment, the analyzed portions of text are assigned with the one or more appropriate labels (152). The assigning of the labels may involve placing XML tags on the TEI Lite formatted version of the text at appropriate locations. The assigning of labels may be accomplished by the use of a custom program capable of reading and marking text in a TEI Lite format with XML mark-ups.

[0028] In the same or another embodiment, analyzing the clusters of text

(152) and assigning the analyzed portions of text with the labels (154) is performed by a software program using a machine learning algorithm. The machine learning algorithm may be a Naives Bayes algorithm. In the same or another embodiment, the machine learning algorithm may be a support vector machine algorithm. [0029] In an exemplary embodiment, the process of categorizing portions of text concerning an associated image may involve determining a text representation (160) for use by a machine learning algorithm. Once determined, such representation may be input into a machine learning algorithm to further refine the ability to determine what portions of text should be label with a particular label. [0030] Label Categories

[0031] Categories may be assigned to the portions of text. The categories may represent one of several aspects of the text. For example, the text may refer to features of the image associated with the text, such as the content of the image or the techniques used to create the image. Other categories relate to historical data about three subject of the image or of the image itself. Each portion of text may be associated with one or more categories. Although several exemplary categories of images are described herein, it is understood that other, additional categories are within the scope of this disclosure, and may be selected to describe the subject matter of the text being categorized

[0032] The first exemplary category of labels is an "Image Content" label. In an exemplary embodiment, the Image Content label might be assigned to one or more clusters of text, or portions thereof, where that text mentions the image and/or describes it in concrete terms. In the same or another embodiment, the Image Content label might be assigned where the text has an explicit mention of the unique image identifier, e.g., the image plate number or figure number. In the same or another embodiment, the Image Content label might be assigned where the text is primarily about one or more specific objects in the image. In the same or another embodiment, the Image Content label might not be applied where the text describes the general class of objects of which the object depicted is a member, rather than a description of the specific object depicted. [0033] The second exemplary category of labels is a "Historical Context" label. In an exemplary embodiment, the Historical Context label might be assigned to one or more clusters of text, or portions thereof, where that text describes the historical context of the creation of the object depicted in the image. In the same or another embodiment, the Historical Context label might be assigned where the text describes when the object depicted was created or why it was created or under what circumstances it was created, e.g., whether it was commissioned or not. In the same or another embodiment, the Historical Context label might be assigned where the text mentions the broader art history facts about the period in which the object depicted was created.

[0034] The third example category of labels is a "Biographical" label. In an exemplary embodiment, the Biographical label might be assigned to one or more clusters of text, or portions thereof, where the text provides biographical information about the artist whose artwork is depicted in the image. In the same or another embodiment, the Biographical label might be assigned where the text provides biographical information about the patron of the object, e.g., the person for whom the object depicted was commissioned. In the same or another embodiment, the Biographical label might be assigned where the text provides biographical information about any other people or personages involved in creating the object depicted or having a direct link to the object depicted after it was created. [0035] The fourth exemplary category of labels is an "Implementation" label.

In an exemplary embodiment, the Implementation label might be assigned to one or more clusters of text, or portions thereof, where the text provides information about conventions, methods, use of particular materials or tools, or techniques that the artist implemented in the creation of the object depicted in the image. In the same or another embodiment, the Implementation label might be assigned where the text includes a discussion about the any other requirements and/or limitations employed by the artist.

[0036] The fifth exemplary category of labels is a "Comparison" label. In an exemplary embodiment, the Comparison label might be assigned to one or more clusters of text, or portions thereof, were the text discusses the object depicted in the image in reference to one or more other works. In the same or another embodiment, the reference to one or more other works may involve a comparison of one or more features of the work depicted in the image and the work referenced, e.g., a comparison of the imagery of the two works or a comparison of the techniques employed to create the two works. In the same or another embodiment, the Comparison label may not be applied where there is a cross reference to another work without some kind of comparison of the depicted work with the referenced work. [0037] The sixth exemplary category of labels is an "Interpretation" label. In an exemplary embodiment, the Interpretation label might be assigned to one or more clusters of text, or portions thereof, where the author of the text provides his or her opinions about the interpretation of the object depicted in the image. In the same or another embodiment, the Interpretation label may be applied to text identified by emotion-bearing vocabulary, sentiment terms, or opinion terms. [0038] The seventh exemplary category of labels is a "Significance" label. In an exemplary embodiment, the Significance label might be assigned to one or more clusters of text, or portions thereof, where the text describes the significance of the object depicted in the image in art history terms. In the same or another embodiment, the Significance label may be applied to text identified by the use of superlative phrases, e.g., "this is the most . . .", "this is quintessential . . .", and so on. In the same or another embodiment, the Significance label may be applied to text identified by the use of superlative morphology, e.g., the use of "-est" words, such as "best", "finest", and so on.

[0039] An example of the labeling in an exemplary embodiment in accordance with the present invention follows. This example is used with reference to a portion of text with an associated image about the Egyptian king Akhenaten. The cluster of text associated with the image reads, in part, as follows:

Of the great projects built by Akhenaten hardly anything remains . . . . Through his choice of masters, he fostered a new style. Known as the Amarna style, it can be seen at its best in a sunk relief portrait of Akhenaten and his family (fig. 2-27). The intimate domestic scene suggests that the relief was meant to serve as a shrine in a private household. The life-giving rays of the sun help to unify the composition and . . . .

[0040] The above text would be associated with the picture labeled as "fig. 2-

27", a relief portrait of Akhenaten and his family. A subsequent paragraph (not shown here) compares the relief to another work by the artist, meaning that the XML tags from both images would point that paragraph.

[0041] The above quoted portion of text is then further divided into smaller sections, which are given appropriate categorical label from among the seven detailed above. Thus the first sentence, "Of the great projects built by Akhenaten hardly anything remains . . . . Through his choice of masters, he fostered a new style", would be given the "Historical Context" label. The first portion of the second sentence, "Known as the Amarna style, it can be seen at its best in", would be given the "Implementation" label. The second portion of the second sentence and first portion of the third sentence, "a sunk relief portrait of Akhenaten and his family (fig. 2-27). The intimate domestic scene", would be given the "Image Content" label. The second portion of the third sentence, "suggests that the relief was meant to serve as a shrine in a private household", would be given the "Historical Context" label. Finally, the last portion of the text, "The life-giving rays of the sun help to unify the composition and . . .", would be given the "Image Content" label. [0042] Figure 2 is an exemplary embodiment of a system 200 for categorizing portions of text concerning an associated image. A database 210 may contain one or more texts in electronic format. In the same or another embodiment, the database 210 may also contain one or more images in electronic format. A processor 220 may be operatively coupled to a computer readable storage medium, such as memory 230. In an exemplary embodiment, the processor 220 is a personal computer operating with memory, such as a standard desktop computer. The database 210 is stored on a hard drive.

[0043] In an exemplary embodiment, the memory 230 may be storing program instructions that when executed by the processor 220, cause the processor 220 to cluster the text into clusters based on a first set of one or more criteria. In the same or another embodiment, the stored program instructions may be a custom software program used to determine what portions of text are related to the associated one or more images. The custom software program useful for clustering text is a program capable of reading text in a TEI Lite format with XML mark-ups. [0044] In the same or another embodiment, the program instructions stored in the memory 230 may be capable of clustering the scanned text by choosing one or more paragraphs for inclusion in the cluster based on a set of one or more criteria, when executed by the processor 220. For example, in one embodiment, the paragraph which includes the XML tag for the image is always be included. In the same or another embodiment, the immediately preceding paragraph is always be included. Furthermore, subsequent paragraphs may be included based on one or more criteria. For example, a criteria might consist of including each subsequent paragraph until a certain condition is met. Such a condition might be that a subsequent paragraph is tagged with a XML tag indicating a new division, e.g., a tag for a new subsection. In such a case, all paragraphs up to but not including that paragraph would be included in the cluster. A second condition might be that if a subsequent paragraph contains the XML tag for a different image. In such a case, all paragraphs up to but not including that paragraph would be included in the cluster. These two conditions are examples of any number of possible conditions that might operate to terminate the inclusion of subsequent paragraphs. Furthermore, these two conditions may be operated independently, such that satisfaction of either, or both, will operate to terminate the inclusion of subsequent paragraphs in the cluster. [0045] In the same or another embodiment, the memory 230 may be storing program instructions that when executed by the processor 220, cause the processor 220 to associate the clustered text with one or more images. In the same or another embodiment, the stored program instructions may be a custom software program used to determine what portions of text are related to the associated one or more images. The custom software program useful for associating text is a program capable of reading text in a TEI Lite format with XML mark-ups.

[0046] In the same or another embodiment, the program instructions stored in the memory 230 may be capable of associating selected portions of text to a related image by identifying the location of the XML tag or the unique identifying number, e.g., the image plate number, marking the location of an image within a cluster of text associated with that image, when executed by the processor 220. The program instructions stored in the memory 230 may further be capable of creating an electronic association between the XML tag or the unique identifying number marking the location of the image within the cluster of text with an electronic version of the image, when executed by the processor 220. This electronic association may allow for an electronic link, whereby an electronic version of the image is displayed along with the associated cluster of text on a displaying device. [0047] In the same or another embodiment, the memory 230 may be storing program instructions that when executed by the processor 220, cause the processor 220 to assign one or more labels to the associated text based on a second set of one or more criteria. In the same or another embodiment, the stored program instructions may be a custom software program used to determine what portions of associated text are relevant to the one or more categories of label and then assign those portions with the appropriate label. The custom software program useful for labeling text may be a program capable of reading and marking text in a TEI Lite format with XML markups.

[0048] In one exemplary embodiment, the label assigned to any given cluster of text, or portion thereof, may be chosen from among seven categories of labels described herein. The program instructions stored in the memory 230 may be capable of assigning one or more labels by analyzing one or more associated clusters of text, or portions thereof, to identify the one or more associated clusters of text, or portions thereof, that could be appropriately labeled with one or more of the categories of labels described herein, when executed by the processor 220. The program instructions stored in the memory 230 may further be capable of assigning the analyzed portions of text with the one or more appropriate labels, when executed by the processor 220.

[0049] In the same or another embodiment, the program instructions stored in the memory 230 may be capable of analyzing the text and assigning the appropriate label to the text using a machine learning algorithm, when executed by the processor 220. Furthermore, the machine learning algorithm may be a Naives Bayes algorithm. The machine learning algorithm may also be a support vector machine algorithm. [0050] In an exemplary embodiment, the program instructions stored in the memory 230 may be capable of determining a text representation for use by a machine learning algorithm, when executed by the processor 220. Once determined, the program instructions stored in the memory 230 may be capable of inputting such representation into a machine learning algorithm to further refine the ability to determine what portions of text should be label with a particular label, when executed by the processor 220.

[0051] The output of the algorithm includes a database created by the processes described herein, including, e.g., the text portions and their assigned categories. Such output is stored in a text file, output to a monitor or print-out, etc. The output may be in a XML mark-up format, such as that shown above. Alternatively, the output may be in a format with the XML marking hidden, also as shown above.

[0052] It will be understood that the foregoing is only illustrative of the principles of the invention, and that various modifications can be made by those skilled in the art without departing from the scope and spirit of the invention. For example, the system and methods described herein are used in text relating to image data. It is understood that that techniques described herein are useful in connection with any text that relates to any subject capable of categorization. Moreover, features of embodiments described herein may be combined and/or rearranged to create new embodiments.

Claims

CLAIMS We claim:

1. A method for categorizing portions of text concerning an associated image, comprising:

clustering said text into clusters based on a first set of one or more criteria;

associating said clustered text with one or more images; and

assigning one or more labels to said associated text based on a second set of one or more criteria.

2. The method of claim 1, further comprising analyzing said text to identify a second set of one or more criteria relevant to said associated text.

3. The method of claim 1, further comprising electronically scanning said text into a database.

4. The method of claim 3, further comprising converting said scanned text into a document encoding format.

5. The method of claim 1, further comprising searching a database for an electronic version of said text.

6. The method of claim 5, further comprising accessing said database through the Internet.

7. The method of claim 1, further comprising searching a database for an electronic version of said one or more images associated with said text.

8. The method of claim 7, further comprising accessing said database through the Internet.

9. The method of claim 1 , further comprising determining a text representation for use by a machine learning algorithm.

10. The method of claim 1, wherein assigning one or more labels to said associated text comprises using a machine learning algorithm.

11. The method of claim 10, wherein the machine learning algorithm comprises Naive Bayes algorithm.

12. The method of claim 10, wherein the machine learning algorithm comprises a support vector machine algorithm.

13. A computer program product for categorizing portions of text concerning an associated image, the computer program product being embodied in a computer readable medium and comprising computer instructions for: clustering said text into clusters based on a first set of one or more criteria; associating said clustered text with one or more images; and assigning one or more labels to said associated text based on a second set of one or more criteria.

14. The computer program product of claim 13, wherein the computer readable medium comprises instructions for analyzing said text to identify said second set of one or more criteria relevant to said text.

15. The computer program product of claim 13, wherein the computer readable medium comprises instructions for determining a text representation for use by a machine learning algorithm.

16. The computer program product of claim 13, further comprising a database comprising one or more images associated with said text.

17. The computer program product of claim 13, wherein associating said clustered text comprises using a machine learning algorithm.

18. The computer program product of claim 17, wherein associating said clustered text comprises using a Naive Bayes algorithm.

19. The computer program product of claim 17, wherein associating said clustered text comprises using a support vector machine algorithm.

20. A computer database product for categorizing portions of text concerning an associated image, the computer database product being embodied in a computer readable medium and created by: clustering said text into clusters based on a first set of one or more criteria; associating said clustered text with one or more images; and assigning one or more labels to said associated text based on a second set of one or more criteria.

21. The computer database product of claim 20, wherein the computer database product is created by analyzing said text to identify said second set of one or more criteria relevant to said text.

22. The computer database product of claim 20, wherein the computer database product is created by determining a text representation for use by a machine learning algorithm.

23. The computer database product of claim 20, wherein associating said clustered text comprises using a machine learning algorithm.

24. The computer database product of claim 23, wherein associating said clustered text comprises using a Naive Bayes algorithm.

25. The computer database product of claim 23, wherein associating said clustered text comprises using a support vector machine algorithm.