[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO1999023584A2 - Information component management system - Google Patents

Information component management system Download PDF

Info

Publication number
WO1999023584A2
WO1999023584A2 PCT/US1998/023193 US9823193W WO9923584A2 WO 1999023584 A2 WO1999023584 A2 WO 1999023584A2 US 9823193 W US9823193 W US 9823193W WO 9923584 A2 WO9923584 A2 WO 9923584A2
Authority
WO
WIPO (PCT)
Prior art keywords
information
information component
document
xml
component
Prior art date
Application number
PCT/US1998/023193
Other languages
French (fr)
Other versions
WO1999023584A3 (en
Inventor
Yonatan Pesach Stern
Original Assignee
Iota Industries Ltd.
Friedman, Mark, M.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US08/962,117 external-priority patent/US6161107A/en
Application filed by Iota Industries Ltd., Friedman, Mark, M. filed Critical Iota Industries Ltd.
Priority to AU13715/99A priority Critical patent/AU1371599A/en
Publication of WO1999023584A2 publication Critical patent/WO1999023584A2/en
Publication of WO1999023584A3 publication Critical patent/WO1999023584A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data

Definitions

  • the present invention relates to an information component management system. Specifically, the system of the present invention enables documents, images and other types of information to be packaged within an active info ⁇ nation component object, which can then be stored, ret ⁇ eved and manipulated according to content rather than according to form.
  • Documents can be defined as a collection of ideas and information, which are organized within a certain structure
  • the ideas and information may be logically linked according to va ⁇ ous relationships, but as a whole should follow a common theme
  • the collection itself is expressed as a combination of text and graphic items.
  • ideas can be expressed with words or graphics
  • Data can be in the form of numbers, symbols, graphics or even sounds
  • the final element, structure is an important element of a document, yet it is often overlooked as a separate entity
  • the structure of a document is the way in which the data and ideas are organized within the document, thereby providing additional significance to these data and ideas.
  • the first category is a document management system
  • This system was originally designed to enable searches for information according to specific keywords within defined database fields.
  • this underlying system design has many disadvantages. For example, the types of searches are limited by the structure of the database itself.
  • information must be extracted from the document and entered mto the database manually, which is time consuming, expensive and prone to human error
  • structured management systems have significant drawbacks for document management.
  • non-structured text ret ⁇ eval systems solves certain problems but also creates new difficulties. These systems enable automatic indexing of information, without the need for human intervention. However, in non-structured ret ⁇ eval systems, only the free text of the document is automatically indexed. Therefore, only free text from the document can be searched. Although free text is an important component of a document, such a system loses the other types of available information. Furthermore, the context of ideas or concepts within a document is largely lost by the automatic indexing procedure, leaving the user with a collection of disconnected textual segments or documents which are divorced from the general theme expressed by the entire document. Thus, the user must often read an entire document or a collection of search results in order to find the desired information
  • the information component management system of the present invention enables documents, images and other types of mformation to be packaged withm an active information component object, which can then be stored, ret ⁇ eved and manipulated according to content rather than according to form
  • the information component includes concepts or ideas, data and structure as separate but related entities.
  • an information component system for sto ⁇ ng an o ⁇ ginal document comp ⁇ sing: (a) at least one information component for sto ⁇ ng information from the o ⁇ ginal document, the at least one information component featu ⁇ ng at least one information component p ⁇ mitive, (b) an information component identifier for classifying the at least one information component according to at least one information component class; and (c) at least one property of the at least one mformation component.
  • a system for displaying a native file format document the document including text and having a native file format and a native document appearance, the native file format including at least one instruction for displaying the text of the native file format document
  • the system comp ⁇ sing: (a) a Web browser for displaying the native file format document according to the native document appearance; and (b) a HT.ML rende ⁇ ng engine for obtaining information regarding the native document appearance of the native file format document, for translating the information into a raster file having a raster format displayable by the Web browser, and for giving the translated information to the Web browser, such that the Web browser is able to display the native file format document.
  • a method for managing information comp ⁇ sing the steps of: (a) captu ⁇ ng the information in an electronic format; (b) converting the captured information into an information component, the information component featu ⁇ ng: (I) a pointer to a storage location of the captured information; (n) at least one method for manipulating the captured information; and (in) at least one property of the captured information; (c) sto ⁇ ng the information component; and (d) displaying the information component such that the captured information appears in substantially the o ⁇ ginal format.
  • an information component comp ⁇ sing a software object including: (a) a pointer to a storage location of the stored o ⁇ ginal information ;(b) at least one method for manipulating the stored o ⁇ gmal information; and (c) at least one property of the stored o ⁇ ginal information.
  • a server for serving stored information to a client Web browser, the server comp ⁇ sing: (a) a database for sto ⁇ ng the stored information; and (b) an image processor for accessing the stored information from the database and transforming the stored information into a Searchable Image Format (SIF) file, the SIF file being accessed by the client Web browser, such that the stored information is displayed by the client Web browser.
  • SIF Searchable Image Format
  • computing platform refers to a particular computer hardware system or to a particular software operating system.
  • hardware systems include, but are not limited to, personal computers (PC), MackintoshTM computers, mainframes, minicomputers and workstations.
  • software operating systems include, but are not limited to, UNIX.
  • the term "software object” includes any software application capable of substantially independent execution by an operating system
  • a software application whether a software object or substantially any other type of software application, could be w ⁇ tten in substantially suitable programming language, which could easily be selected by one of ordinary sloll in the art
  • suitable programming languages include, but are not limited to, C, C++ and Java
  • Web browser refers to any software program which can be used to view a document w ⁇ tten at least partially with at least one instruction taken from HTML (HyperText Mark-up Language) or VRML (Virtual Reality Modeling Language), or any other equivalent computer document language, hereinafter collectively and generally referred to as "document mark-up language"
  • Web browsers include, but are not limited to, MosaicTM, Netscape NavigatorTM and MicrosoftTM Internet ExplorerTM
  • raster format refers to any image format supported by Web browsers including, but not limited to, GIF (Graphics Interchange Format), JPEG (Joint Photographies Expert Group) and PNG (Portable Network Graphics)
  • FIGS 1A-1C are schematic block diagrams of va ⁇ ous exemplary information components and classes
  • FIGS. 2A and 2B are schematic block diagrams of the general architecture of an exemplary system of the present invention
  • FIGS. 3 A and 3B are schematic block diagrams of a preferred embodiment of the IC
  • FIG 4 is a schematic block diagram of an exemplary IC component generator of the present invention
  • FIG. 5 is a schematic block diagram of a preferred embodiment of the IC Server of the present invention
  • HG 6 IS a schematic block diagram of a preferred embodiment of the IC Client of the present invention
  • FIG. 7 is a schematic block diagram of a preferred embodiment of the system of the present invention as implemented in X.ML;
  • HG. 8 is a schematic block diagram of dynamic link management according to the present invention.
  • FIG. 9 is a schematic block diagram of DTD normalization according to the present invention.
  • FIG. 10 is a schematic block diagram of an exemplary system for .HT.ML rende ⁇ ng according to the present invention.
  • FIG 1 1 shows an exemplary output of the system of Figure 10
  • FIG. 12 shows an exemplary embodiment of the IC Server of the present invention as implemented with Java Bean objects
  • FIG. 13 shows an exemplary embodiment of the IC Client of the present invention as implemented with Java Bean objects.
  • the information component management system of the present invention enables documents, images and other types of structured or non-structured information to be analyzed
  • the underlying structure of the information is determined, and the structure is exposed to a database.
  • the information is then packaged withm an active information component object, which can then be stored, ret ⁇ eved and manipulated according to content rather than according to form.
  • the information component includes concepts or ideas, data and structure as separate but related entities.
  • Information components are linked to each other according to a particular relationship, which may be either parallel or hierarchical.
  • an image of a face of a person is an information component which may in turn be a portion of a larger object, such as a group photo, which may in turn be a portion of an article.
  • the image of the face, the group photo and the article are all individual information components which are linked according to a hierarchical structure.
  • Each information component mhe ⁇ ts the features of all associated information components which are higher in the hierarchical structure, and in turn cont ⁇ butes to the pool of features characte ⁇ zing associated information components which are lower in the hierarchical structure
  • information components have both content related to the actual stored information, and content related to the features of associated, higher level components.
  • the actual stored information from an information componeni is displayed in substantially the same format as the o ⁇ ginal source format, so as to maintain the o ⁇ ginal appearance as much as possible
  • the displayed information maintains substantially the same fonts, graphics and structure, so that a newspaper page is displayed as a substantially exact reproduction of the page as it o ⁇ ginally appeared in newsp ⁇ nt, for example
  • the system of the present invention has a clear advantage over p ⁇ or art document management systems, which usually display ret ⁇ eved information only as text. Even if graphic images are also displayed, the structure of the entire document, and the visual relationship between the text and the images, is not maintained by these p ⁇ or art systems.
  • the information component management system of the present invention is able to search for. and ret ⁇ eve, information based upon all characte ⁇ stics of the information component, including graphic images, text and structural relationships. Results are presented as intuitive, visually explicit objects which are easy to examine, manipulate and navigate through. Furthermore, the search results are presented according to the ranked relevance to the desired search strategy, in which the rank is determined with both the full content and the complete characte ⁇ stics of the information component.
  • the system of the present invention includes two basic p ⁇ nciples * object o ⁇ ented management and visual information ret ⁇ eval Both p ⁇ nciples will be explained in greater detail below, in the Descnption of the Preferred Embodiments B ⁇ efly, the information components are managed as objects which belong to an information class. Different information classes are linked according to the logical relationship between the components in each class. Overall, the classes are placed withm a hierarchical structure, in which each child class inhe ⁇ ts the properties of the parent class Each information class defines the properties and operations of a set of information component.
  • each information component is a representation of information, combining structured and non-structured data.
  • the information component also features methods for accessing and manipulating the information, including the data interface and any data operations. Because the methods of the information component are exposed to the general computational environment, the component either can be displayed, or can display itself, on any type of computing platform or operating system. Thus, the information component is both compatible across different computing platforms and has an open, easily accessible interface
  • the information component In order to prepare such an information component, several procedures must be performed. First, the information must be identified. Next, the information must be classified and the actual information component must be created The relationship between the new information component and other information component(s) must be identified. Finally, the behavior of the completed information component is determined according to the functionality of the att ⁇ butes or features which accrue to that component after classification and identification of relationships
  • the information component can be searched and ret ⁇ eved through visual information search and ret ⁇ eval Bnefly, the search can be performed according to keyword, visual example and graphic att ⁇ butes Visual examples include images or graphic objects which are compared to graphic information stored in the database, just as a keyword search involves the compa ⁇ son of keywords to text stored in the database.
  • Graphic att ⁇ butes include font size, font att ⁇ bute and relative positioning of information withm a document. These att ⁇ butes can also be used as search parameters Thus, the search is not limited to a simple keyword compa ⁇ son of stored textual information.
  • the system of the present invention includes a mechanism for learning the preferences and profile of an individual user, which can then also be used to calculate the relevance ranking of the ret ⁇ eved information
  • the present invention is of an information component management system, in which information is packaged as an information component, including textual data, images and structure Information components are related to each other according to a hierarchical organization, in which characte ⁇ stics of components which are higher in the hierarchy accrue to those components which are lower m the hierarchy
  • Information components can be searched and ret ⁇ eved according to all att ⁇ butes of the actual information, as well as the characte ⁇ stics of the component and relationships between components.
  • the information component management system of the present invention is not limited to simple storage, searches and ret ⁇ eval of textual data only, but instead preserves all aspects of the o ⁇ ginal source of information.
  • the detailed desc ⁇ ption of the system of the present invention will be divided into four chapters.
  • the first chapter desc ⁇ bes va ⁇ ous background art technologies which are the preferred support technologies for the system of the present invention. These technologies are desc ⁇ bed as "background art” because they are not fulfilling the same functions as the system of the present invention, but instead are merely enabling these functions. These technologies are given as examples only and are not intended to be limiting in any way.
  • the second chapter provides a b ⁇ ef overall view of the entire system according to the present invention
  • the third chapter desc ⁇ bes an exemplary implementation of the present invention with objects in an XML environment
  • the fourth chapter desc ⁇ bes an exemplary implementation of the present invention with Java Bean objects in a CORBA environment.
  • the background art technologies which desc ⁇ bed in this chapter are well known m the art.
  • the desc ⁇ ption provided herein is not intended to be exhaustive, but rather to desc ⁇ be those aspects of the background art technologies which are optionally implemented to support the management system of the present invention.
  • the prefe ⁇ ed background art technologies which are desc ⁇ bed herein include XML, desc ⁇ bed in section 1: CORBA and a particular prop ⁇ etary embodiment of CORBA, desc ⁇ bed in section 2; and the Java Bean component architecture, desc ⁇ bed in section 3. Section 1 .XML
  • XML XML with ActiveX TM objects as the front end, such that the information components of the present invention are preferably accessed through X.ML
  • X.ML The acronym ".XML” stands for "Extensible Markup Language”
  • XML is a document markup language which was designed to have greater functionality than HT.ML (hypertext markup language)
  • Documents w ⁇ tten in HTML can, however, be converted into XML
  • a document w ⁇ tten in XML is a collection of XML elements, which can be images or sections of text, for example
  • the document itself features elements which are indicated with “tags” These elements have logical values
  • Each element can also have "child elements", which are other elements to which a reference is made that element, known as the ' parent element"
  • This element structure is a hierarchical tree which enables complex elements to be composed of multiple simpler elements
  • Documents w ⁇ tten in XML optionally feature a DTD (Document Type Declaration), which is either included within the XML document, or alternatively is a separate but associated document
  • the DTD contains the rules according to which the XML document should be interpreted, such as the declarations for the structures of the elements within the XML document
  • the term "HTML” document will refer to a document w ⁇ tten in HTML
  • the term "XML” document will refer to a document w ⁇ tten in XML
  • the option of including the DTD both increases the flexibility of XML and enables documents w ⁇ tten in X.ML to be validated, in order to ensure that these XML documents conform to the rules in the DTD
  • X.ML An additional useful feature of X.ML is the more powerful hnl ⁇ ng structure available
  • the links of XML are compatible with those of HTML
  • XML allows any type of element to act as a link
  • the start or the finish of an XML link does not need to be located within one of the documents which is being linked.
  • XML links could be located in a document which is entirely separate from the two linked documents This enables two documents to be more easily linked after both documents have been created, without alte ⁇ ng either document
  • .XML links including simple and extended.
  • a simple link is similar to the link of .HTML, in that it is unidirectional and has only one locator.
  • extended link can have more than one locator, such that the extended link can "point" to more than one target resource.
  • extended links can also be bidirectional or multidirectional. As noted previously, these extended links may be located m a separate file, external to the XML document, and can therefore be very difficult to manage. For example, when a linked file is deleted or otherwise removed, the extended link list is not amended, potentially leading to a broken link.
  • XML links can also point to target resources which are fragments of a document.
  • the bounda ⁇ es of these fragments can be determined through either static "chunl ⁇ ng" or dynamic "chunking".
  • static chunking the XML file is manually divided into pieces, with a new XML file for each piece which is then linked to the mam XML file.
  • dynamic chunking the XIV L file is not divided into new files Rather, separators are placed within the XML file to indicate bounda ⁇ es for chunks These separators can be used to define a portion of a document which is to be ret ⁇ eved, such that the fragments are separated and served "on the fly".
  • dynamic chunl ⁇ ng has the disadvantage of significantly increasing server overhead as the server determines which fragment is to be served, such that the XML server may become overloaded.
  • .XML documents also have style sheets, which feature construction rules descnbmg how each element should be displayed. For example, if the element is a paragraph of text, the construction rule may indicate font size and type, the extent of the indentation of the first line, spacing between lines of the paragraph and so forth
  • both the information components of the present invention and the management system for these components are compliant with the CORBA (Common Object Request Broker Architecture) standard, which is a standard for communication between dist ⁇ aded objects established by OMG (Object Management Group).
  • CORBA Common Object Request Broker Architecture
  • OMG Object Management Group
  • OMG is a consortium of over 700 different software developers.
  • standards developed by OMG are industry-wide and software applications compliant with these standards should be able to successfully interact with other compliant applications, as desc ⁇ bed below.
  • CORBA is a standard which provides a standard method for execution of program modules in a dist ⁇ aded environment, regardless of the computer programming language in which the modules are w ⁇ tten, or the computing platform on which they are executed.
  • CORBA enables complex systems to built, integrating many different types of computing platforms within an entire business, for example.
  • ORB Object Request Broker
  • Each application is an "object” with a particular interface through which communication is enabled.
  • ORB acts as the "middle-man", passing information and requests for service to each object as necessary.
  • ORB permits true dist ⁇ ubbed computing, since different objects do not need to be operated by the same computer or even reside on the same network.
  • the ORB directs any communication to the approp ⁇ ate server which contains the object, which might be located on the same host, or a different host, as the client object.
  • the ORB redirects the results back to the client object.
  • CORBA can also be desc ⁇ bed as an "object bus” because it is a communications interface through which objects are located and accessed.
  • CORBA provides HOP (Intemet Inter-ORB Protocol), which is the CORBA message protocol for communication on the Internet.
  • HOP links GIOP (CORBA's General Inter-ORB Protocol) to TCP.
  • IP the general communication protocol of the Intemet.
  • GIOP in turn specifies how one ORB communicates with another ORB.
  • one type of prop ⁇ etary ORB can communicate with another, different type of prop ⁇ etary ORB on a different host computer according to a combination of IIOP and GIOP protocols Practically speabng, if IIOP is built into a Web browser such as NetscapeTM NavigatorTM, a Java applet is downloaded into the Web browser when the user accesses a Web page with a CORBA-compatible object. The Java applet invokes the ORB to first pass data to the object, then to execute the object and finally ret ⁇ eve the results. Further information on both CORBA and IIOP can be obtained from the "Tech Web Technology Encyclopedia" (http://www techweb.com/encyclopedia as of September 10, 1997)
  • WRB Web Request Broker
  • Oracle Corp. Redwood Shores, California, USA
  • WRB is desc ⁇ bed in a white paper
  • M. Anand et al "The Web Request Broker a Framework for Dist ⁇ ubbed Web-based Applications", http://www.olab.com/www6_l/paper.html as of September 10, 1
  • B ⁇ efly the WRB architecture includes the dispatcher, application and system cart ⁇ dges, and a CORBA compliant ORB.
  • the dispatcher and cart ⁇ dges use the ORB for communication between components, so that these components can be dist ⁇ ubbed on separate remote machines.
  • the dispatcher routes requests from the .HTTP daemon to the approp ⁇ ate cartridge.
  • the cart ⁇ dges are software components which perform a specific function and are thus the "objects" desc ⁇ bed previously Cart ⁇ dges are used within the system of the present invention as an exemplary support for a number of different functions, as desc ⁇ bed in subsequent sections.
  • Cart ⁇ dges have a name, composed of the IP address of the server where the cartridge is located, and the virtual path to the location of the cart ⁇ dge on that server Cart ⁇ dges also have a standard interface, which includes a number of methods Examples of such methods include the authenticate routine, which determines whether the client is entitled to requested services and the exec routine, which receives the particular service request if the authentication routine is successfully performed
  • the cart ⁇ dge technology provides a fully developed basis for the creation of particular software functionality
  • prop ⁇ etary cart ⁇ dge technology for software development is that the system architecture provides a framework for interaction between different objects over the Internet by using HTTP Web servers and existing Web browsers
  • the CORBA protocols only define a standard, but do not provide any specific implementation
  • the prop ⁇ etary cart ⁇ dge technology enables one of ordinary skill in the art to develop a software application which can communicate with other applications over the World Wide Web
  • Java Bean is a component software architecture which operates in the Java programming environment.
  • Java is an interpreter-dnven, object-o ⁇ ented computer programming language which is substantially platform-independent.
  • Software packages which are w ⁇ tten in Java can be operated by any operating system, or platform, which supports the Java interpreter.
  • a Java Bean component can run remotely and independently as a discrete software application object in a dist ⁇ ubbed computing environment using either the Remote Method Invocation protocol of Sun Computers Inc , or else by using CORBA.
  • information components are preferably packaged and then dist ⁇ ubbed as independent Java Bean components.
  • the Java Bean component software architecture is a set of API's (Application Programming Interfaces) and rules which enable software developers to define software components to be dynamically combined to create a software application.
  • the Java Bean component model has two major elements: components and containers
  • Components range in size and capability from small GUI (graphic user interface) widgets such as a button, to an applet-sized functionality such as a tabular viewer, and even to a full-sized application such as an HTML (HyperText Mark-up Language) viewer or the information component of the present invention
  • Components can have a visual aspect, such as a button, can actually be visual information or can be non-visual, such as a data-based monito ⁇ ng component
  • Containers hold an assembly of related components.
  • Containers provide the context for components to be arranged and interact with each other Containers are occasionally referred to as “forms", “pages”, “frames” or “shells”
  • Containers can also be components, so that a container can be used as a component inside another container.
  • the Java Bean component model provides the following major types of services: component interface exposure and discovery; component properties, event handling; persistence; application builder support and component packaging.
  • Component interface exposure and discovery allows components to expose their interface so that they can be d ⁇ ven dynamically by calls and event notifications from other components or application sc ⁇ pts
  • Component properties are the public att ⁇ butes of a component which either directly reflect or effect the current state of that component
  • properties could include the "foreground color" of a video clip, its zoom factor or its access ⁇ ghts The state of these properties can be interrogated or modified through standard mechanisms.
  • Event handling is the mechanism for components to "raise” or “broadcast” events and have those events delivered to the approp ⁇ ate component or components which need to be notified. Typically, notified components then perform a particular function in response. For example, if the user interface shows a document image clip on the monitor screen, the Parent Information Object event will communicate with the Object Server to transmit the full page of the clip, and will send a viewing command to the full-page viewer component.
  • Event handling allows information components to interact with each other
  • Persistence is the mechanism for sto ⁇ ng the state of a component in a non-volatile location. The component state is stored in the context of the container and in relation to other components. For example, if the user wants to save the viewing zoom factor for all of the following documents, the persistence mechanism would support this.
  • Application builder support interfaces enable components to expose their properties and behaviors to application builder development tools. Using these interfaces, the tools can determine the properties and behaviors, or events and methods, of arbitrary components.
  • the tools can provide mechanisms such as tool palettes, inspectors and editors, which the application developer uses to assemble an application Through these mechanisms, the application developer can modify the state and appearance of components as well as establish relationships between components.
  • This mechanism enables sophisticated information applications such as HyperText links to be created.
  • the user can define a button which appears on the viewed document, and then links the document to a different document
  • the application developer will use property editors to specify the appearance, including size, color and label, of the button, the link type and the link target
  • Java Bean components can be dist ⁇ aded ana independently deployed over a network, there is a need to provide a facility to physically "package" the resources which are included in an information component so that they are accessible to the other Java Bean components.
  • packaging is performed with the JAR (Java Archive) file format
  • JAR file format enables the class file of the information component and other information component resources such as images, OMS (object mapping structure), sounds, and link information, to be packaged as a single physical entity for dist ⁇ bution.
  • Section 1 provides a general desc ⁇ ption of information components.
  • Section 2 desc ⁇ bes the system of the present invention.
  • Section 3 desc ⁇ bes the information component content capturer in more detail.
  • Section 4 desc ⁇ bes the information component identifier
  • Section 5 desc ⁇ bes the mformation component cont ⁇ butor
  • Section 6 desc ⁇ bes the information component server.
  • Section 7 desc ⁇ bes the information component publisher and client
  • Section 1 Information Component Each information component has a number of different elements and properties.
  • Each information component belongs to an information class.
  • the information class defines the properties and operations of a group of information components.
  • Information classes can desc ⁇ be a newspaper, a general document or a video clip, for example.
  • Figures 1A-1C are illustrations of exemplary information components, each of which can be placed in different classes.
  • Figure 1A is a general desc ⁇ ption of an exemplary document 10, showing the hierarchical structure.
  • Document 10 is in turn subdivided into a number of page components 12, of which four are shown for the purposes of illustration only Page component 12 is a member of the page class, which stores properties related to the structure of a page of document 10.
  • These prop- erties include textual information, structural information and any links to other components.
  • the operations, or methods include ret ⁇ eving the textual information, for example. Thus, the operations are used to store, ret ⁇ eve or modify information contained in the properties of components which are members of the page class.
  • Every information component which is a page component 12, and hence which belongs to the page class, may share certain repeated structural features. These features are also examples of information components, and are desc ⁇ bed as "shared information components" Every page component 12 can include these shared information components in order to maintain a uniform structure between pages, for example, and to decrease storage space for such repetitive features As shown in Figure 1A, these shared information components for page component 12 include a footer component 14, a header component 16 and a logo component 18 These are intended as examples only and are not meant to be limiting in any way.
  • Header component 16 could be a title, such as "Document Report", or any other desired information
  • Footer component 14 could feature a page number, a date, or any other desired information
  • logo component 18 would be the logo for the particular company which is producing document 10, for example.
  • Each page component 12 also includes one or more information components which are not shared at all. or which are only shared with certain other page components.
  • page component 12 which is labeled "Page 1" includes a summary section component 20, which is a member of the summary class.
  • the summary class could feature text and/or images which summa ⁇ ze an earlier portion of document 10, for example.
  • summary section component 20 is only included withm page component 12.
  • the "Page 1" page component 12 also features a “Chapter 1" component 22, which is shared by the “Page 2" and “Page 3” page components 12.
  • the "Page 4" page component 12 features a “Chapter 3" component 24 Summary section component 20, "Chapter 1" component 22, and “Chapter 3” component 24 are all further subdivided into a plurality of paragraph components 26, which are members of the paragraph class.
  • paragraph components 26 contain the information related to each paragraph, which may include text for example
  • the text in each paragraph component 26 is contained within a text component 28 as shown, which belongs to the text class.
  • paragraph component 26 can optionally include an image component 30 and a table component 32 as shown.
  • table component 32 stores a table and belongs to the table class.
  • information component p ⁇ mitives are examples of information component p ⁇ mitives.
  • An information component p ⁇ mitive is the most basic unit of information components, such that the p ⁇ mitive is no longer divisible into information components which are lower in the hierarchical structure.
  • information component p ⁇ mitives are preferably potentially able to be shared between information components.
  • a table of data which is an example of table component 32, could be included both in summary section component 20 and "Chapter 1" component 22.
  • shared information component p ⁇ mitives also only need to be stored once in order to be available to other information components.
  • Figures IB and IC show portions of certain specific examples of information components, shown in terms of an exemplary class structure, it being understood that this is for the purposes of desc ⁇ ption only and is not meant to be limiting in any way.
  • a newspaper information component belongs to a newspaper class 34, which defines the properties and operations of components which contain newspaper pages.
  • Newspaper class 34 has an article class 36 for an individual newspaper article Article class 36 inhe ⁇ ts the properties of the parent class, newspaper class 34
  • article class 36 may have additional properties and methods, such as the coordinates of the location of the article with the newspaper page, or an operation for ret ⁇ eving the name of the author of the article
  • a column 38 is shown for a column, while an image class 40 is also shown for a picture
  • image class 40 might have information about pictures which are associated with the article
  • Column class 38 might contain information about the structure of the column which contains the article.
  • Column class 38 and image class 40 are related to article class 36 according to a defined set of relationships.
  • Figure IC shows an exemplary video clip information class 42 which contains information such as data and structure for a segment of recorded video.
  • a video stream information class 44 is the highest level class for the hierarchy.
  • a video clip information class 46 is next in the hierarchy, followed by a frame class 48.
  • Frame class 48 might contain only information regarding a single frame of the video. Thus, even though a video may be considered as a sequential collection of images which give the illusion of movement, it too can be broken down into smaller elements which are then stored in the above-mentioned information classes.
  • Figures 2 A and 2B show the general architecture of the system of the present invention.
  • a general system architecture 50 includes IC Contributor 60, IC Server 62, IC Search Engine 63, and IC Publisher 65.
  • IC Cont ⁇ butor 60 further features IC (Information Component) Content Capturer 52, IC Knowledge Base 54, IC Rules Editor 56, and IC Identifier 58.
  • IC Content Capturer 52 is responsible for the acquisition and conversion of information content, and for the transmission of the converted information content to IC Identifier 54.
  • IC Identifier 54 then identifies information components according to certain rules and to class information stored in IC Knowledge Base 54 Both the rules and the class information can be added, removed or otherwise altered with IC Rules Editor 56.
  • the o ⁇ gmal document and the identified information components are transmitted from IC Cont ⁇ butor 60 to IC Server 62.
  • IC Server 62 then stores and manages the actual or "o ⁇ ginal" information such as documents, multimedia objects and other types of information entities, as well as managing the information components themselves
  • Information components are made available from IC Server 62 by a request through IC Search Engine 63, and are then published by IC Publisher 65
  • the general system of the present invention collects the information from a va ⁇ ety of sources, packages the information into information components, and then stores the components for later ret ⁇ eval by a client application
  • Section 3 Information Component Content Capturer This section desc ⁇ bes the IC (information component) Content Capturer, which is shown in Figure 2B, as part of IC Cont ⁇ butor 60
  • IC Content Capturer 52 preferably operates as memory resident software and captures the desired information content from a va ⁇ ety of software systems including, but not limited to, a document editor 64 such as the Word product of MicrosoftTM, a media application 66 including, but not limited to, the AdobeTM AcrobatTM reader for reading PDF files from AdobeTM AcrobatTM, a facsimile machine software application 68 for operating a facsimile machine, and a Web browser software application 70 such as NetscapeTM NavigatorTM Additional software systems from which information content can be captured include imaging software and spreadsheet software These software systems are intended as illustrative examples only, since substantially any software system which handles, stores, ret ⁇ eves or manipulates information could have that information captured by IC Content Capturer 52.
  • a document editor 64 such as the Word product of MicrosoftTM
  • media application 66 including, but not limited to, the AdobeTM AcrobatTM reader for reading PDF files from AdobeTM AcrobatTM
  • a facsimile machine software application 68 for operating a facs
  • IC Content Capturer 52 invokes the approp ⁇ ate software d ⁇ vers for handling different information formats from the above software systems
  • information could be captured from a document stored in the format of MicrosoftTM WordTM word processing software.
  • a number of possible methods could be used to capture the information contained within the document, two illustrative examples of which are given here, it being understood that these are for discussion purposes only and are not meant to be limiting.
  • IC Content Capturer 52 interacts with MicrosoftTM WordTM and instructs MicrosoftTM WordTM to place the document on the "clipboard".
  • the "clipboard" is a feature of a number of different computer operating systems, in particular those operating systems of Microsoft Inc.
  • clipboard refers to any feature of a computer operating system which enables information to be exchanged between two software applications.
  • IC Content Capturer 52 captures the necessary information about the document through substantially direct interaction with the software system, such as MicrosoftTM WordTM. Such interaction can be performed according to a number of different methods. For example, MicrosoftTM WordTM enables other software applications to obtain this information through the creation of a "macro". Alternatively, IC Content Capturer 52 could include a printer driver, which would enable MicrosoftTM WordTM to "print" the document to IC Content Capturer 52 directly, or alternatively to a file in a format accessible by IC Content Capturer 52. In any case, regardless of the specific method employed, the content of the information is obtained from the captured information by using a particular software driver.
  • Each software driver is relevant to the particular information source format, such as electronically scanned paper document, electronic document such as a word processing document, video clip, document sent by facsimile and other such formats.
  • Each driver is a channel to an information processing unit for a specific type of information, and invokes a process specific to the source of that information.
  • the content information is stored in an internal unified format for data processing and information component recognition, access and retrieval. The information in the unified internal file format is then sent to IC Identifier 58.
  • Section 4 Information Component Identifier and Knowledge Base
  • IC Identifier 58 automatically identifies and creates information components from the information passed from IC Content Capturer 52 according to rules stored in IC Knowledge Base 54.
  • the information is first analyzed to extract the information component p ⁇ mitives, which as desc ⁇ bed previously form the most basic unit of information. These p ⁇ mitives include text, images, vector graphics and other such basic units of information.
  • the information components themselves are constructed from the information component p ⁇ mitives, and the relationships between components are determined according to rules stored in IC Knowledge Base 54.
  • the information components are classified, again according to rules stored in IC Knowledge Base 54. This classification determines secu ⁇ ty att ⁇ butes, indexing rules and other publishing parameters.
  • the information is transferred to IC Server 62.
  • Figure 3A shows a portion of IC Cont ⁇ butor 60 in more detail, focusing on those components which interact with IC Identifier 58.
  • IC Identifier 58 has three layers, including a p ⁇ mitive identifier 64, a component constructor 66 and a component classifier 68.
  • P ⁇ mitive identifier 64 examines the received information at two levels. First, the textual information is identified and separated into individual elements, according to the structure of the type of information The second level of examination of the received information is visual identification, which includes determining the visual att ⁇ butes and structure of the information. At the end of this dual level examination, the information component p ⁇ mitives have been identified according to rules stored in IC Knowledge Base 54.
  • the information component is then constructed from one or more information component p ⁇ mitives and/or from one or more information components which are lower in the hierarchy, by component constructor 66
  • the information component includes such information as the identity of the p ⁇ m ⁇ t ⁇ ve(s) or lower information component(s) from which it is constructed.
  • the relationships between components are determined by component constructor 66 according to rules stored in IC Knowledge Base 54. An illustrative example of this process is disclosed U.S. Application No.
  • the disclosed process includes the following steps. First, the document is converted into a digital raster format, for example by scanning a paper document, which is stored in an electronic file. This step is preferably performed by IC Content Capturer 52. Next, preferably the converted document is enhanced to improve the quality of the image, for example.
  • the enhanced raster format file is converted into two electronic files, collectively called a "binary/raster file".
  • the first file has the enhanced raster format
  • the second file has pointers to the enhanced raster format file. Every data element in the raster format file, such as textual information or an entire graphic image, could have a co ⁇ esponding pointer in the second file.
  • the two files are preferably produced, at least in part, by an automatic text recognition process such as OCR, which enables the image of the text to be realized as textual data
  • OCR automatic text recognition process
  • the information is then stored as information components composed of information p ⁇ mitives. as previously desc ⁇ bed
  • indices for information ret ⁇ eval are created.
  • the o ⁇ ginal document has been subdivided and stored as a collection of information components.
  • These information elements preferably include a raster image of the document, a pointer to the storage location of the o ⁇ ginal document, any text contained within the document and the coordinates of the words of the text with the document. More specifically, the coordinates preferably include all information which is necessary to geographically locate the word within the document, such as the number of the page on which the word falls, the number of the word on the page and the coordinates of the rectangle which bounds the word on the page, or "bounding rectangle" The bounding rectangle determines the area occupied by the word on a page and is necessary to fully reproduce the visual aspects of that word Thus, the coordinates of each word nume ⁇ cally desc ⁇ be the visual appearance of the word
  • IC Content Capturer 52 performs OCR (Optical Character Recognition) to obtain the textual information from the image stored in the electronic file by converting the image of a letter into the letter itself Both the text itself and the coordinates of individual words are then available.
  • OCR Optical Character Recognition
  • Other examples of such processes include pattern recognition and PDF conversion. It should be noted that these processes are already well known in the art for the creation and manipulation of information in a particular information source format.
  • the information elements which are produced are then identified according to the type of information component p ⁇ mitive which they represent, which is in turn determined according to rules in IC Knowledge Base 54 For example, every individual image identified in the steps above would be determined to be an image information component p ⁇ mitive. Similarly, the text extracted in the steps above would be determined to be a text information component p ⁇ mitive according to information stored in the textual database. Other information component p ⁇ mitives could also be identified from the collection of information elements After the information component primitives have been identified, the primitives are used to construct information components, according to rules stored in IC Knowledge Base 54. For example, in document component 10 of Figure 1A, the information primitives include image infoimation component primitive 30 and text information component primitive 28. These primitives are in turn used to build paragraph 26.
  • Paragraph 26 now contains information concerning not only the inclusion of one or more image information component primitives 30, for example, but also such infoimation as the relative geometrical location of the p ⁇ mitive within paragraph 26.
  • the geometrical location of the primitive was determined when the primitive itself was identified, for example as described above.
  • the primitives are first assembled in information components which are relatively lower in the hierarchy, for example paragraph 26, and then these components are in turn assembled into information components which are higher in the hierarchy.
  • each individual information component is classified according to rules in IC Knowledge Base 54 by component classifier 68.
  • the individual component is compared to components listed within the knowledge base, and is recognized as a unique and individual element belonging to a larger information cluster.
  • Each component is classified first by assignment to a primary information class, and then by placement within the hierarchical structure of information sub-classes belonging to that primary class.
  • FIG. 3B shows a schematic block diagram of an exemplary IC Knowledge Base 54
  • the document class for the information component is determined.
  • the different document classes are stored within IC Knowledge Base 54 in a document class table 70.
  • the document could be a research report, newspaper, or substantially any other type of document which has been placed within document class table 70.
  • the information component would be classified according to a particular information component class stored in an IC class table 72.
  • These classes could include, but are not limited to, a logo, a main title, publishing information, summary, and so forth.
  • Each class is in turn identified according to rules stored in a rules table 74.
  • These rules are composed of tokens, including constants 76, functions 78 and operations 80.
  • Each rule could optionally be stored in a "flat (text) file" for example, in which case the tokens would preferably be stored as text strings separated by spaces. Of course, many other options are also available for storing these rules.
  • Each rule preferably includes the following tokens in the following order, although of course other rule structures could be used: the name of the IC class, the hierarchy level of the information component, the font type, the size range, the color, the case of the letter, the location of the page on which the information component is found and the text which the information component should contain (if any)
  • the rule does not necessa ⁇ ly need to include all of these tokens, an absent token can be indicated by a place-holding character such as a "slash" ("/”), for example.
  • PageNo Helvetica-Bold 11.0-11.05 / / B /
  • an information component named PageNo, is identified by any text of any color in any letter case, in font Helvetica-Bold, any size between 11 and 11.05, which is located at the bottom of the page (indicated by the letter "B").
  • an information component named Section, is identified by any text of any color in any letter case, in font TimesNewRoman or TimesNewRoman-Bold, any size between 9.50 and 10.50, which is located anywhere on the page.
  • IC Rules Editor 56 is preferably a GUI (graphical user interface), which more preferably allows the user to define new rules, enter new information, delete old rules or information, and amend or alter rules or information.
  • Section 5 Information Component Cont ⁇ butor
  • IC Cont ⁇ butor 60 also prepares the information components for publication and for storage in a database, such that the information components can be served to a client by IC Server 62.
  • IC Cont ⁇ butor 60 features a component generator 82.
  • Component generator 82 transforms the classified information component into a standard format including, but not limited to, an active object format such as a COM object or a Java Bean object, or a flat file format such as D.HTML. Generally, component generator 82 packages the classified information component according to the standard format, so that the packaged information component is accessible by IC Server 62.
  • desc ⁇ ptions of the transformation of the information component into two of these object- o ⁇ ented formats are given in further detail below.
  • Section 2 a desc ⁇ ption is provided of the transformation of the information component into an object in an XML environment.
  • Section 1. a desc ⁇ ption is provided of the transformation of the information component into a Java Bean object in a CORBA environment.
  • IC Cont ⁇ butor 60 is also able to render and to store information components as a D.HTML document.
  • IC Cont ⁇ butor 60 converts data for each p ⁇ mitive of each information component into equivalent fragments in DHTML format.
  • the graphic elements are converted to raster images (in GIF fo ⁇ mat).
  • the text elements are converted to a set of DHTML ⁇ DIN> blocks.
  • DHTML ⁇ DIV> blocks There are two types of DHTML ⁇ DIV> blocks: a style block and a value block.
  • style block defines the style att ⁇ butes of the text, such as font size and name, font-weight, font-style and color.
  • value block defines the position of the text element within the current p ⁇ mitive and its text value When the text value contains more than one word, the text value is inserted into a ⁇ NOBR> block to prevent line breaking for the given text element by the web browser
  • the DHTML fragment is optimized to ensure that each "style block" with specific characte ⁇ stics appears only once in DHTML fragment for the p ⁇ mitive.
  • the o ⁇ ginal fonts are preferably substituted by the fonts available for the Web browser with possible modifications of font size.
  • IC Server 62 can then serve the information component to IC Search Engine 63 after receiving a request for a particular information component from IC Search Engine 63.
  • IC Search Engine 63 receives a request for an information component, which is then made available to IC Publisher 65, which publishes the information component.
  • IC Publisher 65 publishes the information component to a Web page, for example, or onto paper, as another example.
  • Both IC Search Engine 63 and IC Server 62 must be able to communicate with each other, such that an information component can be requested. This communication is permitted with information components which have a standard format.
  • An object format is particularly preferred, because objects can be accessed through a predefined structure, which is more efficient for interacting with the information contents of the object.
  • Both of the exemplary and preferred embodiments desc ⁇ bed below in Chapter IH, Sections 1 and 2 (XML) and in Chapter IV, Section 1 (Java Bean) have object formats for the information component. Of course, other types of formats could be used, such as DHTML.
  • Section 6 Information Component Server
  • IC Information Component
  • IC server 62 stores and manages the "original" information, such as documents, video segments, sounds and so forth.
  • IC Server 62 locates the o ⁇ ginal mformation entity, isolates the corresponding information component and then returns the information component to the client application in some suitable format, for example as an HTML file.
  • Section 7 Information Component Publisher and Client
  • an IC Client 98 is able to send requests for information components to IC Server 62 through an IC Search Engine 63.
  • IC Client 98 is also able to receive such components from IC Server 62, optionally and preferably through the CORBA ORB.
  • IC Client 98 preferably features some type of GUI (graphical user interface), which enables client applications to interact with the functionality of the information management system of the present invention.
  • IC Publisher 65 is then able to publish the information component onto IC Client 98.
  • GUI interface 100 the ability to access certain information components and to view these components on GUI interface 100 is controlled by two functions: automatic information component replacement and "white-label". These features provide customized views of the same documents to different user groups, while preventing the display of sensitive information components to specific users or to groups of users.
  • the IC replacement table includes the following information: the class and the name of the information component to be replaced, the class and the name of the information component which is to replace it, and the user ' s groups or individuals for whom the replacement should be performed.
  • the logo on a particular research report which is an information component called "Big Company X Report”
  • IC Server 62 would then replace the logo of "Big Company X" with the logo of "Another Big Company” when those clients of "Another Big Company” request the Report.
  • the white-label function is used to specify one or more information components in the o ⁇ ginal document which are not to be displayed on GUI interface 100, but which remain incorporated within the o ⁇ ginal document.
  • the white-label function enables sensitive information to be protected from access through IC client 98
  • the objects are preferably compatible with the ActiveXTM architecture, although other types of objects could also be used, as long as they were compatible with the XML environment.
  • the ActiveXTM objects could be constructed by the client from the information component objects according to the ActiveXTM architecture
  • Section 1 is an overview of the system when implemented with XML.
  • Section 2 is a desc ⁇ ption of IC Cont ⁇ butor when implemented with XML
  • Section 3 desc ⁇ bes IC XML server.
  • Section 4 desc ⁇ bes the IC Search Engine when implemented with XML.
  • Section 5 desc ⁇ bes the IC Publisher and IC Client when implemented with XML.
  • Figure 7 shows an overall view of a portion of the system of the present invention as implemented in XML.
  • IC Cont ⁇ butor 60 is now IC XML Cont ⁇ butor 200 and
  • IC Server 62 is now IC XML Server 202 .
  • IC XML Cont ⁇ butor 200 creates objects from the information components as XML-en vironment compatible objects.
  • IC XML Server 202 provides access to a database 204, which is similar to database 84 of Chapter ⁇ Database 204 stores the information components, which are implemented as XML-environment compatible objects.
  • IC X.ML Server 202 also communicates with a DOM (document object model) compliant interface, referred to as DOM Interface 208.
  • DOM Interface 208 Software programs which are compliant with the DOM protocol are able to communicate with other software programs for XML-compatible or XML-specific tools, such as Web browsers or software programs for editing XML documents, for example
  • DOM Interface 208 acts as a gateway, enabling these XML tool software programs to communicate with information components through IC XML Server 202.
  • XML tool software programs can therefore preferably edit and reuse information components directly from database 204, without conversion of the components.
  • IC XML Server 202 provides one or more information components upon receiving a request from IC Universal Search Engine Adapter 214.
  • IC Universal Search Engine Adapter 214 enables many different types of search engines to communicate with IC XML Server 202, such that a search can be made for specific information components with database 204.
  • IC Universal Search Engine Adapter 214 also preferably controls access to IC XML Server 202, preferably including such functions as secu ⁇ ty and request access.
  • IC Universal Search Engine Adapter 214 passes a request for an information component to IC XML Server 202, which then returns the desired information component.
  • IC XML Publisher 210 optionally and preferably includes an .HTML rende ⁇ ng engine 212, and a standard document rende ⁇ ng engine 216.
  • the information component can then be displayed to the user in a number of ways, such as by p ⁇ nting the information on paper or by displaying the information on a Web page
  • IC XML Publisher 210 passes the information component to standard document rende ⁇ ng engine 216 If the information is to be displayed by a Web browser which can only handle HTML documents, then IC XML Publisher 210 passes the information component to HT.ML rende ⁇ ng engine 212. Other types of rende ⁇ ng are also possible of course. The desc ⁇ ption of each of these parts of the system is given in greater detail in the sections below
  • This section desc ⁇ bes a specific, preferred implementation of the IC Cont ⁇ butor for operation with X.ML-env ⁇ ronment compatible objects such as those compatible with ActiveXTM architecture, IC X.ML Cont ⁇ butor 200.
  • X.ML-environment compatible objects infoimation components are organized in a hierarchical structure and linked to each other.
  • Each XML-environment compatible object has methods, properties and data.
  • the data itself is the classified information component obtained as desc ⁇ bed in Chapter ⁇ . Methods determine the ways in which the data and properties of the information component can be manipulated. For example, methods include ways to access the data, whether as an image, a video clip, a sound and so forth.
  • Methods also include an application interface, so that another application would be able to interact with the information component and with the stored data, and with a GUI (graphical user interface).
  • Other methods pertain to access control and to event handling. Event handling enables these objects to broadcast events and to have those events delivered to an approp ⁇ ate component or components for notification Thus, event handling provides methods for communication between components packaged as XIvIL-environment compatible objects
  • the properties of the XML-environment compatible object include the internal structure of the object and the location of the data of the information component withm the hierarchical structure of information components.
  • information components are composed of IC p ⁇ mitives, which are in turn used to build more complex structures which desc ⁇ be the relationships between information components
  • the location of the data of the information component with a hierarchy is important in order to be able to construct virtual documents and to understand the type and significance of the data within the mformation component.
  • these properties include the correct tags for the type of data within the object, in order for the object to be correctly rendered within the XML environment, and its location within the information component hierarchy. For example, if the type of data is a chapter of a book, then the correct tag might be the "chapter" tag. This tag identifies the type of XML element for the object, which is important for the later assembly of the data withm the object as an element of an XML document.
  • IC XML Cont ⁇ butor 200 packages the information component obtained from IC Identifier 58 into the XML-environment compatible object as follows. First, the data of the information component forms the data of the object. Next, the methods which can be used to interact with the object are determined. Certain of these methods are typical for all such objects. Other methods, such as the method for accessing the type of data with the object, are particular for the type of data from the information component. Finally, the properties of the object are determined, for example according to the location of the data of the XML- environment compatible object within the information component hierarchy.
  • IC Server 62 This section desc ⁇ bes a specific implementation of IC Server 62, desc ⁇ bed in Chapter U, Section 6, for operation with .XML-environment compatible objects.
  • IC .XML Server 202 accepts requests for and then serves information components as XML elements assembled into an XML document.
  • IC XML Server 202 manages the extended links of XML and normalizes the structure of va ⁇ ous DTD's for the X.ML documents.
  • IC .XML Server 202 enables the XML-environment compatible objects to be accessed by XML tool software programs without requi ⁇ ng conversion of the objects
  • XML documents are collections of one or more XML elements which are organized according to certain rules, which are held in the DTD of the XML document.
  • IC XML Server 202 is preferably able to assemble XML documents "on the fly" in response to a request from a client application.
  • a client application might request a particular chapter of a book. This chapter could contain a chapter title, text and images, for example. The chapter could also be further subdivided into sections, each of which would also have an organizational structure
  • the data required to assemble the chapter is contained within one or more IC XML elements.
  • IC .XML Server 202 must first locate all of the IC .XML element(s) which are required for the chapter
  • a style sheet is optionally selected for the XML document
  • the style sheet is optionally determined by the properties of the "chapter" IC XML element, which may indicate a particular style sheet to be used for that element.
  • the style sheet could be determined according to specifications submitted by the client application, such that the preferences of the client application determine the style sheet.
  • the IC XML elements are then assembled in the XML document, optionally according to the style sheet
  • the DTD for the XML document is then constn ⁇ cted, according to the tags contained withm the IC XML element
  • the links for the XML document are then determined.
  • these links are extended links. More preferably, the extended links are managed as part of a document which is external to the XML document
  • These links are determined according to the hnk(s) of the IC XML element, which are included in the properties of the element. For example, one such link might link two sections of the chapter.
  • this other XML element is also assembled into a different XML document, such that the different XML document could also be served if necessary.
  • extended links are also objects which are stored externally, for example m database 204 Extended link objects are exposed as child objects of the IC XML- environment compatible objects, or resource objects, which they link.
  • each extended link object is preferably stored in a link table 218, which is then stored in database 204
  • the identity of each IC XML-en vironment compatible object is also stored in an IC table 220, also stored in database 204.
  • a document table 222 is also stored in database 204
  • Document table 222 indicates how to assemble complete XIVEL documents
  • these .XML documents can be assembled into a format which closely resembles the format of the o ⁇ ginal document from which the information was obtained
  • other "virtual" XML documents could also be assembled according to requests received from the client application.
  • IC XML Server 202 manages the extended link objects through dynamic management, by dynamically generating extended link objects as required. For example, if an document or an information component is removed from database 204, IC XML Server 202 updates link table 218, IC table 220 and document table 222.
  • XML Server 202 updates link table 218, IC table 220 and document table 222 as necessary More preferably, IC XML Server 202 sends an alert to the software tool which is attempting to remove the document or information component from database 204, alerting the user to the possible alteration to the link structure.
  • IC XML Publisher 210 is desc ⁇ bed in greater detail in Section 4 below.
  • IC XML Server 202 is preferably capable of serving many different types of XML documents, which may have different DTD structures. Such different structures can increase the difficulty of searching, ret ⁇ eving and assembling IC XML elements Furthermore, if IC X.ML elements have different names for tags which should indicate the same element, IC XML Server 202 may not be able to assemble IC XML elements correctly Therefore, IC XML Server 202 optionally and preferably performs DTD normalization for the XML elements and documents.
  • DTD normahzer 232 compares the name of the tag (text st ⁇ ng associated with the tag) to the rule or rules of a DTD rules database 234. For example, if the name of tag is "summary", a rule might state that "synopsis" should be used to replace "summary".
  • DTD rules database 234 does not have a rule for the name of that particular tag, then DTD normahzer 232 searches any information associated with the .XML element having that tag in order to normalize the name of the tag
  • XML tool software programs such as editor programs for XML documents
  • IC XML Server 202 is able to communicate with IC XML Server 202 through this "gateway" software module
  • these editor programs are able to create and manipulate virtual documents from XML-environment compatible objects stored in database 204 in a substantially similar manner to the way in which XML documents are created and manipulated
  • IC Universal Search Engine Adapter 214 passes requests for information components to IC XML Server 202.
  • IC Universal Search Engine Adapter 214 therefore controls access to IC XML Server 202, and hence to the information components.
  • IC Universal Search Engine Adapter 214 preferably operates according to an HTTP- based protocol.
  • the access offered through IC Universal Search Engine Adapter 214 can be determined according to a software module or applet w ⁇ tten in Javasc ⁇ pt, Java, Active-XTM or C++ for example.
  • IC Universal Search Engine Adapter 214 is preferably able to translate substantially any type of search query language into a format which is accessible to IC X.ML Server 202 More preferably, IC Universal Search Engine Adapter 214 includes a d ⁇ ver (not shown) for each search engine, such that a new type of search engine can be easily accommodated by alte ⁇ ng the d ⁇ ver
  • IC Universal Search Engine Adapter 214 is preferably built to be compatible with the particular architecture of IC XML Server 202, such that the client application requesting a particular information component would not need to be altered in order to be compatible with different search engines Section 5 Specific Implementation of IC Publisher
  • IC Publisher 63 is IC XML Publisher 210
  • IC XML Publisher 210 makes the information components accessible to the client application The information component can then be displayed to the user in a number of ways, such as by p ⁇ nting the information on paper or by displaying the information on a Web page
  • IC X.ML Publisher 210 passes the information component to standard document rende ⁇ ng engine 216
  • Standard document rende ⁇ ng engine 216 could output the information component according to the PostSc ⁇ pt protocol for example, in order to allow data exchange and communication with paper p ⁇ nting devices
  • IC XML Publisher 210 preferably passes the mformation component to .HTML rende ⁇ ng engine 212
  • Other types of rende ⁇ ng are also possible of course
  • HTML rende ⁇ ng engine 212 is able to render the XML document as an HTML or a DHTML document for being served to a Web browser
  • HTML rende ⁇ ng engine 212 is able to render other document formats, such as PDF, word processing and image formats, as HTML documents as well PDF could be rendered from a PostSc ⁇ pt output for example
  • the functions desc ⁇ bed for HTML rende ⁇ ng engine 212 could be used for rende ⁇ ng substantially any type of file in substantially any format as an HTML or DHTML document, as desc ⁇ bed below
  • Figure 10 is a schematic, block diagram showing a preferred implementation of
  • HTML rende ⁇ ng engine 212 and associated items according to the present invention.
  • HT.ML rende ⁇ ng engine 212 interacts between a native file format processor 236 and a Web browser 238 Essentially, HTML rende ⁇ ng engine 212 enables a native file format document 240, which would normally be substantially accessible only to native file format processor 236, to be displayed by Web browser 238. Furthermore, the display of native file format document 240 by Web browser 238 is visually similar or identical to the display of native file format document 240 by native file format processor 236, as enabled by .HTML rende ⁇ ng engine 212.
  • Native file format processor 236 can be any software component or application which can access a native file format document 240.
  • Examples of such software components or applications include, but are not limited to, an XML editing software program, word- processing software such as MicrosoftTM WordTM and exchange format software such as Adobe AcrobatTM Exchange Reader.
  • Examples of native file formats include, but are not limited to, the XIvEL format, the DOC format for MicrosoftTM WordTM and the PDF format for Adobe AcrobatTM
  • the phrase "access a native file format document" is meant to connote that native file format processor 236 can display and manipulate native file format document 240 such that native file format document 240 is viewable with native visual att ⁇ butes or visual appearance.
  • native file format document 240 is preferably in the file format which is intended to be implemented by native file format processor 236.
  • HTML rende ⁇ ng engine 212 is able to convert native file format document 240 into a raster image in a raster format which is displayable by Web browser 238, according to one of two preferred embodiments of the present invention. In the first embodiment, HTML rende ⁇ ng engine 212 interacts with native file format processor 236 to obtain data regarding native file format document 240. In the second prefe ⁇ ed embodiment, HTML rende ⁇ ng engine 212 directly accesses native file format document 240 substantially without any interaction with native file format processor 236
  • HTML rende ⁇ ng engine 212 interacts with native file format processor 236 and instructs native file format processor 236 to place native file format document 240 on the "clipboard" (not shown).
  • the "clipboard” is a feature of a number of different computer operating systems, in particular those operating systems of Microsoft Inc. (Seattle, Washington, USA), such as "W ⁇ ndows95TM” and "Windows NTTM", for example.
  • the general function of the "clipboard” is to enable one software application, such as native file format processor 236, to make information available to another software application, such as HTML rendering engine 212.
  • clipboard refers to any feature of a computer operating system which enables information to be exchanged between two software applications.
  • native file format document 240 Once native file format document 240 has been copied to, or placed on, the clipboard, native file format document 240 is then pasted to HTML rendering engine 212 as a graphical image.
  • HTML rendering engine 212 imports, or accesses, native file format document 240 as an image, which can then be converted to a raster image in a raster format. Additionally, HTML rendering engine 212 is able to obtain the necessary data about native file format document 240 through such "pasting".
  • HTML rendering engine 212 receives the necessary information about native file format document 240 through substantially direct interaction with native file format processor 236. Such interaction can be performed according to a number of different methods. For example, Adobe AcrobatTM allows other software applications to obtain this information through the creation of a "plug-in”. MicrosoftTM WordTM enables other software applications to obtain this information through the creation of a "macro".
  • HTML rendering engine 212 could include a printer driver, which would enable native file format processor 236 to "print" native file format document 240 to an image format file. Such "printing” would also give HTML rendering engine 212 the necessary data about native file format document 240. Thus, HTML rendering engine 212 would obtain the necessary data about native file format document 240 through interaction with native file format processor 236.
  • the second preferred embodiment of the present invention has many different possible implementations, one illustrative example of which is given here, it being understood that this example is for discussion purposes only and is not meant to be limiting.
  • the second preferred embodiment involves direct interaction of HIML rendering engine 212 with native file format document 240, substantially without any interaction with native file format processor 236.
  • HTML rendering engine 212 preferably performs such interaction by understanding all or substantially all of the instructions contained within native file format document 240, in a similar or identical manner as native file format processor 236. These instructions are like any another computer software language, and as such can be understood and interpreted by software applications other than native file format processor 236.
  • HTML rendering engine 212 obtains the necessary data about native file format document 240.
  • This data includes substantially all of the words of the text in native file format document 240, or at least of that portion of native file format document 240 which is to be displayed on Web browser 238.
  • the data includes the coordinates of each word within native file format document 240.
  • the data preferably includes all attributes of each word and of the relationships between words, such as the font style and size, character attributes such as bold or italicized text, and spaces between characters and words.
  • the data in combination enable native file format document 240 to be reproduced in a substantially identical or identical document appearance on Web browser 238.
  • the coordinates preferably include all information which is necessary to geographically locate the word within native file format document 240, such as the number of the page on which the word falls, the number of the word on the page and the coordinates of the rectangle which bounds the word on the page, or "bounding rectangle".
  • the bounding rectangle determines the area occupied by the word on a page and is necessary to fully reproduce the visual aspects of that word.
  • the coordinates of each word numerically describe the visual appearance of the word and, preferably in combination with the visual attributes of the word, enable the visual appearance of the word to be reproduced.
  • HTML rendering engine 212 creates the raster image in a raster format which is displayable by Web browser 238.
  • the raster image is created from the data obtained from native file format processor 236, and preserves substantially all of the visual attributes of native file format document 240, or a portion thereof, when seen in the native document appearance.
  • the raster format is supported by Web browsers.
  • One example of such a format is the GIF raster format.
  • the raster image, containing at least a portion of native file format document 240 is displayable by Web browsers.
  • the raster image is optionally created "on the fly".
  • a raster image could be stored in an additional database 242 containing cached raster images, rather than being created "on the fly”.
  • the raster image is produced as a result of a search request by the user, then preferably at least one "match" or search result is displayed in the context of at least a portion of at least one native file format document 240 containing the match, as shown in Figure 11.
  • FIG 11 shows an exemplary, illustrative depiction of a portion of the computer monitor screen which is displaying the raster images of two matches.
  • a monitor screen 244 is displaying a portion of the graphic output of Web browser 238, here shown as Netscape NavigatorTM although substantially any Web browser could be used.
  • a command area 246 enables the user to enter commands to .HTML rende ⁇ ng engine 212 through Web browser 238.
  • a display area 248 shows a portion of the results from the search.
  • Display area 248 shows a portion of two documents 250 with the searched term, "Keppel", emphasized, in this example by a box.
  • both graphic images and text are displayable in display area 248.
  • the results of the search are displayed by Web browser 238, preferably within the context of at least a portion of the o ⁇ ginal document, as shown.
  • the list of matches includes a plurality of matches within a single native file format document 240, a single match from a plurality of native file format documents 240, or even a plurality of matches from a plurality of native file format documents 240
  • Web browser 238 can then request the next match the se ⁇ es of matches, or else the previous match in the se ⁇ es
  • HTML rende ⁇ ng engine 212 then creates the raster image of the desired match in the se ⁇ es.
  • a raster image could be stored in database 242, rather than being created "on the fly”.
  • the raster image of the desired match is transferred to, and then displayed by, Web browser 238
  • HTML rende ⁇ ng engine 212 is able to render information components as a DHTML document.
  • HTML rende ⁇ ng engine 212 converts data for each p ⁇ mitive of each information component into equivalent fragments in DHTML format.
  • data elements in the source information component graphic elements and text elements.
  • the graphic elements are converted to raster images (in GIF format).
  • the text elements are converted to a set of DHTML ⁇ DIV> blocks
  • DHTML ⁇ DIV> blocks There are two types of DHTML ⁇ DIV> blocks, a style block and a value block.
  • style block defines the style att ⁇ butes of the text, such as font size and name, font-weight, font-style and color.
  • value block defines the position of the text element within the current p ⁇ mitive and its text value. When the text value contains more than one word, the text value is inserted into a ⁇ NOBR> block to prevent line breal ⁇ ng for the given text element by the web browser.
  • the DHTML fragment is optimized to ensure that each "style block" with specific characte ⁇ stics appears only once in DHTML fragment for the p ⁇ mitive.
  • Exact correspondence between the source document text style and the DHTML style is not always possible.
  • the o ⁇ gmal fonts are preferably substituted by the fonts available for the Web browser with possible modifications of font size.
  • DHTML representations can be created for complete pages as well as for parts of any page.
  • the relevant p ⁇ mitives are obtained. These include the DHTML data, the enclosing rectangle for the p ⁇ mitive, and text coordinate mapping (for drawing search "hits" or results, or for otherwise highlighting or emphasizing a portion of text).
  • HTML rende ⁇ ng engine 212 iterates over these p ⁇ mitives and for each one generates the DHTML code that locates it in the proper place on the page.
  • Graphical elements are handled by creating a combination of a ⁇ DIV> and ⁇ IMG> tags which point to a URL for loading the images directly
  • the search hits are also displayable as part of a DHTML view.
  • Hits are created by adding (p ⁇ or to the p ⁇ mitive DHT.ML) ⁇ DIV> and ⁇ MG> tags which point to the URL of a pre-defined small image containing the hit color.
  • the hits can be indicated by using one of 3 colo ⁇ ng methods These include marking the word; marking the beginning of the line; and marl ing the entire line.
  • the size of the colo ⁇ ng which indicates the hit can be adjusted to the proper size by using the text coordinate mapping of the p ⁇ mitive.
  • Section 5 Chapter II above desc ⁇ bed the general implementation of IC Cont ⁇ butor.
  • This section desc ⁇ bes a specific, prefe ⁇ ed implementation of IC Cont ⁇ butor for operation with Java Bean objects.
  • the Java Bean object has two groups of characte ⁇ stics: properties and methods Properties are desc ⁇ ptive features of the Java Bean object.
  • Such features preferably include the OMS (Object Mapping Structure) which is the text, structure, graphics and APPS intelligence of the Java Bean object
  • OMS Object Mapping Structure
  • APPS intelligence is applicable only if the o ⁇ ginal document was a paper document scanned into an electronic file, since APPS stands for "Adaptive Probability Pattern Search", which enables text to be searched in an image even if not correctly recognized by the OCR process desc ⁇ bed previously
  • the OMS contains information related to the overall structure of the Java Bean object, as well as a desc ⁇ ption of the relationships between different portions of that object.
  • the profile information is also included
  • the profile information includes any additional desired characte ⁇ stics of the o ⁇ ginal document These characte ⁇ stics are determined by the user through IC Content Capturer 52.
  • the profile information could include data concerning the type of company which published the ongmal document
  • the profile information is external to the o ⁇ ginal document and is added according to the specification of IC Content Capturer 52
  • Other preferred properties include an optional but preferable object image, which is a visual image of the o ⁇ ginal document
  • Another preferred property is hyperlink information, which desc ⁇ bes all connections to locations on the World Wide Web
  • a desc ⁇ ption of the relationships between this component and other components is also provided
  • secu ⁇ ty and access control data is provided, which determines who is allowed to access the information
  • Methods determine the ways in which the data and properties of the information component can be manipulated These methods are standard for the Java Bean component architecture For example, methods include ways to access the data, whether as an image, a video clip, a sound and so forth Methods also include an application interface, so that another application would be able to interact with the information component and with the stored data, and with a GUI (graphical user interface) Other methods pertain to access control and to event handling Event handling, as noted previously in section 1, is the mechanism for Java Bean components to broadcast events and to have those events delivered to an approp ⁇ ate component or components for notification Thus, event handling provides methods for communication between components packaged as Java Beans.
  • the information component is preferably packaged as a Java Bean by using the JAR file format.
  • the JAR file format includes such information as the class file, images, sounds and links to other components.
  • the class file is a desc ⁇ ption of the information class to which the information component belongs.
  • Each such piece of mformation is stored m the JAR file format as a pointer to the storage location to the relevant data, such as an image for example
  • the JAR file format wraps additional information and data around the information component, in such a way that all of the information and data is both presented as a single, independent entity, yet is readily accessible to other software objects.
  • This section desc ⁇ bes a specific implementation of IC Server 62, desc ⁇ bed in Chapter ⁇ , Section 6, as IC JBC Server 300 for operation with Java Bean objects in a CORBA environment.
  • IC JBC Server 300 locates the o ⁇ ginal information entity, isolates the corresponding information component according to a pointer stored in the Java Bean component for example, and then creates an "object image clip". The object image clip is then sent back to the client application as an HTML file.
  • IC JBC Server 300 includes a database 302.
  • Database 302 is both accessible to, and is managed by, an IC Manager 304
  • IC Manager 304 is responsible for supplying the mam CORBA services, as desc ⁇ bed in Section 1.
  • IC Manager 304 provides these services by being adapted to the main prop ⁇ etary ORB models which are available, such as the "Cart ⁇ dge” model of Oracle Corp. (California, USA) or the "Blade” model of InformixTM.
  • An ORB is an Object Request Broker, preferably a WRB, or ORB for the "Cart ⁇ dge” model which is able to communicate with individual cart ⁇ dges.
  • the main CORBA services include database and indexing services for search and ret ⁇ eval engines, and for push applications, database navigation services, dist ⁇ ubbed viewing, imaging and p ⁇ nting services for the information components, network control and ret ⁇ eval services: and dist ⁇ ubbed storage services for information components.
  • database and indexing services for search and ret ⁇ eval engines, and for push applications
  • database navigation services for dist ⁇ aded viewing, imaging and p ⁇ nting services for the information components
  • network control and ret ⁇ eval services network control and ret ⁇ eval services: and dist ⁇ mped storage services for information components.
  • IC Manager 304 is adapted to the "Cart ⁇ dge" model, then components are accessed from database 302 through one of a number of cart ⁇ dges, shown as at least one cart ⁇ dge 306.
  • Each cart ⁇ dge 306 is a module of software which performs a specific function
  • Each of the previously desc ⁇ bed services performed by IC Manager 304 is provided by a separate cart ⁇ dge 306
  • Different cart ⁇ dges 306 could provide database indexing, database navigation and information component ret ⁇ eval services for example, without requi ⁇ ng cart ⁇ dge 306 and database 302 to be on the same server computer.
  • IC JBC Server 300 is not necessa ⁇ ly a single server computer, but rather is an interacting collection of components which together form IC JBC Server 300.
  • Cart ⁇ dges 306 would communicate with each other and with any databases through an ORB
  • One advantage of the "Cart ⁇ dge” model is that communication between different computers could occur through the World Wide Web, via an HTTP daemon as desc ⁇ bed in section 1
  • Cart ⁇ dges 306 are named with a combination of the IP address of the server where cart ⁇ dge 306 is located and the virtual path to the location of cart ⁇ dge 306 on that server.
  • IC Manager 304 would preferably be composed of a number of different cart ⁇ dges 306, on one server computer or a plurality of server computers, which preferably interact with each other and any other necessary components, such as databases, through the World Wide Web.
  • IC JBC Server 300 also includes web application server 3080, which enables IC Manager 304 to send requests and receive information through the Intemet.
  • IC Manager 304 and web application server 308 enable specific information components to be ret ⁇ eved by first activating a particular cart ⁇ dge 306 and then performing some action through database 302.
  • the name of a desired cart ⁇ dge 306 can be given to IC Manager 304, which then locates and activates the desired cart ⁇ dge, through the Intemet if necessary.
  • cart ⁇ dge 306 Once cart ⁇ dge 306 has been activated, it performs a specific function, such as ret ⁇ eving an information component from database 302, for example The information component can then be dist ⁇ vide through web application server 308.
  • IC Manager 304 can interact with web application server 308, and to give the component to web application server 308, and to give the component to web application server 308, could also be used.
  • the desired o ⁇ ginal information is sent to one of a plurality of image processors 310 Each image processor 310 transforms the o ⁇ ginal information, such as a document, into a Searchable Image Foi at (SIF) file.
  • SIF Searchable Image Foi at
  • Each information format preferably has its own image processor 310, so that for example a first image processor 310 could manipulate text editor documents, while a second image processor 310 might handle graphics files such as TIF (Tagged Image Foimat) or GIF (Graphics Interchange Format) format files, for example.
  • each image processor 310 is optionally and preferably able to transform the "onginal" information into the corresponding SIF file "on the fly”
  • the SIF file can be created and recreated as needed, without the requirement of sto ⁇ ng both the SIF file and the "o ⁇ ginal" mformation
  • SIF files are preferably actually image files, most preferably fully compatible with the TIF file format, which incorporate both graphic images and information data stored m a separate text file, as well as the structure which relates the graphic and textual information within the o ⁇ gmal document.
  • SIF files include a header section for general information about the file such as the image resolution, the digital graphic image stored in the conventional raster format, information relating to individual words or elements of the image file, and administrative information which contains the relational structure of the image and textual elements.
  • the information relating to individual words includes not only the text of the words, but also the data generated by the OCR technology regarding unidentified characters and probable errors (APPS). if the o ⁇ gmal document was an electronically scanned paper document Thus, any search of the textual information can compensate for these unidentified characters and errors.
  • the actual SIF file is assembled from the basic document elements which were desc ⁇ bed in Sections 2 and 5
  • the SIF file is assembled "on the fly” by image processor 310, and can then be dist ⁇ ubbed to a client through web application server 308
  • the SIF file would include the text and images from an onginal document, for example.
  • the client application issues a request for information by sending a polygon to IC JBC Server 300.
  • This polygon would include the geomet ⁇ cal location of the desired information within a document.
  • the polygon could first be obtained as the results of a search through IC Manager 304, for example. Once obtained, the polygon would enable IC JBC Server 300 to determine exactly which information to package into the object image clip. For example, the client application might only want to ret ⁇ eve a single table from a newsletter.
  • the approp ⁇ ate polygon would be sent to IC JBC Server 300, which would then pass the request to the approp ⁇ ate image processor 310.
  • the table would then be sent as an object image clip to the client application.
  • the original document would be stored in its entirety but then retrieved as an individual component or components, if desired.
  • IC JBC Server 300 preferably also includes a view server 312 and a print server 314.
  • view server 312 and print server 314 may provide these services by being adapted to the main proprietary ORB models which are available, such as the "Cartridge” model of Oracle Corp. (California, USA) or the "Blade” model of InformixTM.
  • print server 314 preferably allows high quality on-demand printing of the original document in a platform- independent manner. Each separate printing service is provided as a cartridge if the "Cartridge" model is used.
  • View server 312 provides the appropriate image application services to IC Manager 304, such as services related to the display of an image on a computer screen through a GUI, for example. View server 312 could also provide each service as a cartridge if the "Cartridge" model is used.
  • Section 3 Client as a Web Browser
  • the GUI could be an HTML (hypertext mark-up language) interface, such that IC client 98 is a Web browser-type software application, it being understood that this is for the purposes of illustration only and is not meant to be limiting in any way.
  • HTML rendering engine could be used as that described in Chapter EL, Section 4.
  • an HTML interface 316 displays the Web page.
  • .HTML interface 316 could be a Web browser, for example.
  • the Web page which is displayed is customizable for a particular user by an ITT.ML customization module 318.
  • Java components 320 can also be provided to client 98.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A management system (304) for managing information, in which information is first captured from an original source having an original format (1), and then converted into an information component. The information component can be stored in and retrieved from a database (204). Upon being retrieved from a database (204), the original information can be extracted from the information component and displayed (244) in a substantially similar format as the original information.

Description

INFORMATION COMPONENT MANAGEMENT SYSTEM
FIELD AND BACKGROUND OF THE INVENTION
The present invention relates to an information component management system. Specifically, the system of the present invention enables documents, images and other types of information to be packaged within an active infoπnation component object, which can then be stored, retπeved and manipulated according to content rather than according to form.
Both the amount and format of available information is increasing at a geometπc rate.
Individuals today face a plethora of choices, both of the type of information which can be obtained, and the method by which the information is obtained. For example, in addition to the traditional pπnt media such as newspapers and magazines, a good deal of news information is available electronically, via the World Wide Web (WWW), through electronic computer mail, by dedicated electronic news services, through a facsimile machine or even on television All of this information can be obtained relatively easily, yet finding particularly useful information is increasingly difficult if not impossible
The many different information formats are themselves a source of increasing complexity for information management Such management includes stoπng, searching and retrieving available information to find that small fraction which is useful to the user. For example, a particular news item might be available in a paper document, as a picture, from a video stream such as television broadcast, through a voice medium such as radio, or electronically on the World Wide Web As its name implies, a "document" management system is still tied to the underlying characteπstics of a "document"
Documents can be defined as a collection of ideas and information, which are organized within a certain structure The ideas and information may be logically linked according to vaπous relationships, but as a whole should follow a common theme The collection itself is expressed as a combination of text and graphic items. There are three main types of information in a document* ideas, data and structure. Ideas can be expressed with words or graphics Data can be in the form of numbers, symbols, graphics or even sounds The final element, structure, is an important element of a document, yet it is often overlooked as a separate entity The structure of a document is the way in which the data and ideas are organized within the document, thereby providing additional significance to these data and ideas.
Current document management systems typically fall into one of two categoπes. The first category is a document management system This system was originally designed to enable searches for information according to specific keywords within defined database fields. Unfortunately, this underlying system design has many disadvantages. For example, the types of searches are limited by the structure of the database itself. Furthermore, information must be extracted from the document and entered mto the database manually, which is time consuming, expensive and prone to human error Thus, structured management systems have significant drawbacks for document management.
The alternative category, non-structured text retπeval systems, solves certain problems but also creates new difficulties. These systems enable automatic indexing of information, without the need for human intervention. However, in non-structured retπeval systems, only the free text of the document is automatically indexed. Therefore, only free text from the document can be searched. Although free text is an important component of a document, such a system loses the other types of available information. Furthermore, the context of ideas or concepts within a document is largely lost by the automatic indexing procedure, leaving the user with a collection of disconnected textual segments or documents which are divorced from the general theme expressed by the entire document. Thus, the user must often read an entire document or a collection of search results in order to find the desired information
Therefore, there is an unmet need for, and it would be highly useful to have, an information component retπeval system which stores, manages and retπeves concepts and ideas rather than static documents or document portions
SUMMARY OF THE INVENTION
The information component management system of the present invention enables documents, images and other types of mformation to be packaged withm an active information component object, which can then be stored, retπeved and manipulated according to content rather than according to form The information component includes concepts or ideas, data and structure as separate but related entities. Information components are linked to each other according to a particular relationship, which may be either parallel or hierarchical According to the present invention, there is provided an information component system for stoπng an oπginal document, compπsing: (a) at least one information component for stoπng information from the oπginal document, the at least one information component featuπng at least one information component pπmitive, (b) an information component identifier for classifying the at least one information component according to at least one information component class; and (c) at least one property of the at least one mformation component.
According to another embodiment of the present invention, there is provided a system for displaying a native file format document, the document including text and having a native file format and a native document appearance, the native file format including at least one instruction for displaying the text of the native file format document, the system compπsing: (a) a Web browser for displaying the native file format document according to the native document appearance; and (b) a HT.ML rendeπng engine for obtaining information regarding the native document appearance of the native file format document, for translating the information into a raster file having a raster format displayable by the Web browser, and for giving the translated information to the Web browser, such that the Web browser is able to display the native file format document.
According to still another embodiment of the present invention, there is provided a method for managing information, compπsing the steps of: (a) captuπng the information in an electronic format; (b) converting the captured information into an information component, the information component featuπng: (I) a pointer to a storage location of the captured information; (n) at least one method for manipulating the captured information; and (in) at least one property of the captured information; (c) stoπng the information component; and (d) displaying the information component such that the captured information appears in substantially the oπginal format.
According to yet another embodiment of the present invention, there is provided an information component compπsing a software object, the software object including: (a) a pointer to a storage location of the stored oπginal information ;(b) at least one method for manipulating the stored oπgmal information; and (c) at least one property of the stored oπginal information.
According to still another embodiment of the present invention, there is provided a server for serving stored information to a client Web browser, the server compπsing: (a) a database for stoπng the stored information; and (b) an image processor for accessing the stored information from the database and transforming the stored information into a Searchable Image Format (SIF) file, the SIF file being accessed by the client Web browser, such that the stored information is displayed by the client Web browser.
Hereinafter, the term "computing platform" refers to a particular computer hardware system or to a particular software operating system. Examples of such hardware systems include, but are not limited to, personal computers (PC), Mackintosh™ computers, mainframes, minicomputers and workstations. Examples of such software operating systems include, but are not limited to, UNIX. VMS, Lmux, MacOS™, DOS, one of the Windows™ operating systems by Microsoft Inc (Seattle, Washington, USA), including Windows NT™, Windows 3.x™ (in which "x" is a version number, such as "Windows 3 1™") and Wmdows95™ Hereinafter, the term "software object" includes any software application capable of substantially independent execution by an operating system For the present invention, a software application, whether a software object or substantially any other type of software application, could be wπtten in substantially suitable programming language, which could easily be selected by one of ordinary sloll in the art The programming language chosen should be compatible with the computing platform according to which the software application is executed Examples of suitable programming languages include, but are not limited to, C, C++ and Java
Hereinafter, the term "Web browser" refers to any software program which can be used to view a document wπtten at least partially with at least one instruction taken from HTML (HyperText Mark-up Language) or VRML (Virtual Reality Modeling Language), or any other equivalent computer document language, hereinafter collectively and generally referred to as "document mark-up language" Examples of Web browsers include, but are not limited to, Mosaic™, Netscape Navigator™ and Microsoft™ Internet Explorer™ Hereinafter, the term "raster format" refers to any image format supported by Web browsers including, but not limited to, GIF (Graphics Interchange Format), JPEG (Joint Photographies Expert Group) and PNG (Portable Network Graphics)
BRIEF DESCRIPTION OF THE DRAWINGS
The invention is herein descπbed, by way of example only, with reference to the accompanying drawings, wherein
FIGS 1A-1C are schematic block diagrams of vaπous exemplary information components and classes,
FIGS. 2A and 2B are schematic block diagrams of the general architecture of an exemplary system of the present invention, FIGS. 3 A and 3B are schematic block diagrams of a preferred embodiment of the IC
Identifier of the present invention,
FIG 4 is a schematic block diagram of an exemplary IC component generator of the present invention, FIG. 5 is a schematic block diagram of a preferred embodiment of the IC Server of the present invention;
HG 6 IS a schematic block diagram of a preferred embodiment of the IC Client of the present invention; FIG. 7 is a schematic block diagram of a preferred embodiment of the system of the present invention as implemented in X.ML;
HG. 8 is a schematic block diagram of dynamic link management according to the present invention;
FIG. 9 is a schematic block diagram of DTD normalization according to the present invention;
FIG. 10 is a schematic block diagram of an exemplary system for .HT.ML rendeπng according to the present invention;
FIG 1 1 shows an exemplary output of the system of Figure 10;
FIG. 12 shows an exemplary embodiment of the IC Server of the present invention as implemented with Java Bean objects;
FIG. 13 shows an exemplary embodiment of the IC Client of the present invention as implemented with Java Bean objects.
GENERAL DESCRIPTION OF THE INVENTION The information component management system of the present invention enables documents, images and other types of structured or non-structured information to be analyzed The underlying structure of the information is determined, and the structure is exposed to a database.
The information is then packaged withm an active information component object, which can then be stored, retπeved and manipulated according to content rather than according to form. The information component includes concepts or ideas, data and structure as separate but related entities. Information components are linked to each other according to a particular relationship, which may be either parallel or hierarchical.
For example, an image of a face of a person is an information component which may in turn be a portion of a larger object, such as a group photo, which may in turn be a portion of an article. The image of the face, the group photo and the article are all individual information components which are linked according to a hierarchical structure. Each information component mheπts the features of all associated information components which are higher in the hierarchical structure, and in turn contπbutes to the pool of features characteπzing associated information components which are lower in the hierarchical structure Thus, information components have both content related to the actual stored information, and content related to the features of associated, higher level components.
The actual stored information from an information componeni is displayed in substantially the same format as the oπginal source format, so as to maintain the oπginal appearance as much as possible The displayed information maintains substantially the same fonts, graphics and structure, so that a newspaper page is displayed as a substantially exact reproduction of the page as it oπginally appeared in newspπnt, for example Thus, the system of the present invention has a clear advantage over pπor art document management systems, which usually display retπeved information only as text. Even if graphic images are also displayed, the structure of the entire document, and the visual relationship between the text and the images, is not maintained by these pπor art systems.
The information component management system of the present invention is able to search for. and retπeve, information based upon all characteπstics of the information component, including graphic images, text and structural relationships. Results are presented as intuitive, visually explicit objects which are easy to examine, manipulate and navigate through. Furthermore, the search results are presented according to the ranked relevance to the desired search strategy, in which the rank is determined with both the full content and the complete characteπstics of the information component. Thus, the system of the present invention includes two basic pπnciples* object oπented management and visual information retπeval Both pπnciples will be explained in greater detail below, in the Descnption of the Preferred Embodiments Bπefly, the information components are managed as objects which belong to an information class. Different information classes are linked according to the logical relationship between the components in each class. Overall, the classes are placed withm a hierarchical structure, in which each child class inheπts the properties of the parent class Each information class defines the properties and operations of a set of information component.
As noted previously, each information component is a representation of information, combining structured and non-structured data. As an object, the information component also features methods for accessing and manipulating the information, including the data interface and any data operations. Because the methods of the information component are exposed to the general computational environment, the component either can be displayed, or can display itself, on any type of computing platform or operating system. Thus, the information component is both compatible across different computing platforms and has an open, easily accessible interface
In order to prepare such an information component, several procedures must be performed. First, the information must be identified. Next, the information must be classified and the actual information component must be created The relationship between the new information component and other information component(s) must be identified. Finally, the behavior of the completed information component is determined according to the functionality of the attπbutes or features which accrue to that component after classification and identification of relationships Once prepared, the information component can be searched and retπeved through visual information search and retπeval Bnefly, the search can be performed according to keyword, visual example and graphic attπbutes Visual examples include images or graphic objects which are compared to graphic information stored in the database, just as a keyword search involves the compaπson of keywords to text stored in the database. Graphic attπbutes include font size, font attπbute and relative positioning of information withm a document. These attπbutes can also be used as search parameters Thus, the search is not limited to a simple keyword compaπson of stored textual information.
Information which is retπeved as a result of the search is then presented in a substantially similar or even identical format as the oπgmal source format. Furthermore, the relevance ranking of the retπeved information is preferably determined both by the number and density of required keywords which appear in the information component, if any, but also is preferably calculated according to the desired visual attπbutes and relationships to other information components Even more preferably, as descπbed in more detail below, the system of the present invention includes a mechanism for learning the preferences and profile of an individual user, which can then also be used to calculate the relevance ranking of the retπeved information
DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention is of an information component management system, in which information is packaged as an information component, including textual data, images and structure Information components are related to each other according to a hierarchical organization, in which characteπstics of components which are higher in the hierarchy accrue to those components which are lower m the hierarchy The information components can be searched and retπeved according to all attπbutes of the actual information, as well as the characteπstics of the component and relationships between components. Thus, the information component management system of the present invention is not limited to simple storage, searches and retπeval of textual data only, but instead preserves all aspects of the oπginal source of information. The pπnciples and operation of the information component system according to the present invention may be better understood with reference to the drawings and the accompanying descπption It should be noted that the following descπption will make reference to the Java computer programming language and to related software architecture, it being understood that this is for the sake of claπty only and is not meant to be limiting in any way.
The detailed descπption of the system of the present invention will be divided into four chapters. The first chapter descπbes vaπous background art technologies which are the preferred support technologies for the system of the present invention. These technologies are descπbed as "background art" because they are not fulfilling the same functions as the system of the present invention, but instead are merely enabling these functions. These technologies are given as examples only and are not intended to be limiting in any way.
The second chapter provides a bπef overall view of the entire system according to the present invention The third chapter descπbes an exemplary implementation of the present invention with objects in an XML environment, and the fourth chapter descπbes an exemplary implementation of the present invention with Java Bean objects in a CORBA environment.
Chapter I: Background Art Technologies
The background art technologies which descπbed in this chapter are well known m the art. The descπption provided herein is not intended to be exhaustive, but rather to descπbe those aspects of the background art technologies which are optionally implemented to support the management system of the present invention. Thus, one of ordinary skill in the art could easily use these background technologies in combination with the teachings of the present invention, without requiπng undue expenmentation. The prefeπed background art technologies which are descπbed herein include XML, descπbed in section 1: CORBA and a particular propπetary embodiment of CORBA, descπbed in section 2; and the Java Bean component architecture, descπbed in section 3. Section 1 .XML
One exemplary technology for the implementation of the present invention is XML with ActiveX ™ objects as the front end, such that the information components of the present invention are preferably accessed through X.ML The acronym ".XML" stands for "Extensible Markup Language" XML is a document markup language which was designed to have greater functionality than HT.ML (hypertext markup language) Documents wπtten in HTML can, however, be converted into XML
A document wπtten in XML is a collection of XML elements, which can be images or sections of text, for example The document itself features elements which are indicated with "tags" These elements have logical values Each element can also have "child elements", which are other elements to which a reference is made that element, known as the ' parent element" This element structure is a hierarchical tree which enables complex elements to be composed of multiple simpler elements
Documents wπtten in XML optionally feature a DTD (Document Type Declaration), which is either included within the XML document, or alternatively is a separate but associated document The DTD contains the rules according to which the XML document should be interpreted, such as the declarations for the structures of the elements within the XML document Hereinafter, the term "HTML" document will refer to a document wπtten in HTML, and the term "XML" document will refer to a document wπtten in XML The option of including the DTD both increases the flexibility of XML and enables documents wπtten in X.ML to be validated, in order to ensure that these XML documents conform to the rules in the DTD
Unfortunately, one drawback of the very flexibility of the DTD structure is that different XML documents may have different associated DTD's, such that similar elements in different files may be descπbed with different "tags" In addition, meta data fields could have different tag names for the same fields Attempting to locate information within these different documents according to the tag names could therefore be very difficult
An additional useful feature of X.ML is the more powerful hnlαng structure available The links of XML are compatible with those of HTML However, in addition, XML allows any type of element to act as a link Furthermore, the start or the finish of an XML link does not need to be located within one of the documents which is being linked. In other words, XML links could be located in a document which is entirely separate from the two linked documents This enables two documents to be more easily linked after both documents have been created, without alteπng either document There are several different types of .XML links, including simple and extended. A simple link is similar to the link of .HTML, in that it is unidirectional and has only one locator. An extended link, by contrast, can have more than one locator, such that the extended link can "point" to more than one target resource. Furthermore, extended links can also be bidirectional or multidirectional. As noted previously, these extended links may be located m a separate file, external to the XML document, and can therefore be very difficult to manage. For example, when a linked file is deleted or otherwise removed, the extended link list is not amended, potentially leading to a broken link.
XML links can also point to target resources which are fragments of a document. The boundaπes of these fragments can be determined through either static "chunlαng" or dynamic "chunking". For static chunking, the XML file is manually divided into pieces, with a new XML file for each piece which is then linked to the mam XML file. For dynamic chunking, the XIV L file is not divided into new files Rather, separators are placed within the XML file to indicate boundaπes for chunks These separators can be used to define a portion of a document which is to be retπeved, such that the fragments are separated and served "on the fly". Although it is automatic, dynamic chunlαng has the disadvantage of significantly increasing server overhead as the server determines which fragment is to be served, such that the XML server may become overloaded.
.XML documents also have style sheets, which feature construction rules descnbmg how each element should be displayed. For example, if the element is a paragraph of text, the construction rule may indicate font size and type, the extent of the indentation of the first line, spacing between lines of the paragraph and so forth
Further information on the XML language is widely available, for example in instructional manuals (Part VUI of HTML 4 Unleashed - Professional Reference Edition, Rick Darnell et al, Sams.net, Indianapolis, Indiana, USA, 1998).
Section 2. CORBA
According to an alternative embodiment of the present invention, both the information components of the present invention and the management system for these components are compliant with the CORBA (Common Object Request Broker Architecture) standard, which is a standard for communication between distπbuted objects established by OMG (Object Management Group). OMG is a consortium of over 700 different software developers. Thus, standards developed by OMG are industry-wide and software applications compliant with these standards should be able to successfully interact with other compliant applications, as descπbed below.
CORBA is a standard which provides a standard method for execution of program modules in a distπbuted environment, regardless of the computer programming language in which the modules are wπtten, or the computing platform on which they are executed.
CORBA enables complex systems to built, integrating many different types of computing platforms within an entire business, for example.
In order to permit different software applications to communicate, regardless of programming language, hardware or operating system, all such applications communicate through a CORB A-comphant ORB (Object Request Broker). Each application is an "object" with a particular interface through which communication is enabled. ORB acts as the "middle-man", passing information and requests for service to each object as necessary. Thus, one software application does not need to understand or know the interface used by another object, since all communication occurs through ORB. Furthermore, the use of an ORB permits true distπbuted computing, since different objects do not need to be operated by the same computer or even reside on the same network. The ORB directs any communication to the appropπate server which contains the object, which might be located on the same host, or a different host, as the client object. The ORB then redirects the results back to the client object. Thus, CORBA can also be descπbed as an "object bus" because it is a communications interface through which objects are located and accessed.
In addition, CORBA provides HOP (Intemet Inter-ORB Protocol), which is the CORBA message protocol for communication on the Internet. HOP links GIOP (CORBA's General Inter-ORB Protocol) to TCP. IP, the general communication protocol of the Intemet. GIOP in turn specifies how one ORB communicates with another ORB. These two types of protocols were implemented to enable different propπetary ORB implementations to communicate over the Internet. Therefore, one type of propπetary ORB can communicate with another, different type of propπetary ORB on a different host computer according to a combination of IIOP and GIOP protocols Practically speabng, if IIOP is built into a Web browser such as Netscape™ Navigator™, a Java applet is downloaded into the Web browser when the user accesses a Web page with a CORBA-compatible object. The Java applet invokes the ORB to first pass data to the object, then to execute the object and finally retπeve the results. Further information on both CORBA and IIOP can be obtained from the "Tech Web Technology Encyclopedia" (http://www techweb.com/encyclopedia as of September 10, 1997)
One propπetary version of the CORBA technology for enabling distπbuted web- based applications is called the Web Request Broker (WRB), developed by Oracle Corp. (Redwood Shores, California, USA) WRB is descπbed in a white paper (M. Anand et al, "The Web Request Broker a Framework for Distπbuted Web-based Applications", http://www.olab.com/www6_l/paper.html as of September 10, 1) Bπefly, the WRB architecture includes the dispatcher, application and system cartπdges, and a CORBA compliant ORB. The dispatcher and cartπdges use the ORB for communication between components, so that these components can be distπbuted on separate remote machines. The dispatcher routes requests from the .HTTP daemon to the appropπate cartridge. The cartπdges are software components which perform a specific function and are thus the "objects" descπbed previously Cartπdges are used within the system of the present invention as an exemplary support for a number of different functions, as descπbed in subsequent sections. Cartπdges have a name, composed of the IP address of the server where the cartridge is located, and the virtual path to the location of the cartπdge on that server Cartπdges also have a standard interface, which includes a number of methods Examples of such methods include the authenticate routine, which determines whether the client is entitled to requested services and the exec routine, which receives the particular service request if the authentication routine is successfully performed Thus, the cartπdge technology provides a fully developed basis for the creation of particular software functionality
One particular advantage of employing the propπetary cartπdge technology for software development is that the system architecture provides a framework for interaction between different objects over the Internet by using HTTP Web servers and existing Web browsers The CORBA protocols only define a standard, but do not provide any specific implementation Thus, the propπetary cartπdge technology enables one of ordinary skill in the art to develop a software application which can communicate with other applications over the World Wide Web
Section 3: Java Bean
Another type of enabling background art technology is "Java Bean". Java Bean is a component software architecture which operates in the Java programming environment. Java, of course, is an interpreter-dnven, object-oπented computer programming language which is substantially platform-independent. Software packages which are wπtten in Java can be operated by any operating system, or platform, which supports the Java interpreter. Similarly, a Java Bean component can run remotely and independently as a discrete software application object in a distπbuted computing environment using either the Remote Method Invocation protocol of Sun Computers Inc , or else by using CORBA. As descπbed below, information components are preferably packaged and then distπbuted as independent Java Bean components.
The Java Bean component software architecture is a set of API's (Application Programming Interfaces) and rules which enable software developers to define software components to be dynamically combined to create a software application. The Java Bean component model has two major elements: components and containers
Components range in size and capability from small GUI (graphic user interface) widgets such as a button, to an applet-sized functionality such as a tabular viewer, and even to a full-sized application such as an HTML (HyperText Mark-up Language) viewer or the information component of the present invention Components can have a visual aspect, such as a button, can actually be visual information or can be non-visual, such as a data-based monitoπng component
Containers hold an assembly of related components. Containers provide the context for components to be arranged and interact with each other Containers are occasionally referred to as "forms", "pages", "frames" or "shells" Containers can also be components, so that a container can be used as a component inside another container.
The Java Bean component model provides the following major types of services: component interface exposure and discovery; component properties, event handling; persistence; application builder support and component packaging. Component interface exposure and discovery allows components to expose their interface so that they can be dπven dynamically by calls and event notifications from other components or application scπpts Component properties are the public attπbutes of a component which either directly reflect or effect the current state of that component For example, properties could include the "foreground color" of a video clip, its zoom factor or its access πghts The state of these properties can be interrogated or modified through standard mechanisms.
Event handling is the mechanism for components to "raise" or "broadcast" events and have those events delivered to the appropπate component or components which need to be notified. Typically, notified components then perform a particular function in response. For example, if the user interface shows a document image clip on the monitor screen, the Parent Information Object event will communicate with the Object Server to transmit the full page of the clip, and will send a viewing command to the full-page viewer component. Thus, event handling allows information components to interact with each other Persistence is the mechanism for stoπng the state of a component in a non-volatile location. The component state is stored in the context of the container and in relation to other components. For example, if the user wants to save the viewing zoom factor for all of the following documents, the persistence mechanism would support this.
Application builder support interfaces enable components to expose their properties and behaviors to application builder development tools. Using these interfaces, the tools can determine the properties and behaviors, or events and methods, of arbitrary components. The tools can provide mechanisms such as tool palettes, inspectors and editors, which the application developer uses to assemble an application Through these mechanisms, the application developer can modify the state and appearance of components as well as establish relationships between components.
This mechanism enables sophisticated information applications such as HyperText links to be created. For example, using an appropπate multimedia tool, the user can define a button which appears on the viewed document, and then links the document to a different document The application developer will use property editors to specify the appearance, including size, color and label, of the button, the link type and the link target
Since Java Bean components can be distπbuted ana independently deployed over a network, there is a need to provide a facility to physically "package" the resources which are included in an information component so that they are accessible to the other Java Bean components. Preferably, such packaging is performed with the JAR (Java Archive) file format The JAR file format enables the class file of the information component and other information component resources such as images, OMS (object mapping structure), sounds, and link information, to be packaged as a single physical entity for distπbution.
Chapter II; Information Component Architecture This chapter provides an overall view of the information component architecture and system of the present invention Section 1 provides a general descπption of information components. Section 2 descπbes the system of the present invention. Section 3 descπbes the information component content capturer in more detail. Section 4 descπbes the information component identifier Section 5 descπbes the mformation component contπbutor Section 6 descπbes the information component server. Section 7 descπbes the information component publisher and client
Section 1: Information Component Each information component has a number of different elements and properties.
Each information component belongs to an information class. The information class defines the properties and operations of a group of information components. Information classes can descπbe a newspaper, a general document or a video clip, for example.
Referring now to the drawings, Figures 1A-1C are illustrations of exemplary information components, each of which can be placed in different classes. Figure 1A is a general descπption of an exemplary document 10, showing the hierarchical structure. Document 10 is in turn subdivided into a number of page components 12, of which four are shown for the purposes of illustration only Page component 12 is a member of the page class, which stores properties related to the structure of a page of document 10. These prop- erties include textual information, structural information and any links to other components. The operations, or methods, include retπeving the textual information, for example. Thus, the operations are used to store, retπeve or modify information contained in the properties of components which are members of the page class.
Every information component which is a page component 12, and hence which belongs to the page class, may share certain repeated structural features. These features are also examples of information components, and are descπbed as "shared information components" Every page component 12 can include these shared information components in order to maintain a uniform structure between pages, for example, and to decrease storage space for such repetitive features As shown in Figure 1A, these shared information components for page component 12 include a footer component 14, a header component 16 and a logo component 18 These are intended as examples only and are not meant to be limiting in any way.
Header component 16 could be a title, such as "Document Report", or any other desired information Footer component 14 could feature a page number, a date, or any other desired information. Logo component 18 would be the logo for the particular company which is producing document 10, for example. These three information components are shared between all page components 12 which are shown. One advantage of subdividing each page component 12 into vaπous information components is that shared information components, such as footer component 14, header component 16 and logo component 18 for example, only need to be stored once on the storage medium or media which holds the information components. Thus, significant savings can be realized in terms of storage space required by shaπng repetitive elements between information components.
Each page component 12 also includes one or more information components which are not shared at all. or which are only shared with certain other page components. For example, page component 12 which is labeled "Page 1" includes a summary section component 20, which is a member of the summary class. The summary class could feature text and/or images which summaπze an earlier portion of document 10, for example. As illustrated in Figure 1A, summary section component 20 is only included withm page component 12.
The "Page 1" page component 12 also features a "Chapter 1" component 22, which is shared by the "Page 2" and "Page 3" page components 12. By contrast, the "Page 4" page component 12 features a "Chapter 3" component 24 Summary section component 20, "Chapter 1" component 22, and "Chapter 3" component 24 are all further subdivided into a plurality of paragraph components 26, which are members of the paragraph class. As their name suggests, paragraph components 26 contain the information related to each paragraph, which may include text for example The text in each paragraph component 26 is contained within a text component 28 as shown, which belongs to the text class.
In addition, paragraph component 26 can optionally include an image component 30 and a table component 32 as shown. Image component 30, which belongs to the image class, stores an image, while table component 32 stores a table and belongs to the table class. These three components, text component 28, image component 30 and table component 32, are examples of information component pπmitives. An information component pπmitive is the most basic unit of information components, such that the pπmitive is no longer divisible into information components which are lower in the hierarchical structure. Like other information components, information component pπmitives are preferably potentially able to be shared between information components. For example, a table of data, which is an example of table component 32, could be included both in summary section component 20 and "Chapter 1" component 22. As for the shared information components descπbed earlier, shared information component pπmitives also only need to be stored once in order to be available to other information components.
Figures IB and IC show portions of certain specific examples of information components, shown in terms of an exemplary class structure, it being understood that this is for the purposes of descπption only and is not meant to be limiting in any way. As shown in Figure IB, a newspaper information component belongs to a newspaper class 34, which defines the properties and operations of components which contain newspaper pages. Newspaper class 34 has an article class 36 for an individual newspaper article Article class 36 inheπts the properties of the parent class, newspaper class 34 In addition, article class 36 may have additional properties and methods, such as the coordinates of the location of the article with the newspaper page, or an operation for retπeving the name of the author of the article A column 38 is shown for a column, while an image class 40 is also shown for a picture For example, image class 40 might have information about pictures which are associated with the article Column class 38 might contain information about the structure of the column which contains the article. Column class 38 and image class 40 are related to article class 36 according to a defined set of relationships.
Figure IC shows an exemplary video clip information class 42 which contains information such as data and structure for a segment of recorded video. A video stream information class 44 is the highest level class for the hierarchy. A video clip information class 46 is next in the hierarchy, followed by a frame class 48. Frame class 48 might contain only information regarding a single frame of the video. Thus, even though a video may be considered as a sequential collection of images which give the illusion of movement, it too can be broken down into smaller elements which are then stored in the above-mentioned information classes.
Section 2. General System Architecture
This section provides an overview of the general architecture of the management system of the present invention, as well as of the interactions between the four main elements of this system. The specific functions of each element will also be descπbed in successive sections below
Figures 2 A and 2B show the general architecture of the system of the present invention. As shown in Figure 2 A, a general system architecture 50 includes IC Contributor 60, IC Server 62, IC Search Engine 63, and IC Publisher 65. As shown Figure 2B, IC Contπbutor 60 further features IC (Information Component) Content Capturer 52, IC Knowledge Base 54, IC Rules Editor 56, and IC Identifier 58.
Although each element will be descπbed in further detail below, bπefly IC Content Capturer 52 is responsible for the acquisition and conversion of information content, and for the transmission of the converted information content to IC Identifier 54. IC Identifier 54 then identifies information components according to certain rules and to class information stored in IC Knowledge Base 54 Both the rules and the class information can be added, removed or otherwise altered with IC Rules Editor 56.
Once the information components have been identified, the oπgmal document and the identified information components are transmitted from IC Contπbutor 60 to IC Server 62. IC Server 62 then stores and manages the actual or "oπginal" information such as documents, multimedia objects and other types of information entities, as well as managing the information components themselves Information components are made available from IC Server 62 by a request through IC Search Engine 63, and are then published by IC Publisher 65 Thus, the general system of the present invention collects the information from a vaπety of sources, packages the information into information components, and then stores the components for later retπeval by a client application
Section 3 Information Component Content Capturer This section descπbes the IC (information component) Content Capturer, which is shown in Figure 2B, as part of IC Contπbutor 60
IC Content Capturer 52 preferably operates as memory resident software and captures the desired information content from a vaπety of software systems including, but not limited to, a document editor 64 such as the Word product of Microsoft™, a media application 66 including, but not limited to, the Adobe™ Acrobat™ reader for reading PDF files from Adobe™ Acrobat™, a facsimile machine software application 68 for operating a facsimile machine, and a Web browser software application 70 such as Netscape™ Navigator™ Additional software systems from which information content can be captured include imaging software and spreadsheet software These software systems are intended as illustrative examples only, since substantially any software system which handles, stores, retπeves or manipulates information could have that information captured by IC Content Capturer 52.
IC Content Capturer 52 invokes the appropπate software dπvers for handling different information formats from the above software systems As an example, information could be captured from a document stored in the format of Microsoft™ Word™ word processing software. A number of possible methods could be used to capture the information contained within the document, two illustrative examples of which are given here, it being understood that these are for discussion purposes only and are not meant to be limiting. In the first exemplary implementation, IC Content Capturer 52 interacts with Microsoft™ Word™ and instructs Microsoft™ Word™ to place the document on the "clipboard". The "clipboard" is a feature of a number of different computer operating systems, in particular those operating systems of Microsoft Inc. (Seattle, Washington, USA), such as "Windows95™" and "Windows NT™", for example. The general function of the "clipboard" is to enable one software application, such as Microsoft™ Word™, to make information available to another software application, such as IC Content Capturer 52. Hereinafter, the term "clipboard" refers to any feature of a computer operating system which enables information to be exchanged between two software applications. Once the document has been copied to, or placed on, the clipboard, the document is then pasted to IC Content Capturer 52.
In the second exemplary implementation, IC Content Capturer 52 captures the necessary information about the document through substantially direct interaction with the software system, such as Microsoft™ Word™. Such interaction can be performed according to a number of different methods. For example, Microsoft™ Word™ enables other software applications to obtain this information through the creation of a "macro". Alternatively, IC Content Capturer 52 could include a printer driver, which would enable Microsoft™ Word™ to "print" the document to IC Content Capturer 52 directly, or alternatively to a file in a format accessible by IC Content Capturer 52. In any case, regardless of the specific method employed, the content of the information is obtained from the captured information by using a particular software driver. Each software driver is relevant to the particular information source format, such as electronically scanned paper document, electronic document such as a word processing document, video clip, document sent by facsimile and other such formats. Each driver is a channel to an information processing unit for a specific type of information, and invokes a process specific to the source of that information. Finally, the content information is stored in an internal unified format for data processing and information component recognition, access and retrieval. The information in the unified internal file format is then sent to IC Identifier 58.
Section 4: Information Component Identifier and Knowledge Base
IC Identifier 58 automatically identifies and creates information components from the information passed from IC Content Capturer 52 according to rules stored in IC Knowledge Base 54. As an example, preferably the information is first analyzed to extract the information component pπmitives, which as descπbed previously form the most basic unit of information. These pπmitives include text, images, vector graphics and other such basic units of information. Next, the information components themselves are constructed from the information component pπmitives, and the relationships between components are determined according to rules stored in IC Knowledge Base 54. Finally, the information components are classified, again according to rules stored in IC Knowledge Base 54. This classification determines secuπty attπbutes, indexing rules and other publishing parameters. At the end of this stage, the information is transferred to IC Server 62.
Figure 3A shows a portion of IC Contπbutor 60 in more detail, focusing on those components which interact with IC Identifier 58. IC Identifier 58 has three layers, including a pπmitive identifier 64, a component constructor 66 and a component classifier 68.
Pπmitive identifier 64 examines the received information at two levels. First, the textual information is identified and separated into individual elements, according to the structure of the type of information The second level of examination of the received information is visual identification, which includes determining the visual attπbutes and structure of the information. At the end of this dual level examination, the information component pπmitives have been identified according to rules stored in IC Knowledge Base 54.
The information component is then constructed from one or more information component pπmitives and/or from one or more information components which are lower in the hierarchy, by component constructor 66 The information component includes such information as the identity of the pπmιtιve(s) or lower information component(s) from which it is constructed. In addition, the relationships between components are determined by component constructor 66 according to rules stored in IC Knowledge Base 54. An illustrative example of this process is disclosed U.S. Application No.
08/318,044, herein incorporated by reference. The disclosed process includes the following steps. First, the document is converted into a digital raster format, for example by scanning a paper document, which is stored in an electronic file. This step is preferably performed by IC Content Capturer 52. Next, preferably the converted document is enhanced to improve the quality of the image, for example.
In the third step, the enhanced raster format file is converted into two electronic files, collectively called a "binary/raster file". The first file has the enhanced raster format, while the second file has pointers to the enhanced raster format file. Every data element in the raster format file, such as textual information or an entire graphic image, could have a coπesponding pointer in the second file. Thus, the two files are preferably produced, at least in part, by an automatic text recognition process such as OCR, which enables the image of the text to be realized as textual data The information is then stored as information components composed of information pπmitives. as previously descπbed Once all of the data has been placed in a database as information components, indices for information retπeval are created. Thus, the oπginal document has been subdivided and stored as a collection of information components.
These information elements preferably include a raster image of the document, a pointer to the storage location of the oπginal document, any text contained within the document and the coordinates of the words of the text with the document. More specifically, the coordinates preferably include all information which is necessary to geographically locate the word within the document, such as the number of the page on which the word falls, the number of the word on the page and the coordinates of the rectangle which bounds the word on the page, or "bounding rectangle" The bounding rectangle determines the area occupied by the word on a page and is necessary to fully reproduce the visual aspects of that word Thus, the coordinates of each word numeπcally descπbe the visual appearance of the word
As an example of this process of obtaining text and coordinates, if the source of information is a paper document which has been scanned to an electronic file, IC Content Capturer 52 performs OCR (Optical Character Recognition) to obtain the textual information from the image stored in the electronic file by converting the image of a letter into the letter itself Both the text itself and the coordinates of individual words are then available. Other examples of such processes include pattern recognition and PDF conversion. It should be noted that these processes are already well known in the art for the creation and manipulation of information in a particular information source format.
The information elements which are produced are then identified according to the type of information component pπmitive which they represent, which is in turn determined according to rules in IC Knowledge Base 54 For example, every individual image identified in the steps above would be determined to be an image information component pπmitive. Similarly, the text extracted in the steps above would be determined to be a text information component pπmitive according to information stored in the textual database. Other information component pπmitives could also be identified from the collection of information elements After the information component primitives have been identified, the primitives are used to construct information components, according to rules stored in IC Knowledge Base 54. For example, in document component 10 of Figure 1A, the information primitives include image infoimation component primitive 30 and text information component primitive 28. These primitives are in turn used to build paragraph 26. Paragraph 26 now contains information concerning not only the inclusion of one or more image information component primitives 30, for example, but also such infoimation as the relative geometrical location of the pπmitive within paragraph 26. The geometrical location of the primitive was determined when the primitive itself was identified, for example as described above. Thus, the primitives are first assembled in information components which are relatively lower in the hierarchy, for example paragraph 26, and then these components are in turn assembled into information components which are higher in the hierarchy.
Either after the individual components have been constructed, or substantially simultaneously during the construction process, each individual information component is classified according to rules in IC Knowledge Base 54 by component classifier 68. The individual component is compared to components listed within the knowledge base, and is recognized as a unique and individual element belonging to a larger information cluster. Each component is classified first by assignment to a primary information class, and then by placement within the hierarchical structure of information sub-classes belonging to that primary class.
As shown with reference to Figure 3B, which shows a schematic block diagram of an exemplary IC Knowledge Base 54, first the document class for the information component is determined. The different document classes are stored within IC Knowledge Base 54 in a document class table 70. For example, the document could be a research report, newspaper, or substantially any other type of document which has been placed within document class table 70.
Next, the information component would be classified according to a particular information component class stored in an IC class table 72. These classes could include, but are not limited to, a logo, a main title, publishing information, summary, and so forth. Each class is in turn identified according to rules stored in a rules table 74. These rules are composed of tokens, including constants 76, functions 78 and operations 80. Each rule could optionally be stored in a "flat (text) file" for example, in which case the tokens would preferably be stored as text strings separated by spaces. Of course, many other options are also available for storing these rules. Each rule preferably includes the following tokens in the following order, although of course other rule structures could be used: the name of the IC class, the hierarchy level of the information component, the font type, the size range, the color, the case of the letter, the location of the page on which the information component is found and the text which the information component should contain (if any) The rule does not necessaπly need to include all of these tokens, an absent token can be indicated by a place-holding character such as a "slash" ("/"), for example.
One example of such a rule is "PageNo -1 Helvetica-Bold 11.0-11.05 / / B /". According to this rule, an information component, named PageNo, is identified by any text of any color in any letter case, in font Helvetica-Bold, any size between 11 and 11.05, which is located at the bottom of the page (indicated by the letter "B").
As another example, consider the rule "Section 1
TimesNewRoman/TimesNewRoman-Bold 9.50-10.50 / / / /". According to this rule, an information component, named Section, is identified by any text of any color in any letter case, in font TimesNewRoman or TimesNewRoman-Bold, any size between 9.50 and 10.50, which is located anywhere on the page.
The rules and other information stored in IC Knowledge Base 54 are optionally and preferably edited through an IC Rules Editor 56. IC Rules Editor 56 is preferably a GUI (graphical user interface), which more preferably allows the user to define new rules, enter new information, delete old rules or information, and amend or alter rules or information.
Once the information components have been constructed and classified, they are passed to the IC Server for being served. Section 5: Information Component Contπbutor
As shown in Figure 4, IC Contπbutor 60 also prepares the information components for publication and for storage in a database, such that the information components can be served to a client by IC Server 62. IC Contπbutor 60 features a component generator 82.
Component generator 82 transforms the classified information component into a standard format including, but not limited to, an active object format such as a COM object or a Java Bean object, or a flat file format such as D.HTML. Generally, component generator 82 packages the classified information component according to the standard format, so that the packaged information component is accessible by IC Server 62. For the purposes of discussion only and without any desire to be limiting, descπptions of the transformation of the information component into two of these object- oπented formats are given in further detail below. In Chapter HI, Section 2, a descπption is provided of the transformation of the information component into an object in an XML environment. In Chapter IV, Section 1. a descπption is provided of the transformation of the information component into a Java Bean object in a CORBA environment.
IC Contπbutor 60 is also able to render and to store information components as a D.HTML document. In this embodiment, IC Contπbutor 60 converts data for each pπmitive of each information component into equivalent fragments in DHTML format. There can be two types of data elements in the source information component: graphic elements and text elements. The graphic elements are converted to raster images (in GIF foιmat).The text elements are converted to a set of DHTML <DIN> blocks. There are two types of DHTML <DIV> blocks: a style block and a value block. The
"style block" defines the style attπbutes of the text, such as font size and name, font-weight, font-style and color. The "value block" defines the position of the text element within the current pπmitive and its text value When the text value contains more than one word, the text value is inserted into a <NOBR> block to prevent line breaking for the given text element by the web browser
To minimize the size of the result D.HT.ML fragment for each pπmitive, the DHTML fragment is optimized to ensure that each "style block" with specific characteπstics appears only once in DHTML fragment for the pπmitive.
Exact correspondence between the source document text style and the DHTML style is not always possible. In this case, the oπginal fonts are preferably substituted by the fonts available for the Web browser with possible modifications of font size.
IC Server 62 can then serve the information component to IC Search Engine 63 after receiving a request for a particular information component from IC Search Engine 63. As descπbed in further detail below, IC Search Engine 63 receives a request for an information component, which is then made available to IC Publisher 65, which publishes the information component. Optionally, IC Publisher 65 publishes the information component to a Web page, for example, or onto paper, as another example.
Both IC Search Engine 63 and IC Server 62 must be able to communicate with each other, such that an information component can be requested. This communication is permitted with information components which have a standard format. An object format is particularly preferred, because objects can be accessed through a predefined structure, which is more efficient for interacting with the information contents of the object. Both of the exemplary and preferred embodiments descπbed below in Chapter IH, Sections 1 and 2 (XML) and in Chapter IV, Section 1 (Java Bean) have object formats for the information component. Of course, other types of formats could be used, such as DHTML.
Section 6: Information Component Server Once the information component has been transformed into an active object or other general format by component generator 82, the active object and the oπginal document are then sent to IC (Information Component) IC server 62 and are then stored in a database 84, as shown in Figure 5. IC server 62 stores and manages the "original" information, such as documents, video segments, sounds and so forth. When a client application issues a request for information, IC Server 62 locates the oπginal mformation entity, isolates the corresponding information component and then returns the information component to the client application in some suitable format, for example as an HTML file.
Section 7: Information Component Publisher and Client As shown in Figure 6, an IC Client 98 is able to send requests for information components to IC Server 62 through an IC Search Engine 63. In addition, IC Client 98 is also able to receive such components from IC Server 62, optionally and preferably through the CORBA ORB. IC Client 98 preferably features some type of GUI (graphical user interface), which enables client applications to interact with the functionality of the information management system of the present invention. IC Publisher 65 is then able to publish the information component onto IC Client 98.
In prefeπed embodiments of the present invention, the ability to access certain information components and to view these components on GUI interface 100 is controlled by two functions: automatic information component replacement and "white-label". These features provide customized views of the same documents to different user groups, while preventing the display of sensitive information components to specific users or to groups of users.
Automatic information component replacement is accomplished through an IC replacement table, which is preferably stored in database 84. The IC replacement table includes the following information: the class and the name of the information component to be replaced, the class and the name of the information component which is to replace it, and the user's groups or individuals for whom the replacement should be performed. For example, the logo on a particular research report, which is an information component called "Big Company X Report", could be replaced by the logo of "Another Big Company" for the user's group "Another Big Company" IC Server 62 would then replace the logo of "Big Company X" with the logo of "Another Big Company" when those clients of "Another Big Company" request the Report.
The white-label function is used to specify one or more information components in the oπginal document which are not to be displayed on GUI interface 100, but which remain incorporated within the oπginal document. Thus, the white-label function enables sensitive information to be protected from access through IC client 98
Chapter III; Specific Exemplary Implementation with Objects in an XML Environment
This chapter descπbes the details of an exemplary preferred embodiment which implement the system of the present invention with objects in an XML environment. The objects are preferably compatible with the ActiveX™ architecture, although other types of objects could also be used, as long as they were compatible with the XML environment. For example, the ActiveX™ objects could be constructed by the client from the information component objects according to the ActiveX™ architecture
The list of sections in this chapter is as follows Section 1 is an overview of the system when implemented with XML. Section 2 is a descπption of IC Contπbutor when implemented with XML Section 3 descπbes IC XML server. Section 4 descπbes the IC Search Engine when implemented with XML. Section 5 descπbes the IC Publisher and IC Client when implemented with XML.
Section 1 Overview of System with XML
Figure 7 shows an overall view of a portion of the system of the present invention as implemented in XML. IC Contπbutor 60 is now IC XML Contπbutor 200 and IC Server 62 is now IC XML Server 202 IC XML Contπbutor 200 creates objects from the information components as XML-en vironment compatible objects.
IC XML Server 202 provides access to a database 204, which is similar to database 84 of Chapter π Database 204 stores the information components, which are implemented as XML-environment compatible objects.
IC X.ML Server 202 also communicates with a DOM (document object model) compliant interface, referred to as DOM Interface 208. Software programs which are compliant with the DOM protocol are able to communicate with other software programs for XML-compatible or XML-specific tools, such as Web browsers or software programs for editing XML documents, for example Thus, DOM Interface 208 acts as a gateway, enabling these XML tool software programs to communicate with information components through IC XML Server 202. As descπbed in greater detail in Section 3 below, XML tool software programs can therefore preferably edit and reuse information components directly from database 204, without conversion of the components.
IC XML Server 202 provides one or more information components upon receiving a request from IC Universal Search Engine Adapter 214. IC Universal Search Engine Adapter 214 enables many different types of search engines to communicate with IC XML Server 202, such that a search can be made for specific information components with database 204. IC Universal Search Engine Adapter 214 also preferably controls access to IC XML Server 202, preferably including such functions as secuπty and request access. IC Universal Search Engine Adapter 214 passes a request for an information component to IC XML Server 202, which then returns the desired information component. One or more information components can then be served to IC XML Publisher 210 IC XML Publisher 210 optionally and preferably includes an .HTML rendeπng engine 212, and a standard document rendeπng engine 216. The information component can then be displayed to the user in a number of ways, such as by pπnting the information on paper or by displaying the information on a Web page
If the information is to be pπnted on paper, or otherwise rendered in a "standard" document format, then IC XML Publisher 210 passes the information component to standard document rendeπng engine 216 If the information is to be displayed by a Web browser which can only handle HTML documents, then IC XML Publisher 210 passes the information component to HT.ML rendeπng engine 212. Other types of rendeπng are also possible of course. The descπption of each of these parts of the system is given in greater detail in the sections below
Section 2 Specific Exemplary Implementation of IC Contπbutor with XML-Environment Compatible Obiects Section 5. Chapter II above descπbed the general implementation of IC Contπbutor
62. This section descπbes a specific, preferred implementation of the IC Contπbutor for operation with X.ML-envιronment compatible objects such as those compatible with ActiveX™ architecture, IC X.ML Contπbutor 200. As XML-environment compatible objects, infoimation components are organized in a hierarchical structure and linked to each other. Each XML-environment compatible object has methods, properties and data. The data itself is the classified information component obtained as descπbed in Chapter π. Methods determine the ways in which the data and properties of the information component can be manipulated. For example, methods include ways to access the data, whether as an image, a video clip, a sound and so forth. Methods also include an application interface, so that another application would be able to interact with the information component and with the stored data, and with a GUI (graphical user interface). Other methods pertain to access control and to event handling. Event handling enables these objects to broadcast events and to have those events delivered to an appropπate component or components for notification Thus, event handling provides methods for communication between components packaged as XIvIL-environment compatible objects
Although individual methods might be specific to a particular information component, such that a newspaper article component would probably not include a method for manipulating sound, the overall mechanism for descπbmg each method is supported by the object architecture and could be easily determined by one of ordinary skill in the art.
The properties of the XML-environment compatible object include the internal structure of the object and the location of the data of the information component withm the hierarchical structure of information components. For example, as descπbed in Section 4, Chapter π, information components are composed of IC pπmitives, which are in turn used to build more complex structures which descπbe the relationships between information components The location of the data of the information component with a hierarchy is important in order to be able to construct virtual documents and to understand the type and significance of the data within the mformation component.
In addition, these properties include the correct tags for the type of data within the object, in order for the object to be correctly rendered within the XML environment, and its location within the information component hierarchy. For example, if the type of data is a chapter of a book, then the correct tag might be the "chapter" tag. This tag identifies the type of XML element for the object, which is important for the later assembly of the data withm the object as an element of an XML document.
IC XML Contπbutor 200 packages the information component obtained from IC Identifier 58 into the XML-environment compatible object as follows. First, the data of the information component forms the data of the object. Next, the methods which can be used to interact with the object are determined. Certain of these methods are typical for all such objects. Other methods, such as the method for accessing the type of data with the object, are particular for the type of data from the information component. Finally, the properties of the object are determined, for example according to the location of the data of the XML- environment compatible object within the information component hierarchy.
Section 3 Specific Implementation of IC Server
This section descπbes a specific implementation of IC Server 62, descπbed in Chapter U, Section 6, for operation with .XML-environment compatible objects. IC .XML Server 202 accepts requests for and then serves information components as XML elements assembled into an XML document. In addition, IC XML Server 202 manages the extended links of XML and normalizes the structure of vaπous DTD's for the X.ML documents. Also, in conjunction with DOM Interface 208, IC .XML Server 202 enables the XML-environment compatible objects to be accessed by XML tool software programs without requiπng conversion of the objects
As noted in Chapter I, Section 1, XML documents are collections of one or more XML elements which are organized according to certain rules, which are held in the DTD of the XML document. IC XML Server 202 is preferably able to assemble XML documents "on the fly" in response to a request from a client application. For example, a client application might request a particular chapter of a book. This chapter could contain a chapter title, text and images, for example. The chapter could also be further subdivided into sections, each of which would also have an organizational structure The data required to assemble the chapter is contained within one or more IC XML elements. If there are a plurality of IC XML elements, then these elements are related as part of a hierarchy based upon the information component hierarchy of the chapter. Therefore, IC .XML Server 202 must first locate all of the IC .XML element(s) which are required for the chapter
Next, a style sheet is optionally selected for the XML document The style sheet is optionally determined by the properties of the "chapter" IC XML element, which may indicate a particular style sheet to be used for that element. Alternatively and preferably, the style sheet could be determined according to specifications submitted by the client application, such that the preferences of the client application determine the style sheet. The IC XML elements are then assembled in the XML document, optionally according to the style sheet The DTD for the XML document is then constnαcted, according to the tags contained withm the IC XML element
The links for the XML document are then determined. Preferably, these links are extended links. More preferably, the extended links are managed as part of a document which is external to the XML document These links are determined according to the hnk(s) of the IC XML element, which are included in the properties of the element. For example, one such link might link two sections of the chapter. Preferably, if a link is to another XML element which is not part of the XML document being assembled, such as for a different chapter, then this other XML element is also assembled into a different XML document, such that the different XML document could also be served if necessary.
A prefeπed embodiment for link management is descπbed with regard to Figure 8. Preferably, extended links are also objects which are stored externally, for example m database 204 Extended link objects are exposed as child objects of the IC XML- environment compatible objects, or resource objects, which they link.
The identity of each extended link object is preferably stored in a link table 218, which is then stored in database 204 The identity of each IC XML-en vironment compatible object is also stored in an IC table 220, also stored in database 204. Optionally and more preferably, a document table 222 is also stored in database 204 Document table 222 indicates how to assemble complete XIVEL documents Preferably, these .XML documents can be assembled into a format which closely resembles the format of the oπginal document from which the information was obtained Also preferably, other "virtual" XML documents could also be assembled according to requests received from the client application.
IC XML Server 202 manages the extended link objects through dynamic management, by dynamically generating extended link objects as required. For example, if an document or an information component is removed from database 204, IC XML Server 202 updates link table 218, IC table 220 and document table 222. Other changes to the link structure may occur manually, through a link editor 224 Alternatively and preferably, changes to the link structure may occur when a document is edited, added or deleted through a document manager 226 An XML editor 228 may also be used to change the structure of the links IC XML Server 202 updates link table 218, IC table 220 and document table 222 as necessary More preferably, IC XML Server 202 sends an alert to the software tool which is attempting to remove the document or information component from database 204, alerting the user to the possible alteration to the link structure. Once the XML document has been prepared, it is sent to IC XML Publisher 210, for example for being published as a Web page. IC XML Publisher 210 is descπbed in greater detail in Section 4 below.
With regard to DTD normalization, IC XML Server 202 is preferably capable of serving many different types of XML documents, which may have different DTD structures. Such different structures can increase the difficulty of searching, retπeving and assembling IC XML elements Furthermore, if IC X.ML elements have different names for tags which should indicate the same element, IC XML Server 202 may not be able to assemble IC XML elements correctly Therefore, IC XML Server 202 optionally and preferably performs DTD normalization for the XML elements and documents.
The process of DTD normalization is shown with regard to Figure 9. First, a DTD 230 is received by IC X.ML Server 202 IC .XML Server 202 then passes each tag withm DTD 230 to a DTD normahzer 232. DTD normahzer 232 compares the name of the tag (text stπng associated with the tag) to the rule or rules of a DTD rules database 234. For example, if the name of tag is "summary", a rule might state that "synopsis" should be used to replace "summary". Preferably, if DTD rules database 234 does not have a rule for the name of that particular tag, then DTD normahzer 232 searches any information associated with the .XML element having that tag in order to normalize the name of the tag
With regard to DOM Interface 208, XML tool software programs, such as editor programs for XML documents, are able to communicate with IC XML Server 202 through this "gateway" software module For example, these editor programs are able to create and manipulate virtual documents from XML-environment compatible objects stored in database 204 in a substantially similar manner to the way in which XML documents are created and manipulated
Section 4* Implementation Of IC XML Search Engine
As previously descπbed, IC Universal Search Engine Adapter 214 passes requests for information components to IC XML Server 202. IC Universal Search Engine Adapter 214 therefore controls access to IC XML Server 202, and hence to the information components. IC Universal Search Engine Adapter 214 preferably operates according to an HTTP- based protocol. Preferably, the access offered through IC Universal Search Engine Adapter 214 can be determined according to a software module or applet wπtten in Javascπpt, Java, Active-X™ or C++ for example. IC Universal Search Engine Adapter 214 is preferably able to translate substantially any type of search query language into a format which is accessible to IC X.ML Server 202 More preferably, IC Universal Search Engine Adapter 214 includes a dπver (not shown) for each search engine, such that a new type of search engine can be easily accommodated by alteπng the dπver
IC Universal Search Engine Adapter 214 is preferably built to be compatible with the particular architecture of IC XML Server 202, such that the client application requesting a particular information component would not need to be altered in order to be compatible with different search engines Section 5 Specific Implementation of IC Publisher
For this embodiment of the system of the present invention, IC Publisher 63 is IC XML Publisher 210 IC XML Publisher 210 makes the information components accessible to the client application The information component can then be displayed to the user in a number of ways, such as by pπnting the information on paper or by displaying the information on a Web page
If the information is to be pπnted on paper, or otherwise rendered in a "standard" document format, then IC X.ML Publisher 210 passes the information component to standard document rendeπng engine 216 Standard document rendeπng engine 216 could output the information component according to the PostScπpt protocol for example, in order to allow data exchange and communication with paper pπnting devices
If the information is to be displayed by a Web browser which can only handle .HTML or D.HTML documents, then IC XML Publisher 210 preferably passes the mformation component to .HTML rendeπng engine 212 Other types of rendeπng are also possible of course
HTML rendeπng engine 212 is able to render the XML document as an HTML or a DHTML document for being served to a Web browser In addition, preferably HTML rendeπng engine 212 is able to render other document formats, such as PDF, word processing and image formats, as HTML documents as well PDF could be rendered from a PostScπpt output for example Preferably, the functions descπbed for HTML rendeπng engine 212 could be used for rendeπng substantially any type of file in substantially any format as an HTML or DHTML document, as descπbed below Figure 10 is a schematic, block diagram showing a preferred implementation of
HTML rendeπng engine 212 and associated items according to the present invention. HT.ML rendeπng engine 212 interacts between a native file format processor 236 and a Web browser 238 Essentially, HTML rendeπng engine 212 enables a native file format document 240, which would normally be substantially accessible only to native file format processor 236, to be displayed by Web browser 238. Furthermore, the display of native file format document 240 by Web browser 238 is visually similar or identical to the display of native file format document 240 by native file format processor 236, as enabled by .HTML rendeπng engine 212. Native file format processor 236 can be any software component or application which can access a native file format document 240. Examples of such software components or applications include, but are not limited to, an XML editing software program, word- processing software such as Microsoft™ Word™ and exchange format software such as Adobe Acrobat™ Exchange Reader. Examples of native file formats include, but are not limited to, the XIvEL format, the DOC format for Microsoft™ Word™ and the PDF format for Adobe Acrobat™ The phrase "access a native file format document" is meant to connote that native file format processor 236 can display and manipulate native file format document 240 such that native file format document 240 is viewable with native visual attπbutes or visual appearance. Although a number of "exchange" formats are available, such as the Rich Text Format (RTF), which can be accessed by more than one type of word processing software for example, information is often lost through conversion to and from these formats Furthermore, these formats are not the native file format itself, but only an approximation. Thus, native file format document 240 is preferably in the file format which is intended to be implemented by native file format processor 236. HTML rendeπng engine 212 is able to convert native file format document 240 into a raster image in a raster format which is displayable by Web browser 238, according to one of two preferred embodiments of the present invention. In the first embodiment, HTML rendeπng engine 212 interacts with native file format processor 236 to obtain data regarding native file format document 240. In the second prefeπed embodiment, HTML rendeπng engine 212 directly accesses native file format document 240 substantially without any interaction with native file format processor 236
The first preferred embodiment of the present invention has many different possible implementations, two illustrative examples of which are given here, it being understood that these are for discussion purposes only and are not meant to be limiting. In the first exemplary implementation, HTML rendeπng engine 212 interacts with native file format processor 236 and instructs native file format processor 236 to place native file format document 240 on the "clipboard" (not shown). The "clipboard" is a feature of a number of different computer operating systems, in particular those operating systems of Microsoft Inc. (Seattle, Washington, USA), such as "Wιndows95™" and "Windows NT™", for example. The general function of the "clipboard" is to enable one software application, such as native file format processor 236, to make information available to another software application, such as HTML rendering engine 212. Hereinafter, the term "clipboard" refers to any feature of a computer operating system which enables information to be exchanged between two software applications. Once native file format document 240 has been copied to, or placed on, the clipboard, native file format document 240 is then pasted to HTML rendering engine 212 as a graphical image. Thus, HTML rendering engine 212 imports, or accesses, native file format document 240 as an image, which can then be converted to a raster image in a raster format. Additionally, HTML rendering engine 212 is able to obtain the necessary data about native file format document 240 through such "pasting".
In the second exemplary implementation, HTML rendering engine 212 receives the necessary information about native file format document 240 through substantially direct interaction with native file format processor 236. Such interaction can be performed according to a number of different methods. For example, Adobe Acrobat™ allows other software applications to obtain this information through the creation of a "plug-in". Microsoft™ Word™ enables other software applications to obtain this information through the creation of a "macro". Alternatively, HTML rendering engine 212 could include a printer driver, which would enable native file format processor 236 to "print" native file format document 240 to an image format file. Such "printing" would also give HTML rendering engine 212 the necessary data about native file format document 240. Thus, HTML rendering engine 212 would obtain the necessary data about native file format document 240 through interaction with native file format processor 236.
The second preferred embodiment of the present invention has many different possible implementations, one illustrative example of which is given here, it being understood that this example is for discussion purposes only and is not meant to be limiting. As noted previously, the second preferred embodiment involves direct interaction of HIML rendering engine 212 with native file format document 240, substantially without any interaction with native file format processor 236.
Such direct interaction has a number of advantages, in particular greater speed of conversion of native file format document 240 to the raster format. HTML rendering engine 212 preferably performs such interaction by understanding all or substantially all of the instructions contained within native file format document 240, in a similar or identical manner as native file format processor 236. These instructions are like any another computer software language, and as such can be understood and interpreted by software applications other than native file format processor 236.
Regardless of the particular implementation by HTML rendering engine 212, HTML rendering engine 212 obtains the necessary data about native file format document 240. This data includes substantially all of the words of the text in native file format document 240, or at least of that portion of native file format document 240 which is to be displayed on Web browser 238. In addition, the data includes the coordinates of each word within native file format document 240. Finally, the data preferably includes all attributes of each word and of the relationships between words, such as the font style and size, character attributes such as bold or italicized text, and spaces between characters and words. Thus, the data in combination enable native file format document 240 to be reproduced in a substantially identical or identical document appearance on Web browser 238.
More specifically, the coordinates preferably include all information which is necessary to geographically locate the word within native file format document 240, such as the number of the page on which the word falls, the number of the word on the page and the coordinates of the rectangle which bounds the word on the page, or "bounding rectangle". The bounding rectangle determines the area occupied by the word on a page and is necessary to fully reproduce the visual aspects of that word. Thus, the coordinates of each word numerically describe the visual appearance of the word and, preferably in combination with the visual attributes of the word, enable the visual appearance of the word to be reproduced.
Once HTML rendering engine 212 has received the data from native file format document 240, HTML rendering engine 212 creates the raster image in a raster format which is displayable by Web browser 238. The raster image is created from the data obtained from native file format processor 236, and preserves substantially all of the visual attributes of native file format document 240, or a portion thereof, when seen in the native document appearance. The raster format is supported by Web browsers. One example of such a format is the GIF raster format. Thus, the raster image, containing at least a portion of native file format document 240, is displayable by Web browsers.
The raster image is optionally created "on the fly". Alternatively, such a raster image could be stored in an additional database 242 containing cached raster images, rather than being created "on the fly".
If the raster image is produced as a result of a search request by the user, then preferably at least one "match" or search result is displayed in the context of at least a portion of at least one native file format document 240 containing the match, as shown in Figure 11.
Figure 11 shows an exemplary, illustrative depiction of a portion of the computer monitor screen which is displaying the raster images of two matches. A monitor screen 244 is displaying a portion of the graphic output of Web browser 238, here shown as Netscape Navigator™ although substantially any Web browser could be used. At the left-hand portion of monitor screen 244, a command area 246 enables the user to enter commands to .HTML rendeπng engine 212 through Web browser 238. At the πght-hand portion of monitor screen 244, a display area 248 shows a portion of the results from the search. Display area 248 shows a portion of two documents 250 with the searched term, "Keppel", emphasized, in this example by a box. As can be seen, both graphic images and text are displayable in display area 248. Thus, the results of the search are displayed by Web browser 238, preferably within the context of at least a portion of the oπginal document, as shown.
In a preferred embodiment of the present invention, the list of matches includes a plurality of matches within a single native file format document 240, a single match from a plurality of native file format documents 240, or even a plurality of matches from a plurality of native file format documents 240 Web browser 238 can then request the next match the seπes of matches, or else the previous match in the seπes Again, HTML rendeπng engine 212 then creates the raster image of the desired match in the seπes. Again, alternatively such a raster image could be stored in database 242, rather than being created "on the fly". In any case, the raster image of the desired match is transferred to, and then displayed by, Web browser 238
In a second embodiment of HTML rendeπng engine 212, HTML rendeπng engine 212 is able to render information components as a DHTML document. In this embodiment, HTML rendeπng engine 212 converts data for each pπmitive of each information component into equivalent fragments in DHTML format. There can be two types of data elements in the source information component: graphic elements and text elements. The graphic elements are converted to raster images (in GIF format).The text elements are converted to a set of DHTML <DIV> blocks There are two types of DHTML <DIV> blocks, a style block and a value block. The
"style block" defines the style attπbutes of the text, such as font size and name, font-weight, font-style and color. The "value block" defines the position of the text element within the current pπmitive and its text value. When the text value contains more than one word, the text value is inserted into a <NOBR> block to prevent line brealαng for the given text element by the web browser.
To minimize the size of the result DHT.ML fragment for each pπmitive, the DHTML fragment is optimized to ensure that each "style block" with specific characteπstics appears only once in DHTML fragment for the pπmitive. Exact correspondence between the source document text style and the DHTML style is not always possible. In this case, the oπgmal fonts are preferably substituted by the fonts available for the Web browser with possible modifications of font size.
DHTML representations can be created for complete pages as well as for parts of any page. In order to generate a DHTML view of a document page, the relevant pπmitives are obtained. These include the DHTML data, the enclosing rectangle for the pπmitive, and text coordinate mapping (for drawing search "hits" or results, or for otherwise highlighting or emphasizing a portion of text).
HTML rendeπng engine 212 iterates over these pπmitives and for each one generates the DHTML code that locates it in the proper place on the page. Graphical elements are handled by creating a combination of a <DIV> and <IMG> tags which point to a URL for loading the images directly
The search hits are also displayable as part of a DHTML view. Hits are created by adding (pπor to the pπmitive DHT.ML) <DIV> and <MG> tags which point to the URL of a pre-defined small image containing the hit color. The hits can be indicated by using one of 3 coloπng methods These include marking the word; marking the beginning of the line; and marl ing the entire line. The size of the coloπng which indicates the hit can be adjusted to the proper size by using the text coordinate mapping of the pπmitive.
Chapter IV; Specific Implementation with Java Bean and CORBA
The preceding chapter descπbed the details of implementing the system of the present invention with objects in an XML environment. In this chapter, a descπption is provided for implementation with Java Bean objects in a CORBA environment.
Section 1 Specific Implementation of IC Contπbutor with Java Bean Obiects
Section 5, Chapter II above descπbed the general implementation of IC Contπbutor. This section descπbes a specific, prefeπed implementation of IC Contπbutor for operation with Java Bean objects. The Java Bean object has two groups of characteπstics: properties and methods Properties are descπptive features of the Java Bean object. Such features preferably include the OMS (Object Mapping Structure) which is the text, structure, graphics and APPS intelligence of the Java Bean object The latter feature, APPS intelligence, is applicable only if the oπginal document was a paper document scanned into an electronic file, since APPS stands for "Adaptive Probability Pattern Search", which enables text to be searched in an image even if not correctly recognized by the OCR process descπbed previously The OMS contains information related to the overall structure of the Java Bean object, as well as a descπption of the relationships between different portions of that object.
Preferably, the profile information is also included The profile information includes any additional desired characteπstics of the oπginal document These characteπstics are determined by the user through IC Content Capturer 52. For example, the profile information could include data concerning the type of company which published the ongmal document Thus, the profile information is external to the oπginal document and is added according to the specification of IC Content Capturer 52 Other preferred properties include an optional but preferable object image, which is a visual image of the oπginal document Another preferred property is hyperlink information, which descπbes all connections to locations on the World Wide Web Preferably, a descπption of the relationships between this component and other components is also provided Finally preferably secuπty and access control data is provided, which determines who is allowed to access the information
Methods determine the ways in which the data and properties of the information component can be manipulated These methods are standard for the Java Bean component architecture For example, methods include ways to access the data, whether as an image, a video clip, a sound and so forth Methods also include an application interface, so that another application would be able to interact with the information component and with the stored data, and with a GUI (graphical user interface) Other methods pertain to access control and to event handling Event handling, as noted previously in section 1, is the mechanism for Java Bean components to broadcast events and to have those events delivered to an appropπate component or components for notification Thus, event handling provides methods for communication between components packaged as Java Beans.
Although individual methods might be specific to a particular information component, such that a newspaper article component would probably not include a method for manipulating sound, the overall mechanism for descπbmg each method is supported by the Java Bean component architecture and could be easily determined by one of ordinary skill in the art.
The information component is preferably packaged as a Java Bean by using the JAR file format. The JAR file format includes such information as the class file, images, sounds and links to other components. The class file is a descπption of the information class to which the information component belongs. Each such piece of mformation is stored m the JAR file format as a pointer to the storage location to the relevant data, such as an image for example Thus, the JAR file format wraps additional information and data around the information component, in such a way that all of the information and data is both presented as a single, independent entity, yet is readily accessible to other software objects.
Section 2 Specific Implementation of the Server with Java Bean Obiects in a CORBA Environment
This section descπbes a specific implementation of IC Server 62, descπbed in Chapter π, Section 6, as IC JBC Server 300 for operation with Java Bean objects in a CORBA environment.
In this particular embodiment of the present invention, when a client application issues a request for information, IC JBC Server 300 locates the oπginal information entity, isolates the corresponding information component according to a pointer stored in the Java Bean component for example, and then creates an "object image clip". The object image clip is then sent back to the client application as an HTML file. These functions are descπbed in greater detail with regard to Figure 12
In Figure 12, IC JBC Server 300 includes a database 302. Database 302 is both accessible to, and is managed by, an IC Manager 304 IC Manager 304 is responsible for supplying the mam CORBA services, as descπbed in Section 1. Preferably, IC Manager 304 provides these services by being adapted to the main propπetary ORB models which are available, such as the "Cartπdge" model of Oracle Corp. (California, USA) or the "Blade" model of Informix™. An ORB is an Object Request Broker, preferably a WRB, or ORB for the "Cartπdge" model which is able to communicate with individual cartπdges. The main CORBA services include database and indexing services for search and retπeval engines, and for push applications, database navigation services, distπbuted viewing, imaging and pπnting services for the information components, network control and retπeval services: and distπbuted storage services for information components. For example, if IC Manager 304 is adapted to the "Cartπdge" model, then components are accessed from database 302 through one of a number of cartπdges, shown as at least one cartπdge 306. Each cartπdge 306 is a module of software which performs a specific function Each of the previously descπbed services performed by IC Manager 304 is provided by a separate cartπdge 306 Different cartπdges 306 could provide database indexing, database navigation and information component retπeval services for example, without requiπng cartπdge 306 and database 302 to be on the same server computer. It should be noted that IC JBC Server 300 is not necessaπly a single server computer, but rather is an interacting collection of components which together form IC JBC Server 300. Cartπdges 306 would communicate with each other and with any databases through an ORB One advantage of the "Cartπdge" model is that communication between different computers could occur through the World Wide Web, via an HTTP daemon as descπbed in section 1 Cartπdges 306 are named with a combination of the IP address of the server where cartπdge 306 is located and the virtual path to the location of cartπdge 306 on that server. Thus, IC Manager 304 would preferably be composed of a number of different cartπdges 306, on one server computer or a plurality of server computers, which preferably interact with each other and any other necessary components, such as databases, through the World Wide Web.
IC JBC Server 300 also includes web application server 3080, which enables IC Manager 304 to send requests and receive information through the Intemet. Together, IC Manager 304 and web application server 308 enable specific information components to be retπeved by first activating a particular cartπdge 306 and then performing some action through database 302. Thus, the name of a desired cartπdge 306 can be given to IC Manager 304, which then locates and activates the desired cartπdge, through the Intemet if necessary. Once cartπdge 306 has been activated, it performs a specific function, such as retπeving an information component from database 302, for example The information component can then be distπbuted through web application server 308. Of course, any other communication method which enables IC Manager 304 to interact with web application server 308, and to give the component to web application server 308, could also be used. Once the desired oπginal information has been retπeved, it is sent to one of a plurality of image processors 310 Each image processor 310 transforms the oπginal information, such as a document, into a Searchable Image Foi at (SIF) file. Each SIF file can be searched, transformed into HyperText and manipulated with copy and paste functions. Each information format preferably has its own image processor 310, so that for example a first image processor 310 could manipulate text editor documents, while a second image processor 310 might handle graphics files such as TIF (Tagged Image Foimat) or GIF (Graphics Interchange Format) format files, for example. Furthermore, each image processor 310 is optionally and preferably able to transform the "onginal" information into the corresponding SIF file "on the fly" Thus, the SIF file can be created and recreated as needed, without the requirement of stoπng both the SIF file and the "oπginal" mformation
One example of how such a SIF file could be created from a paper document is given in U.S. Application No 08/625,496, herein incorporated by reference in its entirety, it being understood that this is only for illustrative purposes only and is not meant to be limiting. SIF files are preferably actually image files, most preferably fully compatible with the TIF file format, which incorporate both graphic images and information data stored m a separate text file, as well as the structure which relates the graphic and textual information within the oπgmal document.
SIF files include a header section for general information about the file such as the image resolution, the digital graphic image stored in the conventional raster format, information relating to individual words or elements of the image file, and administrative information which contains the relational structure of the image and textual elements. The information relating to individual words includes not only the text of the words, but also the data generated by the OCR technology regarding unidentified characters and probable errors (APPS). if the oπgmal document was an electronically scanned paper document Thus, any search of the textual information can compensate for these unidentified characters and errors.
The actual SIF file is assembled from the basic document elements which were descπbed in Sections 2 and 5 The SIF file is assembled "on the fly" by image processor 310, and can then be distπbuted to a client through web application server 308 Thus, the SIF file would include the text and images from an onginal document, for example.
According to another preferred embodiment, the client application issues a request for information by sending a polygon to IC JBC Server 300. This polygon would include the geometπcal location of the desired information within a document. The polygon could first be obtained as the results of a search through IC Manager 304, for example. Once obtained, the polygon would enable IC JBC Server 300 to determine exactly which information to package into the object image clip. For example, the client application might only want to retπeve a single table from a newsletter. The appropπate polygon would be sent to IC JBC Server 300, which would then pass the request to the appropπate image processor 310. The table would then be sent as an object image clip to the client application. Thus, rather than storing the original document as a collection of smaller components, the original document would be stored in its entirety but then retrieved as an individual component or components, if desired.
In prefeπed embodiments of the present invention, IC JBC Server 300 preferably also includes a view server 312 and a print server 314. Similarly to IC Manager 304, both view server 312 and print server 314 may provide these services by being adapted to the main proprietary ORB models which are available, such as the "Cartridge" model of Oracle Corp. (California, USA) or the "Blade" model of Informix™. For example, print server 314 preferably allows high quality on-demand printing of the original document in a platform- independent manner. Each separate printing service is provided as a cartridge if the "Cartridge" model is used.
View server 312 provides the appropriate image application services to IC Manager 304, such as services related to the display of an image on a computer screen through a GUI, for example. View server 312 could also provide each service as a cartridge if the "Cartridge" model is used.
Section 3: Client as a Web Browser
In the example shown in Figure 13, the GUI could be an HTML (hypertext mark-up language) interface, such that IC client 98 is a Web browser-type software application, it being understood that this is for the purposes of illustration only and is not meant to be limiting in any way. A similar HTML rendering engine could be used as that described in Chapter EL, Section 4. Preferably, an HTML interface 316 displays the Web page. .HTML interface 316 could be a Web browser, for example. In addition, preferably the Web page which is displayed is customizable for a particular user by an ITT.ML customization module 318. Also preferably, Java components 320 can also be provided to client 98.
While the invention has been described with respect to a limited number of embodiments, it will be appreciated that many variations, modifications and other applications of the invention may be made.

Claims

WHAT IS CLAIMED IS
1 A information component system for stoπng an onginal document, compπsing*
(a) at least one infoπnation component for stoπng information from the onginal document:
(b) an information component identifier for classifying said at least one information component according to at least one information component class; and
(c) at least one property of said at least one information component.
2 The information component system of claim 1, further compπsing
(d) a hierarchy of related classes, such that at least one property of said at least one information component is determined according to a location of said at least one information class in said hierarchy.
3 The information component system of claim 2, wherein said information class is selected from the group consisting of newspaper class and video stream class.
4 The information component system of claim 3, wherein said newspaper class includes at least one information class selected from the group consisting of article, page and picture
5 The information component system of claim 3, wherein said video stream class includes at least one information class selected from the group consisting of video clip and video frame
6 The information component system of claim 2, further compπsing:
(e) a software system for displaying the oπgmal document in an onginal format; and
(f) a content capturer for captuπng said information from the onginal document by interacting with said software system.
7 The information component system of claim 6, wherein the onginal document is selected from the group consisting of an XML document, a word processing document, a PDF file, a video stream, an audio stream and a Web page.
8 The information component system of claim 7, wherein said software system is selected from the group consisting of word processor, facsimile machine software, Web browser and AdobeΓäó AcrobatΓäó
9. The information component system of claim 8, wherein said software system is said word processor, said word processor featuπng a pπnter dπver for pnntmg, and said content capturer captures said information through said pnnter dπver of said word processor.
10 The information component system of claim 8, wherein said content capturer captures said information through a clipboard.
1 1 The information component system of claim 6, further compπsing:
(g) an information component identifier for identifying said at least one information component class of said at least one information component.
12. The information component system of claim 11, further compπsing*
(h) a rules table for containing at least one rule according to which said at least one information component class is determined; and (0 a knowledge base for stoπng said at least one information component class and said rules table
12. The information component system of claim 11, wherein said at least one rule features a plurality of tokens, each of said plurality of tokens determining a property of said information component.
13. The information component system of claim 12, wherein said property of said information component is a geometπcal location of said information component within the onginal document.
14. The infoimation component system of claim 13, wherein said property of said at least one information component is a textual attribute.
15. The information component system of claim 11, further comprising:
(h) a contributor for transforming said at least one information component into a software object.
16. The information component system of claim 15, wherein said software object is an X.ML-environment compatible object.
17. The information component system of claim 16, wherein said XML- environment compatible object features:
(i) data obtained from said at least one information component;
(ii) at least one method for accessing said data; and
(iii) at least one property of said XML-environment compatible object.
18. The information component system of claim 17, wherein said at least one property is a location of said at least one infoimation component in said information component hierarchy.
19. The information component system of claim 18, wherein a link between said XaML-environment compatible object and at least one other XML-environment compatible object is determined according to said location.
20. The information component system of claim 19, wherein said link is stored as a link object, said link object being a child of said XML-environment compatible object.
21. The infoimation component system of claim 15, wherein said software object is a CORBA-compatible object.
22. The information component system of claim 15, wherein said software object is selected from the group consisting of a COM object, a Java Bean component having data stored according to the JAR format and a flat file.
23 The information component system of claim 15, further compπsing:
(0 a database for stoπng said at least one information component; and
(j) a server for placing said at least one information component in said database and for retπeving said at least one information component upon receiving a request.
24. The information component system of claim 23, wherein said server is an XML server and said at least one information component is an XML-environment compatible object
25. The information component system of claim 24, wherein said .XML server assembles an XML document upon receiving said request by retneving said XML- environment compatible object and by placing said XML-environment compatible object into said XML document
26. The information component system of claim 25, wherein said XML- environment compatible object is linked to at least one other XML-environment compatible object through a link object, said link object being a child of said XML-environment compatible object
27 The information component system of claim 26, wherein said link object is identified in a link table, and said XML-environment compatible object is identified in an IC table, said link table and said IC table being stored in said database, and said XML server manages said link table and said IC table, such that said link table is altered when said IC table is altered
28 The information component system of claim 25, wherein said XML document is assembled according to a style sheet, said style sheet featuπng at least one rule for displaying said XML-environment compatible object as an XML element.
29 The information component system of claim 25, further compπsing. (k) an XML editor for editing an XML document; and
(1) a DOM interface for interfacing said XML editor and said XML server.
30. The information component system of claim 25, further comprising:
(k) a normalizer for normalizing a tag of said XML element according to at least one rule.
31. The information component system of claim 23, further comprising:
(k) a manager for placing said at least one information component into said database, and for retrieving said at least one information component from said database.
32. The information component system of claim 31, wherein said manager is an object request broker (ORB), such that said request is performed as an object request.
33. The information component system of claim 32, wherein said server further comprises:
(i) at least one Cartridge for performing a service with said at least one information component through said manager.
34. The information component system of claim 23, further comprising:
(k) a publisher for publishing said at least one information component, said publisher receiving said at least one information component from said server; and
(1) a client for receiving said at least one information component from said publisher and for displaying said at least one information component.
35. The information component system of claim 34, wherein said client is a Web browser software application.
36. The information component system of claim 35, wherein said publisher further comprises an HTIvIL rendering engine for rendering said at least one information component as an HTML document.
37. The information component system of claim 1, wherein said at least one information component is a CORBA-compatible object.
38. The infoimation component system of claim 1, wherein said at least one information component is selected from the group consisting of a COM object, a Java Bean component having data stored according to the JAR format and a flat file.
39 The information component system of claim 1, wherein said at least one information component is an XML-environment compatible object.
40. The information component system of claim 1, wherein said information from the onginal document includes textual data, image data and visual attπbutes.
41 The information component system of claim 40, wherein said visual attπbutes are selected from the group consisting of font type, font style and location of said textual data
42. The information component system of claim 1, wherein said at least one information component is a higher information component, and said at least one information component is further divided into a plurality of lower information components, such that each of said plurality of lower information components is capable of being shared between a plurality of said higher information components
43 A system for displaying a native file format document, the document including text and having a native file format and a native document appearance, the native file format including at least one instruction for displaying the text of the native file format document, the system compπsing
(a) a Web browser for displaying the native file format document according to the native document appearance; and
(b) a HTML rendeπng engine for obtaining information regarding the native document appearance of the native file format document, for translating said information into a raster file having a raster format displayable by said Web browser, and for giving said translated information to said Web browser, such that said Web browser is able to display the native file format document.
44 The system of claim 43, further compπsing an indexing database, said indexing database containing a plurality of records, each record having a word of the text from the native file foimat document, such that the text is searchable according to a request from said Web browser.
45. The system of claim 44, wherein each of said records further includes coordinates of said word
46. The system of claim 44, wherein said coordinates include at least one feature selected from the group consisting of a number designating a page from the native file format document, a number of said word on said page and coordinates of a bounding rectangle of
Figure imgf000051_0001
47 The system of claim 46. wherein each of said records further includes a visual attnbute of said word.
48. The system of claim 6, wherein said visual attπbutes include at least one feature selected from the group consisting of font size, font style, character attnbute and space between characters of said word.
49. The system of claim 45, wherein said HTML rendenng engine is able to search said indexing database according to said request from said Web browser and to give a result to said Web browser, such that said result is displayable by said Web browser according to the native document appearance.
50 The system of claim 49, wherein said result includes a plurality of matches between said request and the native file format document, such that said HTML rendenng engine is able to select one of said plurality of matches according to an instruction from said Web browser.
51 The system of claim 43, further compnsing:
(c) a native file format processor for accessing the native file format document through the native file format, such that the at least one instruction of the native file foimat enables said native file foimat processor to display the native file foimat document according to the native document appearance, and such that said HTML rendeπng engine receives said information about the native file format document from said native file format processor
52. The system of claim 51, wherein said HTML rendeπng engine receives said information from said native file format processor through a pπnter dnver.
53. The system of claim 43, wherein said .HTML rendenng engine obtains said information by substantially directly accessing the native file format document.
54 A method for managing information, compπsing the steps of:
(a) captuπng the information in an electronic format;
(b) converting said captured information into an information component, said information component featuπng
(0 a pointer to a storage location of said captured information;
(n) at least one method for manipulating said captured information; and
(in) at least one property of said captured information;
(c) stonng said information component; and
(d) displaying said information component such that said captured information appears in substantially the onginal format.
55 A information component compnsing a software object, said software object including
(a) a pointer to a storage location of said stored oπginal information; at least one method for manipulating said stored oπgmal information; and
(c) at least one property of said stored onginal information;
56 The information component of claim 55, wherein said software object belongs to an information class, said information class belonging to a hierarchy of related classes, such that said information class has a pool of attnbutes according to a location in said hierarchy
57 The information component of claim 56, wherein said hierarchy includes said information class and at least one information sub-class, such that said pool of attnbutes of said information class is mheπted by said at least one information sub-class. 58 The information component of claim 57, wherein said information class is selected from the group consisting of newspaper class and video stream class
59 The information component of claim 58, wherein said newspaper class includes at least one information subclass selected from the group consisting of article, page and picture
60 The information component of claim 58, wherein said video stream class includes at least one information subclass selected from the group consisting of video clip and video frame
61 A server for serving stored information to a client Web browser, said server compπsing
(a) a database for stoπng the stored information; and
(b) an image processor for accessing the stored information from said database and transforming the stored information into a Searchable Image Format (SIF) file, said SIF file being accessed by the client Web browser, such that the stored information is displayed by the client Web browser
62 The server of claim 61, wherein said SIF file includes (I) a raster image of the stored information,
(n) a text of the stored information, and
(in) a relationship between said text and said raster image, such that a location of said text within said raster image is specified
63 The server of claim 61, further compπsing a polygon sent from the client Web browser to the server, said polygon specifying a portion of the stored information to be sent to the client Web browser, such that said SIF file includes a raster image of said portion of the stored information and such that substantially only said portion of the stored information is displayed by the client Web browser
PCT/US1998/023193 1997-10-31 1998-11-02 Information component management system WO1999023584A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU13715/99A AU1371599A (en) 1997-10-31 1998-11-02 Information component management system

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US96171497A 1997-10-31 1997-10-31
US08/961,714 1997-10-31
US08/962,117 1997-10-31
US08/962,117 US6161107A (en) 1997-10-31 1997-10-31 Server for serving stored information to client web browser using text and raster images

Publications (2)

Publication Number Publication Date
WO1999023584A2 true WO1999023584A2 (en) 1999-05-14
WO1999023584A3 WO1999023584A3 (en) 1999-09-10

Family

ID=27130428

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1998/023193 WO1999023584A2 (en) 1997-10-31 1998-11-02 Information component management system

Country Status (2)

Country Link
AU (1) AU1371599A (en)
WO (1) WO1999023584A2 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1074927A2 (en) * 1999-08-05 2001-02-07 The Boeing Company Intelligent wiring diagram system
WO2001024053A2 (en) * 1999-09-28 2001-04-05 Xmlexpress, Inc. System and method for automatic context creation for electronic documents
WO2001035617A2 (en) * 1999-10-29 2001-05-17 Telera, Inc. Distributed call center with local points of presence
EP1122652A1 (en) * 2000-02-03 2001-08-08 Mitsubishi Denki Kabushiki Kaisha Data Integration system
WO2001092986A2 (en) * 2000-05-26 2001-12-06 Newsstand, Inc. Providing a digital version of a mass-produced printed paper
WO2002084638A1 (en) * 2001-04-10 2002-10-24 Presedia, Inc. System, method and apparatus for converting and integrating media files
US6795819B2 (en) * 2000-08-04 2004-09-21 Infoglide Corporation System and method for building and maintaining a database
US6839714B2 (en) * 2000-08-04 2005-01-04 Infoglide Corporation System and method for comparing heterogeneous data sources
US6845273B1 (en) 2000-05-26 2005-01-18 Newsstand, Inc. Method and system for replacing content in a digital version of a mass-produced printed paper
US6850260B1 (en) 2000-05-26 2005-02-01 Newsstand, Inc. Method and system for identifying a selectable portion of a digital version of a mass-produced printed paper
EP1569134A1 (en) * 2004-02-24 2005-08-31 Sap Ag A computer system, a database for storing electronic data and a method to operate a database system for converting and displaying archived data
US7181679B1 (en) 2000-05-26 2007-02-20 Newsstand, Inc. Method and system for translating a digital version of a paper
US7447771B1 (en) 2000-05-26 2008-11-04 Newsstand, Inc. Method and system for forming a hyperlink reference and embedding the hyperlink reference within an electronic version of a paper
US7953712B2 (en) 2004-02-24 2011-05-31 Sap Ag Computer system, a database for storing electronic data and a method to operate a database system
EP4064075A1 (en) * 2021-03-26 2022-09-28 FUJIFILM Business Innovation Corp. Information processing apparatus, program, and information processing method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5818446A (en) * 1996-11-18 1998-10-06 International Business Machines Corporation System for changing user interfaces based on display data content

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5818446A (en) * 1996-11-18 1998-10-06 International Business Machines Corporation System for changing user interfaces based on display data content

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1074927A3 (en) * 1999-08-05 2003-08-27 The Boeing Company Intelligent wiring diagram system
EP1074927A2 (en) * 1999-08-05 2001-02-07 The Boeing Company Intelligent wiring diagram system
WO2001024053A2 (en) * 1999-09-28 2001-04-05 Xmlexpress, Inc. System and method for automatic context creation for electronic documents
WO2001024053A3 (en) * 1999-09-28 2004-03-25 Xmlexpress Inc System and method for automatic context creation for electronic documents
WO2001035617A2 (en) * 1999-10-29 2001-05-17 Telera, Inc. Distributed call center with local points of presence
WO2001035617A3 (en) * 1999-10-29 2002-09-12 Telera Inc Distributed call center with local points of presence
EP1122652A1 (en) * 2000-02-03 2001-08-08 Mitsubishi Denki Kabushiki Kaisha Data Integration system
US6810429B1 (en) 2000-02-03 2004-10-26 Mitsubishi Electric Research Laboratories, Inc. Enterprise integration system
US7900130B1 (en) 2000-05-26 2011-03-01 Libredigital, Inc. Method, system and computer program product for embedding a hyperlink within a version of a paper
US7181679B1 (en) 2000-05-26 2007-02-20 Newsstand, Inc. Method and system for translating a digital version of a paper
US9122661B2 (en) 2000-05-26 2015-09-01 Libredigital, Inc. Method, system and computer program product for providing digital content
US9087026B2 (en) 2000-05-26 2015-07-21 Libredigital, Inc. Method, system and computer program product for providing digital content
WO2001092986A3 (en) * 2000-05-26 2002-03-28 Newsstand Inc Providing a digital version of a mass-produced printed paper
US9087027B2 (en) 2000-05-26 2015-07-21 Libredigital, Inc. Method, system and computer program product for providing digital content
GB2381350B (en) * 2000-05-26 2005-01-12 Newsstand Inc Method, system, and computer program product for providing a digital version of a mass-produced printed paper
US6845273B1 (en) 2000-05-26 2005-01-18 Newsstand, Inc. Method and system for replacing content in a digital version of a mass-produced printed paper
US6850260B1 (en) 2000-05-26 2005-02-01 Newsstand, Inc. Method and system for identifying a selectable portion of a digital version of a mass-produced printed paper
US8438466B2 (en) 2000-05-26 2013-05-07 Libredigital, Inc. Method, system and computer program product for searching an electronic version of a paper
US8352849B2 (en) 2000-05-26 2013-01-08 Libredigital, Inc. Method, system and computer program product for providing digital content
GB2381350A (en) * 2000-05-26 2003-04-30 Newsstand Inc Method system and computer program product for providing a digital version of a mass-produced printed paper
US7447771B1 (en) 2000-05-26 2008-11-04 Newsstand, Inc. Method and system for forming a hyperlink reference and embedding the hyperlink reference within an electronic version of a paper
WO2001092986A2 (en) * 2000-05-26 2001-12-06 Newsstand, Inc. Providing a digital version of a mass-produced printed paper
US8332742B2 (en) 2000-05-26 2012-12-11 Libredigital, Inc. Method, system and computer program product for providing digital content
US8055994B1 (en) 2000-05-26 2011-11-08 Libredigital, Inc. Method, system and computer program product for displaying a version of a paper
US6839714B2 (en) * 2000-08-04 2005-01-04 Infoglide Corporation System and method for comparing heterogeneous data sources
US6795819B2 (en) * 2000-08-04 2004-09-21 Infoglide Corporation System and method for building and maintaining a database
US7039643B2 (en) * 2001-04-10 2006-05-02 Adobe Systems Incorporated System, method and apparatus for converting and integrating media files
WO2002084638A1 (en) * 2001-04-10 2002-10-24 Presedia, Inc. System, method and apparatus for converting and integrating media files
US7953712B2 (en) 2004-02-24 2011-05-31 Sap Ag Computer system, a database for storing electronic data and a method to operate a database system
EP1569134A1 (en) * 2004-02-24 2005-08-31 Sap Ag A computer system, a database for storing electronic data and a method to operate a database system for converting and displaying archived data
EP4064075A1 (en) * 2021-03-26 2022-09-28 FUJIFILM Business Innovation Corp. Information processing apparatus, program, and information processing method

Also Published As

Publication number Publication date
WO1999023584A3 (en) 1999-09-10
AU1371599A (en) 1999-05-24

Similar Documents

Publication Publication Date Title
US6161107A (en) Server for serving stored information to client web browser using text and raster images
US6249794B1 (en) Providing descriptions of documents through document description files
AU764320B2 (en) Information storage and retrieval system for storing and retrieving the visual form of information from an application in a database
US6401097B1 (en) System and method for integrated document management and related transmission and access
US7168034B2 (en) Method for promoting contextual information to display pages containing hyperlinks
US6832351B1 (en) Method and system for previewing and printing customized business forms
US6169547B1 (en) Method for displaying an icon of media data
US7177949B2 (en) Template architecture and rendering engine for web browser access to databases
US6721921B1 (en) Method and system for annotating documents using an independent annotation repository
US7058944B1 (en) Event driven system and method for retrieving and displaying information
US20050182755A1 (en) Systems and methods for analyzing documents over a network
US7406664B1 (en) System for integrating HTML Web site views into application file dialogs
US7240294B2 (en) Method of constructing a composite image
US7106469B2 (en) Variable data printing with web based imaging
US7007231B2 (en) Document management system employing multi-zone parsing process
JP2011138533A (en) System and method for content delivery over wireless communication medium to portable computing device
US7213202B1 (en) Simplified design for HTML
WO1999023584A2 (en) Information component management system
Merz Web publishing with Acrobat/PDF
KR20060101803A (en) Creating and active viewing method for an electronic document
US6665090B1 (en) System and method for creating and printing a creative expression
JPH09231121A (en) Document storage device
EP0843266A2 (en) Dynamic incremental updating of electronic documents
Hoppe Integrated management of technical documentation: the system SPRITE
Kapidakis Issues in the Development and Operation of a Digital Library

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM HR HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM HR HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
NENP Non-entry into the national phase

Ref country code: KR

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA