US9772991B2 - Text extraction - Google Patents
Text extraction Download PDFInfo
- Publication number
- US9772991B2 US9772991B2 US15/350,866 US201615350866A US9772991B2 US 9772991 B2 US9772991 B2 US 9772991B2 US 201615350866 A US201615350866 A US 201615350866A US 9772991 B2 US9772991 B2 US 9772991B2
- Authority
- US
- United States
- Prior art keywords
- code
- computer
- computer program
- target
- phone
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24578—Query processing with adaptation to user needs using ranking
-
- G06F17/2765—
-
- G06F17/2705—
-
- G06F17/271—
-
- G06F17/274—
-
- G06F17/277—
-
- G06F17/2775—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Definitions
- FIG. 1 illustrates an example of a flowchart that is usable with the embodiments described herein;
- FIG. 2 depicts a block diagram of a computer system which is adapted to use the present invention.
- the embodiments described herein are used to extract terms from any text set that are used on other text, such as in a repository, that then can be used in a variety of applications, from providing search results, to analyzing data sets, to building a variety of text generation tools, such as messaging and emails.
- This process is also called text extraction, where the terms of a document, including single words and multiple words that are linked via grammar, such as “good quality widget” and “food stylist for south magazine”.
- a set of terms that form a unit of understanding such as a group of modifiers with an object, are an improvement over other types of extractions based on statistics or prior knowledge about the subject.
- FIG. 1 shows a flowchart that is suitable for use with the embodiments herein. It starts by the system receiving a text set 101 that may represent a message, a document, an email, a file, or may represent a set of such text.
- the original input may comprise non-text, which has been converted to text for use with the embodiments.
- the delivery of the text set may be a human or requesting function or it may also be triggered when a system- or repository-level has been notified of a new email or a updated document.
- the delivery of the input may be using any communication means, such as over a wireline network, wireless network, or may be the result of a change in memory for an embedded device, etc.
- This process may be completely automated or may require that an input is used, such as a requirements document. This serves as the control point for building the search terms that are used to compare against another set of documents, files, or messages. Any number of text sets can be automated using this process.
- the extraction target method 102 is then determined as serves as the input for the processing of the target, and may be sent using any messaging or call-based system.
- the determination of an extraction target refers to a grammatical function, such as objects or verbs.
- a grammatical function is calculated using any number of processing; the type of output will generally determine the required grammatical function.
- Such processing may optionally include semantical functions as well.
- a grammatical function is equal to an object; this means that all objects are the extraction target for that document. This is normally used for information analysis tasks, where any object is generally being used, such as for comparing a focus document against a repository, as in a requirements list.
- the grammatical function for objects should be completed so that the satisfactory output of the system can be achieved; which is to locate those objects in the repository that compare favorably to the focus document, meeting the requirements of such a document.
- timeline Another example of this requires that a text input that contains objects is found first, then the term “timeline” can be found within the range of objects.
- the objects are already known and do not have to be calculated for this process, so that the object “timeline” can then be found by looking at the document object set.
- the document object set is equal to “project management”, project timeline”, and “Gantt chart”.
- the modifiers of all nouns are part of the object set; this may vary depending on implementation requirements.
- timeline is found and is equal to “project timeline”. This illustrates the restriction of the object set to those terms related to “timeline”.
- Another form of semantical processing is the use of similarity measures, such as synonyms, stemming, and other such approximations, whereby the use of the term “timeline” would also include like terms such as “timelines”.
- a variable may be assigned that is equal to the extraction target.
- the variable may be derived from the task the system is performing, or may be the result of the processing of the input by another system. For instance, a company is looking for all the customer responses to its best-selling product. The best-selling product varies over time, so it is not the same product each time the system is run. Therefore, the input to the system may be equal to the variable best-selling product, which may be in one use of the system equal to “short-handled squeegee” and in another use of the system at a different time may equal “long-handled squeegee”. Any number of such variables may be used by this system as well as any length of individual terms, such as a single-word object or a multiple-word object.
- the extraction target may be set by the user directly, meaning that any grammatical functions used as part of the extraction process would have to match the grammatical function of the user-based target input.
- the target may be comprised of any number of grammatical functions or user inputs that are used to meet a specific implementation requirement.
- the output of 102 is therefore the parameters of the correct output of the system, which establishes the extraction target that needs to be found within the requested input.
- the input that contains the text set then may be parsed to locate term units (TUs) 103 as shown in SYSTEMS AND METHODS FOR INDEXING INFORMATION FOR A SEARCH ENGINE′′, U.S. application Ser. No. 12/192,794, filed 15 Aug. 2008, the disclosure of which is hereby incorporated herein by reference in its entirety, which initially takes the text set and determines the set of TU delimiters that exist for the underlying language or languages of the text set.
- TUs term units
- the TU delimiter is similar to, but not necessary a word delimiter for a given language.
- TUs are based on characters that map to a specific set of functions within a language, such as words, symbols, punctuation, etc. For instance, in one embodiment, English uses the space as a delimiter for words, but it is insufficient to determine the entire functional usage of each character within the input, such as a sentence stop like a period or question mark. Therefore, it is recommended that a TU splitter should be used so that the ability to derive the search terms can include symbols and other such characters that have a specific meaning within the language or languages being used in the inputs. In most implementations, the duplicates from the TU list can be removed, unless frequency or other statistical analysis is to be performed at this point.
- the optional grammatical function for each TU 104 is established at this time, once the TUs in the document have been found for those extract target determinations that require it. Some implementations may already contain the grammatical function embedded in their data structure, while some other implementations may not have performed grammatical analysis on the input. At this point, this grammatical analysis should be done so that the grammatical function of each TU can be known. Any number of methods for determining the grammatical functions can be used. For instance, an exemplary system may determine the parts-of-speech for each term, and use that value as the grammatical function.
- Another exemplary system may use a set of functions that describe the role of each TU in the document, such as that found within the application entitled “Natural Language Determiner”, filed on 24 Sep. 2012, Ser. No. 13/625,784, which is incorporated herein by reference in its entirety.
- the output of this process comprises a set of terms that meets the grammatical function requirements of 102 .
- the filter process 105 can be used to remove any number of TUs by using the extraction target as the filter.
- a focus TU may be defined as part of the filter using any kind of criteria that is a distinguishing characteristic that is required for a specific implementation. If there is a criteria that can be expressed as a single TU and can be distinguished from other criteria, then it can be used as a focus TU.
- the focus TU may be described as a verb, whereby any TU not equal to a verb is filtered out. If a functional descriptor is used, it may be used when it is supported by the underlying data format, such as requiring that only modifiers be used.
- a less grammatical focus TU can be set, such as a term like “US dollar”.
- the use of the focus TU is constrained by the underlying system and the amount of grammar analysis that is available to the system at the time of determining the focus object, and may also be constrained by the requirements for a particular request.
- An implementation may support any number of extraction target(s).
- Test 106 determines if there are any TUs remaining after the filter has been run for the text set input. If no TUs remain, get the next text set 107 . Otherwise, the optional determination of the target variations 108 that meets the target requirements is performed next if required by an implementation. For instance, if there are similarities that are allowed, such as the use of synonyms, then these similarities can be calculated and established. For instance, if the sample extraction target is equal to the price of a garment, such as “$100”, variations may be found by running functions that locate expressions similar to it, such as “one-hundred dollars”, “100 dollars”, “USD100”, and other such terms. This may also be used in a semantical context, such as removing terms that do not represent a semantical meaning, such as “lousy product” and “not a good product” would be considered equivalent to “bad product”.
- the variations that are found to be part of the same entity are grouped into like ranges 109 .
- each term could be grouped as found to be related. For instance, if a system scans the entire document for a single extraction target, then this would be done as soon as a variation has been found. If there are multiple members of an extraction target, then a system could be implemented whereby the range is built by grouping them on the fly into the appropriate group after the group membership had been established. Alternatively, the system may look for all the members of a specific extraction target that forms the range to be grouped. Regardless of how the grouping has been established, the output of 109 contains all the range of acceptable expressions grouped together.
- the text extraction can return all the terms that meet the target, including all similarities as allowed by the implementation to the calling function or user.
- a test for further extraction expansion can be done 110 . If positive, then the system can assign expansion intervals 111 .
- This further expansion refers to the problem when a grammatical function is an object and the use of that object in a sentence might be required to show the originating context that the extraction came from. For some implementations, such as automated search, this is generally not required; for others, such as feeding information about events related to a specific topic, this is usually required. An example of this is feeding devices for specific information that may require further analysis before the extraction is completed.
- a quantification represents a way of establishing the context for a specific section or sections of a text input.
- a topic used associate specific terms with a context.
- a topic represents the main point of the text input, and has a starting and an ending point. In some cases, the start and end point match the length of the text; in most cases, they do not. They can be determined any number of ways; the topic is normally an object (object length is variable) and requires a measurement that determines its importance before it can be used by the system as a topic. Then, the endpoint of the topic needs to be considered. In an exemplary method, the endpoint is found by looking for an extant use of the same start point as an endpoint and extending it to the end of the sentence or paragraph it occurs in.
- the expansion interval assignment 111 may be equivalent to a sentence or a paragraph as well as a quantification interval that may represent any section of the document to be extracted. For instance, an implementation requires that the text extraction contain the extraction target “Siberian”. It may also specify that the topic be used to expand the extraction target. Then, the topic interval for “dog” is used to perform the extraction. In another example, it may be that any object that is related to the topic “dog” is used, causing the entire topical interval to be used, containing any object including “Siberian”.
- the text extraction is completed and can be returned to the user or calling function 112 .
- This may be in the form of a list of words, as would be required for an automated search function, in a variety of orders, such as grouping based on like ranges so that all the variations within a group are visible to the user or calling function. It may also have an expansion interval and this would return a sentence, paragraph, or multiple paragraphs or sections of a document.
- An implementation may include an extraction target “Siberian husky” but would like to see the sentences that occurred within the input that contained the extraction target, which is equal to a sentence expansion interval.
- a sample document might be “There are several different northern breeds.
- the Siberian Husky is a striking breed coming in a variety of fur colors and eye colors including a beautiful blue.
- the extraction then would include the two sentences: “They include many sled-dog breeds related to the Samoyed, including the Alaskan and Siberian Husky.” and “The most popular northern breed in the United States, the Siberian Husky is a striking breed coming in a variety of fur colors and eye colors including a beautiful blue”.
- the returned data may be presented to a user, via a display, or other man-machine interface, or the returned data may be provided to another program application that uses the returned data as input for further processing.
- Extracted text has a variety of uses and can be outputted to any number of end user and interfaces that require a segment of a document that is relevant to a specific task.
- a marketing organization may use an extraction target that modifies a grammatical function so that includes a specific variable value.
- the extraction target is equal to the use of the object grammatical function, and the specific variable is any negative sentiment.
- the system may return the following extraction targets: “bad toothpaste, “poor quality toothpaste”, and “inferior mint-flavored toothpaste”. They may or may not use, depending on the use of the calling function, an expansion variable, to include the sentences where these occurred.
- expansions may be further enhanced by the use of quantification measures, such as time intervals, and the use of the time intervals may be refined by a restriction on time, within the last 3 days.
- the time would be indicated in the return, and may be grouped together based on the time interval.
- the expansion may also include location interval information, such as “central city” in the extraction return.
- the toothpaste is good quality and cleans well.
- Inferior mint-flavored toothpaste The mint flavor is obnoxious.
- Output 1 “Dock Montana. This is a poor-quality toothpaste”.
- Output 2 “Ranch Texas. Inferior mint-flavored toothpaste”.
- Output 3 “Fruit Georgia. This is a bad toothpaste that does not really leave your breath fresh.”.
- FIG. 2 illustrates computer system 200 adapted to use the present invention.
- Central processing unit (CPU) 201 is coupled to system bus 202 .
- the CPU 201 may be any general purpose CPU, such as an Intel Pentium processor. However, the present invention is not restricted by the architecture of CPU 201 as long as CPU 201 supports the operations as described herein.
- Bus 202 is coupled to random access memory (RAM) 203 , which may be SRAM, DRAM, or SDRAM.
- RAM 203 random access memory
- ROM 204 is also coupled to bus 202 , which may be PROM, EPROM, or EEPROM.
- RAM 203 and ROM 204 hold user and system data and programs as is well known in the art.
- Bus 202 is also coupled to input/output (I/O) controller 205 , communications adapter 211 , user interface 208 , and display 209 .
- the I/O adapter card 205 connects to storage devices 206 , such as one or more of flash memory, a hard drive, a CD drive, a floppy disk drive, a tape drive, to the computer system.
- Communications 211 is adapted to couple the computer system 200 to a network 212 , which may be one or more of a telephone network, a local (LAN) and/or a wide-area (WAN) network, an Ethernet network, and/or the Internet network.
- User interface 208 couples user input devices, such as keyboard 213 , pointing device 207 , to the computer system 200 .
- the display card 209 is driven by CPU 201 to control the display on display device 210 .
- any of the functions described herein may be implemented in hardware, software, and/or firmware, and/or any combination thereof.
- the elements of the present invention are essentially the code segments to perform the necessary tasks.
- the program or code segments can be stored in a processor readable medium.
- the “processor readable medium” may include any physical medium that can store or transfer information. Examples of the processor readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an erasable ROM (EROM), a floppy diskette, a compact disk CD-ROM, an optical disk, a hard disk, a fiber optic medium, etc.
- the code segments may be downloaded via computer networks such as the Internet, Intranet, etc.
- Embodiments described herein operate on or with any network attached storage (NAS), storage array network (SAN), blade server storage, rack server storage, jukebox storage, cloud, storage mechanism, flash storage, solid-state drive, magnetic disk, read only memory (ROM), random access memory (RAM), or any conceivable computing device including scanners, embedded devices, mobile, desktop, server, etc.
- NAS network attached storage
- SAN storage array network
- blade server storage blade server storage
- rack server storage rack server storage
- jukebox storage cloud
- storage mechanism flash storage
- solid-state drive magnetic disk
- ROM read only memory
- RAM random access memory
- Such devices may comprise one or more of: a computer, a laptop computer, a personal computer, a personal data assistant, a camera, a phone, a cell phone, mobile phone, a computer server, a media server, music player, a game box, a smart phone, a data storage device, measuring device, handheld scanner, a scanning device, a barcode reader, a POS device, digital assistant, desk phone, IP phone, solid-state memory device, tablet, and a memory card.
- a computer a laptop computer, a personal computer, a personal data assistant, a camera, a phone, a cell phone, mobile phone, a computer server, a media server, music player, a game box, a smart phone, a data storage device, measuring device, handheld scanner, a scanning device, a barcode reader, a POS device, digital assistant, desk phone, IP phone, solid-state memory device, tablet, and a memory card.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Embodiments are used to extract terms from any text set that are used on other text, such as in a repository, that then can be used in a variety of applications, from providing search results, to analyzing data sets, to building a variety of text generation tools, such as messaging and emails.
Description
This application is a continuation application of U.S. patent application Ser. No. 14/268,865, entitled “TEXT EXTRACTION,” filed May 2, 2014, which claims priority from U.S. Provisional Application No. 61/818,908, “DOCUMENT RECONSTRUCTION, TEXT EXTRACTION, AND AUTOMATED SEARCH”, filed May 2, 2013, which applications are hereby incorporated herein by reference in their entirety.
Currently, a myriad of communication devices are being rapidly introduced that need to interact with natural language in an unstructured manner. Communication systems are finding it difficult to keep pace with the introduction of devices as well as the growth of information.
The accompanying drawings are incorporated in and are a part of this specification. Understanding that these drawings illustrate only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained more fully through the use of these accompanying drawings in which:
The embodiments described herein are used to extract terms from any text set that are used on other text, such as in a repository, that then can be used in a variety of applications, from providing search results, to analyzing data sets, to building a variety of text generation tools, such as messaging and emails. This process is also called text extraction, where the terms of a document, including single words and multiple words that are linked via grammar, such as “good quality widget” and “food stylist for south magazine”. In some cases, a set of terms that form a unit of understanding, such as a group of modifiers with an object, are an improvement over other types of extractions based on statistics or prior knowledge about the subject. Using a grammatical filter, these problems can be eliminated or reduced and all (or most) such multi-word terms can be located and added to a search. In addition, combinatorial analysis can be done so that comparisons against different multi-word terms, such as “Siberian husky” and “Siberian husky sled dog team”, can be related in a variety of ways by recognizing that “Siberian husky” is a subset of “Siberian husky sled dog team”.
The extraction target method 102 is then determined as serves as the input for the processing of the target, and may be sent using any messaging or call-based system. For most implementations, the determination of an extraction target refers to a grammatical function, such as objects or verbs. A grammatical function is calculated using any number of processing; the type of output will generally determine the required grammatical function. Such processing may optionally include semantical functions as well. For instance, a grammatical function is equal to an object; this means that all objects are the extraction target for that document. This is normally used for information analysis tasks, where any object is generally being used, such as for comparing a focus document against a repository, as in a requirements list. The grammatical function for objects should be completed so that the satisfactory output of the system can be achieved; which is to locate those objects in the repository that compare favorably to the focus document, meeting the requirements of such a document.
Another example of this requires that a text input that contains objects is found first, then the term “timeline” can be found within the range of objects. In a sample document, the objects are already known and do not have to be calculated for this process, so that the object “timeline” can then be found by looking at the document object set. The document object set is equal to “project management”, project timeline”, and “Gantt chart”. In this case, the modifiers of all nouns are part of the object set; this may vary depending on implementation requirements. Then, the term timeline is found and is equal to “project timeline”. This illustrates the restriction of the object set to those terms related to “timeline”. Another form of semantical processing is the use of similarity measures, such as synonyms, stemming, and other such approximations, whereby the use of the term “timeline” would also include like terms such as “timelines”.
In some cases, a variable may be assigned that is equal to the extraction target. The variable may be derived from the task the system is performing, or may be the result of the processing of the input by another system. For instance, a company is looking for all the customer responses to its best-selling product. The best-selling product varies over time, so it is not the same product each time the system is run. Therefore, the input to the system may be equal to the variable best-selling product, which may be in one use of the system equal to “short-handled squeegee” and in another use of the system at a different time may equal “long-handled squeegee”. Any number of such variables may be used by this system as well as any length of individual terms, such as a single-word object or a multiple-word object.
The variations of this are based on the follow-on application, such as an automated search, a content feeder based on the focus, or sentiment analysis. In addition, the extraction target may be set by the user directly, meaning that any grammatical functions used as part of the extraction process would have to match the grammatical function of the user-based target input. The target may be comprised of any number of grammatical functions or user inputs that are used to meet a specific implementation requirement. The output of 102 is therefore the parameters of the correct output of the system, which establishes the extraction target that needs to be found within the requested input.
The input that contains the text set then may be parsed to locate term units (TUs) 103 as shown in SYSTEMS AND METHODS FOR INDEXING INFORMATION FOR A SEARCH ENGINE″, U.S. application Ser. No. 12/192,794, filed 15 Aug. 2008, the disclosure of which is hereby incorporated herein by reference in its entirety, which initially takes the text set and determines the set of TU delimiters that exist for the underlying language or languages of the text set.
The TU delimiter is similar to, but not necessary a word delimiter for a given language. TUs are based on characters that map to a specific set of functions within a language, such as words, symbols, punctuation, etc. For instance, in one embodiment, English uses the space as a delimiter for words, but it is insufficient to determine the entire functional usage of each character within the input, such as a sentence stop like a period or question mark. Therefore, it is recommended that a TU splitter should be used so that the ability to derive the search terms can include symbols and other such characters that have a specific meaning within the language or languages being used in the inputs. In most implementations, the duplicates from the TU list can be removed, unless frequency or other statistical analysis is to be performed at this point.
The optional grammatical function for each TU 104 is established at this time, once the TUs in the document have been found for those extract target determinations that require it. Some implementations may already contain the grammatical function embedded in their data structure, while some other implementations may not have performed grammatical analysis on the input. At this point, this grammatical analysis should be done so that the grammatical function of each TU can be known. Any number of methods for determining the grammatical functions can be used. For instance, an exemplary system may determine the parts-of-speech for each term, and use that value as the grammatical function. Another exemplary system may use a set of functions that describe the role of each TU in the document, such as that found within the application entitled “Natural Language Determiner”, filed on 24 Sep. 2012, Ser. No. 13/625,784, which is incorporated herein by reference in its entirety. Regardless of the method, the output of this process comprises a set of terms that meets the grammatical function requirements of 102.
Once the set of TUs are found for the text set that have a specific grammatical function, then the filter process 105 can be used to remove any number of TUs by using the extraction target as the filter. A focus TU may be defined as part of the filter using any kind of criteria that is a distinguishing characteristic that is required for a specific implementation. If there is a criteria that can be expressed as a single TU and can be distinguished from other criteria, then it can be used as a focus TU. The focus TU may be described as a verb, whereby any TU not equal to a verb is filtered out. If a functional descriptor is used, it may be used when it is supported by the underlying data format, such as requiring that only modifiers be used. This is common when sentiment and other such measurements are used since they generally modify an object of interest, such as “this product is good” (good=modifier) or “it is a bad product” (bad=modifier). A less grammatical focus TU can be set, such as a term like “US dollar”. In addition, the use of the focus TU is constrained by the underlying system and the amount of grammar analysis that is available to the system at the time of determining the focus object, and may also be constrained by the requirements for a particular request. An implementation may support any number of extraction target(s).
Optionally, the variations that are found to be part of the same entity are grouped into like ranges 109. This is partially dependent on the system implementation, since in some cases each term could be grouped as found to be related. For instance, if a system scans the entire document for a single extraction target, then this would be done as soon as a variation has been found. If there are multiple members of an extraction target, then a system could be implemented whereby the range is built by grouping them on the fly into the appropriate group after the group membership had been established. Alternatively, the system may look for all the members of a specific extraction target that forms the range to be grouped. Regardless of how the grouping has been established, the output of 109 contains all the range of acceptable expressions grouped together.
At this point, the text extraction can return all the terms that meet the target, including all similarities as allowed by the implementation to the calling function or user. However, a test for further extraction expansion can be done 110. If positive, then the system can assign expansion intervals 111. This further expansion refers to the problem when a grammatical function is an object and the use of that object in a sentence might be required to show the originating context that the extraction came from. For some implementations, such as automated search, this is generally not required; for others, such as feeding information about events related to a specific topic, this is usually required. An example of this is feeding devices for specific information that may require further analysis before the extraction is completed. This includes, but is not limited to, topical analysis, location analysis, or other quantification methods; an example of which is shown in the U.S. patent application Ser. No. 13/027,256, filed 14 Feb. 2011, entitled “GRAMMAR TOOLS”, the disclosure of which is hereby incorporated herein by reference.
A quantification represents a way of establishing the context for a specific section or sections of a text input. For example, a topic used associate specific terms with a context. A topic represents the main point of the text input, and has a starting and an ending point. In some cases, the start and end point match the length of the text; in most cases, they do not. They can be determined any number of ways; the topic is normally an object (object length is variable) and requires a measurement that determines its importance before it can be used by the system as a topic. Then, the endpoint of the topic needs to be considered. In an exemplary method, the endpoint is found by looking for an extant use of the same start point as an endpoint and extending it to the end of the sentence or paragraph it occurs in. Once the further expansion has been determined, the expansion interval assignment 111 may be equivalent to a sentence or a paragraph as well as a quantification interval that may represent any section of the document to be extracted. For instance, an implementation requires that the text extraction contain the extraction target “Siberian”. It may also specify that the topic be used to expand the extraction target. Then, the topic interval for “dog” is used to perform the extraction. In another example, it may be that any object that is related to the topic “dog” is used, causing the entire topical interval to be used, containing any object including “Siberian”.
Once the full extent of the extraction has been assigned, then the text extraction is completed and can be returned to the user or calling function 112. This may be in the form of a list of words, as would be required for an automated search function, in a variety of orders, such as grouping based on like ranges so that all the variations within a group are visible to the user or calling function. It may also have an expansion interval and this would return a sentence, paragraph, or multiple paragraphs or sections of a document. An implementation may include an extraction target “Siberian husky” but would like to see the sentences that occurred within the input that contained the extraction target, which is equal to a sentence expansion interval. A sample document might be “There are several different northern breeds. They include many sled-dog breeds related to the Samoyed, including the Alaskan and Siberian Husky. The most popular northern breed in the United States, the Siberian Husky is a striking breed coming in a variety of fur colors and eye colors including a beautiful blue.”. The extraction then would include the two sentences: “They include many sled-dog breeds related to the Samoyed, including the Alaskan and Siberian Husky.” and “The most popular northern breed in the United States, the Siberian Husky is a striking breed coming in a variety of fur colors and eye colors including a beautiful blue”. Note that the returned data may be presented to a user, via a display, or other man-machine interface, or the returned data may be provided to another program application that uses the returned data as input for further processing.
Extracted text has a variety of uses and can be outputted to any number of end user and interfaces that require a segment of a document that is relevant to a specific task. For instance, a marketing organization may use an extraction target that modifies a grammatical function so that includes a specific variable value. The extraction target is equal to the use of the object grammatical function, and the specific variable is any negative sentiment. The system may return the following extraction targets: “bad toothpaste, “poor quality toothpaste”, and “inferior mint-flavored toothpaste”. They may or may not use, depending on the use of the calling function, an expansion variable, to include the sentences where these occurred. These expansions may be further enhanced by the use of quantification measures, such as time intervals, and the use of the time intervals may be refined by a restriction on time, within the last 3 days. The time would be indicated in the return, and may be grouped together based on the time interval. The expansion may also include location interval information, such as “central city” in the extraction return. An example input is as follows.
Toothpaste Reviews
Nov. 20, 2013
Reviewer 1: Palm Florida
The toothpaste is good quality and cleans well.
Reviewer 2: Dock Montana
This is a poor-quality toothpaste. It does not remove stains.
Reviewer 3: Ranch Texas
Inferior mint-flavored toothpaste. The mint flavor is obnoxious.
Reviewer 4: Fruit Georgia
This is a bad toothpaste that does not really leave your breath fresh. It also does not remove plaque.
The output of the system would look like this. Output 1=“Dock Montana. This is a poor-quality toothpaste”. Output 2=“Ranch Texas. Inferior mint-flavored toothpaste”. Output 3=“Fruit Georgia. This is a bad toothpaste that does not really leave your breath fresh.”.
Note that any of the functions described herein may be implemented in hardware, software, and/or firmware, and/or any combination thereof. When implemented in software, the elements of the present invention are essentially the code segments to perform the necessary tasks. The program or code segments can be stored in a processor readable medium. The “processor readable medium” may include any physical medium that can store or transfer information. Examples of the processor readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an erasable ROM (EROM), a floppy diskette, a compact disk CD-ROM, an optical disk, a hard disk, a fiber optic medium, etc. The code segments may be downloaded via computer networks such as the Internet, Intranet, etc.
Embodiments described herein operate on or with any network attached storage (NAS), storage array network (SAN), blade server storage, rack server storage, jukebox storage, cloud, storage mechanism, flash storage, solid-state drive, magnetic disk, read only memory (ROM), random access memory (RAM), or any conceivable computing device including scanners, embedded devices, mobile, desktop, server, etc. Such devices may comprise one or more of: a computer, a laptop computer, a personal computer, a personal data assistant, a camera, a phone, a cell phone, mobile phone, a computer server, a media server, music player, a game box, a smart phone, a data storage device, measuring device, handheld scanner, a scanning device, a barcode reader, a POS device, digital assistant, desk phone, IP phone, solid-state memory device, tablet, and a memory card.
Claims (3)
1. A computer program product having a non-transitory computer-readable medium, wherein the computer-readable medium has a computer program logic recorded thereon, the computer program logic is operative on an input data set involving grammar, the computer program product comprising:
code for receiving the input data set;
code for determining a target of the input data set, wherein the target is a variable that is a grammatical function;
code for expanding the target, thereby forming an expanded target;
code for parsing the input data set into a plurality of term units (TUs), wherein each TU is separated from another TU by a delimiter;
code for retrieving a respective grammatical function for each TU;
code for grouping the TUs according to their respective grammatical function based on the expanded target; and
code for presenting the grouped TUs to an interface.
2. The computer program product of claim 1 , wherein the code for presenting comprises code for displaying.
3. The computer program product of claim 1 , wherein the computer program product resides on a device selected from the group of devices comprising:
a computer, a laptop computer, a personal computer, a personal data assistant, a camera, a phone, a cell phone, a mobile phone, a computer server, a media server, a music player, a game box, a smart phone, a data storage device, a measuring device, a handheld scanner, a scanning device, a barcode reader, a POS device, a digital assistant, a desk phone, an IP phone, solid-state memory device, a tablet, and a memory card.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/350,866 US9772991B2 (en) | 2013-05-02 | 2016-11-14 | Text extraction |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201361818908P | 2013-05-02 | 2013-05-02 | |
US14/268,865 US9495357B1 (en) | 2013-05-02 | 2014-05-02 | Text extraction |
US15/350,866 US9772991B2 (en) | 2013-05-02 | 2016-11-14 | Text extraction |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/268,865 Continuation US9495357B1 (en) | 2013-05-02 | 2014-05-02 | Text extraction |
Publications (2)
Publication Number | Publication Date |
---|---|
US20170060841A1 US20170060841A1 (en) | 2017-03-02 |
US9772991B2 true US9772991B2 (en) | 2017-09-26 |
Family
ID=57235035
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/268,983 Expired - Fee Related US9727619B1 (en) | 2013-05-02 | 2014-05-02 | Automated search |
US14/268,865 Expired - Fee Related US9495357B1 (en) | 2013-05-02 | 2014-05-02 | Text extraction |
US15/350,866 Expired - Fee Related US9772991B2 (en) | 2013-05-02 | 2016-11-14 | Text extraction |
US15/670,914 Abandoned US20180032527A1 (en) | 2013-05-02 | 2017-08-07 | Automated Search Matching |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/268,983 Expired - Fee Related US9727619B1 (en) | 2013-05-02 | 2014-05-02 | Automated search |
US14/268,865 Expired - Fee Related US9495357B1 (en) | 2013-05-02 | 2014-05-02 | Text extraction |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/670,914 Abandoned US20180032527A1 (en) | 2013-05-02 | 2017-08-07 | Automated Search Matching |
Country Status (1)
Country | Link |
---|---|
US (4) | US9727619B1 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10698978B1 (en) * | 2017-03-27 | 2020-06-30 | Charles Malcolm Hatton | System of english language sentences and words stored in spreadsheet cells that read those cells and use selected sentences that analyze columns of text and compare cell values to read other cells in one or more spreadsheets |
US11334608B2 (en) * | 2017-11-23 | 2022-05-17 | Infosys Limited | Method and system for key phrase extraction and generation from text |
US11013340B2 (en) | 2018-05-23 | 2021-05-25 | L&P Property Management Company | Pocketed spring assembly having dimensionally stabilizing substrate |
US11762864B2 (en) * | 2018-10-31 | 2023-09-19 | Kyndryl, Inc. | Chat session external content recommender |
US11790170B2 (en) * | 2019-01-10 | 2023-10-17 | Chevron U.S.A. Inc. | Converting unstructured technical reports to structured technical reports using machine learning |
US20220245377A1 (en) * | 2021-01-29 | 2022-08-04 | Intuit Inc. | Automated text information extraction from electronic documents |
CN115238670B (en) * | 2022-08-09 | 2023-07-04 | 平安科技(深圳)有限公司 | Information text extraction method, device, equipment and storage medium |
Citations (80)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US2003007A (en) | 1923-02-17 | 1935-05-28 | American Morgan Company | Material handling system for mines |
US5161245A (en) | 1991-05-01 | 1992-11-03 | Apple Computer, Inc. | Pattern recognition system having inter-pattern spacing correction |
US5251129A (en) * | 1990-08-21 | 1993-10-05 | General Electric Company | Method for automated morphological analysis of word structure |
US5317507A (en) | 1990-11-07 | 1994-05-31 | Gallant Stephen I | Method for document retrieval and for word sense disambiguation using neural networks |
US5495413A (en) | 1992-09-25 | 1996-02-27 | Sharp Kabushiki Kaisha | Translation machine having a function of deriving two or more syntaxes from one original sentence and giving precedence to a selected one of the syntaxes |
US5528491A (en) | 1992-08-31 | 1996-06-18 | Language Engineering Corporation | Apparatus and method for automated natural language translation |
US5598557A (en) | 1992-09-22 | 1997-01-28 | Caere Corporation | Apparatus and method for retrieving and grouping images representing text files based on the relevance of key words extracted from a selected file to the text files |
US5761631A (en) | 1994-11-17 | 1998-06-02 | International Business Machines Corporation | Parsing method and system for natural language processing |
US5924105A (en) | 1997-01-27 | 1999-07-13 | Michigan State University | Method and product for determining salient features for use in information searching |
US5930746A (en) | 1996-03-20 | 1999-07-27 | The Government Of Singapore | Parsing and translating natural language sentences automatically |
US5963969A (en) * | 1997-05-08 | 1999-10-05 | William A. Tidwell | Document abstraction system and method thereof |
US5995922A (en) | 1996-05-02 | 1999-11-30 | Microsoft Corporation | Identifying information related to an input word in an electronic dictionary |
US6112168A (en) | 1997-10-20 | 2000-08-29 | Microsoft Corporation | Automatically recognizing the discourse structure of a body of text |
US6119077A (en) | 1996-03-21 | 2000-09-12 | Sharp Kasbushiki Kaisha | Translation machine with format control |
US6167368A (en) | 1998-08-14 | 2000-12-26 | The Trustees Of Columbia University In The City Of New York | Method and system for indentifying significant topics of a document |
US6311182B1 (en) | 1997-11-17 | 2001-10-30 | Genuity Inc. | Voice activated web browser |
US6317707B1 (en) | 1998-12-07 | 2001-11-13 | At&T Corp. | Automatic clustering of tokens from a corpus for grammar acquisition |
US6327589B1 (en) | 1998-06-24 | 2001-12-04 | Microsoft Corporation | Method for searching a file having a format unsupported by a search engine |
US20020046019A1 (en) | 2000-08-18 | 2002-04-18 | Lingomotors, Inc. | Method and system for acquiring and maintaining natural language information |
US20020052901A1 (en) | 2000-09-07 | 2002-05-02 | Guo Zhi Li | Automatic correlation method for generating summaries for text documents |
US20020078044A1 (en) * | 2000-12-19 | 2002-06-20 | Jong-Cheol Song | System for automatically classifying documents by category learning using a genetic algorithm and a term cluster and method thereof |
US20020111792A1 (en) | 2001-01-02 | 2002-08-15 | Julius Cherny | Document storage, retrieval and search systems and methods |
US20020143524A1 (en) | 2000-09-29 | 2002-10-03 | Lingomotors, Inc. | Method and resulting system for integrating a query reformation module onto an information retrieval system |
US6473730B1 (en) | 1999-04-12 | 2002-10-29 | The Trustees Of Columbia University In The City Of New York | Method and system for topical segmentation, segment significance and segment function |
US20030007889A1 (en) | 2001-05-24 | 2003-01-09 | Po Chien | Multi-burner flame ionization combustion chamber |
US20030023423A1 (en) | 2001-07-03 | 2003-01-30 | Kenji Yamada | Syntax-based statistical translation model |
US20030074184A1 (en) | 2001-10-15 | 2003-04-17 | Hayosh Thomas E. | Chart parsing using compacted grammar representations |
US6553347B1 (en) | 1999-01-25 | 2003-04-22 | Active Point Ltd. | Automatic virtual negotiations |
US20030167266A1 (en) | 2001-01-08 | 2003-09-04 | Alexander Saldanha | Creation of structured data from plain text |
US20030200077A1 (en) | 2002-04-19 | 2003-10-23 | Claudia Leacock | System for rating constructed responses based on concepts and a model answer |
US20030216904A1 (en) | 2002-05-16 | 2003-11-20 | Knoll Sonja S. | Method and apparatus for reattaching nodes in a parse structure |
US20030236659A1 (en) * | 2002-06-20 | 2003-12-25 | Malu Castellanos | Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging |
US6675159B1 (en) | 2000-07-27 | 2004-01-06 | Science Applic Int Corp | Concept-based search and retrieval system |
US6684183B1 (en) | 1999-12-06 | 2004-01-27 | Comverse Ltd. | Generic natural language service creation environment |
US6731802B1 (en) | 2000-01-14 | 2004-05-04 | Microsoft Corporation | Lattice and method for identifying and normalizing orthographic variations in Japanese text |
US20040143808A1 (en) * | 2003-01-21 | 2004-07-22 | Infineon Technologies North America Corp. | Method of resolving mismatched parameters in computer-aided integrated circuit design |
US20040148170A1 (en) | 2003-01-23 | 2004-07-29 | Alejandro Acero | Statistical classifiers for spoken language understanding and command/control scenarios |
US20040243409A1 (en) * | 2003-05-30 | 2004-12-02 | Oki Electric Industry Co., Ltd. | Morphological analyzer, morphological analysis method, and morphological analysis program |
US20050049852A1 (en) | 2003-09-03 | 2005-03-03 | Chao Gerald Cheshun | Adaptive and scalable method for resolving natural language ambiguities |
US20050065776A1 (en) | 2003-09-24 | 2005-03-24 | International Business Machines Corporation | System and method for the recognition of organic chemical names in text documents |
US20050081146A1 (en) * | 2003-10-14 | 2005-04-14 | Fujitsu Limited | Relation chart-creating program, relation chart-creating method, and relation chart-creating apparatus |
US20050108001A1 (en) * | 2001-11-15 | 2005-05-19 | Aarskog Brit H. | Method and apparatus for textual exploration discovery |
US20050171783A1 (en) | 1999-07-17 | 2005-08-04 | Suominen Edwin A. | Message recognition using shared language model |
US20050220351A1 (en) | 2004-03-02 | 2005-10-06 | Microsoft Corporation | Method and system for ranking words and concepts in a text using graph-based ranking |
US20050256700A1 (en) | 2004-05-11 | 2005-11-17 | Moldovan Dan I | Natural language question answering system and method utilizing a logic prover |
US20060089928A1 (en) | 2004-10-20 | 2006-04-27 | Oracle International Corporation | Computer-implemented methods and systems for entering and searching for non-Roman-alphabet characters and related search systems |
US20060117307A1 (en) | 2004-11-24 | 2006-06-01 | Ramot At Tel-Aviv University Ltd. | XML parser |
US20060129380A1 (en) * | 2004-12-10 | 2006-06-15 | Hisham El-Shishiny | System and method for disambiguating non diacritized arabic words in a text |
US7072794B2 (en) | 2001-08-28 | 2006-07-04 | Rockefeller University | Statistical methods for multivariate ordinal data which are used for data base driven decision support |
US20060224570A1 (en) | 2005-03-31 | 2006-10-05 | Quiroga Martin A | Natural language based search engine for handling pronouns and methods of use therefor |
US20070078832A1 (en) | 2005-09-30 | 2007-04-05 | Yahoo! Inc. | Method and system for using smart tags and a recommendation engine using smart tags |
US20070078814A1 (en) | 2005-10-04 | 2007-04-05 | Kozoru, Inc. | Novel information retrieval systems and methods |
US20070100618A1 (en) | 2005-11-02 | 2007-05-03 | Samsung Electronics Co., Ltd. | Apparatus, method, and medium for dialogue speech recognition using topic domain detection |
US20070239433A1 (en) * | 2006-04-06 | 2007-10-11 | Chaski Carole E | Variables and method for authorship attribution |
US20070265829A1 (en) | 2006-05-10 | 2007-11-15 | Cisco Technology, Inc. | Techniques for passing data across the human-machine interface |
US20070282811A1 (en) * | 2006-01-03 | 2007-12-06 | Musgrove Timothy A | Search system with query refinement and search method |
US20080109475A1 (en) | 2006-10-25 | 2008-05-08 | Sven Burmester | Method Of Creating A Requirement Description For An Embedded System |
US20080133444A1 (en) * | 2006-12-05 | 2008-06-05 | Microsoft Corporation | Web-based collocation error proofing |
US7389233B1 (en) | 2003-09-02 | 2008-06-17 | Verizon Corporate Services Group Inc. | Self-organizing speech recognition for information extraction |
US20080154828A1 (en) | 2006-12-21 | 2008-06-26 | Support Machines Ltd. | Method and a Computer Program Product for Providing a Response to A Statement of a User |
US20080313180A1 (en) | 2007-06-14 | 2008-12-18 | Microsoft Corporation | Identification of topics for online discussions based on language patterns |
US20090037458A1 (en) | 2006-01-03 | 2009-02-05 | France Telecom | Assistance Method and Device for Building The Aborescence of an Electronic Document Group |
US20090058860A1 (en) | 2005-04-04 | 2009-03-05 | Mor (F) Dynamics Pty Ltd. | Method for Transforming Language Into a Visual Form |
US20090094185A1 (en) | 2007-10-09 | 2009-04-09 | Lawson Software, Inc. | User-definable run-time grouping of data records |
US20090177617A1 (en) * | 2008-01-03 | 2009-07-09 | Apple Inc. | Systems, methods and apparatus for providing unread message alerts |
US20090216522A1 (en) * | 2008-02-27 | 2009-08-27 | Kabushiki Kaisha Toshiba | Apparatus, method, and computer program product for determing parts-of-speech in chinese |
US20100241963A1 (en) | 2009-03-17 | 2010-09-23 | Kulis Zachary R | System, method, and apparatus for generating, customizing, distributing, and presenting an interactive audio publication |
US20100332502A1 (en) | 2009-06-30 | 2010-12-30 | International Business Machines Corporation | Method and system for searching numerical terms |
US20110106523A1 (en) * | 2005-06-24 | 2011-05-05 | Rie Maeda | Method and Apparatus for Creating a Language Model and Kana-Kanji Conversion |
US20110246183A1 (en) | 2008-12-15 | 2011-10-06 | Kentaro Nagatomo | Topic transition analysis system, method, and program |
US20110252027A1 (en) * | 2010-04-09 | 2011-10-13 | Palo Alto Research Center Incorporated | System And Method For Recommending Interesting Content In An Information Stream |
US8180713B1 (en) | 2007-04-13 | 2012-05-15 | Standard & Poor's Financial Services Llc | System and method for searching and identifying potential financial risks disclosed within a document |
US20120245923A1 (en) * | 2011-03-21 | 2012-09-27 | Xerox Corporation | Corpus-based system and method for acquiring polar adjectives |
US20130021346A1 (en) | 2011-07-22 | 2013-01-24 | Terman David S | Knowledge Acquisition Mulitplex Facilitates Concept Capture and Promotes Time on Task |
US20130091139A1 (en) * | 2011-10-06 | 2013-04-11 | GM Global Technology Operations LLC | Method and system to augment vehicle domain ontologies for vehicle diagnosis |
US8423350B1 (en) | 2009-05-21 | 2013-04-16 | Google Inc. | Segmenting text for searching |
US20130262091A1 (en) | 2012-03-30 | 2013-10-03 | The Florida State University Research Foundation, Inc. | Automated extraction of bio-entity relationships from literature |
US8719244B1 (en) * | 2005-03-23 | 2014-05-06 | Google Inc. | Methods and systems for retrieval of information items and associated sentence fragments |
US20140229159A1 (en) * | 2013-02-11 | 2014-08-14 | Appsense Limited | Document summarization using noun and sentence ranking |
US8856006B1 (en) | 2012-01-06 | 2014-10-07 | Google Inc. | Assisted speech input |
Family Cites Families (50)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6112201A (en) * | 1995-08-29 | 2000-08-29 | Oracle Corporation | Virtual bookshelf |
CA2302264C (en) * | 1997-09-04 | 2009-09-15 | British Telecommunications Public Limited Company | Methods and/or systems for selecting data sets |
US7124129B2 (en) * | 1998-03-03 | 2006-10-17 | A9.Com, Inc. | Identifying the items most relevant to a current query based on items selected in connection with similar queries |
US6243670B1 (en) * | 1998-09-02 | 2001-06-05 | Nippon Telegraph And Telephone Corporation | Method, apparatus, and computer readable medium for performing semantic analysis and generating a semantic structure having linked frames |
JP3915267B2 (en) * | 1998-09-07 | 2007-05-16 | 富士ゼロックス株式会社 | Document search apparatus and document search method |
GB9821969D0 (en) * | 1998-10-08 | 1998-12-02 | Canon Kk | Apparatus and method for processing natural language |
US6549897B1 (en) * | 1998-10-09 | 2003-04-15 | Microsoft Corporation | Method and system for calculating phrase-document importance |
US6510406B1 (en) * | 1999-03-23 | 2003-01-21 | Mathsoft, Inc. | Inverse inference engine for high performance web search |
US6519585B1 (en) * | 1999-04-27 | 2003-02-11 | Infospace, Inc. | System and method for facilitating presentation of subject categorizations for use in an on-line search query engine |
US6757646B2 (en) * | 2000-03-22 | 2004-06-29 | Insightful Corporation | Extended functionality for an inverse inference engine based web search |
US7398201B2 (en) * | 2001-08-14 | 2008-07-08 | Evri Inc. | Method and system for enhanced data searching |
US7526425B2 (en) * | 2001-08-14 | 2009-04-28 | Evri Inc. | Method and system for extending keyword searching to syntactically and semantically annotated data |
US7593932B2 (en) * | 2002-01-16 | 2009-09-22 | Elucidon Group Limited | Information data retrieval, where the data is organized in terms, documents and document corpora |
US20030200192A1 (en) * | 2002-04-18 | 2003-10-23 | Bell Brian L. | Method of organizing information into topical, temporal, and location associations for organizing, selecting, and distributing information |
US6886010B2 (en) * | 2002-09-30 | 2005-04-26 | The United States Of America As Represented By The Secretary Of The Navy | Method for data and text mining and literature-based discovery |
US7885963B2 (en) * | 2003-03-24 | 2011-02-08 | Microsoft Corporation | Free text and attribute searching of electronic program guide (EPG) data |
KR100515641B1 (en) * | 2003-04-24 | 2005-09-22 | 우순조 | Method for sentence structure analysis based on mobile configuration concept and method for natural language search using of it |
US20040267731A1 (en) * | 2003-04-25 | 2004-12-30 | Gino Monier Louis Marcel | Method and system to facilitate building and using a search database |
US7496567B1 (en) * | 2004-10-01 | 2009-02-24 | Terril John Steichen | System and method for document categorization |
JP2006252047A (en) * | 2005-03-09 | 2006-09-21 | Fuji Xerox Co Ltd | Language processor, and language processing program |
US20060259475A1 (en) * | 2005-05-10 | 2006-11-16 | Dehlinger Peter J | Database system and method for retrieving records from a record library |
US8312034B2 (en) * | 2005-06-24 | 2012-11-13 | Purediscovery Corporation | Concept bridge and method of operating the same |
NZ569107A (en) * | 2005-11-16 | 2011-09-30 | Evri Inc | Extending keyword searching to syntactically and semantically annotated data |
US7630992B2 (en) * | 2005-11-30 | 2009-12-08 | Selective, Inc. | Selective latent semantic indexing method for information retrieval applications |
US7624103B2 (en) * | 2006-07-21 | 2009-11-24 | Aol Llc | Culturally relevant search results |
JP5017666B2 (en) * | 2006-08-08 | 2012-09-05 | 国立大学法人京都大学 | Eigenvalue decomposition apparatus and eigenvalue decomposition method |
US7571158B2 (en) * | 2006-08-25 | 2009-08-04 | Oracle International Corporation | Updating content index for content searches on networks |
US7676457B2 (en) * | 2006-11-29 | 2010-03-09 | Red Hat, Inc. | Automatic index based query optimization |
US7672935B2 (en) * | 2006-11-29 | 2010-03-02 | Red Hat, Inc. | Automatic index creation based on unindexed search evaluation |
US7636715B2 (en) * | 2007-03-23 | 2009-12-22 | Microsoft Corporation | Method for fast large scale data mining using logistic regression |
US7890486B2 (en) * | 2007-08-06 | 2011-02-15 | Ronald Claghorn | Document creation, linking, and maintenance system |
US8706474B2 (en) * | 2008-02-23 | 2014-04-22 | Fair Isaac Corporation | Translation of entity names based on source document publication date, and frequency and co-occurrence of the entity names |
US20100287162A1 (en) * | 2008-03-28 | 2010-11-11 | Sanika Shirwadkar | method and system for text summarization and summary based query answering |
US20100042589A1 (en) * | 2008-08-15 | 2010-02-18 | Smyros Athena A | Systems and methods for topical searching |
US7882143B2 (en) * | 2008-08-15 | 2011-02-01 | Athena Ann Smyros | Systems and methods for indexing information for a search engine |
US7996383B2 (en) * | 2008-08-15 | 2011-08-09 | Athena A. Smyros | Systems and methods for a search engine having runtime components |
US9424339B2 (en) * | 2008-08-15 | 2016-08-23 | Athena A. Smyros | Systems and methods utilizing a search engine |
US8965881B2 (en) * | 2008-08-15 | 2015-02-24 | Athena A. Smyros | Systems and methods for searching an index |
US8156120B2 (en) * | 2008-10-22 | 2012-04-10 | James Brady | Information retrieval using user-generated metadata |
US8219579B2 (en) * | 2008-12-04 | 2012-07-10 | Michael Ratiner | Expansion of search queries using information categorization |
US9223850B2 (en) * | 2009-04-16 | 2015-12-29 | Kabushiki Kaisha Toshiba | Data retrieval and indexing method and apparatus |
US8375033B2 (en) * | 2009-10-19 | 2013-02-12 | Avraham Shpigel | Information retrieval through identification of prominent notions |
US8255401B2 (en) * | 2010-04-28 | 2012-08-28 | International Business Machines Corporation | Computer information retrieval using latent semantic structure via sketches |
US9454962B2 (en) * | 2011-05-12 | 2016-09-27 | Microsoft Technology Licensing, Llc | Sentence simplification for spoken language understanding |
US8983963B2 (en) * | 2011-07-07 | 2015-03-17 | Software Ag | Techniques for comparing and clustering documents |
JP2013235507A (en) * | 2012-05-10 | 2013-11-21 | Mynd Inc | Information processing method and device, computer program and recording medium |
US8954465B2 (en) | 2012-05-22 | 2015-02-10 | Google Inc. | Creating query suggestions based on processing of descriptive term in a partial query |
US9201876B1 (en) * | 2012-05-29 | 2015-12-01 | Google Inc. | Contextual weighting of words in a word grouping |
US9323767B2 (en) * | 2012-10-01 | 2016-04-26 | Longsand Limited | Performance and scalability in an intelligent data operating layer system |
US9542934B2 (en) * | 2014-02-27 | 2017-01-10 | Fuji Xerox Co., Ltd. | Systems and methods for using latent variable modeling for multi-modal video indexing |
-
2014
- 2014-05-02 US US14/268,983 patent/US9727619B1/en not_active Expired - Fee Related
- 2014-05-02 US US14/268,865 patent/US9495357B1/en not_active Expired - Fee Related
-
2016
- 2016-11-14 US US15/350,866 patent/US9772991B2/en not_active Expired - Fee Related
-
2017
- 2017-08-07 US US15/670,914 patent/US20180032527A1/en not_active Abandoned
Patent Citations (80)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US2003007A (en) | 1923-02-17 | 1935-05-28 | American Morgan Company | Material handling system for mines |
US5251129A (en) * | 1990-08-21 | 1993-10-05 | General Electric Company | Method for automated morphological analysis of word structure |
US5317507A (en) | 1990-11-07 | 1994-05-31 | Gallant Stephen I | Method for document retrieval and for word sense disambiguation using neural networks |
US5161245A (en) | 1991-05-01 | 1992-11-03 | Apple Computer, Inc. | Pattern recognition system having inter-pattern spacing correction |
US5528491A (en) | 1992-08-31 | 1996-06-18 | Language Engineering Corporation | Apparatus and method for automated natural language translation |
US5598557A (en) | 1992-09-22 | 1997-01-28 | Caere Corporation | Apparatus and method for retrieving and grouping images representing text files based on the relevance of key words extracted from a selected file to the text files |
US5495413A (en) | 1992-09-25 | 1996-02-27 | Sharp Kabushiki Kaisha | Translation machine having a function of deriving two or more syntaxes from one original sentence and giving precedence to a selected one of the syntaxes |
US5761631A (en) | 1994-11-17 | 1998-06-02 | International Business Machines Corporation | Parsing method and system for natural language processing |
US5930746A (en) | 1996-03-20 | 1999-07-27 | The Government Of Singapore | Parsing and translating natural language sentences automatically |
US6119077A (en) | 1996-03-21 | 2000-09-12 | Sharp Kasbushiki Kaisha | Translation machine with format control |
US5995922A (en) | 1996-05-02 | 1999-11-30 | Microsoft Corporation | Identifying information related to an input word in an electronic dictionary |
US5924105A (en) | 1997-01-27 | 1999-07-13 | Michigan State University | Method and product for determining salient features for use in information searching |
US5963969A (en) * | 1997-05-08 | 1999-10-05 | William A. Tidwell | Document abstraction system and method thereof |
US6112168A (en) | 1997-10-20 | 2000-08-29 | Microsoft Corporation | Automatically recognizing the discourse structure of a body of text |
US6311182B1 (en) | 1997-11-17 | 2001-10-30 | Genuity Inc. | Voice activated web browser |
US6327589B1 (en) | 1998-06-24 | 2001-12-04 | Microsoft Corporation | Method for searching a file having a format unsupported by a search engine |
US6167368A (en) | 1998-08-14 | 2000-12-26 | The Trustees Of Columbia University In The City Of New York | Method and system for indentifying significant topics of a document |
US6317707B1 (en) | 1998-12-07 | 2001-11-13 | At&T Corp. | Automatic clustering of tokens from a corpus for grammar acquisition |
US6553347B1 (en) | 1999-01-25 | 2003-04-22 | Active Point Ltd. | Automatic virtual negotiations |
US6473730B1 (en) | 1999-04-12 | 2002-10-29 | The Trustees Of Columbia University In The City Of New York | Method and system for topical segmentation, segment significance and segment function |
US20050171783A1 (en) | 1999-07-17 | 2005-08-04 | Suominen Edwin A. | Message recognition using shared language model |
US6684183B1 (en) | 1999-12-06 | 2004-01-27 | Comverse Ltd. | Generic natural language service creation environment |
US6731802B1 (en) | 2000-01-14 | 2004-05-04 | Microsoft Corporation | Lattice and method for identifying and normalizing orthographic variations in Japanese text |
US6675159B1 (en) | 2000-07-27 | 2004-01-06 | Science Applic Int Corp | Concept-based search and retrieval system |
US20020046019A1 (en) | 2000-08-18 | 2002-04-18 | Lingomotors, Inc. | Method and system for acquiring and maintaining natural language information |
US20020052901A1 (en) | 2000-09-07 | 2002-05-02 | Guo Zhi Li | Automatic correlation method for generating summaries for text documents |
US20020143524A1 (en) | 2000-09-29 | 2002-10-03 | Lingomotors, Inc. | Method and resulting system for integrating a query reformation module onto an information retrieval system |
US20020078044A1 (en) * | 2000-12-19 | 2002-06-20 | Jong-Cheol Song | System for automatically classifying documents by category learning using a genetic algorithm and a term cluster and method thereof |
US20020111792A1 (en) | 2001-01-02 | 2002-08-15 | Julius Cherny | Document storage, retrieval and search systems and methods |
US20030167266A1 (en) | 2001-01-08 | 2003-09-04 | Alexander Saldanha | Creation of structured data from plain text |
US20030007889A1 (en) | 2001-05-24 | 2003-01-09 | Po Chien | Multi-burner flame ionization combustion chamber |
US20030023423A1 (en) | 2001-07-03 | 2003-01-30 | Kenji Yamada | Syntax-based statistical translation model |
US7072794B2 (en) | 2001-08-28 | 2006-07-04 | Rockefeller University | Statistical methods for multivariate ordinal data which are used for data base driven decision support |
US20030074184A1 (en) | 2001-10-15 | 2003-04-17 | Hayosh Thomas E. | Chart parsing using compacted grammar representations |
US20050108001A1 (en) * | 2001-11-15 | 2005-05-19 | Aarskog Brit H. | Method and apparatus for textual exploration discovery |
US20030200077A1 (en) | 2002-04-19 | 2003-10-23 | Claudia Leacock | System for rating constructed responses based on concepts and a model answer |
US20030216904A1 (en) | 2002-05-16 | 2003-11-20 | Knoll Sonja S. | Method and apparatus for reattaching nodes in a parse structure |
US20030236659A1 (en) * | 2002-06-20 | 2003-12-25 | Malu Castellanos | Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging |
US20040143808A1 (en) * | 2003-01-21 | 2004-07-22 | Infineon Technologies North America Corp. | Method of resolving mismatched parameters in computer-aided integrated circuit design |
US20040148170A1 (en) | 2003-01-23 | 2004-07-29 | Alejandro Acero | Statistical classifiers for spoken language understanding and command/control scenarios |
US20040243409A1 (en) * | 2003-05-30 | 2004-12-02 | Oki Electric Industry Co., Ltd. | Morphological analyzer, morphological analysis method, and morphological analysis program |
US7389233B1 (en) | 2003-09-02 | 2008-06-17 | Verizon Corporate Services Group Inc. | Self-organizing speech recognition for information extraction |
US20050049852A1 (en) | 2003-09-03 | 2005-03-03 | Chao Gerald Cheshun | Adaptive and scalable method for resolving natural language ambiguities |
US20050065776A1 (en) | 2003-09-24 | 2005-03-24 | International Business Machines Corporation | System and method for the recognition of organic chemical names in text documents |
US20050081146A1 (en) * | 2003-10-14 | 2005-04-14 | Fujitsu Limited | Relation chart-creating program, relation chart-creating method, and relation chart-creating apparatus |
US20050220351A1 (en) | 2004-03-02 | 2005-10-06 | Microsoft Corporation | Method and system for ranking words and concepts in a text using graph-based ranking |
US20050256700A1 (en) | 2004-05-11 | 2005-11-17 | Moldovan Dan I | Natural language question answering system and method utilizing a logic prover |
US20060089928A1 (en) | 2004-10-20 | 2006-04-27 | Oracle International Corporation | Computer-implemented methods and systems for entering and searching for non-Roman-alphabet characters and related search systems |
US20060117307A1 (en) | 2004-11-24 | 2006-06-01 | Ramot At Tel-Aviv University Ltd. | XML parser |
US20060129380A1 (en) * | 2004-12-10 | 2006-06-15 | Hisham El-Shishiny | System and method for disambiguating non diacritized arabic words in a text |
US8719244B1 (en) * | 2005-03-23 | 2014-05-06 | Google Inc. | Methods and systems for retrieval of information items and associated sentence fragments |
US20060224570A1 (en) | 2005-03-31 | 2006-10-05 | Quiroga Martin A | Natural language based search engine for handling pronouns and methods of use therefor |
US20090058860A1 (en) | 2005-04-04 | 2009-03-05 | Mor (F) Dynamics Pty Ltd. | Method for Transforming Language Into a Visual Form |
US20110106523A1 (en) * | 2005-06-24 | 2011-05-05 | Rie Maeda | Method and Apparatus for Creating a Language Model and Kana-Kanji Conversion |
US20070078832A1 (en) | 2005-09-30 | 2007-04-05 | Yahoo! Inc. | Method and system for using smart tags and a recommendation engine using smart tags |
US20070078814A1 (en) | 2005-10-04 | 2007-04-05 | Kozoru, Inc. | Novel information retrieval systems and methods |
US20070100618A1 (en) | 2005-11-02 | 2007-05-03 | Samsung Electronics Co., Ltd. | Apparatus, method, and medium for dialogue speech recognition using topic domain detection |
US20070282811A1 (en) * | 2006-01-03 | 2007-12-06 | Musgrove Timothy A | Search system with query refinement and search method |
US20090037458A1 (en) | 2006-01-03 | 2009-02-05 | France Telecom | Assistance Method and Device for Building The Aborescence of an Electronic Document Group |
US20070239433A1 (en) * | 2006-04-06 | 2007-10-11 | Chaski Carole E | Variables and method for authorship attribution |
US20070265829A1 (en) | 2006-05-10 | 2007-11-15 | Cisco Technology, Inc. | Techniques for passing data across the human-machine interface |
US20080109475A1 (en) | 2006-10-25 | 2008-05-08 | Sven Burmester | Method Of Creating A Requirement Description For An Embedded System |
US20080133444A1 (en) * | 2006-12-05 | 2008-06-05 | Microsoft Corporation | Web-based collocation error proofing |
US20080154828A1 (en) | 2006-12-21 | 2008-06-26 | Support Machines Ltd. | Method and a Computer Program Product for Providing a Response to A Statement of a User |
US8180713B1 (en) | 2007-04-13 | 2012-05-15 | Standard & Poor's Financial Services Llc | System and method for searching and identifying potential financial risks disclosed within a document |
US20080313180A1 (en) | 2007-06-14 | 2008-12-18 | Microsoft Corporation | Identification of topics for online discussions based on language patterns |
US20090094185A1 (en) | 2007-10-09 | 2009-04-09 | Lawson Software, Inc. | User-definable run-time grouping of data records |
US20090177617A1 (en) * | 2008-01-03 | 2009-07-09 | Apple Inc. | Systems, methods and apparatus for providing unread message alerts |
US20090216522A1 (en) * | 2008-02-27 | 2009-08-27 | Kabushiki Kaisha Toshiba | Apparatus, method, and computer program product for determing parts-of-speech in chinese |
US20110246183A1 (en) | 2008-12-15 | 2011-10-06 | Kentaro Nagatomo | Topic transition analysis system, method, and program |
US20100241963A1 (en) | 2009-03-17 | 2010-09-23 | Kulis Zachary R | System, method, and apparatus for generating, customizing, distributing, and presenting an interactive audio publication |
US8423350B1 (en) | 2009-05-21 | 2013-04-16 | Google Inc. | Segmenting text for searching |
US20100332502A1 (en) | 2009-06-30 | 2010-12-30 | International Business Machines Corporation | Method and system for searching numerical terms |
US20110252027A1 (en) * | 2010-04-09 | 2011-10-13 | Palo Alto Research Center Incorporated | System And Method For Recommending Interesting Content In An Information Stream |
US20120245923A1 (en) * | 2011-03-21 | 2012-09-27 | Xerox Corporation | Corpus-based system and method for acquiring polar adjectives |
US20130021346A1 (en) | 2011-07-22 | 2013-01-24 | Terman David S | Knowledge Acquisition Mulitplex Facilitates Concept Capture and Promotes Time on Task |
US20130091139A1 (en) * | 2011-10-06 | 2013-04-11 | GM Global Technology Operations LLC | Method and system to augment vehicle domain ontologies for vehicle diagnosis |
US8856006B1 (en) | 2012-01-06 | 2014-10-07 | Google Inc. | Assisted speech input |
US20130262091A1 (en) | 2012-03-30 | 2013-10-03 | The Florida State University Research Foundation, Inc. | Automated extraction of bio-entity relationships from literature |
US20140229159A1 (en) * | 2013-02-11 | 2014-08-14 | Appsense Limited | Document summarization using noun and sentence ranking |
Also Published As
Publication number | Publication date |
---|---|
US20170060841A1 (en) | 2017-03-02 |
US20180032527A1 (en) | 2018-02-01 |
US9727619B1 (en) | 2017-08-08 |
US9495357B1 (en) | 2016-11-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9772991B2 (en) | Text extraction | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN108932294B (en) | Resume data processing method, device, equipment and storage medium based on index | |
CN107704512B (en) | Financial product recommendation method based on social data, electronic device and medium | |
CN110263248B (en) | Information pushing method, device, storage medium and server | |
US20170300565A1 (en) | System and method for entity extraction from semi-structured text documents | |
US20110153595A1 (en) | System And Method For Identifying Topics For Short Text Communications | |
JP6056610B2 (en) | Text information processing apparatus, text information processing method, and text information processing program | |
US8793120B1 (en) | Behavior-driven multilingual stemming | |
US20160140389A1 (en) | Information extraction supporting apparatus and method | |
CN112732893B (en) | Text information extraction method and device, storage medium and electronic equipment | |
US20180075020A1 (en) | Date and Time Processing | |
US8290925B1 (en) | Locating product references in content pages | |
US9330075B2 (en) | Method and apparatus for identifying garbage template article | |
CN114595686A (en) | Knowledge extraction method, and training method and device of knowledge extraction model | |
KR102280490B1 (en) | Training data construction method for automatically generating training data for artificial intelligence model for counseling intention classification | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN110245357B (en) | Main entity identification method and device | |
CN110413996B (en) | Method and device for constructing zero-index digestion corpus | |
CN112199958A (en) | Concept word sequence generation method and device, computer equipment and storage medium | |
CN110362656A (en) | A kind of semantic feature extracting method and device | |
CN114255067A (en) | Data pricing method and device, electronic equipment and storage medium | |
JP7216627B2 (en) | INPUT SUPPORT METHOD, INPUT SUPPORT SYSTEM, AND PROGRAM | |
US20210117448A1 (en) | Iterative sampling based dataset clustering | |
WO2019192122A1 (en) | Document topic parameter extraction method, product recommendation method and device, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20210926 |