[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

US20110093417A1 - Topical sentiments in electronically stored communications - Google Patents

Topical sentiments in electronically stored communications Download PDF

Info

Publication number
US20110093417A1
US20110093417A1 US12/969,356 US96935610A US2011093417A1 US 20110093417 A1 US20110093417 A1 US 20110093417A1 US 96935610 A US96935610 A US 96935610A US 2011093417 A1 US2011093417 A1 US 2011093417A1
Authority
US
United States
Prior art keywords
expression
polar
topical
polarity
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US12/969,356
Other versions
US8041669B2 (en
Inventor
Kamal P. Nigam
Matthew F. Hurst
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Buzzmetrics Ltd
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US12/969,356 priority Critical patent/US8041669B2/en
Publication of US20110093417A1 publication Critical patent/US20110093417A1/en
Assigned to BUZZMETRICS, LTD. reassignment BUZZMETRICS, LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HURST, MATTHEW F., NIGAM, KAMAL P.
Application granted granted Critical
Publication of US8041669B2 publication Critical patent/US8041669B2/en
Assigned to CITIBANK, N.A., AS COLLATERAL AGENT FOR THE FIRST LIEN SECURED PARTIES reassignment CITIBANK, N.A., AS COLLATERAL AGENT FOR THE FIRST LIEN SECURED PARTIES SUPPLEMENTAL IP SECURITY AGREEMENT Assignors: THE NIELSEN COMPANY ((US), LLC
Assigned to THE NIELSEN COMPANY (US), LLC reassignment THE NIELSEN COMPANY (US), LLC RELEASE (REEL 037172 / FRAME 0415) Assignors: CITIBANK, N.A.
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Definitions

  • One way to address BrightScreen's business need would be a text mining toolkit that automatically identifies just those email fragments that are topical to LCD screens and also express positive or negative sentiment. These fragments will contain the most salient representation of the consumers' likes and dislikes specifically with regard to the product at hand.
  • the goal of the present invention is to reliably extract polar sentences about a specific topic from a corpus of data containing both relevant and irrelevant text.
  • the present invention therefore, provides a lightweight but robust approach to combining topic and polarity, thus enabling text mining systems select content based on a certain opinion about a certain topic.
  • a first aspect of the present invention can be characterized as providing a computer implemented method (in which a computer can be any type of computer or computer system, network or combination thereof programmed and configured to perform the steps described herein) for obtaining topical sentiments from an electronically stored communication (which can be, for example and without limitation, an electronic document, message, email, blog post, and the like—and it is not important to the invention exactly where or how the communication is electronically stored and/or accessed) that includes the steps of (in no specific order): (a) determining a topic of a segment of the communication; and (b) locating a polar expression in the communication.
  • an electronically stored communication which can be, for example and without limitation, an electronic document, message, email, blog post, and the like—and it is not important to the invention exactly where or how the communication is electronically stored and/or accessed
  • the method also includes the step of (c) determining a polarity of the polar expression, where the polarity may be positive, negative, mixed and/or neutral, for example. It is also within the scope of the invention that the method include the step of (d) associating the determined polarity with the determined topic.
  • the steps (b) locating a polar expression in the electronically stored communication and (c) determining the polarity of the polar expression may include the steps of: (1) establishing a domain-general polarity lexicon of sentimental/polar phrases (i.e., words and phrases) (2) establishing a topical domain being explored; (3) generating a polarity lexicon of sentimental phrases associated with the topical domain; (4) utilizing the polarity lexicon against phrases found in the polar expression; and (5) assigning at least one polar phrase in the polar expression a polarity associated with a matching phrase in the polarity lexicon.
  • the step (c) of determining the polarity of the polar expression may also include the step of assigning at least one polar phrase in the polar expression a polarity.
  • the method may further include the step of (e) analyzing the polar expression with syntactic and/or semantic rules to determine a topic of the polar expression and to link the determined topic to the polarity of the polar phrase.
  • the step (a) of determining a topic of the segment of the communication containing or associated with the polar expression includes the step of processing the segment with a communication (i.e., text) classifier.
  • a communication i.e., text
  • Such communication classifier may utilize an algorithm, such as a Winnow algorithm, a Support Vector Machine algorithm, a k-Nearest Neighbor algorithm, other machine learning algorithms, or a hand-built rules-based classifier.
  • step (a) of determining a topic of the segment of the communication and the step (c) of determining the polarity of the polar expression are independent tasks.
  • the segments of the communication discussed above maybe an entire communication or a portion of the communication, such as a sentence for example. Further the segment discussed above may be the polar expression.
  • a second aspect of the present invention can be characterized as providing a computer implemented method for obtaining topical sentiments from a body of communications (text, electronic, etc.) comprising the steps of: (a) isolating a subset of the communications relevant to a particular topic; and (b) locating a polar expression in at least one of the subset of communications.
  • the method may also include the steps of (c) determining the polarity of the polar expression and (d) associating the polarity with the particular topic.
  • a third aspect of the present invention can be characterized as providing a computer implemented method for obtaining topical sentiments from a body of communications (text, electronic, etc.) comprising the steps of: (a) isolating a first subset of the communications relevant to a particular topic; and (b) isolating a second subset of communications from the first subset of communications where the second subset of communications includes polar segments (i.e., negative or positive) located in the first subset of communications.
  • the second subset can be broken into further subsets depending upon the particular polarity of the polar segments (i.e., there can be subsets for positive segments, negative segments, neutral segments and/or others).
  • the method may also include the step of (c) associating the polar segments with the particular topic.
  • the segments can be a sentence, a phrase, a paragraph or an entire communication for example.
  • a fourth aspect of the present invention can be characterized as providing a computer implemented method for obtaining topical sentiments from a plurality of electronically stored communications that includes the steps of: (a) determining with the assistance of a computer whether each communication in a plurality of communications is topical to a first predefined topic; (b) for each communication determined to be topical to the predefined topic, separating with the assistance of a computer the communication into one or more expressions (a word or a group of words that form a constituent of a sentence and are considered as a single unit); (c) for each expression, determining with the assistance of a computer if the expression is topical to a second predefined topic; and (d) for each expression that is determined to be topical to the second predefined topic, determining with the assistance of a computer a polarity of the expression.
  • the polarity may be positive, negative, and/or neutral.
  • the step of determining the polarity of the expression may include the steps of: establishing a topical domain being explored; generating a polarity lexicon of sentimental words and/or phrases associated with the topical domain; utilizing with the assistance of a computer the polarity lexicon against words and/or phrases found in the expression; and assigning at least one polar phrase in the expression a polarity associated with a matching word and/or phrase in the polarity lexicon.
  • the step of determining the polarity of the expression may further include the step of analyzing with the assistance of a computer the expression with syntactic and/or semantic rules.
  • the step of determining with the assistance of a computer whether each communication in a plurality of communications is topical to a first predefined topic includes the step of processing each communication with a text classifier.
  • This text classifier may utilize an algorithm such as a Winnow algorithm, a Support Vector Machine algorithm, a k-Nearest Neighbor algorithm or a rules-based classifier.
  • the method may further include the step of (e) calculating with the assistance of a computer an aggregate metric from the plurality of expressions which estimates the frequency of positive and/or negative polar expressions.
  • This step may include the generation of statistically-valid confidence bounds on the aggregate metric.
  • This step (e) may also include the steps of: for each of the plurality of expressions, estimating an opinion based upon the presence, absence or strength of polarity associated with the predefined topic; and aggregating the overall opinion for the plurality of expressions.
  • the step of aggregating the overall opinion for the plurality of expressions may include a step of normalizing the ratio of empirical or estimated frequency of positive and negative polarity associated with the predefined topic.
  • the step (e) of calculating an aggregate metric from the plurality of expressions may utilize Bayesian statistics to derive estimates for positive and negative frequencies of polar expressions.
  • the first predefined topic is a general topic and the second predefined topic is a specific topic associated with the general topic.
  • the general topic is a product or service and the specific topic is a feature of the product or service.
  • the general topic is a commercial brand and the specific topic is a feature of the commercial brand. It is also within the scope of the invention that the first predefined topic and the second predefined topic are the same topic.
  • a fifth aspect of the present invention can be characterized as a computer implemented method for calculating, from a plurality of electronically stored expressions, an aggregate metric which estimates a frequency of positive and/or negative polar expressions contained in the expressions.
  • the method includes the steps of: for each of a plurality of electronically stored expressions, determining with the assistance of a computer an opinion contained in the expressions based upon at least one of the presence, absence and strength of polarity associated with a predefined topic; and calculating an aggregate metric from the determined opinions of the plurality of expressions.
  • the step of calculating an aggregate metric from the determined opinions of the plurality of expressions includes the generation of statistically-valid confidence bounds on the aggregate metric.
  • the step of calculating an aggregate metric from the determined opinions of the plurality of expressions includes a step of normalizing the ratio of empirical or estimated frequency of positive and negative polarity associated with the predefined topic.
  • the step of calculating an aggregate metric from the determined opinions of the plurality of expressions further includes utilizing Bayesian statistics to derive estimates for positive and negative frequencies of polar expressions.
  • at least a portion of the plurality of expressions are taken from larger electronically stored communications.
  • the step of determining an opinion contained in the expressions includes the steps of, for each expression: determining with the assistance of a computer that the expression is topical to the predefined topic; and determining with the assistance of a computer a polarity of the expression.
  • a sixth aspect of the present invention can be characterized as a computer implemented method for finding one or more polar expressions in an electronically stored communication, which includes the step of analyzing with the assistance of a computer the electronically stored communication for one or more polar expressions within the electronically stored communication.
  • This analyzing step includes the steps of: providing a polarity lexicon of sentimental words and/or phrases associated with a topical domain; utilizing with the assistance of a computer the polarity lexicon against words and/or phrases found in the expression; and assigning with the assistance of a computer at least one word/phrase in the expression a polarity associated with a matching word/phrase in the polarity lexicon.
  • the step of assigning with the assistance of a computer at least one word/phrase in the expression a polarity associated with a matching word/phrase in the polarity lexicon includes the steps of: separating the expression into word/phrase chunks; tagging the separated word/phrase chunks with part-of-speech tags; applying the polarity lexicon against the word/phrase chunks to tag one or more of the word/phrase chunks with a polarity tag; and applying syntactic and semantic rules against the tagged word/phrase chunks to elevate the polarity of the word/phrase chunk to the entire expression.
  • the step of applying syntactic and semantic rules against the word/phrase chunks to elevate the polarity of the word/phrase chunk to the entire expression includes the step of identifying a word/phrase chunk in the expression that toggles the polarity of the word/phrase chunk tagged with the polarity tag.
  • the step of applying syntactic and semantic rules against the word/phrase chunks to elevate the polarity of the word/phrase chunk to the entire expression includes the step of performing with the assistance of a computer grammatical analysis on the expression.
  • a seventh aspect of the present invention can be characterized as a computer implemented method for tuning a polarity lexicon for use in classifying polar expressions, which includes the steps of: (a) providing a polarity lexicon; (b) with the assistance of a computer implemented graphical user interface providing a user with candidate words for addition, subtraction or exclusion to the polarity lexicon; and (c) adding, subtracting or excluding each candidate word from the polarity lexicon according to input received by the graphical user interface.
  • the step of (b) providing a user with candidates for addition, subtraction or exclusion to the polarity lexicon includes a step of scanning a plurality of electronic messages collected for the topical domain for words that have the potential to be added to the lexicon.
  • the scanning step may include a pattern based method that locates adjectives and adverbs that have a substantial chance of being polar; or the scanning step may locate candidate words by filtering the communication for words that appear at least a predefined number of times; or the scanning step may include a pattern based method that locates adjectives and adverbs that have a substantial chance of being polar and locates candidate words by filtering the communication for words that appear at least a predefined number of times.
  • the step of (b) providing with a graphical user interface a user with candidate words for addition, subtraction or exclusion to the polarity lexicon includes the step of presenting each candidate word to the user with the word's part of speech label and an example of that candidate word appearing in at least one electronic message collected for the topical domain.
  • An eighth aspect of the present invention can be characterized as a computer implemented method for obtaining topical sentiments from an electronically stored communication, which includes the steps of: determining with the assistance of a computer one or more topical expressions in the communication; locating with the assistance of a computer one or more polar expressions in the communication; and identifying an expression that is both a topical expression and a polar expression as containing a topical sentiment.
  • the steps of determining one or more topical expressions and locating one or more polar expressions are isolated steps performed on the same communication.
  • the step of determining one or more topical expressions includes a step of applying an automated text classifier on the communication and the step of locating one or more polar expressions includes the step of utilizing a domain-specific lexicon and shallow NLP techniques.
  • FIG. 1 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing results of a text mining algorithm;
  • FIG. 2 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing how more specific sub-topics may be selected;
  • FIG. 3 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing how both the positive and negative expressions found in the communications may be displayed to the user;
  • FIG. 4 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing complete text of a message containing an expression selected from the interface of FIG. 3 ;
  • FIG. 5 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing how rule-based classifiers may be built;
  • FIG. 6 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing how a user may use the system to tune a polarity lexicon;
  • FIG. 7 is a scatter-plot graph showing an analysis of confidence bounds by the amount of message volume about a brand.
  • the present invention strikes out a middle ground between the Weibe et al. and Pang et al. approaches and presents a fusion of polarity and topicality.
  • One approach to performing this task is to do a full NLP-style analysis of each sentence and understand at a deep level the semantics of the sentence, how it relates to the topic, and whether the sentiment of the expression is positive or negative.
  • we approximate the topicality judgment with either a statistical machine learning classifier or a hand-built rules-based classifier and the polarity judgment with shallow NLP techniques.
  • An exemplary embodiment assumes that any sentence that is both polar and topical is polar about the topic in question.
  • the present application presents exemplary methods for performing topical sentiment analysis employing fusion of polarity and topicality.
  • One approach to performing this task is to perform a full NLP-style analysis of each sentence and understand at a deep level the semantics of the sentence, how it relates to the topic, and whether the sentiment of the expression is positive or negative (or any other sentiment capable of being expressed in a message).
  • the present invention therefore, provides a lightweight but robust approach to combining topic and polarity thus enabling content access systems to select content based on a certain opinion about a certain topic.
  • Texts can be broadly categorized as subjective or objective. Those that are subjective often carry some indication of the author's opinion, or evaluation of a topic as well as some indication of the author's emotional state with respect to that topic. For example, the expression this is an excellent car indicates the author's evaluation of the car in question; I hate it! reflects the author's emotional state with respect to the topic.
  • An additional type of expression informative in this context is that which indicates a desirable or undesirable condition. These expressions may be deemed objective. For example, It is broken may well be objective but is still describing an undesirable state.
  • An idealized view of polarity detection would be able to accept any document, or subsection thereof; and provide an indication of the polarity: the segment is either positive, negative, mixed or neutral.
  • expressions can be analyzed and rated according to the strength of any sentiment that is capable of being expressed in words. Such sentiments need not be analyzed in opposite pairs (as in the case of positivity and negativity in the exemplary embodiment); a message can be analyzed for the expression of any individual qualitative sentiment, and the relative strength of that sentiment in the message can be expressed on a numerical scale (as is described herein with reference to the sentiments of positivity and negativity). Examples of such additional qualitative sentiments that can be expressed in a message and analyzed according to the present invention include, but are not limited to: anger, hate, fear, loyalty, happiness, respect, confidence, pride, hope, doubt, and disappointment.
  • knowing that a piece of text is positive is only as useful as our ability to determine the topic of the segment. If a brand manager is told that this set of documents is positive and this set is negative, they cannot directly use this information without knowing, for example, which are positive about their product and which are positive about the competition.
  • a domain-general lexicon is developed.
  • a lexicon is a list of words or phrases with their associated parts-of-speech, and a semantic orientation tag (e.g. positive or negative). For example, this may contain the words ‘good’ and ‘bad’ as positive and negative adjectives, respectively.
  • this domain-general lexicon is tuned to the domain being explored. For example, if we are looking at digital cameras, phrases like ‘blurry’ may be negative and ‘crisp’ may be positive. Care is taken not to add ambiguous terms where possible as we rely on assumptions about the distribution of the phrases that we can detect with high precision and its relationship to the distribution of all polar phrases. Note that our lexicon contains possibly ‘incorrect’ terms which reflect modem language usage as found in online messages. For example, there is an increasing lack of distinction between certain classes of adverbs and adjectives and so many adjectives are replicated as adverbs.
  • the input is tokenized.
  • the tokenized input is then segmented into discrete chunks.
  • the chunking phase includes the following steps. Part of speech tagging is carried out using a statistical tagger trained on Penn Treebank data. (We note that taggers trained on clean data, when applied to the noisy data found in our domain, are less accurate than with their native data.) Semantic tagging adds polar orientation information to each token (positive or negative) where appropriate using the prepared polarity lexicon. Simple linear POS tag patterns are then applied to form the chunks.
  • the chunk types that are derived are basic groups (noun, adjective, adverb and verb) as well as determiner groups and an ‘other’ type.
  • the chunked input is then further processed to form higher-order groupings of a limited set of syntactic patterns. These patterns are designed to cover expressions that associate polarity with some topic, and those expressions that toggle the logical orientation of polar phrases (I have never liked it.). This last step conflates simple syntactic rules with semantic rules for propagating the polarity information according to any logical toggles that may occur.
  • the chunking phase would bracket the token sequence as follows: ⁇ (this_DT)_DET, (car_NN)_BNP, (is_VB)_BVP, (really_RR, great_JJ)_BADJP ⁇ .
  • the basic chunk categories are ⁇ DET, BNP, BADVP, BADJP, BVP, OTHER ⁇ .
  • the interpretation phase then carries out two tasks: the elevation of semantic information from lower constituents to higher, applying negation logic where appropriate, and assembling larger constituents from smaller. Rules are applied in a certain order. In this example, a rule combining DET and BNP chunks would work first over the sequence, followed by a rule that forms verb phrases from BNP BVP BADJP sequences whenever polar information is found in a BADJP.
  • the simple syntactic patterns are: Predicative modification (it is good), Attributive modification (a good car), Equality (it is a good car), Polar clause (it broke my car).
  • a lexicon is consulted to annotate the terminals with lexical information.
  • the lexicon contains information describing the role that the word has in the context of interpreting polarity. This role is described as either:
  • an atomic feature representing a grounded interpretation for the lexical item e.g. positive negative.
  • Such a set of lexical types may include:
  • INVERT-NEG invert the polarity of a negative argument
  • INVERT-POS invert the polarity of a positive argument
  • Composition is the process by which an interpretation is built up (via the application of functions, or via transmitting a child's interpretation to its parent) from the terminals in a syntactic tree to the root of the tree.
  • Each sentence is tokenized to produce a series of word-like elements.
  • Each token is given a part of speech (POS) which is encoded using a two or three character tag, e.g., NN for singular noun, NNP for plural noun.
  • POS part of speech
  • Each token is looked up in a lexicon.
  • the lexicon uses the POS and morphological analysis of the word.
  • the morphological analysis takes as input a word and a POS and produces a reduced form of the word and the appropriate derived POS. For example, looking up ‘breaking’ would produce the token ‘break’ with the associated POS VB.
  • a grammatical analysis is performed on the entire sentence.
  • the goal of the grammatical analysis is to diagram as much of the sentence as possible. In certain situations, the sentence will be fully described as a single structure. Otherwise, the structure will be fragmented.
  • the notation below the bracketed form shows how the sentence is built up.
  • the sentence consists of an NP—the subject Noun Phrase—(I) and a VP the main Verb Phrase (hears this was a great movie.
  • the VP is then further split into a verb (heard) and a relative clause (S-REL) which itself has a simple Subject Verb Object sentence structure.
  • Semantic Analysis Now that we have the structural and lexical description of the sentence, we can carry out a semantic analysis.
  • the semantic analysis works in a simple compositional manner working from the nodes at the leaves of the tree structure (starting with the words themselves and moving through the tree to the very top node).
  • ‘good’ finds a hit in the lexicon and gets the ‘positive’ feature. ‘not’ also finds a hit in the lexicon and gets assigned the *function* ‘INVERT( )’.
  • the ‘positive’ feature associated with ‘good’ is an *atomic* feature.
  • ‘a good movie” which is a noun phrase, gets associated with the ‘positive’ feature.
  • the INVERT( ) function that the word ‘not’ hit in the lexicon makes its way up to the verbal group (‘was not’).
  • the higher level node, a Verb Phrase that spans all of ‘was not a good movie’ has two children: ‘was not’ and ‘a good movie’. If we reduce these children to their semantic information, we have to two expressions: ‘INVERT( )’ and ‘positive’.
  • the combinatorial process applies the function to the atomic argument and evaluates the result.
  • ‘INVERT( )’ and ‘positive’ become ‘INVERT(positive)’ which then becomes ‘negative’.
  • this ‘negative’ feature then makes its way up the tree structure to the top S node, resulting in a sentence with negative polarity.
  • POS allows the system to distinguish between the grammatical function of the word (e.g. ‘pretty’ in ‘it was pretty’ and ‘pretty’ in ‘it was pretty evil’).
  • the context when appropriate, can be used to distinguish other cases, such as the difference between ‘well’ in ‘it works well’ and ‘well’ in ‘oh, well’.
  • the lexicon when word (ordinal) w is looked up in a sentence of n>w words, the lexicon has access to all the words in the sentence and can address them relative to the position w.
  • Our evaluation experiment proceeded as follows. Using our message harvesting and text mining toolkit, we acquired 20,000 messages from online resources (usenet, online message boards, etc.). Our message harvesting system harvests messages in a particular domain (a vertical industry, such as ‘automotive’, or a specific set of products). Messages are then automatically tagged according to some set of topics of interest to the analyst.
  • a vertical industry such as ‘automotive’, or a specific set of products.
  • topic topical, off-topic (a binary labeling).
  • topic and polarity positive-correlated, negative-correlated, positive-uncorrelated, negative uncorrelated, topical, off-topic.
  • the positive-correlated label indicates that the sentences contained a positive polar segment that referred to the topic, positive-uncorrelated indicates that there was some positive polarity but that it was not associated with the topic in question.
  • a machine learning text classifier is trained to assess topicality on whole messages and thus expects to predict whether or not a whole message is relevant to the given topic.
  • the provided classifier is trained with machine learning techniques from a collection of documents that have been hand-labeled with the binary relation of topicality.
  • the underlying classifier is a variant of the Winnow classifier (N. Littlestone, “Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm,” Machine Learning 2, pp. 285-318, 1988; A. Blum, “Empirical support for winnow and weighted-majority based algorithms: results on a calendar scheduling domain,” Machine Learning 26, pp. 5-23, 1997; and I. Dagan, Y. Karov, and D.
  • Winnow is theoretically guaranteed to quickly converge to a correct hypothesis.
  • Winnow is a very effective document classification algorithm, rivaling the performance of Support Vector Machines (T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Machine Learning: ECML 98, Tenth European Conference on Machine Learning, pp. 137-142, 1998, the disclosure of which is incorporated herein by reference) and k-Nearest Neighbor (Y. Yang, “An evaluation of statistical approaches to text categorization,” Information Retrieval 1(1/2), pp.
  • Winnow because it is more computationally efficient than SVMs and easier to apply than kNN. It is to be understood, of course, that it is within the scope of the invention to use classifiers other than the Winnow algorithm.
  • a straightforward and ad-hoc technique of adapting a given document classifier into a high precision/low recall sentence classifier we then use a straightforward and ad-hoc technique of adapting a given document classifier into a high precision/low recall sentence classifier. If a document is judged by the classifier to be irrelevant, we predict that all sentences in that document are also irrelevant. If a document is judged to be topical, then we further examine each sentence in that document. Given each sentence and our text classifier, we simply form a bag-of-words representation of the sentence as if an entire document consisted of that single sentence. We then run the classifier on the derived pseudo-document. If the classifier predicts topical, then we label the sentence as topical and proceed with the sentiment analysis for that sentence. If the classifier predicts irrelevant, we skip the sentiment analysis and proceed on to the next sentence.
  • This section describes how we use the polar sentence detector and identify which messages contain positive or negative expressions about a particular brand.
  • the approach we take is to use a brand text classifier, a feature text classifier, and a set of resolution heuristics to combine these with the polar language detector.
  • topics of discussion In a marketing intelligence application of data mining, there are typically topics of discussion in the data that warrant explicit tracking and identification.
  • the most prevalent type of topics are brand-related, i.e. one topic for each product or brand being tracked, such as the Dell Axim.
  • analysts compose well-written hand-built rules to identify these types of topics. These rules are based on words and phrases, and allow for stemming, synonymy, windowing, and context-sensitivity based on document analysis.
  • topic classification In contrast to brand-like topics defined through rules, it's often the case that other topics are more accurately recognized from a complex language expression that is not easily captured by a rule. For example, topics such as Customer Service are not so simply captured by sets of words, phrases and rules. Thus, we often approach topic classification with machine learning techniques.
  • the provided classifier is trained with machine learning techniques from a collection of documents that have been hand-labeled with the binary relation of topicality. The hand-labeling by the analysts is performed using an active learning framework.
  • the underlying classifier is a variant of the Winnow classifier (Littlestone 1988), an online learning algorithm that finds a linear separator between the class of documents that are topical and the class of documents that are irrelevant. Documents are modeled with the standard bag-of-words representation that discards the ordering of words and notices only whether or not a word occurs in a document.
  • feature classifiers are used to extend the recall of identifying polar messages through the following process. If a message contains brand mentions, the feature classifiers are also run on each sentence in a message. If a sentence is both polar and passes a feature classifier, there is likely a polar expression about one of the brands mentioned in the message. A process of fact extraction is layered on top of these classifiers and the sentiment analysis to understand which brand is being referenced in the message. We use simple resolution techniques to associate brand-like topics (e.g. Dell Axim) with topics describing features of brands (e.g. Customer Service or Peripherals).
  • brand-like topics e.g. Dell Axim
  • topics describing features of brands e.g. Customer Service or Peripherals
  • a brand can be referenced in the Subject line of a blog, and feature-like topics mentioned in the body of the blog resolve back to the brand topics in the subject line when other brands are not mentioned in the body.
  • feature-like topics mentioned in the body of the blog resolve back to the brand topics in the subject line when other brands are not mentioned in the body.
  • a message For purposes of measuring aggregate sentiment for a brand, a message is considered positive about the brand if it contains a fact with the brand's class and a positive polarity. A message is considered negative about the brand if it contains a fact with the brand's class and a negative polarity. While generally correct, the automated nature of the system results in a not insignificant amount of error in claiming these facts. Aggregating these counts into a single overall score for a brand requires a mindfulness of the error rates, to avoid making incorrect claims about a brand. Below we describe how the counts of each of these groups of messages is used to generate a score with confidence bounds that achieves this goal.
  • An alternate embodiment for identifying topical sentences is to use a hand-built set of rules to identify sentences containing a topic. For example, to identify the display of a PDA, an analyst might write the rule “the word ‘screen’ within' five words of the word ‘PDA’, the word ‘resolution’, the phrase ‘trans reflective’ but not the phrase ‘monitor resolution’. These rules can be run over every sentence in the document collection, and any sentence that matches the hand-written rule is considered topical.
  • FIG. 5 shows a screen shot of an exemplary computerized tool for writing topical rules of this kind.
  • One goal of the present invention is to reliably extract polar sentiments about a topic.
  • An embodiment of our system assumes that a sentence judged to be polar and also judged to be topical is indeed expressing polarity about the topic. This relationship is asserted without any NLP-style evidence for a connection between the topic and the sentiment, other than their apparent locality in the same sentence. This section tests the assumption that at the locality of a sentence, a message that is both topical and polar actually expresses polar sentiment about the topic.
  • the system identifies sentences that are judged to be topical and have either positive or negative sentiment. These sentences are predicted by the system to be saying either positive or negative things about the topic in question. Out of the 1262 sentences predicted to be topical, 316 sentences were predicted to have positive polarity and 81 were predicted to have negative polarity. The precision for the intersection—testing the assumption that a topical sentence with polar content is polar about that topic—is show in Table 2, above. The results show the overall precision was 72%. Since the precision of the polarity module was 82% and the topic module 79%, an overall precision of72% demonstrates that the locality assumption holds in most instances.
  • the B&W display is great in the sun.
  • the screen is at 70 setting (255 max) which is for me the lowest comfortable setting.
  • the screen is the same (both COMPANY-A & COMPANY-B decided to follow COMPANY-C), but multimedia is better and more stable on the PRODUCT.
  • the model we choose is a statistical generative model. That is, we assume the facts are extracted by an error-prone process that we model with explicit parameterization. Specifically for the sentiment metric, the fundamental parameter we hope to derive is the frequency of positive messages about a brand, and the frequency of negative messages about a brand. These two processes are modeled analogously; for brevity we discuss here only the derivation of the frequency of positive messages, but one of ordinary skill will readily appreciate how to derive the frequency of negative messages using this model.
  • the goal of the parameter estimation process is to use the observed values N (total messages) and m (positive messages detected) and estimate ⁇ , the underlying frequency of true positive messages.
  • N total messages
  • m positive messages detected
  • the underlying frequency of true positive messages.
  • Z is a normalization function of fp, fn, tp and tn.
  • ⁇ and ⁇ are parameters given by the Beta prior for the Binomial distribution.
  • the E-step calculates the expectation of the missing data using the estimated parameterization:
  • E ⁇ [ tp ] m ⁇ ( ⁇ ⁇ ⁇ ( 1 - ⁇ fn ) ⁇ ⁇ ⁇ ( 1 - ⁇ fn ) + ( 1 - ⁇ ⁇ ) ⁇ ⁇ fp ) Equ .
  • E ⁇ [ fp ] m ⁇ ( ( 1 - ⁇ ⁇ ) ⁇ ⁇ fp ⁇ ⁇ ⁇ ( 1 - ⁇ fn ) + ( 1 - ⁇ ⁇ ) ⁇ ⁇ fp ) Equ .
  • the process described is a method for deriving estimates for the positive and negative frequencies of a brand.
  • customer needs require that only a single summary statistic be produced, and that the form of this is a 1-10 metric.
  • a 5.0 value of the metric needs to correspond to the case where the estimated frequencies of positive and negative are equal, and generally, very few brands should score at the most extreme ends of the score.
  • the frequencies are converted to a 1-10 score through a log linear normalization of the ratio of positive to negative. Thus, if a 7.0 corresponds to a ratio of 2.0, then 9.0 corresponds to a ratio of 4.0 and a 3.0 score to a ratio of 0.5. Extreme ratios are very rare, and anything beyond a 1 or a 10 are simply truncated at the extrema.
  • Table 3 shows the results of the sentiment metric applied in the auto domain. Note that in general, models with larger message counts have smaller confidence bounds. Using these scores to drive analysis, yields insights that explain the relative rankings of the different models.
  • the Mazda6 MPS achieves a superior balance between high performance and daily needs such as comfort and economy.
  • the Ford Taurus a lower rated model, received a number of complaints about quality issues and begin generally out of date:
  • the standard spoiler is too small.
  • the power steering always whined, even with enough fluid.
  • the Taurus should have been put out of its misery S years ago.
  • Table 4 shows the results of measuring polarity for location topics in a small data set of messages about Caribbean destinations. By further drilling down on these scores, an analyst can quickly determine that:
  • Cuba has a lower score due to poor snorkeling and beach activities.
  • Grand Bahama's medium score comes from above average opinion of snorkeling, moderate opinion of dining out and a slightly lower opinion of beach activities.
  • FIG. 7 provides a scatterplot showing how the size of the confidence bounds is influenced by the number of messages. Each point is an automotive model. Once there are about 1000 messages for a topic, the 95% confidence bounds tend to be within 1.0 on a ten point scale.
  • FIG. 7 shows an analysis of the confidence bounds by the amount of message volume about a brand.
  • the x-axis shows the number of messages about a brand, and the y-axis shows the estimated size of the 95% confidence bounds.
  • the confidence bounds on each brand tend to be rather large. This generally will prevent conclusive expressions to be made by comparing sentiment scores with these large confidence bounds.
  • the bounds get smaller, and thus it becomes easier to make statistically valid conclusions based on these scores.
  • FIGS. 1-4 provide screen shots of an exemplary computerized tool for implementing certain embodiments of the present invention.
  • the screen shot of FIG. 1 illustrates a function of the exemplary computerized tool establishing a topic for the text mining algorithm contained therein.
  • Three main features visible in this screen view are the Topic Select window 20 , the Viewer window 22 , and the Current Slice box 24 .
  • the Topic Select window 20 lists the available topics from which the user may select a topic for analysis.
  • the Viewer window 22 displays the text of a particular message.
  • the Current Slice box 24 provides status information regarding the user's selections that define the research project that is being performed. In the example shown, the Current Slice box 24 indicates that the Topic selected by the user is “Hardware::Display”.
  • the exemplary computerized tool will concentrate on certain characteristics regarding a manufacturer's electronic device (in this case, a PDA) or the competing electronic devices of the manufacturer's competitors.
  • the tool has access to a repository of thousands or millions of interne message-board entries preselected as likely having content of interest (e.g., taken from interne message boards dedicated to electronic devices and/or PDAs).
  • the Viewer window 22 provides an example message found by the above-described text mining tool in the repository of messages that the text mining tool considered relevant to the selected topic.
  • the right-side block 25 displays data pertaining to the analysis of the currently selected message contents.
  • the Relevance, Aux Relevance, Positive and Negative Polarity displays show a score between zero and one for the currently selected message for each of these different types of scoring. Specifically, scores greater than zero for Positive and Negative Polarity mean that at least one sentence in the message has been identified as positive or negative, with a higher score indicating a higher degree of positivity or negativity identified in the message.
  • the Relevance and Aux Relevance scores indicate a confidence that the message is about the selected topic (PDA's and Pocket PCs in this example). Messages that are below a specified threshold of relevance can be excluded.
  • the screen shot of FIG. 2 illustrates a function of the exemplary computerized tool in which the user may establish a more specific aspect of the general topic selected in the screen of FIG. 1 .
  • the Viewer window 22 and Current Slice box 24 appear again and serve the same purpose as described above with reference to FIG. 1 .
  • a Phrase-select window 26 which allows the user to enter a word or group of words to specify the content to be analyzed in the messages.
  • the Current Slice box 24 indicates that the user has entered the phrase “resolution,” thus indicating that the messages will be searched for comments relating to the resolution of the hardware displays.
  • the Viewer window 22 provides an example message found by the above-described text mining tool in the repository of messages that the text mining tool considered relevant to the selected topic and phrase, with the selected phrase “resolution” appearing in highlighted text 26 .
  • the screen shot of FIG. 3 illustrates a function of the exemplary computerized tool in which the user has requested the tool to illustrate the positive sentences and negative sentences located in the messages considered to be topical to the resolution of the customer's electronic device screen.
  • the positive sentences found by the sentence classifier are listed under the “Positive Quotes” header 28 in the Quotes window 30 and the negative sentences found by the sentence classifier are listed under the “Negative Quotes” header 32 in the Quotes window 30 .
  • the user has the ability to select one of the sentences, such as sentence 34 to view the entire message from which it was extracted as shown in the Viewer window of FIG. 4 .
  • the screen shot of FIG. 4 shows the Viewer window 22 displaying the text of the message from which the comment 34 shown in FIG. 3 selected by the user originated.
  • the screen shot of FIG. 5 illustrates a demonstration of how rule-based classifiers may be built.
  • This tool allows the user to define a topic (such as a particular brand or product) by creating a “rule” built from words to be associated with that topic.
  • a rule can be used to search feedback or comment messages, with those messages that conform to the defined rule being identified as pertaining to the selected topic.
  • a list 36 containing the different topics for which the topical sentiment analysis of the present invention may be performed.
  • This list can be generated by a database and include topics for which feedback or comment is available.
  • “Kia Optima” is the currently selected topic 38 .
  • the middle of the screen contains a window 40 for implementing the tool for writing rules to define the currently selected topic.
  • the window 40 is further divided into an “OR Rules” block 42 and a “BUT-NOT Rules” block 44 . Words appearing in the “OR Rules” block will be associated with the topic, such that feedback or comment messages containing any of these words will be identified as pertaining to the selected topic.
  • Words appearing in the “BUT-NOT Rules” block will be given preclusive effect in the topic definition, such that the appearance of one of these words in a feedback or comment message will disqualify that message from pertinence to the selected topic.
  • the rule defined by the words shown in the “OR Rules” block 42 and “BUT-NOT Rules” block 44 of FIG. 5 can be stated as “A message is about the Kia Optima if the word ‘Optima’ appears in the message, but not if any of the phrases ‘Optima batteries’, ‘Optima battery’, ‘Optima Yellow’, ‘Optima Red’, or ‘Optima Yell’ appear in the message”.
  • the user can type words to be added to the “OR Rules” block 42 and “BUT-NOT Rules” block 44 , or the user can select words or phrase sets from the list 46 on the right side of the FIG. 5 screen.
  • the list 46 is a collection of previously-entered or customized words and phrases, which can be used as shortcuts when writing a rule.
  • a standard lexicon may be applied to any data set. However, the results will be improved if the lexicon is tuned to work with the particular language of a domain. A system has been implemented to assist a user in carrying out this process.
  • Step 2 above uses a number of methods to determine which words are to be used as candidates: (a) patterned based methods (Gregory Grefenstette, Yan Qu, David A. Evans and James G. Shanahan, Validating the Coverage of Lexical Resources for Affect Analysis and Automatically Classifying New Words Along Semantic Axes, AAAI Symposium on Exploring Attitude and Affect in Text: Fischs and Applications, 2004, the disclosure of which is incorporated herein by reference); and (b) commonly occurring adjectives and adverbs not found in the lexicon and not include in a default lexicon of known non polar terms.
  • the patterns involve both tokens (words) and parts of speech.
  • the patterns consist of a prefix of tokens and a target set of POS tags.
  • the patterns are created from a pair of word pools. Pool one contains, for example, ‘appears’, ‘looks’, ‘seems’, pool two contains, for example, ‘very’, ‘extremely’.
  • the product of these pools e.g. ‘appears very’, ‘looks extremely’ and so on
  • the messages in the corpus collected for the project being customized is scanned using the patterns described above. All words which match any of the patterns are collected.
  • Each of the above pattern driven and parameter driven approaches can be tuned using a filter which accepts only candidates Which appear a certain number of times in the corpus.
  • the user then steps through the four sets of candidates, accepting or rejecting words as appropriate.
  • the interface within which this is carried out presents the user with the word, its POS and a list of examples of that word appearing in contexts mined from the corpus of messages.
  • an example screen shot shows such a system in use.
  • the system is presenting the user with the word ‘underwhelming’ which has been generated in the first candidate generation step.
  • the word is illustrated by two examples that have been pulled from the corpus of messages.
  • the user labels the word either by keyboard or shortcuts, or by clicking on the appropriate label found in the bottom right hand corner of the display.
  • Determining the sentiment of an author by text analysis requires the ability to determine the polarity of the text as well as the topic.
  • topic detection is generally solved by a trainable classification algorithm and polarity detection is generally solved by a grammatical model.
  • the approach described in some of these embodiments takes independent topic and polarity systems and combines them, under the assumption that a topical sentence with polarity contains polar content on that topic. We tested this assumption and determined it to be viable for the domain of online messages.
  • This system provides the ability to retrieve messages (in fact, parts of messages) that indicate the author's sentiment to some particular topic, a valuable capability.
  • the detection of polarity is a semantic or meta-semantic interpretive problem.
  • a complete linguistic solution to the problem would deal with word sense issues and some form of discourse modeling (which would ultimately require a reference resolution component) in order to determine the topic of the polar expressions.
  • Our approach restricts these open problems by constraining the data set, and specializing the detection of polarity. These steps by no means address directly these complex linguistic issues, but taken in conjunction (and with some additional aspects pertaining to the type of expressions found in the domain of online messages) the problem is constrained enough to produce perfectly reliable results.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The present application presents methods for performing topical sentiment analysis on electronically stored communications employing fusion of polarity and topicality. The present application also provides methods for utilizing shallow NLP techniques to determine the polarity of an expression. The present application also provides a method for tuning a domain-specific polarity lexicon for use in the polarity determination. The present application also provides methods for computing a numeric metric of the aggregate opinion about some topic expressed in a set of expressions.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • The present application claims the benefit of U.S. Provisional Patent Application Ser. No. 60/614,941, filed Sep. 30, 2004, U.S. patent application Ser. No. 11/245,542, filed Sep. 30, 2005, and U.S. patent application Ser. No. 12/395,239, filed Feb. 27, 2009, each of which is incorporated by reference in its entirety.
  • BACKGROUND
  • One of the most important and most difficult tasks in marketing is to ascertain, as accurately as possible, how consumers view various products. A simple example illustrates the problem to be solved. As the new marketing manager for BrightScreen, a supplier of LCD screens for personal digital assistants (PDAs), you would like to understand what positive and negative impressions the public holds about your product. Your predecessor left you 300,000 customer service emails sent to BrightScreen last year that address not only screens for PDAs, but the entire BrightScreen product line. Instead of trying to manually sift through these emails to understand the public sentiment, can text analysis techniques help you quickly determine what aspects of your product line are viewed favorably or unfavorably?
  • One way to address BrightScreen's business need would be a text mining toolkit that automatically identifies just those email fragments that are topical to LCD screens and also express positive or negative sentiment. These fragments will contain the most salient representation of the consumers' likes and dislikes specifically with regard to the product at hand. The goal of the present invention is to reliably extract polar sentences about a specific topic from a corpus of data containing both relevant and irrelevant text.
  • Recent advances in the fields of text mining, information extraction, and information retrieval have been motivated by a similar goal: to exploit the hidden value locked in huge volumes of unstructured data. Much of this work has focused on categorizing documents into a predefined topic hierarchy, finding named entities (entity extraction), clustering similar documents, and inferring relationships between extracted entities and metadata.
  • An emerging field of research with much perceived benefit, particularly to certain corporate functions such as brand management and marketing, is that of sentiment or polarity detection. For example, sentences such as I hate its resolution or The BrightScreen LCD is excellent indicate authorial opinions about the BrightScreen LCD. Sentences such as The BrightScreen LCD has a resolution of 320×200 indicates factual objectivity. To effectively evaluate the public's impression of a product, it is much more efficient to focus on the small minority of sentences containing subjective language.
  • Recently, several researchers have addressed techniques for analyzing a document and discovering the presence or location of sentiment or polarity within the document. J. Wiebe, T. Wilson, and M. Bell, “Identifying collocations for recognizing opinions,” in Proceedings of ACLIEACL '01 Workshop on Collocation, (Toulouse, France), July 2001, discovers subjective language by doing a fine-grained NLP-based textual analysis. B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classification using machine learning techniques,” in Proceedings of EMNLP 2002, 2002 use a machine learning classification-based approach to determine if a movie review as a whole is generally positive or negative about the movie.
  • This prior art makes significant advances into this novel area. However, they do not consider the relationship between polar language and topicality. In taking a whole-document approach, Pang, et al. sidesteps any issues of topicality by assuming that each document addresses a single topic (a movie), and that the preponderance of the expressed sentiment is about the topic. In the domain of movie reviews this may be a good assumption (though it is not tested), but this assumption docs not generalize to less constrained domains (It is noted that the data used in that paper contained a number of reviews about more than one movie. In addition, the domain of movie reviews is one of the more challenging for sentiment detection as the topic matter is often of an emotional character; e.g., there are bad characters that make a movie enjoyable.) Weibe et al.'s approach does a good job of capturing the local context of a single expression, but with such a small context, the subject of the polar expression is typically captured by just the several base noun words, which are often too vague to identify the topic in question.
  • SUMMARY
  • In summary, in an industrial application setting, the value of polarity detection is very much increased when married with an ability to determine the topic of a document or part of a document. In this application, we outline exemplary methods for recognizing polar expressions and for determining the topic of a document segment.
  • The present invention, therefore, provides a lightweight but robust approach to combining topic and polarity, thus enabling text mining systems select content based on a certain opinion about a certain topic.
  • More specifically, a first aspect of the present invention can be characterized as providing a computer implemented method (in which a computer can be any type of computer or computer system, network or combination thereof programmed and configured to perform the steps described herein) for obtaining topical sentiments from an electronically stored communication (which can be, for example and without limitation, an electronic document, message, email, blog post, and the like—and it is not important to the invention exactly where or how the communication is electronically stored and/or accessed) that includes the steps of (in no specific order): (a) determining a topic of a segment of the communication; and (b) locating a polar expression in the communication. In a more detailed embodiment, the method also includes the step of (c) determining a polarity of the polar expression, where the polarity may be positive, negative, mixed and/or neutral, for example. It is also within the scope of the invention that the method include the step of (d) associating the determined polarity with the determined topic.
  • The steps (b) locating a polar expression in the electronically stored communication and (c) determining the polarity of the polar expression may include the steps of: (1) establishing a domain-general polarity lexicon of sentimental/polar phrases (i.e., words and phrases) (2) establishing a topical domain being explored; (3) generating a polarity lexicon of sentimental phrases associated with the topical domain; (4) utilizing the polarity lexicon against phrases found in the polar expression; and (5) assigning at least one polar phrase in the polar expression a polarity associated with a matching phrase in the polarity lexicon. The step (c) of determining the polarity of the polar expression may also include the step of assigning at least one polar phrase in the polar expression a polarity. In a more detailed embodiment the method may further include the step of (e) analyzing the polar expression with syntactic and/or semantic rules to determine a topic of the polar expression and to link the determined topic to the polarity of the polar phrase.
  • It is further within the scope of the invention that the step (a) of determining a topic of the segment of the communication containing or associated with the polar expression includes the step of processing the segment with a communication (i.e., text) classifier. Such communication classifier may utilize an algorithm, such as a Winnow algorithm, a Support Vector Machine algorithm, a k-Nearest Neighbor algorithm, other machine learning algorithms, or a hand-built rules-based classifier.
  • It is also within the scope of the invention that the step (a) of determining a topic of the segment of the communication and the step (c) of determining the polarity of the polar expression are independent tasks.
  • The segments of the communication discussed above maybe an entire communication or a portion of the communication, such as a sentence for example. Further the segment discussed above may be the polar expression.
  • A second aspect of the present invention can be characterized as providing a computer implemented method for obtaining topical sentiments from a body of communications (text, electronic, etc.) comprising the steps of: (a) isolating a subset of the communications relevant to a particular topic; and (b) locating a polar expression in at least one of the subset of communications. The method may also include the steps of (c) determining the polarity of the polar expression and (d) associating the polarity with the particular topic.
  • A third aspect of the present invention can be characterized as providing a computer implemented method for obtaining topical sentiments from a body of communications (text, electronic, etc.) comprising the steps of: (a) isolating a first subset of the communications relevant to a particular topic; and (b) isolating a second subset of communications from the first subset of communications where the second subset of communications includes polar segments (i.e., negative or positive) located in the first subset of communications. The second subset can be broken into further subsets depending upon the particular polarity of the polar segments (i.e., there can be subsets for positive segments, negative segments, neutral segments and/or others). The method may also include the step of (c) associating the polar segments with the particular topic. The segments can be a sentence, a phrase, a paragraph or an entire communication for example.
  • A fourth aspect of the present invention can be characterized as providing a computer implemented method for obtaining topical sentiments from a plurality of electronically stored communications that includes the steps of: (a) determining with the assistance of a computer whether each communication in a plurality of communications is topical to a first predefined topic; (b) for each communication determined to be topical to the predefined topic, separating with the assistance of a computer the communication into one or more expressions (a word or a group of words that form a constituent of a sentence and are considered as a single unit); (c) for each expression, determining with the assistance of a computer if the expression is topical to a second predefined topic; and (d) for each expression that is determined to be topical to the second predefined topic, determining with the assistance of a computer a polarity of the expression. In a more detailed embodiment the polarity may be positive, negative, and/or neutral. In another detailed embodiment, the step of determining the polarity of the expression may include the steps of: establishing a topical domain being explored; generating a polarity lexicon of sentimental words and/or phrases associated with the topical domain; utilizing with the assistance of a computer the polarity lexicon against words and/or phrases found in the expression; and assigning at least one polar phrase in the expression a polarity associated with a matching word and/or phrase in the polarity lexicon.
  • In yet another detailed embodiment of the fourth aspect of the present invention the step of determining the polarity of the expression may further include the step of analyzing with the assistance of a computer the expression with syntactic and/or semantic rules. In yet another detailed embodiment, the step of determining with the assistance of a computer whether each communication in a plurality of communications is topical to a first predefined topic includes the step of processing each communication with a text classifier. This text classifier may utilize an algorithm such as a Winnow algorithm, a Support Vector Machine algorithm, a k-Nearest Neighbor algorithm or a rules-based classifier.
  • In yet another detailed embodiment of the fourth aspect of the present invention the method may further include the step of (e) calculating with the assistance of a computer an aggregate metric from the plurality of expressions which estimates the frequency of positive and/or negative polar expressions. This step may include the generation of statistically-valid confidence bounds on the aggregate metric. This step (e) may also include the steps of: for each of the plurality of expressions, estimating an opinion based upon the presence, absence or strength of polarity associated with the predefined topic; and aggregating the overall opinion for the plurality of expressions. The step of aggregating the overall opinion for the plurality of expressions may include a step of normalizing the ratio of empirical or estimated frequency of positive and negative polarity associated with the predefined topic. Alternatively, the step (e) of calculating an aggregate metric from the plurality of expressions may utilize Bayesian statistics to derive estimates for positive and negative frequencies of polar expressions.
  • In yet another detailed embodiment of the fourth aspect of the present invention, the first predefined topic is a general topic and the second predefined topic is a specific topic associated with the general topic. In a further detailed embodiment the general topic is a product or service and the specific topic is a feature of the product or service. Alternatively, the general topic is a commercial brand and the specific topic is a feature of the commercial brand. It is also within the scope of the invention that the first predefined topic and the second predefined topic are the same topic.
  • A fifth aspect of the present invention can be characterized as a computer implemented method for calculating, from a plurality of electronically stored expressions, an aggregate metric which estimates a frequency of positive and/or negative polar expressions contained in the expressions. The method includes the steps of: for each of a plurality of electronically stored expressions, determining with the assistance of a computer an opinion contained in the expressions based upon at least one of the presence, absence and strength of polarity associated with a predefined topic; and calculating an aggregate metric from the determined opinions of the plurality of expressions. In a detailed embodiment of this fifth aspect of the present invention the step of calculating an aggregate metric from the determined opinions of the plurality of expressions includes the generation of statistically-valid confidence bounds on the aggregate metric. Alternatively, or in addition, the step of calculating an aggregate metric from the determined opinions of the plurality of expressions includes a step of normalizing the ratio of empirical or estimated frequency of positive and negative polarity associated with the predefined topic. Alternatively, or in addition, the step of calculating an aggregate metric from the determined opinions of the plurality of expressions further includes utilizing Bayesian statistics to derive estimates for positive and negative frequencies of polar expressions. Alternatively, or in addition, at least a portion of the plurality of expressions are taken from larger electronically stored communications. Alternatively, or in addition, the step of determining an opinion contained in the expressions includes the steps of, for each expression: determining with the assistance of a computer that the expression is topical to the predefined topic; and determining with the assistance of a computer a polarity of the expression.
  • A sixth aspect of the present invention can be characterized as a computer implemented method for finding one or more polar expressions in an electronically stored communication, which includes the step of analyzing with the assistance of a computer the electronically stored communication for one or more polar expressions within the electronically stored communication. This analyzing step includes the steps of: providing a polarity lexicon of sentimental words and/or phrases associated with a topical domain; utilizing with the assistance of a computer the polarity lexicon against words and/or phrases found in the expression; and assigning with the assistance of a computer at least one word/phrase in the expression a polarity associated with a matching word/phrase in the polarity lexicon. In a more detailed embodiment of this sixth aspect of the present invention, the step of assigning with the assistance of a computer at least one word/phrase in the expression a polarity associated with a matching word/phrase in the polarity lexicon, includes the steps of: separating the expression into word/phrase chunks; tagging the separated word/phrase chunks with part-of-speech tags; applying the polarity lexicon against the word/phrase chunks to tag one or more of the word/phrase chunks with a polarity tag; and applying syntactic and semantic rules against the tagged word/phrase chunks to elevate the polarity of the word/phrase chunk to the entire expression. The step of applying syntactic and semantic rules against the word/phrase chunks to elevate the polarity of the word/phrase chunk to the entire expression includes the step of identifying a word/phrase chunk in the expression that toggles the polarity of the word/phrase chunk tagged with the polarity tag. Alternatively, the step of applying syntactic and semantic rules against the word/phrase chunks to elevate the polarity of the word/phrase chunk to the entire expression includes the step of performing with the assistance of a computer grammatical analysis on the expression.
  • A seventh aspect of the present invention can be characterized as a computer implemented method for tuning a polarity lexicon for use in classifying polar expressions, which includes the steps of: (a) providing a polarity lexicon; (b) with the assistance of a computer implemented graphical user interface providing a user with candidate words for addition, subtraction or exclusion to the polarity lexicon; and (c) adding, subtracting or excluding each candidate word from the polarity lexicon according to input received by the graphical user interface. In a more detailed embodiment, the step of (b) providing a user with candidates for addition, subtraction or exclusion to the polarity lexicon includes a step of scanning a plurality of electronic messages collected for the topical domain for words that have the potential to be added to the lexicon. The scanning step may include a pattern based method that locates adjectives and adverbs that have a substantial chance of being polar; or the scanning step may locate candidate words by filtering the communication for words that appear at least a predefined number of times; or the scanning step may include a pattern based method that locates adjectives and adverbs that have a substantial chance of being polar and locates candidate words by filtering the communication for words that appear at least a predefined number of times.
  • In yet another detailed embodiment of the seventh aspect of the present invention, the step of (b) providing with a graphical user interface a user with candidate words for addition, subtraction or exclusion to the polarity lexicon includes the step of presenting each candidate word to the user with the word's part of speech label and an example of that candidate word appearing in at least one electronic message collected for the topical domain.
  • An eighth aspect of the present invention can be characterized as a computer implemented method for obtaining topical sentiments from an electronically stored communication, which includes the steps of: determining with the assistance of a computer one or more topical expressions in the communication; locating with the assistance of a computer one or more polar expressions in the communication; and identifying an expression that is both a topical expression and a polar expression as containing a topical sentiment. In a more detailed embodiment, the steps of determining one or more topical expressions and locating one or more polar expressions are isolated steps performed on the same communication. In a further detailed embodiment, the step of determining one or more topical expressions includes a step of applying an automated text classifier on the communication and the step of locating one or more polar expressions includes the step of utilizing a domain-specific lexicon and shallow NLP techniques.
  • Upon reviewing the following detailed description and associated drawings, it will be appreciated by those of ordinary skill in the art, of course, that many other aspects of the invention exist, which may not be summarized above.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing results of a text mining algorithm;
  • FIG. 2 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing how more specific sub-topics may be selected;
  • FIG. 3 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing how both the positive and negative expressions found in the communications may be displayed to the user;
  • FIG. 4 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing complete text of a message containing an expression selected from the interface of FIG. 3;
  • FIG. 5 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing how rule-based classifiers may be built;
  • FIG. 6 is a graphical user interface screen view from an exemplary computerized tool for implementing certain embodiments of the present invention, showing how a user may use the system to tune a polarity lexicon; and
  • FIG. 7 is a scatter-plot graph showing an analysis of confidence bounds by the amount of message volume about a brand.
  • DETAILED DESCRIPTION
  • 1. Introduction
  • The present invention strikes out a middle ground between the Weibe et al. and Pang et al. approaches and presents a fusion of polarity and topicality. One approach to performing this task is to do a full NLP-style analysis of each sentence and understand at a deep level the semantics of the sentence, how it relates to the topic, and whether the sentiment of the expression is positive or negative. In the absence of comprehensive NLP techniques, we approximate the topicality judgment with either a statistical machine learning classifier or a hand-built rules-based classifier and the polarity judgment with shallow NLP techniques. An exemplary embodiment assumes that any sentence that is both polar and topical is polar about the topic in question. However, when these modules are run separately there are no guarantees that a sentence that is judged to be both topical and polar is expressing anything polar about the topic. For example the sentence It has a BrightScreen LCD screen and awesome battery life does not say anything positive about the screen. The present invention described herein demonstrates that the underlying combination assumption made by this system is sound, resulting in high-precision identification of these sentences.
  • The present application presents exemplary methods for performing topical sentiment analysis employing fusion of polarity and topicality. One approach to performing this task is to perform a full NLP-style analysis of each sentence and understand at a deep level the semantics of the sentence, how it relates to the topic, and whether the sentiment of the expression is positive or negative (or any other sentiment capable of being expressed in a message).
  • In the absence of comprehensive NLP techniques, alternate embodiments approximate the topicality judgment with a statistical machine learning classifier or a hand-built rules-based classifier and the polarity judgment with shallow NLP techniques. One embodiment of the system we describe assumes that any sentence that is both polar and topical is polar about the topic in question. However, when these modules are run separately there are no guarantees that a sentence that is judged to be both topical and polar is expressing anything polar about the topic. For example the sentence It has a BrightScreen LCD screen and awesome battery life does not say anything positive about the screen. Nevertheless, one embodiment described herein demonstrates that the underlying combination assumption made by this system is sound, resulting in high-precision identification of these sentences.
  • In summary, in an industrial application setting, the value of polarity detection is very much increased when married with an ability to determine the topic of a document or part of a document. In this application, we outline methods for recognizing polar expressions and for determining the topic of a document segment.
  • The present invention, therefore, provides a lightweight but robust approach to combining topic and polarity thus enabling content access systems to select content based on a certain opinion about a certain topic.
  • 2. Polarity
  • 2.1 Polarity Detection
  • Texts can be broadly categorized as subjective or objective. Those that are subjective often carry some indication of the author's opinion, or evaluation of a topic as well as some indication of the author's emotional state with respect to that topic. For example, the expression this is an excellent car indicates the author's evaluation of the car in question; I hate it! reflects the author's emotional state with respect to the topic. An additional type of expression informative in this context is that which indicates a desirable or undesirable condition. These expressions may be deemed objective. For example, It is broken may well be objective but is still describing an undesirable state.
  • An idealized view of polarity detection would be able to accept any document, or subsection thereof; and provide an indication of the polarity: the segment is either positive, negative, mixed or neutral. Additionally, in an alternative embodiment of the present invention, expressions can be analyzed and rated according to the strength of any sentiment that is capable of being expressed in words. Such sentiments need not be analyzed in opposite pairs (as in the case of positivity and negativity in the exemplary embodiment); a message can be analyzed for the expression of any individual qualitative sentiment, and the relative strength of that sentiment in the message can be expressed on a numerical scale (as is described herein with reference to the sentiments of positivity and negativity). Examples of such additional qualitative sentiments that can be expressed in a message and analyzed according to the present invention include, but are not limited to: anger, hate, fear, loyalty, happiness, respect, confidence, pride, hope, doubt, and disappointment.
  • However, this is only half of the story. Firstly, the classification of polar segments has a dependency on many other aspects of the text. For example, the adjective huge is negative in the context there was a huge stain on my trousers and positive in the context this washing machine can deal with huge loads. There is no guarantee that the information required to resolve such ambiguities will be present in the observable segment of the document.
  • Secondly, knowing that a piece of text is positive is only as useful as our ability to determine the topic of the segment. If a brand manager is told that this set of documents is positive and this set is negative, they cannot directly use this information without knowing, for example, which are positive about their product and which are positive about the competition.
  • An exemplary embodiment of the polar phrase extraction system according to the present invention was implemented with the following steps.
  • In the set up phase, a domain-general lexicon is developed. A lexicon is a list of words or phrases with their associated parts-of-speech, and a semantic orientation tag (e.g. positive or negative). For example, this may contain the words ‘good’ and ‘bad’ as positive and negative adjectives, respectively. Then, this domain-general lexicon is tuned to the domain being explored. For example, if we are looking at digital cameras, phrases like ‘blurry’ may be negative and ‘crisp’ may be positive. Care is taken not to add ambiguous terms where possible as we rely on assumptions about the distribution of the phrases that we can detect with high precision and its relationship to the distribution of all polar phrases. Note that our lexicon contains possibly ‘incorrect’ terms which reflect modem language usage as found in online messages. For example, there is an increasing lack of distinction between certain classes of adverbs and adjectives and so many adjectives are replicated as adverbs.
  • At run time, the input is tokenized. The tokenized input is then segmented into discrete chunks. The chunking phase includes the following steps. Part of speech tagging is carried out using a statistical tagger trained on Penn Treebank data. (We note that taggers trained on clean data, when applied to the noisy data found in our domain, are less accurate than with their native data.) Semantic tagging adds polar orientation information to each token (positive or negative) where appropriate using the prepared polarity lexicon. Simple linear POS tag patterns are then applied to form the chunks. The chunk types that are derived are basic groups (noun, adjective, adverb and verb) as well as determiner groups and an ‘other’ type.
  • The chunked input is then further processed to form higher-order groupings of a limited set of syntactic patterns. These patterns are designed to cover expressions that associate polarity with some topic, and those expressions that toggle the logical orientation of polar phrases (I have never liked it.). This last step conflates simple syntactic rules with semantic rules for propagating the polarity information according to any logical toggles that may occur.
  • If the text This car is really great were to be processed, firstly the tokenization step would result in the sequence {this, car, is, really, great}. Part of speech tagging would provide {this_DT car_NN, is_VB, really_RR, great_JJ}. Assuming the appropriate polarity lexicon, additional information would he added thus: {this_DT, car_NN, is_VB, really_RR, great_JJ;+} where ‘+’ indicate a positive lexical item. Note that features are encoded in a simplified frame structure which is a tree. The standard operations of unification (merging), test for unifiability and subsumption are available on these structures.
  • The chunking phase would bracket the token sequence as follows: {(this_DT)_DET, (car_NN)_BNP, (is_VB)_BVP, (really_RR, great_JJ)_BADJP}. Note that the basic chunk categories are {DET, BNP, BADVP, BADJP, BVP, OTHER}.
  • The interpretation phase then carries out two tasks: the elevation of semantic information from lower constituents to higher, applying negation logic where appropriate, and assembling larger constituents from smaller. Rules are applied in a certain order. In this example, a rule combining DET and BNP chunks would work first over the sequence, followed by a rule that forms verb phrases from BNP BVP BADJP sequences whenever polar information is found in a BADJP.
  • Note that there is a restriction of the applicability of rules related to the presence of polar features in the frames of at least one constituent (be it a BNP, BADJP, BADVP or BVP).
  • The simple syntactic patterns are: Predicative modification (it is good), Attributive modification (a good car), Equality (it is a good car), Polar clause (it broke my car).
  • Negation of the following types are captured by the system: Verbal attachment (it is not good, it isn't good), Adverbal negatives (I never really liked it, it is never any good), Determiners (it is no good), Superordinate scope (I don't think they made their best offer).
  • 2.2 Advanced Polarity Detection—Semantic Interpretation of Syntactic Fragments Containing Polarity Terms
  • In an advanced polarity detection process, once a syntactic structure has been built a lexicon is consulted to annotate the terminals with lexical information. The lexicon contains information describing the role that the word has in the context of interpreting polarity. This role is described as either:
  • a) an atomic feature representing a grounded interpretation for the lexical item; e.g. positive negative.
  • b) a function which is to be applied to any sub-interpretation during composition resulting in a new interpretation; e.g.; invert which takes as an argument an atomic interpretation and produces a resultant interpretation, e.g., invert(positive)→negative.
  • Such a set of lexical types may include:
  • Functions:
  • INVERT
  • INVERT-NEG: invert the polarity of a negative argument
  • INVERT-POS: invert the polarity of a positive argument
  • INTENSIFY-IF-INVERTED: intensify an inverted argument
  • NEGATIVE-IF-INVERTED: negate an inverted argument
  • POSITIVE-IF-INVERTED: make positive an inverted argument
  • NON-TRANSMITTING: block the application of inversion for this verb
  • INTENSIFY: intensify the argument
  • NEGATIVE-NO-INVERSION: negate if no inversions have yet been applied
  • FILTER: remove the interpretation from the composition
  • Atoms:
  • POSITIVE
  • NEGATIVE
  • Composition is the process by which an interpretation is built up (via the application of functions, or via transmitting a child's interpretation to its parent) from the terminals in a syntactic tree to the root of the tree.
  • Illustrative examples of semantic interpretation follow.
  • Syntactic Analysis
  • 1. The input is segmented into a series of sentences
  • 2. Each sentence is tokenized to produce a series of word-like elements.
  • 3. Each token is given a part of speech (POS) which is encoded using a two or three character tag, e.g., NN for singular noun, NNP for plural noun.
  • 4. Each token is looked up in a lexicon. The lexicon uses the POS and morphological analysis of the word. The morphological analysis takes as input a word and a POS and produces a reduced form of the word and the appropriate derived POS. For example, looking up ‘breaking’ would produce the token ‘break’ with the associated POS VB.
  • 5. A grammatical analysis is performed on the entire sentence. The goal of the grammatical analysis is to diagram as much of the sentence as possible. In certain situations, the sentence will be fully described as a single structure. Otherwise, the structure will be fragmented.
  • Example—Given the following communication, “I heard this was a great movie. Did you like it?” The above steps are applied as follows:
  • 1. ‘I heard this was a great movie.’ and ‘Did you like it?’
  • 2. Taking the first sentence—‘I’, ‘heard’, ‘this’, ‘was’, ‘a’, ‘great’, ‘movie’, ‘.’
  • 3. I\PRP heard\VBD it\PRP was\VBD a\DT great\JJ movie\NN where PRP is personal noun, VBD is a past tense verb, DT is a determiner, JJ is adjective and NN is noun.
  • 4. The only word that matches with the lexicon is ‘great’.
  • 5. Using a bracketing notation to indicate structure, the sentence can be represented as follows:
  • Figure US20110093417A1-20110421-C00001
  • The notation below the bracketed form shows how the sentence is built up. The sentence consists of an NP—the subject Noun Phrase—(I) and a VP the main Verb Phrase (hears this was a great movie. The VP is then further split into a verb (heard) and a relative clause (S-REL) which itself has a simple Subject Verb Object sentence structure.
  • Semantic Analysis. Now that we have the structural and lexical description of the sentence, we can carry out a semantic analysis. The semantic analysis works in a simple compositional manner working from the nodes at the leaves of the tree structure (starting with the words themselves and moving through the tree to the very top node).
  • In the above example, nothing too interesting happens. The node ‘great’ has found a hit with a positive term in the lexicon. It is, therefore, associated with the ‘positive’ feature. This feature is, via the compositional analysis mechanism, propagated all the way up the tree to the top S node. The result is that the ‘positive’ feature is the only polarity feature present and thus the sentence is marked as being positive.
  • A more interesting case concerns the interaction of different lexical items. if we look at the fragment:
  • ‘it was not a good movie’
  • As before, ‘good’ finds a hit in the lexicon and gets the ‘positive’ feature. ‘not’ also finds a hit in the lexicon and gets assigned the *function* ‘INVERT( )’. The ‘positive’ feature associated with ‘good’ is an *atomic* feature.
  • The structural analysis for this fragment is something like
  • ((it) ((was not) (a good movie)))
  • As before, ‘a good movie” which is a noun phrase, gets associated with the ‘positive’ feature. The INVERT( ) function, that the word ‘not’ hit in the lexicon makes its way up to the verbal group (‘was not’). The higher level node, a Verb Phrase that spans all of ‘was not a good movie’ has two children: ‘was not’ and ‘a good movie’. If we reduce these children to their semantic information, we have to two expressions: ‘INVERT( )’ and ‘positive’. The combinatorial process applies the function to the atomic argument and evaluates the result. Thus ‘INVERT( )’ and ‘positive’ become ‘INVERT(positive)’ which then becomes ‘negative’. Just like the ‘positive’ feature in the earlier example, this ‘negative’ feature then makes its way up the tree structure to the top S node, resulting in a sentence with negative polarity.
  • More information about novelty of lexicon/semantics. When a word or phrase is looked up in the lexicon, the POS and the context surrounding it may be consulted. The POS allows the system to distinguish between the grammatical function of the word (e.g. ‘pretty’ in ‘it was pretty’ and ‘pretty’ in ‘it was pretty horrible’). The context, when appropriate, can be used to distinguish other cases, such as the difference between ‘well’ in ‘it works well’ and ‘well’ in ‘oh, well’. These contextual distinctions are made using a simple set of per entry rules which require the presence or absence of certain words either preceding or following the lexical entry.
  • Specifically, when word (ordinal) w is looked up in a sentence of n>w words, the lexicon has access to all the words in the sentence and can address them relative to the position w.
  • 3. Polarity Evaluation
  • We wish to evaluate three aspects of our approach: the performance of the topic classifier on sentences, the performance of the polarity recognition system and the assumption that polar sentences that are on topic contain polar language about that topic.
  • Our evaluation experiment proceeded as follows. Using our message harvesting and text mining toolkit, we acquired 20,000 messages from online resources (usenet, online message boards, etc.). Our message harvesting system harvests messages in a particular domain (a vertical industry, such as ‘automotive’, or a specific set of products). Messages are then automatically tagged according to some set of topics of interest to the analyst.
  • We selected those messages which were tagged as being on topic for a particular topic in the domain being studied (982 messages). These messages were then segmented into sentences (using a naive sentence boundary detection algorithm) resulting in (16,616 sentences). The sentences were then tagged individually by the topic classifier (1,262 sentences on topic) and the polarity recognition system described above in Section 2.2.
  • We then selected at random 250 sentences for each of the evaluation tasks (topic, polarity, topic & polarity) and hand labeled them as follows.
  • polarity: positive, negative (in a multi-label environment this results in four possible combinations).
  • topic: topical, off-topic (a binary labeling).
  • topic and polarity: positive-correlated, negative-correlated, positive-uncorrelated, negative uncorrelated, topical, off-topic. The positive-correlated label indicates that the sentences contained a positive polar segment that referred to the topic, positive-uncorrelated indicates that there was some positive polarity but that it was not associated with the topic in question.
  • As our system is designed to detect relative degrees of opinion we are more interested in precision than recall. A greater issue than recall is the potential bias that our set of classifiers might impose on the data. This aspect is not measured here due to the labor intensive nature of the task.
  • The results for the polarity task from this hand labeling are shown in Table 1. Sentences judged to have positive polarity were detected with a precision of 82%. Negative sentences were judged to be detected with a precision of 80%.
  • TABLE 1
    Precision of polarity for hand labeled
    sentences. Positive: 82%: negative: 80%
    predicted
    pos Neg
    truth pos 139
    notPos 30
    neg 70
    notNeg 17
  • 4. Identifying Topical Sentences with a Document Classifier
  • In the previous section we approached the task of assessing the sentiment of a sentence through a shallow NLP approach. In this section, we take a different approach for determining the topicality of a sentence. We treat the topicality judgment as a text classification problem and solve it with machine learning techniques.
  • In the standard (prior art) text classification approach, representative training examples are provided along with human judgments of topicality. From these, a learning algorithm forms a generalization hypothesis that can be used to determine topicality of previously unseen examples. Typically, the types of text that form the training examples are the same type as those seen during the evaluation and application phases for the classifier. That is, the classifier assumes the example distribution remains constant before and after training.
  • 4.1. Classifying Topical Messages
  • In an exemplary embodiment of our text mining system for a specific marketing domain, a machine learning text classifier is trained to assess topicality on whole messages and thus expects to predict whether or not a whole message is relevant to the given topic. In this section we explore how to use such a text classifier trained on whole messages to accurately predict sentence-level topicality.
  • The provided classifier is trained with machine learning techniques from a collection of documents that have been hand-labeled with the binary relation of topicality. The underlying classifier is a variant of the Winnow classifier (N. Littlestone, “Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm,” Machine Learning 2, pp. 285-318, 1988; A. Blum, “Empirical support for winnow and weighted-majority based algorithms: results on a calendar scheduling domain,” Machine Learning 26, pp. 5-23, 1997; and I. Dagan, Y. Karov, and D. Roth, “Mistake-driven learning in text categorization,” in EMNLP '97, 2nd Conference on Empirical Methods in Natural Language Processing, 1997), the disclosures of which are incorporated herein by reference, an online learning algorithm that finds a linear separator between the class of documents that are topical and the class of documents that are irrelevant. Documents are modeled with the standard bag-of-words representation that simply counts how many times each word occurs in a document. Winnow learns a linear classifier of the form:
  • H ( x ) = w V f w c w ( x ) Equ . 1
  • where cw(x) is 1 if word w occurs in document x and 0 otherwise. fw is the weight for feature w. If h(x)>V then the classifier predicts topical, and otherwise predicts irrelevant. The basic Winnow algorithm proceeds as:
      • 1. Initialize all fw to 1.
      • 2. For each labeled document x in the training set:
      • 2a. calculate H(x).
      • 2b. If the document is topical, but Winnow predicts irrelevant, update each weight fw where cw(x) is 1 by:

  • fw*=2   Equ. 2
      • 2c. If the document is irrelevant, but Winnow predicts topical, update each weight fw where cw(x) is 1 by:

  • fw/=2   Equ. 3
  • In a setting with many irrelevant features, no label noise and a linear separation of the classes, Winnow is theoretically guaranteed to quickly converge to a correct hypothesis. Empirically, we have found Winnow to be a very effective document classification algorithm, rivaling the performance of Support Vector Machines (T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Machine Learning: ECML98, Tenth European Conference on Machine Learning, pp. 137-142, 1998, the disclosure of which is incorporated herein by reference) and k-Nearest Neighbor (Y. Yang, “An evaluation of statistical approaches to text categorization,” Information Retrieval 1(1/2), pp. 67-88, 1999, the disclosure of which is incorporated herein by reference), two other state-of-the-art text classification algorithms. In the exemplary embodiment, we use Winnow because it is more computationally efficient than SVMs and easier to apply than kNN. It is to be understood, of course, that it is within the scope of the invention to use classifiers other than the Winnow algorithm.
  • 4.2. Classifying Topical Sentences
  • In the exemplary embodiment, after determining whether the whole message is considered relevant or irrelevant, we then use a straightforward and ad-hoc technique of adapting a given document classifier into a high precision/low recall sentence classifier. If a document is judged by the classifier to be irrelevant, we predict that all sentences in that document are also irrelevant. If a document is judged to be topical, then we further examine each sentence in that document. Given each sentence and our text classifier, we simply form a bag-of-words representation of the sentence as if an entire document consisted of that single sentence. We then run the classifier on the derived pseudo-document. If the classifier predicts topical, then we label the sentence as topical and proceed with the sentiment analysis for that sentence. If the classifier predicts irrelevant, we skip the sentiment analysis and proceed on to the next sentence.
  • 4.3. Experiment Results and Discussion
  • To evaluate this exemplary embodiment, we use the same experimental setup as described in the previous section. We trained a Winnow classifier by hand-labeling 731 training messages, 246 which were topical. Then, on our test collection, 982 messages were predicted to be topical by the classifier. Precision was measured at 85.4% (117/137 on a randomly selected test set) on the message level. The 982 messages contained 16,616 sentences, 1262 of which were judged to be topical by the classifier. These sentences came from 685 different documents, indicating that that 70% of documents judged to be topical also had at least one sentence predicted to be topical. A random sample of 224 of the 1262 topical sentences were hand labeled. Precision on this set was estimated at 79% (176/224). These results show that applying a message-level classifier in a straightforward fashion on the sentence level still maintains about the same precision that was seen on the document level. However, this approach clearly results in a loss of recall, as a significant number of messages predicted to be topical did not have any sentences predicted as topical.
  • 4.4 Brand Specific Topical Polar Messages
  • This section describes how we use the polar sentence detector and identify which messages contain positive or negative expressions about a particular brand. The approach we take is to use a brand text classifier, a feature text classifier, and a set of resolution heuristics to combine these with the polar language detector.
  • In a marketing intelligence application of data mining, there are typically topics of discussion in the data that warrant explicit tracking and identification. The most prevalent type of topics are brand-related, i.e. one topic for each product or brand being tracked, such as the Dell Axim. To facilitate this taxonomic requirement, analysts compose well-written hand-built rules to identify these types of topics. These rules are based on words and phrases, and allow for stemming, synonymy, windowing, and context-sensitivity based on document analysis.
  • From one point of view, these brands are entities occurring in the text, and it might be considered that entity extraction would be the most appropriate technology to apply. However, to facilitate tracking and identification, extracted entities must be normalized to a set of topics. For example, Axim, Dell Axim, and the Dell PDA should all fall into the Dell Axim topic. An approach following that of Cohen, W. W., “Data Integration Using Similarity Joins and a Word-Based Information Representation Language,” ACM Transactions of Information Systems 18(3):288-321 (2000), the disclosure of which is incorporated herein by reference, could be established to automatically normalize entities. However, since our customers typically know exactly which brands they want to monitor, pre-building the rules in this case is both more accurate and the performance is more predictable and can be easily measured.
  • As discussed above, we showed that in the domain of online message discussion, intersecting sentiment with topic classifiers at the sentence level provides reasonable precision. That is, if a sentence in a message is both about a brand (according to its classifier) and also contains positive language (as detected by our sentiment analysis) our system asserts that the message is positive about that brand. Other NLP approaches to sentiment do a finer-grained grammatical analysis to associate sentiment with a topic. We have found that in the domain on online discussion, using a sentence intersection approach has reasonably high precision, and also better recall than a grammatical association approach. However, the recall is still relatively low, and thus we extend the recall through a second layer of classification and resolution. A second set of “feature classifiers” is defined to recognize discussion about features of a brand within the given industry. For example, in the automotive domain, there might be classifiers for acceleration, interior styling, and dealership service.
  • In contrast to brand-like topics defined through rules, it's often the case that other topics are more accurately recognized from a complex language expression that is not easily captured by a rule. For example, topics such as Customer Service are not so simply captured by sets of words, phrases and rules. Thus, we often approach topic classification with machine learning techniques. The provided classifier is trained with machine learning techniques from a collection of documents that have been hand-labeled with the binary relation of topicality. The hand-labeling by the analysts is performed using an active learning framework. The underlying classifier is a variant of the Winnow classifier (Littlestone 1988), an online learning algorithm that finds a linear separator between the class of documents that are topical and the class of documents that are irrelevant. Documents are modeled with the standard bag-of-words representation that discards the ordering of words and notices only whether or not a word occurs in a document.
  • These “feature classifiers” are used to extend the recall of identifying polar messages through the following process. If a message contains brand mentions, the feature classifiers are also run on each sentence in a message. If a sentence is both polar and passes a feature classifier, there is likely a polar expression about one of the brands mentioned in the message. A process of fact extraction is layered on top of these classifiers and the sentiment analysis to understand which brand is being referenced in the message. We use simple resolution techniques to associate brand-like topics (e.g. Dell Axim) with topics describing features of brands (e.g. Customer Service or Peripherals). For example, a brand can be referenced in the Subject line of a blog, and feature-like topics mentioned in the body of the blog resolve back to the brand topics in the subject line when other brands are not mentioned in the body. In this way, we identify facts that can be thought of as triples of brands, their (optional) features, and the (optional) polarity of the authorial expression.
  • For purposes of measuring aggregate sentiment for a brand, a message is considered positive about the brand if it contains a fact with the brand's class and a positive polarity. A message is considered negative about the brand if it contains a fact with the brand's class and a negative polarity. While generally correct, the automated nature of the system results in a not insignificant amount of error in claiming these facts. Aggregating these counts into a single overall score for a brand requires a mindfulness of the error rates, to avoid making incorrect claims about a brand. Below we describe how the counts of each of these groups of messages is used to generate a score with confidence bounds that achieves this goal.
  • 4.5. Other Embodiments of Identifying Topical Sentences
  • An alternate embodiment for identifying topical sentences is to use a hand-built set of rules to identify sentences containing a topic. For example, to identify the display of a PDA, an analyst might write the rule “the word ‘screen’ within' five words of the word ‘PDA’, the word ‘resolution’, the phrase ‘trans reflective’ but not the phrase ‘monitor resolution’. These rules can be run over every sentence in the document collection, and any sentence that matches the hand-written rule is considered topical.
  • FIG. 5 shows a screen shot of an exemplary computerized tool for writing topical rules of this kind.
  • 5. Polarity and Topic
  • One goal of the present invention is to reliably extract polar sentiments about a topic. An embodiment of our system assumes that a sentence judged to be polar and also judged to be topical is indeed expressing polarity about the topic. This relationship is asserted without any NLP-style evidence for a connection between the topic and the sentiment, other than their apparent locality in the same sentence. This section tests the assumption that at the locality of a sentence, a message that is both topical and polar actually expresses polar sentiment about the topic.
  • TABLE 2
    Topic/Polarity combinations: 72% precision
    (72% for positive, 71% for negative)
    predicted
    topic & topic &
    positive negative
    truth topic & positive 137
    other 52
    truth topic & negative 37
    other 15
  • Using the polarity and topic modules described and tested in the previous sections, the system identifies sentences that are judged to be topical and have either positive or negative sentiment. These sentences are predicted by the system to be saying either positive or negative things about the topic in question. Out of the 1262 sentences predicted to be topical, 316 sentences were predicted to have positive polarity and 81 were predicted to have negative polarity. The precision for the intersection—testing the assumption that a topical sentence with polar content is polar about that topic—is show in Table 2, above. The results show the overall precision was 72%. Since the precision of the polarity module was 82% and the topic module 79%, an overall precision of72% demonstrates that the locality assumption holds in most instances.
  • Below are five randomly selected sentences predicted to be negative topical and five randomly selected sentences predicted to be positive topical. These show typical examples of the sentences discovered by our system.
  • Negative Sentences:
  • Compared to the PRODUCT's screen this thing is very very poor.
  • In multimedia I think the winner is not that clear when you consider that PRODUCT-A has a higher resolution screen than PRODUCT-B and built in camera.
  • I never had a problem with the PRODUCT-A, but did encounter the “Dust/Glass Under The Screen Problem” associated with PRODUCT-B.
  • broken PRODUCT screen
  • It is very difficult to take a picture of a screen.
  • Positive Sentences:
  • The B&W display is great in the sun.
  • The screen is at 70 setting (255 max) which is for me the lowest comfortable setting.
  • At that time, superior screen.
  • Although I really don't care for a cover, I like what COMPANY-A has done with the rotating screen, or even better yet, the concept from COMPANY-B with the horizontally rotating screen and large foldable keyboard.
  • The screen is the same (both COMPANY-A & COMPANY-B decided to follow COMPANY-C), but multimedia is better and more stable on the PRODUCT.
  • 6. A Generative Model for Observing Polar Expressions
  • 6.1 Confidence Scoring
  • Given a set of messages about brand X, previous sections describe how we determine (with some error) whether each message is positive, negative, mixed or neutral about brand X. The end sentiment metric is a function of the estimated frequency of positive messages, and the estimated frequency of negative messages. The simplest measure of positive frequency would be to just divide the number of positive messages about brand X by the total number of messages about brand X. This approach may be undesirable in two important ways. First, the analysis determining positive is error-prone, and the error rates of this are not accounted for. Second, with small amounts of data, the true underlying frequency may be quite far from the measured frequency. In this section we describe how we use Bayesian statistics to model these properties to derive valid estimates for the positive and negative frequencies.
  • The model we choose is a statistical generative model. That is, we assume the facts are extracted by an error-prone process that we model with explicit parameterization. Specifically for the sentiment metric, the fundamental parameter we hope to derive is the frequency of positive messages about a brand, and the frequency of negative messages about a brand. These two processes are modeled analogously; for brevity we discuss here only the derivation of the frequency of positive messages, but one of ordinary skill will readily appreciate how to derive the frequency of negative messages using this model.
  • We model a generative process for facts about brand X by assuming that the positive frequency over all brands is modeled by a Beta distribution, and brand X's positive frequency, Θ is determined by a draw from this Beta distribution. Given the data D consisting of N messages about the brand, n of these are truly positive, determined by a draw from a Binomial distribution, Binomial(N, Θ).
  • The observation process of fact extraction makes two types of errors: (1) false positives, observing a true neutral as a positive, and (2) false negatives, observing a true positive as a neutral. Let these error rates be εfp and εfn respectively. By observing N messages through the error-prone lens of fact extraction, we see m positive messages instead of the correct number n. Let fp, fn, tp and tn be the number of false positive, false negative, true positive and true negative messages observed. Note that these are unknown from the observations, though we do know that:

  • tp+fp=m   Equ.4

  • tn+fn=N−m   Equ.5
  • The goal of the parameter estimation process is to use the observed values N (total messages) and m (positive messages detected) and estimate Θ, the underlying frequency of true positive messages. As we are calculating this from a Bayesian perspective, we derive not only a maximum a posteriori estimate {circumflex over (Θ)}, but also a posterior distribution over Θ, which will be important in estimating the size of the confidence bounds.
  • Given the data, we estimate e through an application of Bayes' rule and Expectation-Maximization. The posterior probability of Θ is:

  • P(Θ|D)∝P(Θ)P(D|Θ)   Equ. 6
  • Beta ( Θ ) 1 Z Θ n ɛ fn fn ( 1 - ɛ fn ) tp ( 1 - Θ ) N - n ɛ fp fp ( 1 - ɛ fp ) tn Equ . 7
  • where Z is a normalization function of fp, fn, tp and tn.
  • This likelihood equation can be maximized through a straightforward application of expectation-Maximization. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society, Series B 39(1): 1-38 (1977), the disclosure of which is incorporated herein by reference. In the general case, the EM iterative process will solve for a local maxima to a likelihood equation with missing data. In this application, each datapoint's true sentiment is unknown, and only the observed sentiments are known.
  • The M-step estimates Θ using the expectations of the missing values of the data:
  • Θ ^ = E [ tp ] + E [ fn ] + α N + α + β Equ . 8
  • where α and β are parameters given by the Beta prior for the Binomial distribution.
  • The E-step calculates the expectation of the missing data using the estimated parameterization:
  • E [ tp ] = m ( Θ ^ ( 1 - ɛ fn ) Θ ^ ( 1 - ɛ fn ) + ( 1 - Θ ^ ) ɛ fp ) Equ . 9 E [ fp ] = m ( ( 1 - Θ ^ ) ɛ fp Θ ^ ( 1 - ɛ fn ) + ( 1 - Θ ^ ) ɛ fp ) Equ . 10 E [ tn ] = ( N - m ) ( Θ ^ ɛ fn Θ ^ ɛ fn + ( 1 - Θ ^ ) ( 1 - ɛ fp ) ) Equ . 11 E [ fn ] = ( N - m ) ( ( 1 - Θ ^ ) ( 1 - ɛ fp ) Θ ^ ɛ fn + ( 1 - Θ ^ ) ( 1 - ɛ fp ) ) Equ . 12
  • By iterating the E-steps and M-steps until convergence, we arrive at a local maxima in likelihood space, giving us an estimate for Θ. Additionally, at this fixed point, we have also arrived at a posterior distribution:

  • P(Θ|D)=Beta(E[tp]+E[fn]+α, E[tn]+E[fp]+β)   Equ. 13
  • This is not mathematically the true posterior distribution, as it does not account for the uncertainty in the estimation of which messages were erroneously or correctly observed. We have empirically observed much success in using this approximation.
  • Four parameters of this model are set through. empirical methods: εfp, εfn, α, and β. Both εfp and εfn are set by simply measuring these over a set of labeled data. Both α and β are estimated through. a process of setting empirical priors using large sets of unlabeled data.
  • The process described is a method for deriving estimates for the positive and negative frequencies of a brand. However, customer needs require that only a single summary statistic be produced, and that the form of this is a 1-10 metric. Additionally, a 5.0 value of the metric needs to correspond to the case where the estimated frequencies of positive and negative are equal, and generally, very few brands should score at the most extreme ends of the score. The frequencies are converted to a 1-10 score through a log linear normalization of the ratio of positive to negative. Thus, if a 7.0 corresponds to a ratio of 2.0, then 9.0 corresponds to a ratio of 4.0 and a 3.0 score to a ratio of 0.5. Extreme ratios are very rare, and anything beyond a 1 or a 10 are simply truncated at the extrema.
  • To measure the confidence bounds of a sentiment score estimated by this process, we use the posterior distribution of the positive and negative frequencies. We estimate 95% confidence bounds by repeatedly sampling from these posterior distributions, and then plugging this into the 1-10 conversion metric. It's extremely fast to sample this 1000 times, and select the 2.5% and 97.5% lower and upper bounds to set a 95% confidence interval. This process implicitly makes the assumption that the distribution of positive frequency and negative frequencies are independent. While somewhat of a simplification, we have found this process to hold up well empirically.
  • 6.2 Empirical Validation
  • This section presents empirical results of the polarity metric with confidence bounds in two different domains. We also demonstrate that the confidence bounds are well-behaved, and necessary for good interpretation of comparisons between brands.
  • One important industry for Intelliseek is the automotive industry. To this extent, we have configured a system to recognize all currently-available auto makes and models. in addition, we have defined a number of classifiers for automotive features, from physical characteristics such as interior styling, to leasing and dealerships, to more intangible items like customer service. Table 3 displays message counts, sentiment scores, and sentiment confidence bounds for a sampling of auto brands, as determined by the algorithms described in the previous section. The table shows numbers for a time-constrained set of messages. By analyzing just a small timeframe, the message counts can be somewhat small, which highlights the needs for the confidence bounds on the metric.
  • TABLE 3
    Model # Messages Sentiment Bounds
    Mazda Mazda6 568 8.0 1.2
    Infi niti G35 292 7.9 1.7
    Hyundai Sonata 212 7.7 2.2
    Audi A4 431 7.3 1.2
    BMW M3 504 7.0 1.0
    Toyota Corolla 684 6.6 0.8
    Honda Odyssey 317 6.6 1.3
    Toyota Celica 276 6.4 1.3
    Ford F150 412 6.2 0.9
    Honda S2000 543 6.2 0.8
    Honda Accord 1951 5.8 0.5
    Nissan Altima 444 5.2 1.1
    Honda Civic 1212 5.0 0.6
    Honda CR-V 274 4.5 1.2
    Dodge Ram 248 4.5 1.5
    Volkswagen Jetta 505 4.3 0.9
    Ford Taurus 469 3.7 1.1
  • The above Table 3 shows the results of the sentiment metric applied in the auto domain. Note that in general, models with larger message counts have smaller confidence bounds. Using these scores to drive analysis, yields insights that explain the relative rankings of the different models.
  • By drilling down on some of the backing data for sentiment scores, it is possible to understand why specific models were rated highly or lowly. By investigating further, we find that the Mazda 6 (a highly rated model) had a number of positive comments surrounding its performance and styling in the sports sedan market:
  • I think the Mazda 6 is the best value for a sports sedan
  • The Mazda 6 is one of the best handling FWD autos
  • The Mazda6 MPS achieves a superior balance between high performance and daily needs such as comfort and economy.
  • That car is soo good lookin!
  • Power and torque are faithfully and thoroughly transferred to the road surface for maximum efficiency.
  • The Ford Taurus, a lower rated model, received a number of complaints about quality issues and begin generally out of date:
  • I had three separate Tauruses with leaky rear main seals.
  • The Taurus in a failure.
  • The standard spoiler is too small.
  • The power steering always whined, even with enough fluid.
  • The Taurus should have been put out of its misery S years ago.
  • TABLE 4
    Destination # Messages Sentiment Bounds
    Aruba 539 9.7 1.4
    Antigua 944 8.8 1.2
    St. Lucia 687 8.3 1.2
    St. Bart's 116 7.7 2.6
    Barbados 1440 6.8 0.9
    Grand Bahama 3384 6.4 0.5
    Jamaica 5479 5.9 0.4
    Cuba 2435 5.3 0.8
    Grand Cayman 492 5.1 1.7
  • The above Table 4 illustrates Results of the sentiment metric in measuring aggregate opinion about Caribbean vacation destinations.
  • Table 4 shows the results of measuring polarity for location topics in a small data set of messages about Caribbean destinations. By further drilling down on these scores, an analyst can quickly determine that:
  • Aruba scores well due to a good general opinion of dining out, snorkeling and beach activities.
  • Cuba has a lower score due to poor snorkeling and beach activities.
  • Grand Bahama's medium score comes from above average opinion of snorkeling, moderate opinion of dining out and a slightly lower opinion of beach activities.
  • FIG. 7 provides a scatterplot showing how the size of the confidence bounds is influenced by the number of messages. Each point is an automotive model. Once there are about 1000 messages for a topic, the 95% confidence bounds tend to be within 1.0 on a ten point scale.
  • FIG. 7 shows an analysis of the confidence bounds by the amount of message volume about a brand. The x-axis shows the number of messages about a brand, and the y-axis shows the estimated size of the 95% confidence bounds. With a very small amount of data for a brand, the confidence bounds on each brand tend to be rather large. This generally will prevent conclusive expressions to be made by comparing sentiment scores with these large confidence bounds. As the message volume gets larger, the bounds get smaller, and thus it becomes easier to make statistically valid conclusions based on these scores.
  • 7. Demonstration of User Interface
  • FIGS. 1-4 provide screen shots of an exemplary computerized tool for implementing certain embodiments of the present invention.
  • The screen shot of FIG. 1 illustrates a function of the exemplary computerized tool establishing a topic for the text mining algorithm contained therein. Three main features visible in this screen view are the Topic Select window 20, the Viewer window 22, and the Current Slice box 24. The Topic Select window 20 lists the available topics from which the user may select a topic for analysis. The Viewer window 22 displays the text of a particular message. The Current Slice box 24 provides status information regarding the user's selections that define the research project that is being performed. In the example shown, the Current Slice box 24 indicates that the Topic selected by the user is “Hardware::Display”. With this selection, the exemplary computerized tool will concentrate on certain characteristics regarding a manufacturer's electronic device (in this case, a PDA) or the competing electronic devices of the manufacturer's competitors. The tool has access to a repository of thousands or millions of interne message-board entries preselected as likely having content of interest (e.g., taken from interne message boards dedicated to electronic devices and/or PDAs). The Viewer window 22 provides an example message found by the above-described text mining tool in the repository of messages that the text mining tool considered relevant to the selected topic.
  • In the FIG. 1 screen view, the right-side block 25 displays data pertaining to the analysis of the currently selected message contents. The Relevance, Aux Relevance, Positive and Negative Polarity displays show a score between zero and one for the currently selected message for each of these different types of scoring. Specifically, scores greater than zero for Positive and Negative Polarity mean that at least one sentence in the message has been identified as positive or negative, with a higher score indicating a higher degree of positivity or negativity identified in the message. The Relevance and Aux Relevance scores indicate a confidence that the message is about the selected topic (PDA's and Pocket PCs in this example). Messages that are below a specified threshold of relevance can be excluded.
  • The screen shot of FIG. 2 illustrates a function of the exemplary computerized tool in which the user may establish a more specific aspect of the general topic selected in the screen of FIG. 1. The Viewer window 22 and Current Slice box 24 appear again and serve the same purpose as described above with reference to FIG. 1. However, there is now a Phrase-select window 26, which allows the user to enter a word or group of words to specify the content to be analyzed in the messages. In the example shown, the Current Slice box 24 indicates that the user has entered the phrase “resolution,” thus indicating that the messages will be searched for comments relating to the resolution of the hardware displays. The Viewer window 22 provides an example message found by the above-described text mining tool in the repository of messages that the text mining tool considered relevant to the selected topic and phrase, with the selected phrase “resolution” appearing in highlighted text 26.
  • The screen shot of FIG. 3 illustrates a function of the exemplary computerized tool in which the user has requested the tool to illustrate the positive sentences and negative sentences located in the messages considered to be topical to the resolution of the customer's electronic device screen. The positive sentences found by the sentence classifier are listed under the “Positive Quotes” header 28 in the Quotes window 30 and the negative sentences found by the sentence classifier are listed under the “Negative Quotes” header 32 in the Quotes window 30. As can be seen by this example, not every sentence is directly on point, but there are certainly a substantial ratio of sentences that are on point versus those that are not. Additionally, the user has the ability to select one of the sentences, such as sentence 34 to view the entire message from which it was extracted as shown in the Viewer window of FIG. 4.
  • The screen shot of FIG. 4 shows the Viewer window 22 displaying the text of the message from which the comment 34 shown in FIG. 3 selected by the user originated.
  • The screen shot of FIG. 5 illustrates a demonstration of how rule-based classifiers may be built. This tool allows the user to define a topic (such as a particular brand or product) by creating a “rule” built from words to be associated with that topic. Such a rule can be used to search feedback or comment messages, with those messages that conform to the defined rule being identified as pertaining to the selected topic.
  • On the left-hand part of the FIG. 5 screen is a list 36 containing the different topics for which the topical sentiment analysis of the present invention may be performed. This list can be generated by a database and include topics for which feedback or comment is available. In the example shown, “Kia Optima” is the currently selected topic 38. The middle of the screen contains a window 40 for implementing the tool for writing rules to define the currently selected topic. The window 40 is further divided into an “OR Rules” block 42 and a “BUT-NOT Rules” block 44. Words appearing in the “OR Rules” block will be associated with the topic, such that feedback or comment messages containing any of these words will be identified as pertaining to the selected topic. Words appearing in the “BUT-NOT Rules” block will be given preclusive effect in the topic definition, such that the appearance of one of these words in a feedback or comment message will disqualify that message from pertinence to the selected topic. For example, the rule defined by the words shown in the “OR Rules” block 42 and “BUT-NOT Rules” block 44 of FIG. 5 can be stated as “A message is about the Kia Optima if the word ‘Optima’ appears in the message, but not if any of the phrases ‘Optima batteries’, ‘Optima battery’, ‘Optima Yellow’, ‘Optima Red’, or ‘Optima Yell’ appear in the message”. When building rules, the user can type words to be added to the “OR Rules” block 42 and “BUT-NOT Rules” block 44, or the user can select words or phrase sets from the list 46 on the right side of the FIG. 5 screen. The list 46 is a collection of previously-entered or customized words and phrases, which can be used as shortcuts when writing a rule.
  • 8. Process for Specializing a Lexicon for a Data Set
  • A standard lexicon may be applied to any data set. However, the results will be improved if the lexicon is tuned to work with the particular language of a domain. A system has been implemented to assist a user in carrying out this process.
  • 1. Messages for the domain are collected (as part of the configuration process within the application housing the polarity system).
  • 2. Messages are scanned to determine which words have the potential to be added to the lexicon.
  • 3. The user is stepped through these candidate words and required to indicate if they accept or reject the word for the custom lexicon.
  • Step 2 above uses a number of methods to determine which words are to be used as candidates: (a) patterned based methods (Gregory Grefenstette, Yan Qu, David A. Evans and James G. Shanahan, Validating the Coverage of Lexical Resources for Affect Analysis and Automatically Classifying New Words Along Semantic Axes, AAAI Symposium on Exploring Attitude and Affect in Text: Theories and Applications, 2004, the disclosure of which is incorporated herein by reference); and (b) commonly occurring adjectives and adverbs not found in the lexicon and not include in a default lexicon of known non polar terms.
  • In the pattern driven approach, a number of patterns are used to locate adjectives and adverbs which have a good chance of being polar. The patterns involve both tokens (words) and parts of speech. The patterns consist of a prefix of tokens and a target set of POS tags. The patterns are created from a pair of word pools. Pool one contains, for example, ‘appears’, ‘looks’, ‘seems’, pool two contains, for example, ‘very’, ‘extremely’. The product of these pools (e.g. ‘appears very’, ‘looks extremely’ and so on) is then appended with one of the target POS tags (which select for adjectives and adverbs) giving a complete set of patterns (e.g. ‘looks extremely 11’ meaning the sequence of two tokens and a pas tag).
  • To populate the candidate list, the messages in the corpus collected for the project being customized is scanned using the patterns described above. All words which match any of the patterns are collected.
  • In a parameter driven approach, all adjectives and adverbs in messages which have already been marked as polar, and which have counts above a certain threshold, are added to the list of candidates.
  • Each of the above pattern driven and parameter driven approaches can be tuned using a filter which accepts only candidates Which appear a certain number of times in the corpus. By using these parameters, we create four candidate creation methods, two for each approach. The user then steps through the four sets of candidates, accepting or rejecting words as appropriate. The interface within which this is carried out presents the user with the word, its POS and a list of examples of that word appearing in contexts mined from the corpus of messages.
  • As shown in FIG. 6, an example screen shot shows such a system in use. The system is presenting the user with the word ‘underwhelming’ which has been generated in the first candidate generation step. The word is illustrated by two examples that have been pulled from the corpus of messages. The user labels the word either by keyboard or shortcuts, or by clicking on the appropriate label found in the bottom right hand corner of the display.
  • 9. Conclusions
  • Determining the sentiment of an author by text analysis requires the ability to determine the polarity of the text as well as the topic. In these exemplary embodiment, topic detection is generally solved by a trainable classification algorithm and polarity detection is generally solved by a grammatical model. The approach described in some of these embodiments takes independent topic and polarity systems and combines them, under the assumption that a topical sentence with polarity contains polar content on that topic. We tested this assumption and determined it to be viable for the domain of online messages. This system provides the ability to retrieve messages (in fact, parts of messages) that indicate the author's sentiment to some particular topic, a valuable capability.
  • The detection of polarity is a semantic or meta-semantic interpretive problem. A complete linguistic solution to the problem would deal with word sense issues and some form of discourse modeling (which would ultimately require a reference resolution component) in order to determine the topic of the polar expressions. Our approach restricts these open problems by constraining the data set, and specializing the detection of polarity. These steps by no means address directly these complex linguistic issues, but taken in conjunction (and with some additional aspects pertaining to the type of expressions found in the domain of online messages) the problem is constrained enough to produce perfectly reliable results.
  • Following from the above description and invention summaries, it should be apparent to those of ordinary skill in the art that, while the systems and processes herein described constitute exemplary embodiments of the present invention, it is understood that the invention is not limited to these precise systems and processes and that changes may be made therein without departing from the scope of the invention as defined by the following proposed claims. Additionally, it is to be understood that the invention is defined by the proposed claims and it is not intended that any limitations or elements describing the exemplary embodiments set forth herein are to be incorporated into the meanings of the claims unless such limitations or elements are explicitly listed in the proposed claims. Likewise, it is to be understood that it is not necessary to meet any or all of the identified advantages or objects of the invention disclosed herein in order to fall within the scope of any proposed claims, since the invention is defined by the claims and since inherent and/or unforeseen advantages of the present invention may exist even though they may not have been explicitly discussed herein.

Claims (18)

1. A system comprising:
a text mining tool to process text in an electronic document to identify a topical expression and a polar expression in the electronic document;
a text classifier to determine a topic of the topical expression and a polarity of the polar expression, to identify a relevance of the polar expression to the topical expression, and to generate a confidence score associated with the relevance of the polar expression to the topical expression; and
a user interface to display the topical expression with the polar expression and the confidence score to a user.
2. The system of claim 1, wherein the user interface is to display an aggregated plurality of topical expressions and corresponding polar expressions identified within the electronic document.
3. The system of claim 1, wherein the user interface is to display an aggregated plurality of topical expression and corresponding polar expressions identified from a plurality of electronic documents.
4. The system of claim 1, wherein the text mining tool is to perform a natural language processing analysis of the electronic document to identify a topical expression and a polar expression, and the text classifier is to perform a natural language processing analysis of the topical expression and the polar expression to identify a relevance of the polar expression to the topical expression and to generate a confidence score associated with the relevance of the polar expression to the topical expression.
5. The system of claim 1, wherein the text classifier comprises a statistical machine learning classifier.
6. The system of claim 5, wherein the statistical machine learning classifier is to execute a Winnow analysis with respect to the electronic document.
7. The system of claim 1, wherein the text classifier comprises a rules-based classifier and is to analyze the polar expression using a shallow natural language processing technique.
8. A method comprising:
processing text in electronic content to identify a topical expression and a polar expression in the electronic content;
determining a topic of the topical expression and a polarity of the polar expression;
identifying a relevance of the polar expression to the topical expression;
generating a confidence score associated with the relevance of the polar expression to the topical expression; and
displaying the topical expression with the polar expression and the confidence score to a user.
9. The method of claim 8, further comprising displaying an aggregated plurality of topical expressions and corresponding polar expressions identified within the electronic content.
10. The method of claim 8, further comprising displaying an aggregated plurality of topical expression and corresponding polar expressions identified from a plurality of electronic content.
11. The method of claim 8, wherein processing text further comprises performing a natural language processing analysis of the electronic content to identify the topical expression and the polar expression, and wherein identifying further comprises performing a natural language processing analysis of the topical expression and the polar expression to identify a relevance of the polar expression to the topical expression.
12. The method of claim 8, wherein identifying further comprises applying a statistical machine learning classification to identify a relevance of the polar expression to the topical expression.
13. The method of claim 8, wherein the identifying further comprises applying a rules-based classification with a shallow natural language processing technique to analyze the polar expression.
14. A tangible computer readable storage medium including executable program instructions which, when executed by a computer processor, cause the computer to implement a system comprising:
a text mining tool to process text in an electronic document to identify a topical expression and a polar expression in the electronic document;
a text classifier to determine a topic of the topical expression and a polarity of the polar expression, to identify a relevance of the polar expression to the topical expression; and
a user interface to display the topical expression with the polar expression to a user.
15. The computer readable storage medium of claim 14, wherein the text classifier is to generate a confidence score associated with the relevance of the polar expression to the topical expression.
16. The computer readable storage medium of claim 14, wherein the user interface is to display an aggregated plurality of topical expression and corresponding polar expressions identified from a plurality of electronic documents.
17. The computer readable storage medium of claim 14, wherein the text mining tool is to perform a natural language processing analysis of the electronic document to identify a topical expression and a polar expression and the text classifier is to perform a natural language processing analysis of the topical expression and the polar expression to identify a relevance of the polar expression to the topical expression, and to generate a confidence score associated with the relevance of the polar expression to the topical expression.
18. The computer readable storage medium of claim 14, wherein the text classifier comprises a rules-based classifier to apply a statistical generator model to analyze the polar expression in relation to the topical expression.
US12/969,356 2004-09-30 2010-12-15 Topical sentiments in electronically stored communications Expired - Fee Related US8041669B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/969,356 US8041669B2 (en) 2004-09-30 2010-12-15 Topical sentiments in electronically stored communications

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US61494104P 2004-09-30 2004-09-30
US11/245,542 US7523085B2 (en) 2004-09-30 2005-09-30 Topical sentiments in electronically stored communications
US12/395,239 US7877345B2 (en) 2004-09-30 2009-02-27 Topical sentiments in electronically stored communications
US12/969,356 US8041669B2 (en) 2004-09-30 2010-12-15 Topical sentiments in electronically stored communications

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US12/395,239 Continuation US7877345B2 (en) 2004-09-30 2009-02-27 Topical sentiments in electronically stored communications

Publications (2)

Publication Number Publication Date
US20110093417A1 true US20110093417A1 (en) 2011-04-21
US8041669B2 US8041669B2 (en) 2011-10-18

Family

ID=36143117

Family Applications (3)

Application Number Title Priority Date Filing Date
US11/245,542 Active 2026-05-10 US7523085B2 (en) 2004-09-30 2005-09-30 Topical sentiments in electronically stored communications
US12/395,239 Active 2026-01-20 US7877345B2 (en) 2004-09-30 2009-02-27 Topical sentiments in electronically stored communications
US12/969,356 Expired - Fee Related US8041669B2 (en) 2004-09-30 2010-12-15 Topical sentiments in electronically stored communications

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US11/245,542 Active 2026-05-10 US7523085B2 (en) 2004-09-30 2005-09-30 Topical sentiments in electronically stored communications
US12/395,239 Active 2026-01-20 US7877345B2 (en) 2004-09-30 2009-02-27 Topical sentiments in electronically stored communications

Country Status (2)

Country Link
US (3) US7523085B2 (en)
WO (1) WO2006039566A2 (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100079464A1 (en) * 2008-09-26 2010-04-01 Nec Biglobe, Ltd. Information processing apparatus capable of easily generating graph for comparing of a plurality of commercial products
US20110295591A1 (en) * 2010-05-28 2011-12-01 Palo Alto Research Center Incorporated System and method to acquire paraphrases
US8352405B2 (en) * 2011-04-21 2013-01-08 Palo Alto Research Center Incorporated Incorporating lexicon knowledge into SVM learning to improve sentiment classification
WO2013019791A1 (en) * 2011-08-02 2013-02-07 Anderson Tom H C Natural language test analytics
US20130103385A1 (en) * 2011-10-24 2013-04-25 Riddhiman Ghosh Performing sentiment analysis
US20140278363A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Enhanced Answers in DeepQA System According to User Preferences
US20150058080A1 (en) * 2013-08-23 2015-02-26 International Business Machines Corporation Contract erosion and renewal prediction through sentiment analysis
WO2015137998A1 (en) * 2013-03-10 2015-09-17 Squerb, Inc. System for graphically displaying user-provided information
WO2016035072A3 (en) * 2014-09-02 2016-04-21 Feelter Sales Tools Ltd Sentiment rating system and method
WO2016066228A1 (en) * 2014-10-31 2016-05-06 Longsand Limited Focused sentiment classification
US9535899B2 (en) 2013-02-20 2017-01-03 International Business Machines Corporation Automatic semantic rating and abstraction of literature
US20170039183A1 (en) * 2015-08-07 2017-02-09 Nec Laboratories America, Inc. Metric Labeling for Natural Language Processing
US10198432B2 (en) 2016-07-28 2019-02-05 Abbyy Production Llc Aspect-based sentiment analysis and report generation using machine learning methods
US20190297035A1 (en) * 2018-03-26 2019-09-26 International Business Machines Corporation Chat thread correction
US10956482B2 (en) * 2008-11-10 2021-03-23 Google Llc Sentiment-based classification of media content
US11100557B2 (en) 2014-11-04 2021-08-24 International Business Machines Corporation Travel itinerary recommendation engine using inferred interests and sentiments
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US11354507B2 (en) * 2018-09-13 2022-06-07 International Business Machines Corporation Compared sentiment queues

Families Citing this family (202)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7268700B1 (en) 1998-01-27 2007-09-11 Hoffberg Steven M Mobile communication device
US8271316B2 (en) * 1999-12-17 2012-09-18 Buzzmetrics Ltd Consumer to business data capturing system
US7197470B1 (en) 2000-10-11 2007-03-27 Buzzmetrics, Ltd. System and method for collection analysis of electronic discussion methods
US7428496B1 (en) * 2001-04-24 2008-09-23 Amazon.Com, Inc. Creating an incentive to author useful item reviews
US9818136B1 (en) 2003-02-05 2017-11-14 Steven M. Hoffberg System and method for determining contingent relevance
US7725414B2 (en) 2004-03-16 2010-05-25 Buzzmetrics, Ltd An Israel Corporation Method for developing a classifier for classifying communications
US7590589B2 (en) 2004-09-10 2009-09-15 Hoffberg Steven M Game theoretic prioritization scheme for mobile ad hoc networks permitting hierarchal deference
US7523085B2 (en) * 2004-09-30 2009-04-21 Buzzmetrics, Ltd An Israel Corporation Topical sentiments in electronically stored communications
JP4148522B2 (en) * 2004-11-19 2008-09-10 インターナショナル・ビジネス・マシーンズ・コーポレーション Expression detection system, expression detection method, and program
US20060242040A1 (en) * 2005-04-20 2006-10-26 Aim Holdings Llc Method and system for conducting sentiment analysis for securities research
US9158855B2 (en) 2005-06-16 2015-10-13 Buzzmetrics, Ltd Extracting structured data from weblogs
US20070100779A1 (en) * 2005-08-05 2007-05-03 Ori Levy Method and system for extracting web data
US8874477B2 (en) 2005-10-04 2014-10-28 Steven Mark Hoffberg Multifactorial optimization system and method
US20070143122A1 (en) * 2005-12-06 2007-06-21 Holloway Lane T Business method for correlating product reviews published on the world wide Web to provide an overall value assessment of the product being reviewed
US20070174255A1 (en) * 2005-12-22 2007-07-26 Entrieva, Inc. Analyzing content to determine context and serving relevant content based on the context
US9269068B2 (en) * 2006-05-05 2016-02-23 Visible Technologies Llc Systems and methods for consumer-generated media reputation management
US7831928B1 (en) 2006-06-22 2010-11-09 Digg, Inc. Content visualization
US8869037B2 (en) * 2006-06-22 2014-10-21 Linkedin Corporation Event visualization
US8862591B2 (en) * 2006-08-22 2014-10-14 Twitter, Inc. System and method for evaluating sentiment
US8296168B2 (en) * 2006-09-13 2012-10-23 University Of Maryland System and method for analysis of an opinion expressed in documents with regard to a particular topic
US7660783B2 (en) 2006-09-27 2010-02-09 Buzzmetrics, Inc. System and method of ad-hoc analysis of data
US7761287B2 (en) * 2006-10-23 2010-07-20 Microsoft Corporation Inferring opinions based on learned probabilities
TWI337712B (en) * 2006-10-30 2011-02-21 Inst Information Industry Systems and methods for measuring behavior characteristics, and machine readable medium thereof
US7930302B2 (en) * 2006-11-22 2011-04-19 Intuit Inc. Method and system for analyzing user-generated content
US8099429B2 (en) * 2006-12-11 2012-01-17 Microsoft Corporation Relational linking among resoures
US20080147579A1 (en) * 2006-12-14 2008-06-19 Microsoft Corporation Discriminative training using boosted lasso
US8631005B2 (en) 2006-12-28 2014-01-14 Ebay Inc. Header-token driven automatic text segmentation
US20080249764A1 (en) * 2007-03-01 2008-10-09 Microsoft Corporation Smart Sentiment Classifier for Product Reviews
US7996210B2 (en) 2007-04-24 2011-08-09 The Research Foundation Of The State University Of New York Large-scale sentiment analysis
US20160217488A1 (en) * 2007-05-07 2016-07-28 Miles Ward Systems and methods for consumer-generated media reputation management
US7827128B1 (en) 2007-05-11 2010-11-02 Aol Advertising Inc. System identification, estimation, and prediction of advertising-related data
US20090157668A1 (en) * 2007-12-12 2009-06-18 Christopher Daniel Newton Method and system for measuring an impact of various categories of media owners on a corporate brand
US7987188B2 (en) * 2007-08-23 2011-07-26 Google Inc. Domain-specific sentiment classification
US20110077952A1 (en) * 2007-09-07 2011-03-31 Ryan Steelberg Apparatus, System and Method for a Brand Affinity Engine Using Normalized Media Ratings
US8108255B1 (en) 2007-09-27 2012-01-31 Amazon Technologies, Inc. Methods and systems for obtaining reviews for items lacking reviews
US8001003B1 (en) * 2007-09-28 2011-08-16 Amazon Technologies, Inc. Methods and systems for searching for and identifying data repository deficits
US8280885B2 (en) * 2007-10-29 2012-10-02 Cornell University System and method for automatically summarizing fine-grained opinions in digital text
US20090119156A1 (en) * 2007-11-02 2009-05-07 Wise Window Inc. Systems and methods of providing market analytics for a brand
US20090119157A1 (en) * 2007-11-02 2009-05-07 Wise Window Inc. Systems and method of deriving a sentiment relating to a brand
US20090125381A1 (en) * 2007-11-07 2009-05-14 Wise Window Inc. Methods for identifying documents relating to a market
US20090125382A1 (en) * 2007-11-07 2009-05-14 Wise Window Inc. Quantifying a Data Source's Reputation
US8417713B1 (en) 2007-12-05 2013-04-09 Google Inc. Sentiment detection as a ranking signal for reviewable entities
US8347326B2 (en) 2007-12-18 2013-01-01 The Nielsen Company (US) Identifying key media events and modeling causal relationships between key events and reported feelings
US7814108B2 (en) * 2007-12-21 2010-10-12 Microsoft Corporation Search engine platform
CA2650319C (en) 2008-01-24 2016-10-18 Radian6 Technologies Inc. Method and system for targeted advertising based on topical memes
US8010539B2 (en) * 2008-01-25 2011-08-30 Google Inc. Phrase based snippet generation
US8799773B2 (en) * 2008-01-25 2014-08-05 Google Inc. Aspect-based sentiment summarization
US8239189B2 (en) * 2008-02-26 2012-08-07 Siemens Enterprise Communications Gmbh & Co. Kg Method and system for estimating a sentiment for an entity
US20090248484A1 (en) * 2008-03-28 2009-10-01 Microsoft Corporation Automatic customization and rendering of ads based on detected features in a web page
US9245252B2 (en) 2008-05-07 2016-01-26 Salesforce.Com, Inc. Method and system for determining on-line influence in social media
US8731995B2 (en) * 2008-05-12 2014-05-20 Microsoft Corporation Ranking products by mining comparison sentiment
US9646078B2 (en) * 2008-05-12 2017-05-09 Groupon, Inc. Sentiment extraction from consumer reviews for providing product recommendations
US20090307003A1 (en) * 2008-05-16 2009-12-10 Daniel Benyamin Social advertisement network
WO2009152154A1 (en) * 2008-06-09 2009-12-17 J.D. Power And Associates Automatic sentiment analysis of surveys
US20100030865A1 (en) * 2008-07-31 2010-02-04 International Business Machines Corporation Method for Prioritizing E-mail Messages Based on the Status of Existing E-mail Messages
US9092517B2 (en) * 2008-09-23 2015-07-28 Microsoft Technology Licensing, Llc Generating synonyms based on query log data
US8606815B2 (en) * 2008-12-09 2013-12-10 International Business Machines Corporation Systems and methods for analyzing electronic text
KR101163010B1 (en) * 2008-12-15 2012-07-09 한국전자통신연구원 Apparatus for online advertisement selecting based on content affective and intend analysis and method thereof
US9521013B2 (en) 2008-12-31 2016-12-13 Facebook, Inc. Tracking significant topics of discourse in forums
US8462160B2 (en) * 2008-12-31 2013-06-11 Facebook, Inc. Displaying demographic information of members discussing topics in a forum
US20100169317A1 (en) * 2008-12-31 2010-07-01 Microsoft Corporation Product or Service Review Summarization Using Attributes
US20100217647A1 (en) * 2009-02-20 2010-08-26 Philip Clifford Jacobs Determining share of voice
US9213687B2 (en) * 2009-03-23 2015-12-15 Lawrence Au Compassion, variety and cohesion for methods of text analytics, writing, search, user interfaces
US20100293179A1 (en) * 2009-05-14 2010-11-18 Microsoft Corporation Identifying synonyms of entities using web search
US8504550B2 (en) * 2009-05-15 2013-08-06 Citizennet Inc. Social network message categorization systems and methods
CN101901230A (en) * 2009-05-31 2010-12-01 国际商业机器公司 Information retrieval method, user comment processing method and system thereof
US8533203B2 (en) * 2009-06-04 2013-09-10 Microsoft Corporation Identifying synonyms of entities using a document collection
TW201118589A (en) * 2009-06-09 2011-06-01 Ebh Entpr Inc Methods, apparatus and software for analyzing the content of micro-blog messages
US20110029926A1 (en) * 2009-07-30 2011-02-03 Hao Ming C Generating a visualization of reviews according to distance associations between attributes and opinion words in the reviews
US8533208B2 (en) 2009-09-28 2013-09-10 Ebay Inc. System and method for topic extraction and opinion mining
US8380697B2 (en) * 2009-10-21 2013-02-19 Citizennet Inc. Search and retrieval methods and systems of short messages utilizing messaging context and keyword frequency
US11023675B1 (en) 2009-11-03 2021-06-01 Alphasense OY User interface for use with a search engine for searching financial related documents
US8356025B2 (en) * 2009-12-09 2013-01-15 International Business Machines Corporation Systems and methods for detecting sentiment-based topics
US8554854B2 (en) 2009-12-11 2013-10-08 Citizennet Inc. Systems and methods for identifying terms relevant to web pages using social network messages
US8725717B2 (en) * 2009-12-23 2014-05-13 Palo Alto Research Center Incorporated System and method for identifying topics for short text communications
US9201863B2 (en) * 2009-12-24 2015-12-01 Woodwire, Inc. Sentiment analysis from social media content
CN102812475A (en) * 2009-12-24 2012-12-05 梅塔瓦纳股份有限公司 System And Method For Determining Sentiment Expressed In Documents
US9047283B1 (en) * 2010-01-29 2015-06-02 Guangsheng Zhang Automated topic discovery in documents and content categorization
US8983989B2 (en) * 2010-02-05 2015-03-17 Microsoft Technology Licensing, Llc Contextual queries
US8260664B2 (en) * 2010-02-05 2012-09-04 Microsoft Corporation Semantic advertising selection from lateral concepts and topics
US8903794B2 (en) * 2010-02-05 2014-12-02 Microsoft Corporation Generating and presenting lateral concepts
US8150859B2 (en) * 2010-02-05 2012-04-03 Microsoft Corporation Semantic table of contents for search results
CN102163187B (en) * 2010-02-21 2014-11-26 国际商业机器公司 Document marking method and device
US20110231395A1 (en) * 2010-03-19 2011-09-22 Microsoft Corporation Presenting answers
US8725494B2 (en) * 2010-03-31 2014-05-13 Attivio, Inc. Signal processing approach to sentiment analysis for entities in documents
JP5390463B2 (en) 2010-04-27 2014-01-15 インターナショナル・ビジネス・マシーンズ・コーポレーション Defect predicate expression extraction device, defect predicate expression extraction method, and defect predicate expression extraction program for extracting predicate expressions indicating defects
US9600566B2 (en) 2010-05-14 2017-03-21 Microsoft Technology Licensing, Llc Identifying entity synonyms
US8874727B2 (en) 2010-05-31 2014-10-28 The Nielsen Company (Us), Llc Methods, apparatus, and articles of manufacture to rank users in an online social network
US9251248B2 (en) * 2010-06-07 2016-02-02 Microsoft Licensing Technology, LLC Using context to extract entities from a document collection
US8230062B2 (en) 2010-06-21 2012-07-24 Salesforce.Com, Inc. Referred internet traffic analysis system and method
US9262517B2 (en) * 2010-08-18 2016-02-16 At&T Intellectual Property I, L.P. Systems and methods for social media data mining
US9311619B2 (en) * 2010-09-10 2016-04-12 Visible Technologies Llc Systems and methods for consumer-generated media reputation management
US8615434B2 (en) 2010-10-19 2013-12-24 Citizennet Inc. Systems and methods for automatically generating campaigns using advertising targeting information based upon affinity information obtained from an online social network
US8612293B2 (en) 2010-10-19 2013-12-17 Citizennet Inc. Generation of advertising targeting information based upon affinity information obtained from an online social network
US9015033B2 (en) 2010-10-26 2015-04-21 At&T Intellectual Property I, L.P. Method and apparatus for detecting a sentiment of short messages
US9264764B2 (en) 2011-07-06 2016-02-16 Manish Bhatia Media content based advertising survey platform methods
US10142687B2 (en) 2010-11-07 2018-11-27 Symphony Advanced Media, Inc. Audience content exposure monitoring apparatuses, methods and systems
US9292602B2 (en) 2010-12-14 2016-03-22 Microsoft Technology Licensing, Llc Interactive search results page
US8949211B2 (en) * 2011-01-31 2015-02-03 Hewlett-Packard Development Company, L.P. Objective-function based sentiment
US8856056B2 (en) 2011-03-22 2014-10-07 Isentium, Llc Sentiment calculus for a method and system using social media for event-driven trading
US9063927B2 (en) 2011-04-06 2015-06-23 Citizennet Inc. Short message age classification
CN102760264A (en) * 2011-04-29 2012-10-31 国际商业机器公司 Computer-implemented method and system for generating extracts of internet comments
JP2013003663A (en) * 2011-06-13 2013-01-07 Sony Corp Information processing apparatus, information processing method, and program
CN102831119B (en) * 2011-06-15 2016-08-17 日电(中国)有限公司 Short text clustering Apparatus and method for
US20130018892A1 (en) * 2011-07-12 2013-01-17 Castellanos Maria G Visually Representing How a Sentiment Score is Computed
US9002892B2 (en) 2011-08-07 2015-04-07 CitizenNet, Inc. Systems and methods for trend detection using frequency analysis
US8688499B1 (en) * 2011-08-11 2014-04-01 Google Inc. System and method for generating business process models from mapped time sequenced operational and transaction data
WO2013033385A1 (en) * 2011-08-30 2013-03-07 E-Rewards, Inc. System and method for generating a knowledge metric using qualitative internet data
US8798995B1 (en) * 2011-09-23 2014-08-05 Amazon Technologies, Inc. Key word determinations from voice data
US8738363B2 (en) * 2011-10-13 2014-05-27 Xerox Corporation System and method for suggestion mining
US9009024B2 (en) * 2011-10-24 2015-04-14 Hewlett-Packard Development Company, L.P. Performing sentiment analysis
US20130173254A1 (en) * 2011-12-31 2013-07-04 Farrokh Alemi Sentiment Analyzer
US9208139B2 (en) * 2012-01-05 2015-12-08 Educational Testing Service System and method for identifying organizational elements in argumentative or persuasive discourse
US9135350B2 (en) * 2012-01-05 2015-09-15 Sri International Computer-generated sentiment-based knowledge base
US8838435B2 (en) 2012-01-11 2014-09-16 Motorola Mobility Llc Communication processing
GB2502037A (en) 2012-02-10 2013-11-20 Qatar Foundation Topic analytics
US9477749B2 (en) 2012-03-02 2016-10-25 Clarabridge, Inc. Apparatus for identifying root cause using unstructured data
US8745019B2 (en) 2012-03-05 2014-06-03 Microsoft Corporation Robust discovery of entity synonyms using query logs
US8620718B2 (en) 2012-04-06 2013-12-31 Unmetric Inc. Industry specific brand benchmarking system based on social media strength of a brand
US9053497B2 (en) 2012-04-27 2015-06-09 CitizenNet, Inc. Systems and methods for targeting advertising to groups with strong ties within an online social network
US9418389B2 (en) 2012-05-07 2016-08-16 Nasdaq, Inc. Social intelligence architecture using social media message queues
US20140019264A1 (en) * 2012-05-07 2014-01-16 Ditto Labs, Inc. Framework for product promotion and advertising using social networking services
US10304036B2 (en) 2012-05-07 2019-05-28 Nasdaq, Inc. Social media profiling for one or more authors using one or more social media platforms
US9069798B2 (en) * 2012-05-24 2015-06-30 Mitsubishi Electric Research Laboratories, Inc. Method of text classification using discriminative topic transformation
US8515828B1 (en) * 2012-05-29 2013-08-20 Google Inc. Providing product recommendations through keyword extraction from negative reviews
US9009027B2 (en) * 2012-05-30 2015-04-14 Sas Institute Inc. Computer-implemented systems and methods for mood state determination
US8738628B2 (en) * 2012-05-31 2014-05-27 International Business Machines Corporation Community profiling for social media
US10032131B2 (en) 2012-06-20 2018-07-24 Microsoft Technology Licensing, Llc Data services for enterprises leveraging search system data assets
US9594831B2 (en) 2012-06-22 2017-03-14 Microsoft Technology Licensing, Llc Targeted disambiguation of named entities
US9678948B2 (en) * 2012-06-26 2017-06-13 International Business Machines Corporation Real-time message sentiment awareness
US9460473B2 (en) * 2012-06-26 2016-10-04 International Business Machines Corporation Content-sensitive notification icons
US20140013223A1 (en) * 2012-07-06 2014-01-09 Mohammad AAMIR System and method for contextual visualization of content
US9141600B2 (en) * 2012-07-12 2015-09-22 Insite Innovations And Properties B.V. Computer arrangement for and computer implemented method of detecting polarity in a message
US9229924B2 (en) 2012-08-24 2016-01-05 Microsoft Technology Licensing, Llc Word detection and domain dictionary recommendation
US9396179B2 (en) * 2012-08-30 2016-07-19 Xerox Corporation Methods and systems for acquiring user related information using natural language processing techniques
US20140067370A1 (en) * 2012-08-31 2014-03-06 Xerox Corporation Learning opinion-related patterns for contextual and domain-dependent opinion detection
US9558273B2 (en) * 2012-09-21 2017-01-31 Appinions Inc. System and method for generating influencer scores
US9715493B2 (en) * 2012-09-28 2017-07-25 Semeon Analytics Inc. Method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model
US10402745B2 (en) * 2012-09-28 2019-09-03 Semeon Analytics Inc. Method and system for analysing sentiments
US9817810B2 (en) * 2012-11-07 2017-11-14 International Business Machines Corporation SVO-based taxonomy-driven text analytics
US9134215B1 (en) 2012-11-09 2015-09-15 Jive Software, Inc. Sentiment analysis of content items
US9336192B1 (en) 2012-11-28 2016-05-10 Lexalytics, Inc. Methods for analyzing text
US9278255B2 (en) 2012-12-09 2016-03-08 Arris Enterprises, Inc. System and method for activity recognition
US10212986B2 (en) 2012-12-09 2019-02-26 Arris Enterprises Llc System, apparel, and method for identifying performance of workout routines
US9690775B2 (en) 2012-12-27 2017-06-27 International Business Machines Corporation Real-time sentiment analysis for synchronous communication
US9460083B2 (en) 2012-12-27 2016-10-04 International Business Machines Corporation Interactive dashboard based on real-time sentiment analysis for synchronous communication
US9477704B1 (en) * 2012-12-31 2016-10-25 Teradata Us, Inc. Sentiment expression analysis based on keyword hierarchy
US11928606B2 (en) 2013-03-15 2024-03-12 TSG Technologies, LLC Systems and methods for classifying electronic documents
US9298814B2 (en) 2013-03-15 2016-03-29 Maritz Holdings Inc. Systems and methods for classifying electronic documents
US20140317118A1 (en) * 2013-04-18 2014-10-23 International Business Machines Corporation Context aware dynamic sentiment analysis
US10515153B2 (en) * 2013-05-16 2019-12-24 Educational Testing Service Systems and methods for automatically assessing constructed recommendations based on sentiment and specificity measures
US10331786B2 (en) 2013-08-19 2019-06-25 Google Llc Device compatibility management
US10521807B2 (en) 2013-09-05 2019-12-31 TSG Technologies, LLC Methods and systems for determining a risk of an emotional response of an audience
US10706367B2 (en) * 2013-09-10 2020-07-07 Facebook, Inc. Sentiment polarity for users of a social networking system
US10453079B2 (en) * 2013-11-20 2019-10-22 At&T Intellectual Property I, L.P. Method, computer-readable storage device, and apparatus for analyzing text messages
US9542455B2 (en) * 2013-12-11 2017-01-10 Avaya Inc. Anti-trending
US9373075B2 (en) 2013-12-12 2016-06-21 International Business Machines Corporation Applying a genetic algorithm to compositional semantics sentiment analysis to improve performance and accelerate domain adaptation
US10325274B2 (en) * 2014-01-31 2019-06-18 Walmart Apollo, Llc Trend data counter
US10332127B2 (en) 2014-01-31 2019-06-25 Walmart Apollo, Llc Trend data aggregation
US9711058B2 (en) * 2014-03-06 2017-07-18 International Business Machines Corporation Providing targeted feedback
US9317816B2 (en) 2014-05-27 2016-04-19 InsideSales.com, Inc. Email optimization for predicted recipient behavior: suggesting changes that are more likely to cause a target behavior to occur
US9088533B1 (en) 2014-05-27 2015-07-21 Insidesales.com Email optimization for predicted recipient behavior: suggesting a time at which a user should send an email
US9092742B1 (en) * 2014-05-27 2015-07-28 Insidesales.com Email optimization for predicted recipient behavior: suggesting changes in an email to increase the likelihood of an outcome
US11250450B1 (en) 2014-06-27 2022-02-15 Groupon, Inc. Method and system for programmatic generation of survey queries
US9317566B1 (en) 2014-06-27 2016-04-19 Groupon, Inc. Method and system for programmatic analysis of consumer reviews
KR20170030570A (en) * 2014-07-07 2017-03-17 머신 존, 인크. System and method for identifying and suggesting emoticons
US10878017B1 (en) 2014-07-29 2020-12-29 Groupon, Inc. System and method for programmatic generation of attribute descriptors
US11263401B2 (en) 2014-07-31 2022-03-01 Oracle International Corporation Method and system for securely storing private data in a semantic analysis system
US10977667B1 (en) 2014-10-22 2021-04-13 Groupon, Inc. Method and system for programmatic analysis of consumer sentiment with regard to attribute descriptors
US20160117737A1 (en) * 2014-10-28 2016-04-28 Adobe Systems Incorporated Preference Mapping for Automated Attribute-Selection in Campaign Design
US9690772B2 (en) 2014-12-15 2017-06-27 Xerox Corporation Category and term polarity mutual annotation for aspect-based sentiment analysis
CN104572616B (en) * 2014-12-23 2018-04-24 北京锐安科技有限公司 The definite method and apparatus of Text Orientation
US10078843B2 (en) * 2015-01-05 2018-09-18 Saama Technologies, Inc. Systems and methods for analyzing consumer sentiment with social perspective insight
CN104915443B (en) * 2015-06-29 2018-11-23 北京信息科技大学 A kind of abstracting method of Chinese microblogging evaluation object
US20170169008A1 (en) * 2015-12-15 2017-06-15 Le Holdings (Beijing) Co., Ltd. Method and electronic device for sentiment classification
CN105893444A (en) * 2015-12-15 2016-08-24 乐视网信息技术(北京)股份有限公司 Sentiment classification method and apparatus
US9904669B2 (en) * 2016-01-13 2018-02-27 International Business Machines Corporation Adaptive learning of actionable statements in natural language conversation
US10755195B2 (en) 2016-01-13 2020-08-25 International Business Machines Corporation Adaptive, personalized action-aware communication and conversation prioritization
US10324968B2 (en) * 2016-01-29 2019-06-18 International Business Machines Corporation Topic generation for a publication
US10489509B2 (en) 2016-03-14 2019-11-26 International Business Machines Corporation Personality based sentiment analysis of textual information written in natural language
US20170293620A1 (en) * 2016-04-06 2017-10-12 International Business Machines Corporation Natural language processing based on textual polarity
US20170293621A1 (en) * 2016-04-06 2017-10-12 International Business Machines Corporation Natural language processing based on textual polarity
US10706044B2 (en) 2016-04-06 2020-07-07 International Business Machines Corporation Natural language processing based on textual polarity
US9875230B2 (en) 2016-04-08 2018-01-23 International Business Machines Corporation Text analysis on unstructured text to identify a high level of intensity of negative thoughts or beliefs
US10275444B2 (en) 2016-07-15 2019-04-30 At&T Intellectual Property I, L.P. Data analytics system and methods for text data
US10942971B2 (en) * 2016-10-14 2021-03-09 NewsRx, LLC Inserting elements into artificial intelligence content
US10552468B2 (en) * 2016-11-01 2020-02-04 Quid, Inc. Topic predictions based on natural language processing of large corpora
US10614164B2 (en) * 2017-02-27 2020-04-07 International Business Machines Corporation Message sentiment based alert
US10628528B2 (en) 2017-06-29 2020-04-21 Robert Bosch Gmbh System and method for domain-independent aspect level sentiment detection
CN108710611B (en) * 2018-05-17 2021-08-03 南京大学 Short text topic model generation method based on word network and word vector
US11062094B2 (en) 2018-06-28 2021-07-13 Language Logic, Llc Systems and methods for automatically detecting sentiments and assigning and analyzing quantitate values to the sentiments expressed in text
US11593433B2 (en) 2018-08-07 2023-02-28 Marlabs Incorporated System and method to analyse and predict impact of textual data
US11651016B2 (en) 2018-08-09 2023-05-16 Walmart Apollo, Llc System and method for electronic text classification
US10482116B1 (en) * 2018-12-05 2019-11-19 Trasers, Inc. Methods and systems for interactive research report viewing
US10831998B2 (en) * 2018-12-19 2020-11-10 International Business Machines Corporation Visualizing sentiment on an input device
US11107092B2 (en) * 2019-01-18 2021-08-31 Sprinklr, Inc. Content insight system
US11132511B2 (en) 2019-02-05 2021-09-28 International Business Machines Corporation System for fine-grained affective states understanding and prediction
US11715134B2 (en) 2019-06-04 2023-08-01 Sprinklr, Inc. Content compliance system
US11144730B2 (en) 2019-08-08 2021-10-12 Sprinklr, Inc. Modeling end to end dialogues using intent oriented decoding
US11256874B2 (en) * 2020-06-16 2022-02-22 Optum Technology, Inc. Sentiment progression analysis
US11664010B2 (en) 2020-11-03 2023-05-30 Florida Power & Light Company Natural language domain corpus data set creation based on enhanced root utterances
US20220277403A1 (en) * 2021-02-26 2022-09-01 Deliberati LLC Informed consensus determination among multiple divergent user opinions

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7523085B2 (en) * 2004-09-30 2009-04-21 Buzzmetrics, Ltd An Israel Corporation Topical sentiments in electronically stored communications

Family Cites Families (152)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3950618A (en) * 1971-03-25 1976-04-13 Bloisi Albertoni De Lemos System for public opinion research
US4930077A (en) * 1987-04-06 1990-05-29 Fan David P Information processing expert system for text analysis and predicting public opinion based information available to the public
US5124911A (en) * 1988-04-15 1992-06-23 Image Engineering, Inc. Method of evaluating consumer choice through concept testing for the marketing and development of consumer products
US5041972A (en) * 1988-04-15 1991-08-20 Frost W Alan Method of measuring and evaluating consumer response for the development of consumer products
US5301109A (en) 1990-06-11 1994-04-05 Bell Communications Research, Inc. Computerized cross-language document retrieval using latent semantic indexing
US5077785A (en) * 1990-07-13 1991-12-31 Monson Gerald D System for recording comments by patrons of establishments
US5321833A (en) 1990-08-29 1994-06-14 Gte Laboratories Incorporated Adaptive ranking system for information retrieval
US5317507A (en) 1990-11-07 1994-05-31 Gallant Stephen I Method for document retrieval and for word sense disambiguation using neural networks
JPH0756933A (en) * 1993-06-24 1995-03-03 Xerox Corp Method for retrieval of document
US5519608A (en) * 1993-06-24 1996-05-21 Xerox Corporation Method for extracting from a text corpus answers to questions stated in natural language by using linguistic analysis and hypothesis generation
AU1554795A (en) * 1993-12-23 1995-07-10 Diacom Technologies, Inc. Method and apparatus for implementing user feedback
US5671333A (en) * 1994-04-07 1997-09-23 Lucent Technologies Inc. Training apparatus and method
CA2128122A1 (en) * 1994-07-15 1996-01-16 Ernest M. Thiessen Computer-based method and apparatus for interactive computer-assisted negotiations
US6029195A (en) * 1994-11-29 2000-02-22 Herz; Frederick S. M. System for customized electronic identification of desirable objects
US5668953A (en) * 1995-02-22 1997-09-16 Sloo; Marshall Allan Method and apparatus for handling a complaint
US5895450A (en) * 1995-02-22 1999-04-20 Sloo; Marshall A. Method and apparatus for handling complaints
US5659732A (en) * 1995-05-17 1997-08-19 Infoseek Corporation Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents
US5675710A (en) * 1995-06-07 1997-10-07 Lucent Technologies, Inc. Method and apparatus for training a text classifier
WO1997008604A2 (en) 1995-08-16 1997-03-06 Syracuse University Multilingual document retrieval system and method using semantic vector matching
US6026388A (en) 1995-08-16 2000-02-15 Textwise, Llc User interface and other enhancements for natural language information retrieval system and method
US5659742A (en) * 1995-09-15 1997-08-19 Infonautics Corporation Method for storing multi-media information in an information retrieval system
US5819285A (en) * 1995-09-20 1998-10-06 Infonautics Corporation Apparatus for capturing, storing and processing co-marketing information associated with a user of an on-line computer service using the world-wide-web.
IT1285619B1 (en) * 1996-03-15 1998-06-18 Co Me Sca Costruzioni Meccanic BENDING METHOD OF A DIELECTRIC SHEET IN THE SHEET
US6314420B1 (en) * 1996-04-04 2001-11-06 Lycos, Inc. Collaborative/adaptive search engine
US5867799A (en) 1996-04-04 1999-02-02 Lang; Andrew K. Information system and method for filtering a massive flow of information entities to meet user information classification needs
US5950172A (en) * 1996-06-07 1999-09-07 Klingman; Edwin E. Secured electronic rating system
US6026387A (en) * 1996-07-15 2000-02-15 Kesel; Brad Consumer comment reporting apparatus and method
US5822744A (en) * 1996-07-15 1998-10-13 Kesel; Brad Consumer comment reporting apparatus and method
US6038610A (en) * 1996-07-17 2000-03-14 Microsoft Corporation Storage of sitemaps at server sites for holding information regarding content
US5864863A (en) 1996-08-09 1999-01-26 Digital Equipment Corporation Method for parsing, indexing and searching world-wide-web pages
US5920854A (en) 1996-08-14 1999-07-06 Infoseek Corporation Real-time document collection search engine with phrase indexing
US5857179A (en) 1996-09-09 1999-01-05 Digital Equipment Corporation Computer method and apparatus for clustering documents and automatic generation of cluster keywords
US5911043A (en) * 1996-10-01 1999-06-08 Baker & Botts, L.L.P. System and method for computer-based rating of information retrieved from a computer network
US5924094A (en) * 1996-11-01 1999-07-13 Current Network Technologies Corporation Independent distributed database system
US6571227B1 (en) * 1996-11-04 2003-05-27 3-Dimensional Pharmaceuticals, Inc. Method, system and computer program product for non-linear mapping of multi-dimensional data
US5836771A (en) * 1996-12-02 1998-11-17 Ho; Chi Fai Learning method and system based on questioning
US5778363A (en) * 1996-12-30 1998-07-07 Intel Corporation Method for measuring thresholded relevance of a document to a specified topic
US5950189A (en) 1997-01-02 1999-09-07 At&T Corp Retrieval system and method
US7437351B2 (en) * 1997-01-10 2008-10-14 Google Inc. Method for searching media
US6138128A (en) 1997-04-02 2000-10-24 Microsoft Corp. Sharing and organizing world wide web references using distinctive characters
US6362837B1 (en) * 1997-05-06 2002-03-26 Michael Ginn Method and apparatus for simultaneously indicating rating value for the first document and display of second document in response to the selection
US6098066A (en) 1997-06-13 2000-08-01 Sun Microsystems, Inc. Method and apparatus for searching for documents stored within a document directory hierarchy
US6012053A (en) 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
US6233575B1 (en) * 1997-06-24 2001-05-15 International Business Machines Corporation Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
US6119933A (en) * 1997-07-17 2000-09-19 Wong; Earl Chang Method and apparatus for customer loyalty and marketing analysis
US6278990B1 (en) 1997-07-25 2001-08-21 Claritech Corporation Sort system for text retrieval
US5845278A (en) 1997-09-12 1998-12-01 Inioseek Corporation Method for automatically selecting collections to search in full text searches
US5983216A (en) 1997-09-12 1999-11-09 Infoseek Corporation Performing automated document collection and selection by providing a meta-index with meta-index values indentifying corresponding document collections
US7214298B2 (en) * 1997-09-23 2007-05-08 California Institute Of Technology Microfabricated cell sorter
US5974412A (en) * 1997-09-24 1999-10-26 Sapient Health Network Intelligent query system for automatically indexing information in a database and automatically categorizing users
US6266664B1 (en) * 1997-10-01 2001-07-24 Rulespace, Inc. Method for scanning, analyzing and rating digital information content
US6094657A (en) 1997-10-01 2000-07-25 International Business Machines Corporation Apparatus and method for dynamic meta-tagging of compound documents
US5953718A (en) * 1997-11-12 1999-09-14 Oracle Corporation Research mode for a knowledge base search and retrieval system
US6236991B1 (en) 1997-11-26 2001-05-22 International Business Machines Corp. Method and system for providing access for categorized information from online internet and intranet sources
US6269362B1 (en) 1997-12-19 2001-07-31 Alta Vista Company System and method for monitoring web pages by comparing generated abstracts
US6289342B1 (en) 1998-01-05 2001-09-11 Nec Research Institute, Inc. Autonomous citation indexing and literature browsing using citation context
US6256620B1 (en) 1998-01-16 2001-07-03 Aspect Communications Method and apparatus for monitoring information access
JP3692764B2 (en) 1998-02-25 2005-09-07 株式会社日立製作所 Structured document registration method, search method, and portable medium used therefor
US6067539A (en) * 1998-03-02 2000-05-23 Vigil, Inc. Intelligent information retrieval system
US6185558B1 (en) 1998-03-03 2001-02-06 Amazon.Com, Inc. Identifying the items most relevant to a current query based on items selected in connection with similar queries
US6421675B1 (en) 1998-03-16 2002-07-16 S. L. I. Systems, Inc. Search engine
US6064980A (en) * 1998-03-17 2000-05-16 Amazon.Com, Inc. System and methods for collaborative recommendations
US6236987B1 (en) 1998-04-03 2001-05-22 Damon Horowitz Dynamic content organization in information retrieval systems
US6078892A (en) * 1998-04-09 2000-06-20 International Business Machines Corporation Method for customer lead selection and optimization
US6112203A (en) 1998-04-09 2000-08-29 Altavista Company Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis
US6032145A (en) 1998-04-10 2000-02-29 Requisite Technology, Inc. Method and system for database manipulation
US7051277B2 (en) * 1998-04-17 2006-05-23 International Business Machines Corporation Automated assistant for organizing electronic documents
GB2336698A (en) 1998-04-24 1999-10-27 Dialog Corp Plc The Automatic content categorisation of text data files using subdivision to reduce false classification
US6006225A (en) 1998-06-15 1999-12-21 Amazon.Com Refining search queries by the suggestion of correlated terms from prior searches
US6192360B1 (en) 1998-06-23 2001-02-20 Microsoft Corporation Methods and apparatus for classifying text and for building a text classifier
US6401118B1 (en) 1998-06-30 2002-06-04 Online Monitoring Services Method and computer program product for an online monitoring search engine
US6202068B1 (en) 1998-07-02 2001-03-13 Thomas A. Kraay Database display and search method
US6035294A (en) * 1998-08-03 2000-03-07 Big Fat Fish, Inc. Wide access databases and database systems
WO2000008573A1 (en) * 1998-08-04 2000-02-17 Rulespace, Inc. Method and system for deriving computer users' personal interests
US6138113A (en) 1998-08-10 2000-10-24 Altavista Company Method for identifying near duplicate pages in a hyperlinked database
US6654813B1 (en) 1998-08-17 2003-11-25 Alta Vista Company Dynamically categorizing entity information
US6393460B1 (en) * 1998-08-28 2002-05-21 International Business Machines Corporation Method and system for informing users of subjects of discussion in on-line chats
US6334131B2 (en) 1998-08-29 2001-12-25 International Business Machines Corporation Method for cataloging, filtering, and relevance ranking frame-based hierarchical information structures
US6513032B1 (en) 1998-10-29 2003-01-28 Alta Vista Company Search and navigation system and method using category intersection pre-computation
US6360215B1 (en) 1998-11-03 2002-03-19 Inktomi Corporation Method and apparatus for retrieving documents based on information other than document content
US6751606B1 (en) * 1998-12-23 2004-06-15 Microsoft Corporation System for enhancing a query interface
US6236977B1 (en) * 1999-01-04 2001-05-22 Realty One, Inc. Computer implemented marketing system
US20020059258A1 (en) * 1999-01-21 2002-05-16 James F. Kirkpatrick Method and apparatus for retrieving and displaying consumer interest data from the internet
US6385586B1 (en) * 1999-01-28 2002-05-07 International Business Machines Corporation Speech recognition text-based language conversion and text-to-speech in a client-server configuration to enable language translation devices
US6418433B1 (en) 1999-01-28 2002-07-09 International Business Machines Corporation System and method for focussed web crawling
US6411936B1 (en) * 1999-02-05 2002-06-25 Nval Solutions, Inc. Enterprise value enhancement system and method
US6304864B1 (en) * 1999-04-20 2001-10-16 Textwise Llc System for retrieving multimedia information from the internet using multiple evolving intelligent agents
US6571234B1 (en) * 1999-05-11 2003-05-27 Prophet Financial Systems, Inc. System and method for managing online message board
US6493703B1 (en) * 1999-05-11 2002-12-10 Prophet Financial Systems System and method for implementing intelligent online community message board
US6536037B1 (en) * 1999-05-27 2003-03-18 Accenture Llp Identification of redundancies and omissions among components of a web based architecture
US7165041B1 (en) * 1999-05-27 2007-01-16 Accenture, Llp Web-based architecture sales tool
US6957186B1 (en) * 1999-05-27 2005-10-18 Accenture Llp System method and article of manufacture for building, managing, and supporting various components of a system
US6473794B1 (en) * 1999-05-27 2002-10-29 Accenture Llp System for establishing plan to test components of web based framework by displaying pictorial representation and conveying indicia coded components of existing network framework
US6519571B1 (en) * 1999-05-27 2003-02-11 Accenture Llp Dynamic customer profile management
US6721713B1 (en) * 1999-05-27 2004-04-13 Andersen Consulting Llp Business alliance identification in a web architecture framework
US7315826B1 (en) * 1999-05-27 2008-01-01 Accenture, Llp Comparatively analyzing vendors of components required for a web-based architecture
US6615166B1 (en) * 1999-05-27 2003-09-02 Accenture Llp Prioritizing components of a network framework required for implementation of technology
US6546390B1 (en) * 1999-06-11 2003-04-08 Abuzz Technologies, Inc. Method and apparatus for evaluating relevancy of messages to users
US6571238B1 (en) * 1999-06-11 2003-05-27 Abuzz Technologies, Inc. System for regulating flow of information to user by using time dependent function to adjust relevancy threshold
KR20010004404A (en) 1999-06-28 2001-01-15 정선종 Keyfact-based text retrieval system, keyfact-based text index method, and retrieval method using this system
US6507866B1 (en) * 1999-07-19 2003-01-14 At&T Wireless Services, Inc. E-mail usage pattern detection
US6341306B1 (en) * 1999-08-13 2002-01-22 Atomica Corporation Web-based information retrieval responsive to displayed word identified by a text-grabbing algorithm
EP1221118A2 (en) * 1999-08-31 2002-07-10 Comsort, Inc. System for influence network marketing
US6260041B1 (en) * 1999-09-30 2001-07-10 Netcurrents, Inc. Apparatus and method of implementing fast internet real-time search technology (first)
JP4279427B2 (en) * 1999-11-22 2009-06-17 富士通株式会社 Communication support method and system
US6434549B1 (en) * 1999-12-13 2002-08-13 Ultris, Inc. Network-based, human-mediated exchange of information
US6772141B1 (en) * 1999-12-14 2004-08-03 Novell, Inc. Method and apparatus for organizing and using indexes utilizing a search decision table
US6651086B1 (en) * 2000-02-22 2003-11-18 Yahoo! Inc. Systems and methods for matching participants to a conversation
US6606644B1 (en) 2000-02-24 2003-08-12 International Business Machines Corporation System and technique for dynamic information gathering and targeted advertising in a web based model using a live information selection and analysis tool
US6757646B2 (en) * 2000-03-22 2004-06-29 Insightful Corporation Extended functionality for an inverse inference engine based web search
WO2001071624A1 (en) * 2000-03-22 2001-09-27 3-Dimensional Pharmaceuticals, Inc. System, method, and computer program product for representing object relationships in a multidimensional space
US6658389B1 (en) * 2000-03-24 2003-12-02 Ahmet Alpdemir System, method, and business model for speech-interactive information system having business self-promotion, audio coupon and rating features
WO2001075790A2 (en) * 2000-04-03 2001-10-11 3-Dimensional Pharmaceuticals, Inc. Method, system, and computer program product for representing object relationships in a multidimensional space
US6721734B1 (en) * 2000-04-18 2004-04-13 Claritech Corporation Method and apparatus for information management using fuzzy typing
US6983320B1 (en) * 2000-05-23 2006-01-03 Cyveillance, Inc. System, method and computer program product for analyzing e-commerce competition of an entity by utilizing predetermined entity-specific metrics and analyzed statistics from web pages
AU2001243610A1 (en) 2000-05-25 2001-12-11 Manyworlds Consulting, Inc. Fuzzy content network management and access
US6782393B1 (en) * 2000-05-31 2004-08-24 Ricoh Co., Ltd. Method and system for electronic message composition with relevant documents
US6640218B1 (en) 2000-06-02 2003-10-28 Lycos, Inc. Estimating the usefulness of an item in a collection of information
US7351376B1 (en) * 2000-06-05 2008-04-01 California Institute Of Technology Integrated active flux microfluidic devices and methods
US7136854B2 (en) 2000-07-06 2006-11-14 Google, Inc. Methods and apparatus for providing search results in response to an ambiguous search query
US6807566B1 (en) * 2000-08-16 2004-10-19 International Business Machines Corporation Method, article of manufacture and apparatus for processing an electronic message on an electronic message board
NO313399B1 (en) * 2000-09-14 2002-09-23 Fast Search & Transfer Asa Procedure for searching and analyzing information in computer networks
US6999914B1 (en) * 2000-09-28 2006-02-14 Manning And Napier Information Services Llc Device and method of determining emotive index corresponding to a message
US6751683B1 (en) * 2000-09-29 2004-06-15 International Business Machines Corporation Method, system and program products for projecting the impact of configuration changes on controllers
US7197470B1 (en) * 2000-10-11 2007-03-27 Buzzmetrics, Ltd. System and method for collection analysis of electronic discussion methods
US6560600B1 (en) * 2000-10-25 2003-05-06 Alta Vista Company Method and apparatus for ranking Web page search results
GB2368670A (en) * 2000-11-03 2002-05-08 Envisional Software Solutions Data acquisition system
US6622140B1 (en) * 2000-11-15 2003-09-16 Justsystem Corporation Method and apparatus for analyzing affect and emotion in text
US6526440B1 (en) 2001-01-30 2003-02-25 Google, Inc. Ranking search results by reranking the results based on local inter-connectivity
US6584470B2 (en) 2001-03-01 2003-06-24 Intelliseek, Inc. Multi-layered semiotic mechanism for answering natural language questions using document retrieval combined with information extraction
US8001118B2 (en) * 2001-03-02 2011-08-16 Google Inc. Methods and apparatus for employing usage statistics in document retrieval
US6778975B1 (en) * 2001-03-05 2004-08-17 Overture Services, Inc. Search engine for selecting targeted messages
US20020159642A1 (en) * 2001-03-14 2002-10-31 Whitney Paul D. Feature selection and feature set construction
US7055273B2 (en) * 2001-10-12 2006-06-06 Attitude Measurement Corporation Removable label and incentive item to facilitate collecting consumer data
US20040205482A1 (en) * 2002-01-24 2004-10-14 International Business Machines Corporation Method and apparatus for active annotation of multimedia content
US6751611B2 (en) * 2002-03-01 2004-06-15 Paul Jeffrey Krupin Method and system for creating improved search queries
JP3726263B2 (en) * 2002-03-01 2005-12-14 ヒューレット・パッカード・カンパニー Document classification method and apparatus
JP2003281446A (en) 2002-03-13 2003-10-03 Culture Com Technology (Macau) Ltd Media management method and system
US7716161B2 (en) * 2002-09-24 2010-05-11 Google, Inc, Methods and apparatus for serving relevant advertisements
US7599911B2 (en) * 2002-08-05 2009-10-06 Yahoo! Inc. Method and apparatus for search ranking using human input and automated ranking
US6928526B1 (en) * 2002-12-20 2005-08-09 Datadomain, Inc. Efficient data storage system
US7292723B2 (en) * 2003-02-26 2007-11-06 Walker Digital, Llc System for image analysis in a network that is structured with multiple layers and differentially weighted neurons
US7051023B2 (en) 2003-04-04 2006-05-23 Yahoo! Inc. Systems and methods for generating concept units from search queries
US7130777B2 (en) * 2003-11-26 2006-10-31 International Business Machines Corporation Method to hierarchical pooling of opinions from multiple sources
US7865354B2 (en) * 2003-12-05 2011-01-04 International Business Machines Corporation Extracting and grouping opinions from text documents
US20060041605A1 (en) * 2004-04-01 2006-02-23 King Martin T Determining actions involving captured information and electronic content associated with rendered documents
US7596571B2 (en) * 2004-06-30 2009-09-29 Technorati, Inc. Ecosystem method of aggregation and search and related techniques
US7433866B2 (en) * 2005-01-11 2008-10-07 International Business Machines Corporation Systems, methods, and media for awarding credits based on provided usage information
US7624102B2 (en) * 2005-01-28 2009-11-24 Microsoft Corporation System and method for grouping by attribute
US7680855B2 (en) 2005-03-11 2010-03-16 Yahoo! Inc. System and method for managing listings
US20070027840A1 (en) * 2005-07-27 2007-02-01 Jobserve Limited Searching method and system
US7464003B2 (en) * 2006-08-24 2008-12-09 Skygrid, Inc. System and method for change detection of information or type of data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7523085B2 (en) * 2004-09-30 2009-04-21 Buzzmetrics, Ltd An Israel Corporation Topical sentiments in electronically stored communications
US7877345B2 (en) * 2004-09-30 2011-01-25 Buzzmetrics, Ltd. Topical sentiments in electronically stored communications

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100079464A1 (en) * 2008-09-26 2010-04-01 Nec Biglobe, Ltd. Information processing apparatus capable of easily generating graph for comparing of a plurality of commercial products
US8325189B2 (en) * 2008-09-26 2012-12-04 Nec Biglobe, Ltd. Information processing apparatus capable of easily generating graph for comparing of a plurality of commercial products
US11379512B2 (en) 2008-11-10 2022-07-05 Google Llc Sentiment-based classification of media content
US10956482B2 (en) * 2008-11-10 2021-03-23 Google Llc Sentiment-based classification of media content
US20110295591A1 (en) * 2010-05-28 2011-12-01 Palo Alto Research Center Incorporated System and method to acquire paraphrases
US9672204B2 (en) * 2010-05-28 2017-06-06 Palo Alto Research Center Incorporated System and method to acquire paraphrases
US8352405B2 (en) * 2011-04-21 2013-01-08 Palo Alto Research Center Incorporated Incorporating lexicon knowledge into SVM learning to improve sentiment classification
WO2013019791A1 (en) * 2011-08-02 2013-02-07 Anderson Tom H C Natural language test analytics
US20130103385A1 (en) * 2011-10-24 2013-04-25 Riddhiman Ghosh Performing sentiment analysis
US9275041B2 (en) * 2011-10-24 2016-03-01 Hewlett Packard Enterprise Development Lp Performing sentiment analysis on microblogging data, including identifying a new opinion term therein
US9535899B2 (en) 2013-02-20 2017-01-03 International Business Machines Corporation Automatic semantic rating and abstraction of literature
WO2015137998A1 (en) * 2013-03-10 2015-09-17 Squerb, Inc. System for graphically displaying user-provided information
US9311294B2 (en) * 2013-03-15 2016-04-12 International Business Machines Corporation Enhanced answers in DeepQA system according to user preferences
US20140278363A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Enhanced Answers in DeepQA System According to User Preferences
US9244911B2 (en) * 2013-03-15 2016-01-26 International Business Machines Corporation Enhanced answers in DeepQA system according to user preferences
US20150006158A1 (en) * 2013-03-15 2015-01-01 International Business Machines Corporation Enhanced Answers in DeepQA System According to User Preferences
US20150058080A1 (en) * 2013-08-23 2015-02-26 International Business Machines Corporation Contract erosion and renewal prediction through sentiment analysis
CN107077486A (en) * 2014-09-02 2017-08-18 菲特尔销售工具有限公司 Affective Evaluation system and method
WO2016035072A3 (en) * 2014-09-02 2016-04-21 Feelter Sales Tools Ltd Sentiment rating system and method
CN107077470A (en) * 2014-10-31 2017-08-18 隆沙有限公司 The semantic classification of focusing
WO2016066228A1 (en) * 2014-10-31 2016-05-06 Longsand Limited Focused sentiment classification
US11100557B2 (en) 2014-11-04 2021-08-24 International Business Machines Corporation Travel itinerary recommendation engine using inferred interests and sentiments
US20170039183A1 (en) * 2015-08-07 2017-02-09 Nec Laboratories America, Inc. Metric Labeling for Natural Language Processing
US10198432B2 (en) 2016-07-28 2019-02-05 Abbyy Production Llc Aspect-based sentiment analysis and report generation using machine learning methods
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US20190297035A1 (en) * 2018-03-26 2019-09-26 International Business Machines Corporation Chat thread correction
US11354507B2 (en) * 2018-09-13 2022-06-07 International Business Machines Corporation Compared sentiment queues

Also Published As

Publication number Publication date
US20060069589A1 (en) 2006-03-30
WO2006039566A3 (en) 2007-07-05
US7523085B2 (en) 2009-04-21
US8041669B2 (en) 2011-10-18
US7877345B2 (en) 2011-01-25
WO2006039566A2 (en) 2006-04-13
US20090164417A1 (en) 2009-06-25

Similar Documents

Publication Publication Date Title
US8041669B2 (en) Topical sentiments in electronically stored communications
Schouten et al. Survey on aspect-level sentiment analysis
Messaoudi et al. Opinion mining in online social media: a survey
Li et al. Improving aspect extraction by augmenting a frequency-based method with web-based similarity measures
Bollegala et al. Cross-domain sentiment classification using a sentiment sensitive thesaurus
Yi et al. Sentiment analyzer: Extracting sentiments about a given topic using natural language processing techniques
Chen et al. Identifying intention posts in discussion forums
US8676730B2 (en) Sentiment classifiers based on feature extraction
US8200477B2 (en) Method and system for extracting opinions from text documents
Hurst et al. Retrieving topical sentiments from online document collections
Ghag et al. Comparative analysis of the techniques for sentiment analysis
Wang et al. Adapting topic map and social influence to the personalized hybrid recommender system
Wang et al. Customer-driven product design selection using web based user-generated content
Nigam et al. Towards a robust metric of polarity
Humphreys Automated text analysis
Hailu Opinion mining from Amharic blog
Moghaddam et al. Opinion mining in online reviews: Recent trends
Sahu et al. An Emotion based Sentiment Analysis on Twitter Dataset
Tonkin A day at work (with text): A brief introduction
Dragoni Extracting Linguistic Features From Opinion Data Streams For Multi-Domain Sentiment Analysis.
Ferilli et al. Sentiment analysis as a text categorization task: a study on feature and algorithm selection for Italian language
Ibitoye et al. Improved customer churn prediction model using word order contextualized semantics on customers’ social opinion
Asgarian et al. Designing an integrated semantic framework for structured opinion summarization
Vo et al. An efficient hybrid model for vietnamese sentiment analysis
Shrestha Enabling deeper customer-centricity-creating a search engine for project proposal documents at a tech consultancy

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

AS Assignment

Owner name: BUZZMETRICS, LTD., ISRAEL

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NIGAM, KAMAL P.;HURST, MATTHEW F.;REEL/FRAME:026258/0421

Effective date: 20051205

ZAAA Notice of allowance and fees due

Free format text: ORIGINAL CODE: NOA

ZAAB Notice of allowance mailed

Free format text: ORIGINAL CODE: MN/=.

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

AS Assignment

Owner name: CITIBANK, N.A., AS COLLATERAL AGENT FOR THE FIRST LIEN SECURED PARTIES, DELAWARE

Free format text: SUPPLEMENTAL IP SECURITY AGREEMENT;ASSIGNOR:THE NIELSEN COMPANY ((US), LLC;REEL/FRAME:037172/0415

Effective date: 20151023

Owner name: CITIBANK, N.A., AS COLLATERAL AGENT FOR THE FIRST

Free format text: SUPPLEMENTAL IP SECURITY AGREEMENT;ASSIGNOR:THE NIELSEN COMPANY ((US), LLC;REEL/FRAME:037172/0415

Effective date: 20151023

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8

AS Assignment

Owner name: THE NIELSEN COMPANY (US), LLC, NEW YORK

Free format text: RELEASE (REEL 037172 / FRAME 0415);ASSIGNOR:CITIBANK, N.A.;REEL/FRAME:061750/0221

Effective date: 20221011

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20231018