[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN107368506B - Unstructured data analysis system and method - Google Patents

Unstructured data analysis system and method Download PDF

Info

Publication number
CN107368506B
CN107368506B CN201610496280.9A CN201610496280A CN107368506B CN 107368506 B CN107368506 B CN 107368506B CN 201610496280 A CN201610496280 A CN 201610496280A CN 107368506 B CN107368506 B CN 107368506B
Authority
CN
China
Prior art keywords
data
unstructured data
topic
users
unstructured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610496280.9A
Other languages
Chinese (zh)
Other versions
CN107368506A (en
Inventor
汪晓宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Stratifyd Inc
Original Assignee
Stratifyd Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/151,572 external-priority patent/US10452698B2/en
Application filed by Stratifyd Inc filed Critical Stratifyd Inc
Priority to CN202011265115.5A priority Critical patent/CN112732878A/en
Publication of CN107368506A publication Critical patent/CN107368506A/en
Application granted granted Critical
Publication of CN107368506B publication Critical patent/CN107368506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An unstructured data analysis system comprising: an unstructured data analysis algorithm residing on a server and accessible via a browser, the unstructured data analysis algorithm operable to: receiving unstructured data from one or more remote sources, applying one or more analysis tools to the unstructured data, and displaying summary information to one or more users; wherein the summary information is displayed to the one or more users in a presentation layer, an exploration layer, and an annotation layer. The unstructured-data analysis algorithm is also operable to receive external data from one or more remote sources. The presentation layer displays one or more of: unstructured data, summaries of unstructured data, and summary information. The exploration layer allows one or more users to modify the granularity of the summary information, thereby modifying the granularity of the presentation layer. One or more users may simultaneously interact with the unstructured data analysis system via the annotation layer.

Description

Unstructured data analysis system and method
Cross Reference to Related Applications
This patent application/patent claims the priority of co-pending U.S. provisional patent application No.62/159,662 entitled "unsrutured data ANALYTICS SYSTEMS AND METHODS for entering a vision interaction INTERFACE" filed on 11.5.2015 and U.S. provisional patent application No.62/159,683 entitled "unsrutured DATAANALYTICS SYSTEMS AND METHODS for entering a membrane process for entering a vision interaction FUNCTIONS" filed on 11.5.2015, both of which are incorporated herein by reference in their entirety.
Technical Field
The present invention relates generally to methods and systems for analyzing large text corpora and unstructured data. More particularly, the present invention relates to a method and system for analyzing large text corpora and unstructured data using visual analysis and topic modeling, a visual interface, and natural language processing and statistical functions.
Background
Management of large and growing collections of textual information and unstructured data is a challenging problem. Data repositories of knowledgeable textual information have become popular, resulting in the consolidation, mining and analysis of large volumes of data. As the number of documents increases, learning the meaning of text corpora becomes cognitively costly and time consuming.
The challenge of automatic summarization of large text corpora has become a major concern for researchers in the field of Natural Language Processing (NLP). To summarize text corpora, researchers have developed techniques such as implicit semantic analysis (LSA) for extracting and representing meaning in the context of use of words. LSAs produce a conceptual space that can be used for document classification and clustering. Recently, probabilistic topic models have emerged as an advantageous new technique for finding semantically meaningful topics in unstructured text collections. To further provide a visual summary of textual corpora, researchers from the community field of knowledge discovery and visualization have developed tools and techniques to support visualization (visualization) and exploration (exploration) of large textual corpora based on both LSA and probabilistic topic models (probabilistictopic models).
Although probabilistic topic models have demonstrated their advantages in interpretation and semantic association, few interactive visualization systems have utilized such models to support exploration and analysis of text corpora. Example-based visualization and probabilistic implied semantic visualization methods have projected documents onto semantic two-dimensional (2D) charts while estimating topics of text corpora. Although the document clusters are well-compliant with the selected tags, there is little opportunity for interactive exploration and analysis of the document clusters. One exception is the time-based visualization system, TIARA, which applies a river flow graph (ThemeRiver) metaphor to visually summarize a text collection based on topic content. Through analysis of the TIARA system, the user is able to answer questions such as: what is the main topic in the document corpus? And how does the topic evolve over time?
However, when analyzing large text corpora, there are many other real-world questions that current text analysis visualization systems have difficulty answering. In particular, questions about the relationship between topics and documents are difficult to solve with existing tools. Such problems include: what are document features based on the topic distribution of the document? And what documents include multiple topics (and what the multiple topics are) at a time? In the field of scientific strategies, for example, a document with multiple topics may indicate a publication that is cross-disciplined (i.e., encompasses more than one body of knowledge). Similarly, in the context of social media analysis, a document with multiple topics may represent unique news articles related to different hot topics.
To overcome the disadvantages associated with existing methods and systems, and to help users understand large text corpora more effectively, the present invention provides a novel visualization analysis system that integrates the latest probabilistic topic model, implicit Dirichlet distribution (LDA), with interactive visualizations. To describe the document corpus, the method and system of the present invention first extracts a set of semantically meaningful topics using LDA. Unlike most traditional clustering techniques that assign documents to particular clusters, the LDA model takes into account different topical aspects of each individual document. This permits efficient comprehensive text analysis of large documents that may contain multiple topics. To highlight this property of the model, the method and system of the present invention utilizes a parallel coordinate metaphor to present a probability distribution across topic documents. This presentation allows users to discover single-topic and multi-topic documents, as well as the relative importance of each topic to the document of interest. In addition, since most text corpora are time-based in nature, the system and method of the present invention also show topic evolution over time.
Further, the present invention enables companies, including analysts, marketers, business unit leaders, information technologists, and type C employees, to obtain actionable insights from any type of textual data. This technology allows people to enhance their decision-making process on a data-driven basis. The technique takes textual data and, through depth calculations and statistical algorithms, identifies topics, topics and emerging questions within each data set. The results are displayed in an interactive visual format so that anyone in the company can analyze the data as a whole or in a fine-grained manner. All types of text data can be analyzed-internal data (e.g., email, chat, surveys, call centers, and focus groups), or external data (e.g., social media, review websites, forums, and news websites). The technique can handle a large number of languages, ensuring that feedback loops from all over the world can be analyzed. However, highly customizable features are selected that tailor the effectiveness of the analysis. Most companies are sitting on treasures of unstructured text data, but have little ability to mine unstructured text data for intelligence.
Disclosure of Invention
Again, in example embodiments, the method and system of the present invention tightly integrates interaction visualization with the latest probabilistic topic model. In particular, to address the problems set forth herein above, the method and system of the present invention utilizes a Parallel Coordinate (PC) metaphor to present a probability distribution across topic documents. This carefully chosen presentation not only shows how many topics the document is related to, but also shows the importance of each topic to the document. In addition, the method and system of the present invention provide a rich set of interactions that can help users automatically partition a collection of documents based on the number of topics in the documents. In addition to showing relationships between topics and documents, the method and system of the present invention also supports other tasks necessary to understand a document collection, such as summarizing the main topics of the document collection and showing how topics evolve over time.
The problem set which can be effectively solved when the method and the system of the invention analyze the large text corpus comprises the following steps: what is the main topic that captures the document collection? What are document features based on the topic distribution of the document? What documents relate to multiple topics at a time? And how do topics of interest evolve over time? To help the user answer these questions, the method and system of the present invention first extracts a set of semantically meaningful topics using the LDA model. To support topic model-based visual exploration of a collection of documents, the method and system of the present invention employs multiple coordinated views to highlight both topic and temporal features of document corpora. One novel contribution of the method and system of the present invention is: the method is used for describing the probability distribution of documents by topics and supports interactive identification and more detailed examination of single-topic and multi-topic documents.
In one example embodiment, the present invention provides a computerized method for textual data analysis, comprising: receiving, at one or more processors, text data to be analyzed from one or more memories; formatting, using the one or more processors, the text data for subsequent analysis; applying, using the one or more processors, a probabilistic topic model to the text data to extract a set of semantically meaningful topics, the set of semantically meaningful topics collectively describing all or a portion of the text data; generating, using a keyword weighting module executing on the one or more processors, a topic cloud view representing a topic as tag clouds, wherein each tag cloud is associated with a plurality of keywords; generating, using a topic ranking module executing on the one or more processors, a document distribution view representing a distribution of all or a portion of the text data over a plurality of topics; generating, using a document entropy calculation module executing on the one or more processors, a document scatter plot view representing how many topics may be attributed to all or a portion of the text data; generating, using a temporal topic trend calculation module executing on the one or more processors, a temporal view representing changes over time in occurrences of topics with respect to all or a portion of text data; and displaying one or more of a topic cloud view, a document distribution view, a document scatter plot view, and a time view to a user in an analysis of all or a portion of the text data. The text data includes one or more of: text data derived from a plurality of documents, text data derived from a plurality of files, text data derived from one or more data repositories, and text data derived from the internet. The probabilistic topic model generates a set of implied topics and represents each topic as a multi-item distribution over a plurality of keywords. The text data is described as a probabilistic mixture of topics. Optionally, the keywords are ordered to indicate their importance and relationship to each other for a given topic. Optionally, keywords are highlighted to indicate their importance to multiple topics. The questions are ordered to represent their relationship. Various other example functions are also provided herein.
In another example embodiment, the present invention provides a computerized method for textual data analysis, comprising: one or more memories operable to store text data to be analyzed and one or more processors operable to receive text data to be analyzed; an algorithm, executing on the one or more processors, operable to: formatting the text data for subsequent analysis; an algorithm, executing on the one or more processors, operable to: applying a probabilistic topic model to the text data to extract a set of semantically meaningful topics, the set of semantically meaningful topics collectively describing all or a portion of the text data; a keyword weighting module executing on the one or more processors operable to: generating a topic cloud view representing the topic as tag clouds, wherein each tag cloud is associated with a plurality of keywords; a topic ranking module executing on the one or more processors operable to: generating a document distribution view representing a distribution of all or a portion of the text data over a plurality of topics; a document entropy calculation module executing on the one or more processors operable to: generating a document scatter plot view representing how many topics may be attributed to all or a portion of the data herein; a provisional topic trend calculation module executing on the one or more processors operable to: generating a time view representing changes over time in occurrences of topics relating to all or a portion of the text data; and the display is operable to: in the analysis of all or a portion of the text data, one or more of a topic cloud view, a document distribution view, a document scatter plot view, and a time view are displayed to the user. The text data includes one or more of: text data derived from a plurality of documents, text data derived from a plurality of files, text data derived from one or more data repositories, and text data derived from the internet. The probabilistic topic model generates a set of implied topics and represents each topic as a multi-item distribution over a plurality of keywords. The text data is described as a probabilistic mixture of topics. Optionally, the keywords are ordered to indicate their importance and relationship to each other for a given topic. Optionally, keywords are highlighted to indicate their importance to multiple topics. The topics are ordered to represent the relationship between them. Various other example functions are also provided herein.
Again, the present invention enables companies, including analysts, marketers, business unit leaders, information technologists, and type C employees, to obtain actionable insights from any type of textual data. This technology allows people to enhance their decision making process on a data-driven basis. The technique takes textual data and, through depth calculations and statistical algorithms, identifies topics, topics and emerging questions within each data set. The results are displayed in an interactive visual format so that anyone in the company can analyze the data as a whole or in a fine-grained manner. All types of text data can be analyzed-internal data (e.g., email, chat, surveys, call centers, and focus groups), or external data (e.g., social media, review websites, forums, and news websites). The technology can handle a large number of languages, ensuring that feedback loops from all over the world can be analyzed. However, highly customizable features are selected that tailor the effectiveness of the analysis. Most companies are sitting on treasures of unstructured text data, but have little ability to mine unstructured text data for intelligence.
In an additional example embodiment, the present invention provides an unstructured data analysis system comprising: an unstructured data analysis algorithm residing on a server and accessible via a browser, the unstructured data analysis algorithm operable to receive unstructured data from one or more remote sources, apply one or more analysis tools to the unstructured data, and display summary information to one or more users; wherein summary information is displayed to one or more users in one or more of a presentation layer, an exploration layer, and an annotation layer. The unstructured data includes one or more of the following: customer experience data, telecommunications data, email data, and social media data. The unstructured-data analysis algorithm is further operable to: external data is received from one or more remote sources. The external data includes one or more of: internet data, government data, and business data. The one or more analysis tools applied to the unstructured data include one or more of the following: statistical algorithms, machine learning sum, natural language processing, and text mining. The presentation layer displays one or more of: unstructured data, summaries of unstructured data, and summary information. The exploration layer allows one or more users to modify the granularity of the summary information, thereby modifying the granularity of the presentation layer. One or more users may simultaneously interact with the unstructured data analysis system via the annotation layer. Summary information is also displayed to one or more users in the combined layer.
In another additional example embodiment, the present invention provides an unstructured data analysis method, comprising: providing an unstructured data analysis algorithm residing on a server and accessible via a browser, the unstructured data analysis algorithm operable to receive unstructured data from one or more remote sources, apply one or more analysis tools to the unstructured data, and display summary information to one or more users; wherein the summary information is displayed to the one or more users in one or more of the presentation layer, the exploration layer, and the annotation layer. The unstructured data includes one or more of the following: customer experience data, telecommunications data, email data, and social media data. The unstructured-data analysis algorithm is further operable to: external data is received from one or more remote sources. The external data includes one or more of: internet data, government data, and business data. The one or more analysis tools applied to the unstructured data include one or more of the following: statistical algorithms, machine learning, natural language processing, and text mining. The presentation layer displays one or more of: one or more of unstructured data, summaries of unstructured data, and summary information. The exploration layer allows one or more users to modify the granularity of the summary information, thereby modifying the granularity of the presentation layer. One or more users may simultaneously interact with the unstructured data analysis system via the annotation layer. Summary information is also displayed to one or more users in the combined layer.
Drawings
The present invention is illustrated and described herein with reference to the accompanying drawings, in which like reference numerals are used to refer to like method steps/system components as appropriate, and in which:
FIG. 1 is a schematic diagram illustrating one exemplary embodiment of a visual text corpus analysis tool of the present invention;
FIG. 2 is an exemplary display showing a topic cloud view of the visual text corpus analysis tool of the present invention;
FIG. 3 is an exemplary display showing a document distribution view of the visual text corpus analysis tool of the present invention;
FIG. 4 is a series of graphs showing document distribution on one topic, two topics, and more than two topics in accordance with the method and system of the present invention;
FIG. 5 is an exemplary display showing a topic cloud view of the visual text corpus analysis tool of the present invention;
FIG. 6 is an exemplary display showing a temporal view of the visual text corpus analysis tool of the present invention; and
FIG. 7 is a schematic diagram illustrating one exemplary embodiment of an unstructured data analysis system in accordance with the present invention;
FIG. 8 is a schematic diagram illustrating another exemplary embodiment of an unstructured data analysis system of the present invention;
FIG. 9 is a schematic diagram illustrating an additional exemplary embodiment of an unstructured data analysis system of the present invention;
FIG. 10 is a schematic diagram illustrating another exemplary embodiment of an unstructured data analysis system of the present invention;
FIG. 11 is a schematic diagram illustrating one example embodiment of a presentation layer of the unstructured data analysis system of the present invention;
FIG. 12 is a schematic diagram illustrating one example embodiment of an exploration layer of the unstructured data analysis system of the present invention; and
FIG. 13 is a schematic diagram illustrating one example embodiment of an annotation layer of the unstructured data analysis system of the present invention.
Detailed Description
Two-line work, namely a text analysis model and a text visualization technology, is the main inspiration of the primary design of the invention. These concepts are then refined and expanded upon as described in more detail below.
The first significant advance in text processing is the Vector Space Model (VSM). In this model, text is represented as vectors in a high-dimensional space, where each dimension is associated with a unique term within a document. One well-known example of a VSM is TF-IDF, which evaluates how important a word is to a document in a corpus. While VSM has shown its effectiveness in practical experience, it has a number of inherent drawbacks in capturing statistical structures between and within documents.
To overcome the shortcomings of VSM, researchers have introduced LSA, which is a factor analysis that reduces the term document matrix to a much lower dimensional subspace that captures most variables in the corpus. While LSA overcomes some of the disadvantages of VSM, it also has its limitations. The new feature space is difficult to interpret because each dimension is a linear combination of a set of words from the original space.
Recognizing the limitations of LSA, researchers have proposed generative probabilistic models for document modeling. For example, researchers have introduced generative models representing the content of words and documents with probabilistic topics, rather than pure spatial representations. One unique advantage of this representation is that each topic is independently interpretable, providing a probability distribution based on the words used to pick up the coherent cluster of associated terms. The LDA model assumes an implicit structure consisting of a set of topics; each document is generated by: a topic-based distribution is selected and then each word is randomly generated from the topics selected by using the distribution. For example, as shown by analyzing scientific abstracts and newspaper archives, the extracted topics capture meaningful structures in other unstructured data. On a cognitive level, the LDA model performs well in the aspects of predictive word association and semantic association and fuzzy effects on various language processing and storage tasks.
Because of the various advantages of the LDA model, the method and system of the present invention first uses the model to extract a set of semantically meaningful topics for a given text corpus. The method and system of the present invention then presents the probabilistic results in an intuitive manner so that users can easily consume complex models when analyzing large text corpora.
In addition to the advantages of automatic text processing techniques, artificial intelligence still plays a key role in analyzing text corpora. Therefore, a number of visualization systems and techniques based on text processing methods have been developed to keep users in progress.
For example, using VSMs, tools have been introduced to visualize email content for the purpose of delineating relationships from session history. Keywords within the visualization are generated based on the TF-IDF algorithm.
Other tools enable users to visually explore text corpora through social network metaphors based on implicit semantic analysis results. Other visualization systems have used multidimensional projection methods, such as Principal Component Analysis (PCA) and/or multidimensional scaling (MDS), to visualize text corpora. These projection techniques are similar to LSA in mind in that they represent text as vectors with the term frequency as their feature, and then identify a lower dimensional projection space. The visualization system is therefore based on these projection techniques including IN-SPIRE. More recently, to visualize large collections of classified documents, others have proposed a secondary framework for topology-based projection and visualization tools. However, unlike most conventional clustering techniques that assign documents to particular clusters, the methods and systems of the present invention consider different topical aspects of each individual document.
Since the first lighting phase of topic models, visualization systems have used such models due to their advantages over previous text processing techniques. Example-based visualization and probabilistic implied semantic visualization tools have projected documents onto static 2D charts while estimating topics of text corpora. Although the visual clustering results are better than those obtained from multi-dimensional projection methods, there are several limitations to this. First, as the number of extracted topics grows, the clusters of documents in the 2D projection are no longer separable on a topic basis. Furthermore, there is little room in these visualization tools for interactive mining and analysis of document clusters. More recently, TIARA, a time-based interactive visualization system, has been introduced that presents topics extracted from a given corpus of text in a time-sensitive manner. TIARA provides a good overview of topics as they evolve over time. However, the relationship between documents and topics is less clear.
Thus, the method and system of the present invention presents a probability distribution of a document across extracted topics in addition to describing the evolution of topics over time. Thus, the method and system of the present invention provide an overview of document features based on their topic distributions and enable users to identify documents that relate to multiple topics at a time.
The method and system of the present invention support exploration of a document collection at multiple levels. At an overview level, the system assists the user in answering the following questions: what is the main topic of the document collection? And what are the characteristics of the documents in the collection? At the facet (facet) level, the system supports activities such as: identifying temporal trends of particular topics, and identifying documents that are relevant to a plurality of topics of interest. At a detailed level, the system allows access to the details of each individual document as needed. Based on the latest topic model, the system employs multiple coordinated views, each view solving one of the problems described above.
Referring now specifically to FIG. 1, in one exemplary embodiment, the overall structure of the visual text corpus analysis tool 10 of the present invention comprises: offline text preprocessing 12 and topic modeling module 14. Text pre-processing module 12 is operable to place the text of relevant documents 16 under appropriate conditions for subsequent processing, exploration, and analysis. Such text preprocessing may include, but is not limited to, preprocessing of text from social media (e.g., Twitter posts and Facebook profiles), books (e.g., documents from Gutenberg online book entries), and other documents (e.g., emails, Word documents, etc.).
As described above, topic models have several advantages over traditional text processing techniques. Thus, the visual text corpus analysis tool 10 of the present invention utilizes a probabilistic topic model in the topic modeling module 14 to summarize the relevant documents 16. More specifically, LDA is used to first extract a set of semantically meaningful topics. LDA produces a set of implied topics, each represented as a multi-term distribution based on keywords, and assumes that each document can be described as a probabilistic mixture of these topics. P (z) is the topic z-based distribution in a particular document. Assume that the text collection 16 includes D documents and T topics. Determining topics is an iterative process using the visual text corpus analysis tool 10. The tool 10 enables a user to interactively specify that multiple topics are deemed essential in their analysis domain. Users are allowed to modify the topic modeling module 14 based on findings from their visual interactions and surveys so that they can modify the number of topics and/or define the number of iterations of the process. The visual corpus text analysis tool 10 also enables users to add, remove, and merge topics to the topic modeling module 14.
Thus, the document collection 16 is first preprocessed to remove stop words, etc. Then, a Stanford Topic Modeling Toolkit (STMT) or the like is used to extract a set of topics from the set of documents 16. The extracted topics and probability document distributions serve as input for further visualization.
The visual design of the tool 10 of the present invention includes four coordinated summaries which may be displayed and operated on a suitable Graphical User Interface (GUI) either individually or in combination: (1) a document distribution view 18 that displays a probability distribution of documents across topics; (2) a topic cloud 20 that presents content of the extracted topics; (3) a time view 22 highlighting the temporal evolution of the topic; and (4) a document scatter plot view 24 that facilitates interactive selection of single topics versus multi-topic documents. Each of the four summaries serves a different purpose, and they are coordinated through a rich set of user interactions. Further, when an arbitrary document is selected, the detailed view presents the textual content of that document as needed.
To help the user quickly catch the gist of a document collection, the main topics are presented as tag clouds in a topic cloud view 20. In the topic cloud view 20, each row displays a topic that includes, for example, a plurality of keywords related to that topic. Since each topic is modeled as a multi-term distribution based on keywords, the weight of each keyword indicates its importance to the topic. To encapsulate such information in the tag cloud, the keywords are aligned from left to right, with the most important keywords placed at the beginning. In addition, since one keyword may appear in a plurality of topics, the display size or weight of each keyword reflects its appearance within all topics. However, it will be apparent to those skilled in the art that other configurations may be used. An example of a topic cloud view 20 is provided in FIG. 2. To assist the user in understanding the main topics in the document collection 16, topics are presented in a sequence such that semantically similar topics are close together such that there is continuity when topics are viewed in turn. Because the LDA model does not model relationships between topics, the topics are reordered by defining similarity measures. The visual corpus analysis tool 10 utilizes a forest (Hellinger) distance function to characterize a similarity measure representing the closeness of a topic. The visual text corpus analysis tool 10 visualizes the similarity measure to provide the user with an understanding of the semantic layer of topic distribution and helps reduce their cognitive overload by clustering the topic space.
The topic cloud view 20 also provides a set of interactions for the user to help the user quickly understand the topic. For example, hovering over a particular keyword will cause all other occurrences of that keyword in the tag cloud to be highlighted. The user may also search for specific keywords of interest. Further, the topic cloud view 20 works closely with all other views to quickly provide information about a particular topic as needed.
The topic cloud view 20 is generated, in part, by an online keyword weighting module 26, the online keyword weighting module 26 operable to aggregate the results of the topic modeling module. It classifies words in a given topic based on their probability, the more likely words will be placed at the top of the classification queue. The probability values are labeled with the values calculated by the topic modeling module 14. For example, the size of the word in the topic cloud view is determined by the frequency of occurrence of the word in the entire text corpus and normalized based on the maximum word frequency. For example, the higher the frequency, the larger the word. For example, the tool 10 defaults to representing the most likely 50 words of each topic. The user may modify the number of words through interaction.
To provide an overview of the documents as a mixture of topics, the tool 10 of the present invention highlights the distribution of each document across all extracted topics. The selected representation converts the document probability distribution into a signal-like style representing each document. More specifically, a parallel coordinate metaphor is employed in which each axis represents a topic and each line represents a document in the collection 16. This is illustrated in fig. 3. In this selected configuration, all variables (i.e., topics) are evenly spaced and each variable shares the same range of values from 0 to 1. Thus, when viewing the document distribution view 18, the document need not be understood based on its value on each individual axis, but rather the document may be understood based on its style in its entirety on all axes. However, it will be apparent to those skilled in the art that other configurations may be used.
One limitation of LDA is that it does not directly model the cross-correlation between topic occurrences, but in most text corpora, it is natural to expect cross-correlation between topic occurrences. The tool 10 of the present invention overcomes this limitation using visualization by making the cross-correlation between topics more prominent. Coincidentally, one feature of parallel coordinate visualization is that it is easier to find the association between adjacent axes. Thus, topics can be ordered in such a way that semantically similar topics are adjacent to each other, so that associations between similar topics become visually prominent. The topic similarity is defined in terms of the euclidean distance between two topics in the overall document 16:
Figure BDA0001032601760000131
wherein d iskIs one of the D documents in the entire collection 16, and P (D)k) Is the probability distribution of the kth document over all topics. Thus, P (d)kI) represents the probability of topic i when generating document k. When plotting topics as axes in a selected interface, starting with the topic with the most concentrated probability and then finding and based on the distance between topicsThe most similar topic of the current topic. FIG. 3 illustrates a document visualization across topics after topic reordering. The relationship between any two most similar topics (i.e. on adjacent axes) becomes visually identifiable.
The document distribution view 18 is generated, in part, by an online topic ranking module 28, which online topic ranking module 28 is operable to perform the functions described above as well as signal representations of individual documents. Such signals are illustrative of different properties of the document. View 18 shows that documents with a significant distribution on a single topic are very much focused on a particular topic, whereas documents with a distribution on 2 or 3 topics indicate a focus of variability.
When exploring the distribution of documents on topics, it can be easily discovered that a given document exhibits different characteristics based on the number of topics they have. FIG. 4 shows a document 30 that focuses on only one topic, a document 32 that focuses on two topics, and a document 34 that focuses on more than two topics. Different numbers of topics within a document may be interpreted as different characteristics within the context of a given set of documents 16. For example, in a collection of scientific publications, a document with a topic represents a publication that is related to a particular field of scientific research. A document with two or more topics is more likely to represent a research article across disciplines, which typically integrates two or more bodies of expertise.
In addition, the document distribution view 18 provides a rich set of interactions, such as swipes, highlights, and the like. A proportional range on the brush topic allows the user to select documents with a particular probability for that particular topic. By integrating information and document features related to the primary topic from both the topic cloud view 20 and the document distribution view 18, a user is able to effectively develop a summary of the document collection 16.
The document distribution view 18 enables a user to identify documents that are focused on a particular topic by swiping the upper range on the topic. However, identifying documents related to two or more topics in a corpus is not as easy because they are masked by a single topic document of high probability value. To alleviate this problem, the entire document is separated in such a way that single-topic and multi-topic documents can be easily separated. This is the document scatter plot view 24.
As can be seen in the document distribution view 18, each document is converted into a signal-like probability distribution pattern. In this representation, documents with multiple topics appear noisier than those that focus explicitly on a topic. In information theory, shannon entropy is a measure of the amount of uncertainty associated with a random variable. Assuming that the topic is a random variable for each document in our context, shannon entropy can be used to distinguish clean signals from noise signals. Thus, the tool 10 of the present invention applies shannon entropy to distinguish documents based on the number of topics they have. The entropy of each document based on its probability distribution across topics is calculated as:
Figure BDA0001032601760000141
wherein P (d)k) Is the probability distribution of the kth document over all topics. The entropy of each document and its maximum probability value on a topic (normalized to [0, 1) may then be based on it in the document scatter plot view 24]) To render each document (see fig. 5). In this presentation, for example, the single-topic (with higher maximum and lower entropy) document is in the upper left corner of the scatter plot, while the lower right corner captures documents with higher topic numbers (with lower maximum and higher entropy). Upon selection, a pie chart is shown to describe the topic distribution of a particular document. In fig. 5, each pie chart represents a selected document, where each color represents a topic. As shown, the document with the smaller entropy value appears as a pie chart of solid circles; while documents with larger entropy values appear to have multiple colors, indicating that the entropy value corresponds to the number of topics in the input document.
In summary, the document scatter plot view 24 enables a user to interactively identify a subset of documents having a desired number of topics by selection of documents within different regions. The document scatter plot view 24 is produced, in part, by a document entropy calculation module 36, which document entropy calculation module 36 is operable to perform the functions described above and group documents in any given corpus of text. The document scatter plot view 24 intentionally groups documents based on their entropy and visually illustrates the focus on that given corpus, suggesting whether that corpus is focused on a single topic or a variable topic.
Since most document collections 16 accumulate over time, presenting such temporal information helps assist users in understanding how topics of the corpus evolve. Referring now specifically to FIG. 6, the time view 22 is created as an interactive river graph (Themeriver) with each band representing a topic. In the text corpus, each document is associated with a timestamp, so the height of each band over time can be determined by summing the distribution of the documents over the topic within each time frame. The units of the time frame depend on the corpus, for example, a year may be a suitable unit of time for a scientific publication, while a month or even a day would be more suitable for a news corpus. After the time units have been selected, the document is divided into respective time frames based on the time stamps. However, for each time frame, the height of each topic is calculated by summing the distributions of topics from the document within that time frame.
For example, the order of topics (top to bottom) is the same in both the topic cloud view 20 and the document distribution view 18. A color or pattern is assigned to a topic by interpolating the color or pattern spectrum using normalized distances between all adjacent topics. As a result, a more similar pair of topics is assigned a more similar color or style.
In summary, the temporal view 22 provides a visual summary of how topics of the document collection 16 evolve over time. In addition to this representation, various interactions are supported within the time view 22. The selection of a time frame (one vertical unit of time) results in the filtering of all documents published within the selected time frame. Similarly, for example, clicking on the intersection of the topic band and the time frame in the temporal view 22 results in the selection of a document published during the time frame that has a greater than 30% probability on the selected topic. Thus, it may be identified what documents were shared with the production of topics over a particular period of time. The temporal view 22 adds richness by revealing temporal information hidden in the document collection 16 and allowing the user to perform filtering based on time and topics.
The temporal view 22 is generated, in part, by a temporal topic trend calculation module 38, the temporal topic trend calculation module 38 operable to perform the above-described functions and review of the detailed documents. The time view 22 enables a user to directly select, for example, documents within a specific time range and retrieve corresponding data. The temporal view 22 plays a key role in showing the user the basis of the identified visualization styles and trends by revealing the document details associated with such depictions.
In selecting an arbitrary document, the tool 10 of the present invention provides details of the actual textual content of the document of interest. Since any topic model is far from perfect, the function of the detailed view is twofold: first, it provides context for the user to develop a deep understanding of topics and topic-associated keywords; second, it helps the user verify the style shown in the visualization.
Since understanding the large text corpus 16 would involve the use of all four views, careful tailoring of the coordination between all views is required. At the topic level, hovering over a topic in any view that relates to a topic representation will highlight the same topic in other views. For example, if the user hovers over one axis in the document distribution view 18, the same topic is highlighted in both the topic cloud view 20 and the time view 22. Thus, the user can quickly integrate information about keywords, document distribution, and temporal trends of a specific topic. In addition, views are also coordinated by color or style, with each topic having the same color or style in all views.
At the document level, selecting any set of documents in the view that includes the respective documents highlights the same set of documents in the other views. For example, the swiping operation in the document scatter plot view 20 is immediately reflected in the document distribution view 18, and vice versa. When a user selects several documents with two prominent topics (i.e., middle range) in the document scatter plot view 24, viewing the distribution of these documents helps the user understand the topic combinations of the documents.
With respect to temporal aspects, filtering of documents written/published within a particular time period is supported. For example, a click on a time frame (i.e., one vertical unit of time) in the time view 22 results in filtering of all documents published within the selected time span. Similarly, a click on the intersection of the topic band and the time frame in the time view 22 results in the selection of the following documents published during that time period: those documents have topics that make a major contribution to those documents (e.g., a probability greater than 30%). This selection is shown in both the document distribution view 18 and the document scatter plot view 24. This functionality allows a user to filter documents based on time of interest and topics, and then review documents published within a selected timeframe.
The tool 10 of the present invention allows a user to explore and query the large document corpus 16 from multiple viewpoints. Starting from the topic cloud view 20, the user can view a summary of the corpus 16 and identify topics of interest or even keywords. From the document distribution view 18, the user can locate a topic of interest and select a document that focuses on that topic by performing a swipe operation on the vertical axis. The user may then visually identify which other topics the selected set of documents is relevant to by viewing the distributions in the document distribution view 18 and the document scatter plot view 24. Furthermore, the user can always check the details of the document based on the selection. For example, if the user wants to identify cross-discipline/multidiscipline publications in the corpus 16, he/she is equipped to do so by selecting the middle to lower right hand document in the document scatter plot view 24. Further, if the user is interested in querying the corpus 16 by the time factor, he/she may perform the selection in the time view 22 by clicking on one time frame or clicking on the intersection of a particular time frame and topic. In summary, the tool 10 of the present invention employs multiple coordinated views to support interactive exploration of textual corpus 16. Each of the views is designed to solve one of four important problems.
To evaluate the effectiveness of the tool 10 of the present invention in answering four target questions, the tool 10 was applied to explore and analyze two text corpuses including scientific proposals awarded from the national natural foundation (NSF) in 2010 in 2006 and publications in the IEEE VAST paper collection.
Case study 1. And (5) carrying out analytical scientific proposal. In this case study, we first describe the data we collected. Then we characterize the target domain and expose a set of tasks that are summarized based on our dialog with the project manager of the NSF. Finally, we show how the tools can assist expert users in solving these tasks.
Data collection and preparation. To verify whether the tool can assist project managers in making funding decisions and managing awarded investment structures, we first collected Information and Intelligence System (IIS) awarded proposals from 2000 to 2010 as part of the Board of Computer and Information Science and Engineering (CISE). The set consists of up to 4000 grants with structured data on grant numbers, board of directors, departments, projects, project managers, major researchers, and grant dates; and a proposal abstract in the form of unstructured text. We process all the collected summaries, each one constituting a single document in the corpus. We remove the list of standard forbidden words. This gives us a vocabulary of 334,447 words. We then extracted 30 topics from the corpus using the LDA model.
And (5) domain depiction. The core part of the mission of NSFs is: the united states is kept at a discovery front by funding research in the traditional academic arena, including identifying broader effects, as well as funding variable and interdisciplinary research. To implement the former, project managers of the NSF need to identify appropriate reviewers and panelists to ensure the best possible peer review. To effectively perform the latter, project managers need to identify emerging areas and research topics in order to fund the cross-disciplined and actionable research. In addition to making investment decisions, project managers also need to manage their investment structure awarded. While project managers have worked well in the past, they need new ways to help them due to the natural rapidly changing nature of science and the significant growth in the number of proposals submitted. Mapping high-level tasks to executable items, we designed three tasks related to decision making and awarding investment structures. Task 1 focuses on grouping new proposal submissions based on the topic of the new proposal. This task entails understanding the main topics of the text corpus and filtering a set of sub-documents based on their features relative to the topics. Task 2 is to identify the appropriate reviewer for the proposal submission, which also involves knowing whether the submission is related to multiple topics to gather the correct experts. Finally, task 3 focuses on the temporal aspect of the structure of the grant of funds which relates to discovering trends in topics that develop over time.
And (5) evaluating by an expert. Since the project manager for NSF is particularly busy, we have invited the former NSF project manager to do our expert evaluation. Participants had two years of working experience as project managers of the NSF. At the beginning of this evaluation, we took 30 minutes to demonstrate the system design and function for each visualization. Then, we ask the participants to use the tool to perform the following three tasks.
Task 1. The 200 newly submitted proposals are grouped based on topic. Starting from the topic cloud view, the participant quickly browses through the extracted topics to get an overview of the newly submitted proposal. Since the participant was responsible for the proposal in the fields of robotics and computer vision, she quickly focused her attention on both topics. Upon selecting a proposal to focus on a topic about robotics, participants quickly glance at the title in the detailed view to verify their relevance. Although the participant ensures that each selected proposal is relevant, she also notes that the position of the proposal in the document scatter plot view is scattered. Since the proposals in the lower right position are more likely to include two or more topics, the participants are interested in knowing which other topics these proposals also relate to. By further filtering those proposals that appear to be more interdisciplinary in the document scatter plot view, participants find that they are involved in other areas such as neuroscience and social communications. When a relevant document is selected in the document distribution view, the detailed view is invoked so that the project manager can view the previously granted PIs.
And (5) task 2. The appropriate reviewers are identified. To identify reviewers, participants first want to roughly group proposals. Based on the initial exploration, participants concluded that there are roughly two sets of proposals: one group focuses on the core of the robotics field, while the other group uses knowledge bodies from other fields such as neuroscience and social communication. To identify reviewers of both sets of proposals, participants want to find PIs from previously awarded proposals. By examining the historical data, the project manager locates topics about robotics in the document distribution view. She then performs a brushing operation at the top range of the shaft to select a proposal related to the topic. Finally, the participant turns to the detailed view to see the PI previously awarded in the field of robotics. For cross-discipline proposals in group 2, participants go through a similar process to identify other experts from other related areas (e.g., neuroscience) to serve the review graph, ensuring the best possible peer review.
And (3) task. The temporal trend of awarding investment structure is analyzed. At the investment structure level, the front project manager is interested in looking at the recent time trends in the field she is responsible for. By exploring the temporal view, participants found the trend of the proposals awarded in the field of robotics to be stable, although the overall number of proposals awarded during 2006 and 2009 was increasing. Unlike the steady trend of robotics, the number of proposals granted on the topic of "helping disabled persons using techniques" is increasing year by year. The former project manager commented that this view is valuable to her because it enables her to view investment trends on different topics that would otherwise be difficult to find.
In summary, the participants consider each view in the tool to be well-designed with a clear purpose. She reviews that the tool may play a promoting role in the workflow of the project manager. In particular, she likes this fact: our tool can automatically suggest more interdisciplinary proposals because this is difficult to judge in the traditional way. She also enjoys collaboration between views, which helps her to quickly integrate information from different aspects of the same corpus.
Case study 2. The VAST conference discourse set is analyzed. As the field of visual analytics matures, it is helpful to review how the field evolves. One way to solve this problem is to analyze publications that have been accepted by the most important venues in the visual analysis. In this case study, we recruited four researchers to explore papers published in the VAST meeting/interview since the beginning of the field in 2006. Since all users are familiar with the field of visual analytics, we wish to encourage free exploration, which is contrary to the well-structured task below. After evaluation, we grouped the findings of participants into two groups: discovering causal relationships between the time evolution of topics and funding sources, and learning interesting sub-areas in the field of visual analysis.
Data collection and preparation. We first collected all papers published in VAST meetings/seating conferences from 2006 to 2010. A total of 123 publications were collected. We then parse each publication into fields that include title, author, year of publication, abstract, subject, and reference. We perform topic modeling (from introduction to conclusion) on the entire body of each article, where each article constitutes a document in the corpus. Removing the standard forbidden word leaves us with a vocabulary of 317,315 words. Based on our records for different tracks of each VAST meeting, we extracted 19 topics from the corpus.
And (4) user evaluation. Of the four researchers we recruited, two were advanced researchers in the field of visualization analysis, and the other two were doctor's who had visualization analysis as their primary research interest. In this evaluation, we provide all participants with advanced tasks and encourage more liberal mining. After introducing the tool, we ask each participant to identify core topics within the domain and how the domain has evolved over the last 5 years. We roughly group the usage patterns into two groups: the topic of ascent/decline is identified and the system is used as an educational tool.
The topic of rising/fading is identified. After sweeping through all topics in the topic cloud view, one high-level researcher reviews the statement: topics fit well with the paper tracking from VAST meetings. When looking at the temporal trend of each topic, the participants notice several clear patterns of rise and decay. For example, topics related to video news analysis initially attracted much attention, but attention rapidly decreased year by year. He also notes similar trends on topics related to network traffic monitoring and analysis. This style is associated with his knowledge and the participants explain the trend because the field of interest is guided by the Department of Homeland Security (DHS), which is the main source of funds at that time, when the field starts. Next, participants are turning to an ascending fashion that indicates attention in those topics generated in recent years. In particular, both topic trend and uncertainty analysis and topic dimension analysis and reduction have attracted more attention since 2008. The pattern is also associated with his own knowledge, and the participants comment that this is likely the result of the data introduced by the NSF and DHS in conjunction with the Foundation (FODAVA) project of the visual analysis.
The field of visualization analysis is known. Another high-level researcher (who then taught visual analysis lessons) commented that: he can see that the tool is useful for his lessons. Students can explore all VAST publications and identify papers related to topics of interest for course demonstration. Similarly, another participant wants to see what has been done in the field of visual analysis in terms of text analysis. He first locates a topic and then selects the publication that ranks high on that topic in the document distribution view. He quickly glances at the paper title in the detailed view and verifies that all selected papers satisfy his interests. He also notes that some of the papers in this selection appear to be related to other topics such as entity extraction and database queries. After the study, he requires a screen capture of a detailed view so that he can find the papers he identified during the study.
In summary, participants believe that this tool helps them explore the evolution of the visual analytics domain and identify publications based on their own interests for further investigation.
Those skilled in the art will appreciate that the various modules and processes of the invention are implemented using a processing device such as a computer. Such computers and other processing devices may include one or more general or special purpose processors, such as microprocessors, digital signal processors, custom processors, and Field Programmable Gate Arrays (FPGAs), and unique stored program instructions, including both software and firmware, that control the one or more processors, which in conjunction with certain non-processor circuits, implement some, most, or all of the functions of the method and system of the present invention. Alternatively, some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more Application Specific Integrated Circuits (ASICs), in which each function or some combinations of functions are implemented as custom logic. Of course, a combination of the above methods may be used. Further, some example embodiments may be implemented via a non-transitory computer-readable storage medium having computer-readable code stored thereon for programming a computer, server, appliance, device, etc., each of which may include a processor to perform the functions described and claimed herein. Examples of such computer-readable storage media include, but are not limited to: hard disks, optical storage devices, magnetic storage devices, read-only memories (ROMs), programmable read-only memories (PROMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), flash memories, and the like. When stored in a non-transitory computer readable medium, the software may include instructions executable by a processor, which in response to such execution, causes the processor and/or any other circuitry to perform a set of operations, steps, methods, procedures, algorithms, and so forth.
Again, the present invention enables companies, including analysts, marketers, business unit leaders, information technologists, and type C employees, to obtain actionable insights from any type of textual data. This technology allows people to enhance their decision making process on a data-driven basis. The technique takes textual data and, through depth calculations and statistical algorithms, identifies topics, topics and emerging issues within each data set. The results are displayed in an interactive visual format so that anyone in the company can analyze the data as a whole or in a fine-grained manner. All types of text data can be analyzed-internal data (e.g., email, chat, surveys, call centers, and focus groups), or external data (e.g., social media, review websites, forums, and news websites). The technique can handle a large number of languages, ensuring that feedback loops from all over the world can be analyzed. However, highly customizable features are selected that tailor the effectiveness of the analysis. Most companies are sitting on treasures of unstructured text data, but have little ability to mine unstructured text data for intelligence.
Generally, the software of the present invention delivers deep learning based data analysis in a complex visualization platform that exposes, analyzes and speculates executable strategies in a wide range of business decision fields. It links call center audio, email, news, social media, chat, transaction data, customer feedback and analysis in an advantageous way to discover connections within data that affect sales, customer service, operations and risk analysis stakeholders. Structured data is also utilized, including retail transactions, survey data, personal profiles, and the like, as well as national and international industry, government, and product specific data sources. The software is accessible by any browser device, integrates predictive modeling, artificial intelligence, and statistical NLP to analyze any type of unstructured data. The visualization is provided in whole and/or in detail. The entire system 40 is shown schematically in fig. 7. The system 40 uses a high throughput multilingual API for information tagging using complex term extraction, entity indicator extraction, geospatial indicator extraction, temporal indicator extraction, and opinion sentiment analysis. The system 40 also uses data-driven semantic machine learning and clustering, using automatic term association, statistical topic summarization, influencer interference, context-aware content ordering, content network association, and product-centric analysis.
Referring now specifically to fig. 8 and 9, in one example embodiment, the present invention provides an enhanced intelligence platform 45 that helps companies find the shortest path from data to revenue. It aggregates together the data islands of segments, creates a top-level unified visual analytics layer, and enables users from multiple business functions to efficiently and collaboratively extract valuable insights. Platform 45 is securely located at the top of the organizational data lake and is compatible with multiple levels of data infrastructure. It automatically ingests unstructured data (e.g., email, call logs) as well as structured data (e.g., sales, budgeting, finance) through depth calculation and statistical algorithms. It handles tens of millions of feedback points and data points in real time and identifies topics, and emerging questions within an organization. It helps to dynamically associate customer experience trends with the overall corporate data. The platform 45 is fully interactive and easy to use. Anyone in the organization, employees from front-line, analysts, sellers to business unit leaders and type C employees, can interact with the data, either in whole or in fine detail, customize their own dashboard and share the findings with others. In addition to the data analysis background engine, the platform 45 is supported with a fully enhanced user's UI experience. The invention provides a pixel-finished dashboard with customizable visualization for the user. This makes the analysis work of the presentation user much easier and more controllable. Rich interactions in the exploration layer allow the user to quickly begin analyzing details and keeping context information around it. The present invention ensures, and the flexible data analysis environment ensures that users never lose a general level of contact with data while sneaking in details. This goes beyond just a few visualizations; the user experience is extended to a variety of useful data analysis and visualization. Annotation and collaboration on analytical results was previously not easy. The invention completely changes the way that people can find, share and cooperate on the analysis task. Users can annotate and share their findings with colleagues, supporting collaboration both inside and outside of each data analysis group. In summary, the present invention enhances decision making by providing a realistic environment for data analysis.
FIG. 10 is a schematic diagram illustrating another exemplary embodiment of an unstructured data analysis system 50 of the present invention. Typically, customer experience data 52, telecommunications data 54, email data 56, social media data 58, and other data 60, for example, that are closely related to a business enterprise, are aggregated in a data repository 62, and external data sources 64, such as internet data, government data, are pulled into unstructured data analysis algorithms 66, which unstructured data analysis algorithms 66 reside, for example, on a web server, and are accessible via a browser. As described in detail herein above, unstructured data analysis algorithm 66 applies predictive modeling, artificial intelligence, and statistical NLP to the data to reveal, analyze, infer, and visualize executable information. Advantageously, the executable information may be viewed by various business units 68, stakeholders or other users, all of which may add or otherwise modify the visualization and share the results via the common interactive user interface 70.
FIG. 11 is a schematic diagram illustrating one exemplary embodiment of a presentation layer 80 of unstructured data analysis system 50 (FIG. 8) of the present invention; in general, the presentation layer 80 allows for the display of various summary information regarding the unstructured data and/or results. For example, presentation layer 80 is shown displaying customer experience data 82, telecommunications data 84, and sales data 86.
FIG. 12 is a schematic diagram illustrating one example embodiment of an exploration layer 90 of unstructured data analysis system 50 (FIG. 8) of the present invention. In general, the exploration layer 90 allows for the display of various summary information regarding the unstructured data and/or results. The exploration layer 90 also allows for time granularity to be selected and displayed in further detail. This "divestment" also updates the other visualizations, including presentation layer 80, accordingly. For example, snapshot 94 is shown as being selected from customer experience data 92.
FIG. 13 is a schematic diagram illustrating one example embodiment of an annotation layer 100 of the unstructured data analysis system 50 (FIG. 8) of the present invention. The annotation layer 100 is configured to display various results, as well as customer experience data 102, telecommunications data 104, emails 106, social media data 108, other data 110, and the like, and receive user annotations 112, which user annotations 112 may be accessed by all or selected users via a shared user interface 114.
Although the invention has been illustrated and described herein with reference to preferred embodiments and specific examples thereof, those skilled in the art will readily appreciate that other embodiments and examples may perform similar functions and/or achieve similar results. It is therefore to be understood that all such equivalent embodiments and examples are within the spirit and scope of the present invention and are intended to be covered by the appended claims.

Claims (12)

1. An unstructured data analysis system comprising:
an unstructured data analysis algorithm residing on a server and accessible via a browser, the unstructured data analysis algorithm operable to: receiving unstructured data from one or more remote sources, applying one or more analysis tools to the unstructured data, and displaying summary information to a plurality of users;
wherein the summary information is displayed to the plurality of users in a presentation layer, an exploration layer, and an annotation layer,
the presentation layer displays one or more of: unstructured data, a summary of unstructured data, and the summary information,
the exploration layer allows the plurality of users to modify the granularity of the summary information, thereby modifying the granularity of the presentation layer,
the plurality of users are capable of simultaneously interacting with the unstructured data analysis system via an annotation layer, the annotation layer operable to annotate the unstructured data, a summary of the unstructured data, and the summary information, wherein annotations made by one of the plurality of users are accessible to others of the plurality of users via the annotation layer.
2. The system of claim 1, wherein the unstructured data comprises one or more of: customer experience data, telecommunications data, email data, social media data, and transaction data.
3. The system of claim 1, wherein the unstructured-data analysis algorithm is further operable to: external data is received from one or more remote sources.
4. The system of claim 3, wherein the external data comprises one or more of: internet data, government data, and business data.
5. The system of claim 1, wherein the one or more analysis tools applied to the unstructured data comprise one or more of: statistical algorithms, machine learning, natural language processing, and text mining.
6. The system of claim 1, wherein the summary information is also displayed to the plurality of users in a combined layer.
7. An unstructured data analysis method, comprising:
providing an unstructured data analysis algorithm residing on a server and accessible via a browser, the unstructured data analysis algorithm operable to: receiving unstructured data from one or more remote sources, applying one or more analysis tools to the unstructured data, and displaying summary information to a plurality of users;
wherein the summary information is displayed to the plurality of users in a presentation layer, an exploration layer, and an annotation layer,
the presentation layer displays one or more of: unstructured data, a summary of unstructured data, and the summary information,
the exploration layer allows the plurality of users to modify the granularity of the summary information, thereby modifying the granularity of the presentation layer,
the plurality of users are capable of simultaneously interacting with the unstructured data analysis system via an annotation layer, the annotation layer operable to annotate the unstructured data, a summary of the unstructured data, and the summary information, wherein annotations made by one of the plurality of users are accessible to others of the plurality of users via the annotation layer.
8. The method of claim 7, wherein the unstructured data comprises one or more of: customer experience data, telecommunications data, email data, social media data, and transaction data.
9. The method of claim 7, wherein the unstructured-data analysis algorithm is further operable to: external data is received from one or more remote sources.
10. The method of claim 9, wherein the external data comprises one or more of: internet data, government data, and business data.
11. The method of claim 7, wherein the one or more analysis tools applied to the unstructured data comprise one or more of: statistical algorithms, machine learning, natural language processing, and text mining.
12. The method of claim 7, wherein the summary information is also displayed to the plurality of users in a combined layer.
CN201610496280.9A 2015-05-11 2016-06-28 Unstructured data analysis system and method Active CN107368506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011265115.5A CN112732878A (en) 2015-05-11 2016-06-28 Unstructured data analysis system and method

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201562159662P 2015-05-11 2015-05-11
US15/151,572 US10452698B2 (en) 2015-05-11 2016-05-11 Unstructured data analytics systems and methods
US15/151,572 2016-05-11

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202011265115.5A Division CN112732878A (en) 2015-05-11 2016-06-28 Unstructured data analysis system and method

Publications (2)

Publication Number Publication Date
CN107368506A CN107368506A (en) 2017-11-21
CN107368506B true CN107368506B (en) 2020-11-06

Family

ID=60312579

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201610496280.9A Active CN107368506B (en) 2015-05-11 2016-06-28 Unstructured data analysis system and method
CN202011265115.5A Pending CN112732878A (en) 2015-05-11 2016-06-28 Unstructured data analysis system and method

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202011265115.5A Pending CN112732878A (en) 2015-05-11 2016-06-28 Unstructured data analysis system and method

Country Status (1)

Country Link
CN (2) CN107368506B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170657A (en) * 2018-01-04 2018-06-15 陆丽娜 A kind of natural language long text generation method
CN109299286A (en) * 2018-09-28 2019-02-01 北京赛博贝斯数据科技有限责任公司 The Knowledge Discovery Method and system of unstructured data
CN110413782B (en) * 2019-07-23 2022-08-26 杭州城市大数据运营有限公司 Automatic table theme classification method and device, computer equipment and storage medium
CN112883186B (en) * 2019-11-29 2024-04-12 智慧芽信息科技(苏州)有限公司 Method, system, equipment and storage medium for generating information map

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308498A (en) * 2008-07-03 2008-11-19 上海交通大学 Text collection visualized system
CN102750355A (en) * 2012-06-11 2012-10-24 清华大学 Visual management method for non-structured data management system
CN102929894A (en) * 2011-08-12 2013-02-13 中国人民解放军总参谋部第五十七研究所 Online clustering visualization method of text
US9135242B1 (en) * 2011-10-10 2015-09-15 The University Of North Carolina At Charlotte Methods and systems for the analysis of large text corpora

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1402408A1 (en) * 2001-07-04 2004-03-31 Cogisum Intermedia AG Category based, extensible and interactive system for document retrieval
US7849048B2 (en) * 2005-07-05 2010-12-07 Clarabridge, Inc. System and method of making unstructured data available to structured data analysis tools
US9684683B2 (en) * 2010-02-09 2017-06-20 Siemens Aktiengesellschaft Semantic search tool for document tagging, indexing and search
KR101481253B1 (en) * 2013-03-14 2015-01-13 한국과학기술원 Method and system for providing summery of text document using word cloud
CN103473369A (en) * 2013-09-27 2013-12-25 清华大学 Semantic-based information acquisition method and semantic-based information acquisition system
US20160071212A1 (en) * 2014-09-09 2016-03-10 Perry H. Beaumont Structured and unstructured data processing method to create and implement investment strategies

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308498A (en) * 2008-07-03 2008-11-19 上海交通大学 Text collection visualized system
CN102929894A (en) * 2011-08-12 2013-02-13 中国人民解放军总参谋部第五十七研究所 Online clustering visualization method of text
US9135242B1 (en) * 2011-10-10 2015-09-15 The University Of North Carolina At Charlotte Methods and systems for the analysis of large text corpora
CN102750355A (en) * 2012-06-11 2012-10-24 清华大学 Visual management method for non-structured data management system

Also Published As

Publication number Publication date
CN112732878A (en) 2021-04-30
CN107368506A (en) 2017-11-21

Similar Documents

Publication Publication Date Title
US11003864B2 (en) Artificial intelligence optimized unstructured data analytics systems and methods
US10452698B2 (en) Unstructured data analytics systems and methods
US9135242B1 (en) Methods and systems for the analysis of large text corpora
Liu et al. Coreflow: Extracting and visualizing branching patterns from event sequences
Dou et al. Paralleltopics: A probabilistic approach to exploring document collections
Zhao et al. Interactive exploration of implicit and explicit relations in faceted datasets
Yang et al. Cognitive impact of virtual reality sketching on designers’ concept generation
Glinka et al. Past Visions and Reconciling Views: Visualizing Time, Texture and Themes in Cultural Collections.
Sinar Data visualization
Alper et al. Opinionblocks: Visualizing consumer reviews
Roberts et al. Visualising business data: A survey
CN107368506B (en) Unstructured data analysis system and method
Verbert et al. Agents vs. users: visual recommendation of research talks with multiple dimension of relevance
Liu et al. SocialBrands: Visual analysis of public perceptions of brands on social media
US20230306033A1 (en) Dashboard for monitoring current and historical consumption and quality metrics for attributes and records of a dataset
US20230289696A1 (en) Interactive tree representing attribute quality or consumption metrics for data ingestion and other applications
Seifert et al. Visual analysis and knowledge discovery for text
Alazmi et al. Data mining and visualization of large databases
US20230289839A1 (en) Data selection based on consumption and quality metrics for attributes and records of a dataset
Verspoor et al. Commviz: Visualization of semantic patterns in large social communication networks
Heer Supporting asynchronous collaboration for interactive visualization
Shen et al. EvIcon: Designing High‐Usability Icon with Human‐in‐the‐loop Exploration and IconCLIP
Brüggemann Collaboration and the Semantic Web: Social Networks, Knowledge Networks, and Knowledge Resources: Social Networks, Knowledge Networks, and Knowledge Resources
Šperková et al. Evaluation of e-Word-of-Mouth through Business Intelligence processes in banking domain.
Järvinen A data model based approach for visual analytics of monitoring data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant