[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

KR102025813B1 - Device and method for chronological big data curation system - Google Patents

Device and method for chronological big data curation system Download PDF

Info

Publication number
KR102025813B1
KR102025813B1 KR1020180036540A KR20180036540A KR102025813B1 KR 102025813 B1 KR102025813 B1 KR 102025813B1 KR 1020180036540 A KR1020180036540 A KR 1020180036540A KR 20180036540 A KR20180036540 A KR 20180036540A KR 102025813 B1 KR102025813 B1 KR 102025813B1
Authority
KR
South Korea
Prior art keywords
topic
data
node set
association
topic data
Prior art date
Application number
KR1020180036540A
Other languages
Korean (ko)
Other versions
KR20180111646A (en
Inventor
한상용
최승진
서지완
유가람
Original Assignee
중앙대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 중앙대학교 산학협력단 filed Critical 중앙대학교 산학협력단
Publication of KR20180111646A publication Critical patent/KR20180111646A/en
Application granted granted Critical
Publication of KR102025813B1 publication Critical patent/KR102025813B1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a chronological information-based curation apparatus and a control method, and more particularly to a chronological information-based curation apparatus and method for providing event flow information. The present invention organizes various types of information scattered on the Internet through topics and associations, and shows the organized data to users easily, makes it accessible, and reuses the organized data to maintain information. It can generate valuable information through management, management and provision.

Figure R1020180036540

Description

Chronological information-based curation apparatus and its control method for providing event flow information {DEVICE AND METHOD FOR CHRONOLOGICAL BIG DATA CURATION SYSTEM}

The present invention relates to a chronological information-based curation apparatus and method, and more particularly to a chronological information-based curation apparatus and control method for providing event flow information.

With the advent and development of web technologies, vast amounts of different kinds of data are being produced rapidly and the amount of information is increasing significantly. In this big data era, consumers can get more data and information than ever before, but it's not easy to sift through valuable information and knowledge. Finding valuable information from large amounts of data is becoming increasingly important, and many countries and companies are spending a lot of time and money on data acquisition and analysis.

Digital curation is the process of organizing various kinds of information scattered on the Internet through themes and associations, and presenting the organized data to users easily and accessiblely, and making the data reusable. It's work. Such digital curation can generate valuable information through maintenance, management, and provision of information, and can improve accessibility and reusability of information.

As it takes a lot of effort and cost to search, understand, and identify important topics in this big data age, one of the important issues is to create a profitable and reliable digital curation device that can satisfy the needs of users.

In general, existing information systems provide search services based on queries.

Semantic search is a search that can improve search accuracy by understanding the searcher's intention and the contextual meaning of words in order to show better performance and get more accurate results. However, the current search system including such a semantic system has a great difficulty in the representation and ranking of search data considering the time decay effect, since only the time-independent parameters are considered.

Some events occur over time, and their importance and interests change over time. When various events and / or accidents become entangled with others, it becomes difficult to understand the inherent and fundamental meanings associated with the search data.

Accordingly, there is a need for research related to digital curation devices and systems that further consider the time of occurrence of an event.

It is an object of the present invention to solve the above and other problems. Another object is to provide a curation apparatus that can classify data more efficiently in consideration of the occurrence time between events.

The present invention provides a chronological information-based curation apparatus and method that provides the relevant core information chronologically in order to help a comprehensive understanding of a specific event or accident that a user tries to search in a big data environment.

The present invention collects event or accident data from a variety of sources, analyzes related information in chronological order, is provided as modeling specific events or knowledge over time, and is shown through visualization tools, to understand specific events or knowledge. Provided are a chronological information-based curation apparatus and method for generating reusable information and reducing repetitive searching.

The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned above will be clearly understood by those skilled in the art from the following description. Could be.

According to an aspect of the present invention to achieve the above or another object, a data collector for collecting a plurality of topic data from a data source on the web; A graph modeling unit configured to extract an association between the collected plurality of topic data; And a chronological order analysis unit classifying the internal node set directly associated with a predetermined topic among the plurality of topic data and the external node set indirectly associated with the predetermined topic, based on the extracted correlation. To provide.

The data source may include a post of a social network service (SNS) and a news article.

The predetermined topic may be a topic about a keyword of a search word received from a user.

In addition, an association relationship between two topic data may be quantified by the number of documents in which the two topic data are included together in a single document.

The chronological order analysis unit may classify the topic data having an association relationship with the predetermined topic into an internal node set, and may classify the topic data related to a creation time into an external node set although it is not associated with the predetermined topic. .

The chronological analysis unit may sort the sorted plurality of topics in order of creation time.

According to another aspect of the present invention for achieving the above or another object, collecting a plurality of topic data from a data source on the web; Extracting an association between the collected plurality of topic data; And classifying an internal node set directly associated with a predetermined topic among the plurality of topic data and an external node set indirectly associated with the predetermined topic, based on the extracted association relationship. Provide a control method.

The effects of the curation apparatus and its control method according to the present invention will be described below.

According to at least one embodiment of the present invention, various types of information scattered on the Internet can be organized through topics and associations, and the organized data can be easily viewed by users, and the reused data can be reused. In this way, valuable information can be generated by maintaining, managing, and providing the information.

Further scope of the applicability of the present invention will become apparent from the following detailed description. However, various changes and modifications within the spirit and scope of the present invention can be clearly understood by those skilled in the art, and therefore, specific embodiments, such as the detailed description and the preferred embodiments of the present invention, should be understood as given by way of example only.

1 is a block diagram of a curation apparatus according to an embodiment of the present invention.
2 is a diagram illustrating the characteristics of SNS, blogs, and news articles that are differentiated according to one embodiment of the present invention.
3 is a diagram illustrating a topic alignment structure for aligning main topics and subtopics according to an embodiment of the present invention.
4 is an exemplary diagram for explaining a time relation graph according to an embodiment of the present invention.
5 is an exemplary diagram for describing a similarity relation graph according to an embodiment of the present invention.
6 and 7 are views illustrating an example in which the reuse unit 400 stores the tree structure according to an embodiment of the present invention.
8 is a flowchart illustrating a control method of a curation apparatus according to an embodiment of the present invention.
9 is a result graph of a result of searching for Ahn Cheol-soo, a famous Korean politician, by a curation apparatus according to an embodiment of the present invention.
10 is a graph of a search result for the "Nonghyup" hacking incident of the Korean bank by the curation apparatus according to an embodiment of the present invention.
11 is a graph showing results different from the above experimental results by the curation apparatus according to an embodiment of the present invention.

DETAILED DESCRIPTION Hereinafter, exemplary embodiments disclosed herein will be described in detail with reference to the accompanying drawings, and the same or similar components will be given the same reference numerals regardless of the reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "unit" for components used in the following description are given or used in consideration of ease of specification, and do not have distinct meanings or roles from each other. In addition, in describing the embodiments disclosed herein, when it is determined that the detailed description of the related known technology may obscure the gist of the embodiments disclosed herein, the detailed description thereof will be omitted. In addition, the accompanying drawings are intended to facilitate understanding of the embodiments disclosed herein, but are not limited to the technical spirit disclosed herein by the accompanying drawings, all changes included in the spirit and scope of the present invention. It should be understood to include equivalents and substitutes.

Terms including ordinal numbers such as first and second may be used to describe various components, but the components are not limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

Singular expressions include plural expressions unless the context clearly indicates otherwise.

In this application, the terms "comprises" or "having" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

The present invention relates to a curation apparatus and a control method thereof, and refers to a method of determining how to display various data. For example, in the case of a news article, existing search systems such as 'Naver News' enter specific keywords, and only the same articles (the titles are slightly different) appear at the same time. As a result, the results through such a system can be viewed as a search result screen which is somewhat difficult to check various results related to the search keyword. In order to overcome this problem, the present invention configures a plurality of topics (multiple keywords of articles over time) related to a specific keyword for searching, and appropriately divides the plurality of topics by various parameters to provide an effective search result to the user. The purpose is to do that.

In particular, the present invention proposes to consider indirect relations as well as direct relations in constructing a plurality of topics. When searching, the indirectivity is inferred by considering the time when the topic occurred closest to the topic occurrence time of the entered topic (keyword). Indirectness may refer to a criterion that a person does not judge.

 In the example of the results of the experiment "For wildfires in Pohang," it would be common to search for the keyword "wildfires" as well as to show all the wildfire areas associatively. In addition, according to an embodiment of the present invention, even when searching with the keyword “pohang” because of further considering indirectness by time, there is a possibility that “fire” occurred in other regions is also shown. Because "Pohang" and "forest fire" do not have a direct relationship (in other cases except for the case), but indirect relationships can be recognized according to the present invention because the topics are generated at the same time. Due to this indirect relevance, if a search term "Pohang" is entered, the topic "Forest Fire" is composed together (hereinafter, these topics are referred to as external topics), and the topic "Forest Fire" is used to search for other forest related search results. Will be shown to the user.

Consider a set of topics that are directly related to each other (Internal Node Group, hereinafter) and a set of topics that are indirectly related by time (External Node Group, hereinafter). do. In addition, you can determine which search results to provide with a certain threshold (using Maximum Cut) to properly show only the results you are interested in. The time difference is small (the topic that has recently occurred around the topic), and the appropriate threshold value that is the maximum (topic similar to the topic) may provide only the most significant result. That is, an internal node group may search for a topic by intuition, and an external node group may be a topic provided to find an included region.

1 is a block diagram of a curation apparatus according to an embodiment of the present invention.

Referring to FIG. 1, the chronological information-based curation apparatus according to an embodiment of the present invention includes a data collector 100, a graph modeling unit 200, a chronological analysis unit 300, and a reuse unit 400. Can be configured.

The data collector 100 collects a plurality of topic data from a data source on the web. Data sources here may include posts from social network services (SNS), news articles, etc., and any source that can provide data through the www (world wide web) protocol and other communication protocols. May be included.

The topic data may mean keyword data about an event or a subject that may receive attention or attention of people. For example, it could be words like 'presidential election' or 'fire'. A specific method of collecting topic data will be described later in detail.

The data collection unit 100 does not simply collect only topic data, but may include both a date and time when the topic data is generated. Hereinafter, the data collection unit 100 may include a generated date and time. ) May be gathered together.

The generation time, the topic data, and the association data collected by each data source in the data collector 100 are subjected to a preprocessing process (step of the graphic modeling unit), which will be described later, to improve accuracy and to efficiently use data in a later step. Make sure

In particular, in one embodiment of the present invention, it is proposed to collect data by dividing the characteristics of the social network service (SNS) of the data source into at least two as follows.

For example, it may be divided into a first SNS that allows users to easily share a daily life on a mobile or a PC, and a second SNS provided for sharing detailed information. Examples of the first SNS include kakaotalk, twitter, instagram or facebook. Examples of the second SNS include blog or brunch. brunch).

The reason for distinguishing the characteristics of SNS is that the type of data collected may vary according to the characteristics.

2 is a diagram illustrating the characteristics of SNS, blogs, and news articles that are differentiated according to one embodiment of the present invention. In the diagram shown, SNS means the above-mentioned first SNS, and blog refers to the above-mentioned second SNS.

In the case of the first SNS for daily conversations, the length of the document is very short and the propagation power is very high because the users simply write their opinions, but the reliability is somewhat insufficient.

Even in the case of the second SNS for sharing information, the length of the document belongs to a short side, but contrary to the above, the propagation power is somewhat insufficient, and in terms of reliability, it may be a medium level higher than the second SNS.

Contrary to the above two examples, the news article is highly reliable in terms of reliability than the above-described SNS, and will be very high in terms of propagation power since it is written by a professional journalist. In addition, since the length of the document is also secured more than a predetermined length, it may be higher than the two SNS described above.

In one embodiment of the present invention, when the propagation power is high as in the first SNS, but somewhat lower in terms of document length or reliability, it is proposed to collect only topic data and generation time of the topic data.

In addition, in one embodiment of the present invention, since the second SNS and the news article are data sources of somewhat high reliability, it is proposed to collect the correlations between the data.

The associations thus collected are used to classify topic data below.

Extraction (collection) of such an association is performed by the graph modeling unit 200.

The graph modeling unit 200 performs a preprocessing process to obtain meaningful data from the data collected by the data collecting unit 100. Preprocessing is a process that encompasses all the various processes necessary to select only the important data needed for analysis.

The graph modeling unit 200 extracts an association relationship between the plurality of topic data. In addition, the correlation may be extracted (collected) from a data source whose reliability is recognized to some extent, such as a second SNS (blog or the like) and a news article in the above-described example. The data on the association may be data using at least one of similarity and co-occurrence between the plurality of topic data. That is, the graph modeling unit 200 uses at least one of similarity and co-occurrence frequency in determining an association relationship between any one topic data and another topic data.

The co-occurrence frequency may refer to the number of documents in which the two topic data are included together in a single document. That is, the number of cases where words representing each of the two topic data are included in one document at the same time. For example, to determine the correlation between the first topic data and the second topic data, the number of single documents that simultaneously include the first and second topic data may be referred to as the relationship between the first and second topic data. Could be. In this case, the higher the association, the higher the association between the two topic data.

If there is a strong association between the two topic data (that is, the association is high), it is divided into main topics and subtopics in order of creation time. The relationship between topic data can be represented as a graph structure that can be extended with Map & Reduce for parallel processing.

As will be described in detail below, in the graphs of the collected data in this manner, each topic data is represented by a node on a network, and an association relationship is represented by a link structure that connects the node and the node to each other.

3 is a diagram illustrating a topic alignment structure for aligning main topics and subtopics according to an embodiment of the present invention. The illustrated topic alignment structure is a hierarchical structure of topic data associated with a given topic (specific event).

Each topic data may include subtopic data according to a generation time (event creation time), and may recursively have a subtopic (subtopic data as a subtopic of subtopic data). For each main topic, the subtopics are represented hierarchically using the direction graph structure.

As shown in FIG. 3, the graph is composed of directional links connecting topic nodes to nodes and connecting two nodes to each other. The directionality may mean the order of the generation time relationship, and the value may mean a difference in generation time. Although values are not shown in FIG. 3, these values will be described later with reference to FIGS. 4 and 5.

In FIG. 3, node A 301-1 represents a main topic (eg, a topic about a search word keyword input from a user), and other nodes 301-2 to 301-6 represent lower topics. This topic alignment structure is useful for understanding the hierarchical structure of the topics. The analysis of the graph allows you to analyze which events occurred as a series of processes. In Figure 3, node C 301-3 has nodes E, F (301-5, 301-6) for two subtopics E, F, and main topic A has an implicit relationship with E, F. It can be confirmed. Through this, it can be seen that the subtopic C occurs after the main topic A, and it can be confirmed that the topics E and F follow the topic C.

Two types of links for graphically representing an association between two topic data will be described with reference to FIGS. 4 and 5.

4 is an exemplary diagram for explaining a time relation graph according to an embodiment of the present invention. 5 is an exemplary diagram for describing a similarity relation graph according to an embodiment of the present invention.

Referring to FIG. 4, a value (value) of a link connecting two topic data means a difference in generation time of two topic data. That is, the difference between the generation time of the first topic data and the generation time of the second topic data is digitized. The temporal link is directional depending on the generation time.

4 shows a time-related link between the five topic data (401-1 ~ 401-5). Indicates that topic data B 401-2 was generated 25 hours after topic data A 401-1, and topic data D 401-4 occurred 1 hour after topic data A 401-1.

FIG. 5 means a link showing only similarity regardless of directionality. Such a similarity link may have a value obtained by quantifying the co-occurrence frequency described above using TF-IDF (Term Frequency-Inverse Document Frequency).

As shown in FIG. 4, after the time-related link is completed, the similarity between the topic data of FIG. 5 may be calculated. As described above, it is recognized that similarity exists between two topic data occurring simultaneously in a document. In order to quantify the relationship, the co-occurrence frequency of blogs and news is used to quantify the relationship between each topic data (keyword).

1 again, the chronological order analysis unit 300 will be described.

The chronological analysis unit 300 analyzes the chronological order of internal nodes, which is a set of events occurring in a chronological order on a specific subject (main topic), and a specific topic (direct topic and the main topic). It is classified into External Chonical Nodes Group, which is a set of potential events for which no correlation exists, and based on the relationship between the topic data extracted by the graph modeling unit 200, the topic data is the internal node set and the external node. It is divided into sets.

An internal node set represents a set of topic data arranged in a series of time sequences and related to the main topic. The external node set is a set of topic data that is indirectly related to the main topic.

The relationship between the parent node (parent node) and the child node (child node) is determined in chronological order.

Although the relationship between the external node set and the main topic is low, it means that similar events occurred at similar times. That is, as shown in FIG. 3, the outer topic 304-1 at the top of the outer node set may have a low or no association with the main topic 301-1, but the occurrence time is close. There is a connection.

The number of nodes to be finally shown on the topic alignment structure of FIG. 3 may be limited to an appropriate number according to the similarity value between the creation time and the link. In particular, in one embodiment of the present invention can be limited to an appropriate number based on the maximum cut (Maximum cut) method. That is, the number of nodes to be displayed can be controlled through the maximum cut threshold value according to a preset or user input.

On the other hand, if a different topic appears after a long time after the first topic appears, the relationship between the two topics may not be important. Or, even though the similarity between the two topics is high, relevant topics may reappear after a long time. In both cases, the node to be shown to the user to satisfy the mode uses the maximum cut threshold which has similarity but high related weight but minimum temporal weight.

The reuse unit 400 may visualize a graph according to the above-described method and store the generated result. That is, the results of the chronological analysis according to the above-described method are visually shown and the analyzed results are stored in a tree structure for later use.

6 and 7 are views illustrating an example in which the reuse unit 400 stores the tree structure according to an embodiment of the present invention.

6 and 7, when the graph according to the above-described method is configured, the reuse unit 400 stores all the information of the graph in a standardized format. As a result, the availability and reusability of data will increase and become useful for future maintenance.

The analysis results are stored in a tree structure so that they can be used later in the chronological analysis. For example, the analysis results are expressed in a tree structure using an XML (eXtensible Markup Language) format as shown in FIG. 7. That is, the graph data of FIG. 6 may be stored in an XML format as shown in FIG. 7. These analysis results can be provided as an API or used again when a user enters a keyword for a key event by entering a query.

8 is a flowchart illustrating a control method of a curation apparatus according to an embodiment of the present invention.

According to the drawing, in step S810 it is possible to collect a plurality of topic data from a data source on the web. As an example, as described above, the topic data and the creation time may be collected from the first SNS data source, and the correlation between the data may be collected from a second SNS data source such as a blog or a news article.

In step S830, the chronological order analysis unit 300 may classify the topic data into an internal node set and an external node set.

In operation S840, the reuse unit 400 stores the analysis results in a tree structure for visualization and reuse through graphs for the results of the chronological analysis.

In the following drawings and related descriptions, data analyzed through actual experiments will be described in detail.

We used Twitter, Naver Blog, and Naver News for a month. The total data collected through Twitter is 1.5 million tweets, and the total size is about 30GB. A total of 600 topic data were extracted from meaningless words from Twitter, and a graph was constructed using this topic data. We also collected 22,692 Naver blogs and 16,288 Naver News articles. Blog data averaged 328 words per document, and news data averaged 254 words per document. The links in the graph were constructed using Naver blogs and Naver news data, and 123,894 links were created in the topic alignment structure. In this experiment, we limited the scope of the data set for SNS, blogs, and news, but it can be easily extended by adding preprocessing modules to collect data from more diverse sources.

To extract the topic data from the data source, we use 'Komoran', a noun extractor. Creation time and keywords were used to extract topic data. The relevant blog and news data was collected using keywords extracted from Twitter data sources. The co-occurrence frequency of the words in the document was extracted from blog and news data, the subject and TF-IDF pairs were calculated, and the similarity links corresponding to the relationships were normalized between 0 and 1.

The time-relational link was calculated based on the creation time of the post containing the topic extracted from Twitter data. The time-relevant link value is calculated by the difference between the document generation time and the normalized (or averaged) value, and is likewise normalized from 0 to 1. Graphs are constructed using subject, similarity, and time relationships (association relationships). The set of internal and external nodes is organized according to chronological analysis. If the similarity link representing an association has a value higher than a certain threshold, these nodes are included as parent and child nodes. This set is called an internal chronological group. Other nodes that did not exceed a certain threshold were classified as External Chronological Group.

9 is a result graph of a result of searching for Ahn Cheol-soo, a famous Korean politician, by a curation apparatus according to an embodiment of the present invention. "Ahn Cheol-soo" is the main topic, and "Kim Jong-hoon" is composed of subtopics. Ahn Cheol-soo was an opposition presidential candidate, Kim Jong-hoon was considered a candidate for the ruling party, and Ahn later resigned from the presidential candidate to support the opposition candidate. This series of events is visualized as a graph divided into the inner node set and the outer node set related to Ahn.

10 is a graph of a search result for the "Nonghyup" hacking incident of the Korean bank by the curation apparatus according to an embodiment of the present invention. One of the main suspicions related to the incident was that the suspect was "North Korea" after the incident. And people thought that this hacking incident was related to "Cheonan". The external link to "North Korea" is South Korean Defense Minister Kim Kwan-jin at the time of the accident. This series of events is visualized as a graph divided into the internal node set and the external node set related to the "Nonghyup". There are also keywords that seem to be unrelated, but in most cases the results were related to "hacking hacks."

11 is a graph showing results different from the above experimental results by the curation apparatus according to an embodiment of the present invention. White Day is the anniversary of a man giving candy to a woman on March 14. You can see that there is a subtopic "Candy" in "White Day". Although these two keywords are semantically similar, it is difficult to confirm that there is a time relationship. The Chronological Big Data Curation method, which is suggested when entering various search keywords, works robustly in most cases and helps us to understand a series of topics chronologically. However, it was confirmed that there is a limit in case of general theme or one-time event that does not last long.

The evaluation was performed by comparing and analyzing the chronological information-based curation method and the existing method according to an embodiment of the present invention. An element of the assessment is to answer the question, "Do you use search terms to get a comprehensive understanding of the topic?" The inventors of the present invention compared with Naver News Portal, the most widely used news portal site in Korea.

12 shows Naver news results in which the search keywords "Ahn Cheol-soo" and "hacking" are entered. Referring to FIG. 12 (a), which is the result of “Ahn Cheol-soo,” the search results of the first page display only the article of March 31, which is the most recent date, even when sorted by the accuracy of the search options. Two days ago, the article on March 29 was the 130th exposure. As you can see from this figure, it is difficult to report only the news results and understand the flow of the whole event. On the other hand, Figure 12 (b), which is the result of the search keyword "hacking", can easily understand the entire event focused on the event.

For a more accurate assessment, we performed a quantitative evaluation of the traditional assessment method and 'Chronological Big Data Curation'. For quantitative evaluation, we measured the only words that appeared in the article.

13 is a result of comparing the chronological information-based curation method and the Naver news search results according to an embodiment of the present invention.

The average value of the following items was calculated based on the search term. 1) Unique word rate: The ratio of proper nouns that can identify the topic for the news. 2) Time distribution: The average time span of article creation time shown in the final result. It is calculated by N number.

When comparing the top 5 to 15 search results, it can be seen that the ratio of the corresponding unique words is higher than the existing Naver news search results, all of the curation method according to an embodiment of the present invention has a higher value . In addition, the search result time table also confirms that the curation method according to an embodiment of the present invention has a higher value.

In other words, it can be appreciated that the results according to the present invention can provide more comprehensive and accurate search results than Naver news search results.

As described above, an embodiment of the chronological information-based curation apparatus and a control method using the same has been described, but this is described as at least one embodiment, whereby the technical idea of the present invention and its configuration and operation are not limited. No, the scope of the technical idea of the present invention is not limited / limited by the drawings or the description with reference to the drawings. In addition, the concept and embodiment of the invention presented in the present invention may be used by those skilled in the art as a basis for modifying or designing to other structures for carrying out the same purpose of the present invention. The equivalent structure modified or changed by those skilled in the art to which the present invention pertains is to be bound by the technical scope of the present invention described in the claims, and the spirit or scope of the invention described in the claims. Various changes, substitutions and alterations are possible without departing from the scope of the invention.

Claims (7)

A data collector configured to collect a plurality of topic data and generation time information of the plurality of topic data from a data source on a web;
A graph modeling unit configured to extract an association between the collected plurality of topic data; And
And a chronological order analysis unit classifying the internal node set directly related to the main topic among the plurality of topic data and the external node set indirectly related to the main topic based on the extracted correlation and the creation time information. ,
The chronological analysis unit,
Divide the data sources on the web into first and second groups based on at least one of propagation force, document length, and reliability,
Based on the association extracted from the first group, the topic data in which the association exists with the main topic is classified into an internal node set.
Based on the generation time information extracted from the second group, the topic data associated with the generation time is classified into an external node set,
Curation device.
The method of claim 1,
The data source,
Including posts from social networking services (SNS), news articles,
Curation device.
The method of claim 1,
The main topic is a topic for a search term keyword received from a user,
Curation device.
The method of claim 1,
The association between two topic data is
The number of documents containing the two topic data together on a single document is quantified.
Curation device.
delete The method of claim 1,
The chronological analysis unit,
Characterized in that sorted the plurality of topics in the order of creation time,
Curation device.
Collecting a plurality of topic data and generation time information of the plurality of topic data from a data source on a web;
Extracting an association between the collected plurality of topic data; And
And classifying an internal node set directly associated with a predetermined topic among the plurality of topic data and an external node set indirectly associated with the predetermined topic, based on the extracted correlation and the generation time information.
The classifying step,
Dividing the data sources on the web into first and second groups based on at least one of propagation force, document length, and reliability;
Classifying the topic data in which an association relationship exists with a main topic based on an association relationship extracted from the first group into an internal node set; And
And classifying the topic data related to the generation time into an external node set based on the generation time information extracted from the second group.
Control method of curation device.

KR1020180036540A 2017-03-31 2018-03-29 Device and method for chronological big data curation system KR102025813B1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR20170041412 2017-03-31
KR1020170041412 2017-03-31

Publications (2)

Publication Number Publication Date
KR20180111646A KR20180111646A (en) 2018-10-11
KR102025813B1 true KR102025813B1 (en) 2019-11-04

Family

ID=63864890

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020180036540A KR102025813B1 (en) 2017-03-31 2018-03-29 Device and method for chronological big data curation system

Country Status (1)

Country Link
KR (1) KR102025813B1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210097443A (en) 2020-01-30 2021-08-09 울산과학기술원 Exhibition Curation Service System and Method

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102541414B1 (en) * 2020-12-04 2023-06-12 주식회사 신한디에스 Apparatus for analyzing documents and method therefor

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000020538A (en) * 1998-07-02 2000-01-21 Mitsubishi Electric Corp Method and device for retrieving information, and storage medium for information retrieving program
KR101680701B1 (en) * 2013-10-02 2016-11-29 (주)에이엔티홀딩스 System and method for providing contents curation service based on context

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000020538A (en) * 1998-07-02 2000-01-21 Mitsubishi Electric Corp Method and device for retrieving information, and storage medium for information retrieving program
KR101680701B1 (en) * 2013-10-02 2016-11-29 (주)에이엔티홀딩스 System and method for providing contents curation service based on context

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210097443A (en) 2020-01-30 2021-08-09 울산과학기술원 Exhibition Curation Service System and Method

Also Published As

Publication number Publication date
KR20180111646A (en) 2018-10-11

Similar Documents

Publication Publication Date Title
Wilson et al. The creation and analysis of a website privacy policy corpus
Carvalho et al. MISNIS: An intelligent platform for twitter topic mining
US20150261773A1 (en) System and Method for Automatic Generation of Information-Rich Content from Multiple Microblogs, Each Microblog Containing Only Sparse Information
US9665561B2 (en) System and method for performing analysis on information, such as social media
Hou et al. Newsminer: Multifaceted news analysis for event search
EP2441010A1 (en) Methods, apparatus and software for analyzing the content of micro-blog messages
KR20130022042A (en) System for detecting and tracking topic based on topic opinion and social-influencer and method thereof
CN106354844B (en) Service combination package recommendation system and method based on text mining
Alfonseca et al. Whad: Wikipedia historical attributes data: Historical structured data extraction and vandalism detection from the wikipedia edit history
Bykau et al. Fine-grained controversy detection in Wikipedia
Ouyang et al. Sentistory: multi-grained sentiment analysis and event summarization with crowdsourced social media data
KR102025813B1 (en) Device and method for chronological big data curation system
Wachsmuth et al. Constructing efficient information extraction pipelines
Chakraborty et al. Text mining and analysis
Jabeen et al. Divided we stand out! Forging Cohorts fOr Numeric Outlier Detection in large scale knowledge graphs (CONOD)
CN116467291A (en) Knowledge graph storage and search method and system
Yin et al. Research of integrated algorithm establishment of a spam detection system
Steinberger et al. Observing Trends in Automated Multilingual Media Analysis
Ali et al. Detecting present events to predict future: detection and evolution of events on Twitter
Sharma et al. A probabilistic approach to apriori algorithm
Kawamura et al. Science graph for characterizing the recent scientific landscape using paragraph vectors
Kiomourtzis et al. NOMAD: Linguistic Resources and Tools Aimed at Policy Formulation and Validation.
Lomotey et al. Terms analytics service for CouchDB: a document-based NoSQL
Garigliotti et al. IntentsKB: a knowledge base of entity-oriented search intents
JP2018147411A (en) Data processing device, data processing method, data processing system, and program

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right