JP2009271799A

JP2009271799A - Company correlative information extracting system

Info

Publication number: JP2009271799A
Application number: JP2008122785A
Authority: JP
Inventors: Osamu Oshima; 修大島
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2008-05-08
Filing date: 2008-05-08
Publication date: 2009-11-19

Abstract

<P>PROBLEM TO BE SOLVED: To realize a system which can automatically collect correlative information such as cooperation and conflict between a pair of companies out of many computerized document data. <P>SOLUTION: A company correlative information extracting system 10 comprises: a document DB 15 in which a plurality of document data whose release date are associated are accumulated; a company name DB 16 in which a plurality of company names are stored; an associated document extracting section 17 which extracts the document data in which a plurality of the company names appear from the document DB 15, compares the release date of each document data and emerging company names, and extracts a plurality of the associated document data related to the same event; a keyword extracting section 18 for extracting a keyword from each associated document data; and a company correlative information extracting section 21 for checking the presence or absence of the keyword for every associated document, identifying the keyword existing in two or more associated documents as a correlative keyword, and storing the company correlative information comprising a pair of the company names and correlative keywords into a company correlative information DB 22. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

この発明は企業相関情報抽出システムに係り、特に、新聞報道記事などの大量の文書データ中から自動的に企業間の相関関係を抽出する技術に関する。 The present invention relates to a company correlation information extraction system, and more particularly to a technique for automatically extracting correlation between companies from a large amount of document data such as newspaper articles.

株式や投資信託に資金を投資する投資家にとって、各企業の財務データや人的資本データに加えて、他の企業と関係を把握することが重要となる。例えば、安定株主としてどのような企業が名を連ねているのかは、当該企業の信用を推し量る上で有効な判断材料となる。また、経営難が囁かれていた企業であっても、有力企業との提携話が報道された途端に、株価が急上昇する場合がある。反対に、業績好調であった企業の株価が、不振企業との救済合併の情報や、他社との特許紛争勃発の情報が流布された直後に急落することもある。 In addition to financial data and human capital data of each company, it is important for investors who invest funds in stocks and mutual funds to understand relationships with other companies. For example, what kind of company is listed as a stable shareholder is an effective judgment material for estimating the trust of the company. In addition, even for companies that have been faced with management difficulties, stock prices may rise rapidly as soon as stories of partnerships with leading companies are reported. On the other hand, stock prices of companies that performed well may fall sharply immediately after information on relief mergers with sluggish companies and information on the occurrence of patent disputes with other companies are disseminated.

このように、投資家がある企業を投資対象として選定するに際し、他の企業との相関関係は極めて重要な判断材料となるため、新聞やテレビ、雑誌等においても、企業間の提携や合併のニュースが頻繁に報道されている（非特許文献１参照）。また、会社四季報（登録商標）などの投資情報誌にも、この種の情報が掲載されている（非特許文献２参照）。
YOMIURI ONLINE／松下・三洋が提携構想、統合も視野インターネットURL:http://osaka.yomiuri.co.jp/eco_news/20080428ke01.htm検索日：平成２０年４月２８日東洋経済Web／『会社四季報』のご案内インターネットURL:http://www.toyokeizai.co.jp/data/shikiho/shikiho/検索日：平成２０年４月２８日 In this way, when an investor selects a company as an investment target, the correlation with other companies is an extremely important material for judgment. News is frequently reported (see Non-Patent Document 1). In addition, this type of information is also published in investment information magazines such as Company Shikipo (registered trademark) (see Non-Patent Document 2).
YOMIURI ONLINE / Matsushita and Sanyo are planning to partner and view integration URL: http://osaka.yomiuri.co.jp/eco_news/20080428ke01.htm Search Date: April 28, 2008 Toyo Keizai Web / Guide to the Company's Four Seasons Internet URL: http://www.toyokeizai.co.jp/data/shikiho/shikiho/ Search date: April 28, 2008

このため投資家は、これらの公開情報を丹念に追っていくことにより、企業間の関係を把握することは不可能ではない。
しかしながら、インターネットの普及に伴い日々膨大な文字情報がネット上で流布される今日、企業間の関係に係る情報を網羅的に収集するには大変な時間と労力が必要であり、重要な情報を見逃す危険性も拡大している。一部上場企業に関する情報であれば、一般紙や有力経済誌に掲載されるため比較的容易に把握することができるが、それ以外の公開企業や公開前の企業に関する情報となると、Web上の雑多な報道記事を検索エンジンを駆使して収集する必要があり、相当な手間暇を要する作業となる。 For this reason, it is not impossible for investors to grasp the relationships between companies by following these public information carefully.
However, today, with the spread of the Internet, a huge amount of text information is circulated on the Internet every day, and it takes a lot of time and effort to comprehensively collect information related to relationships between companies. The risk of missing is also expanding. If it is information about some listed companies, it will be relatively easy to grasp because it will be published in general newspapers and leading economic magazines. It is necessary to collect miscellaneous news articles using a search engine, which is a time-consuming work.

この発明は上記の問題を解決するために案出されたものであり、電子化された多数の文書データの中から、複数の企業間における提携や紛争等の相関情報を自動的に収集可能なシステムを実現することを目的としている。 The present invention has been devised to solve the above problems, and can automatically collect correlation information such as partnerships and disputes among multiple companies from a large number of digitized document data. The purpose is to realize the system.

上記の目的を達成するため、請求項１に記載した企業相関情報抽出システムは、所定の時間情報が関連付けられた複数の文書データが蓄積された文書記憶手段と、複数の企業名が格納された企業名記憶手段と、この企業名記憶手段を参照し、文書中に複数の企業名が出現する文書データを上記文書記憶手段から抽出する手段と、抽出した各文書データの時間情報及び出現企業名を比較し、同一事象に係る複数の文書データを一対の企業の組合せ毎に関連文書データとして抽出する関連文書抽出手段と、各関連文書データからそれぞれキーワードを抽出するキーワード抽出手段と、各キーワードの存否を関連文書毎にチェックし、少なくとも２以上の関連文書中に存在しているキーワードを相関キーワードと認定する相関キーワード認定手段と、上記一対の企業名及び上記相関キーワードからなる企業相関情報を、企業相関情報記憶手段に格納する手段を備えたことを特徴としている。 To achieve the above object, the company correlation information extraction system according to claim 1 stores document storage means storing a plurality of document data associated with predetermined time information, and stores a plurality of company names. Company name storage means, means for extracting document data in which a plurality of company names appear in a document from the document storage means with reference to the company name storage means, time information of each extracted document data, and name of the appearing company A related document extracting means for extracting a plurality of document data relating to the same event as related document data for each combination of a pair of companies, a keyword extracting means for extracting a keyword from each related document data, Correlated keyword recognition means for checking existence / non-existence for each related document and recognizing a keyword present in at least two or more related documents as a correlation keyword The company correlation information consisting of the pair of company name and the correlation keywords, is characterized by comprising means for storing the enterprise correlation information storing means.

請求項２に記載した企業相関情報抽出システムは、請求項１に記載のシステムであって、さらに、企業名を特定する入力がなされた場合に、当該企業名をキーに上記企業相関情報記憶手段を検索し、当該企業名を含む企業相関情報を抽出する手段と、当該企業相関情報に含まれる一対の企業名と、両企業間の相関キーワードが記述された企業相関図を生成する手段と、この企業相関図を出力する手段を備えたことを特徴としている。
上記の「出力」とは、例えばディスプレイに表示することや、プリンタを介してプリントアウトすること、あるいはサーバによって生成された画面をネットワーク経由でクライアント端末に送信することが該当する（以下同様）。 The company correlation information extraction system according to claim 2 is the system according to claim 1, and further, when an input for specifying a company name is made, the company correlation information storage means using the company name as a key And a means for extracting company correlation information including the company name, a means for generating a company correlation diagram in which a pair of company names included in the company correlation information and a correlation keyword between both companies are described, It is characterized by having means for outputting this company correlation diagram.
The above “output” corresponds to, for example, displaying on a display, printing out via a printer, or transmitting a screen generated by a server to a client terminal via a network (the same applies hereinafter).

請求項３に記載した企業相関情報抽出システムは、請求項１または２に記載のシステムであって、さらに、相関キーワードを特定する入力がなされた場合に、当該相関キーワードをキーに上記企業相関情報記憶手段を検索し、当該相関キーワードを含む企業相関情報を抽出する手段と、当該企業相関情報に含まれる一対の企業名と、両企業間の相関キーワードが記述された企業相関図を生成する手段と、この企業相関図を出力する手段を備えたことを特徴としている。 The system for extracting corporate correlation information according to claim 3 is the system according to claim 1 or 2, and further, when an input for specifying a correlation keyword is made, the correlation information for the company is used as a key. Means for searching storage means and extracting company correlation information including the correlation keyword, means for generating a company correlation diagram in which a pair of company names included in the company correlation information and a correlation keyword between the two companies are described And a means for outputting the company correlation diagram.

請求項４に記載した企業相関情報抽出システムは、請求項２または３に記載のシステムであって、さらに上記企業相関情報には、各関連文書データのIDが含まれており、また上記企業相関図には関連文書データのIDがリンクされており、この関連文書データのIDが入力された場合に、上記文書記憶手段から対応の関連文書データを抽出する手段と、この関連文書データを出力する手段を備えたことを特徴としている。 The company correlation information extraction system according to claim 4 is the system according to claim 2 or 3, wherein the company correlation information includes an ID of each related document data, and the company correlation information The figure is linked with the ID of the related document data. When the ID of the related document data is input, the corresponding document data is extracted from the document storage means and the related document data is output. It is characterized by having means.

請求項５に記載した企業相関情報抽出システムは、請求項１〜４に記載のシステムであって、さらに、上記企業名記憶手段には各企業の略称または通称が正式名称と関連付けて登録されており、上記関連文書抽出手段は、各文書データ中に企業の略称または通称が記載されている場合に、これを正式名称の企業名として取り扱うことを特徴としている。 The company correlation information extraction system according to claim 5 is the system according to claims 1 to 4, and further, in the company name storage means, an abbreviation or common name of each company is registered in association with an official name. The related document extracting means is characterized in that, when an abbreviation or common name of a company is described in each document data, it is handled as a company name of an official name.

請求項６に記載した企業相関情報抽出システムは、請求項１〜５に記載のシステムであって、さらに、キーワードと類義語との関係を定義しておく類義語辞書を備え、上記相関キーワード認定手段は、各キーワードの存否を関連文書毎にチェックする際に上記類義語辞書を参照し、あるキーワードの類義語が記載された関連文書については当該キーワードの存在を認定することを特徴としている。 The company correlation information extraction system according to claim 6 is the system according to claims 1 to 5, further comprising a synonym dictionary for defining a relationship between keywords and synonyms, When checking the existence of each keyword for each related document, the synonym dictionary is referred to, and the existence of the keyword is recognized for a related document in which a synonym of a certain keyword is described.

請求項７に記載した企業相関情報抽出システムは、請求項１〜６に記載のシステムであって、さらに、キーワードと同義語との関係を定義しておく同義語辞書を備え、上記相関キーワード認定手段は、各キーワードの存否を関連文書毎にチェックする際に上記同義語辞書を参照し、あるキーワードの同義語が記載された関連文書については当該キーワードの存在を認定することを特徴としている。 The company correlation information extraction system according to claim 7 is the system according to claims 1 to 6, further comprising a synonym dictionary for defining a relationship between a keyword and a synonym, and the correlation keyword authentication The means is characterized by referring to the synonym dictionary when checking the existence of each keyword for each related document, and certifying the existence of the keyword for a related document in which a synonym of a certain keyword is described.

請求項８に記載した企業相関情報抽出システムは、請求項１〜７に記載のシステムであって、さらに上記のキーワード抽出手段が、それぞれ固有の抽出基準に基づいてキーワード候補を抽出する複数のフィルタを備え、各フィルタによって抽出されたキーワード候補をマッチングし、２以上のフィルタによって抽出されたキーワード候補をキーワードとして認定することを特徴としている。 The system for extracting corporate correlation information according to claim 8 is the system according to claims 1 to 7, wherein the keyword extracting means further extracts a plurality of filters for extracting keyword candidates based on specific extraction criteria. The keyword candidates extracted by each filter are matched, and the keyword candidates extracted by two or more filters are recognized as keywords.

請求項９に記載した企業相関情報抽出システムは、請求項８に記載のシステムであって、さらに上記フィルタの一つが、(1) 各文書データ中に含まれる名詞を注目語として抽出し、(2) 各注目語の全文書データ中における出現頻度を算出し、(3) 各注目語の一つ前及び／又は一つ後の形態素に範囲を拡張し、この拡張範囲を含めた注目語の全文書中における出現頻度を算出し、(4) 上記(3)の処理によって算出された出現頻度が所定数以上の場合には、さらにその一つ前あるいは後の形態素に範囲を拡張し、この拡張範囲を含めた注目語の全文書データ中における出現頻度を算出する処理を、その出現頻度が所定数未満となるまで繰り返し、(5) 最初の注目語及び拡張範囲を含めた注目語の中で、所定範囲内の出現頻度を有するものをキーワード候補として選定することを特徴としている。 The system for extracting corporate correlation information according to claim 9 is the system according to claim 8, wherein one of the filters further extracts (1) a noun included in each document data as an attention word, 2) Calculate the appearance frequency of each attention word in all document data, and (3) expand the range to the morpheme before and / or after each attention word, and (4) If the appearance frequency calculated by the processing in (3) above is a predetermined number or more, the range is further expanded to the morpheme before or after that. Repeat the process of calculating the appearance frequency of the attention word including the extended range in all the document data until the appearance frequency becomes less than the predetermined number. (5) Among the attention words including the first attention word and the extended range Keyword candidates that have an appearance frequency within a predetermined range It is characterized by selecting and.

請求項１に記載した企業相関情報抽出システムによれば、大量の文書データ中から相互に関係のある一対の企業名と、両企業間における相関キーワードを自動的に抽出することが可能となるため、企業間の関係について記述した文書データを個別に検索・収集する手間暇を大幅に低減できると共に、人為的なミスによって重要な情報が漏れることを防止可能となる。 According to the company correlation information extraction system described in claim 1, it is possible to automatically extract a pair of company names that are mutually related from a large amount of document data and a correlation keyword between both companies. In addition, it is possible to greatly reduce the time and labor for individually searching and collecting document data describing relationships between companies, and it is possible to prevent important information from leaking due to human error.

請求項２に記載した企業相関情報抽出システムによれば、入力された企業名に係る企業相関情報が記載された企業相関図が出力されるため、ユーザは関心の高い企業名を指定することで、当該企業に関する企業相関情報を網羅的に入手可能となる。 According to the company correlation information extraction system described in claim 2, since the company correlation diagram in which the company correlation information relating to the input company name is output is output, the user can designate the company name with high interest. It becomes possible to comprehensively obtain corporate correlation information regarding the company.

請求項３に記載した企業相関情報抽出システムによれば、入力された相関キーワードに係る企業相関情報が記載された企業相関図が出力されるため、ユーザは関心の高い事象（例えば資本提携、営業譲渡、特許権侵害等）を指定することにより、当該事象に関する企業相関情報を網羅的に入手可能となる。 According to the company correlation information extraction system described in claim 3, since the company correlation diagram describing the company correlation information related to the input correlation keyword is output, the user is highly interested in events (for example, capital alliances, sales By assigning (assignment, patent infringement, etc.), it is possible to comprehensively obtain corporate correlation information regarding the event.

請求項４に記載した企業相関情報抽出システムによれば、企業相関図にリンクされた関連文書データのIDを選択することにより、関連文書データが提示されるため、ユーザは企業相関情報の元になった関連文書を即座に確認可能となる。 According to the company correlation information extraction system described in claim 4, since the related document data is presented by selecting the ID of the related document data linked to the company correlation diagram, the user can use the company correlation information as a source. It becomes possible to immediately confirm the related documents.

請求項５に記載した企業相関情報抽出システムによれば、文書データ毎に企業名の表記に若干の相違が存在しても、正しく同一企業として処理されることとなり、企業相関情報の信頼度を高めることが可能となる。 According to the company correlation information extraction system described in claim 5, even if there is a slight difference in the notation of the company name for each document data, it is processed correctly as the same company, and the reliability of the company correlation information is increased. It becomes possible to raise.

請求項６に記載した企業相関情報抽出システムによれば、文書データ毎に用語の選択が異なっていても、それが同義語の範囲内であれば同一キーワードとして存在がカウントされるため、企業相関情報の信頼度を高めることが可能となる。 According to the company correlation information extraction system described in claim 6, even if the term selection is different for each document data, the existence is counted as the same keyword if it is within the range of synonyms. It becomes possible to increase the reliability of information.

請求項７に記載した企業相関情報抽出システムによれば、文書データ毎に用語の選択が異なっていても、それが類義語の範囲内であれば同一キーワードとして存在がカウントされるため、企業相関情報の信頼度を高めることが可能となる。 According to the company correlation information extraction system described in claim 7, even if the selection of terms differs for each document data, the existence is counted as the same keyword if it is within the range of synonyms. It becomes possible to increase the reliability of the.

請求項８及び９に記載した企業相関情報抽出システムの場合、複数のフィルタを用いて文書データ中からそれぞれ独自にキーワード候補を抽出させ、これらの中で２以上のフィルタによって抽出されたものを正式なキーワードと認定する仕組みを備えているため、重要なキーワードの取りこぼしを防止すると同時に、重要でないノイズがキーワード中に混入することを防止できる。
特に請求項９のシステムの場合、キーワード候補の抽出に際し、複数の文書データ中における出現頻度に基づいてある注目語をキーワード候補として選定するか否かを判断する仕組みを備えているため、選定されたキーワード候補の重要度に対して客観性を持たせることが可能となる。 In the case of the company correlation information extraction system according to claim 8 and 9, keyword candidates are individually extracted from document data using a plurality of filters, and those extracted by two or more filters are officially extracted. Since it is equipped with a mechanism that recognizes important keywords, it is possible to prevent important keywords from being missed, and at the same time, prevent unimportant noise from entering the keywords.
Particularly, in the case of the system according to claim 9, since the keyword candidate is extracted, since it has a mechanism for determining whether or not to select a word of interest as a keyword candidate based on the appearance frequency in a plurality of document data, it is selected. It is possible to give objectivity to the importance of the keyword candidates.

図１は、企業相関情報抽出システム10の機能構成を示すブロック図であり、巡回先ＤＢ12と、Webファイル収集部13と、テキスト生成部14と、文書ＤＢ15と、企業名ＤＢ16と、関連文書抽出部17と、キーワード抽出部18と、類義語辞書19と、同義語辞書20と、企業相関情報抽出部21と、企業相関情報ＤＢ22と、企業相関図生成部23とを備えている。
このシステム10にはWebサーバ24が接続されており、このWebサーバ24は、インターネット25やイントラネット等のネットワークを介して複数のクライアント26及びニュースサイト等を開設している他の複数のWebサーバ27と接続されている。 FIG. 1 is a block diagram showing a functional configuration of a company correlation information extraction system 10, and a circulation destination DB 12, a Web file collection unit 13, a text generation unit 14, a document DB 15, a company name DB 16, and related document extraction. A unit 17, a keyword extraction unit 18, a synonym dictionary 19, a synonym dictionary 20, a company correlation information extraction unit 21, a company correlation information DB 22, and a company correlation diagram generation unit 23 are provided.
A web server 24 is connected to the system 10, and the web server 24 includes a plurality of clients 26 and other web servers 27 that have opened news sites and the like via a network such as the Internet 25 or an intranet. Connected with.

上記のWebファイル収集部13、テキスト生成部14、関連文書抽出部17、キーワード抽出部18、企業相関情報抽出部21及び企業相関図生成部23は、サーバコンピュータのCPUが、ＯＳ及び専用のアプリケーションプログラムに従い、必要な処理を実行することによって実現される。 The above-mentioned Web file collection unit 13, text generation unit 14, related document extraction unit 17, keyword extraction unit 18, company correlation information extraction unit 21 and company correlation diagram generation unit 23, the server computer CPU, OS and dedicated application This is realized by executing necessary processing according to a program.

上記の巡回先ＤＢ12、文書ＤＢ15、企業名ＤＢ16、類義語辞書19、同義語辞書20及び企業相関情報ＤＢ22は、同サーバコンピュータのハードディスクに格納されている。 The circulation destination DB 12, the document DB 15, the company name DB 16, the synonym dictionary 19, the synonym dictionary 20, and the company correlation information DB 22 are stored in the hard disk of the server computer.

巡回先ＤＢ12には、予め複数のニュースサイトのURL及び抽出対象文書の所在を特定する情報が登録されている。
また、企業名ＤＢ16には、予め多数の企業に係る通称、略称、愛称が、正式名称と関連付けて蓄積されている。
類義語辞書19には、多数のキーワードと、それぞれの同義語との対応関係が予め格納されている。
同義語辞書20には、多数のキーワードと、それぞれの類義語との対応関係が予め格納されている。 In the circulation destination DB 12, URLs of a plurality of news sites and information for specifying the location of the extraction target document are registered in advance.
In the company name DB 16, common names, abbreviations, and nicknames associated with a large number of companies are stored in association with official names.
The synonym dictionary 19 stores correspondences between a large number of keywords and their synonyms in advance.
In the synonym dictionary 20, correspondences between a large number of keywords and respective synonyms are stored in advance.

上記のキーワード抽出部18は、図２に示すように、係り受け表現抽出フィルタ18a、区切り文字抽出フィルタ18b、文字列頻度統計フィルタ18c、TermExtractフィルタ18d、キーワード認定フィルタ18eを備えている。 As shown in FIG. 2, the keyword extraction unit 18 includes a dependency expression extraction filter 18a, a delimiter extraction filter 18b, a character string frequency statistical filter 18c, a TermExtract filter 18d, and a keyword recognition filter 18e.

つぎに、図３のフローチャートに従い、企業相関情報の抽出工程について説明する。
まずWebファイル収集部13は、インターネット25上で文書データを公開しているニュースサイト等の複数のWebサーバ27を定期的に巡回し、予め設定されたルールに従い、Webファイルを大量に収集する（Ｓ10）。この際、Webファイル収集部13は巡回先ＤＢ12を参照し、アクセスすべきWebサイトのURL及び収集対象文書の所在情報を取得する。 Next, the extraction process of company correlation information will be described with reference to the flowchart of FIG.
First, the Web file collection unit 13 periodically circulates a plurality of Web servers 27 such as news sites that publish document data on the Internet 25, and collects a large amount of Web files according to preset rules ( S10). At this time, the Web file collection unit 13 refers to the circulation destination DB 12 and acquires the URL of the Web site to be accessed and the location information of the collection target document.

これらのWebファイルは、テキスト生成部14において不要なHtmlタグが除去され、プレーンなテキストデータに整形された後（Ｓ12）、文書ＤＢ15に格納される（Ｓ14）。この際、各文書データには、ユニークな文書IDと、当該文書データの元になったWebファイルの所在を示すURLと、時間情報（掲載日、収集日時、蓄積日時等）と、情報源を識別するコードとが関連付けられる。 These Web files are stored in the document DB 15 (S14) after the unnecessary Html tag is removed by the text generation unit 14 and formed into plain text data (S12). At this time, each document data includes a unique document ID, a URL indicating the location of the Web file that is the source of the document data, time information (posting date, collection date, storage date, etc.), and information source. Associated with the identifying code.

なお、上記のようにWebファイル収集部13及びテキスト生成部14の協働によって自動的に文書データを文書ＤＢ15に蓄積する代わりに、人間の手によって収集・選別・加工された文書データを、文書ＤＢ15に蓄積しておくこともできる。 Instead of automatically storing the document data in the document DB 15 by the cooperation of the Web file collection unit 13 and the text generation unit 14 as described above, the document data collected, selected, and processed by the human hand It can also be stored in DB15.

つぎに、関連文書抽出部17が起動し、文書中に２以上の企業名が登場する同一事象に係る複数の文書データを、一対の企業の組合せ単位で文書ＤＢ15から抽出する（Ｓ16）。
このため関連文書抽出部17は、まず企業名ＤＢ16に登録された各企業名及びその略称等をキーに文書データの全文検索を順次実行し、各文書データ中に登場する企業名の数を算出し、それが２以上ある文書データを選別する。
つぎに関連文書抽出部17は、選別された各文書データの日付情報及び企業名をマッチングし、一対の企業名が一致し、かつ日付が近いもの同士を同一事象に係る関連文書データと認定する。 Next, the related document extraction unit 17 is activated, and a plurality of document data related to the same event in which two or more company names appear in the document are extracted from the document DB 15 in combination units of a pair of companies (S16).
For this reason, the related document extraction unit 17 first sequentially performs a full text search of document data using each company name registered in the company name DB 16 and its abbreviation as a key, and calculates the number of company names appearing in each document data. Then, document data having two or more are selected.
Next, the related document extraction unit 17 matches the date information and company name of each selected document data, and recognizes a pair of company names that match and that have close dates as related document data related to the same event. .

図４(a)は、この同一事象に係る関連文書データの一例を示すものであり、文書データ30（情報源：Ａ新報）、文書データ31（情報源：Ｂニュース）、文書データ32（情報源：Ｃ新聞）には、それぞれ「Ｄ通信（ＤＤＩ）」と「Ｅサーチ」という共通の企業名が登場しており、公開日も同一であるため、関連文書抽出部17によって「Ｄ通信−Ｅサーチ」間の関連文書として抽出された。 FIG. 4 (a) shows an example of related document data related to the same event. Document data 30 (information source: A new report), document data 31 (information source: B news), document data 32 (information Source: C newspaper) has common company names “D communication (DDI)” and “E search”, and the release date is the same. It was extracted as a related document between “e-search”.

文書データ31中の「ＤＤＩ」は、他の文書データ中の「Ｄ通信」と表現上は異なっているが、企業名ＤＢ16においてＤ通信の略称として定義されているため、関連文書抽出部17によって関連文書として抽出された。すなわち、関連文書抽出部17は、企業名ＤＢ16を参照することにより、企業名の表記のゆれを吸収することができる。 Although “DDI” in the document data 31 is different in expression from “D communication” in other document data, it is defined as an abbreviation for D communication in the company name DB 16. Extracted as related documents. That is, the related document extraction unit 17 can absorb the fluctuation of the notation of the company name by referring to the company name DB 16.

なお、文書データ30及び文書データ32には、Ｄ通信とＥサーチの他に、Ｆ商社という第三の企業名が登場しているが、このシステム10は一対の企業間の相関関係を抽出することを企図しているため、この場面では「Ｄ通信−Ｅサーチ」に係る関連文書として取り扱われる。ただし、文書データ30及び文書データ32は、関連文書抽出部17によって「Ｄ通信−Ｆ商社」及び「Ｅサーチ−Ｆ商社」に係る関連文書としても抽出される（詳細は後述）。 In addition to the D communication and E search, a third company name F trading company appears in the document data 30 and document data 32. This system 10 extracts the correlation between a pair of companies. In this scene, it is handled as a related document related to “D communication-E search”. However, the document data 30 and the document data 32 are also extracted as related documents related to “D communication-F trading company” and “E search-F trading company” by the related document extraction unit 17 (details will be described later).

上記の各文書の日付は完全に一致していたが、この発明はこれに限定されるものではなく、時間的に近い範囲内（例えば２日以内）であれば、同一事象に係る関連文書と認定することができる。 The dates of the above documents were completely the same. However, the present invention is not limited to this, and if it is within a time range (for example, within two days), the related documents related to the same event Can be certified.

つぎにキーワード抽出部18は、関連文書抽出部17によって抽出された各関連文書に係り受け表現抽出フィルタ18aを適用し、各文書データから所定の係り受け表現を備えた文字列を抽出する（Ｓ18）。
すなわち、係り受け表現抽出フィルタ18aには、「○○メーカー」、「○○が主力」、「○○を生産」という係り受け表現パターンが予め多数用意されており、キーワード抽出部18は、これに当てはまる表現パターンを検出した後、「○○」に相当する文字列をキーワード候補として抽出する。 Next, the keyword extraction unit 18 applies a dependency expression extraction filter 18a to each related document extracted by the related document extraction unit 17, and extracts a character string having a predetermined dependency expression from each document data (S18). ).
That is, the dependency expression extraction filter 18a includes a large number of dependency expression patterns “XX manufacturer”, “XX is the main force”, and “XX is produced” in advance. After the expression pattern that applies to is detected, a character string corresponding to “XX” is extracted as a keyword candidate.

つぎにキーワード抽出部18は、各関連文書データに区切り文字抽出フィルタ18bを適用し、「○○」、"○○"、（○○）、［○○］、,○○,のように、カンマや括弧、スペース、タブ等の区切り文字で囲まれた○○の部分をキーワード候補として抽出する（Ｓ20）。 Next, the keyword extraction unit 18 applies the delimiter extraction filter 18b to each related document data, and “XX”, “XX”, (XX), [XX], XX, The XX part surrounded by delimiters such as commas, parentheses, spaces and tabs is extracted as a keyword candidate (S20).

つぎにキーワード抽出部18は、各関連文書データに文字列頻度統計フィルタ18cを適用し、各関連文書データに含まれる文字列が文書ＤＢ15に格納された他の文書も含めて何回登場するのかを集計し、一定範囲の出現頻度を備えた文字列をキーワード候補として抽出する（Ｓ22）。
まず文字列頻度統計フィルタ18cは、図５に示すように、関連文書中の名詞（ここでは「ＤＶＤ」）に注目し、このＤＶＤという注目語が文書ＤＢ15内に蓄積された全文書データ中に出現する数を集計する。つぎに、文字列頻度統計フィルタ18cは、この注目語の前後の形態素に範囲を拡張し、それぞれの全文書中に登場する頻度を集計し、出現頻度が一定以下（例えば20以下）となった時点で文字範囲拡張を停止する。 Next, the keyword extraction unit 18 applies the character string frequency statistical filter 18c to each related document data, and how many times the character string included in each related document data appears including other documents stored in the document DB 15. And character strings having an appearance frequency within a certain range are extracted as keyword candidates (S22).
First, as shown in FIG. 5, the character string frequency statistical filter 18c pays attention to a noun (here, “DVD”) in the related document, and the attention word “DVD” is included in all the document data stored in the document DB 15. Aggregate the number of occurrences. Next, the character string frequency statistical filter 18c expands the range to the morphemes before and after this attention word, totals the frequencies that appear in all the documents, and the appearance frequency becomes less than a certain value (for example, 20 or less). Stop character range expansion at this point.

例えば、ＤＶＤの一つ前の形態素を含む「したＤＶＤ」の出現頻度は「２」と低いため、これ以上前の形態素に範囲が拡張されることはない。これに対し、ＤＶＤの一つ後の形態素を含む「ＤＶＤレコーダー」の出現頻度は「８６２」と多いため、その一つ後の形態素を含む「ＤＶＤレコーダーでは」の出現頻度を集計する。そして、この出現頻度は「５」と低いため、これ以降の形態素に範囲を拡張することが停止される。 For example, since the appearance frequency of “done DVD” including the previous morpheme of the DVD is as low as “2”, the range is not expanded to the previous morpheme. On the other hand, since the appearance frequency of “DVD recorder” including the next morpheme of DVD is as many as “862”, the appearance frequencies of “DVD recorder” including the next morpheme are tabulated. Since the appearance frequency is as low as “5”, the expansion of the range to subsequent morphemes is stopped.

上記の「形態素」とは、意味を有する最小の言語単位を指す。例えば、「私の名前は鈴木です」を形態素に分解すると、「私（代名詞）」「の（助詞）」「名前（一般名詞）」「は（係助詞）」「鈴木（固有名詞）」「です（助動詞）」となる。 The above “morpheme” refers to the smallest linguistic unit having meaning. For example, when “my name is Suzuki” is broken down into morphemes, “I (pronoun)” “no (particle)” “name (general noun)” “ha (counselor)” “Suzuki (proprietary noun)” “ Is (auxiliary verb) ".

つぎに文字列頻度統計フィルタ18cは、「ＤＶＤ」及び「ＤＶＤレコーダー」が所定範囲（例えば20〜5,000）内の出現頻度を備えていることを理由にキーワード候補として抽出する。これに対し、「したＤＶＤ」及び「ＤＶＤレコーダーでは」は上記の範囲外であるため、キーワード候補から除外される。
全文書中における出現頻度が20未満のものはそもそも重要語とはいえず、また5,000を越えるものは逆に特徴のない汎用語あるいは一般語と考えられるからであるが、この範囲設定は文書データの分量や検索システムの使用目的に応じて適宜調整される。 Next, the character string frequency statistical filter 18c extracts keyword candidates because “DVD” and “DVD recorder” have appearance frequencies within a predetermined range (for example, 20 to 5,000). On the other hand, “done DVD” and “in the DVD recorder” are out of the above range, and are excluded from keyword candidates.
This is because, if the frequency of occurrence is less than 20 in all documents, it is not an important word in the first place, and if it exceeds 5,000, it is considered a general word or general word with no features. The amount is adjusted as appropriate according to the amount of use and the purpose of use of the search system.

ところで、文書ＤＢ15内に蓄積された多量の文書データに含まれる各文字列に関して、それぞれの出現頻度を集計するには膨大な時間を要するため、図６に示すように、文書ＤＢ15内には予め全文書データに登場する各形態素が、個々の文書データ中に存在しているか否かを一覧表にまとめたインデックス（所謂転置インデックス）が生成されている。このため、キーワード抽出部18はこのインデックスを参照することにより、比較的短時間でその出現頻度を取得することが可能となる。 By the way, since it takes a lot of time to count the appearance frequency of each character string included in a large amount of document data stored in the document DB 15, as shown in FIG. An index (so-called transposed index) is generated that lists whether or not each morpheme appearing in all document data is present in each document data. For this reason, the keyword extracting unit 18 can acquire the appearance frequency in a relatively short time by referring to the index.

つぎにキーワード抽出部18は、各関連文書データにTermExtractフィルタ18dを適用し、各関連文書データから所定以上のスコアを備えた文字列をキーワード候補として抽出する（Ｓ24）。
このTermExtractフィルタ18dは、専門分野のコーパス（主として研究目的で収集され、電子化された自然言語の文章からなる巨大なテキストデータ）から専門用語を自動抽出するために案出された文字列抽出アルゴリズムであり、文書データ中から単名詞及び複合名詞を候補語として抽出し、各候補語の出現頻度と連接頻度に基づいてそれぞれの重要度を算出する機能を備えている。このTermExtractフィルタ18d自体は公知技術であるため、これ以上の説明は省略する。 Next, the keyword extraction unit 18 applies a TermExtract filter 18d to each related document data, and extracts a character string having a score equal to or higher than a predetermined value from each related document data as a keyword candidate (S24).
This TermExtract filter 18d is a string extraction algorithm devised to automatically extract technical terms from specialized corpora (huge text data collected mainly for research purposes and digitized natural language sentences). It has a function of extracting single nouns and compound nouns from the document data as candidate words, and calculating respective importance levels based on the appearance frequency and connection frequency of each candidate word. Since this TermExtract filter 18d itself is a known technique, further explanation is omitted.

つぎにキーワード抽出部18は、係り受け表現抽出フィルタ18a、区切り文字抽出フィルタ18b、文字列頻度統計フィルタ18c、TermExtractフィルタ18dによって抽出された各キーワード候補をキーワード認定フィルタ18eに入力し、キーワードを絞り込む。
キーワード認定フィルタ18eでは、各フィルタによってリストアップされたキーワード候補同士をマッチングし、２以上のフィルタによってキーワード候補として挙げられているものを最終的なキーワードと認定し（Ｓ26）、企業相関情報抽出部21に出力する。 Next, the keyword extraction unit 18 inputs the keyword candidates extracted by the dependency expression extraction filter 18a, the delimiter extraction filter 18b, the character string frequency statistical filter 18c, and the TermExtract filter 18d to the keyword certification filter 18e, and narrows down the keywords. .
In the keyword certification filter 18e, keyword candidates listed by each filter are matched, and those listed as keyword candidates by two or more filters are certified as final keywords (S26). Output to 21.

図４(b)は、関連文書30から「提携」「携帯電話端末」「ネット事業」「検索サービス」のキーワードが抽出され、関連文書31からは「協業」「Z-mode」「ネット事業」「リスティング広告」のキーワードが、関連文書32からは「検索連動広告」「検索サービス」「加入者」「解約」のキーワードが、キーワード抽出部18によって抽出されたことを示している。 In FIG. 4B, keywords of “partnership”, “mobile phone terminal”, “net business”, and “search service” are extracted from the related document 30, and “collaboration”, “Z-mode”, “net business” are extracted from the related document 31. The keyword “listing advertisement” indicates that keywords “search-linked advertisement”, “search service”, “subscriber”, and “cancellation” are extracted from the related document 32 by the keyword extraction unit 18.

上記のように、係り受け表現抽出フィルタ18a、区切り文字抽出フィルタ18b、文字列頻度統計フィルタ18c、TermExtractフィルタ18dの４つのフィルタを用いることにより、文書データからキーワードを抽出する際に重要語が漏れ落ちることを防止すると共に、キーワード認定フィルタ18eを用いて絞り込むことにより、不要なキーワード（ノイズ）が混入することを防止できる。 As described above, by using the four filters of dependency expression extraction filter 18a, delimiter extraction filter 18b, character string frequency statistical filter 18c, and TermExtract filter 18d, important words are leaked when extracting keywords from document data. In addition to preventing falling, it is possible to prevent unnecessary keywords (noise) from being mixed by narrowing down using the keyword certification filter 18e.

なお、４つのフィルタ中の２以上のフィルタによって選別されたキーワード候補を正式なキーワードと認定するのは一例であり、３以上のフィルタによって選別されることをキーワード認定の要件とすることもできる。
また、フィルタの数も上記に限定されるものではなく、他の有効なキーワード候補抽出フィルタをキーワード抽出部18に設け、５以上のフィルタ中の２以上のフィルタによって選別されたキーワード候補を正式なキーワードと認定することもできる。 Note that keyword candidates selected by two or more filters among the four filters are recognized as formal keywords, and selection by three or more filters may be a requirement for keyword recognition.
Further, the number of filters is not limited to the above, and other effective keyword candidate extraction filters are provided in the keyword extraction unit 18, and the keyword candidates selected by two or more of the five or more filters are officially selected. It can also be recognized as a keyword.

つぎに企業相関情報抽出部21が起動し、各関連文書から抽出されたキーワード同士をマッチングし、各キーワードの得票数を算出する。
すなわち、図４(c)に示すように、企業相関情報抽出部21は各キーワードが情報源を異にする各関連文書中に含まれていたか否かを判定し、含まれていた場合には得票１を、含まれていなかった場合には得票０をテーブル中に記録していく。そして、２以上の得票数を勝ち得たキーワードについては、Ｄ通信及びＥサーチ間の相関キーワードとして採用される。 Next, the company correlation information extraction unit 21 is activated, matches the keywords extracted from each related document, and calculates the number of votes for each keyword.
That is, as shown in FIG. 4C, the company correlation information extraction unit 21 determines whether or not each keyword is included in each related document having a different information source. If the vote 1 is not included, the vote 0 is recorded in the table. A keyword that has won two or more votes is employed as a correlation keyword between D communication and E search.

例えば、「ネット事業」については、Ａ新報及びＢニュースに記載があり、得票数が２であるため、相関キーワードとして採用されている。
これに対し「加入者」については、Ｃ新聞のみに記載があり、Ａ新報及びＢニュースの文書中には登場しないため、得票数が１にとどまり、相関キーワードとして不採用となっている。 For example, “Internet business” is described in A newsletter and B news, and since the number of votes is 2, it is adopted as a correlation keyword.
On the other hand, “subscriber” is described only in the C newspaper and does not appear in the A newsletter and B news documents, so the number of votes is only 1 and is not adopted as a correlation keyword.

なお、相関キーワードとして採用されるか否かのボーダーラインとなる「得票数２以上」はあくまでも一例であり、全ての関連文書中に当該キーワードが存在していることを相関キーワードとして採用されるための条件とすることもできる（全会一致）。
あるいは、関連文書数の過半数以上の得票を得てはじめて相関キーワードとして認定されるようにすることもできる（多数決）。 Note that “two or more votes”, which is a border line indicating whether or not to be adopted as a correlation keyword, is merely an example, and the fact that the keyword exists in all related documents is adopted as a correlation keyword. It can also be a condition of (unanimous).
Alternatively, it may be recognized as a correlation keyword only after obtaining a vote of more than a majority of related documents (majority decision).

念のため付言するが、上記の得票はあるキーワードが関連文書に存在しているという事実に対して「１」が付与されるのであり、ある関連文書中に当該キーワードが５箇所に登場したとしても、「５」の得票が与えられるわけではない。 As a reminder, the above vote is given “1” for the fact that a certain keyword exists in the related document, and the keyword appears in five places in the related document. However, a vote of “5” is not given.

企業相関情報抽出部21は、各キーワードをマッチングするに際し、類義語辞書19及び同義語辞書20を参照することにより、両者が完全一致でなくてもそれぞれの得票を集約する機能を備えている。 The company correlation information extraction unit 21 has a function of aggregating each vote even when the keywords are not completely matched by referring to the synonym dictionary 19 and the synonym dictionary 20 when matching each keyword.

例えば、Ａ新報の文書中に登場する「提携」とＢニュースの文書中に登場する「協業」は、類義語辞書19において類義語として定義されているため、企業相関情報抽出部21は両者を同類のキーワードと認定し、Ａ新報及びＢニュースの得票を合算して「２票」を計上している。これはすなわち、企業相関情報抽出部21が、２つの関連文書中に「提携（競業）」が存在しているものと認定していることを意味している。 For example, since the “affiliation” appearing in the A newsletter document and the “collaboration” appearing in the B news document are defined as synonyms in the synonym dictionary 19, the company correlation information extracting unit 21 regards both as similar. It is recognized as a keyword, and "2 votes" are counted by adding up the votes of A new report and B news. This means that the company correlation information extraction unit 21 recognizes that “affiliation (competition)” exists in two related documents.

また、Ｂニュースの「リスティング広告」とＣ新聞社の「検索連動広告」は、同義語辞書20において同義語として定義されているため、企業相関情報抽出部21は両者を同義のキーワードと認定し、Ｂニュース及びＣ新聞の得票を合算して「２票」を計上している。これはすなわち、企業相関情報抽出部21が、２つの関連文書中に「リスティング広告（検索連動広告）」が存在しているものと認定していることを意味している。 In addition, since the “listing advertisement” of B news and the “search-linked advertisement” of C newspaper are defined as synonyms in the synonym dictionary 20, the company correlation information extraction unit 21 recognizes both as synonymous keywords. , B News and C Newspapers are added together to give “2 votes”. This means that the company correlation information extraction unit 21 recognizes that “listing advertisement (search-linked advertisement)” exists in two related documents.

以上のようにして、相関キーワードを決定した企業相関情報抽出部21は、複数の相関キーワードを両相関企業名（Ｄ通信及びＥサーチ）に関連付けて、企業相関情報ＤＢ22に格納する（Ｓ32）。この際、各関連文書のIDも、企業相関情報の構成要素の一つとして企業相関情報ＤＢ22に格納される。 As described above, the company correlation information extracting unit 21 that has determined the correlation keyword associates the plurality of correlation keywords with both correlation company names (D communication and E search) and stores them in the company correlation information DB 22 (S32). At this time, the ID of each related document is also stored in the company correlation information DB 22 as one of the components of the company correlation information.

上記の通り、文書データ30及び文書データ32は、関連文書抽出部17によって「Ｄ通信−Ｆ商社」及び「Ｅサーチ−Ｆ商社」に係る関連文書としても抽出される。そして、キーワード抽出部18によるキーワード抽出処理及び企業相関情報抽出部21による企業相関情報抽出処理を経て、「Ｄ通信−Ｆ商社」間の企業相関情報及び「Ｅサーチ−Ｆ商社」間の企業相関情報が企業相関情報ＤＢ22に格納される。 As described above, the document data 30 and the document data 32 are also extracted by the related document extraction unit 17 as related documents relating to “D communication-F trading company” and “E search-F trading company”. After the keyword extraction process by the keyword extraction unit 18 and the company correlation information extraction process by the company correlation information extraction unit 21, the company correlation information between “D communication-F trading company” and the company correlation between “E search-F trading company” Information is stored in the company correlation information DB 22.

つぎに、図７のフローチャートに従い、企業相関図の生成・公開処理について説明する。まず、クライアント26から企業名を特定した企業相関図の表示リクエストをWebサーバ24が受け付けると（Ｓ40）、企業相関図生成部23が起動し、企業相関情報ＤＢ22から該当企業に係る企業相関情報を抽出する（Ｓ42）。
例えば、クライアント26から「Ｄ通信」を特定したリクエストがあった場合、企業相関図生成部23は、Ｄ通信に係る全ての企業相関情報を取り出して企業相関図を生成し（Ｓ44）、Webサーバ24経由でクライアント26に企業相関図表示画面が送信される（Ｓ46）。 Next, the generation / publication process of a company correlation diagram will be described with reference to the flowchart of FIG. First, when the Web server 24 receives a display request for a company correlation diagram specifying the company name from the client 26 (S40), the company correlation diagram generation unit 23 is activated, and the company correlation information related to the company is obtained from the company correlation information DB22. Extract (S42).
For example, when there is a request specifying “D communication” from the client 26, the company correlation diagram generating unit 23 extracts all the company correlation information related to D communication and generates a company correlation diagram (S44), and the Web server The company correlation diagram display screen is transmitted to the client 26 via 24 (S46).

この結果、図８に示すように、Ｄ通信に関係する２件の企業相関図が記載された画面がクライアント26のWebブラウザ上に表示される。
図８(a)の企業相関図35は、Ｄ通信とＥサーチの二つの企業（相関企業）間における相関関係を示すものであり、提携（協業）、ネット事業、検索サービス、検索連動広告（リスティング広告）が、相関キーワードとして記述されている。この２社の企業名と相関キーワードが記述された相関図を参照することにより、ユーザはＤ通信とＥサーチ間でネット事業（特に検索連動広告）に関する提携話が存在することを認識することができる。
また、図８(b)の企業相関図36は、Ｄ通信とＧ電機間における相関関係を示すものであり、特許権、侵害、訴訟（訴え）、ライセンス（通常実施権）が、相関キーワードとして記述されている。この相関図を参照することにより、ユーザはＤ通信とＧ電機との間で、特許権の侵害訴訟に関する何らかの問題が存在することを認識することができる。 As a result, as shown in FIG. 8, a screen on which two business correlation diagrams related to D communication are described is displayed on the Web browser of the client 26.
The corporate correlation chart 35 in FIG. 8 (a) shows the correlation between two companies (correlated companies), D-communication and E-search. Listing advertisement) is described as a correlation keyword. By referring to the correlation diagram in which the company names of the two companies and the correlation keywords are described, the user may recognize that there is a partnership story related to the Internet business (especially search-linked advertising) between D communication and E search it can.
The corporate correlation chart 36 in FIG. 8 (b) shows the correlation between D-communication and G Electronics, and patent rights, infringement, litigation (sue), and license (normal license) are used as correlation keywords. is described. By referring to this correlation diagram, the user can recognize that there is some problem regarding the patent infringement lawsuit between D Communication and G Electric.

「根拠文書の表示」ボタン37には、企業相関図生成の元になった関連文書データのIDがリンクされているため、これをユーザがクリックすると、当該文書のIDがクライアント26からWebサーバ24に送信される。これを受けたWebサーバ24は、文書ＤＢ15から該当の文書データを抽出し、クライアント26に送信する。
この結果、図示は省略したが、クライアント26のWebブラウザ上に情報源の異なる複数の関連文書が記述された画面が表示される。
これを閲覧することにより、ユーザは各企業間に存在する相関関係を詳細に確認することが可能となる。この画面中には各関連文書のURLも記述されているため、これをクリックすることにより、ユーザは元のWebページに容易にアクセスすることができる。 The “display evidence document” button 37 is linked to the ID of the related document data from which the company correlation diagram was generated. When the user clicks this button, the ID of the document is transferred from the client 26 to the Web server 24. Sent to. Receiving this, the Web server 24 extracts the corresponding document data from the document DB 15 and transmits it to the client 26.
As a result, although not shown, a screen on which a plurality of related documents with different information sources are described is displayed on the Web browser of the client 26.
By browsing this, the user can confirm in detail the correlation that exists between the companies. Since the URL of each related document is also described in this screen, the user can easily access the original Web page by clicking this URL.

ユーザが、表示リクエスト時にＤ通信とＥサーチの両者をAND条件で繋いで指定した場合には、企業相関図生成部23により、図８(a)の企業相関図35のみが表示された画面が生成される。
ユーザは、相関キーワードを指定した表示リクエストをすることもできる。例えば、「ネット事業 AND 提携」が表示条件として指定された場合、企業相関図生成部23は、企業相関情報ＤＢ22内に格納された全企業相関情報の中で、相関キーワードとして「ネット事業及び提携」を備えた情報を抽出し、それらの企業相関図が記載された画面をクライアント26に送信する。 When the user designates both D communication and E search connected by AND condition at the time of the display request, the company correlation diagram generation unit 23 displays a screen on which only the company correlation diagram 35 of FIG. Generated.
The user can also make a display request specifying a correlation keyword. For example, when “net business AND alliance” is designated as the display condition, the company correlation diagram generation unit 23 selects “net business and alliance as a correlation keyword among all company correlation information stored in the company correlation information DB 22. ”Is extracted, and a screen on which the company correlation diagram is described is transmitted to the client 26.

図９は、業界マップ生成システム50の機能構成を示すブロック図であり、文書ＤＢ15と、キーワード抽出部18と、キーワードＤＢ51と、関連度算出部52と、キーワード共起頻度表ＤＢ53と、キーワード組合せ頻度総和表ＤＢ54と、キーワード頻度総和表55と、キーワード関連度表ＤＢ56と、業界マップ生成部57と、企業相関情報ＤＢ22と、企業名ＤＢ16とを備えている。
このシステム50にはWebサーバ24が接続されており、このWebサーバ24は、インターネット25やイントラネット等のネットワークを介して複数のクライアント26と接続されている。 FIG. 9 is a block diagram showing the functional configuration of the industry map generation system 50. The document DB 15, the keyword extraction unit 18, the keyword DB 51, the relevance calculation unit 52, the keyword co-occurrence frequency table DB 53, and the keyword combinations A frequency total table DB 54, a keyword frequency total table 55, a keyword relevance table DB 56, an industry map generation unit 57, a company correlation information DB 22, and a company name DB 16 are provided.
A web server 24 is connected to the system 50, and the web server 24 is connected to a plurality of clients 26 via a network such as the Internet 25 or an intranet.

キーワード抽出部18、関連度算出部52及び業界マップ生成部57は、サーバコンピュータのCPUが、ＯＳ及び専用のアプリケーションプログラムに従い、必要な処理を実行することによって実現される。 The keyword extraction unit 18, the relevance calculation unit 52, and the industry map generation unit 57 are realized by the CPU of the server computer executing necessary processing according to the OS and a dedicated application program.

文書ＤＢ15、キーワードＤＢ51、キーワード共起頻度表ＤＢ53、キーワード組合せ頻度総和表ＤＢ54、キーワード頻度総和表55、キーワード関連度表ＤＢ56、企業相関情報ＤＢ22、企業名ＤＢ16は、同サーバコンピュータのハードディスクに格納されている。 Document DB 15, keyword DB 51, keyword co-occurrence frequency table DB 53, keyword combination frequency sum table DB 54, keyword frequency sum table 55, keyword relevance table DB 56, company correlation information DB 22, and company name DB 16 are stored in the hard disk of the server computer. ing.

文書ＤＢ15には、上記した企業相関情報抽出システム10のWebファイル収集部13によってインターネット上のWebサーバ27から収集され、テキスト生成部14によってプレーンテキスト化された大量の文書データが格納されている。
企業名ＤＢ16は、上記企業相関情報抽出システム10の企業名ＤＢ16と同様、予め多数の企業に係る通称、略称、愛称が、正式名称と関連付けて蓄積されている。
キーワード抽出部18は、上記企業相関情報抽出システム10のキーワード抽出部18と同様、係り受け表現抽出フィルタ18a、区切り文字抽出フィルタ18b、文字列頻度統計フィルタ18c、TermExtractフィルタ18d、キーワード認定フィルタ18eを備えている（図２参照）。
企業相関情報ＤＢ22には、上記企業相関情報抽出システム10の企業相関情報抽出部21によって抽出された企業相関情報（一対の企業名、両企業間の相関キーワード、関連文書ID）が格納されている。 The document DB 15 stores a large amount of document data collected from the web server 27 on the Internet by the web file collection unit 13 of the company correlation information extraction system 10 and converted into plain text by the text generation unit 14.
In the company name DB 16, common names, abbreviations, and nicknames relating to a large number of companies are stored in association with formal names in the same manner as the company name DB 16 of the company correlation information extraction system 10.
The keyword extraction unit 18 includes a dependency expression extraction filter 18a, a delimiter extraction filter 18b, a character string frequency statistical filter 18c, a TermExtract filter 18d, and a keyword certification filter 18e in the same manner as the keyword extraction unit 18 of the company correlation information extraction system 10. (See FIG. 2).
The company correlation information DB 22 stores the company correlation information (a pair of company names, a correlation keyword between the companies, and a related document ID) extracted by the company correlation information extraction unit 21 of the company correlation information extraction system 10. .

以下、図１０のフローチャートに従い、文書データからのキーワードの抽出処理及びキーワード間の関連度の算出処理について説明する。
まず、キーワード抽出部18は、文書ＤＢ15に蓄積された各文書データに係り受け表現抽出フィルタ18aを適用し、上記と同様の要領で各文書データから所定の係り受け表現を備えた文字列を抽出する（Ｓ50）。 In the following, a process for extracting keywords from document data and a process for calculating the degree of association between keywords will be described with reference to the flowchart of FIG.
First, the keyword extraction unit 18 applies a dependency expression extraction filter 18a to each document data stored in the document DB 15, and extracts a character string having a predetermined dependency expression from each document data in the same manner as described above. (S50).

つぎにキーワード抽出部18は、各文書データに区切り文字抽出フィルタ18bを適用し、上記と同様の要領で所定の区切り文字で囲まれた文字列をキーワード候補として抽出する（Ｓ52）。 Next, the keyword extraction unit 18 applies a delimiter extraction filter 18b to each document data, and extracts a character string surrounded by a predetermined delimiter in the same manner as described above as a keyword candidate (S52).

つぎにキーワード抽出部18は、各文書データに文字列頻度統計フィルタ18cを適用し、上記と同様の要領で各文書データに含まれる文字列が文書ＤＢ15に格納された他の文書も含めて何回登場するのかを集計し、一定範囲の出現頻度を備えた文字列をキーワード候補として抽出する（Ｓ54）。 Next, the keyword extraction unit 18 applies the character string frequency statistical filter 18c to each document data, and in a manner similar to the above, the character string included in each document data includes any other document stored in the document DB 15. The number of occurrences is counted, and character strings having a certain range of appearance frequencies are extracted as keyword candidates (S54).

つぎにキーワード抽出部18は、各文書データにTermExtractフィルタ18dを適用し、上記と同様の要領で各文書データから所定以上のスコアを備えた文字列をキーワード候補として抽出する（Ｓ56）。 Next, the keyword extraction unit 18 applies the TermExtract filter 18d to each document data, and extracts a character string having a score equal to or higher than a predetermined value from each document data as a keyword candidate in the same manner as described above (S56).

つぎにキーワード抽出部18は、係り受け表現抽出フィルタ18a、区切り文字抽出フィルタ18b、文字列頻度統計フィルタ18c、TermExtractフィルタ18dによって抽出された各キーワード候補をキーワード認定フィルタ18eに入力し、キーワード認定フィルタ18eは上記と同様、２以上のフィルタによってキーワード候補として挙げられているものを最終的なキーワードと認定し、キーワードＤＢ51に格納する（Ｓ58）。 Next, the keyword extraction unit 18 inputs each keyword candidate extracted by the dependency expression extraction filter 18a, the delimiter extraction filter 18b, the character string frequency statistical filter 18c, and the TermExtract filter 18d to the keyword recognition filter 18e, Similarly to the above, 18e recognizes what is listed as a keyword candidate by two or more filters as a final keyword, and stores it in the keyword DB 51 (S58).

つぎに関連度算出部52が起動し、各キーワードの各文書データ中における出現頻度を集計してキーワード共起頻度表を生成し、キーワード共起頻度表ＤＢ53に格納する（Ｓ60）。
図１１は、キーワード共起頻度表ＤＢ53に格納されたキーワード共起頻度表の具体例を示すものであり、文書ＤＢ15に格納された各文書D1〜Dnごとに、各キーワードKW-1〜nの出現頻度が記述されている。 Next, the degree-of-relevance calculation unit 52 is activated, and the appearance frequency of each keyword in each document data is totaled to generate a keyword co-occurrence frequency table, which is stored in the keyword co-occurrence frequency table DB 53 (S60).
FIG. 11 shows a specific example of the keyword co-occurrence frequency table stored in the keyword co-occurrence frequency table DB 53. For each document D1 to Dn stored in the document DB 15, each keyword KW-1 to n is stored. Appearance frequency is described.

ここで、あるキーワードＸとＹとの間の関連度は、数１のiにキーワード共起頻度表ＤＢ28に記載されたＸとＹの出現頻度を代入することにより、理論的には算出可能である。

Here, the degree of association between a certain keyword X and Y can be theoretically calculated by substituting the appearance frequency of X and Y described in the keyword co-occurrence frequency table DB 28 into i of Equation 1. is there.

この数１の分子は、キーワードＸ、Ｙの文書毎の出現頻度の積の全文書に亘る総和を意味するため、Ｘ、Ｙが同じ文書に出現する頻度が高いほど値は大きくなる。もっとも、特定の文書中におけるＸ及びＹの出現頻度の絶対数が多ければそれにつられて分子の値は高くなってしまい、必ずしもＸとＹの共起性の高さを表しているとはいえない。これに対し分母は、キーワードＸ、Ｙの文書毎の出現頻度の二乗の全文書に亘る総和の平方根同士を加算したものであり、Ｘ、Ｙの特定文書中の出現頻度が高いほど値が大きくなる。このため、分子の値を分母の値で除算することにより、特定文書中におけるＸ、Ｙの出現頻度の絶対数が多いことの影響を排除し、Ｘ、Ｙ間の共起性の高さに基づく関連度を導くことが可能となる。 Since the numerator of Equation 1 means the sum of the products of the appearance frequencies of the keywords X and Y for all documents, the value increases as the frequency of occurrence of X and Y in the same document increases. However, if the absolute number of occurrence frequencies of X and Y in a specific document is large, the value of the numerator increases accordingly, and it does not necessarily indicate the high co-occurrence of X and Y. . On the other hand, the denominator is obtained by adding the square roots of the sums of all the squares of the appearance frequencies of the keywords X and Y for each document, and the value increases as the appearance frequency in the specific document of X and Y increases. Become. For this reason, by dividing the numerator value by the denominator value, the influence of the large number of occurrence frequencies of X and Y in a specific document is eliminated, and the co-occurrence between X and Y is increased. It is possible to derive the degree of relevance based on it.

ただし、単純に数１の計算を行うやり方では、文書データの分量及びキーワードの総数が多い場合には膨大な計算量が発生し、多くの処理時間を要することとなる。
そこで、この実施の形態では、キーワード共起頻度表に基づいてキーワード組合せ頻度総和表及びキーワード頻度総和表を生成することにより、計算工程の簡素化を図っている。 However, in the method of simply performing the calculation of Equation 1, when the amount of document data and the total number of keywords are large, a huge amount of calculation occurs, and a lot of processing time is required.
Therefore, in this embodiment, the calculation process is simplified by generating the keyword combination frequency summation table and the keyword frequency summation table based on the keyword co-occurrence frequency table.

図１２は、その要領を例示するものである。この場合、キーワード共起頻度表にはキーワードKW-1〜KW-5の文書D1における出現頻度が記載されているが、この中KW-3及びKW-4の出現頻度は０であるため、実際に関連度を算出すべきキーワードの組合せは以下の３パターンで済むこととなる。
（KW-1, KW-2）、（KW-1, KW-5）、（KW-2, KW-5）
ここで関連度算出部52は、各組合せ毎に出現頻度を乗じた値を記述したキーワード組合せ頻度総和表と、各キーワードの出現頻度を二乗した値を記述したキーワード頻度総和表を生成し、キーワード組合せ頻度総和表ＤＢ54及びキーワード頻度総和表ＤＢ55にそれぞれ格納する（Ｓ62、Ｓ64）。 FIG. 12 illustrates the procedure. In this case, the keyword co-occurrence frequency table describes the appearance frequencies of the keywords KW-1 to KW-5 in the document D1, but the KW-3 and KW-4 appearance frequencies are 0. The combination of keywords for which the degree of relevance should be calculated is the following three patterns.
(KW-1, KW-2), (KW-1, KW-5), (KW-2, KW-5)
Here, the relevance calculation unit 52 generates a keyword combination frequency sum table describing values multiplied by the appearance frequency for each combination, and a keyword frequency sum table describing values obtained by squaring the appearance frequency of each keyword. They are stored in the combination frequency summation table DB 54 and the keyword frequency summation table DB 55, respectively (S62, S64).

図１２のキーワード組合せ頻度総和表では、文書D1についての値のみが記述されているが、関連度算出部52は同様の処理を各文書毎に実行し、その結果に基づいて値を加算していく。
同じく、図１２のキーワード頻度総和表では、文書D1についての値のみが記述されているが、関連度算出部52は同様の処理を各文書毎に実行し、各文書における各キーワードの出現頻度を二乗した値を加算していく。 In the keyword combination frequency summation table of FIG. 12, only the value for the document D1 is described, but the relevance calculation unit 52 performs the same processing for each document and adds the values based on the result. Go.
Similarly, in the keyword frequency total table of FIG. 12, only the value for the document D1 is described, but the relevance calculation unit 52 performs the same process for each document, and determines the appearance frequency of each keyword in each document. Add the squared values.

最後に関連度算出部52は、図１３に示すように、キーワード組合せ頻度総和表ＤＢ54からキーワードＸ，Ｙの組合せ頻度の総和を読み込むと共に、キーワード頻度総和表ＤＢ55からキーワードＸの二乗値の総和とキーワードＹの二乗値の総和を読み込み、各二乗値の総和の平方根を求めた後、これらの値を数１に代入することにより、キーワードＸ，Ｙ間の関連度を算出し、キーワード関連度表ＤＢ56に格納する（Ｓ66）。すべてのキーワードの組合せについて処理が終了するまで、関連度算出部52は処理を繰り返し、キーワード関連度表を生成する。 Finally, as shown in FIG. 13, the degree-of-relevance calculation unit 52 reads the sum of the combination frequencies of the keywords X and Y from the keyword combination frequency sum table DB 54 and also calculates the sum of the square values of the keywords X from the keyword frequency sum table DB 55. After reading the sum of the square values of the keyword Y and calculating the square root of the sum of the square values, substituting these values into Equation 1 calculates the relevance between the keywords X and Y, and the keyword relevance table Store in the DB 56 (S66). Until the processing is completed for all keyword combinations, the relevance calculation unit 52 repeats the processing to generate a keyword relevance table.

上記のように、文書データ毎に各キーワード間の組合せパターンを抽出し、それぞれの積値及び各キーワードの二乗値を求めた上で、各文書データの値を加算していくことにより、出現頻度が０のキーワードに係る計算処理を省くことが可能となる。 As described above, the combination pattern between keywords is extracted for each document data, the product value and the square value of each keyword are obtained, and then the value of each document data is added to the appearance frequency. This makes it possible to omit the calculation processing related to the keyword with 0.

また、文書ＤＢ15に新規の文書データが追加された場合には、この新規文書データ中の各キーワードに係る値を、キーワード組合せ頻度総和表ＤＢ54及びキーワード頻度総和表ＤＢ55に格納された既存の集計値に加算することによって、簡単にキーワード間の関連度が再計算可能となる。
古くなった文書データの影響を排除する場合にも、当該文書データ中の各キーワードに係る値をキーワード組合せ頻度総和表ＤＢ54及びキーワード頻度総和表ＤＢ55に格納された既存の集計値から減算することによって、簡単にキーワード間の関連度を最新の状態に維持することが可能となる。 In addition, when new document data is added to the document DB 15, values related to each keyword in the new document data are stored in the existing combined values stored in the keyword combination frequency summation table DB54 and the keyword frequency summation table DB55. By adding to, it is possible to easily recalculate the degree of association between keywords.
Even when the influence of the obsolete document data is eliminated, the value related to each keyword in the document data is subtracted from the existing total values stored in the keyword combination frequency summation table DB54 and the keyword frequency summation table DB55. Thus, it is possible to easily maintain the degree of association between keywords in the latest state.

つぎに、図１４のフローチャートに従い、業界マップの生成・公開処理について説明する。まず、クライアント26から特定業界を示すキーワードを指定した業界マップの表示リクエストをWebサーバ24が受け付けると（Ｓ70）、業界マップ生成部57が起動し、キーワード関連度表ＤＢ56から当該キーワードに対する関連度の高いの連想キーワードを所定数抽出する（Ｓ72）。 Next, industry map generation / publication processing will be described with reference to the flowchart of FIG. First, when the web server 24 receives an industry map display request specifying a keyword indicating a specific industry from the client 26 (S70), the industry map generation unit 57 is activated and the degree of relevance for the keyword is retrieved from the keyword relevance table DB56. A predetermined number of high associative keywords are extracted (S72).

例えば、クライアント26から「バイオテクノロジー」という業界を指定するキーワードが送信された場合、業界マップ生成部57はキーワード関連度表ＤＢ56に格納された全キーワードの「バイオテクノロジー」に対する関連度をソートし、上位所定数（例えば上位100件）のキーワードを「バイオテクノロジー」に対する連想キーワードとして抽出する。図１５は、連想キーワードの一例を示すものであり、プラント、メタンガス、エタノール等のキーワードが、「バイオテクノロジー」に対する関連度の高い順に列記されている。 For example, when the keyword specifying the industry “biotechnology” is transmitted from the client 26, the industry map generation unit 57 sorts the relevance level of all keywords stored in the keyword relevance degree table DB 56 with respect to “biotechnology”. The top predetermined number (for example, top 100) keywords are extracted as associative keywords for “biotechnology”. FIG. 15 shows an example of an associative keyword, and keywords such as plant, methane gas, and ethanol are listed in descending order of relevance to “biotechnology”.

つぎに業界マップ生成部57は企業名ＤＢ16を参照し、上記連想キーワードの中から企業名に該当するものを、「バイオテクノロジー」の連想企業として抽出する（Ｓ74）。図１６は、連想企業の一例を示すものであり、共同発酵工業、日本化学研究所、ジラフホールディングス等の企業名が、「バイオテクノロジー」に対する関連度の高い順に列記されている。 Next, the industry map generation unit 57 refers to the company name DB 16 and extracts those corresponding to the company name from the above associative keywords as “biotechnology” associative companies (S74). FIG. 16 shows an example of an associative company, and the names of companies such as Kyo Fermentation Industry, Japan Chemical Research Laboratories, and Giraffe Holdings are listed in descending order of relevance to “biotechnology”.

つぎに業界マップ生成部57は、図１７に示すように、所定の面積を備えたマップ平面60上に、連想企業の企業名61及び連想企業の存在を示すアイコン62を配置する（Ｓ76）。 Next, as shown in FIG. 17, the industry map generation unit 57 arranges the company name 61 of the associative company and the icon 62 indicating the existence of the associative company on the map plane 60 having a predetermined area (S76).

つぎに業界マップ生成部57は、企業相関情報ＤＢ22を参照し、一対の連想企業間の企業相関情報を取得すると共に（Ｓ78）、この相関情報をマップ平面60上に追記する（Ｓ80）。
例えば、共同発酵工業とジラフホールディングスに係る企業相関情報が企業相関情報ＤＢ22内に存在している場合、業界マップ生成部57は両者のアイコン62,62間を線図63で結ぶと共に、線図63上に「買収」の相関キーワード64を配置する。
他の連想企業との間に企業相関情報が登録されていない連想企業については、企業名61とアイコン62が表示されたままとなる。 Next, the industry map generation unit 57 refers to the company correlation information DB 22 to acquire the company correlation information between the pair of associative companies (S78), and additionally writes this correlation information on the map plane 60 (S80).
For example, when the company correlation information relating to Kyo Fermentation Industries and Giraffe Holdings exists in the company correlation information DB 22, the industry map generation unit 57 connects both the icons 62 and 62 with a diagram 63, and the diagram 63 The correlation keyword 64 of “acquisition” is arranged above.
For an associative company for which no corporate correlation information is registered with another associative company, the company name 61 and the icon 62 remain displayed.

つぎに業界マップ生成部57は、企業相関情報ＤＢ22を参照し、「バイオテクノロジー」の連想キーワード（プラント、メタンガス等）を相関キーワードとして含んでおり、かつマップ平面60上に配置された連想企業を片方の相関企業としている企業相関情報を抽出する（Ｓ82）。 Next, the industry map generating unit 57 refers to the company correlation information DB 22 and includes the associative companies including the associative keywords (plant, methane gas, etc.) of “biotechnology” as the correlation keywords and arranged on the map plane 60. The company correlation information as one correlation company is extracted (S82).

つぎに業界マップ生成部57は、抽出した企業相関情報をマップ平面60上に反映させ、業界マップを完成させる（Ｓ84）。具体的には、図１８に示すように、企業相関情報に含まれる連想企業以外の相関企業を業界周辺企業と認定し、マップ平面60上にその企業名65（例えば「Ｙ社」）及びアイコン66配置し、連想企業のアイコン62との間を線図67で結ぶと共に、連想キーワード68（例えば「バイオ燃料」）を線図67上に配置する。 Next, the industry map generation unit 57 reflects the extracted company correlation information on the map plane 60 to complete the industry map (S84). Specifically, as shown in FIG. 18, a correlated company other than the association company included in the company correlation information is recognized as an industry peripheral company, and the company name 65 (for example, “Y company”) and an icon are displayed on the map plane 60. 66, and the icon 62 of the associative company is connected with the diagram 67, and the associative keyword 68 (for example, “biofuel”) is arranged on the diagram 67.

つぎに業界マップ生成部57は、完成した業界マップをWebサーバ24に出力し、この業界マップを含む業界マップ表示画面がWebサーバ24からクライアント26に送信される（Ｓ86）。 Next, the industry map generation unit 57 outputs the completed industry map to the Web server 24, and an industry map display screen including this industry map is transmitted from the Web server 24 to the client 26 (S86).

連想企業同士は、相互に同一文書中における共起性が高いため同業である可能性が大きい。これに対し非連想企業は、連想企業との間で連想キーワードに絡んだ相関関係を有しているため、周辺企業と定義付けることができる。
この業界マップを概観することにより、ユーザは特定業界の主要な構成企業と、これに関わる周辺企業の全体像を掴むことが可能となる。
しかも、最新の報道内容を相関情報として即座に反映させることができるため、陳腐化していない情報をユーザに提供可能となる。 Associative companies are highly likely to be in the same industry because of their high co-occurrence in the same document. On the other hand, a non-associative company can be defined as a peripheral company because it has a correlation related to an associative keyword with the associative company.
By overviewing this industry map, the user can grasp the overall picture of the major constituent companies in a specific industry and the peripheral companies involved in this.
In addition, since the latest report content can be immediately reflected as correlation information, information that is not obsolete can be provided to the user.

企業相関情報抽出システムの機能構成を示すブロック図である。It is a block diagram which shows the function structure of a company correlation information extraction system. キーワード抽出部の機能構成を示すブロック図である。It is a block diagram which shows the function structure of a keyword extraction part. 企業相関情報の抽出工程を示すフローチャートである。It is a flowchart which shows the extraction process of company correlation information. 各関連文書からキーワードを抽出し、これらのキーワードから相関キーワードを抽出する様子を示す説明図である。It is explanatory drawing which shows a mode that a keyword is extracted from each related document, and a correlation keyword is extracted from these keywords. 文字列頻度統計フィルタの動作を示す説明図である。It is explanatory drawing which shows operation | movement of a character string frequency statistical filter. 文書ＤＢ内に形態素インデックスが形成されている様子を示す説明図である。It is explanatory drawing which shows a mode that the morpheme index is formed in document DB. 企業相関図の生成・公開処理に係る手順を示すフローチャートである。It is a flowchart which shows the procedure which concerns on the production | generation / publication process of a company correlation diagram. 企業相関図の一例を示す図である。It is a figure which shows an example of a company correlation diagram. 業界マップ生成システムの機能構成を示すブロック図でる。It is a block diagram which shows the function structure of an industry map production | generation system. キーワードの抽出処理及びキーワード間の関連度の算出処理を説明するフローチャートである。It is a flowchart explaining the extraction process of a keyword, and the calculation process of the relevance degree between keywords. キーワード共起頻度表の一例を示す図である。It is a figure which shows an example of a keyword co-occurrence frequency table. 関連度算出処理を簡略化する方法を示す説明図である。It is explanatory drawing which shows the method of simplifying a relevance calculation process. キーワード組合せ頻度総和表及びキーワード頻度総和表に基づいてキーワード関連度表が生成される様子を示す説明図である。It is explanatory drawing which shows a mode that a keyword relevance table is produced | generated based on a keyword combination frequency total table and a keyword frequency total table. 業界マップの生成・公開処理に係る手順を示すフローチャートである。It is a flowchart which shows the procedure which concerns on the production | generation / publication process of an industry map. 連想キーワードの抽出例を示す図である。It is a figure which shows the example of extraction of an association keyword. 連想企業の抽出例を示す図である。It is a figure which shows the extraction example of an association company. 業界マップの生成途中を示す図である。It is a figure which shows the production | generation middle of an industry map. 業界マップの完成形を示す図である。It is a figure which shows the completed form of an industry map.

Explanation of symbols

10 企業相関情報抽出システム
12 巡回先ＤＢ
13 Webファイル収集部
14 テキスト生成部
15 文書ＤＢ
16 企業名ＤＢ
17 関連文書抽出部
18 キーワード抽出部
18a 表現抽出フィルタ
18b 文字抽出フィルタ
18c 文字列頻度統計フィルタ
18d キーワード認定フィルタ
19 類義語辞書
20 同義語辞書
21 企業相関情報抽出部
22 企業相関情報ＤＢ
23 企業相関図生成部
24 Webサーバ
25 インターネット
26 クライアント
27 Webサーバ
28 キーワード共起頻度表ＤＢ
30 文書データ
31 文書データ
32 文書データ
35 企業相関図
36 企業相関図
37 「根拠文書の表示」ボタン
50 業界マップ生成システム
51 キーワードＤＢ
52 関連度算出部
53 キーワード共起頻度表ＤＢ
54 頻度総和表ＤＢ
55 キーワード頻度総和表ＤＢ
56 キーワード関連度表ＤＢ
57 業界マップ生成部
60 マップ平面
61 連想企業の企業名
62 連想企業のアイコン
63 線図
64 相関キーワード
65 周辺企業の企業名
66 周辺企業のアイコン
67 線図
68 連想キーワード 10 Company correlation information extraction system
12 Travel destination DB
13 Web file collection part
14 Text generator
15 Document DB
16 Company name DB
17 Related document extractor
18 Keyword extractor
18a Expression Extraction Filter
18b character extraction filter
18c string frequency statistics filter
18d keyword recognition filter
19 Thesaurus
20 synonym dictionary
21 Business correlation information extraction unit
22 Corporate correlation information DB
23 Business correlation diagram generator
24 Web server
25 Internet
26 clients
27 Web server
28 Keyword co-occurrence frequency table DB
30 Document data
31 Document data
32 Document data
35 Corporate correlation chart
36 Corporate correlation chart
37 “Show evidence document” button
50 Industry map generation system
51 Keyword DB
52 Relevance calculator
53 Keyword Co-occurrence Frequency Table DB
54 Frequency Sum Table DB
55 Keyword Frequency Summation Table DB
56 Keyword Relevance Table DB
57 Industry map generator
60 Map plane
61 Company name of the association company
62 Associative company icon
63 diagram
64 correlation keywords
65 Company names of neighboring companies
66 Neighborhood company icons
67 diagram
68 associative keywords

Claims

Document storage means for storing a plurality of document data associated with predetermined time information;
A company name storage means for storing a plurality of company names;
Means for extracting document data in which a plurality of company names appear in a document from the document storage means with reference to the company name storage means;
Related document extracting means for comparing the time information and the name of the appearing company of each extracted document data and extracting a plurality of document data related to the same event as related document data for each combination of a pair of companies;
A keyword extracting means for extracting a keyword from each related document data;
Correlated keyword recognition means for checking the presence / absence of each keyword for each related document and recognizing a keyword existing in at least two or more related documents as a correlation keyword,
Means for storing company correlation information comprising the pair of company names and the correlation keyword in a company correlation information storage means;
A system for extracting corporate correlation information.

Means for retrieving the company correlation information storage means using the company name as a key and extracting the company correlation information including the company name when an input for specifying the company name is made;
Means for generating a company correlation diagram in which a pair of company names included in the company correlation information and a correlation keyword between both companies are described;
Means for outputting this company correlation diagram;
The corporate correlation information extraction system according to claim 1, further comprising:

Means for searching the company correlation information storage means using the correlation keyword as a key and extracting the company correlation information including the correlation keyword when an input for specifying the correlation keyword is made;
Means for generating a company correlation diagram in which a pair of company names included in the company correlation information and a correlation keyword between both companies are described;
Means for outputting this company correlation diagram;
The company correlation information extraction system according to claim 1 or 2, further comprising:

The company correlation information includes the ID of each related document data.
In addition, the ID of related document data is linked to the above company correlation diagram,
Means for extracting corresponding related document data from the document storage means when the ID of the related document data is input;
Means for outputting the related document data;
The company correlation information extraction system according to claim 2 or 3, further comprising:

In the company name storage means, the abbreviation or common name of each company is registered in association with the official name,
The said related document extraction means treats this as a company name of a formal name, when the abbreviation or common name of a company is described in each document data. Company correlation information extraction system.

A synonym dictionary that defines the relationship between keywords and synonyms
The correlated keyword recognition means refers to the synonym dictionary when checking the existence of each keyword for each related document, and recognizes the existence of the keyword for a related document in which a synonym of a keyword is described. The company correlation information extraction system according to any one of claims 1 to 5.

It has a synonym dictionary that defines the relationship between keywords and synonyms,
The correlated keyword recognition means refers to the synonym dictionary when checking the existence of each keyword for each related document, and recognizes the existence of the keyword for a related document in which a synonym of a keyword is described. The company correlation information extraction system according to any one of claims 1 to 6.

The keyword extraction means includes a plurality of filters that extract keyword candidates based on unique extraction criteria,
8. The company correlation information extraction system according to claim 1, wherein keyword candidates extracted by each filter are matched and keyword candidates extracted by two or more filters are recognized as keywords.

One of the above filters is
(1) Extract nouns included in each document data as attention words,
(2) Calculate the appearance frequency of all the attention words in all document data,
(3) Expand the range to the morpheme one and the next before each attention word, and calculate the appearance frequency of the attention word including this expansion range in all documents,
(4) If the appearance frequency calculated by the processing in (3) above is a predetermined number or more, the range is further expanded to the previous or subsequent morpheme, and all documents of the attention word including this expanded range Repeat the process of calculating the appearance frequency in the data until the appearance frequency falls below a predetermined number,
(5) Among the attention words including the first attention word and the extended range, those having an appearance frequency within a predetermined range are selected as keyword candidates. .