JP2002007431A

JP2002007431A - Information searching device, information searching method and recording medium

Info

Publication number: JP2002007431A
Application number: JP2000193529A
Authority: JP
Inventors: Satoshi Inoue; 聡井上; Shinji Abe; 伸治安部; Yoshinobu Tonomura; 佳伸外村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2000-06-27
Filing date: 2000-06-27
Publication date: 2002-01-11

Abstract

PROBLEM TO BE SOLVED: To provide an information searching device, an information searching method and a recording medium by which only necessary contents among contents following from a prescribed start page can be acquired by giving a query such as 'I wish to acquire a homepage including one or more images, using 'KYOTO' and 'SIGHTSEEING' as a keyword and described in 'Japanese' to a searching robot. SOLUTION: Only the necessary contents are acquired by checking whether or not the contents is compatible to the inputted query.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ＷＷＷに代表され
るように、多種多様のコンテンツが、あちこちのコンテ
ンツサーバに大量に格納されている場合に、必要なコン
テンツのみを自動的に取得することができ、さらには、
インターネット等のコンピュータネットワーク上に設け
られている複数のコンテンツサーバに存在する膨大なコ
ンテンツの中から、必要なコンテンツだけを用いて行な
う種々のサービスに適用できる情報探索装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is to automatically acquire only necessary contents when a large variety of contents are stored in a large number of contents servers as represented by WWW. Can be
The present invention relates to an information search apparatus that can be applied to various services provided by using only necessary contents from a vast amount of contents existing in a plurality of content servers provided on a computer network such as the Internet.

【０００２】[0002]

【従来の技術】一般的に、ＷＷＷ等の探索エンジンの探
索ロボットは、ＨＴＭＬ等のコンテンツ上に張られたハ
イパーリンクを辿り、最初に指定した場所から繋がる、
ありとあらゆるコンテンツを取得するものである。2. Description of the Related Art Generally, a search robot of a search engine such as WWW follows a hyperlink provided on content such as HTML and is connected from a first designated place.
Get all kinds of content.

【０００３】ところで、「２０００年電子情報通信学会
総合大会講演論文集基礎・境界Ａ−１６−４６Ｐ．３
５０関連発見型ブラウザ『ＡｓｓｏｃｉａＶｉｅｗ』」
に記載されているように、探索対象とするコンテンツ
を、所定のドメイン（たとえば京都観光や、博物館、美
術館等）に特化することができる場合には、全てのコン
テンツを予め取得し、コンテンツ中に、どのような単語
が含まれるかという情報を、上記取得されたコンテンツ
から抽出する方法を考えることができるが、この方法
は、却って、計算コスト、外部記憶装置容量ともに、無
駄であるという問題がある。By the way, "Basic and Boundary A-16-46 P.3, Proceedings of the 2000 IEICE General Conference"
50 related discovery browser "AssociaView"
If the content to be searched can be specialized in a predetermined domain (for example, Kyoto sightseeing, museums, art museums, etc.), all contents are acquired in advance, and In this case, a method of extracting information about what words are included from the acquired content can be considered. However, this method is disadvantageous in that both the calculation cost and the external storage device capacity are useless. There is.

【０００４】[0004]

【発明が解決しようとする課題】本発明は、たとえば、
「一枚以上の画像を含み、キーワード『京都』、『観
光』が使用され、「日本語」で記述されているホームペ
ージを取得して欲しい」というようなＱｕｅｒｙ（探索
要求）を、探索ロボットに与えることによって、所定の
スタートページからつながるコンテンツの中で、必要な
コンテンツのみを取得することができる情報探索装置、
情報探索方法および記録媒体を提供することを目的とす
るものである。SUMMARY OF THE INVENTION The present invention provides, for example,
A query (search request) such as "I want to obtain a home page that contains one or more images and uses the keywords" Kyoto "and" sightseeing "and is described in" Japanese "" is sent to the search robot. By giving, an information search device capable of acquiring only necessary contents among contents connected from a predetermined start page,
It is an object to provide an information search method and a recording medium.

【０００５】また、本発明は、所定のスタートページか
らつながるコンテンツの中で、可能性の高そうなリンク
を先に探索することによって、コンテンツを効率的に取
得することができる情報探索装置、情報探索方法および
記録媒体を提供することを目的とするものである。[0005] Further, the present invention provides an information search apparatus and an information search apparatus capable of efficiently obtaining content by searching for a link that is likely to be high in content connected from a predetermined start page. It is an object to provide a search method and a recording medium.

【０００６】つまり、上記ＡｓｓｏｃｉａＶｉｅｗのよ
うに、予めドメインを特化し、ある程度限定された（フ
ィルタリングされた）コンテンツを用いるサービスにお
いて、必要なコンテンツを自動的に取得する。上記Ａｓ
ｓｏｃｉａＶｉｅｗで言えば、たとえば京都観光ＡＶ
や、江戸時代美術品ＡＶのようなものに使われるコンテ
ンツを自動的に取得できる。That is, as in the above-mentioned AssociateView, in a service that uses a domain that is specialized in a domain in advance and uses content that has been limited to some extent (filtered), necessary content is automatically acquired. As above
In the case of socialview, for example, Kyoto sightseeing AV
Also, content used for things such as Edo period art AV can be automatically acquired.

【０００７】さらに、本発明は、インターネット上の膨
大なコンテンツの中から必要なコンテンツだけを用いて
提供する種々のサービスに適用できる情報探索装置、情
報探索方法および記録媒体を提供することを目的とす
る。A further object of the present invention is to provide an information search device, an information search method, and a recording medium applicable to various services provided by using only necessary contents from a huge amount of contents on the Internet. I do.

【０００８】[0008]

【課題を解決するための手段】本発明は、入力されたＱ
ｕｅｒｙに、コンテンツが適合するか否かをチェックす
ることによって、必要なコンテンツだけを取得するもの
である。SUMMARY OF THE INVENTION According to the present invention, an input Q
By checking whether or not the content conforms to the query, only the necessary content is acquired.

【０００９】また、本発明は、コンテンツ中に含まれる
リンクを辿る際に、リンク自体の情報を用いることによ
って、必要なコンテンツが存在すると思われるリンクか
ら優先的に探索をするものである。Further, according to the present invention, when tracing a link included in a content, information of the link itself is used to preferentially search for a link from which a necessary content is considered to exist.

【００１０】[0010]

【発明の実施の形態および実施例】図１は、本発明の一
実施例である情報探索システム１００の構成図である。FIG. 1 is a block diagram of an information search system 100 according to an embodiment of the present invention.

【００１１】情報探索システム１００は、コンテンツＤ
Ｂ１と、サーバ２と、クライアント３と、コンピュータ
ネットワーク４とを有する。[0011] The information search system 100 includes a content D
B1, a server 2, a client 3, and a computer network 4.

【００１２】コンテンツＤＢ１は、所定のスタートペー
ジから探索を始め、そのページがＱｕｅｒｙ（探索要
求）に適合した場合（コンテンツスコアが所定の閾値を
超えている場合）に、そのコンテンツを、登録、蓄積す
るデータベースである。The content DB 1 starts searching from a predetermined start page, and registers and stores the content when the page conforms to the query (search request) (when the content score exceeds a predetermined threshold). Database.

【００１３】サーバ２は、クライアント３の要望によっ
て、コンピュータネットワーク４上に存在している多数
のコンテンツサーバから、所望のコンテンツを探索する
機能を有する。The server 2 has a function of searching for a desired content from a number of content servers existing on the computer network 4 according to a request from the client 3.

【００１４】インターネット等のコンピュータネットワ
ーク４には、コンテンツＤＢ１と、サーバ２と、クライ
アント３とが接続されている。A content DB 1, a server 2, and a client 3 are connected to a computer network 4 such as the Internet.

【００１５】情報探索システム１００は、一般的にＡＳ
Ｐ（ＡｐｐｌｉｃａｔｉｏｎＳｅｒｖｉｃｅＰｒｏ
ｖｉｄｅｒ）と言われる事業者によって使われることが
多く、この場合、一般のインターネット等のサービスを
利用するユーザから直接見えるものではない。したがっ
て、ＷＷＷブラウザ等を直接用いてコンテンツ情報を入
手するとは限らない。The information search system 100 generally has an AS
P (Application Service Pro)
vider), and is not directly visible to users who use services such as the general Internet. Therefore, the content information is not always obtained by directly using a WWW browser or the like.

【００１６】上記Ｑｕｅｒｙは、探索要求であり、所望
の情報を得るために、サーバ２に渡す「こういう情報が
欲しい」という情報であり、探索スタートページのＵＲ
Ｌも、上記Ｑｕｅｒｙに含まれる。The above-mentioned Query is a search request, and is information that "I want such information" passed to the server 2 in order to obtain desired information.
L is also included in the above Query.

【００１７】図２は、上記実施例において、サーバ２を
具体的に示すブロック図である。FIG. 2 is a block diagram specifically showing the server 2 in the above embodiment.

【００１８】図３は、上記実施例において、探索ロボッ
ト３０とリンクスコアテーブル２２との関係を説明する
図である。FIG. 3 is a diagram for explaining the relationship between the search robot 30 and the link score table 22 in the above embodiment.

【００１９】サーバ２は、Ｑｕｅｒｙ解釈エンジン１０
と、探索ロボット管理エンジン２０と、探索ロボット３
０とを有する。The server 2 has a query interpretation engine 10
Search robot management engine 20, search robot 3
0.

【００２０】Ｑｕｅｒｙ解釈エンジン１０は、クライア
ント３から入力されたＱｕｅｒｙを解釈するものであ
り、つまり、後述のコンテンツ評価部３２、リンク評価
部３５で使うことができる形に、上記Ｑｕｅｒｙを変換
するものである。たとえば、入力された自然言語文を形
態素解析にかけることによって、名詞を抽出する処理
が、上記解釈の例である。The Query interpretation engine 10 interprets the Query input from the client 3, that is, converts the Query into a form that can be used by the content evaluation unit 32 and the link evaluation unit 35 described later. It is. For example, a process of extracting a noun by subjecting an input natural language sentence to morphological analysis is an example of the above interpretation.

【００２１】また、たとえば、「キーワード：京都（７
０）、観光（３０）」のように、各Ｑｕｅｒｙを構成す
るキーワードに対する重要度を数値で付属させ、この付
属させた重要度を、Ｑｕｅｒｙを解釈する場合に反映さ
せ、これによって、コンテンツとリンクとを評価する際
に、各Ｑｕｅｒｙに重み付けをするようにしてもよい。Further, for example, "keyword: Kyoto (7
0), sightseeing (30) ", the importance of the keywords constituting each query is attached numerically, and the attached importance is reflected in the interpretation of the query, whereby the content and the link are linked. When evaluating, each Query may be weighted.

【００２２】Ｑｕｅｒｙ解釈エンジン１０によって解釈
されたＱｕｅｒｙは、コンテンツ評価部３２、リンク評
価部３５で使うことができる形に変換されたものであ
り、コンテンツスコアとリンクスコアとを計算する場合
に使用される。The Query interpreted by the Query interpretation engine 10 is converted into a form that can be used by the content evaluation unit 32 and the link evaluation unit 35, and is used when calculating a content score and a link score. You.

【００２３】上記Ｑｕｅｒｙを構成する項目は、コンテ
ンツを特徴づける属性であり、具体的には、「キーワー
ド」、「ファイルサイズ」、「含まれる文字数」、「名
詞の数」、「言語」、「リンクの有無／数」、「表の有
無／数」、「画像の有無／枚数／サイズ」、「音楽の有
無／サイズ」、「更新（作成）時間」、「ＵＲＬ」等で
ある。Items constituting the above-mentioned Query are attributes that characterize the content. Specifically, “keyword”, “file size”, “number of characters included”, “number of nouns”, “language”, “language”, "Link presence / number / number", "Table presence / number / number", "Image presence / number / number / size", "Music presence / size / size", "Update (creation) time", "URL" and the like.

【００２４】また、上記リンクスコアは、リンク先のコ
ンテンツスコアがわからない状態で、リンク先のコンテ
ンツを取得する前に、効率的に辿るべきリンクを探索す
るためのスコアである。The link score is a score for efficiently searching for a link to be followed before acquiring the content of the link destination without knowing the content score of the link destination.

【００２５】探索ロボット管理エンジン２０は、探索ロ
ボット生成部２１と、リンクスコアテーブル２２とを有
する。The search robot management engine 20 has a search robot generation unit 21 and a link score table 22.

【００２６】探索ロボット管理エンジン２０は、サーバ
２内でのネットワークと計算量の負荷とに応じて、並列
に動かす探索ロボット３０の数Ｎを動的に計算する機能
を持ち、また、探索ロボット３０が現在何台動いている
かを調べることによって、探索ロボット３０を管理する
機能を持つ。さらに、探索ロボット管理エンジン２０
は、探索ロボット３０の数が、Ｎ個よりも少なくなった
場合に、リンクスコアテーブル２２を参照し、その時点
でＱｕｅｒｙに最も適合すると思われるＵＲＬに対し
て、探索ロボット３０を生成する機能を持つ。The search robot management engine 20 has a function of dynamically calculating the number N of the search robots 30 to be moved in parallel according to the network in the server 2 and the load of the calculation amount. Has a function of managing the search robot 30 by checking how many vehicles are currently moving. Further, the search robot management engine 20
When the number of search robots 30 becomes smaller than N, a function of generating the search robot 30 with reference to the link score table 22 and generating a search robot 30 for a URL that is considered to be most suitable for Query at that time. Have.

【００２７】なお、上記「その時点でＱｕｅｒｙに最も
適合すると思われるＵＲＬ」は、その時点におけるリン
クスコアテーブル２２を参照し、図３に示すリンクスコ
アリストのリンクスコアの平均値が最も高い値を持つＵ
ＲＬである。ここで、複数のリンク元ＵＲＬが存在し、
それぞれリンクスコアが異なる場合に、上記「リンクス
コアの平均値」で、リンクスコアを比較する。The above “URL that seems to be the most suitable for the query at that time” refers to the link score table 22 at that time and determines the value having the highest average link score in the link score list shown in FIG. U to have
RL. Here, there are a plurality of link source URLs,
When the link scores are different from each other, the link scores are compared using the “average link score” described above.

【００２８】また、所定のＵＲＬに対応する探索ロボッ
ト３０の役割が終了した際には、リンクスコアテーブル
の、「既に調査済みか？」を示すフラグ（調査済フラ
グ）を「Ｙｅｓ」にする。この調査済フラグは、同じＵ
ＲＬを複数回チェックすることによる無駄を排除するた
めのフラグであり、次回、探索ロボット３０を生成する
際に、上記調査済フラグが「Ｙｅｓ」になっているＵＲ
Ｌについて、探索ロボット３０を生成することはない。When the role of the search robot 30 corresponding to the predetermined URL has been completed, the flag (investigated flag) indicating "already investigated?" In the link score table is set to "Yes". This investigated flag is the same U
This is a flag for eliminating waste caused by checking the RL a plurality of times. When the search robot 30 is generated next time, the UR for which the investigated flag is “Yes” is set.
For L, the search robot 30 is not generated.

【００２９】探索ロボット３０は、コンテンツ取得部３
１と、コンテンツ評価部３２と、Ｑｕｅｒｙ適合判別部
３３と、コンテンツ登録制御部３４と、リンク評価部３
５とを有する。The search robot 30 includes the content acquisition unit 3
1, a content evaluation unit 32, a query matching determination unit 33, a content registration control unit 34, and a link evaluation unit 3.
And 5.

【００３０】探索ロボット３０は、探索ロボット管理エ
ンジン２０における探索ロボット生成部２１によって、
１つのＵＲＬを引数として生成され、上記ＵＲＬが示す
コンテンツを、コンテンツＤＢ１に登録する必要がある
か否かをチェックし、登録する必要があれば、コンテン
ツＤＢ１に登録する。また、そのコンテンツ内の全ての
リンクに対して、リンクスコアを計算し、リンクスコア
テーブル２２に登録すると、当該探索ロボット３０が消
滅する。The search robot 30 is controlled by the search robot generation unit 21 in the search robot management engine 20.
It is generated using one URL as an argument, and it is checked whether or not it is necessary to register the content indicated by the URL in the content DB1, and if it is necessary, the content is registered in the content DB1. When the link score is calculated for all the links in the content and registered in the link score table 22, the search robot 30 disappears.

【００３１】次に、探索ロボット３０について、詳細に
説明する。Next, the search robot 30 will be described in detail.

【００３２】探索ロボット３０において、与えられたＵ
ＲＬに基づいて、コンテンツ取得部３１がコンテンツを
取得し、コンテンツ評価部３２がコンテンツスコアを計
算することによって、コンテンツを評価し、Ｑｕｅｒｙ
適合判別部３３が、Ｑｕｅｒｙに適合するか否かを判別
する。In the search robot 30, the given U
Based on the RL, the content acquisition unit 31 acquires the content, and the content evaluation unit 32 calculates the content score, thereby evaluating the content.
The matching determination unit 33 determines whether or not the query is compatible.

【００３３】次に、コンテンツ評価部３２が行なうコン
テンツスコア計算について、詳細に説明する。Next, the content score calculation performed by the content evaluation section 32 will be described in detail.

【００３４】コンテンツ評価部３２は、Ｑｕｅｒｙ解釈
エンジン１０が解釈したＱｕｅｒｙに基づいて、コンテ
ンツスコアを計算する。なお、上記コンテンツスコア
は、所定のコンテンツがＱｕｅｒｙに適合している度合
を示すスコアである。The content evaluation unit 32 calculates a content score based on the query interpreted by the query interpretation engine 10. The content score is a score indicating the degree to which the predetermined content conforms to the query.

【００３５】また、コンテンツスコアを計算する場合、
Ｑｕｅｒｙにマッチする度合いが高ければ、コンテンツ
スコアが高くなるような計算式を用いて計算する。たと
えば、キーワードとして「京都」、「観光」を入力した
場合に、所定のコンテンツに「京都」と「観光」とが含
まれていれば、上記所定のコンテンツのコンテンツスコ
アが高くなり、それらのキーワードが含まれていなけれ
ば、上記所定のコンテンツスコアが低くなるような計算
式を用いる。When calculating the content score,
If the degree of matching with the query is high, the calculation is performed using a calculation formula that increases the content score. For example, when "Kyoto" and "sightseeing" are input as keywords, and if the predetermined content includes "Kyoto" and "sightseeing", the content score of the predetermined content increases, and the keyword Is not included, a calculation formula that lowers the predetermined content score is used.

【００３６】キーワード以外の属性（たとえば、画像を
有するコンテンツであることを示す属性、企業が提供す
るコンテンツであることを示す属性）についても、上記
と同様に、Ｑｕｅｒｙにマッチする度合いが高ければ、
コンテンツスコアが高くなるような計算式を用いて計算
する。For attributes other than keywords (for example, an attribute indicating that the content has an image, an attribute indicating that the content is provided by a company), as described above, if the degree of matching with Query is high,
It is calculated using a calculation formula that increases the content score.

【００３７】また、シソーラス等を用いて、入力された
キーワードと同じキーワードが所定のコンテンツに含ま
れていないが、上記入力されたキーワードに近いニュア
ンスのキーワードが所定のコンテンツに含まれている場
合には、上記所定のコンテンツのコンテンツスコアがあ
る程度高くなる計算式を用いるようにしてもよい。Using a thesaurus or the like, if the same keyword as the input keyword is not included in the predetermined content, but a keyword having a nuance close to the input keyword is included in the predetermined content, May use a calculation formula that increases the content score of the predetermined content to some extent.

【００３８】また、たとえば、キーワード「京都」の重
要度を「７０」とし、キーワード「観光」の重要度を
「３０」とするように、Ｑｕｅｒｙに対する重要度を、
数値として各Ｑｕｅｒｙに付属させるようにしてもよ
い。この場合、各Ｑｕｅｒｙによるスコアに対して、重
要度によって重み付けをした値で加重平均をとること等
によって、上記重要度を、コンテンツスコアに反映させ
るようにしてもよい。Further, for example, the importance of the keyword “Kyoto” is set to “70”, and the importance of the keyword “sightseeing” is set to “30”.
You may make it attach to each Query as a numerical value. In this case, the importance may be reflected in the content score by, for example, taking a weighted average with a value weighted by importance for the score of each query.

【００３９】ここで、所定のコンテンツがＱｕｅｒｙに
適合していれば（コンテンツスコアが所定の閾値を超え
ていれば）、コンテンツ登録制御部３４が、上記所定の
コンテンツを、ＤＢ１に登録する。Here, if the predetermined content conforms to the Query (if the content score exceeds a predetermined threshold), the content registration control unit 34 registers the predetermined content in the DB1.

【００４０】また、そのコンテンツにリンクが含まれて
いる場合、この含まれている各リンクに対して、リンク
評価部３５が、リンクスコアを計算する。このリンクス
コア計算によって得られたリンクスコアは、探索ロボッ
ト管理エンジン２０のリンクスコアテーブル２２に送ら
れる。１つのリンク先に対して、複数のリンク元ＵＲＬ
が存在する場合があるので、リンクスコアは、リスト構
造に応じて複数保存される。なお、上記リスト構造は、
事前的には個数のわからない状況において、複数の値を
入れるための構造であり、データと、次のデータヘのポ
インタとによって構成されている。When a link is included in the content, the link evaluation unit 35 calculates a link score for each of the included links. The link score obtained by this link score calculation is sent to the link score table 22 of the search robot management engine 20. Multiple link source URLs for one link destination
May exist, and a plurality of link scores are stored according to the list structure. Note that the above list structure is
This is a structure for inserting a plurality of values in a situation where the number is not known in advance, and is composed of data and a pointer to the next data.

【００４１】上記のように、上記実施例におけるリンク
スコアは、コンテンツとコンテンツに付属する情報とに
よって決まる。なお、上記「コンテンツに付属する情
報」は、ＵＲＬ等、コンテンツそのものではないが、コ
ンテンツを特徴づける情報である。As described above, the link score in the above embodiment is determined by the content and the information attached to the content. The “information attached to the content” is not the content itself, such as a URL, but is information that characterizes the content.

【００４２】次に、リンク評価部３５が行なうリンクス
コア計算について詳細に説明する。Next, the link score calculation performed by the link evaluation unit 35 will be described in detail.

【００４３】リンク評価部３５は、コンテンツスコアと
同様にして、リンクそのもののスコアを求め、また、コ
ンテンツ評価部３２が計算したコンテンツスコアを用い
て、リンクスコアを計算する。The link evaluation unit 35 obtains the score of the link itself in the same manner as the content score, and calculates the link score using the content score calculated by the content evaluation unit 32.

【００４４】リンク元のコンテンツの内容がＱｕｅｒｙ
にマッチする度合いに応じて、リンク先のコンテンツの
内容がマッチする度合いが影響される可能性が強いの
で、コンテンツ評価部３２が計算したコンテンツスコア
を用いてリンクスコアを計算する。たとえば、キーワー
ドが「観光or京都」であるＱｕｅｒｙについて探索して
いる場合、「地方自治体紹介ホームページ」内に設けら
れているリンク「京都」よりも、「観光に関するホーム
ページ」内に設けられているリンク「京都」が、Ｑｕｅ
ｒｙにマッチする可能性が高い。したがて、リンクその
ものスコアと、コンテンツ評価部３２が計算したコンテ
ンツスコアとを乗じたものを、リンクスコアとして、リ
ンク評価部３５が計算する。When the content of the link source content is Query
Since the degree of matching of the contents of the link destination is likely to be affected according to the degree of matching, the link score is calculated using the content score calculated by the content evaluation unit 32. For example, when searching for a query whose keyword is “sightseeing or Kyoto”, the link provided in the “homepage on tourism” is better than the link “Kyoto” provided in the “local government introduction homepage”. "Kyoto" is Que
ry is likely to match. Therefore, the link evaluation unit 35 calculates the product of the link itself score and the content score calculated by the content evaluation unit 32 as a link score.

【００４５】次に、上記実施例において、所定のＱｕｅ
ｒｙに適合するコンテンツを探索する動作について説明
する。Next, in the above embodiment, a predetermined Que
The operation of searching for a content that matches ry will be described.

【００４６】（ステップ１）まず、ユーザが、Ｑｕｅｒ
ｙをクライアント３に入力する。この入力されたＱｕｅ
ｒｙは、インターネット４を介して接続されているサー
バ２にアクセスされる。このコンテンツの探索に先立っ
て、通常、ユーザ名、パスワード等の入力が必要である
が、ここでは、この入力動作の説明を省略する。(Step 1) First, when the user enters
y is input to the client 3. This input Que
ry is accessed by the server 2 connected via the Internet 4. Prior to the search for the content, it is usually necessary to input a user name, a password, and the like, but the description of the input operation is omitted here.

【００４７】そして、コンテンツ評価部３２とリンク評
価部３５とで使うことができる形（解釈されたＱｕｅｒ
ｙ）に、クライアント３に入力されたＱｕｅｒｙを、Ｑ
ｕｅｒｙ解釈エンジン１０が変換する。The content evaluation unit 32 and the link evaluation unit 35 can use the format (interpreted Quer).
y), the query input to the client 3 is replaced by Q
The query interpretation engine 10 performs the conversion.

【００４８】（ステップ２）解釈されたＱｕｅｒｙは、
コンテンツ評価部３２とリンク評価部３５とに送られ
る。Ｑｕｅｒｙの項目は、コンテンツを特徴づける属性
であり、具体的には、「キーワード」、「ファイルサイ
ズ」、「含まれる文字数」、「名詞の数」、「言語」、
「リンクの有無／数」、「表の有無／数」、「画像の有
無／枚数／サイズ」、「音楽の有無／サイズ」、「更新
（作成）時間」、「ＵＲＬ」等である。(Step 2) The interpreted Query is
It is sent to the content evaluation unit 32 and the link evaluation unit 35. The Query item is an attribute that characterizes the content, and specifically includes “keyword”, “file size”, “number of included characters”, “number of nouns”, “language”,
“Presence / presence / number of links”, “Presence / presence / number of tables”, “Presence / presence / number / size of images”, “Presence / presence / size of music”, “Update (creation) time”, “URL”, and the like.

【００４９】（ステップ３）探索ロボット管理エンジン
２０に送られた探索スタートページ情報は、探索ロボッ
ト生成部２１を経て、探索ロボット３０に送られる。コ
ンテンツ取得部３１が、上記探索スタートページのコン
テンツを取得し、この取得されたコンテンツが、コンテ
ンツ評価部３２に送られる。コンテンツ評価部３２は、
コンテンツがステップ２で解釈されたＱｕｅｒｙに適合
する程度に高い値をとるような計算式を用いる。(Step 3) The search start page information sent to the search robot management engine 20 is sent to the search robot 30 via the search robot generation unit 21. The content acquisition unit 31 acquires the content of the search start page, and the acquired content is sent to the content evaluation unit 32. The content evaluation unit 32
A calculation formula is used such that the content takes a value high enough to conform to the Query interpreted in step 2.

【００５０】たとえば、「キーワード：京都、観光」と
いうＱｕｅｒｙを入力した場合、「京都」と「観光」と
の両方のキーワードを含むコンテンツには、コンテンツ
スコアとして、０．９を与え、片方のキーワードのみが
含まれるコンテンツには、コンテンツスコアとして、
０．５を与え、１つも含まれないコンテンツには、コン
テンツスコアとして、０．１を与えるような式を、上記
計算式として用いる。For example, when the query “Keyword: Kyoto, sightseeing” is input, 0.9 is given as a content score to the content including both the keywords “Kyoto” and “sightseeing”, and one of the keywords is given. Content that contains only
A formula that gives 0.5 and gives 0.1 as a content score is used for the content that does not include any content as the above-mentioned calculation formula.

【００５１】また、たとえば「キーワード：京都、観
光、画像：あり、ＵＲＬ：^*．co.jp」というＱｕｅｒｙ
を入力した場合、「キーワード：京都」、「キーワー
ド：観光」、「画像：あり」、「ＵＲＬ：^*．co.jp（企
業のページ）」の４つの条件のうちで、４つの条件を満
たすコンテンツには、コンテンツスコアとして０．９を
与え、３つの条件を満たすコンテンツには、コンテンツ
スコアとして０．７を与え、２つの条件を満たすコンテ
ンツには、コンテンツスコアとして０．５を与え、１つ
の条件を満たすコンテンツには、コンテンツスコアとし
て０．３を与え、１つの条件も満たさないコンテンツに
は、コンテンツスコアとして０．１を与えるような式
を、上記計算式として用いる。Also, for example, the query “Keyword: Kyoto, sightseeing, image: yes, URL: ^* .co.jp”
, "Keyword: Kyoto", "keyword: sightseeing", "image: available", and "URL: ^* .co.jp (company page)" satisfy four of the four conditions. A content is given a content score of 0.9, a content that satisfies the three conditions is given a content score of 0.7, and a content that satisfies the two conditions is given a content score of 0.5. An expression that gives a content score of 0.3 to content that satisfies one condition and gives a content score of 0.1 to content that does not satisfy one condition is used as the above calculation formula.

【００５２】（ステップ４）ここで、所定のコンテンツ
が、入力されたＱｕｅｒｙに適合した場合（スコアがあ
る閾値を超えている場合）に、コンテンツ登録制御部３
４が、そのコンテンツをＤＢ１に登録する。(Step 4) Here, when the predetermined content matches the input Query (when the score exceeds a certain threshold), the content registration control unit 3
4 registers the content in DB1.

【００５３】（ステップ５）探索の対象となるコンテン
ツにリンクが存在する場合には、各リンクについて、リ
ンク評価部３５がリンクスコア計算を行う。リンク評価
部３５は、コンテンツスコアを求める場合と同様にし
て、リンクそのもののスコアを求めるが、コンテンツ評
価部３２が計算したコンテンツスコアをも用いて、リン
クスコアを計算する。(Step 5) If links exist in the content to be searched, the link evaluation unit 35 calculates a link score for each link. The link evaluation unit 35 calculates the score of the link itself in the same manner as in the case of calculating the content score, but also calculates the link score using the content score calculated by the content evaluation unit 32.

【００５４】たとえば、リンクそのもののスコアと、コ
ンテンツスコアとを乗じたスコアを、リンクスコアとす
る。この場合、「今も都のつもり京都」というリンクに
ついての「リンクそのもののスコア」は、コンテンツス
コアの計算と同様に行われるので、０．５になる（キー
ワード：京都、観光のうちの片方のキーワードのみが含
まれるコンテンツには、コンテンツスコアとして、０．
５を与える計算式を使用）が、そのリンクが含まれるコ
ンテンツのコンテンツスコアが０．９であれば、リンク
スコアは、０．５＊０．９＝０．４５になる。For example, a score obtained by multiplying the score of the link itself by the content score is set as a link score. In this case, the “link itself score” for the link “Kyoto is still going to be Kyoto” is calculated in the same way as the content score calculation, so it becomes 0.5 (Keyword: Kyoto, one of sightseeing) For content including only keywords, a content score of 0.
5 is used), but if the content score of the content including the link is 0.9, the link score is 0.5 * 0.9 = 0.45.

【００５５】これと同様に、「昔は都だった奈良」とい
うリンクの場合、リンクそのもののスコアが０．１であ
る（キーワード：京都、観光のうちの１つも含まれない
コンテンツには、コンテンツスコアとして、０．１を与
える計算式を使用）ので、リンクスコアは０．１であ
り、そのリンクが含まれるコンテンツのコンテンツスコ
アが０．９であれば、リンクスコアは、０．１＊０．９
＝０．０９になる。得られたリンクスコアは、探索ロボ
ット管理エンジン２０のリンクスコアテーブル２２に送
られる。Similarly, in the case of the link “Nara was once a city,” the link itself has a score of 0.1 (keywords: Kyoto, contents that do not include any of sightseeing include contents Since the calculation formula that gives 0.1 is used as the score), the link score is 0.1. If the content score of the content including the link is 0.9, the link score is 0.1 * 0. .9
= 0.09. The obtained link score is sent to the link score table 22 of the search robot management engine 20.

【００５６】リンクそのもののスコアは、Ｑｕｅｒｙに
基づいて、リンク先ＵＲＬやリンクの説明文に応じて、
次のように、計算される。たとえば、「キーワード：京
都、観光」というＱｕｅｒｙを入力した場合、Ｑｕｅｒ
ｙにＵＲＬが含まれないので、リンクの説明文が対象に
なる。たとえば、キーワード「京都」、「観光」の両方
含む場合には、０．９、片方のみを含む場合には０．
５、１つも含まれない場合には０．１を返すような計算
式を用いる。The score of the link itself is determined based on the Query, in accordance with the link destination URL and the description of the link.
It is calculated as follows: For example, if you enter the query "Keyword: Kyoto, sightseeing"
Since y does not include the URL, the description of the link is targeted. For example, 0.9 if both keywords "Kyoto" and "sightseeing" are included, and 0 if only one is included.
A calculation formula that returns 0.1 when none is included is used.

【００５７】また、「キーワード：京都、観光、画像：
あり、ＵＲＬ：^*．co.jp」というＱｕｅｒｙを入力した
場合には、「キーワード」に関しては同様であるが、
「画像の有無」に関しては、リンク先ＵＲＬやリンクの
説明文だけでは判別できない（リンク先コンテンツを取
得してからでないと判別できない）ので、このＱｕｅｒ
ｙを無視する。Also, "keyword: Kyoto, sightseeing, image:
Yes, URL: ^* . If you enter the query "co.jp", the same applies to "keywords",
As for “presence / absence of image”, it cannot be determined only by the link destination URL or the description of the link (it cannot be determined until the link destination content is acquired).
Ignore y.

【００５８】また、「ＵＲＬ：^*．co.jp」については、
リンク先ＵＲＬを見て判別する。つまり、たとえば、
「京都」、「観光」の両方のキーワードを含み、しかも
企業のページ（co.jp）であるリンクには、０．９、
「キーワード：京都」、「キーワード：観光」、「ＵＲ
Ｌ：^*．co.jp」の４つの条件のうち、２つを満たすもの
には０．６、１つを満たすものには０．３、１つも満た
さないものには０．１となるような式を用いる。For “URL: ^* .co.jp”,
It is determined by looking at the link destination URL. So, for example,
Links that include both the keywords “Kyoto” and “sightseeing” and that are corporate pages (co.jp) include 0.9,
"Keyword: Kyoto", "Keyword: sightseeing", "UR
L: ^* . Of the four conditions of "co.jp", use an equation such that 0.6 is satisfied if it satisfies two, 0.3 if it satisfies one, and 0.1 if it does not satisfy any one. .

【００５９】（ステップ６）探索ロボット管理エンジン
２０は、サーバ２内でのネットワークと、計算量の負荷
とに応じて、並列に動かす探索ロボット３０の数Ｎを動
的に計算し、また、探索ロボット３０の数がＮ個よりも
少なくなった場合に、そのときのリンクスコアテーブル
２２を参照し、その時点で最もＱｕｅｒｙに適合すると
思われるＵＲＬに対して、探索ロボット３０を生成す
る。(Step 6) The search robot management engine 20 dynamically calculates the number N of the search robots 30 to be moved in parallel according to the network in the server 2 and the load of the calculation amount. When the number of robots 30 is smaller than N, the search robot 30 is generated for the URL that is most likely to match the query at that time by referring to the link score table 22 at that time.

【００６０】（ステップ７）ステップ６において生成さ
れた探索ロボット３０が、それぞれステップ３〜ステッ
プ５を実行することによって、コンテンツＤＢ１にコン
テンツ情報が、次第に蓄積される。(Step 7) The search robot 30 generated in step 6 executes steps 3 to 5, whereby the content information is gradually accumulated in the content DB1.

【００６１】つまり、上記実施例は、リンクが張られて
いるコンテンツを無作為に回収するものではなく、探索
の時点で、必要なコンテンツであるか否かを判別し、必
要なコンテンツのみを回収する例である。つまり、上記
実施例は、ドメインを特化したい等、探索対象コンテン
ツを、所定の決まりに従って、必要なコンテンツだけに
絞りたい場合に、有効な探索ロボットである。すなわ
ち、上記実施例では、従来の探索ロボットでは実現でき
なかった、必要なコンテンツのみを効率良く取得するこ
とができる。In other words, the above embodiment does not randomly collect the linked content, but determines at the time of search whether or not the content is necessary, and collects only the necessary content. Here is an example. In other words, the above-described embodiment is an effective search robot when it is desired to narrow down the search target content to only necessary content according to a predetermined rule, for example, to specialize a domain. That is, in the above-described embodiment, only necessary contents, which cannot be realized by the conventional search robot, can be efficiently acquired.

【００６２】すなわち、上記実施例によれば、所定のス
タートページから探索を始め、そのページがＱｕｅｒｙ
に適合した場合（スコアがある閾値を超えている場合）
に、そのコンテンツをＤＢに登録する。その後に、「リ
ンク」を評価したスコアをテーブルに残し、その中で、
スコアが高いものから順に、優先的にコンテンツを探索
する。したがって、「幅優先」や「深さ優先」コンテン
ツだけを先に収集することができる。That is, according to the above embodiment, the search is started from a predetermined start page, and the page is queried.
(If the score exceeds a certain threshold)
Then, the content is registered in the DB. After that, the score that evaluated "link" is left in the table,
Content is searched for preferentially in descending order of score. Therefore, only “width-first” and “depth-first” contents can be collected first.

【００６３】サーバ２において、リンク評価部３５とリ
ンクスコアテーブル２２とを削除してもよく、このよう
にしても、必要なコンテンツのみを回収することができ
る。一方、リンク評価部３５とリンクスコアテーブル２
２とを、サーバ２が有していれば、上記必要なコンテン
ツのみを回収する作業を短時間で終了させることができ
る。In the server 2, the link evaluation section 35 and the link score table 22 may be deleted. Even in this case, only necessary contents can be collected. On the other hand, the link evaluation unit 35 and the link score table 2
2, the server 2 can end the operation of collecting only the necessary contents in a short time.

【００６４】なお、上記実施例を記録媒体の発明として
把握することができる。つまり、上記実施例は、所定の
クライアントから入力されたＱｕｅｒｙを解釈するＱｕ
ｅｒｙ解釈手順と、探索ロボットを生成する探索ロボッ
ト生成手順と、上記コンピュータネットワークからコン
テンツを取得するコンテンツ取得手順と、上記Ｑｕｅｒ
ｙと上記コンテンツとに基づいて、上記Ｑｕｅｒｙに対
する上記コンテンツの適合度であるコンテンツスコアを
計算するコンテンツ評価手順と、上記コンテンツが、上
記入力されたＱｕｅｒｙに適合するか否かを判別する適
合性判別手順と、Ｑｕｅｒｙに適合しているコンテンツ
を、コンテンツＤＢに登録するコンテンツ登録手順とを
コンピュータに実行させるプログラムを記録したコンピ
ュータ読み取り可能な記録媒体の例である。The above embodiment can be understood as the invention of a recording medium. That is, in the above embodiment, the Qu interprets the Query input from the predetermined client.
ery interpretation procedure, a search robot creation procedure for creating a search robot, a content acquisition procedure for acquiring content from the computer network, and the Quer
a content evaluation procedure for calculating a content score, which is a degree of conformity of the content to the Query, based on the y and the content; and a compatibility determination for determining whether the content conforms to the input Query. It is an example of a computer-readable recording medium on which a program for causing a computer to execute a procedure and a content registration procedure for registering content conforming to the Query in a content DB is recorded.

【００６５】この場合、上記コンテンツに含まれている
リンクが指し示すコンテンツの上記Ｑｕｅｒｙに対する
適合度であるリンクスコアを計算するリンク評価手順
と、所定のコンテンツに含まれているリンクのリンクス
コアを蓄積するリンクスコア蓄積手順とをコンピュータ
に実行させるプログラムを設けるようにしてもよい。In this case, a link evaluation procedure for calculating a link score, which is a degree of conformity of the content indicated by the link included in the content to the query, and a link score of the link included in the predetermined content are accumulated. A program that causes a computer to execute the link score accumulation procedure may be provided.

【００６６】また、この場合、上記コンテンツを特徴づ
ける属性が、キーワード、ファイルサイズ、含まれる文
字数、名詞の数、言語、リンクの有無／サイズ、更新
（作成）時間、ＵＲＬの少なくとも１つであり、上記属
性の中から選択された属性を用い、上記コンテンツスコ
アと上記リンクそのもののスコアとを計算する。In this case, the attributes characterizing the content are at least one of a keyword, a file size, the number of included characters, the number of nouns, a language, presence / absence / size of a link, an update (creation) time, and a URL. Using the attribute selected from the attributes, the content score and the score of the link itself are calculated.

【００６７】さらに、上記実施例は、コンテンツを特徴
づける属性に基づいて、コンテンツスコアとリンクその
もののスコアとを計算するスコア計算手順と、上記計算
されたコンテンツスコアとリンクそのもののスコアとに
基づいて、リンクスコアを計算するリンクスコア計算手
順と、上記計算されたリンクスコアに基づいて、探索の
優先順位を決定する探索優先順位決定手順とをコンピュ
ータに実行させるプログラムを記録したコンピュータ読
み取り可能な記録媒体の例である。Further, in the above embodiment, a score calculation procedure for calculating the content score and the score of the link itself based on the attribute characterizing the content, and the score calculation procedure based on the calculated content score and the score of the link itself. A computer-readable recording medium storing a program for causing a computer to execute a link score calculation procedure for calculating a link score, and a search priority determination procedure for determining a search priority based on the calculated link score. This is an example.

【００６８】なお、上記記録媒体は、ＦＤ、ＣＤ、ＤＶ
Ｄ、ＨＤ、半導体メモリ等が考えられる。The recording medium is FD, CD, DV
D, HD, semiconductor memory and the like can be considered.

【００６９】[0069]

【発明の効果】本発明によれば、コンテンツを探索する
時点で、必要なコンテンツであるか否かを判別するの
で、必要なコンテンツのみを回収することができるとい
う効果を奏する。According to the present invention, it is determined whether or not the content is necessary at the time of searching for the content, so that only the necessary content can be collected.

[Brief description of the drawings]

【図１】本発明の一実施例である情報探索システム１０
０の構成図である。FIG. 1 is an information search system 10 according to an embodiment of the present invention.
FIG.

【図２】上記実施例において、サーバ２を具体的に示す
ブロック図である。FIG. 2 is a block diagram specifically showing a server 2 in the embodiment.

【図３】上記実施例において、探索ロボット３０とリン
クスコアテーブル２２との関係を説明する図である。FIG. 3 is a diagram illustrating a relationship between a search robot 30 and a link score table 22 in the embodiment.

[Explanation of symbols]

１…コンテンツＤＢ、２…サーバ、３…クライアント、４…インターネット等のコンピュータネットワーク、１０…Ｑｕｅｒｙ解釈エンジン、２０…探索ロボット管理エンジン、２１…探索ロボット生成部、２２…リンクスコアテーブル、３０…探索ロボット、３１…コンテンツ取得部、３２…コンテンツ評価部、３３…Ｑｕｅｒｙ適合判別部、３４…コンテンツ登録制御部、３５…リンク評価部。 DESCRIPTION OF SYMBOLS 1 ... Content DB, 2 ... Server, 3 ... Client, 4 ... Computer network, such as the internet, 10 ... Query interpretation engine, 20 ... Search robot management engine, 21 ... Search robot generation part, 22 ... Link score table, 30 ... Search Robot 31: Content acquisition unit 32: Content evaluation unit 33: Query matching determination unit 34: Content registration control unit 35: Link evaluation unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者外村佳伸東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 5B075 ND36 NK46 PP12 PP23 PR06 PR08 QM08 ────────────────────────────────────────────────── ─── Continuing on the front page (72) Inventor Yoshinobu Tomura 2-3-1 Otemachi, Chiyoda-ku, Tokyo F-term in Nippon Telegraph and Telephone Corporation (reference) 5B075 ND36 NK46 PP12 PP23 PR06 PR08 PRM QM08

Claims

[Claims]

1. A client connected via a predetermined computer network, a server, and a content D
B, a query interpretation engine for interpreting the query input from the client; content acquisition means for acquiring content from the computer network;
Based on the Query and the content, the Q
content evaluation means for calculating a content score that is the degree of conformity of the content to the query; suitability determination means for determining whether the content conforms to the input Query; and content conforming to the Query. An information search device, comprising: a search robot having content registration control means for registering the search robot in a content DB; and a search robot management engine having search robot generation means for generating the search robot.

2. The search robot according to claim 1, further comprising: link evaluation means for calculating a link score, which is a degree of conformity of the content indicated by the link included in the content to the query. An information search device, characterized in that the management engine has a link score table for accumulating link scores, which are scores of links included in predetermined content.

3. A client, a server, and a content D connected via a predetermined computer network.
B. an information search method of the information search system in the information search system comprising: B; a Query interpretation step of interpreting the Query input from the client; a search robot generation step of generating a search robot; Acquiring the content; and the above Qu
query based on the query and the content.
A content evaluation step of calculating a content score that is a degree of conformity of the content with respect to; a compatibility determination step of determining whether the content conforms to the input Query; and a content conforming to the Query. And a content registration step of registering the content in a content DB.

4. A link evaluation step according to claim 3, wherein a link score, which is a degree of relevance of the content pointed to by the link included in the content to the query, is calculated; and a link of the link included in the predetermined content. A link score accumulating step of accumulating scores.

5. The attribute according to claim 4, wherein the attribute characterizing the content is at least one of a keyword, a file size, the number of included characters, the number of nouns, a language, presence / absence / size of a link, an update (creation) time, and a URL. An information search method, wherein the content score and the score of the link itself are calculated using an attribute selected from the attributes.

6. A score calculating step of calculating a content score and a score of the link itself based on an attribute characterizing the content; and calculating a link score based on the calculated content score and the score of the link itself. A link score calculating step of performing a search; and a search priority determining step of determining a search priority based on the calculated link score.

7. Qu input from a predetermined client
a query interpretation procedure for interpreting the query; a search robot creation procedure for creating a search robot; a content acquisition procedure for acquiring content from the computer network; and adaptation of the content to the query based on the query and the content. A content evaluation procedure for calculating a content score that is a degree; a compatibility determination procedure for determining whether or not the content conforms to the input Query;
And a computer-readable recording medium storing a program for causing a computer to execute a content registration procedure for registering content conforming to the above in a content DB.

8. The link evaluation procedure according to claim 7, wherein a link score, which is a degree of relevance of the content pointed to by the link included in the content to the query, is calculated; and a link of the link included in the predetermined content. And a computer-readable recording medium recording a program for causing a computer to execute a link score accumulation procedure for accumulating scores.

9. The method according to claim 8, wherein the attribute characterizing the content is at least one of a keyword, a file size, the number of included characters, the number of nouns, a language, presence / absence / size of a link, an update (creation) time, and a URL. Wherein the content score and the score of the link itself are calculated using an attribute selected from the attributes.

10. A score calculation procedure for calculating a content score and a score of a link itself based on an attribute characterizing the content; and calculating a link score based on the calculated content score and a score of the link itself. A computer-readable recording medium storing a program for causing a computer to execute a link score calculation procedure to perform; and a search priority determination procedure to determine a search priority based on the calculated link score.