JP2005031867A

JP2005031867A - Web information collecting device and web information collecting method

Info

Publication number: JP2005031867A
Application number: JP2003194662A
Authority: JP
Inventors: Shigehiko Suzuki; 茂彦鈴木; Masaki Uchida; 雅規内田; Taisuke Ushio; 泰典牛尾
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2003-07-09
Filing date: 2003-07-09
Publication date: 2005-02-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device for efficiently collecting the information of a home page including a keyword at the time of collecting the information of a Web page through the Internet. <P>SOLUTION: Character data across the plurality of hierarchies of a designated home page are downloaded, and the downloaded character data file is retrieved by using a preliminarily set keyword, and when the date of the character data file is collated with already registered data, and they are made incoincident, the overall home page hit by the keyword retrieval is downloaded, and the changed contents of the downloaded home page are notified to a user concerned as a mail. Also, the headline of the home page hit by the keyword is edited, and the information of the home page is distributed as news. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、キーワードが含まれたホームページを情報収集し、キーワード検索の結果、新規、あるいは変更のあったホームページを関係者にメール配信する技術に関する。
【０００２】
【従来の技術】
インターネットの普及に伴い、インターネット上にあるＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ：ワールドワイドウエブ、以下ウエブと表現）サイトから、Ｇｏｏｇｌｅに代表される検索エンジンを利用することで所望のウエブ情報を収集することが一般に行なわれている（例えば、非特許文献１参照）。
【０００３】
しかしながら、従来、検索エンジンにおいてキーワード検索する場合、参照しきれない程の膨大なホームページがヒットしてしまう。また、条件を追加していくと、参照したいホームページがヒットしないことが多いなどの問題を抱えていた。
さらに、キーワード検索でヒットしたホームページをダウンロードすると、参照したくないホームページもダウンロードされるため、処理時間がかかり、所望の情報の特定が直ぐにできない。また、更新していないホームページも再確認せざるを得ず、欲しい情報がなかなか参照できない。さらには、グループに属し共同作業を行う複数の利用者が、同一のホームページを参照して、同じ内容を確認しているなど大変不効率なことを行っていた。
【０００４】
【非特許文献１】
日経ＢＰ社「日経パソコン」２００１年新春特集号、Ｎｏ．３７６（９２〜１１５頁）
【０００５】
【発明が解決しようとする課題】
そこで、上記した問題を解決するため、本発明では、キーワード指定されたホームページの複数階層にわたってキーワードが含まれているページを参照し、キーワード検索でヒットしたホームページが改版されていればユーザに通知する（キーワードが含まれていても以前にホームページを参照していれば参照範囲外にする）。また、キーワード検索で、改版、新規のホームページが判っても、本当に参照したいホームページは少ない。そのため、一人がホームページを確認して、内容（ヘッドライン）をまとめ、その内容を関係者にメール配信し、ホームページの参照時間を減らすことを目的とする。
【０００６】
【課題を解決するための手段】
第一の発明は、インターネットを介したウエブ情報収集装置において、指定されたホームページを複数階層にわたって巡回し、前記ホームページ上の文字データを自動的にダウンロードするウエブ巡回手段と、前記文字データをホームページ毎にファイルとして格納する文字データファイル格納手段と、前記ウエブ巡回手段によってダウンロードした文字データファイルを予め設定されたキーワードによって検索するキーワード検索手段と、前記文字データファイルの日付が既登録データを参照して不一致の場合に、前記キーワード検索の結果、キーワードがヒットしたホームページ全体をダウンロードするホームページダウンロード手段と、ダウンロードした前記ホームページの更新または新規内容の情報を関係ユーザに自動通知する情報通知手段と、を有することを特徴とするウエブ情報収集装置に関する。
【０００７】
すなわち、第一の発明によれば、ウエブ巡回手段によって、予め指定されたホームページの複数層を巡回してテキストデータをダウンロードし、そのダウンロードしたテキスト内をキーワード検索し、検索の結果、キーワードが存在したホームページ全体をダウンロードし、かつ更新日付をチェックすることで、改版、あるいは新規のホームページに対し、ユーザに自動通知するような構成とした。
【０００８】
これによって、指定範囲（ダウンロードファイル）内でキーワード検索が出来るため、必要なホームページを簡単に探すことができ、また、指定したホームページの複数階層で、指定キーワードが含まれたホームページの表示を行い、ホームページが改版された場合、新規登録されたホームページのみ通知するため、変更が無いホームページを参照しなくてもよく、検索の処理時間が大幅に短縮できる。
【０００９】
第二の発明は、前記ウエブ巡回手段では、前記キーワード検索手段においてキーワードがヒットしたホームページだけを対象に、次回以降、巡回させることを特徴とする上記第一の発明に記載のウエブ情報収集装置に関する。
すなわち、第二の発明によれば、初回の巡回によるダウンロードファイルでキーワード検索にヒットしたホームページだけを対象に、次回以降、巡回させることになるため、巡回処理時間が大きく短縮され、トータルなウエブ情報収集の管理工数の削減となる。
【００１０】
【発明の実施の形態】
以下、図面にもとづいて本発明の実施形態を説明する。
図１は、本発明の基本システム構成を示す。本発明のシステムは、インターネット３を介して、これに接続する複数の情報提供サーバ（図示していない）からホームページの情報を収集するウエブ情報収集装置１と、ホームページの場所を表すＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒｓ）の設定、キーワード設定、検索結果確認、およびニュース投稿等、前記ウエブ情報端末１とＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）で接続され、ウエブ情報巡回及び検索にあたっての環境設定の入力を行う複数のユーザ端末２とで構成され、グループ内で収集したウエブ情報を共有して管理するシステムとなっている。
【００１１】
また、ウエブ情報収集装置１は、指定したホームページ巡回のためのＵＲＬ管理データベース１０と、ユーザが入力した巡回条件、キーワード条件等ウエブ情報の巡回、検索のための環境条件を保持しておく環境設定ファイル１１と、巡回条件取得手段１２と、指定したＵＲＬにしたがってインターネットに接続する情報提供サーバにアクセスし、ホームページの複数層にわたって巡回して文字データ（拡張子指定）を自動的にダウンロードするウエブ巡回手段１３と、前記文字データをホームページ毎に格納する文字データファイル格納手段１４と、そのダウンロードした文字データファイルを格納するダウンロードフォルダ１５と、格納したダウンロードファイルを予め設定されたキーワードによって検索するキーワード検索手段１６と、キーワード検索の結果、そのヒットした情報を格納するヒット情報データベース１７と、ヒットしたホームページ全体をダウンロードするホームページダウンロード手段１８と、そのダウンロードしたファイルの内、変更、あるいは新規情報があれば、その更新情報を関係ユーザに通知する情報通知手段１９とで構成される。
【００１２】
ここで、ウエブ情報収集装置１は、コンピュータであり、予め内蔵されたプログラムがコンピュータ上で実行され、巡回条件取得手段１２、ウエブ巡回手段１３、文字データファイル格納手段１４、キーワード検索手段１６、ホームページダウンロード手段１８、および情報通知手段１９の各手段が実現される。そして、当該プログラムは、フロッピーディスク、コンパクトディスク、ＣＤ−ＲＯＭ等のコンピュータ読取可能な記録媒体に記録され、とくに図には示していないが、内蔵あるいは、外部接続された媒体読取装置にセットしインストールすることによって実行可能な状態としてもよい。
【００１３】
以下の実施例では、ウエブ情報収集の一例として、「不具合情報の収集」を例に取り上げて説明する。
図２は、本発明の実施の形態になるＵＲＬ管理データベースのデータ構成例を示す図である。ＵＲＬ管理データベース１０のデータ構成は、インターネットのウエブページの場所を表すＵＲＬ、会社名、ＵＲＬの登録日、巡回の収集停止（巡回してヒットしなければ停止）、ＵＲＬの収集日時、セキュリティ付きサイトアクセス時の認証ＩＤ、認証パスワード、ＵＲＬの巡回すべき階層数、初回収集（１又は０で表現）、および更新日時等の項目からなる。
【００１４】
図３は、本発明の実施の形態になるヒット情報データベースのデータ構成例を示す図である。本ヒット情報データベース１７には、ダウンロードしたホームページの内、文字データをキーワード検索してヒットした指定拡張子のファイルから、後述する関係ユーザに内容を通知する（ニュース投稿）データとしての加工情報を保持しておく。
【００１５】
ヒット情報データベース１７のデータ構成は、キーワードヒット有無、キーワードヒット数、総数、習得数、不可数、タイトル、およびヘッドラインの項目からなる。キーワードヒット有無は、１キーワードに対し、ヒットの有無は１又は０で表現する。例えば、キーワード列数が８個であれば、各キーワードに対応して、００１０００１０で表され、合計ヒット数は２件とカウントされる。総数、習得数、不可数は、検索対象としたＵＲＬ数を示す。タイトルは、＜ＴＩＴＬＥ＞の表記であり、表記がなければタイトルなしとなる。また、ヘッドラインは、キーワードがヒットしたテキストの頭からの文字数（例えば、１００字等）を抽出して表現したものである。
【００１６】
つぎに、本発明になるウエブ情報の収集を実施するにあたり、予めのウエブ巡回の設定項目について、図４〜図７を使って説明する。
図４は、本発明の実施の形態になる初期メニュー画面の例を示す図である。初期メニュー画面は、三つの画面領域１０１、１０２、および１０３で構成されている。
【００１７】
画面領域１０１は、これまで検索されたＵＲＬリストについて、会社名、ＵＲＬ、更新日、およびヒット数の項目に対するデータが表示される。□はチェックボックスであり、ブランクは巡回を必要とし、×印は巡回の対象外として処理されたことを表す。これは、画面領域１０３の各設定ニューでユーザによって選択された結果が反映される。
【００１８】
ＵＲＬは、（プロトコル名）：／／（ドメイン名）／（ファイル名）で表される。図中、ｈｔｔｐ：／／ｗｗｗ．ａａａ．ｃｏｍ／ｉｎｄｅｘ．ｈｔｍは、会社ＡＡＡのトップホームページのＵＲＬであり、ｈｔｔｐ：／／は、ウエブのクライアントが情報提供サーバと通信するＨＴＴＰ（ＨｙｐｅｒＴｅｘｔＴｒａｎｓｆｅｒＰｒｏｔｏｃｏｌ）というプロトコルを使った送信命令を表し、続くｗｗｗ．ａａａ．ｃｏｍは、ホームページが保存してあるインターネット上のＷＷＷサーバの名前を表し、ｉｎｄｅｘ．ｈｔｍは、ホームページのトップページを表している。また、（．ｈｔｍ）は、ホームページを記述する言語ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）によるファイルの拡張子を表している。
【００１９】
ｈｔｔｐ：／／ｗｗｗ．ａａａ．ｃｏｍ／ｘｘｘ１／ｉｎｄｅｘ．ｈｔｍやｈｔｔｐ：／／ｗｗｗ．ａａａ．ｃｏｍ／ｘｘｘ１／ｐｒ．ｈｔｍは、２階層目や３階層目のウエブページを表している。
画面領域１０１のＵＲＬリストにおいて、例えば、ＵＲＬ［ｈｔｔｐ：／／ｗｗｗ．ａａａ．ｃｏｍ／ｘｘｘ１／ｉｎｄｅｘ．ｈｔｍ］のウエブページは、キーワード検索の結果、ｙｙｙｙ年ｍｍ月ｄｄ日に、５件のヒットがあったことを示している。
【００２０】
また、画面領域１０２には、不具合情報、新製品情報、技術情報等のアイテムが表示される。本実施例では、不具合情報が表示されている。
さらに、画面領域１０３には、ＵＲＬ追加（ＵＲＬ、階層、識別子の設定）、ＵＲＬ削除（ＵＲＬリスト、フォルダからのデータ削除）、収集停止・再開（ＵＲＬ毎に収集停止・再開を指示）、収集スケジュール（実行日／曜日／時刻設定）、キーワード設定（ダウンロードデータに対する検索キーワードの設定）、およびニュース投稿等、ユーザが入力すべき検索環境の設定メニューが表示される。各メニューボタンをマウスによってクリックすることで、各設定画面が表示され、該表示画面においてユーザによる入力が行われる。
【００２１】
設定メニューの内、まず、ＵＲＬ追加およびＵＲＬ削除は、ＡＡＡ、ＢＢＢなどの会社のトップホームページ単位でページを追加したり、削除したりする設定機能である。
以下に、他のメニューの機能について説明する。
図５は、本発明の実施の形態になる環境設定の画面例（キーワード設定）を示す。例えば、キーワード列１は、「不具合ａｎｄコンデンサａｎｄＬＳＩ」の検索式を検索キーワードとして検索することを表している。
【００２２】
図６は、本発明の実施の形態になる環境設定の画面例（収集停止・再開設定）を示す。ＵＲＬリストから、収集停止／再開を指示、収集停止したＵＲＬのファイルは、一旦検索対象フォルダの外に移動させ、再開時、検索対象に戻す。前回ヒットの実績が０のものに対し、チェックボックス□に×を入れることで、そのファイルは、巡回収集の対象からはずされる。チェックを全て選択して、最後に収集停止／再開画面の選択画面を表示して、ヒット件数０件のＵＲＬを一括して収集停止することができる。
【００２３】
図７は、本発明の実施の形態になる環境設定の画面例（収集スケジュール設定）を示す。設定項目としては、全巡回周期、差分巡回周期、巡回時刻、収集対象ファイル、および検索階層が表示される。全巡回周期では、毎回、月、週に収集する回数を設定し、差分巡回では、前回検索との比較において、変更分だけを抽出して巡回する場合の設定を行う。本画面例では、全巡回を選択し、１回／週，曜日指定は月曜日、巡回時刻は、１時００分，収集対象ファイルは、ｈｔｍ／ｈｔｍｌ、ｔｘｔ、ｄｏｃ、ｘｌｓの拡張子を指定し、および巡回検索の深さである階層は３階層となるように収集スケジュールを設定している。
【００２４】
ここで、ｈｔｍ／ｈｔｍｌはホームページの記述言語ＨＴＭＬの拡張子、ｔｘｔはテキスト形式の拡張子である。また、ｄｏｃはマイクロソフト社のワードの登録商標、ｘｌｓはマイクロソフト社のエクセルの商標登録、およびｐｄｆはアドビー社の登録商標のそれぞれの拡張子である。
図８は、本発明の実施の形態になる指定条件に基づくウエブ巡回のフローチャートを示す。まず、ステップＳ１１において、ユーザが入力し、保持されている環境設定ファイルから指定されたＵＲＬを読み込む。ステップＳ１２において、ウエブ巡回手段１３が、インターネットに接続する情報提供サーバのホームページを複数階層にわたって巡回し、ステップＳ１３で、指定拡張子の文字データのファイルをダウンロードし、ステップＳ１４において、ダウンロードフォルダ１５に保存する。そして、ステップＳ１５において、環境設定ファイル１１から読み込んだ指定ＵＲＬが全て終了するまで以上の処理を繰り返す。
【００２５】
全てのＵＲＬが終了した時点で次の処理フロー（▲１▼）に移る。
図９は、本発明の実施の形態になるダウンロードファイルのキーワード検索のフローチャートを示す。図８の処理を受けて、ステップＳ２１において、ダウンロードフォルダ１５を参照し、ステップＳ２２で、ダウンロードした文字データファイルが終わるまでダウンロードファイルの参照を続行する。ステップＳ２３において、収集した文字データファイルのキーワード検索を行う。ステップＳ２４で、検索条件が一致すれば、ステップＳ２５に進み、ダウンロードした個々の文字データファイルについて、前回ダウンロ−ドしたファイルとの日付をチェックする。
【００２６】
ステップＳ２６において、日付が不一致であれば、キーワード検索においてヒットしたホームページの情報について、ヒット情報データベース１７に図３のデータ構成に則って必要情報を保存する。
ステップＳ２４において、キーワ−ド条件が一致しなければ、ヒット情報データベース１７の当該ファイルのフラグをＯＦＦとする。また、ステップＳ２６で日付条件が一致しなければ、ステップＳ２５に戻ってダウンロードファイルのチェックを繰り返す。
【００２７】
以上の処理フロ−を終了したら、次の▲２▼の処理フローへと移行する。
図１０は、本発明の実施の形態になる更新ホームページ内容の自動通知のフローチャートを示す。図９の処理フロ−を受けて、ステップＳ３１において、前記ヒット情報データベース１７を参照し、ステップＳ３２で、ヒット情報データベース１７のデータがなくなるまで処理を行う。ステップＳ３３において、ヒット情報データベース１７においてダウンロードファイルのフラグがＯＮのものについて選定し、ステップＳ３４でフラグのついた指定ＵＲＬのホームページ全体をダウンロードする。そしてステップＳ３５で、ダウンロードしたＵＲＬを保存する。
【００２８】
つぎに、ステップＳ３２で、全てのデータ処理が終わったら、ステップＳ３６において、ダウンロードしたＵＲＬについて抽出し、ステップＳ３７において、ダウンロードしたＵＲＬのホームページの変更内容の情報を編集（ニュース投稿）し、関係ユーザにメールで自動通知する。
図１１は、本発明の実施の形態になる自動投稿の作成例を示す。キーワード検索および更新日付をチェックすることで、キーワードが存在したホームページ全体をダウンロードし、更新あるいは新規となったホームページについての情報を関係ユーザに自動通知する。本画面例では、自動投稿と手動投稿が選択できる画面としている。自動投稿では、例えば、キーワード列１に対しヒットしたＵＲＬ１（ｙｙｙｙ／ｍｍ／ｄｄ）、ＵＲＬ２（ｙｙｙｙ／ｍｍ／ｄｄ）・・また、キーワード列２に対しヒットしたＵＲＬａ（ｙｙｙｙ／ｍｍ／ｄｄ）、ＵＲＬｂ（ｙｙｙｙ／ｍｍ／ｄｄ）・・のＵＲＬ群が自動的にリストアップされ、関係ユーザに通知される。また、手動投稿では、フリーなスタイルでのニュース投稿画面が用意される。
【００２９】
図１２は、本発明の実施の形態になる変更内容のメール通知例を示す。変更のあったウエブページのＵＲＬ、更新日、ヒットしたキーワード、および内容についてのヘッドラインを抽出して自動的に関係ユーザにメール通知される。
以上の実施例では、主に「不具合情報」という事例を想定して記述してきたが、本発明は、もちろん、これに限定を受けるものではなく、「新製品情報」、「技術情報」、「特許情報」等、広い範囲のジャンルのウエブ情報収集に同様な手法が適用されることは言うまでもない。
【００３０】
（付記１）インターネットを介したウエブ情報収集装置において、
指定されたホームページを複数階層にわたって巡回し、前記ホームページ上の文字データを自動的にダウンロードするウエブ巡回手段と、
前記文字データをホームページ毎にファイルとして格納する文字データファイル格納手段と、
前記ウエブ巡回手段によってダウンロードした文字データファイルを予め設定されたキーワードによって検索するキーワード検索手段と、
前記文字データファイルの日付が既登録データを参照して不一致の場合に、前記キーワード検索の結果、キーワードがヒットしたホームページ全体をダウンロードするホームページダウンロード手段と、
ダウンロードした前記ホームページの更新または新規内容の情報を関係ユーザに自動通知する情報通知手段と
を有することを特徴とするウエブ情報収集装置。
【００３１】
（付記２）前記情報通知手段における通知情報は、ヒットしたキーワードを含む前後の文字列を抽出して自動生成されたヘッドライン情報であることを特徴とする付記１に記載のウエブ情報収集装置。
（付記３）前記ウエブ巡回装置では、前記キーワード検索ステップにおいてキーワードがヒットしたホームページだけを対象に、次回以降、巡回させることを特徴とする付記１記載のウエブ情報収集装置。
【００３２】
（付記４）インターネットを介したウエブ情報収集方法において、
指定されたホームページを複数階層にわたって巡回し、前記ホームページ上の文字データを自動的にダウンロードするウエブ巡回ステップと、
前記文字データをホームページ毎にファイルとして格納する文字データファイル格納ステップと、
前記ウエブ巡回ステップによってダウンロードした文字データファイルを予め設定されたキーワードによって検索するキーワード検索ステップと、
前記文字データファイルの日付が既登録データを参照して不一致の場合に、前記キーワード検索の結果、キーワードがヒットしたホームページ全体をダウンロードするホームページダウンロードステップと、
ダウンロードした前記ホームページの更新または新規内容の情報を関係ユーザに自動通知する情報通知ステップと、
を有することを特徴とするウエブ情報収集方法。
【００３３】
（付記５）インターネットを介したウエブ情報収集プログラムにおいて、
コンピュータに、
指定されたホームページを複数階層にわたって巡回し、前記ホームページ上の文字データを自動的にダウンロードするウエブ巡回ステップと、
前記文字データをホームページ毎にファイルとして格納する文字データファイル格納ステップと、
前記ウエブ巡回ステップによってダウンロードした文字データファイルを予め設定されたキーワードによって検索するキーワード検索ステップと、
前記文字データファイルの日付が既登録データを参照して不一致の場合に、前記キーワード検索の結果、キーワードがヒットしたホームページ全体をダウンロードするホームページダウンロードステップと、
ダウンロードした前記ホームページの更新または新規内容の情報を関係ユーザに自動通知する情報通知ステップと、
を実行させるウエブ情報収集プログラム。
【００３４】
（付記６）インターネットを介したウエブ情報収集プログラムを記録した記録媒体であって、
コンピュータに、
指定されたホームページを複数階層にわたって巡回し、前記ホームページ上の文字データを自動的にダウンロードするウエブ巡回ステップと、
前記文字データをホームページ毎にファイルとして格納する文字データファイル格納ステップと、
前記ウエブ巡回ステップによってダウンロードした文字データファイルを予め設定されたキーワードによって検索するキーワード検索ステップと、
前記文字データファイルの日付が既登録データを参照して不一致の場合に、前記キーワード検索の結果、キーワードがヒットしたホームページ全体をダウンロードするホームページダウンロードステップと、
ダウンロードした前記ホームページの更新または新規内容の情報を関係ユーザに自動通知する情報通知ステップと、
を実行させるウエブ情報収集プログラム記録したコンピュータ読取可能な記録媒体。
【００３５】
【発明の効果】
以上、説明してきたように、本発明によれば、指定したホームページの複数階層で、指定キーワードが含まれたホームページの表示を行い、ホームページが改版された場合、新規登録されたホームページのみ通知するため、変更が無いホームページを参照しなくてもよく、検索の処理時間が大幅に短縮できる。
【００３６】
また、指定範囲（ダウンロードファイル）内でキーワード検索が出来るため、必要なホームページを簡単に探すことができる。
さらに、本発明によれば、担当者がホームページの内容を確認の上、ヘッドラインを変更して必要者にメールにて配信することになるため、担当者以外は、その内容（ヘッドライン）を確認するだけで、必要がなければホームページを参照しなくて済むため、検索に要する工数の削減が図れる。
【図面の簡単な説明】
【図１】本発明になる基本システム構成を示す図である。
【図２】本発明の実施の形態になるＵＲＬ管理データベースのデータ構成例を示す図である。
【図３】本発明の実施の形態になるヒット情報データベースのデータ構成例を示す図である。
【図４】本発明の実施の形態になる初期メニュー画面の例を示す図である。
【図５】本発明の実施の形態になる環境設定の画面例（キーワード設定）を示す図である。
【図６】本発明の実施の形態になる環境設定の画面例（収集停止・再開設定）を示す図である。
【図７】本発明の実施の形態になる環境設定の画面例（収集スケジュール設定）を示す図である。
【図８】本発明の実施の形態になる指定条件に基づくウエブ巡回のフローチャートを示す図である。
【図９】本発明の実施の形態になるダウンロードファイルのキーワード検索のフローチャートを示す図である。
【図１０】本発明の実施の形態になる更新ホームページ内容の自動通知のフローチャートを示す図である。
【図１１】本発明の実施の形態になる自動投稿の作成例を示す図である。
【図１２】本発明の実施の形態になる変更内容のメール通知例を示す図である。
【符号の説明】
１ウエブ情報収集装置
２ユーザ端末
３インターネット
１０ＵＲＬ管理データベース
１１環境設定ファイル
１２巡回条件取得手段
１３ウエブ巡回手段
１４文字データファイル格納手段
１５ダウンロードフォルダ
１６キーワード検索手段
１７ヒット情報データベース
１８ホームページダウンロード手段
１９情報通知手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a technique for collecting information on a homepage including a keyword and delivering a new or changed homepage as a result of keyword search to a related person by e-mail.
[0002]
[Prior art]
With the spread of the Internet, it is common to collect desired web information from a WWW (World Wide Web) site on the Internet by using a search engine typified by Google. (For example, refer nonpatent literature 1).
[0003]
However, conventionally, when a keyword search is performed by a search engine, a huge number of homepages that cannot be referred to are hit. In addition, as conditions were added, there were many problems such as the homepage you would like to refer to was often not hit.
Furthermore, when a homepage that has been hit by a keyword search is downloaded, a homepage that the user does not want to refer to is downloaded, so that processing time is required and desired information cannot be specified immediately. In addition, I have to reconfirm homepages that have not been updated, and it is difficult to refer to the information I want. In addition, multiple users who belong to a group and perform collaborative work refer to the same homepage and confirm the same content, which is very inefficient.
[0004]
[Non-Patent Document 1]
Nikkei BP “Nikkei PC” New Year 2001 Special Issue, No. 376 (92-115 pages)
[0005]
[Problems to be solved by the invention]
Therefore, in order to solve the above problem, in the present invention, a page including a keyword is referred to over a plurality of hierarchies of the home page specified by the keyword, and the user is notified if the home page hit by the keyword search has been revised. (Even if keywords are included, if you have visited the homepage before, it will be out of the reference range.) In addition, even if you can find a revised or new homepage by keyword search, there are few homepages that you really want to reference. Therefore, one person confirms the homepage, summarizes the contents (headlines), distributes the contents to the parties concerned by e-mail, and aims to reduce the reference time of the homepage.
[0006]
[Means for Solving the Problems]
According to a first aspect of the present invention, in a web information collecting apparatus via the Internet, a web patrol unit that patrols a designated home page over a plurality of layers and automatically downloads character data on the home page; Character data file storage means for storing the file as a file, keyword search means for searching for the character data file downloaded by the web patrol means with a preset keyword, and the date of the character data file refer to the registered data In the case of mismatch, as a result of the keyword search, a homepage download means for downloading the entire homepage where the keyword is hit, and information for automatically notifying related users of updated or new content information of the downloaded homepage A knowledge unit, related web information collection apparatus characterized by having a.
[0007]
That is, according to the first invention, the web circulating means downloads the text data by patroling a plurality of layers of the designated homepage, searches the downloaded text for a keyword, and the keyword exists as a result of the search. The entire homepage is downloaded and the update date is checked to automatically notify the user of the revised or new homepage.
[0008]
This allows you to search for keywords within the specified range (download file), so you can easily find the required homepage, and display the homepage that contains the specified keyword in multiple levels of the specified homepage. When the homepage is revised, only the newly registered homepage is notified, so there is no need to refer to a homepage that has not been changed, and the search processing time can be greatly shortened.
[0009]
The second invention relates to the web information collecting device according to the first invention, wherein the web patrol means is to circulate only the homepage where the keyword is hit in the keyword search means from the next time. .
In other words, according to the second invention, since only the home page that hits the keyword search in the download file by the first visit is visited, the visit processing time is greatly reduced from the next time, so that the visit processing time is greatly reduced and the total web information is reduced. The management man-hour for collection is reduced.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 shows the basic system configuration of the present invention. The system of the present invention includes a web information collecting device 1 that collects homepage information from a plurality of information providing servers (not shown) connected to the web 3 via the Internet 3, and a URL (Uniform Resource) indicating the location of the homepage. A plurality of user terminals connected to the Web information terminal 1 via a LAN (Local Area Network) such as setting of Locators), keyword setting, search result confirmation, and news posting, etc., and inputting environment settings for searching and searching the Web information 2 and is a system for sharing and managing web information collected within the group.
[0011]
The web information collection apparatus 1 also stores an environment condition for holding a URL management database 10 for patrol of a designated home page, and patrol conditions and keyword conditions for web information patrol and search conditions entered by the user. A web patrol that accesses a file 11, patrol condition acquisition means 12, and an information providing server connected to the Internet according to a designated URL, and patrols across multiple layers of the homepage to automatically download character data (extension designation). Means 13; character data file storage means 14 for storing the character data for each home page; download folder 15 for storing the downloaded character data file; and keyword search for searching the stored download file with a preset keyword. Means 16 and As a result of keyword search, the hit information database 17 for storing the hit information, the home page download means 18 for downloading the entire hit home page, and the update information of the downloaded file if there is any change or new information Information notification means 19 for notifying related users.
[0012]
Here, the web information collection device 1 is a computer, and a program built in advance is executed on the computer, and a tour condition acquisition unit 12, a web tour unit 13, a character data file storage unit 14, a keyword search unit 16, a home page. Each means of the download means 18 and the information notification means 19 is implement | achieved. The program is recorded on a computer-readable recording medium such as a floppy disk, a compact disk, or a CD-ROM. Although not specifically shown in the figure, the program is installed in a built-in or externally connected medium reader. It is good also as an executable state by doing.
[0013]
In the following embodiments, “collection of defect information” will be described as an example of web information collection.
FIG. 2 is a diagram showing a data configuration example of the URL management database according to the embodiment of the present invention. The data structure of the URL management database 10 includes a URL representing the location of a web page on the Internet, a company name, a URL registration date, a patrol collection stop (pause if no patrol occurs), a URL collection date and time, a site with security It includes items such as an authentication ID at the time of access, an authentication password, the number of hierarchies of URLs, initial collection (represented by 1 or 0), and update date / time.
[0014]
FIG. 3 is a diagram showing a data configuration example of the hit information database according to the embodiment of the present invention. The hit information database 17 holds processing information as data (news posting) for notifying related users described later from a file with a specified extension that has been hit by a keyword search of character data in the downloaded homepage. Keep it.
[0015]
The data structure of the hit information database 17 includes items such as presence / absence of keyword hit, number of keyword hits, total number, number of acquisitions, number of impossibility, title, and headline. The presence / absence of keyword hit is expressed by 1 or 0 for the presence / absence of hit for one keyword. For example, if the number of keyword columns is 8, it is represented by 00100010 corresponding to each keyword, and the total number of hits is counted as 2. The total number, the acquisition number, and the impossibility number indicate the number of URLs to be searched. The title is expressed as <TITLE>, and if there is no description, there is no title. The headline is a representation of the number of characters (for example, 100 characters) from the head of the text hit by the keyword.
[0016]
Next, in order to collect the web information according to the present invention, the setting items of the web tour in advance will be described with reference to FIGS.
FIG. 4 is a diagram showing an example of an initial menu screen according to the embodiment of the present invention. The initial menu screen includes three screen areas 101, 102, and 103.
[0017]
The screen area 101 displays data for items of company name, URL, update date, and number of hits for the URL list searched so far. □ is a check box, a blank indicates that a tour is required, and a cross indicates that it has been processed as not subject to a tour. This reflects the result selected by the user in each setting menu of the screen area 103.
[0018]
The URL is represented by (protocol name): // (domain name) / (file name). In the figure, http: // www. aaa. com / index. http is the URL of the company AAA's top homepage, and http: // is a transmission command using a protocol called HTTP (HyperText Transfer Protocol) with which a web client communicates with an information providing server, followed by www. aaa. com represents the name of the WWW server on the Internet where the homepage is stored. htm represents the top page of the home page. In addition, (.htm) represents an extension of a file in a language HTML (HyperText Markup Language) describing a home page.
[0019]
http: // www. aaa. com / xxx1 / index. http: // www. aaa. com / xxx1 / pr. htm represents the second and third level web pages.
In the URL list of the screen area 101, for example, URL [http: // www. aaa. com / xxx1 / index. The web page of [htm] indicates that there were five hits on the month and month dd of yyyy year as a result of the keyword search.
[0020]
In the screen area 102, items such as defect information, new product information, and technical information are displayed. In this embodiment, defect information is displayed.
Further, in the screen area 103, URL addition (setting of URL, hierarchy, identifier), URL deletion (URL list, deletion of data from folder), collection stop / restart (instruction of collection stop / restart for each URL), collection A search environment setting menu to be entered by the user, such as a schedule (execution date / day of the week / time setting), keyword setting (search keyword setting for download data), and news posting, is displayed. By clicking each menu button with the mouse, each setting screen is displayed, and input by the user is performed on the display screen.
[0021]
In the setting menu, first, URL addition and URL deletion are setting functions for adding and deleting pages in units of top homepages of companies such as AAA and BBB.
Hereinafter, functions of other menus will be described.
FIG. 5 shows an example of an environment setting screen (keyword setting) according to the embodiment of the present invention. For example, the keyword column 1 represents a search using a search expression “defect and capacitor and LSI” as a search keyword.
[0022]
FIG. 6 shows an example of the environment setting screen (collection stop / restart setting) according to the embodiment of the present invention. The URL list is instructed to stop / restart collection, and the URL file whose collection has been stopped is temporarily moved out of the search target folder and returned to the search target when restarting. If the result of the previous hit is 0, the file is removed from the cyclic collection target by putting an X in the check box □. By selecting all the checks and finally displaying the selection screen of the collection stop / resume screen, it is possible to collectively stop collecting URLs with 0 hits.
[0023]
FIG. 7 shows an example of an environment setting screen (collection schedule setting) according to the embodiment of the present invention. As setting items, the total cycle, difference cycle, cycle time, collection target file, and search hierarchy are displayed. In the entire tour cycle, the number of times collected in each month and week is set, and in the difference tour, a setting is made in which only the changed portion is extracted and toured in comparison with the previous search. In this sample screen, select all tours, specify once / week, day of the week on Monday, tour time is 1:00, and the files to be collected specify the extensions of html / html, txt, doc, and xls. The collection schedule is set so that the hierarchy, which is the depth of the cyclic search, is three.
[0024]
Here, html / html is an extension of a homepage description language HTML, and txt is an extension of a text format. Doc is a registered trademark of Microsoft Corporation, xls is a registered trademark of Microsoft Corporation, and pdf is an extension of a registered trademark of Adobe Corporation.
FIG. 8 shows a flowchart of the web tour based on the designated condition according to the embodiment of the present invention. First, in step S11, the URL input by the user and designated from the held environment setting file is read. In step S12, the web circulator 13 circulates the home page of the information providing server connected to the Internet over a plurality of layers, downloads the character data file with the designated extension in step S13, and downloads it to the download folder 15 in step S14. save. In step S15, the above processing is repeated until all the specified URLs read from the environment setting file 11 are completed.
[0025]
When all the URLs are completed, the process proceeds to the next processing flow (1).
FIG. 9 shows a flowchart of keyword search for a download file according to the embodiment of the present invention. In step S21, the download folder 15 is referred to in step S21. In step S22, reference to the download file is continued until the downloaded character data file ends. In step S23, a keyword search is performed on the collected character data file. If the search conditions match in step S24, the process proceeds to step S25, and the date of the downloaded individual character data file with the previously downloaded file is checked.
[0026]
If the dates do not match in step S26, necessary information is stored in the hit information database 17 in accordance with the data structure of FIG.
In step S24, if the keyword condition does not match, the flag of the file in the hit information database 17 is turned OFF. If the date condition does not match in step S26, the process returns to step S25 and the download file check is repeated.
[0027]
When the above process flow is completed, the process proceeds to the next process flow (2).
FIG. 10 shows a flowchart of automatic notification of updated homepage contents according to the embodiment of the present invention. In response to the processing flow of FIG. 9, in step S31, the hit information database 17 is referred to, and in step S32, processing is performed until there is no more data in the hit information database 17. In step S33, the hit information database 17 is selected if the download file flag is ON, and in step S34, the entire designated URL homepage with the flag is downloaded. In step S35, the downloaded URL is saved.
[0028]
Next, in step S32, when all the data processing is completed, the downloaded URL is extracted in step S36, and in step S37, the information on the change contents of the homepage of the downloaded URL is edited (news posting). Automatic notification by email.
FIG. 11 shows an example of creating an automatic posting according to the embodiment of the present invention. By checking the keyword search and update date, the entire home page where the keyword exists is downloaded, and information about the updated or new home page is automatically notified to related users. In this screen example, automatic posting and manual posting can be selected. In the automatic posting, for example, URL1 (yyyy / mm / dd) hit for the keyword string 1, URL2 (yyyy / mm / dd)... URLa (yyyy / mm / dd) hit for the keyword string 2, A URL group of URLb (yyyy / mm / dd)... Is automatically listed and notified to related users. For manual posting, a free-form news posting screen is prepared.
[0029]
FIG. 12 shows a mail notification example of the change contents according to the embodiment of the present invention. The URL of the web page that has been changed, the update date, the hit keyword, and the headline about the content are extracted and the relevant users are automatically notified by email.
In the above embodiment, the description has been mainly made assuming the case of “defect information”, but the present invention is of course not limited to this, and “new product information”, “technical information”, “ It goes without saying that the same technique is applied to collecting web information of a wide range of genres such as “patent information”.
[0030]
(Supplementary note 1) In the web information collection device via the Internet,
A web patrol unit that patrols a designated homepage over a plurality of layers and automatically downloads character data on the homepage;
Character data file storage means for storing the character data as a file for each homepage;
Keyword search means for searching for a character data file downloaded by the web patrol means using a preset keyword;
When the date of the character data file is inconsistent with reference to already registered data, as a result of the keyword search, a homepage download means for downloading the entire homepage where the keyword is hit,
An information notification means for automatically notifying related users of updates of the downloaded home page or information on new contents.
[0031]
(Additional remark 2) The notification information in the said information notification means is the headline information automatically extracted by extracting the character string before and behind including the hit keyword, The web information collection apparatus of Additional remark 1 characterized by the above-mentioned.
(Supplementary note 3) The web information collecting device according to supplementary note 1, wherein the web patrol device circulates only the homepage where the keyword is hit in the keyword search step.
[0032]
(Appendix 4) In the web information collection method via the Internet,
A web patrol step of patroling a designated homepage over a plurality of layers and automatically downloading character data on the homepage;
A character data file storing step for storing the character data as a file for each homepage;
A keyword search step for searching for a character data file downloaded by the web patrol step using a preset keyword;
When the date of the character data file is inconsistent with reference to already registered data, the result of the keyword search is a homepage download step of downloading the entire homepage where the keyword is hit;
An information notification step of automatically notifying related users of updated or new content information of the downloaded homepage;
A web information collecting method characterized by comprising:
[0033]
(Appendix 5) In the web information collection program via the Internet,
On the computer,
A web patrol step of patroling a designated homepage over a plurality of layers and automatically downloading character data on the homepage;
A character data file storing step for storing the character data as a file for each homepage;
A keyword search step for searching for a character data file downloaded by the web patrol step using a preset keyword;
When the date of the character data file is inconsistent with reference to already registered data, the result of the keyword search is a homepage download step of downloading the entire homepage where the keyword is hit;
An information notification step of automatically notifying related users of updated or new content information of the downloaded homepage;
Web information collection program to execute.
[0034]
(Supplementary note 6) A recording medium recording a web information collection program via the Internet,
On the computer,
A web patrol step of patroling a designated homepage over a plurality of layers and automatically downloading character data on the homepage;
A character data file storing step for storing the character data as a file for each homepage;
A keyword search step for searching for a character data file downloaded by the web patrol step using a preset keyword;
When the date of the character data file is inconsistent with reference to already registered data, the result of the keyword search is a homepage download step of downloading the entire homepage where the keyword is hit;
An information notification step of automatically notifying related users of updated or new content information of the downloaded homepage;
A computer-readable recording medium having a web information collecting program recorded thereon for recording.
[0035]
【The invention's effect】
As described above, according to the present invention, a home page including a specified keyword is displayed in a plurality of levels of the specified home page, and when the home page is revised, only a newly registered home page is notified. , It is not necessary to refer to a homepage that has not been changed, and the search processing time can be greatly reduced.
[0036]
In addition, because it is possible to search for keywords within the specified range (download file), it is possible to easily find the necessary homepage.
Furthermore, according to the present invention, the person in charge confirms the contents of the homepage, changes the headline and delivers it to the person who needs it by e-mail. Therefore, the contents other than the person in charge (headline) If you do not need to refer to the homepage just by checking, you can reduce the man-hours required for searching.
[Brief description of the drawings]
FIG. 1 is a diagram showing a basic system configuration according to the present invention.
FIG. 2 is a diagram showing a data configuration example of a URL management database according to the embodiment of the present invention.
FIG. 3 is a diagram showing a data configuration example of a hit information database according to the embodiment of the present invention.
FIG. 4 is a diagram showing an example of an initial menu screen according to the embodiment of the present invention.
FIG. 5 is a diagram showing an example of an environment setting screen (keyword setting) according to the embodiment of the present invention.
FIG. 6 is a diagram showing an example of an environment setting screen (collection stop / restart setting) according to the embodiment of the present invention;
FIG. 7 is a diagram showing an example of an environment setting screen (collection schedule setting) according to the embodiment of the present invention.
FIG. 8 is a diagram showing a flowchart of web patrol based on designated conditions according to the embodiment of the present invention.
FIG. 9 is a flowchart showing a keyword search for a download file according to the embodiment of the present invention.
FIG. 10 is a diagram showing a flowchart of automatic notification of updated homepage contents according to the embodiment of the present invention.
FIG. 11 is a diagram showing an example of creating an automatic posting according to the embodiment of the present invention.
FIG. 12 is a diagram showing a mail notification example of change contents according to the embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Web information collection apparatus 2 User terminal 3 Internet 10 URL management database 11 Environment setting file 12 Travel condition acquisition means 13 Web tour means 14 Character data file storage means 15 Download folder 16 Keyword search means 17 Hit information database 18 Homepage download means 19 Information Notification means

Claims

In the web information collection device via the Internet,
A web patrol unit that patrols a designated homepage over a plurality of layers and automatically downloads character data on the homepage;
Character data file storage means for storing the character data as a file for each homepage;
Keyword search means for searching for a character data file downloaded by the web patrol means using a preset keyword;
When the date of the character data file is inconsistent with reference to already registered data, as a result of the keyword search, a homepage download means for downloading the entire homepage where the keyword is hit,
An information notification means for automatically notifying related users of updated or new contents information of the downloaded home page.

2. The web information collecting apparatus according to claim 1, wherein the web patrol unit circulates only the homepage where the keyword is hit in the keyword search unit from the next time.

In the web information collection method via the Internet,
A web patrol step of patroling a designated homepage over a plurality of layers and automatically downloading character data on the homepage;
A character data file storing step for storing the character data as a file for each homepage;
A keyword search step for searching for a character data file downloaded by the web patrol step using a preset keyword;
When the date of the character data file is inconsistent with reference to already registered data, the result of the keyword search is a homepage download step of downloading the entire homepage where the keyword is hit;
An information notification step of automatically notifying related users of updated or new content information of the downloaded homepage;
A web information collecting method characterized by comprising: