CN101866342B

CN101866342B - Method and device for generating or displaying webpage label and information sharing system

Info

Publication number: CN101866342B
Application number: CN 200910133976
Authority: CN
Inventors: 郝宇; 孟遥; 于浩
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2009-04-16
Filing date: 2009-04-16
Publication date: 2013-09-11
Anticipated expiration: 2029-04-16
Also published as: CN101866342A

Abstract

Disclosed are a method and device for generating or displaying webpage annotations, and an information sharing system based on the webpage annotations. The method for generating webpage annotation information includes: in response to the user selecting a target webpage element as the object to be annotated on the current webpage loaded on the web browser of the client, extracting the document object model (DOM) tree of the object to be annotated on the current webpage The XPath path in the object; based on the content of the marked object and the context web page elements immediately before and after the marked object in the current web page, generate the feature code CF of the marked object; and based on the marked object XPath path, feature code CF and The annotations input by the user generate webpage annotation information, where the webpage annotation information is stored in the annotation database of the remote annotation server, and the feature code CF of the object to be annotated is composed of the content-based feature (CBF) of the object to be annotated and its contextual webpage element CBF composition.

Description

Method and device for generating or displaying web page markup and information sharing system

技术领域technical field

本发明总体上涉及网页标注技术，并且尤其涉及考虑到网页上作为被标注对象的目标网页元素的内容而生成或者显示网页标注的技术，以及基于这种网页标注实现信息共享的技术。The present invention generally relates to webpage tagging technology, and in particular to a technology for generating or displaying webpage tagging considering content of a target web page element on a web page as an tagged object, and a technology for realizing information sharing based on such web page tagging.

背景技术Background technique

标注是一种在文档中添加信息的技术。这个概念最开始是在纸质媒体中产生的，包括对关键词进行突出显示、添加旁注等。随着计算机及网络技术的迅猛发展及日渐普及，当前网络媒体已经成为人们了解信息的重要途径之一。在这种情况下，网页标注技术也得到了重视和发展，网页标注日渐成为包括数字图书馆、计算机辅助协同工作、信息共享及管理在内的多种领域内的热门话题之一。Annotation is a technique for adding information to documents. This concept was originally produced in print media, including highlighting key words, adding marginalia, etc. With the rapid development and popularization of computer and network technology, the current network media has become one of the important ways for people to understand information. In this case, the technology of webpage annotation has also been paid attention to and developed, and webpage annotation has gradually become one of the hot topics in various fields including digital library, computer-aided collaborative work, information sharing and management.

传统的Web系统向内容或者信息的提供者提供了很方便的信息发布平台，比如网页制作平台。但是，这种信息交流的方式基本上是单向的。网页阅读者能够进行的交互仅仅限于点击链接或者添加书签等。当前流行的Web2.0理念强调了广大Web用户的参与和信息共享，这样信息的流动就成为双向的、甚至是多向的方式。目前常用的信息共享技术包括：The traditional web system provides a very convenient information release platform for content or information providers, such as a webpage production platform. However, this way of information exchange is basically one-way. The interactions that web page readers can perform are limited to clicking links or adding bookmarks. The current popular Web2.0 concept emphasizes the participation and information sharing of the vast number of Web users, so that the flow of information becomes a two-way or even multi-directional way. Currently commonly used information sharing technologies include:

-RSS(Really Simply Syndication)：其中通过一个服务器对要发布的内容进行集成，然后由用户选择所要获取的内容。在这种方式下用户只能被动地获取RSS源所发布的内容，这样的信息流动也是不对称的；-RSS (Really Simply Syndication): The content to be published is integrated through a server, and then the user selects the content to be obtained. In this way, users can only passively obtain the content published by the RSS source, and such information flow is also asymmetrical;

-交互式的Web发布平台(例如，Wiki和Blog)：用户通过这样的平台，可以发表自己的文章和意见，以达到信息共享的目的。但是，这种信息共享的方式需要在特定结构化的网页中进行，不能对所看到的所有网页随时随地的共享意见。-Interactive Web publishing platforms (for example, Wiki and Blog): Users can publish their own articles and opinions through such platforms to achieve the purpose of information sharing. However, this information sharing method needs to be carried out in a specific structured webpage, and opinions cannot be shared anytime and anywhere on all webpages viewed.

网页标注系统和上述两种信息共享方式不同，它实际上提供了一种标注装置来帮助用户对所浏览的网页进行标注，该标注装置可以是包含浏览器的单独软件工具，可以是独立于浏览器的单独软件工具，或者也可以是集成在浏览器中的扩展模块。Annotea作为万维网(World Wide Web，W3C)提供的标准网页标注工具，使用了RDF(Resource DescriptionFormat，资源描述格式)和XPointer作为描述被标注网页的方法。作为W3C的推荐计划，Annotea为网页标注的表示及存储提供了一个标准的框架和实现方法。在Annotea系统中，系统使用了一个RDF数据库服务器来存储所有的网页标注信息，用户利用一个特定的软件客户端对网页进行标注。在Annotea的基础上，还出现了一些各有特色的网页标注系统，比如Annoty、Crit、e-Marked、YAWAS等。The web page tagging system is different from the above two information sharing methods. It actually provides a tagging device to help users tag the web pages they browse. The tagging device can be a separate software tool including a browser, or it can be independent of browsing browser as a separate software tool, or as an extension module integrated in the browser. Annotea, as a standard web page annotation tool provided by the World Wide Web (W3C), uses RDF (Resource Description Format, resource description format) and XPointer as methods to describe the marked web pages. As a W3C recommendation project, Annotea provides a standard framework and implementation method for the representation and storage of web page annotations. In the Annotea system, the system uses an RDF database server to store all webpage annotation information, and users use a specific software client to annotate webpages. On the basis of Annotea, some unique web page annotation systems have emerged, such as Annoty, Crit, e-Marked, YAWAS, etc.

总体来说，现有的网页标注系统的基本架构可以如图1所示。如图1所示，现有技术的网页标注系统主要包括用户命令处理单元110、标注查询单元120、网页获得单元130和网页标注合成单元140。其中，用户命令处理单元110接收用户的输入信息(包括网页URL、显示选项、用户信息等)，并把这些信息发送到标注查询单元120和网页获得单元130。标注查询单元120根据用户输入的网页URL信息，通过经由诸如互联网之类的网络查询远程的标注服务器，得到网页的标注信息。网页获得单元130基于用户提供的网页URL信息，通过互联网取得所期望的网页。网页标注合成单元140把取得的网页和相关的标注信息合成在一起，提供给用户，使用户在看到所需网页的同时还可以看到相关的网页标注信息。Generally speaking, the basic architecture of the existing web page tagging system can be shown in FIG. 1 . As shown in FIG. 1 , the webpage annotation system in the prior art mainly includes a user command processing unit 110 , an annotation query unit 120 , a webpage obtaining unit 130 and a webpage annotation synthesis unit 140 . Among them, the user command processing unit 110 receives user input information (including webpage URL, display options, user information, etc.), and sends these information to the annotation query unit 120 and the webpage obtaining unit 130 . The annotation query unit 120 obtains the annotation information of the webpage by querying a remote annotation server through a network such as the Internet according to the URL information of the webpage input by the user. The web page obtaining unit 130 obtains a desired web page through the Internet based on the web page URL information provided by the user. The webpage annotation synthesis unit 140 synthesizes the acquired webpage and related annotation information, and provides it to the user, so that the user can also see the relevant webpage annotation information while viewing the required webpage.

尽管现有的网页标注系统可以实现对网页添加标注，但是还存在着诸如以下所述的各种问题：Although the existing web page tagging system can add tags to web pages, there are still various problems such as the following:

-不能处理其中被标注对象转移到其它页面的情况。在很多网站中，一个页面内的特定元素往往随着内容的滚动而自动地列到其它页面中，传统的网页标注方法不能把这样的标注显示出来；- Cannot handle the case where the marked object is transferred to other pages. In many websites, specific elements in a page are often automatically listed in other pages as the content scrolls, and traditional webpage labeling methods cannot display such labels;

-当网页中被标注对象的格式发生某些可以容忍的变化(例如，被标注对象中的字体变为斜体或者加黑等)时，标注不能被正确地显示；-When the format of the tagged object in the web page undergoes certain tolerable changes (for example, the font in the tagged object becomes italic or blackened, etc.), the tag cannot be displayed correctly;

-在很多情况下，往往会对被标注对象的内容进行若干修改，在传统的网页标注系统中经过内容修改的被标注对象被认为已经不是原被标注内容，因而不再对其标注进行显示。-In many cases, some modifications are often made to the content of the tagged object. In the traditional web page tagging system, the tagged object whose content has been modified is considered to be no longer the original tagged content, so its tagging will no longer be displayed.

因此，目前仍然需要提供一种能够在考虑到被标注对象的内容的情况下生成网页标注或者显示网页标注的方法和装置，以及能够基于网页标注在用户之间更有效地实现信息共享的系统，以克服现有技术中存在的上述一种或更多种缺陷。Therefore, there is still a need to provide a method and device capable of generating or displaying web page annotations in consideration of the content of the tagged object, and a system that can more effectively realize information sharing among users based on web page annotations. In order to overcome the above-mentioned one or more defects existing in the prior art.

发明内容Contents of the invention

在下文中给出了关于本发明的简要概述，以便提供关于本发明的某些方面的基本理解。应当理解，这个概述并不是关于本发明的穷举性概述。它并不是意图确定本发明的关键或重要部分，也不是意图限定本发明的范围。其目的仅仅是以简化的形式给出某些概念，以此作为稍后论述的更详细描述的前序。A brief overview of the invention is given below in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to identify key or critical parts of the invention nor to delineate the scope of the invention. Its purpose is merely to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

为了解决现有技术的上述问题，本发明的一个目的是提供一种能够考虑到网页上被标注对象的内容而生成或者显示网页标注的方法和装置，其中能够将网页标注信息与被标注对象及网页上紧邻在被标注对象之前和之后的上下文网页元素的内容联系起来，从而可以动态地跟踪被标注对象的变化。In order to solve the above-mentioned problems in the prior art, an object of the present invention is to provide a method and device capable of generating or displaying webpage annotations in consideration of the content of the annotated objects on the webpage, wherein the webpage annotation information can be combined with the annotated objects and The content of the contextual web page elements immediately before and after the marked object on the web page is linked, so that the changes of the marked object can be dynamically tracked.

本发明的另一个目的是提供一种网页标注方法和装置，利用该方法和装置，能够在客户端浏览器上显示用户期望载入和显示的网页，以及存储在远程标注服务器上的、先前标注在该网页上的已有标注，并在网页上添加和显示新标注。Another object of the present invention is to provide a method and device for web page tagging. By using the method and device, the web page that the user expects to load and display, as well as the previously tagged web pages stored on the remote tagging server, can be displayed on the client browser. Existing callouts on the page, and add and display new callouts on the page.

本发明的再一个目的是提供一种利用上述网页标注方法和装置实现基于网页标注的信息共享的信息共享系统。Another object of the present invention is to provide an information sharing system that utilizes the above web page tagging method and device to realize information sharing based on web page tagging.

为了实现上述目的，根据本发明的一个方面，提供了一种用于生成网页标注信息的方法，该方法包括：响应于用户在客户端Web浏览器上载入的当前网页上选择了目标网页元素作为被标注对象，提取被标注对象在当前网页的文档对象模型(DOM)树中的XPath路径；基于被标注对象及当前网页中紧邻在被标注对象之前和之后的上下文网页元素的内容，生成被标注对象的特征码CF；以及基于被标注对象的XPath路径、特征码CF以及用户输入的标注，生成网页标注信息，其中，所述网页标注信息被存储在远程标注服务器的标注数据库中，被标注对象的特征码CF由被标注对象的基于内容的特征(CBF)及其上下文网页元素的CBF构成，以及网页元素的CBF由该网页元素的字母投影向量和字母顺序向量组成，其中所述字母投影向量由该网页元素中的所有字母在字母表Λ＝{a，b，c，d，...，z}上的统计个数组成，所述字母顺序向量由该网页元素中的所有字母在字母表Λ上的逆序统计个数组成。In order to achieve the above object, according to one aspect of the present invention, a method for generating webpage annotation information is provided, the method comprising: in response to the user selecting a target webpage element on the current webpage loaded on the client web browser As the marked object, extract the XPath path of the marked object in the Document Object Model (DOM) tree of the current webpage; The feature code CF of the marked object; and based on the XPath path of the marked object, the feature code CF and the annotation input by the user, generate webpage annotation information, wherein the webpage annotation information is stored in the annotation database of the remote annotation server, and is annotated The feature code CF of the object is formed by the content-based feature (CBF) of the marked object and the CBF of the context webpage element thereof, and the CBF of the webpage element is composed of the alphabetic projection vector and the alphabetical order vector of the webpage element, wherein the alphabetical projection The vector consists of the statistical numbers of all the letters in the webpage element on the alphabet Λ={a, b, c, d, ..., z}, and the alphabetical vector is composed of all the letters in the webpage element in It is composed of reverse order statistics on the alphabet Λ.

根据本发明的另一个方面，还提供了一种用于生成网页标注信息的装置，该装置包括：XPath生成器，用于响应于用户在客户端Web浏览器上载入的当前网页上选择了目标网页元素作为被标注对象，提取被标注对象在当前网页的文档对象模型(DOM)树中的XPath路径；特征码(CF)生成器，用于基于被标注对象及当前网页中紧邻在被标注对象之前和之后的上下文网页元素的内容，生成被标注对象的特征码CF；以及标注生成器，用于基于被标注对象的XPath路径、被标注对象的特征码CF以及用户输入的标注，生成网页标注信息，其中被标注对象的特征码CF由被标注对象的基于内容的特征CBF及其上下文网页元素的CBF构成，其中，所述网页标注信息被存储在远程标注服务器的标注数据库中，网页元素的CBF由该网页元素的字母投影向量和字母顺序向量组成，其中所述字母投影向量由该网页元素中的所有字母在字母表Λ＝{a，b，c，d，...，z}上的统计个数组成，所述字母顺序向量由该网页元素中的所有字母在字母表Λ上的逆序统计个数组成。According to another aspect of the present invention, there is also provided a device for generating web page annotation information, the device includes: an XPath generator, used for responding to the user's selection of The target web page element is used as the tagged object, extracting the XPath path of the tagged object in the document object model (DOM) tree of the current web page; the feature code (CF) generator is used to The content of the contextual web page elements before and after the object generates the feature code CF of the tagged object; and the tag generator is used to generate the web page based on the XPath path of the tagged object, the feature code CF of the tagged object and the tag input by the user Annotation information, wherein the feature code CF of the marked object is composed of the content-based feature CBF of the marked object and the CBF of the context webpage element, wherein the webpage annotation information is stored in the annotation database of the remote annotation server, and the webpage element The CBF of the webpage element is composed of a letter projection vector and an alphabet order vector, wherein the letter projection vector consists of all letters in the webpage element in the alphabet Λ={a, b, c, d, ..., z} The alphabetical order vector is composed of the statistical numbers of all letters in the web page element in reverse order on the alphabet Λ.

根据本发明的另一个方面，还提供了一种用于在客户端Web浏览器上显示网页及网页上的标注的方法，该方法包括：a)响应于用户输入要在浏览器上载入并显示的网页的统一资源定位符(URL)，对输入的URL进行分析，以得到有效URL；b)根据所述有效URL，从远程标注服务器中查询出所有和有效URL有关的标注，从而得到标注候选集以及这些标注的网页标注信息；c)针对标注候选集中的每一个标注，根据该标注的网页标注信息，确定该标注是否标注了所要载入的网页中的网页元素，即，确定该标注是否应当存在于要载入的网页中，并且如果是的话，还进一步确定其所标注的网页元素在所述要载入的网页中的位置、即标注位置；以及d)根据被确定为应当存在于要载入的网页中的标注的网页标注信息及其标注位置，将这些标注与所述要载入的网页合成起来，并经由浏览器将合成后的网页显示给用户，其中，标注的网页标注信息包含标注所对应的被标注对象的XPath路径、被标注对象的特征码CF、标注的内容和格式、标注所在网页的URL、标注所在网页的内容特征码，被标注对象的特征码CF由被标注对象的基于内容的特征(CBF)及紧邻在被标注对象之前和之后的上下文网页元素的CBF构成，网页元素的CBF由该网页元素的字母投影向量和字母顺序向量组成，其中所述字母投影向量由该网页元素中的所有字母在字母表Λ＝{a，b，c，d，...，z}上的统计个数组成，所述字母顺序向量由该网页元素中的所有字母在字母表Λ上的逆序统计个数组成。According to another aspect of the present invention, there is also provided a method for displaying webpages and annotations on the webpage on the client web browser, the method comprising: a) responding to user input to load and The Uniform Resource Locator (URL) of the displayed web page analyzes the input URL to obtain an effective URL; b) according to the effective URL, query all annotations related to the effective URL from the remote annotation server, thereby obtaining the annotation The candidate set and the web page annotation information of these annotations; c) for each annotation in the annotation candidate set, according to the webpage annotation information of the annotation, determine whether the annotation has marked the webpage element in the webpage to be loaded, that is, determine the annotation Whether it should exist in the webpage to be loaded, and if so, further determine the position of the marked webpage element in the webpage to be loaded, that is, the marked position; and d) be determined to exist according to The marked web page marking information and marked position in the web page to be loaded, combining these markings with the web page to be loaded, and displaying the synthesized web page to the user via the browser, wherein the marked web page The annotation information includes the XPath path of the annotated object corresponding to the annotation, the feature code CF of the annotated object, the content and format of the annotation, the URL of the webpage where the annotation is located, and the content feature code of the webpage where the annotation is located. The feature code CF of the annotated object is determined by The content-based feature (CBF) of the annotated object and the CBF of the context webpage element immediately before and after the annotated object are composed, and the CBF of the webpage element is composed of the letter projection vector and the alphabet order vector of the webpage element, wherein the letter The projection vector is composed of the statistical numbers of all letters in the webpage element on the alphabet Λ={a, b, c, d, ..., z}, and the alphabetical vector is composed of all the letters in the webpage element It consists of counting numbers in reverse order on the alphabet Λ.

根据本发明的另一个方面，还提供了一种用于经由客户端Web浏览器显示网页及网页上的标注的装置，所述装置包括：URL分析器，用于响应于用户输入的要在浏览器上载入并显示的网页的统一资源定位符(URL)，对输入的URL进行分析，以得到有效URL；标注查询器，用于根据所述有效URL，从远程标注服务器中查询出所有和有效URL有关的标注，从而得到标注候选集以及这些标注的网页标注信息；标注位置确定单元，用于针对标注候选集中的每一个标注，根据该标注的网页标注信息，确定该标注是否标注了所要载入的网页中的网页元素，即，确定该标注是否应当存在于要载入的网页中，并且如果是的话，还进一步确定其所标注的网页元素在所述要载入的网页中的位置、即标注位置；以及合成单元，用于根据被确定为应当存在于要载入的网页中的标注的网页标注信息及其标注位置，将这些标注与所述要载入的网页合成起来，其中，合成后的网页经由浏览器显示给用户，标注的网页标注信息包含标注所对应的被标注对象的XPath路径、被标注对象的特征码CF、标注的内容和格式、标注所在网页的URL、标注所在网页的内容特征码，被标注对象的特征码CF由被标注对象的基于内容的特征(CBF)及紧邻在被标注对象之前和之后的上下文网页元素的CBF构成，网页元素的CBF由该网页元素的字母投影向量和字母顺序向量组成，其中所述字母投影向量由该网页元素中的所有字母在字母表Λ＝{a，b，c，d，...，z}上的统计个数组成，所述字母顺序向量由该网页元素中的所有字母在字母表Λ上的逆序统计个数组成。According to another aspect of the present invention, there is also provided a device for displaying webpages and annotations on the webpages via a client Web browser, the device comprising: a URL analyzer for responding to user input The Uniform Resource Locator (URL) of the webpage loaded and displayed on the server analyzes the input URL to obtain an effective URL; the label queryer is used to query all and all tags from the remote label server according to the effective URL Effective URL-related annotations, thereby obtaining the annotation candidate set and the webpage annotation information of these annotations; the annotation position determination unit is used for each annotation in the annotation candidate set, and according to the webpage annotation information of the annotation, it is determined whether the annotation is marked. the webpage element in the loaded webpage, i.e., determine whether the annotation should exist in the webpage to be loaded, and if so, further determine the position of the webpage element it annotates in said webpage to be loaded , that is, the annotation position; and a synthesis unit, which is used to synthesize these annotations with the webpage to be loaded according to the webpage annotation information and the annotation position of the annotations determined to be present in the webpage to be loaded, wherein , the synthesized webpage is displayed to the user via the browser, and the annotation information of the annotated webpage includes the XPath path of the annotated object corresponding to the annotation, the feature code CF of the annotated object, the content and format of the annotation, the URL of the webpage where the annotation is located, the annotation The content feature code of the web page, the feature code CF of the marked object is composed of the content-based feature (CBF) of the marked object and the CBF of the contextual web page element immediately before and after the marked object, and the CBF of the web page element is composed of the web page The letter projection vector and letter order vector of the element are composed, wherein the letter projection vector is the statistical number of all letters in the webpage element on the alphabet Λ={a, b, c, d, ..., z} The alphabetical vector is formed by counting the numbers of all letters in the webpage element in reverse order on the alphabet Λ.

另外，根据本发明的又一个方面，还提供了一种网页标注方法，该方法包括：响应于用户输入的要在客户端Web浏览器上载入和显示的网页的URL，通过执行上述用于在客户端Web浏览器上显示网页及网页上的标注的方法，在浏览器上显示所述网页，以及存储在远程标注服务器上的、先前标注在该网页上的已有标注；通过执行上述用于生成网页标注信息的方法，在所述网页上添加新标注，该新标注的网页标注信息被存储在远程标注服务器上；以及经由浏览器在所述网页上显示所添加的新标注。In addition, according to another aspect of the present invention, there is also provided a webpage tagging method, the method includes: responding to the URL of the webpage to be loaded and displayed on the client web browser input by the user, by executing the above method for A method for displaying a webpage and annotations on the webpage on the client web browser, displaying the webpage on the browser, as well as the existing annotations stored on the remote annotation server and previously annotated on the webpage; by executing the above According to the method for generating webpage annotation information, a new annotation is added on the webpage, and the newly annotated webpage annotation information is stored on a remote annotation server; and the added new annotation is displayed on the webpage via a browser.

根据本发明的又一个方面，还提供了一种网页标注装置，该装置包括：上述用于生成网页标注信息的装置；以及上述用于经由客户端Web浏览器显示网页及网页上的标注的装置。According to another aspect of the present invention, there is also provided a webpage tagging device, which includes: the above-mentioned device for generating webpage tagging information; .

根据本发明的又一个方面，还提供了一种基于网页标注的信息共享系统，它包括：客户端和远程标注服务器，其中，所述客户端包括上述网页标注装置，以及所述远程标注服务器包括用于存储网页标注信息的标注数据库，和用于对标注数据库进行存取控制的标注信息存取器。According to another aspect of the present invention, there is also provided an information sharing system based on webpage annotation, which includes: a client and a remote annotation server, wherein the client includes the above-mentioned webpage annotation device, and the remote annotation server includes An annotation database for storing annotation information of web pages, and an annotation information accessor for controlling access to the annotation database.

依据本发明的其它方面，还提供了相应的计算机可读存储介质和计算机程序产品。According to other aspects of the present invention, corresponding computer-readable storage media and computer program products are also provided.

本发明的优点在于，在以上所述的根据本发明的方法、装置和系统中，在生成网页标注信息时考虑了被标注对象的XPath路径，以及被标注对象及其上下文网页元素的内容，使得能够实现标注对于被标注对象的动态跟踪，因此，相关的标注信息会跟随被标注对象移动。而且，即使被标注对象的格式发生变化，标注也可以被正确地显示出来。甚至在被标注对象的内容本身发生变化时，也可以对内容变化进行评估，以决定是否可以显示对应的标注。The advantage of the present invention is that, in the above-mentioned method, device and system according to the present invention, the XPath path of the marked object, and the content of the marked object and its context web page element are taken into account when generating the web page marking information, so that It can realize the dynamic tracking of the tagged object, so the related tagging information will follow the tagged object. Also, annotations can be displayed correctly even if the format of the object being annotated changes. Even when the content of the marked object itself changes, the content change can be evaluated to decide whether the corresponding markup can be displayed.

通过以下结合附图对本发明的最佳实施例的详细说明，本发明的这些以及其他优点将更加明显。These and other advantages of the present invention will be more apparent through the following detailed description of the preferred embodiments of the present invention with reference to the accompanying drawings.

附图说明Description of drawings

本发明可以通过参考下文中结合附图所给出的描述而得到更好的理解，其中在所有附图中使用了相同或相似的附图标记来表示相同或者相似的部件。所述附图连同下面的详细说明一起包含在本说明书中并且形成本说明书的一部分，而且用来进一步举例说明本发明的优选实施例和解释本发明的原理和优点。在附图中：The present invention can be better understood by referring to the following description given in conjunction with the accompanying drawings, wherein the same or similar reference numerals are used throughout to designate the same or similar parts. The accompanying drawings, together with the following detailed description, are incorporated in and form a part of this specification, and serve to further illustrate preferred embodiments of the invention and explain the principles and advantages of the invention. In the attached picture:

图1是示出了现有技术中的网页标注系统的一般架构的示意图；Fig. 1 is a schematic diagram showing the general architecture of a web page tagging system in the prior art;

图2是示出了根据本发明实施例的利用网页标注实现信息共享的系统的结构的示意图；FIG. 2 is a schematic diagram showing the structure of a system for realizing information sharing by using web page annotation according to an embodiment of the present invention;

图3是示出了根据本发明的实施例、利用图2所示的系统在网页上添加新标注时所执行的处理过程的示例性流程图；FIG. 3 is an exemplary flow chart illustrating a process performed when a new annotation is added to a web page using the system shown in FIG. 2 according to an embodiment of the present invention;

图4是详细地示出了图2中所示的CBF生成器的示例性结构及处理过程的示意图；Fig. 4 is a schematic diagram showing in detail the exemplary structure and processing procedure of the CBF generator shown in Fig. 2;

图5是详细地示出了图2中所示的标注分析器的示例性结构的方框图；FIG. 5 is a block diagram illustrating in detail an exemplary structure of the annotation analyzer shown in FIG. 2;

图6是示出了根据本发明的实施例、在用户利用图2所示的系统在客户端浏览器中输入要载入网页的URL(统一资源定位符)以便显示所述网页及其中的已有标注的处理过程的流程图；Fig. 6 shows that according to an embodiment of the present invention, when the user utilizes the system shown in Fig. 2 to input the URL (uniform resource locator) of the webpage to be loaded in the client browser so as to display the webpage and the existing content therein Annotated flowchart of the process;

图7是示出了在根据本发明的一个实施例中基于用户输入的URL获得备选URL、并将备选URL所对应的网页和浏览器当前载入的网页进行相同和相近页面判定以得到有效URL的过程(即图6中所示的步骤S610的具体处理过程)的流程图；Fig. 7 shows that according to an embodiment of the present invention, an alternative URL is obtained based on the URL input by the user, and the webpage corresponding to the alternative URL and the webpage currently loaded by the browser are determined to be the same and similar pages to obtain The flow chart of the process of valid URL (ie the specific processing procedure of step S610 shown in Figure 6);

图8是示出了在根据本发明的一个实施例中确定所有可能的标注是否存在于当前载入的网页中及标注在其中的标注位置的过程(即图6中的步骤S630的具体处理过程)的流程图；以及Fig. 8 shows the process of determining whether all possible annotations exist in the currently loaded webpage and the annotation positions therein according to one embodiment of the present invention (that is, the specific processing procedure of step S630 in Fig. 6 ) flow chart; and

图9是示出了在图8所示的处理过程中用到的某标注的特征码CF(如图9中的(a)所示)及其对应的当前网页的DOM树(如图9中的(b)所示)的结构的示意图。Fig. 9 shows a certain marked feature code CF (as shown in (a) in Fig. 9) used in the process shown in Fig. 8 and the DOM tree of the corresponding current web page (as shown in Fig. 9 Schematic diagram of the structure shown in (b).

本领域技术人员应当理解，附图中的元件仅仅是为了简单和清楚起见而示出的，而且不一定是按比例绘制的。例如，附图中某些元件的尺寸可能相对于其他元件放大了，以便有助于提高对本发明实施例的理解。It will be appreciated by those skilled in the art that elements in the figures are illustrated for simplicity and clarity only and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of the embodiments of the present invention.

具体实施方式Detailed ways

在下文中将结合附图对本发明的示范性实施例进行描述。为了清楚和简明起见，在说明书中并未描述实际实施方式的所有特征。然而，应该了解，在开发任何这种实际实施例的过程中必须做出很多特定于实施方式的决定，以便实现开发人员的具体目标，例如，符合与系统及业务相关的那些限制条件，并且这些限制条件可能会随着实施方式的不同而有所改变。此外，还应该了解，虽然开发工作有可能是非常复杂和费时的，但对得益于本公开内容的本领域技术人员来说，这种开发工作仅仅是例行的任务。Exemplary embodiments of the present invention will be described below with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in this specification. It should be understood, however, that in developing any such practical embodiment, many implementation-specific decisions must be made in order to achieve the developer's specific goals, such as meeting those constraints related to the system and business, and those Restrictions may vary from implementation to implementation. Moreover, it should also be understood that development work, while potentially complex and time-consuming, would at least be a routine undertaking for those skilled in the art having the benefit of this disclosure.

在此，需要说明的是，为了避免因不必要的细节而模糊了本发明，在附图中仅仅示出了与根据本发明的方案密切相关的装置结构和/或处理步骤，而省略了与本发明关系不大的其他细节。Here, it should be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the device structure and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and the related Other details are not relevant to the invention.

图2是示出了根据本发明实施例的、利用网页标注实现信息共享的系统的结构的示意图。该系统可以分为通过网络(未示出)相连的客户端和服务器端(即标注服务器)两大部分。Fig. 2 is a schematic diagram showing the structure of a system for implementing information sharing by using web page annotation according to an embodiment of the present invention. The system can be divided into two parts, the client and the server (ie, the labeling server), which are connected through a network (not shown).

如图2所示，在客户端部分，网页标注装置200主要包括用户接口210、XPath生成器220、基于内容的特征(CBF)生成器230、标注生成器240、标注分析器250和XML转换器260，而在服务器端部分主要包括标注信息存取器270和标注数据库280。As shown in Figure 2, in the client part, the web page annotation device 200 mainly includes a user interface 210, an XPath generator 220, a content-based feature (CBF) generator 230, an annotation generator 240, an annotation analyzer 250 and an XML converter 260, and the server part mainly includes an annotation information accessor 270 and an annotation database 280.

在图2所示的系统的一个具体实现示例中，客户端的网页标注装置200可以以浏览器插件的方式实现；而标注服务器可以用Java Servelet来实现，具体来说，服务器端的标注信息存取器240可以用Java Servelet的方式实现，标注数据库250可以用已有的数据库管理系统实现。但是，本领域技术人员应当明白，本发明的原理并不仅仅局限于此，而是完全可以根据需要采用其他不同的方式实现这些装置或部件。In a specific implementation example of the system shown in Fig. 2, the web page tagging device 200 of the client can be implemented in the form of a browser plug-in; and the tagging server can be realized with Java Servlet, specifically, the tagging information accessor 240 can be realized by way of Java Servlet, and the marked database 250 can be realized by existing database management system. However, those skilled in the art should understand that the principles of the present invention are not limited thereto, and these devices or components can be implemented in other different ways as required.

在客户端，用户可以利用网页标注装置200在浏览器所载入的网页上添加并显示新的网页标注，以及正确地显示先前已经添加在该网页上的已有网页标注。在网页标注装置200中，用户接口210负责接收整个装置的输入，它可以接收以下输入信息中的任意一个或者多个：(1)与系统的配置参数有关的输入信息；(2)与用户在网页上选择的被标注对象有关的输入信息；(3)与标注内容有关的输入信息；(4)与标注的显示方式有关的输入信息；等等。On the client side, the user can use the webpage annotation device 200 to add and display new webpage annotations on the webpage loaded by the browser, and correctly display the existing webpage annotations previously added on the webpage. In the web page labeling device 200, the user interface 210 is responsible for receiving the input of the entire device, and it can receive any one or more of the following input information: (1) input information related to the configuration parameters of the system; Input information related to the marked object selected on the webpage; (3) input information related to the content of the mark; (4) input information related to the display mode of the mark; and so on.

XPath生成器220用于提取被标注对象在网页的DOM(文档对象模型)树中的XPath路径。XPath是W3C推荐的网页内任意一个元素的表示方式，网页中的每一个元素都对应着一条XPath路径，而且通过XPath路径可以定位到网页中的任何一个元素。在网页的DOM树中的各个节点分别对应于网页中所包含的各个元素。也就是说，网页中的被标注对象及紧邻在被标注对象之前和之后的网页元素都可以被表示为DOM树上的节点。为了便于说明，将网页中紧挨在被标注对象之前和之后的网页元素称为上文和下文元素，其分别对应于该被标注对象在DOM树中的对应节点的紧密相邻的兄弟节点，因此也可以将其称之为上下文节点或上下文网页元素。The XPath generator 220 is used to extract the XPath path of the marked object in the DOM (Document Object Model) tree of the web page. XPath is the representation of any element in a webpage recommended by W3C. Each element in a webpage corresponds to an XPath path, and any element in a webpage can be located through an XPath path. Each node in the DOM tree of the web page corresponds to each element included in the web page. That is to say, the marked object in the web page and the web page elements immediately before and after the marked object can be represented as nodes on the DOM tree. For ease of description, the webpage elements immediately before and after the marked object in the webpage are referred to as the above and below elements, which correspond to the closely adjacent brother nodes of the corresponding node of the marked object in the DOM tree respectively, Therefore, it can also be called a context node or a context web page element.

CBF(基于内容的特征)生成器230依据被标注对象的内容，生成被标注对象的CBF。被标注对象的CBF由被标注对象的字母投影向量(CPF)和字母顺序向量(CSF)组成，即：CBF＝CPF+CSF。The CBF (content-based feature) generator 230 generates the CBF of the marked object according to the content of the marked object. The CBF of the marked object is composed of the letter projection vector (CPF) and the letter order vector (CSF) of the marked object, namely: CBF=CPF+CSF.

其中，字母投影向量(CPF)由被标注对象中的所有字母在字母表Λ＝{a，b，c，d，...，z}上的统计个数组成，向量的长度即为字母表Λ的长度。例如，假设被标注对象是网页上的一段英文文字说明，则可以统计出在该段文字说明中每个字母a、b、......、z的个数Num(a)、Num(b)、......、Num(z)，从而可以得到如下字母投影向量CPF：[Num(a)，Num(b)，...，Num(z)]。CPF的变化可以在一定程度上反映出对被标注对象的内容的删除、插入和替换等操作。Among them, the letter projection vector (CPF) is composed of the statistical numbers of all letters in the marked object on the alphabet Λ={a, b, c, d, ..., z}, and the length of the vector is the alphabet The length of Λ. For example, assuming that the marked object is a section of English text description on a web page, the number Num(a), Num( b), ..., Num(z), so that the following letter projection vector CPF can be obtained: [Num(a), Num(b), ..., Num(z)]. The change of CPF can reflect operations such as deletion, insertion and replacement of the content of the marked object to a certain extent.

字母顺序向量(CSF)由表示被标注对象中的所有字母在字母表Λ上的逆序统计个数组成，向量的长度为字母表的长度。假设字母表Λ存在一个偏序关系：a＜b＜c＜...＜z，则被标注对象x中的所有字母在字母a上的逆序个数为所有大于字母a(即，b、c、......、z)并且紧密地排在字母a之前的字母的统计个数，被标注对象x中的所有字母在字母b上的逆序统计个数为所有大于字母b(即，c、d、......、z)并且紧密地排在字母b前面的字母的统计个数，以此类推，从而可以得到被标注对象x中的所有字母在整个字母表上的逆序统计个数。CSF的变化可以在一定程度上反映出被标注对象的交换变化。例如，对于bad和dab，它们的CPF相同，但是CSF不同，这反映出它们之间在字母顺序方面存在差异。The Alphabet Sequence Vector (CSF) is composed of the reverse order statistical numbers of all letters in the marked object on the alphabet Λ, and the length of the vector is the length of the alphabet. Assuming that there is a partial order relationship in the alphabet Λ: a<b<c<...<z, then the reverse order numbers of all letters in the labeled object x on the letter a are all greater than the letter a (that is, b, c , ..., z) and the statistical number of letters that are closely arranged before the letter a, the reverse order statistics of all letters in the marked object x on the letter b are all greater than the letter b (ie, c, d, ..., z) and the statistical number of letters that are closely arranged in front of the letter b, and so on, so that the reverse order of all letters in the labeled object x on the entire alphabet can be obtained Statistics. The change of CSF can reflect the exchange change of labeled objects to a certain extent. For example, bad and dab have the same CPF but different CSFs, reflecting differences in alphabetical order between them.

为了能够有效地跟踪被标注对象的上下文是否发生了变化，CBF生成器320除了生成被标注对象的CBF之外，还生成被标注对象的上下文节点的CBF。被标注对象的上下文节点可以通过XPath生成器220所生成的被标注对象的XPath路径来确定。由被标注对象(用DOM树节点x表示)的CBF及其上下文节点(分别用节点x_left和x_right表示)的CBF构成被标注对象的特征码CF，即，CF(x)＝CBF(x_left)+CBF(x)+CBF(x_right)。In order to effectively track whether the context of the marked object has changed, the CBF generator 320 generates the CBF of the context node of the marked object in addition to the CBF of the marked object. The context node of the marked object can be determined through the XPath path of the marked object generated by the XPath generator 220 . The CBF of the marked object (represented by DOM tree node x) and the CBF of its context node (represented by node x _left and x _right ) constitute the feature code CF of the marked object, that is, CF(x)=CBF(x _left )+CBF(x)+CBF(x _right ).

CBF生成器230的具体结构及其处理过程，以及如何利用网页标注装置在网页中添加新标注的过程，将下面参照图3和图4来进行描述。The specific structure and processing process of the CBF generator 230, as well as the process of how to use the web page tagging device to add a new tag in a web page, will be described below with reference to FIG. 3 and FIG. 4 .

标注生成器240根据被标注对象的有关信息(例如，被标注对象的特征码)和输入的标注的内容和格式等，生成网页标注信息，并且XML转换器260将所生成的网页标注信息转换成适合于通过网络与服务器端进行通信的XML消息格式，以便将网页标注信息传输到服务器端并经由标注信息存取器270存储到标注数据库280中。其中，网页标注信息包含标注的URL(即，标注所在网页的URL)、标注在网页上的位置(即，对应的被标注对象的XPath路径信息)、对应的被标注对象的有关特征(例如，特征码CF信息等)、标注所在的网页的内容特征码、标注的内容和格式等。在此，网页的内容特征码是用于标识网页的内容的特征码，两个网页的内容特征码相同，表明这两个网页的内容相同，并且网页的内容特征码可以采用传统的编码方式、例如哈希编码(MD5)来获得。Annotation generator 240 generates webpage annotation information according to the relevant information of the object to be annotated (for example, the feature code of the object to be annotated) and the content and format of the input annotation, and XML converter 260 converts the generated webpage annotation information into An XML message format suitable for communicating with the server through the network, so as to transmit the annotation information of the webpage to the server and store it in the annotation database 280 via the annotation information accessor 270 . Wherein, the webpage annotation information includes the URL of the annotation (i.e., the URL of the webpage where the annotation is located), the location of the annotation on the webpage (i.e., the XPath path information of the corresponding annotated object), and the relevant characteristics of the corresponding annotated object (e.g., feature code CF information, etc.), the content feature code of the webpage where the annotation is located, the content and format of the annotation, etc. Here, the content feature code of the webpage is a feature code used to identify the content of the webpage, and the content feature codes of two webpages are the same, indicating that the content of the two webpages is the same, and the content feature code of the webpage can adopt traditional coding methods, For example hash code (MD5) to get.

标注分析器250基于当前网页的URL，把存储在标注数据库250中的、和当前网页在同一网站中且所对应的网页与当前网页相同或者相近的URL确定为有效URL，从标注数据库中查询所有和有效URL有关的标注，并用查询得到的所有标注在当前网页中进行匹配，以判断其中哪些标注应当标注了当前载入网页中的元素(即，判断其中哪些标注应当存在于当前网页中)，并确定这些标注应当被显示在当前网页中的哪些位置上。标注分析器230可以支持其中被标注对象的内容被从一个页面转移到其它页面的情况。有关标注分析器250的具体处理过程及其结构将在下面参照图5至图9来进行描述。Based on the URL of the current webpage, the annotation analyzer 250 determines the URL that is stored in the annotation database 250, is in the same website as the current webpage and the corresponding webpage is the same or similar to the current webpage as a valid URL, and queries all URLs from the annotation database. Annotations related to effective URLs, and use all the annotations obtained by the query to match in the current webpage to determine which of the annotations should annotate the elements in the currently loaded webpage (that is, determine which of the annotations should exist in the current webpage), And determine which positions in the current webpage these annotations should be displayed on. Annotation analyzer 230 may support situations where the content of an annotated object is transferred from one page to another. The specific processing and structure of the annotation analyzer 250 will be described below with reference to FIGS. 5 to 9 .

XML转换器260用于将需要在客户端与服务器端之间进行通信的信息进行XML消息格式转换，以使得客户端的网页标注装置200能够与服务器端进行通信。然而，本领域技术人员应当明白，XML格式的消息是为了便于客户端与用Java Servelet实现的服务器端进行通信而使用的，本发明的原理并不仅仅局限于转换成XML格式的消息格式转换，而是可以根据如图2所示的服务器端部分的实现方式的不同而选用其他不同消息格式在客户端与服务器端进行通信。The XML converter 260 is used to convert the information that needs to be communicated between the client and the server into an XML message format, so that the web page tagging device 200 of the client can communicate with the server. However, those skilled in the art should understand that the message in XML format is used in order to facilitate the client to communicate with the server end implemented with Java Servlet, and the principle of the present invention is not limited to the message format conversion converted into XML format. Instead, other different message formats may be selected for communication between the client and the server according to the implementation of the server part as shown in FIG. 2 .

如图2所示，在服务器端，标注信息存取器270响应于来自客户端的请求，对标注数据库280进行存取，而标注数据库280中存储了与信息共享系统所收集的各个标注有关的网页标注信息，其如上所述可以包括标注的URL(即，标注所在网页的URL)、标注在网页上的位置、对应的被标注对象的特征码、标注的内容和格式等。As shown in Figure 2, on the server side, the label information accessor 270 responds to the request from the client to access the label database 280, and the label database 280 stores webpages related to each label collected by the information sharing system The annotation information, as mentioned above, may include the URL of the annotation (that is, the URL of the webpage where the annotation is located), the location of the annotation on the webpage, the feature code of the corresponding object to be annotated, the content and format of the annotation, and the like.

下面结合图3和图4来进行说明。其中，图3是示出了根据本发明的实施例、利用图2所示的系统在网页上添加新标注时所执行的处理过程300的示例性流程图，而图4是详细地示出了图2中所示的CBF生成器的示例性结构及处理过程的示意图。The following description will be made in conjunction with FIG. 3 and FIG. 4 . Wherein, FIG. 3 is an exemplary flow chart showing a processing procedure 300 executed when adding a new markup on a webpage by using the system shown in FIG. 2 according to an embodiment of the present invention, and FIG. 4 shows in detail A schematic diagram of an exemplary structure and processing procedure of the CBF generator shown in FIG. 2 .

如图3所示，在步骤S310，依据用户在当前网页上选择的被标注对象，提取被标注对象在当前网页的DOM树中的XPath路径，然后在步骤S320，基于被标注对象及其上下文节点(可以基于步骤S310中所生成的XPath路径来确定)的内容，如上所述生成它们的CBF，从而得到被标注对象的特征码CF。接下来，在步骤S330，根据被标注对象和输入的标注内容等的有关信息，生成网页标注信息，在步骤S340，将来自步骤S330中所生成的网页标注信息转换成适合于与服务器端进行通信的XML格式的消息，然后在步骤S350中，在服务器端经由标注信息存取器270将客户端所生成的网页标注信息存储到标注数据库280中。As shown in Figure 3, in step S310, according to the marked object selected by the user on the current web page, the XPath path of the marked object in the DOM tree of the current web page is extracted, and then in step S320, based on the marked object and its context node (can be determined based on the XPath path generated in step S310), generate their CBF as described above, so as to obtain the feature code CF of the marked object. Next, in step S330, generate webpage annotation information based on the tagged object and the input annotation content, and in step S340, convert the webpage annotation information generated in step S330 into an image suitable for communication with the server Then, in step S350, the web page annotation information generated by the client is stored in the annotation database 280 via the annotation information accessor 270 on the server side.

图4中详细地示出了如图2所示的CBF生成器230。如图4所示，CBF生成器230可以包括HTML(超文本标记语言)清理(cleaning)单元410、HTML字母化单元420、字母投影向量(CPF)生成单元430、字母顺序向量(CSF)生成单元440。下面以利用CBF生成器230生成被标注对象的CBF为例来进行说明。The CBF generator 230 shown in FIG. 2 is shown in detail in FIG. 4 . As shown in Figure 4, the CBF generator 230 may include an HTML (Hypertext Markup Language) cleaning (cleaning) unit 410, an HTML alphabetization unit 420, a letter projection vector (CPF) generation unit 430, and a letter order vector (CSF) generation unit 440. The following uses the CBF generator 230 to generate the CBF of the marked object as an example for illustration.

HTML清理单元410用于根据预先存储的HTML清理原则(例如，如图4所示可以预先存储在HTML字典450中)，从用户所选择的被标注对象中去掉一些没有作用的HTML标记(例如，诸如、等之类的格式标记)，以便降低HTML噪音以及减小网页格式变化对被标注对象的影响。The HTML cleaning unit 410 is used for removing some useless HTML tags (for example, Format tags such as , , etc.) to reduce HTML noise and reduce the impact of web page format changes on marked objects.

HTML字母化单元420用于对经过HTML清理后的被标注对象进行HTML字母化，从而基于被标注对象的内容被标注对象转换为一个由a到z的字母构成的字母串。对于其中包含中文文字说明的被标注对象，HTML字母化单元420需要先参考汉字字典460(它在被标注对象不包含中文文字说明时可以省略)将被标注对象中的中文文字说明转换为汉语拼音，然后再得到字母串。对于多音字的情况，HTML字母化单元可以取该多音字的第一个汉语拼音，但是显然本发明的原理并不仅仅局限于此。The HTML alphabetization unit 420 is configured to perform HTML alphabetization on the tagged object after HTML cleaning, so that the tagged object is converted into an alphabet string composed of letters a to z based on the content of the tagged object. For marked objects containing Chinese text descriptions, the HTML alphabetization unit 420 needs to refer to the Chinese character dictionary 460 (it can be omitted when the marked objects do not contain Chinese text descriptions) to convert the Chinese text descriptions in the marked objects into Chinese pinyin , and then get the letter string. For polyphonic characters, the HTML lettering unit can take the first Chinese pinyin of the polyphonic character, but obviously the principle of the present invention is not limited thereto.

字母投影向量(CPF)生成单元430和字母顺序向量(CSF)生成单元440根据以上给出的字母投影向量(CPF)和字母顺序向量(CSF)的定义，基于经过HTML字母化处理得到的字母串，分别生成被标注对象的字母投影向量和字母顺序向量。然后，通过将字母投影向量(CPF)和字母顺序向量(CSF)拼接起来，就可以得到被标注对象的基于内容的特征CBF。Letter projection vector (CPF) generation unit 430 and letter order vector (CSF) generation unit 440 are based on the letter string obtained through HTML letterization processing according to the definition of letter projection vector (CPF) and letter order vector (CSF) given above , generate the alphabetic projection vector and alphabetical order vector of the annotated object, respectively. Then, by concatenating the letter projection vector (CPF) and letter order vector (CSF), the content-based feature CBF of the annotated object can be obtained.

返回参见图2。当用户在客户端浏览器中输入某一网页的URL以便浏览该网页以及网页上的标注信息时，客户端的浏览器载入所期望的网页，并把网页的URL以及DOM树结构传送给标注分析器240。Refer back to Figure 2. When the user enters the URL of a webpage in the client browser to browse the webpage and the annotation information on the webpage, the client browser loads the desired webpage and sends the URL and DOM tree structure of the webpage to the annotation analysis device 240.

图5示出了根据本发明实施例的标注分析器240的示例性结构。如图5所示，标注分析器230包括URL分析器510、标注查询器520以及网页标注合成器530。FIG. 5 shows an exemplary structure of the annotation analyzer 240 according to an embodiment of the present invention. As shown in FIG. 5 , the annotation analyzer 230 includes a URL analyzer 510 , an annotation queryer 520 and a webpage annotation synthesizer 530 .

其中，URL分析器510对用户输入的URL进行分析，(经由XML转换器260和标注信息存取器270)从标注数据库280中取出所有和当前要载入的网页(即当前输入的URL所对应的网页，也可简单地为当前网页)在同一网站中的URL，形成一个备选URL集，将备选URL集中的所有URL(以下将其称为备选URL)所对应的网页与当前网页进行相同页面判定和相近页面判定，并将所对应的网页与当前网页相同或者相近的备选URL确定为有效URL。Wherein, the URL analyzer 510 analyzes the URL input by the user, and (via the XML converter 260 and the annotation information accessor 270) takes out all the webpages corresponding to the currently loaded webpage (that is, the currently input URL) from the annotation database 280 webpage, also can simply be the URL of the current webpage) in the same website to form a set of alternative URLs, the webpages corresponding to all URLs in the set of alternative URLs (hereinafter referred to as alternative URLs) and the current webpage The same page judgment and the similar page judgment are carried out, and the candidate URL whose corresponding webpage is the same as or similar to the current webpage is determined as a valid URL.

标注查询器520根据URL分析器510所确定的有效URL，(经由XML转换器260和标注信息存取器270)在标注数据库280中查询和有效URL有关的所有标注(即在有效URL所对应的网页上的所有标注)，即，在标注数据库280中查询出所有可能与当前网页有关的标注，从而得到标注候选集，并从标注数据库280中获得所有这些可能标注的网页标注信息。According to the valid URL determined by the URL analyzer 510, the label queryer 520 (via the XML converter 260 and the label information accessor 270) inquires all labels related to the valid URL in the label database 280 (that is, in the valid URL corresponding to All annotations on the webpage), that is, all possible annotations related to the current webpage are queried in the annotation database 280 to obtain the annotation candidate set, and the webpage annotation information of all these possible annotations are obtained from the annotation database 280.

网页标注合成器530用所有可能的标注在当前网页中进行匹配，以判断其中哪些标注最有可能标注了当前载入网页中的哪些元素或对象，即，确定每一个可能的标注在当前网页中是否存在及其存在的位置，并将标注与网页合成起来以便经由浏览器显示给用户。如图5所示，网页标注合成器530可以进一步包括标注位置确定单元532和合成单元534。The web page annotation synthesizer 530 uses all possible annotations to match in the current webpage, to determine which annotations are most likely to annotate which elements or objects in the currently loaded webpage, that is, to determine that each possible annotation is in the current webpage Whether it exists and where it exists, and combine the markup with the web page to display it to the user through the browser. As shown in FIG. 5 , the webpage annotation synthesizer 530 may further include an annotation position determining unit 532 and a combining unit 534 .

其中，标注位置确定单元532针对所述标注候选集中的每一个可能的标注，根据该标注的网页标注信息(例如，该标注所对应的被标注对象的XPath路径及特征码CF等信息)，确定该可能标注是否标注了当前网页中的网页元素(即，确定该可能标注在当前网页中是否存在)，并且在确定该可能标注存在的情况下进一步确定其所标注的网页元素在当前网页中的位置(即，标注位置)。Wherein, for each possible annotation in the annotation candidate set, the annotation position determination unit 532 determines according to the annotation information of the webpage of the annotation (for example, information such as the XPath path and the feature code CF of the object corresponding to the annotation). Whether this possible label has marked the webpage element in the current webpage (that is, determine whether this possible label exists in the current webpage), and further determine the position of the marked webpage element in the current webpage under the situation that this possible label exists location (ie, label location).

合成单元534根据被确定应当存在于当前网页中的可能标注的网页标注信息，及所确定的这些标注在当前网页中的标注位置，将这些标注与当前网页合成，并经由浏览器将合成后的网页显示给用户。Combining unit 534 synthesizes these annotations with the current webpage according to the webpage annotation information of the possible annotations that should exist in the current webpage and the determined annotation positions in the current webpage, and synthesizes the annotations into the current webpage through the browser. The web page is displayed to the user.

图6是示出了根据本发明的实施例、在用户利用上述信息共享系统在客户端浏览器中输入要载入网页的URL以便显示该网页及其中的已有标注的处理过程600的流程图。6 is a flow chart showing a process 600 for displaying the webpage and existing annotations therein when the user uses the information sharing system to input the URL of the webpage to be loaded in the client browser according to an embodiment of the present invention. .

如图6所示，在步骤S610中，如上所述，对用户输入的URL进行分析，获得备选URL集，并将所有备选URL所对应的网页与要载入的网页(即当前网页)进行相同和相近页面判定，从而确定出有效URL。有过步骤S610中的具体处理过程将在下文中参照图7进行描述。As shown in Figure 6, in step S610, as described above, the URL input by the user is analyzed to obtain a set of candidate URLs, and the webpages corresponding to all the candidate URLs are compared with the webpage to be loaded (ie, the current webpage) Determine the same and similar pages to determine the effective URL. The specific processing procedure in step S610 will be described below with reference to FIG. 7 .

在步骤S620中，根据所确定的有效URL，在标注数据库中查询所有可能与当前网页有关的标注，从而得到标注候选集。然后，在步骤S630，确定所有可能标注中的哪一些在当前网页中存在，并确定这些存在的标注在当前网页中的标注位置。有关步骤S630的具体处理过程将在下文中参照图8和图9来加以说明。In step S620, according to the determined valid URL, all annotations that may be related to the current webpage are queried in the annotation database, so as to obtain an annotation candidate set. Then, in step S630, it is determined which of all possible annotations exist in the current webpage, and the annotation positions of these existing annotations in the current webpage are determined. The specific process of step S630 will be described below with reference to FIG. 8 and FIG. 9 .

然后，在步骤S640中，基于步骤S630中确定应当存在的标注的网页标注信息以及这些标注的所确定的标注位置，将标注与当前网页合成，并且在步骤S650中将合成后的网页经由浏览器显示给用户。在此，可以通过动态修改当前网页的DOM代码，首先把标注转化成html的格式，然后把转换之后的html片段插入到网页代码中，并在浏览器中显示出来。Then, in step S640, based on the webpage annotation information of the annotations that should exist in step S630 and the determined annotation positions of these annotations, the annotations are combined with the current webpage, and in step S650, the synthesized webpage is displayed via the browser displayed to the user. Here, by dynamically modifying the DOM code of the current web page, first convert the markup into html format, then insert the converted html fragment into the web page code, and display it in the browser.

图7是示出了在根据本发明的一个实施例中基于用户输入的URL获得备选URL以及将其所对应的网页和浏览器当前载入的网页(即当前网页)进行相同和相近页面判定的过程(即，图6中所示的步骤S610的具体处理过程)的示例性流程图。Fig. 7 shows that in one embodiment of the present invention, based on the URL input by the user, the alternative URL is obtained and the corresponding webpage and the webpage currently loaded by the browser (i.e. the current webpage) are carried out to determine the same and similar pages An exemplary flow chart of the process (that is, the specific processing procedure of step S610 shown in FIG. 6 ).

如图7所示，在步骤S710中，如上所述，基于用户输入的URL，获得和输入的URL在同一网站中的所有备选URL的集合、即备选URL集。然后，在步骤S720中，确定某一备选URL所对应的网页与当前网页是否为相同的页面。在此，如果备选URL所对应的网页的内容特征码与当前网页的内容特征码相同，则可以确定所述两个网页为相同的页面，否则上述两个网页就是不相同的。在此借助于网页的内容特征码来判断标注所在的网页和当前网页是否为相同的页面，因此，如上文中所述，可以采用现有的编码方式、例如MD5来获得网页的内容特征码。这主要是针对一些网页的URL不同但是内容却没有改变的情况。As shown in FIG. 7 , in step S710 , as described above, based on the URL input by the user, a set of all candidate URLs in the same website as the input URL, that is, a set of candidate URLs, is obtained. Then, in step S720, it is determined whether the webpage corresponding to a certain candidate URL is the same as the current webpage. Here, if the content feature code of the webpage corresponding to the alternative URL is the same as that of the current webpage, it can be determined that the two webpages are the same page; otherwise, the two webpages are different. Here, it is determined whether the marked webpage and the current webpage are the same page by means of the content characteristic code of the webpage. Therefore, as mentioned above, the content characteristic code of the webpage can be obtained by using an existing encoding method, such as MD5. This is mainly for the situation that the URL of some web pages is different but the content has not changed.

如果在步骤S720中确定上述两个网页不相同，则在步骤S730中，确定这两个网页是否是相近似的页面。在此，在这两个网页之间满足以下条件时，可以确定这两个网页是相近似的，否则就是不相近的：If it is determined in step S720 that the above two web pages are not the same, then in step S730, it is determined whether the two web pages are similar pages. Here, when the following conditions are met between the two web pages, it can be determined that the two web pages are similar, otherwise they are not close:

(1)网页的标题相同，而且(1) the titles of the pages are the same, and

(2)这两个网页之间存在参数传递的情况，URL中数字参数缺失，其它相同；(2) There is a parameter transfer between the two web pages, the number parameter in the URL is missing, and the others are the same;

这两个网页之间存在参数传递的情况，URL中的数字参数不同，而且备选URL所对应的网页中的数字参数与当前URL所对应的网页中的数字参数相比更小，其它相同；或者There is a parameter transfer between the two webpages, the numerical parameters in the URL are different, and the numerical parameters in the webpage corresponding to the alternative URL are smaller than the numerical parameters in the webpage corresponding to the current URL, and the others are the same; or

这两个网页之间不存在参数传递，URL的最后一个地址部分不同，其它相同。There is no parameter transfer between the two web pages, the last address part of the URL is different, and the others are the same.

在此显然可以看出，本发明的原理并不仅仅局限于上述这种相近页面判定条件，本领域技术人员完全可以根据需要设定其他不同的相近页面判定条件。It can be clearly seen here that the principle of the present invention is not limited to the above-mentioned similar page determination conditions, and those skilled in the art can completely set other different similar page determination conditions as required.

在步骤S720或者步骤S730中的判定结果是肯定的时，处理进行到步骤S740，将当前备选URL确定为有效URL。When the determination result in step S720 or step S730 is affirmative, the process proceeds to step S740 to determine the current candidate URL as a valid URL.

如果在经步骤S720和步骤S730中的判定后确定上述两个网页既不相同也不相近，则处理进行到步骤S750，确定备选URL集中是否还有未经相同和相近页面判定的URL。如果是的话，则在步骤S760，从备选URL集中取出下一个备选URL，然后处理返回到步骤S720，以便将该取出的下一个备选URL所对应的网页与当前网页进行相同和相近页面判定。重复步骤S720～步骤S760的处理，直至在步骤S750中确定备选URL集中的所有备选URL都已经经过了相同和相近页面判定为止，从而确定出备选URL集中的所有有效URL。If after the determination in step S720 and step S730, it is determined that the above two web pages are neither the same nor similar, then the process proceeds to step S750 to determine whether there are URLs that have not been determined for the same and similar pages in the set of candidate URLs. If yes, then in step S760, take out the next candidate URL from the set of candidate URLs, then process and return to step S720, so that the webpage corresponding to the next candidate URL taken out is identical and similar to the current webpage determination. The processing of steps S720 to S760 is repeated until it is determined in step S750 that all the candidate URLs in the candidate URL set have passed the same and similar page determination, thereby determining all valid URLs in the candidate URL set.

图8是详细地示出了图6中的步骤S630的处理过程(即，确定所有可能的标注是否存在于当前网页中及其在当前网页中的标注位置)的流程图，而图9是示出了在图8所示的处理过程中用到的某标注的特征码CF(如图9中的(a)所示)及其对应的当前网页的DOM树(如图9中的(b)所示)的结构的示意图。FIG. 8 is a flow chart showing in detail the processing procedure of step S630 in FIG. A marked feature code CF used in the processing shown in Figure 8 (as shown in (a) in Figure 9) and its corresponding DOM tree of the current webpage (as shown in (b) in Figure 9 A schematic diagram of the structure shown).

如图8所示，在步骤S810中，基于当前待确定的可能标注的网页标注信息，例如与该标注对应的被标注对象的特征码CF及XPath路径等，以在当前网页的DOM树中依据XPath路径所确定的节点为基础，分别向上和向下依次对当前网页的DOM树中的节点进行检测，以确定DOM树中的与该标注所对应的被标注对象及其上下文节点相同或最接近的节点(在此，相似是指节点的内容以及上下文的差异在可以允许的范围内)，作为当前网页中与该标注对应的DOM树节点。As shown in Figure 8, in step S810, based on the currently to-be-determined webpage annotation information that may be annotated, such as the feature code CF and XPath path of the object to be annotated corresponding to the annotation, in the DOM tree of the current webpage according to Based on the nodes determined by the XPath path, the nodes in the DOM tree of the current web page are detected up and down in order to determine that the tagged object and its context node corresponding to the tag in the DOM tree are the same or the closest (here, the similarity means that the content of the node and the difference in the context are within the allowable range), as the DOM tree node corresponding to the annotation in the current web page.

例如，以图9的(a)所示的某一待确定的可能标注的特征码CF为例，其中A、B和C分别表示该标注所对应的被标注对象及其上文节点和下文节点，以基于A的XPath路径确定的节点为基础依次对DOM树中的节点进行检测，确定出A、B和C在当前DOM树中最接近的节点分别是如图9中的(b)所示的A’、B’和C’，在此可以将其称为所述待确定标注所对应的DOM树节点。For example, take the feature code CF of a possible label to be determined shown in (a) of Figure 9 as an example, where A, B and C respectively represent the marked object corresponding to the label and its upper node and lower node , based on the nodes determined based on the XPath path of A, the nodes in the DOM tree are detected sequentially, and the closest nodes of A, B and C in the current DOM tree are determined, as shown in (b) in Figure 9 A', B', and C', which may be referred to as the DOM tree nodes corresponding to the to-be-determined labels.

然后，在步骤S820中，基于所确定的与待确定的可能标注对应的DOM树节点，按照下述方式计算该标注与DOM树的距离D(A，A’)：Then, in step S820, based on the determined DOM tree node corresponding to the possible label to be determined, the distance D(A, A') between the label and the DOM tree is calculated in the following manner:

D(A，A’)＝d(A，A’)+α(d(B，B’)+d(C，C’))+βd_s D(A,A')=d(A,A')+α(d(B,B')+d(C,C'))+βd _s

其中，in,

d(A，A’)＝|CBF(A)-CBF(A’)|，d(A, A')=|CBF(A)-CBF(A')|,

d(B，B’)＝|CBF(B)-CBF(B’)|，d(B, B')=|CBF(B)-CBF(B')|,

d(B，B’)＝|CBF(C)-CBF(C’)|，d(B, B')=|CBF(C)-CBF(C')|,

d_s为树结构距离，α、β为常数，而且α表示被标注对象的上下文的差异对被标注对象的差异的影响程度，β表示DOM树结构的差异对标注的相似度差异的影响程度，d_s表示当前DOM树中的上下文节点结构和标注的CF结构(即，原上下文节点结构)的差异。d _s is the distance of the tree structure, α and β are constants, and α represents the degree of influence of the difference in the context of the marked object on the difference of the marked object, and β represents the degree of influence of the difference in the DOM tree structure on the difference in the similarity of the mark, d _s represents the difference between the context node structure in the current DOM tree and the marked CF structure (ie, the original context node structure).

假设在DOM树中可以找到节点A’、B’、C’的最底层公共节点P，而且l_A’、l_B’、l_C’分别表示从节点A’、B’、C’到节点P所经过的节点的个数，则d_s可以按如下方式计算：Assume that the lowest common node P of the nodes A', B', and C' can be found in the DOM tree, and l _A' , l _B' , and l _C' respectively represent from nodes A', B', C' to node P The number of nodes passed, then d _s can be calculated as follows:

d_s＝l_A’+l_B’+l_C’ d _s =l _A' +l _B' +l _C'

在如图9(b)中所示的情况下，d_s＝1。In the case shown in Fig. 9(b), d _s =1.

返回参见图8。在步骤S830中，判断在步骤S820中所计算的所述待确定标注的距离D是否小于某一预定阈值。如果是的话，则在步骤S840中可以确定该标注应当存在于当前网页上，并确定它在当前网页上的存在位置。例如，如果所计算的D(A，A’)小于预定阈值，则确定所述待确定标注仍然标注了当前网页中的元素或对象，因此应当显示在当前网页上，并且节点A’在DOM树中所处的位置就决定了该标注应当显示在当前网页上的位置。Refer back to FIG. 8 . In step S830, it is determined whether the distance D of the label to be determined calculated in step S820 is smaller than a predetermined threshold. If yes, then in step S840 it can be determined that the annotation should exist on the current webpage, and its location on the current webpage can be determined. For example, if the calculated D(A, A') is less than a predetermined threshold, it is determined that the annotation to be determined still marks an element or object in the current web page, so it should be displayed on the current web page, and the node A' is in the DOM tree The position in the determines where the callout should appear on the current web page.

如果在步骤S830中确定所述待确定标注的距离D不小于预定阈值，则在步骤S840中，放弃该标注，即确定该标注不应当被显示在当前网页上。If it is determined in step S830 that the distance D of the label to be determined is not less than the predetermined threshold, then in step S840, the label is discarded, that is, it is determined that the label should not be displayed on the current web page.

从以上对被标注对象的基于内容的特征CBF及特征码CF的定义中可以看出，CBF对于被标注对象而言在一般情况下都具有唯一性(尤其是在被标注对象是以英文文本表示的网页内容时更是如此)，而且具有统一的长度，便于数据传输和存储；CBF的变化能够真实反映出被标注对象的内容的变化；而且被标注对象的CF之间的距离是对象变化的度量。From the above definitions of the content-based feature CBF and feature code CF of the marked object, it can be seen that CBF is generally unique to the marked object (especially when the marked object is expressed in English text This is especially true when the content of the webpage is not available), and it has a uniform length, which is convenient for data transmission and storage; the change of CBF can truly reflect the change of the content of the marked object; and the distance between the CF of the marked object is the change of the object measure.

在如上所述的根据本发明实施例的信息共享系统中，在使用XPath路径对被标注对象进行标识的同时，还利用了被标注对象的特征码CF信息，因此能够实现动态网页中标注对于被标注对象的动态跟踪，而这在传统的网页信息标注系统中是不可能实现的。这是因为，在传统的网页信息标注系统中一般采用哈希函数的形式(比如MD5编码)来构造被标注对象的特征，虽然这种特征在一般情况下是唯一的，而且长度统一，便于数据传输和存储，但是这种特征不能反映被标注内容的变化程度。这种哈希编码使得被标注对象的微小的变化导致特征的巨大变化，从而不能通过特征之间的距离来度量被标注对象变化的程度。In the above-mentioned information sharing system according to the embodiment of the present invention, while using the XPath path to identify the marked object, the feature code CF information of the marked object is also used, so it can be realized that the dynamic web page is marked for the marked object Dynamic tracking of marked objects, which is impossible in traditional web page information marking systems. This is because, in the traditional web page information tagging system, the form of hash function (such as MD5 encoding) is generally used to construct the features of the tagged object, although this feature is generally unique, and the length is uniform, which is convenient for data Transmission and storage, but this feature cannot reflect the degree of change of the marked content. This kind of hash coding makes a small change of the marked object lead to a huge change of the feature, so that the distance between the features cannot be used to measure the degree of change of the marked object.

在以上结合附图所描述的根据本发明实施例的基于网页标注的信息共享方法和系统中，可以基于被标注对象的内容及其上下文内容来生成被标注对象的特征码，这样在用所有可能标注在当前载入网页中进行匹配的时候，可以对标注的变化进行度量，从而使得可以根据变化的程度来确定是否对标注进行显示，从而实现了动态跟踪。而且，在标注匹配的过程中，采用了基于上下文内容的特征的轻量级的DOM树搜索方法，用来衡量被标注对象的内容变化及其上下文变化。In the information sharing method and system based on web page marking described above in conjunction with the accompanying drawings, the feature code of the marked object can be generated based on the content of the marked object and its context, so that all possible When the annotation is matched in the currently loaded webpage, the change of the annotation can be measured, so that whether to display the annotation can be determined according to the degree of change, thereby realizing dynamic tracking. Moreover, in the process of label matching, a lightweight DOM tree search method based on the characteristics of the context content is used to measure the content changes of the marked objects and their context changes.

通过以上的描述不难看出，在以上所描述的根据本发明实施例的方法和系统中，使用了动态跟踪技术，使得即使网页中的被标注对象发生了一定的变化，也可以将对应的标注正确地显示在网页上的变化后的位置处，而对于从网页中消失的内容，则其对应标注将不会被显示出来。而且，在网页中的被标注对象是从其他网页中转移而来的情况下，对于这类被标注对象，也可以在网页上正确的位置显示出其对应的标注。另外，在当前网页可能已经通过不同的URL进行了标注的情况下，这些标注也会全部被正确地显示出来。此外，当被标注对象的格式发生变化时，其标注也可以同时正确的显示出来，比如加黑，斜体等，引文等。格式的改变在网页更新或者论坛内容转载的是很常见的。因此，可以以网页标注作为手段来实现用户之间共享信息的目的。It is not difficult to see from the above description that in the above-described method and system according to the embodiment of the present invention, the dynamic tracking technology is used, so that even if the marked object in the web page changes to a certain extent, the corresponding marked object Correctly displayed at the changed position on the webpage, and for the content that disappears from the webpage, its corresponding label will not be displayed. Moreover, in the case that the marked objects in the webpage are transferred from other webpages, for such marked objects, their corresponding annotations can also be displayed at the correct positions on the webpage. In addition, in the case that the current web page may have been marked by different URLs, these marks will all be correctly displayed. In addition, when the format of the labeled object changes, its labels can also be displayed correctly at the same time, such as blackening, italics, etc., citations, etc. Formatting changes are common when web pages are updated or forum content is reproduced. Therefore, the purpose of sharing information between users can be achieved by using web page annotation as a means.

此外，显然，根据本发明的上述方法的各个操作过程也可以以存储在各种机器可读的存储介质中的计算机可执行程序的方式实现。In addition, obviously, each operation process of the above method according to the present invention can also be implemented in the form of computer executable programs stored in various machine-readable storage media.

而且，本发明的目的也可以通过下述方式实现：将存储有上述可执行程序代码的存储介质直接或者间接地提供给系统或设备，并且该系统或设备中的计算机或者中央处理单元(CPU)读出并执行上述程序代码。Moreover, the purpose of the present invention can also be achieved in the following manner: the storage medium storing the above-mentioned executable program code is directly or indirectly provided to a system or device, and the computer or central processing unit (CPU) in the system or device Read and execute the above program code.

此时，只要该系统或者设备具有执行程序的功能，则本发明的实施方式不局限于程序，并且该程序也可以是任意的形式，例如，目标程序、解释器执行的程序或者提供给操作系统的脚本程序等。At this time, as long as the system or device has the function of executing the program, the embodiment of the present invention is not limited to the program, and the program can also be in any form, for example, an object program, a program executed by an interpreter, or a program provided to an operating system. script programs, etc.

上述这些机器可读存储介质包括但不限于：各种存储器和存储单元，半导体设备，磁盘单元例如光、磁和磁光盘，以及其它适于存储信息的介质等。The above-mentioned machine-readable storage media include, but are not limited to: various memories and storage units, semiconductor devices, magnetic disk units such as optical, magnetic and magneto-optical disks, and other media suitable for storing information, and the like.

另外，计算机通过连接到互联网上的相应网站，并且将依据本发明的计算机程序代码下载和安装到计算机中然后执行该程序，也可以实现本发明。In addition, the present invention can also be realized by a computer connecting to a corresponding website on the Internet, and downloading and installing the computer program code according to the present invention into the computer and then executing the program.

还需要指出的是，执行上述系列处理的步骤可以自然地按照说明的顺序按时间顺序执行，但是并不需要一定按照时间顺序执行。某些步骤可以并行或彼此独立地执行。It should also be pointed out that the steps for executing the above series of processes can naturally be executed in chronological order according to the illustrated order, but it does not need to be executed in chronological order. Certain steps may be performed in parallel or independently of each other.

最后，还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个......”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also Other elements not expressly listed, or inherent to the process, method, article, or apparatus are also included. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

以上虽然已经结合附图详细说明了本发明的实施例，但是应当明白，上面所描述的实施方式只是用于说明本发明，而并不构成对本发明的限制。在不背离由所附的权利要求所限定的本发明的精神和范围的情况下，可以进行各种改变、替代和变型。而且，本申请的范围不仅限于说明书所描述的过程、设备、制造、物质的结构、手段、方法和步骤的具体实施例。本领域普通技术人员根据本发明的公开内容将很容易理解，根据本发明可以使用执行与在此所述的相应实施例基本相同的功能或者获得与其基本相同的结果的、现有和将来要被开发的过程、设备、制造、物质的结构、手段、方法或者步骤。因此，所附的权利要求旨在它们的范围内包括这样的过程、设备、制造、物质的结构、手段、方法或者步骤。Although the embodiments of the present invention have been described in detail above with reference to the accompanying drawings, it should be understood that the above-described embodiments are only used to illustrate the present invention, but not to limit the present invention. Various changes, substitutions and modifications can be made without departing from the spirit and scope of the present invention as defined by the appended claims. Moreover, the scope of the present application is not limited to the specific embodiments of the process, equipment, manufacture, material structure, means, methods and steps described in the specification. Those of ordinary skill in the art will readily appreciate from the disclosure of the present invention that existing and future proposed embodiments that perform substantially the same function or obtain substantially the same results as the corresponding embodiments described herein can be used in accordance with the present invention. The developed process, device, manufacture, structure of matter, means, method or steps. Accordingly, the appended claims are intended to include within their scope such processes, means, manufacture, structure of matter, means, methods, or steps.

Claims

1. A method for generating web page annotation information, comprising the steps of:

In response to the user selecting the target web page element as the marked object on the current web page loaded on the client web browser, extracting the XPath path of the marked object in the DOM tree of the current web page;

Generate the feature code CF of the marked object based on the marked object and the contents of the contextual web page elements immediately before and after the marked object in the current web page; and

Based on the XPath path of the marked object, the feature code CF and the markup entered by the user, generate webpage markup information,

Wherein, the webpage annotation information is stored in the annotation database of the remote annotation server,

The feature code CF of the tagged object is composed of the content-based feature CBF of the tagged object and the CBF of its context web page element, and

The CBF of a webpage element is composed of a letter projection vector and an alphabet order vector of the webpage element, wherein the letter projection vector consists of all letters in the webpage element in the alphabet Λ={a, b, c, d, ..., z}, and the alphabetical vector is composed of the reverse order statistical numbers of all letters in the webpage element on the alphabet Λ.

2. The method according to claim 1, wherein the step of generating the feature code CF of the marked object further comprises:

Generate the CBF of the marked object and its context web page element in the following way:

Remove meaningless HTML tags from web page elements by referring to pre-stored HTML cleaning principles;

Perform HTML alphabetization on the web page element after HTML cleaning, so as to convert the web page element into an alphabet string composed of letters from a to z based on the content of the web page element;

Count the number of all letters in the letter string on the alphabet Λ={a, b, c, d, ..., z} and the number in reverse order, so as to generate the letter projection vector and alphabet order vector of the web page elements ;

concatenate the alphabetic projection vector and the alphabetical order vector of the webpage element to obtain the CBF of the webpage element, and

The feature code CF of the tagged object is obtained in the following manner: CBF of the upper web page element of the tagged object+CBF of the tagged object+CBF of the lower web page element of the tagged object.

3. The method according to claim 2, wherein, in the case where the marked object and its contextual webpage elements contain Chinese text descriptions, before performing HTML alphabetization on the webpage elements after HTML cleaning, refer to the Chinese character dictionary to translate Chinese The text description is converted to Hanyu Pinyin.

4. according to the method described in any one in claim 1 to 3, wherein, described web page labeling information is except comprising the XPath path of marked object, feature code CF and the content and format of labeling, also comprises labeling place webpage URL, the content feature code of the webpage where the annotation is located.

5. according to the method described in any one in claim 1 to 3, wherein, described remote marking server realizes by Java Servelet, and

The method further includes the step of: converting the generated web page annotation information into an XML format suitable for communicating with the remote annotation server, so as to transmit it to the remote annotation server.

6. A device for generating web page annotation information, comprising:

The XPath generator is used to respond to the user selecting the target web page element as the marked object on the current web page loaded on the client web browser, extracting the XPath path of the marked object in the document object model DOM tree of the current web page;

A feature code CF generator, configured to generate the feature code CF of the marked object based on the marked object and the contents of the context web page elements immediately before and after the marked object in the current web page; and

Annotation generator, used to generate webpage annotation information based on the XPath path of the object to be annotated, the feature code CF of the object to be annotated, and the annotation input by the user,

Among them, the feature code CF of the marked object is composed of the content-based feature CBF of the marked object and the CBF of the context web page element,

7. The device according to claim 6, wherein the feature code CF generator comprises a content-based feature CBF generator for generating a content-based feature CBF of a webpage element based on the content of the webpage element; and

The CBF generator further includes:

an HTML cleaning unit, configured to remove meaningless HTML tags from web page elements by referring to pre-stored HTML cleaning principles;

The HTML alphabetization unit is used to perform HTML alphabetization on the webpage elements after HTML cleaning, so as to convert the webpage elements into alphabet strings composed of letters from a to z based on the content of the webpage elements;

A letter projection vector generation unit, used to count the number of all letters in the letter string on the alphabet Λ={a, b, c, d, ..., z}, to generate letter projection vectors of web page elements ;

Alphabetical order vector generation unit, used to count the number of reverse order of all letters in the alphabet Λ={a, b, c, d, ..., z} in the alphabet string, to generate the alphabetical order of web page elements vector; and

A unit for concatenating the alphabetic projection vector and the alphabetical order vector of the webpage element to obtain the CBF of the webpage element, and

Wherein, the CF generator generates the feature code CF of the tagged object in the following manner: CBF of the upper web page element of the tagged object+CBF of the tagged object+CBF of the lower web page element of the tagged object.

8. The device according to claim 7, wherein, when the marked object and its contextual webpage elements contain Chinese text descriptions, the HTML alphabetization unit refers to the Chinese character dictionary to clean up the Chinese text of the webpage elements after HTML The description is converted to Hanyu Pinyin, which is then HTML-alphabetized.

9. The device according to any one of claims 6 to 8, wherein, in addition to including the XPath path of the marked object, the feature code CF, and the content and format of the mark, the web page mark information also includes the mark on the web page URL, the content feature code of the webpage where the annotation is located.

10. The device according to any one of claims 6 to 8, wherein the device is realized by a browser plug-in, and the remote annotation server is realized by Java Servlet,

The device further includes an XML converter for converting the generated web page annotation information into an XML format suitable for communicating with a remote annotation server.

11. A method for displaying webpages and annotations on webpages on a client web browser, comprising the following steps:

a) in response to the user inputting the Uniform Resource Locator URL of the webpage to be loaded and displayed on the browser, analyzing the input URL to obtain a valid URL;

b) According to the effective URL, query all the annotations related to the effective URL from the remote annotation server, so as to obtain the annotation candidate set and the webpage annotation information of these annotations;

c) For each annotation in the annotation candidate set, according to the annotation information of the webpage, determine whether the annotation marks the webpage element in the webpage to be loaded, that is, determine whether the annotation should exist in the webpage to be loaded , and if so, further determining the position of the marked webpage element in the webpage to be loaded, that is, the marked position; and

d) Synthesizing these annotations with the webpage to be loaded, and displaying the synthesized webpage via the browser according to the webpage annotation information and the location of the annotations determined to be in the webpage to be loaded to the user,

Wherein, the annotation information of the marked web page includes the XPath path of the marked object corresponding to the mark, the feature code CF of the marked object, the content and format of the mark, the URL of the web page where the mark is located, the content feature code of the web page where the mark is located,

The feature code CF of the marked object is composed of the content-based feature CBF of the marked object and the CBF of the context web page element immediately before and after the marked object,

12. The method according to claim 11, wherein said step a) further comprises:

Based on the URL of the input, take out all URLs in the same website as the webpage to be loaded from the remote labeling server as an alternative URL, and carry out the same and similar pages with the webpage corresponding to the alternative URL and the webpage to be loaded Determine, and determine the candidate URL whose corresponding webpage is the same as or similar to the webpage to be loaded as a valid URL.

13. The method according to claim 11 or 12, wherein said step c) further comprises:

For each annotation in the annotation candidate set:

Based on the feature code CF and XPath path of the tagged object corresponding to the tag, based on the nodes determined according to the XPath path in the document object model DOM tree to be loaded into the web page, the DOM of the web page is sequentially upward and downward respectively. The nodes in the tree are detected to determine the same or closest node in the DOM tree as the marked object corresponding to the mark and its context web page element, as the corresponding DOM tree node of the mark in the DOM tree;

Calculate the distance D between the label and the DOM tree based on the feature code of the label and its corresponding DOM tree node;

determining whether the calculated distance D is less than a predetermined threshold; and

When the distance D of the annotation is less than a predetermined threshold, it is determined that the annotation should exist in the webpage to be loaded, and based on the determined DOM tree node that is the same as or closest to the marked object corresponding to the annotation, Determine where the callout should be placed in the web page to be loaded.

14. The method according to claim 13, wherein the distance D between the label and the DOM tree is calculated in the following manner:

Assuming that the tagged objects and their context webpage elements corresponding to the tag are A, B, and C, and the tree nodes that are the same as or closest to them in the DOM tree are A’, B’, and C’ respectively, then:

D=d(A, A')+α(d(B, B')+d(C, C'))+βd _s ,

in,

d(A, A')=|CBF(A)-CBF(A')|,

d(B, B')=|CBF(B)-CBF(B')|,

d(B, B')=|CBF(C)-CBF(C')|,

Among them, d(A, A') represents the distance between the webpage element A and the tree node A' that is the same as or most similar to the webpage element A in the DOM tree, and d(B, B') represents the distance between the webpage element B and the DOM The distance between the tree node B' that is the same or the most similar to the web page element B in the tree, d(C, C') represents the tree node that is the same or the most similar to the web page element C in the web page element C and the DOM tree The distance between C', CBF(A), CBF(B) and CBF(C) represent the CBF of web page elements A, B and C respectively, CBF(A'), CBF(B') and CBF(C') Represent the CBF of tree nodes A', B' and C' respectively, α and β are constants, and α represents the degree of influence of the difference in the context of the marked object on the difference of the marked object, and β represents the impact of the difference in the DOM tree structure on the mark The degree of influence of the similarity difference, d _s represents the difference between the structure of the DOM tree and the marked feature code CF.

15. The method according to claim 11 or 12, wherein the CBF of the web page element is generated in the following manner:

The CBF of the web page element is obtained by concatenating the alphabetic projection vector and the alphabetical order vector of the web page element.

16. The method according to claim 11 or 12, wherein said remote annotation server is realized by Java Servlet, and

The method further includes the step of converting the information transmitted between the client and the remote marking server into XML format before sending or receiving.

17. A device for displaying webpages and annotations on webpages via a client web browser, comprising:

The URL analyzer is used to analyze the input URL in response to the Uniform Resource Locator URL of the web page to be loaded and displayed on the browser input by the user, so as to obtain a valid URL;

An annotation queryer, configured to query all annotations related to the effective URL from the remote annotation server according to the effective URL, so as to obtain the annotation candidate set and the web annotation information of these annotations;

An annotation position determining unit is used for determining whether the annotation marks a webpage element in the webpage to be loaded according to the annotation information of the webpage for each annotation in the annotation candidate set, that is, determining whether the annotation should exist in the desired in the loaded webpage, and if so, further determining the position of the marked webpage element in the webpage to be loaded, that is, the marked position; and

a synthesizing unit, configured to synthesize these annotations with the webpage to be loaded according to the webpage annotation information and the annotation positions determined as the annotations that should exist in the webpage to be loaded,

Wherein, the synthesized webpage is displayed to the user via the browser,

The annotation information of the marked web page includes the XPath path of the marked object corresponding to the mark, the feature code CF of the marked object, the content and format of the mark, the URL of the marked web page, the content feature code of the marked web page,

The tagged object's feature code CF consists of the tagged object's content-based feature CBF and the CBF of the contextual web page element immediately preceding and following the tagged object, and

18. The device according to claim 17, wherein, based on the input URL, the URL analyzer takes out all URLs in the same website as the webpage to be loaded from the remote marking server as alternative URLs, and The webpage corresponding to the alternative URL is determined to be the same or similar to the webpage to be loaded, and the alternative URL whose corresponding webpage is the same or similar to the webpage to be loaded is determined as a valid URL.

19. The device according to claim 17 or 18, wherein the annotation position determining unit performs the following processing for each annotation in the annotation candidate set:

20. The device according to claim 19, wherein the label position determination unit calculates the distance D between the label and the DOM tree in the following manner:

D=d(A, A')+α(d(B, B')+d(C, C'))+βd _s ,

in,

d(A, A')=|CBF(A)-CBF(A')|,

d(B, B')=|CBF(B)-CBF(B')|,

d(B, B')=|CBF(C)-CBF(C')|,

Among them, d(A, A') represents the distance between the webpage element A and the tree node A' that is the same as or most similar to the webpage element A in the DOM tree, and d(B, B') represents the distance between the webpage element B and the DOM The distance between the tree node B' that is the same or the most similar to the web page element B in the tree, d(C, C') represents the tree node that is the same or the most similar to the web page element C in the web page element C and the DOM tree The distance between C', CBF(A), CBF(B) and CBF(C) represent the CBF of web page elements A, B and C respectively, CBF(A'), CBF(B') and CBF(C') Represent the CBF of tree nodes A', B' and C' respectively, α and β are constants, and α represents the influence degree of the difference in the context of the marked object on the difference of the marked object, and β represents the difference in the DOM tree structure to the mark The degree of influence of the similarity difference, d _s represents the difference between the structure of the DOM tree and the marked feature code CF.

21. The device according to claim 17 or 18, further comprising a content-based feature CBF generator for generating a content-based feature CBF of a web page element,

The CBF generator further includes:

A unit for concatenating the alphabetic projection vector and the alphabetical order vector of the webpage element to obtain the CBF of the webpage element.

22. The device according to claim 17 or 18, wherein the device is realized by a browser plug-in, and the remote annotation server is realized by Java Servlet,

The device further includes an XML converter for converting the information transmitted between the client and the remote annotation server into XML format before sending or receiving.

23. A web page labeling method, comprising:

displaying the web page on the browser by performing a method according to any one of claims 11 to 16 in response to user input of the URL of the web page to be loaded and displayed on the client web browser, and Existing annotations previously annotated on the web page stored on a remote annotation server;

By performing the method according to any one of claims 1 to 5, adding a new annotation on the webpage, the webpage annotation information of the new annotation is stored on the remote annotation server; and

The added new annotation is displayed on the web page via a browser.

24. A web page tagging device, comprising:

The device for generating web page annotation information according to claim 9; and

The device for displaying webpages and annotations on the webpages via a client web browser according to any one of claims 17 to 22.

25. An information sharing system based on webpage annotation, including a client and a remote annotation server, wherein,

The client includes the web page tagging device according to claim 24, and

The remote annotation server includes an annotation database for storing webpage annotation information, and an annotation information accessor for controlling access to the annotation database.