[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2019047849A1 - 新闻处理方法、装置、存储介质及计算机设备 - Google Patents

新闻处理方法、装置、存储介质及计算机设备 Download PDF

Info

Publication number
WO2019047849A1
WO2019047849A1 PCT/CN2018/104156 CN2018104156W WO2019047849A1 WO 2019047849 A1 WO2019047849 A1 WO 2019047849A1 CN 2018104156 W CN2018104156 W CN 2018104156W WO 2019047849 A1 WO2019047849 A1 WO 2019047849A1
Authority
WO
WIPO (PCT)
Prior art keywords
news
event
identified
time
time node
Prior art date
Application number
PCT/CN2018/104156
Other languages
English (en)
French (fr)
Inventor
殷乐
花贵春
王丹丹
郎兵
赵林
胡博
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019047849A1 publication Critical patent/WO2019047849A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present application relates to the field of Internet application technologies, and in particular, to a news processing method, apparatus, computer readable storage medium, and computer device.
  • the recommended news can be recent hot news, or it can be based on different users' targeted recommendations in the corresponding field.
  • the news needs to set the news expiration time, and the invalid news is dealt with in time to ensure that the invalid news is not recommended to the user, and the news recommended to the user is in line with the development of the news event, thereby satisfying the user's reading needs.
  • the related art there is no effective solution to the above problem.
  • the embodiment of the present application provides a news processing method, device, computer readable storage medium, and computer device that can improve the timeliness of recommended news.
  • a news processing method executed by a server, comprising:
  • a news processing apparatus comprising: a first obtaining module, configured to acquire a word vector of the news to be identified; a second acquiring module, configured to acquire a word vector corresponding to the event, and a time node of the event; and a determining module, configured to: Determining, according to a similarity between the word vector of the to-be-identified news and the word vector of the event, an association event of the to-be-identified news, and determining a time node corresponding to the to-be-identified news in the associated event, Determining whether the news is valid according to the time node.
  • a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements a news processing method.
  • the news processing method includes: acquiring a word vector of the news to be recognized; acquiring a word vector of the event, and a time node of the event; determining, based on the similarity between the word vector of the news to be recognized and the word vector of the event The associated event of the news to be identified, and determining a time node corresponding to the news to be identified in the associated event, determining whether the news is valid according to the time node.
  • a computer device comprising a memory, a processor, and a computer program stored on the memory, the processor implementing a news processing method when the program is executed.
  • the news processing method includes: acquiring a word vector of the news to be recognized; acquiring a word vector of the event, and a time node of the event; determining, based on the similarity between the word vector of the news to be recognized and the word vector of the event The associated event of the news to be identified, and determining a time node corresponding to the news to be identified in the associated event, determining whether the news is valid according to the time node.
  • FIG. 1 is an application environment diagram of a news processing method according to an embodiment of the present application.
  • FIG. 2 is a flow chart of a news processing method in an embodiment of the present application.
  • FIG. 3 is a flowchart of a news processing method in another embodiment of the present application.
  • FIG. 4 is a flow chart of a news processing method in still another embodiment of the present application.
  • FIG. 5 is a flowchart of a news processing method in still another embodiment of the present application.
  • FIG. 6 is a flow chart of a news processing method in still another embodiment of the present application.
  • FIG. 7 is a schematic diagram of an application scenario in which a news reading application provides news processing on a server during a news push service according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an application scenario displayed by a news reading application on a terminal during a news push service according to an embodiment of the present disclosure.
  • FIG. 9 is a flowchart of main steps of a news processing method in which the game event A and the news B to be identified are taken as an example.
  • FIG. 10 is a schematic structural diagram of a news processing apparatus according to an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a news processing apparatus according to another embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a news processing apparatus according to still another embodiment of the present application.
  • FIG. 13 is a schematic diagram showing the internal structure of a computer device according to an embodiment of the present application.
  • Browsing news through the Internet has become the habit of more and more users, and many news websites or news applications also have the function of actively recommending news to users.
  • the determination of the expiration time of the news includes two ways:
  • the corresponding expiration time is preset for the news including the corresponding keyword
  • the corresponding expiration time is preset for the news of the category.
  • this method can only be invalidated for news settings containing specific keywords or the same category. Duration, and for news areas that contain a large number of clear events and where the periodicity of the event is not clear, such as sports news, movie news, etc., the way to set the expiration time based on news keywords or categories is not applicable, for example, after a sports competition. It is unreasonable to recommend pre-match or in-game news. It is not appropriate to recommend the preview news before the release of the movie. It is not meaningful to recommend the news to the user after the news is recommended to the user. Poor timeliness.
  • FIG. 1 is a diagram showing an application environment of a news processing method according to an embodiment of the present application, including a terminal 100 and a server 200.
  • the server 200 is connected to the terminal 100 through a network.
  • the user downloads the news application through the terminal 100 or logs in to the news website for browsing.
  • the news application refers to an application software specifically for the user to read the news information or an application software including a function module specially for the user to read the news information, such as various commonly used news reading areas including the news recommendation function. (Application) software.
  • the terminal 100 may be a smartphone, a tablet, a personal digital assistant (PDA), a personal computer, or the like.
  • the server 200 transmits the recommended news to the corresponding terminal 100 through the network for the user to display and view through the terminal.
  • Server 200 can be a standalone physical server or a cluster of physical servers.
  • FIG. 2 is a news processing method according to an embodiment of the present application.
  • the method may be performed by the server 200, and the method includes the following steps.
  • Step 101 Acquire a word vector of the news to be identified.
  • News usually refers to the use of general narratives, texts, images, videos and other means to timely report more significant and valuable events, so that a certain group of people understand.
  • News in a broad sense refers to a message that contains all the words, images, videos, and audio data that record events and disseminate information through media or network channels.
  • news in a broad sense includes not only news websites and news applications in the usual sense. Text, images, video, and audio data that are served, as well as event-related messages that are served as articles in social applications in the usual sense.
  • news refers to news in a broad sense.
  • the to-be-identified news refers to the object to be processed in the news processing method provided by the embodiment of the present application.
  • step 101 acquiring a word vector of the news to be identified includes: extracting a keyword based on the news to be recognized; mapping the extracted keyword to a word vector space to obtain a word vector corresponding to the keyword.
  • a keyword generally refers to information that describes a feature that is necessarily mentioned in an event process and that can reflect a unique event.
  • the description information of an event usually includes time, place, person, and information related to four elements of the event, thereby
  • the keywords can be determined and extracted at least from the perspective of the information related to the four elements.
  • the step of extracting keywords based on the news to be identified may be obtained by fetching structured information from a vertical website of the news or other related news web pages, and the crawling of the structured information may adopt a crawling method known in the current Internet technology. For example, reptile technology.
  • a vertical website is a website that focuses on certain areas or specific needs, providing a full range of in-depth information and related services about the field or needs.
  • the structured information means that after the information is analyzed, it can be decomposed into a plurality of interrelated components. Each component has a clear hierarchical structure. Its use and maintenance are managed through the database, and there are certain operational specifications.
  • the extraction of the keywords may be derived from the title of the news, the content of the report, and the comment corresponding to the news.
  • extracting a keyword based on the to-be-identified news includes: extracting a keyword corresponding to the news to be identified from at least one of: first, information included in content of the news to be identified; second, to be identified Specific associated information for the news.
  • the news to be identified refers to information contained in the content of the news report itself, such as a news headline and a news body, wherein the news for the video or audio data can be extracted by the voice in addition to the keywords in the news headline.
  • the key is extracted by recognizing the way it is converted into text.
  • the specific related information of the news to be identified mainly refers to the information contained in the content related to the news report, such as the comment corresponding to the news, and the news for the video or audio data, in addition to the keyword can be extracted from the news title, Extract keywords from the comments.
  • the keyword can be comprehensively extracted by means of the content and comment of the news report itself, so that the keyword can be more accurately and accurately identified.
  • the keywords of the news are also fully considered to help the timeliness of the rich news content in the news.
  • Word vector refers to the way in which words, words, phrases, etc. in a language are converted into digits.
  • the expression form of the word vector includes: a word is represented by a vector of a specific length, the length of the vector is the size of the dictionary, the component of the vector has only one, and the others are all 0, and the position of the 1 corresponds to the position of the word in the dictionary.
  • a word in the language to be a short vector of fixed length shorter than the specific length, putting all these vectors together to form a word vector space, and each vector is one of the spaces Point, introduce the distance parameter in the space, and judge the lexical and semantic similarity between the words according to the distance between the short vectors corresponding to the words.
  • the training of the word vector can be realized by means of a language model, and the extracted keywords are mapped into the word vector space to obtain the corresponding word vector.
  • the word vector model e.g., word2vec
  • the word vector model is trained by samples, e.g., words and corresponding word vectors, to obtain parameters of the word vector model. Mapping the extracted keywords to the word vector space can obtain the word vectors corresponding to the keywords by inputting the extracted keywords into the word vector model.
  • Step 103 Acquire a word vector of the event and a time node of the event.
  • the event may be one or more, and when the event is multiple, the word vector of the multiple events is obtained.
  • the time node of the event is multiple, and the time node of the event can be obtained by acquiring the sequence of time nodes of the event.
  • Events are things that are significant and can affect a certain group of people.
  • the description information of an event usually includes time, place, person, and information through four elements related to the event, wherein the event is described by the content including the event from the generation to the end of the development process.
  • the time node of an event refers to a specific time point that divides things into multiple development stages according to some common characteristics of different time periods. Taking the sports event as an example, according to the development process of the sports event, the game can be divided into three stages: pre-match, mid-game and post-game, respectively, at the time of the game start time and the end time of the game.
  • the time point, the premiere time, the release start time, and the release end time are respectively used as time nodes to distinguish them from before and during the release. And three stages after the release.
  • acquiring the word vector corresponding to the event includes: extracting a keyword based on the event; mapping the extracted keyword to the word vector space to obtain a word vector corresponding to the keyword.
  • the keyword of the extracted event is input into the word vector model, and the vector output by the word vector model is used as the word vector corresponding to the keyword.
  • a keyword generally refers to information that describes a feature that is necessarily mentioned in an event process and that can reflect the uniqueness of the event.
  • the description information of the event usually includes time, place, person, and information related to the four elements of the event.
  • the event itself has attribute information of the industry or domain category, and the category to which the event belongs is information related to another element of the event, so that the keywords of the event can be determined or extracted according to at least information related to the five elements.
  • the keyword of the event can extract "XX” day from the perspective of time elements, from the perspective of location elements. Extract “Beijing”, extract the starring "XX” from the perspective of the character elements, and extract the "entertainment” class as the keyword of the event from the perspective of the event category elements.
  • News is a specific form of presentation of events, and extracting keywords based on events may also be based on a plurality of related news extraction keywords whose events are known. Specifically, one or more news associated with the event is acquired, and the keyword of the event is determined according to the information of the content included in the one or more news articles and the specific associated information.
  • Step 105 Determine an association event of the news to be identified based on the similarity between the word vector of the news to be identified and the word vector of the event, and determine a time node corresponding to the news to be identified in the associated event.
  • the similarity between the news to be identified and each event is determined, and one event is selected as the associated event of the news to be identified according to the similarity between the news to be identified and each event.
  • the time node corresponding to the news to be identified is determined in the sequence of time nodes of the associated event.
  • Similarity refers to the degree of association between two things.
  • the manner of determining the similarity between the news and the event to be identified based on the word vector of the news to be recognized and the word vector of the event mainly includes: matching between the word vector of the news to be recognized and the word vector of the event, and determining according to the result of the matching; or A similarity value is calculated between the word vector of the news to be identified and the word vector of the event, and is determined according to the magnitude of the similarity value.
  • the related event corresponding to the news to be identified is automatically identified by the similarity between the news and the event to be identified, that is, whether the news to be identified is the related news of the specific event.
  • the time node of the associated event corresponding to the news to be identified is automatically identified by the similarity between the news and the event to be identified, that is, the development stage in which the associated event corresponding to the news to be identified is located is identified.
  • the news processing method by setting a time node of an event, extracting related information of the news to be recognized, automatically identifying the related news related to the event, and determining a time node of the event corresponding to the news according to the time information of the news.
  • the time node of the event corresponding to the news can be judged based on the news node corresponding to the event, the event corresponding to the news to be identified, and whether the news to be identified is
  • the current development stage of the event can be accurately identified, which is conducive to improving the timeliness of the news to be identified.
  • Step 106 Determine whether the news is valid according to the time node.
  • the news processing method further includes:
  • Step 107 When the corresponding time node is a specific time node associated with the failure, it is determined that the news to be identified is invalid.
  • the time node of an event is typically a sequence of multiple time nodes that are arranged in chronological order. Each time node represents the start time of one development phase of the event or the end time of another development phase of the event, and any two adjacent time nodes correspond to a development phase of the event. Therefore, after determining the time node corresponding to the news to be identified, the development stage of the event in which the news to be identified is located is determined, so that it can be determined according to the corresponding time node whether it is a specific time node associated with the failure.
  • the next time node of the time node corresponding to the news to be identified that is, the end time of the event development stage where the news to be identified is located or the event development stage of the news to be identified may be
  • the start time of the next development phase is taken as a specific time node associated with the failure, and the specific time node associated with the failure can be determined as the expiration time of the news to be identified.
  • the subsequent time node with the preset interval of the corresponding time node may be The time node in the time is determined as the specific time node associated with the failure, and the specific time node is determined as the expiration time of the news to be identified.
  • the corresponding time node may be added with a preset time length as a specific time node of the failure association, and the specific time node associated with the failure is associated. Determine the expiration time of the news to be identified.
  • the specific time node associated with the failure may be a time or a time period. When a specific time node associated with the failure is represented by a time period, the time period may be set according to actual application requirements.
  • the time is determined as the expiration time of the news to be identified.
  • the start time of the next development phase of the event is set to the expiration time of the news to be identified, and the specific time node associated with the failure refers to the development stage of the event in which the news is to be identified. The start time of the next development phase.
  • the event development is divided into multiple development stages by the time node, and after identifying the different development stages of the event of the news, the expiration time of the news is set as the start time of the next development stage or the subsequent specific development stage, which one is selected
  • the development phase is based on actual application needs.
  • the specific time node associated with the failure is determined by the time node corresponding to the news, so that only the news belonging to the current development stage of the event is recommended to the user, and the news that does not belong to the current development stage of the event is timely removed from the transaction to ensure The timeliness of news recommended to users.
  • the determining, according to the time node, whether the news is valid the news processing method further includes:
  • Step 108 When the type of the corresponding time node is an end time node, and the preset failure time of the end time node arrives, it is determined that the news to be identified is invalid.
  • each time node can be used to indicate the start time of one development phase of the event or the end time of another development phase representing the event.
  • the time node at the forefront of the node sequence is the start time node
  • the time node at the end of the node sequence is the end time node
  • the time nodes between the front end and the last end are intermediate time nodes.
  • the time node defines its end time, and may be the end time node when determining the time node of the event corresponding to the time information included in the news to be identified. Therefore, for the case where each time node is used to indicate the start time of a development phase of the event, when it is confirmed that the time node of the news corresponding event to be identified is the start time node or the intermediate time node, the corresponding time may be The next time node of the node, or a subsequent time node having a preset interval, or a time node corresponding to the event plus a time node determined by the preset duration, is determined as the expiration time of the news to be identified. When it is confirmed that the time node of the event corresponding to the news to be identified is the end time node, the expiration time of the related news belonging to the last development stage of the event is determined by setting the preset expiration time.
  • the preset failure duration refers to a time range in which the preset news is valid, and the failure processing is performed for the time when the time after the news release remains valid beyond the valid time range.
  • the end time node is set to indicate the start time of the last development phase of the event, and the time node of the event corresponding to the news to be identified is determined to be the end time node, the news belongs to the last development stage of the event.
  • the expiration time of the news can be determined by setting the corresponding time node plus the preset expiration time. Through the setting of the time node, the event is divided into multiple different development stages through multiple time nodes, and only the time of each development stage is considered, and the presets are uniformly set for the final development stage of the events in different fields. The expiration time is sufficient, which can reduce the difficulty of setting the time node of the event.
  • the event node is divided into multiple stages according to a certain common characteristic of different development stages by the time node of the event, and the event corresponding to the news to be identified is determined.
  • the time node can be used to know the development stage of the event in which the news is to be identified, determine whether the news belongs to the news of the current development stage of the event, and determine the news that is not in the current development stage of the event as invalid news.
  • the event-based time node sets a reasonable life cycle for the news, and timely determines news that is not in the current development stage of the event as invalid news, so as to avoid recommending news that is not in line with the current development stage of the event to the user, so as to improve the direction.
  • the timeliness of the news recommended by the user are used to know the development stage of the event in which the news is to be identified, determine whether the news belongs to the news of the current development stage of the event, and determine the news that is not in the current development stage of the event as invalid news.
  • the time node for acquiring the event includes:
  • the setting of the time node of the event can be formed in a predefined manner. For example, by analyzing the common development characteristics of events in different domain categories, it is divided into several development stages, and the division time points of several development stages are determined, and these division time points are used as pre-defined time nodes of events of corresponding categories. For example, by analyzing the common development characteristics of events with different thermal agendas, it is divided into several hot discussion stages, and the division time points of several hot discussion stages are determined, and these time points are taken as events corresponding to the thermal agenda degree.
  • the splitting time point may be a time or a time period.
  • the time node may also be a time or a time segment. When the split time point is a time segment, According to actual needs, it is selected to set any time in the time period to belong to the time included in the two development stages adjacent thereto or the time included in one of them.
  • step 103 the time node for acquiring the event includes:
  • the relevant news of the event is obtained and clustered, and the time node of the event is determined according to the time information included in the related news of different categories.
  • the setting of the time node of the event can be determined by clustering the relevant news of the event.
  • Clustering refers to the process of classifying data into different classes or clusters. Objects in the same class or cluster have great similarities, and objects in the same class or clusters have great dissimilarity.
  • the time information included in the related news includes the release time of the related news, the time when the news is involved, and the like. In this embodiment, the time information included in the related news refers to the release time of the news.
  • the clustering is performed, the related news is classified into different categories according to the keywords of the related news. Specifically, the keywords of the related news may be corresponding.
  • the vector input classification model divides the relevant news into different categories through the classification model, wherein the classification model is pre-trained.
  • the split time node corresponding to the category is determined according to the release time of each related news in the category.
  • the split time point of the corresponding class may be determined according to the earliest release time and the latest release time in the related news included in different categories in the clustering result.
  • the split time points corresponding to different categories are used as the time nodes of the event.
  • the related news of the event is acquired and clustering is performed, and the time node of the event is determined according to time information included in the related news of different categories, including:
  • the time node of the event is determined based on the initial time node.
  • the time information included in the related news includes the release time of the related news, the time when the event involved in the news occurs, and the like. Taking the time information included in the related news refers to the release time of the news as an example, firstly, the earliest publishing time and the latest publishing time in the related news of different categories obtained by the cluster processing are used as the segmentation time points of the corresponding category, and the segments are segmented. The time point serves as the initial time node of the corresponding event.
  • the adjustment rule may be formulated according to some personalized requirements, and the initial time node may be adjusted according to the adjustment rule to obtain the time node of the event; or Based on the time node, the time node of the event is obtained by the user adjusting in a custom manner according to experience or other conditions.
  • step 105 determining an association event of the news to be identified based on the similarity between the word vector of the news to be identified and the word vector of the event, and determining that the news to be identified corresponds to the related event.
  • Time nodes including:
  • Step 1051 Construct a first feature corresponding to the news to be identified based on the similarity between the word vector of the news to be identified and the word vector of the event.
  • the manner in which the similarity between the word vector of the to-be-recognized news and the word vector of the event is determined includes: determining the matching probability value between the word vector of the news and the word vector of the event; or, by calculating the word vector and event of the news The similarity value between the word vectors is determined.
  • the first feature refers to the similarity represented by the matching probability value or the similarity value of the word vector of the news to be recognized and the word vector of the event.
  • the similarity value between the word vector of the news and the word vector of the event is calculated as follows:
  • a i represents the word vector of the keyword of the i-th event in f e
  • f n represents the keyword of the news to be recognized
  • b j represents the j-th news in f n
  • the word vector of the keyword, n represents the number of keywords of the news
  • K represents the number of keywords of the event.
  • the word vector of the event keyword and the word vector of the keyword of the news all express the corresponding information in a digital manner, and how to determine the word vector of the keyword of the event and the word vector of the keyword of the news can be realized by a known method. As implemented by the word2vec language model.
  • the specific representation of the first feature corresponding to the news to be identified is constructed as follows:
  • Equation 2 fea represents the first feature corresponding to the news to be identified. Wherein, fea represents the feature of the news to be identified relative to an event, and when there are N events, there are N said fea.
  • step 1052 the first feature is input as the sample feature into the first classification model, and the confidence that the different event is the associated event of the news to be identified is obtained.
  • the first classification model may be a softmax regression model or a support vector machine (SVM) model.
  • the sample feature is represented by x, and the first feature is input as the sample feature into the first classification model to obtain a specific representation of the confidence that the different events are associated events of the news to be identified as follows:
  • Equation 3 h ⁇ (x) represents the confidence, ⁇ represents the model parameters obtained by training, and x represents the sample features.
  • Step 1053 determining that the event that the confidence meets the condition is an associated event of the news to be identified.
  • Equation 4 J( ⁇ ) represents the cost function, x (i) represents the input, y (i) represents the output, and m represents the number of sample features.
  • an iterative optimization algorithm such as the gradient descent method to solve the minimized cost function, it is determined that the confidence needs to satisfy the condition, and an available classification model is realized, that is, the model parameters of the classification model are determined.
  • the first feature corresponding to the news to be identified is input into the first classification model, and the probability (confidence) of the related news that the news to be identified belongs to one event is determined, that is, the probability that the event is the associated event of the news to be identified. Determining an association event of the to-be-identified news according to the confidence level, and further determining a time node corresponding to the to-be-identified news in the association event.
  • step 1051 based on the similarity between the word vector of the news to be identified and the word vector of the event, constructing the first feature corresponding to the news to be identified, including:
  • the following feature components are combined to obtain a first feature corresponding to the news to be identified: a similarity between the word vector of the news to be recognized and the word vector of the event; a relationship between the time of the news to be recognized and the time node of the event.
  • the time of the news to be identified refers to the release time of the news to be recognized, and the relationship between the time of the news to be identified and the time node of the event refers to the relationship between the release time of the news to be recognized and each time node of the event.
  • the time of the news to be identified includes the time of publication of the news to be identified, the time of occurrence of the event content involved in the news to be identified, and the like. Taking the time when the news to be recognized is the release time of the news to be identified as an example, the relationship between the time of the news to be identified and the time node of the event may be the difference between the release time of the news to be identified and the time when the event occurs. Based on the similarity between the word vector of the news to be identified and the word vector of the event, the first feature corresponding to the news to be identified is constructed as follows:
  • Equation 5 fea represents the first feature corresponding to the news to be identified, Similarar represents the similarity between the keyword of the news and the keyword of the event, newtime represents the release time of the news to be identified, and eventtime represents the time node of the event, in the instance In the event, the event can occur at the time when the event occurs, and the event can occur at the time corresponding to the first time node of the event.
  • a one-dimensional feature component the mean of the word vectors of the news to be identified.
  • determining a time node corresponding to the news to be identified in the associated event includes:
  • Step 1054 Construct a second feature corresponding to the news to be identified based on the relationship between the time of the news to be identified and the time node of the event.
  • the time of the news to be identified refers to the release time of the news to be recognized
  • the relationship between the time of the news to be identified and the time node of the event refers to the relationship between the release time of the news to be recognized and each time node of the event.
  • the time of the news to be identified mainly includes the time of publication of the news to be identified, the time when the content of the event to be identified is involved, and the like.
  • the relationship between the time of the news to be recognized and the time node of the event may be a difference between the time of the news to be recognized and the time node of the event, or a value given according to the magnitude of the difference, or the like.
  • the time in the news to be identified refers to the time of the news release, and the relationship between the time in the news to be identified and the time node of the event is a difference
  • the time vector for constructing the news to be identified is as follows:
  • Timefea [newtime-e_time 0 ,....,newtime-e_time i ,...,newtime-e_time n ] (Equation 6)
  • timefea represents the time vector of the news to be identified
  • e_time i represents the ith time node of the event
  • newtime represents the news release time of the news to be identified.
  • timefea can be used as the second feature, and in addition, Timefea] as a second feature, where W i is the word vector of the i-th keyword of the news to be recognized, and M is the number of keywords in the news to be recognized.
  • step 1055 the second feature is input to the second classification model, and the confidence of the different time nodes of the associated event corresponding to the news to be identified is obtained.
  • the second classification model can be a softmax regression model or an SVM model.
  • Outputting the second feature to the second classification model means inputting the second feature as the second sample feature to the second classification model, expressing the sample feature by x, and outputting the second feature to the second classification model to obtain the
  • the specific representation of the confidence that the news corresponds to the different time nodes of the associated event is as follows:
  • Equation 7 h ⁇ (x) represents confidence, ⁇ represents training model parameters, and x represents sample characteristics.
  • Step 1056 Determine that the time node whose confidence meets the condition is a time node corresponding to the news to be identified.
  • Equation 8 J( ⁇ ) represents the cost function, x (i) represents the input, y (i) represents the output, and m represents the number of sample features.
  • an iterative optimization algorithm such as the gradient descent method to solve the minimized cost function
  • an available classification model is realized, that is, the model parameters of the second classification model are determined.
  • the second feature is input to the second classification model, and the probability of each time node of the news corresponding event to be identified is calculated, that is, the time node corresponding to the news to be identified is determined by the probability of each time node of the news corresponding event to be identified, wherein
  • the cost function is used to determine the parameters of the model, and the parameters of the model are obtained through training.
  • some samples are input into the formula (7) to obtain the confidence of the sample, wherein the confidence of the sample is the confidence of the representation of the model parameter. Enter the confidence of the sample into equation (8), solve the cost function, and determine the parameters of the model.
  • step 1054 based on the relationship between the time of the news to be identified and the time node of the event, the second feature corresponding to the news to be identified is constructed, including:
  • the following feature components are combined to obtain the second feature corresponding to the news to be identified: the mean value of the word vector of the news to be recognized; the relationship between the time of the news to be recognized and the different time nodes of the associated event.
  • the mean value of the word vector of the news to be identified refers to the mean value of the word vector corresponding to the time node of the event to be identified by the news.
  • the relationship between the time of the news to be recognized and the time node of the event may be a difference between the time of the news to be recognized and the time node of the event, or a value given according to the magnitude of the difference, or the like.
  • the relationship between the time in the news to be identified and the time node of the event is a difference
  • the second feature of constructing the news to be identified is as follows:
  • Equation 9 fea represents the second feature
  • M represents the number of time nodes associated with the event
  • Wi represents the word vector of the i-th word of the news to be identified
  • timefea represents the relationship between the time in the news to be recognized and the time node of the event.
  • the time vector of the to-be-identified news to be characterized such as the time vector of the news to be identified, which is represented by the difference between the time in the news to be recognized and the time node of the event, as shown in Equation 6.
  • step 105 based on the similarity between the word vector of the news to be recognized and the word vector of the event, the associated event of the news to be identified is determined, and the news to be identified is determined to be associated.
  • the time node corresponding to the time may also be implemented by another implementation manner, and the time node corresponding to the news to be identified is directly determined according to the third classification model.
  • the plurality of events correspond to a plurality of time nodes, and the third classification model is used to determine which one of the plurality of time nodes corresponds to the news to be identified, and then the event corresponding to the determined time node is used as the association of the news to be identified.
  • the event specifically includes the following steps:
  • Step 1057 Based on the similarity between the word vector of the news to be recognized and the word vector of the event, and the relationship between the time of the news to be recognized and the time node of the event, construct a third feature corresponding to the news to be identified.
  • step 1057 based on the similarity between the word vector of the news to be identified and the word vector of the event, and the relationship between the time of the news to be recognized and the time node of the event, constructing a third feature corresponding to the news to be identified Including combining the following feature components to obtain a third feature: a similarity between the word vector of the news to be recognized and the word vector of the event; a relationship between the time of the news to be recognized and the occurrence time node of the event; the mean value of the word vector of the news to be identified; The relationship between the time of the news to be identified and the different time nodes of the associated event.
  • the feature component may be the same as the corresponding feature component in the foregoing embodiment, such as the similarity between the word vector of the news to be recognized and the word vector of the event, as shown in formula (2), the word vector and event of the news to be recognized.
  • the combination of the similarity of the word vector, the relationship between the time of the news to be recognized and the time of occurrence of the event is as shown in the formula (5); the relationship between the time of the news to be identified and the time node of the event is as shown in the formula (6).
  • Step 1058 Input a third feature to a third classification model, and obtain a confidence that the time of the news to be identified corresponds to different time nodes of different events.
  • the third classification model may be a softmax regression model or a SVM (Support Vector Machine) model.
  • the outputting the third feature to the third classification model refers to inputting the third feature as a third sample feature to the third classification model, expressing the sample feature by x, and outputting the third feature to the third classification model to obtain news to be identified.
  • the specific representation of the confidence of the time corresponding to different time nodes of different events is as follows:
  • Equation 10 h ⁇ (x) represents confidence, ⁇ represents training model parameters, and x represents sample features formed by the third feature.
  • the third feature of the news sample and the time node corresponding to the news are respectively trained as inputs and outputs of the third model.
  • the time nodes of the plurality of news constitute a set of time nodes, each time node carries an identifier of the corresponding event, and the third classification model is used to determine which time node in the set of corresponding time nodes of the news to be identified, and corresponding to the determined time node
  • the event acts as an associated event for the news to be identified.
  • Step 1059 Determine that the time node whose confidence meets the condition is a time node corresponding to the news to be identified, and the event corresponding to the determined time node is used as an associated event of the news to be identified.
  • Equation 11 J( ⁇ ) represents the cost function, x (i) represents the input, y (i) represents the output, and m represents the number of sample features.
  • An iterative optimization algorithm such as gradient descent method is used to solve the minimum cost function to determine the model parameters of the third classification model, to implement an available classification model, and to input the third feature of the news to be identified into the third classification model to determine the third
  • the information to be identified by the classification model corresponds to the probability of each time node, and the time node that determines the confidence satisfaction condition is the time node corresponding to the news to be identified, and further determines the event corresponding to the determined time node as the associated event of the news to be identified.
  • the development stage of the event is divided by the time node of the event, and the life cycle of the related news related to the event is corresponding to the development stage of the event, thereby identifying whether the news and the event are recognized.
  • the correlation and the time of the news correspond to the judgment of the current stage of development of the event is more scientific and precise, and further, through this method, the calculation of the expiration time of the news can achieve better results.
  • the news processing method can be applied to any news reading application software for users to read news information, such as Daily Express, Tencent News, and the like.
  • the terminal 100 is installed as a client for the daily newsletter.
  • the news reading application provided by the embodiment of the present application is in the news push service.
  • FIG. 8 is a schematic diagram of an application scenario displayed by a news reading application on a terminal during a news push service according to an embodiment of the present application.
  • the user can read the server through the news processing method to determine the to-be-identified news by installing a news reading application client in the terminal. After the associated event and the time node corresponding to the event, the news corresponding to the current development stage of the event is pushed, and the user views through the software interface of the news reading application on the terminal 100.
  • a specific application manner of determining the expiration time of the news by the news processing method provided by the embodiment of the present application is as follows, taking the sports event A and the news B to be identified as an example, including:
  • obtaining a time node of the event A by clustering the related news of the event specifically comprising: clustering related news of the sports event A, and acquiring four time nodes A1, A2, and A3 of the sports event A.
  • A4 divides the event into the game event A before the game (time nodes A1 to A2), the game event A (time nodes A2 to A3), and the game event A (time nodes A3 to A4).
  • the keyword of the news B to be identified and the keyword of the event A are obtained.
  • it is determined whether the news B to be identified is the related news of the event A and specifically includes: Extracting the structured information as the keyword of the news B from the title, the report content and the comment of the news B to be identified, and calculating the similarity between the keyword of the news B and the keyword of the predefined or pre-extracted event A, and The sample features are constructed according to the similarity, and classified by the classification model to determine whether the news B to be identified is the related news of the match event A.
  • the extraction of the keyword of the news B to be identified may take into account the full text of the news or even the content included in the comment, and the similarity includes the keywords of the plurality of news and the key of the event respectively.
  • the calculation of the similarity between words can obtain more accurate judgment results.
  • the news related to the recorded sports events in the news to be identified may not be effectively identified and recalled, so that the news and The accuracy of the correlation of events is more accurate.
  • the recall rate of news and competitions can reach 85%, and the correct rate can reach 98%.
  • the method includes: constructing a sample feature according to a release time of the news to be identified and a time node of the event, and classifying the classification model to determine which time node of the game event A corresponding to the news item B to be identified, for example, determining that the news B to be identified corresponds to the game The previous stage, that is, the time node in the corresponding association event is A1; if the news B to be identified corresponds to the stage in the game, that is, the time node in the corresponding association event is A2; if the news B to be identified corresponds to the connection after the game , that is, the time node in the corresponding associated event is A3.
  • the terminal 100 recalls when the expiration time node corresponding to the news B to be identified arrives.
  • the dead time node corresponding to the news B to be identified is the next time node A n+1 of the corresponding time node A n .
  • the next time node A n+1 is determined as the expiration time of the news B to be identified according to the corresponding time node A n .
  • Any two adjacent time nodes (A n , A n+1 ) respectively represent the start and end time of a development phase of event A, and can be started at the current development stage by determining the development stage of the event in which the news to be identified is located.
  • the related news that will belong to the previous development stage will be invalidated to ensure the timeliness of the news.
  • the related news belonging to the pre-match is pushed to the user before the mid-stage of the match event A, and is recalled when the time node A2 of the match event A arrives; the related news belonging to the match is in the post-competition phase.
  • the correct rate of news recognition before the game can reach 95%
  • the correct rate of news recognition in the game can reach 90%
  • the correct rate of news recognition after the game can reach 97%.
  • the above news processing method can improve the competitiveness of news reading application software by setting a reasonable life cycle for news and improving the timeliness of news recommendation.
  • a news processing apparatus including a first obtaining module 11, a second obtaining module 13, and a determining module 15.
  • the first obtaining module 11 is configured to acquire a word vector of the news to be identified.
  • the second obtaining module 13 is configured to acquire a word vector corresponding to the event and a time node of the event.
  • the determining module 15 is configured to determine an association event of the news to be identified based on the similarity between the word vector of the news to be identified and the word vector of the event, and determine a time node corresponding to the news to be identified in the associated event, according to the time node Determine if the news is valid.
  • the first obtaining module 11 includes a keyword extracting unit 111 and a word vector unit 113.
  • the keyword extraction unit is configured to extract keywords based on the news to be identified.
  • the word vector unit is used to map the extracted keywords into the word vector space to obtain a word vector corresponding to the keyword.
  • the keyword extracting unit is specifically configured to extract keywords corresponding to the to-be-identified news from at least one of the following: the to-be-identified news; the specific associated information of the to-be-identified news.
  • the second acquisition module 13 includes a predefined unit 131 or a clustering unit 133.
  • the predefined unit 131 is configured to acquire a predefined time node of the event.
  • the clustering unit 133 is configured to acquire related news of the event and perform clustering processing, and determine a time node of the event according to time information included in the related news of different categories.
  • the failure determination module 17 is further configured to determine that the news to be identified is invalid when the type of the corresponding time node is an end time node and the preset failure time of the end time node arrives.
  • the failure determination module 17 is configured to determine that the news to be identified is invalid when the corresponding time node is a specific time node associated with the failure.
  • the determination module 15 includes a first feature unit 151, a first classification unit 152, and an event determination unit 153.
  • the first feature unit 151 is configured to construct a first feature corresponding to the news to be identified based on the similarity between the word vector of the news to be recognized and the word vector of the event.
  • the first classification unit 152 is configured to input the first feature as a sample feature into the first classification model, and obtain a confidence that the different events are associated events of the news to be identified.
  • the event determining unit 153 is configured to determine that the event whose confidence meets the condition is an associated event of the news to be identified.
  • the first feature unit 151 is specifically configured to combine the following feature components to obtain a first feature corresponding to the news to be identified: a similarity between the word vector of the news to be recognized and the word vector of the event; a time of the news to be identified and a time node of the event Relationship.
  • the determining module further includes a second feature unit 154, a second classifying unit 155, and a time determining unit 156.
  • the second feature unit 154 is configured to construct a second feature corresponding to the news to be identified based on the relationship between the time of the news to be identified and the time node of the event.
  • the second classification unit 155 is configured to input the second feature to the second classification model, and obtain the confidence of the different time nodes of the related event corresponding to the news to be identified.
  • the time determining unit 156 is configured to determine that the time node whose confidence meets the condition is a time node corresponding to the news to be identified.
  • the second feature unit 154 is specifically configured to combine the following feature components to obtain a second feature corresponding to the news to be identified: an average value of the word vector of the news to be recognized; a relationship between the time of the news to be identified and different time nodes of the associated event.
  • the determining unit 15 includes a third feature unit 157, a third classifying unit 158, and a determining unit 159.
  • the third feature unit 157 is configured to construct a third feature corresponding to the news to be identified based on the similarity between the word vector of the news to be recognized and the word vector of the event, and the relationship between the time of the news to be recognized and the time node of the event.
  • the third classification unit 158 is configured to input the third feature to the third classification model, and obtain the confidence that the time of the news to be identified corresponds to different time nodes of different events.
  • the determining unit 159 is configured to determine that the time node satisfying the condition satisfies the time node corresponding to the news to be identified, and the event corresponding to the determined time node as the associated event of the news to be identified.
  • the news processing apparatus divides the event into a plurality of development stages according to a certain common characteristic of different time periods by the time node of the event, and determines the event corresponding to the news to be identified.
  • the time node can thus know the development stage of the event in which the news is to be identified, determine whether the news belongs to the news of the current development stage of the event, and determine the news that is not in the current development stage of the event as invalid news.
  • the event-based time node sets a reasonable life cycle for the news, and timely determines news that is not in the current development stage of the event as invalid news, so as to avoid recommending news that is not in line with the current development stage of the event to the user, so as to improve the direction.
  • the timeliness of the news recommended by the user are examples of the news that is not in line with the current development stage of the event to the user.
  • the embodiment of the present application further provides a computer device, including a processor and a memory for storing a computer program capable of running on a processor, wherein when the processor is configured to run the computer program, executing:
  • a news processing method comprising: acquiring a word vector of a news to be recognized; acquiring a word vector of an event, and a time node of the event; and similarity between a word vector of the news to be recognized and a word vector of the event, Determining an associated event of the to-be-identified news, and determining a time node corresponding to the to-be-identified news in the associated event, and determining, according to the time node, whether the news is valid.
  • the processor is further configured to: when the computer program is executed, perform: the acquiring a word vector of the news to be identified, including: extracting a keyword based on the news to be identified; mapping the extracted keyword to a word vector space, The word vector corresponding to the keyword.
  • the processor is further configured to: when the computer program is executed, the extracting a keyword based on the to-be-identified news, comprising: extracting a keyword corresponding to the news to be identified from at least one of the following: the news to be identified; The specific associated information of the news to be identified.
  • the processor is further configured to: when the computer program is executed, execute: the acquiring a time node of the event, comprising: acquiring a predefined time node of the event; or acquiring relevant news of the event and performing The clustering process determines a time node of the event according to time information included in related news of different categories.
  • the processor is further configured to: when the computer program is executed, the determining, according to the similarity between the word vector of the to-be-identified news and the word vector of the event, determining the associated event of the to-be-identified news, including: Constructing, according to the similarity between the word vector of the news to be identified and the word vector of the event, constructing a first feature corresponding to the news to be identified; and inputting the first feature as a sample feature into the first classification model to obtain a different
  • the event is a confidence level of the associated event of the news to be identified; and the event determining that the confidence meets the condition is an associated event of the news to be identified.
  • the processor is further configured to: when the computer program is executed, perform: the first feature corresponding to the to-be-identified news based on the similarity between the word vector of the to-be-identified news and the word vector of the event,
  • the method includes: combining the following feature components to obtain the first feature corresponding to the news to be identified: a similarity between a word vector of the news to be recognized and a word vector of the event; and a time of the news to be identified The relationship of the time nodes of the event.
  • the processor is further configured to: when the computer program is executed, the determining, by the determining, the time node corresponding to the to-be-identified news in the association time, comprising: a time based on the to-be-identified news and the event And establishing a second feature corresponding to the news to be identified; and inputting the second feature to the second classification model to obtain a confidence that the to-be-identified news corresponds to different time nodes of the associated event;
  • the time node that determines that the confidence meets the condition is a time node corresponding to the news to be identified.
  • the processor is further configured to: when the computer program is executed, the second feature corresponding to the news to be identified based on the relationship between the time of the news to be identified and the time node of the event, including: Combining the following feature components to obtain the second feature corresponding to the news to be identified: the mean value of the word vector of the news to be identified; the relationship between the time of the news to be identified and different time nodes of the associated event .
  • the processor is further configured to: when the computer program is executed, perform: determining, according to a similarity between a word vector of the to-be-identified news and a word vector of the event, an association event of the to-be-identified news, and Determining a time node corresponding to the to-be-identified news in the association time, comprising: a similarity between a word vector of the to-be-identified news and a word vector of the event, and a time and a location of the to-be-identified news Constructing a relationship of time nodes of the event, constructing a third feature corresponding to the news to be identified; inputting the third feature to the third classification model, and obtaining a confidence that the time of the news to be identified corresponds to different time nodes of different events And determining, by the time node that the confidence meets the condition, the time node corresponding to the news to be identified, and the event corresponding to the determined time node as the associated event of the news to be identified.
  • the processor is further configured to: when the computer program is executed, the method further includes: when the type of the corresponding time node is an end time node, and the preset is compared to the end time node When the failure duration arrives, it is determined that the news to be identified is invalid.
  • the processor is further configured to: when the computer program is executed, execute: the news processing method further comprises: when the corresponding time node is a specific time node associated with the failure, determining that the news to be identified is invalid.
  • FIG. 12 is a schematic diagram of an internal structure of a computer device, which may be the server 200 shown in FIG. 1, including a processor connected through a system bus, an internal memory, a network interface, and a non-volatile storage medium. .
  • the processor is configured to implement a computing function and a function of controlling a server.
  • the processor is configured to execute the news processing method provided by the embodiment of the present application.
  • the non-volatile storage medium stores an operating system, a database, and a news processing apparatus for implementing the news processing method provided by the embodiments of the present application.
  • the network interface is used to connect to the terminal.
  • the memory can be implemented by any type of volatile or non-volatile storage device, or a combination thereof.
  • the non-volatile memory may be a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), or an Erasable Programmable Read (EPROM). Only Memory), Electrically Erasable Programmable Read-Only Memory (EEPROM), Ferromagnetic Random Access Memory (FRAM), Flash Memory, Magnetic Surface Memory , CD-ROM, or Compact Disc Read-Only Memory (CD-ROM); the magnetic surface memory can be a disk storage or a tape storage.
  • the volatile memory can be a random access memory (RAM) that acts as an external cache.
  • RAM Random Access Memory
  • SRAM Static Random Access Memory
  • SSRAM Synchronous Static Random Access Memory
  • SSRAM Dynamic Random Access
  • DRAM Dynamic Random Access Memory
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDRSDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • ESDRAM enhancement Enhanced Synchronous Dynamic Random Access Memory
  • SLDRAM Synchronous Dynamic Random Access Memory
  • DRRAM Direct Memory Bus Random Access Memory
  • the memory is used to store various types of data to support the operation of the news processing device.
  • Examples of such data include any computer program for operating on a news processing device, such as an operating system and applications; news to be identified, word vectors for news to be identified, time nodes for events, word vectors for time, and the like.
  • the operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks.
  • the application can include various applications, such as a news application, a Media Player, a browser, etc., for implementing various application services.
  • a program implementing the method of the embodiment of the present application may be included in an application.
  • the network interface is used for wired or wireless communication between the news processing device and other devices.
  • the news processing device can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof.
  • the network interface receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel.
  • the network interface further includes a Near Field Communication (NFC) module to facilitate short range communication.
  • NFC Near Field Communication
  • the NFC module may be based on Radio Frequency Identification (RFID) technology, IrDA (Infrared Data Association) technology, Ultra Wideband (UWB) technology, Bluetooth (BT, BlueTooth) technology or other technologies. to fulfill.
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wideband
  • Bluetooth BT, BlueTooth
  • the news processing method disclosed in the above embodiment of the present application may be applied to a processor or implemented by a processor.
  • the number of processors may be one or more to complete all or part of the steps of the above method.
  • the processor may be an integrated circuit chip with signal processing capabilities.
  • each step of the above method may be completed by an integrated logic circuit of hardware in a processor or an instruction in a form of software.
  • the above processor may be a general purpose processor, a digital signal processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, or the like.
  • DSP digital signal processor
  • the processor may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application.
  • a general purpose processor can be a microprocessor or any conventional processor or the like.
  • the steps of the method disclosed in the embodiment of the present application may be directly implemented as a hardware decoding processor, or may be performed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a storage medium, the storage medium being located in the memory, the processor reading the information in the memory, and completing the steps of the foregoing methods in combination with the hardware thereof.
  • the news processing device may be configured by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), and Complex Programmable Logic Devices (CPLDs). , Complex Programmable Logic Device), Field-Programmable Gate Array (FPGA), General Purpose Processor, Controller, Micro Controller Unit (MCU), Microprocessor, or other electronics Element implementation for performing the aforementioned method.
  • ASICs Application Specific Integrated Circuits
  • DSPs Programmable Logic Devices
  • PLDs Programmable Logic Devices
  • CPLDs Complex Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • MCU Micro Controller Unit
  • Microprocessor or other electronics Element implementation for performing the aforementioned method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种新闻处理方法、装置、存储介质及计算机设备,其中方法包括:获取待识别新闻的词向量(101);获取事件的词向量、以及所述事件的时间节点(103);基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,以及确定所述待识别新闻在所述关联事件中所对应的时间节点(105),根据所述时间节点确定所述新闻是否有效(106)。

Description

新闻处理方法、装置、存储介质及计算机设备
本申请要求于2017年9月5日提交中国专利局、申请号为201710791715.7、申请名称为“新闻处理方法、装置、存储介质及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网应用技术领域,特别涉及新闻处理方法、装置、计算机可读存储介质及计算机设备。
背景技术
随着互联网技术的发展,上网浏览新闻已经成为越来越多用户的习惯,纵多新闻网站或者新闻应用都具有主动向用户推荐新闻的功能。推荐的新闻可以是近期的热点新闻,也可以是根据不同用户有针对性的推荐对应领域内的新闻。
通常,新闻需要设置新闻失效时间,将失效新闻及时下架处理,以确保不会将失效新闻推荐给用户,向用户所推荐的新闻符合新闻事件的发展动态,从而满足用户的阅读需求。相关技术中,对于上述问题,尚无有效解决方案。
技术内容
本申请实施例提供一种可提高推荐新闻时效性的新闻处理方法、装置、计算机可读存储介质及计算机设备。
本申请实施例的技术方案是这样实现的:
一种新闻处理方法,由服务器执行,包括:
获取待识别新闻的词向量;获取事件的词向量、以及所述事件的时间节点;
基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,以及,确定所述待识别新闻在所述关联事件中所对应的时间节点;
根据所述时间节点确定所述新闻是否有效。
一种新闻处理装置,包括:第一获取模块,用于获取待识别新闻的词向量;第 二获取模块,用于获取事件对应的词向量、以及所述事件的时间节点;确定模块,用于基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,以及,确定所述待识别新闻在所述关联事件中所对应的时间节点,根据所述时间节点确定所述新闻是否有效。
一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现一种新闻处理方法。该新闻处理方法,包括:获取待识别新闻的词向量;获取事件的词向量、以及所述事件的时间节点;基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,以及,确定所述待识别新闻在所述关联事件中所对应的时间节点,根据所述时间节点确定所述新闻是否有效。
一种计算机设备,包括存储器、处理器及存储在所述存储器上运行的计算机程序,所述处理器执行所述程序时实现一种新闻处理方法。该新闻处理方法,包括:获取待识别新闻的词向量;获取事件的词向量、以及所述事件的时间节点;基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,以及,确定所述待识别新闻在所述关联事件中所对应的时间节点,根据所述时间节点确定所述新闻是否有效。
附图说明
图1为本申请一个实施例中新闻处理方法的应用环境图。
图2为本申请一个实施例中新闻处理方法的流程图。
图3为本申请另一个实施例中新闻处理方法的流程图。
图4为本申请又一个实施例中新闻处理方法的流程图。
图5为本申请再一个实施例中新闻处理方法的流程图。
图6为本申请又一个实施例中新闻处理方法的流程图。
图7为本申请一个实施例提供的新闻阅读应用在新闻推送业务时在服务器进行新闻处理的一个应用场景示意图。
图8为本申请一个实施例提供的新闻阅读应用在新闻推送业务时在终端显示的 一个应用场景示意图。
图9为本申请以比赛事件A、待识别新闻B为例的新闻处理方法的主要步骤的流程图。
图10为本申请一个实施例中的新闻处理装置的结构示意图。
图11为本申请另一个实施例中的新闻处理装置的结构示意图。
图12为本申请又一个实施例中的新闻处理装置的结构示意图。
图13为本申请一个实施例中计算机设备的内部结构示意图。
具体实施方式
以下结合说明书附图及具体实施例对本申请技术方案做进一步的详细阐述。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。本文所使用的术语“及/或”包括一个或多个相关的所列项目的任意的和所有的组合。
通过网络浏览新闻已经成为越来越多用户的习惯,众多的新闻网站或者新闻应用也都具有主动向用户推荐新闻的功能。而为了能够让用户所接收到的新闻是符合事件的发展动态又满足用户阅读需求的,需要提供能够有效识别新闻与事件的关系并设置合理的失效时间的新闻处理方法。
在一个具体的实施例中,新闻的失效时长的确定包括两种方式:
第一,基于新闻标题中包含的关键词,针对包含相应关键词的新闻预先设置相应的失效时长;
第二,基于新闻的类别,针对该类别的新闻预先设置相应的失效时长。
通过基于上面两种方式确定新闻的失效时长,再基于新闻的发布时间加上新闻的失效时长来设置新闻的失效时间,然而,该种方式仅能针对包含特定关键词或同类别的新闻设置失效时长,而对于包含大量明确事件、事件周期性并不明确的新闻领域,如体育比赛新闻、电影新闻等领域,根据新闻关键词或者类别设定失效时长的方式均不适用,例如在体育比赛后推荐赛前或赛中新闻是不合理的,在电影上映 后推荐上映前的预告新闻是不合适的,出现将新闻推荐给用户后用户再获取该新闻已经没有意义的现象,从而导致推荐新闻的时效性差。
为解决以上技术问题,本申请提出了新闻处理方法、装置、存储介质及计算机设备。图1所示为本申请一实施例提供的新闻处理方法的应用环境图,包括终端100和服务器200,服务器200通过网络与终端100连接。其中,用户通过终端100下载新闻应用或者登陆新闻网站进行浏览。其中,新闻应用是指专门供用户获取阅读新闻信息的应用软件或包含有专门供用户获取阅读新闻信息的功能模块的应用软件,如目前常用的包含有新闻推荐功能的新闻阅读专区的各种APP(Application)软件。该终端100可以是智能手机、平板电脑、个人数字助理(PDA)以及个人计算机等。服务器200通过网络向对应的终端100发送推荐的新闻,以供用户通过终端显示查看。服务器200可以为独立的物理服务器或者物理服务器集群。
请参阅图2,为本申请一实施例提供的新闻处理方法,该方法可以由服务器200执行,该方法包括如下步骤。
步骤101,获取待识别新闻的词向量。
新闻通常是指用概括的叙述方式,以文字、图像、视频等手段及时报道比较重大、有价值的事件,使得一定人群了解。广义上的新闻是指消息,包含所有通过媒体或网络途径记录事件、传播信息的文字、图像、视频、音频数据的称谓,如,广义上的新闻不仅包括通过通常意义上的新闻网站、新闻应用等投放的文字、图像、视频、音频数据,也包括通常意义上社交应用中以文章形式投放的与事件相关的消息。本实施例中,新闻是指广义上的新闻。待识别新闻是指本本申请实施例所提供的新闻处理方法中的待处理对象。
在一个具体的实施例中,步骤101,获取待识别新闻的词向量包括:基于待识别新闻提取关键词;将所述提取的关键词映射到词向量空间,得到关键词对应的词向量。其中,基于待识别新闻提取关键词即提取所述待识别新闻对应的关键词;所述将所述提取的关键词映射到词向量空间,得到关键词对应的词向量包括:通过将所述提取的关键词输入词向量模型,得到所述词向量模型输出的所述关键词对应的词向量来实现。
这里,关键词通常是指描述事件过程中必然会提到的且能够体现事件独特的特 征的信息,如,事件的描述信息通常包括时间、地点、人物、事情经过四个要素相关的信息,从而关键词至少可以从与该四个要素相关的信息的角度进行确定和提取。基于待识别新闻提取关键词的步骤可以是通过从新闻的垂直网站或其它相关新闻网页中抓取结构化信息的方式获得,抓取结构化信息可以采用目前互联网技术中已知的抓取方式,例如爬虫技术。其中,垂直网站是指注意力集中在某些特定的领域或某种特定的需求的网站,提供有关这个领域或需求的全部深度信息和相关服务的网站。而结构化信息是指信息经过分析后可分解成多个互相关联的组成部分,各组成部分间有明确的层次结构,其使用和维护通过数据库进行管理,并有一定的操作规范。关键词的提取可以来源于新闻的标题、报道内容、新闻对应的评论中等。
在一个具体的实施例中,基于待识别新闻提取关键词包括:从以下至少之一提取待识别新闻对应的关键词:第一,待识别新闻本身的内容中包含的信息;第二,待识别新闻的特定关联信息。其中,待识别新闻是指新闻报道本身的内容中包含的信息,如新闻标题、新闻正文,其中,针对视频或者音频数据的新闻,除从新闻标题中可以提取关键词之外,还可以通过语音识别将其转换成文本的方式进行关键词的提取。待识别新闻的特定关联信息主要是指新闻报道相关的内容中包含的信息,如新闻对应的评论,针对视频或者音频数据的新闻,除从新闻标题中可以提取关键词之外,还可以从对应的评论中提取关键词。本实施例中,基于待识别新闻提取关键词不限于新闻发布时的原始出处的框架形式,可以借助于新闻报道本身的内容及评论等相关信息全面提取关键词,从而可以更正确和精准地识别出新闻的关键词,也充分考虑到新闻中丰富的报道内容对于提高时效性的帮助。
词向量是指将语言中的字、词、短语等转换为数字化的方式。词向量的表达形式包括:用一个特定长度的向量来表示一个词,向量的长度为词典的大小,向量的分量只有一个1,其它全为0,1的位置对应词在词典中的位置。或者通过训练将语言中的每一个词映射成一个相对所述特定长度较短的固定长度的短向量,将所有这些向量放在一起形成一个词向量空间,而每一向量为该空间中的一个点,在空间中引入距离参数,根据词所对应的短向量之间的距离来判断词之间在词法、语义上的相似性。词向量的训练可以通过语言模型的方式实现,通过该语言模型将提取的关键词映射到词向量空间得到对应的词向量。在一个具体的实施例中,通过样本,例如,词及对应的词向量来训练词向量模型,例如,word2vec,得到词向量模型的参 数。将提取的关键词映射到词向量空间可以通过将提取的关键词输入到词向量模型,得到关键词对应的词向量。
步骤103,获取事件的词向量、以及所述事件的时间节点。其中,所述事件可以为一个,也可以为多个,当事件为多个时,获取该多个事件的词向量。其中,事件的时间节点为多个,可以通过获取事件的时间节点序列来获取事件的时间节点。
事件是指比较重大,能够对一定人群产生影响的事情。事件的描述信息通常包括时间、地点、人物、事情经过四个要素相关的信息,其中,事情经过包括事件从产生到结束的发展过程中的内容描述。事件的时间节点是指将事情的发展过程根据不同时间段所具有的某种共同特性而将事情区分为多个发展阶段的具体时间点。以体育比赛事件为例,根据体育比赛这一事情的发展过程,可分别以比赛开始时间和比赛结束时间这两个时间节点将比赛区分为赛前、赛中及赛后三个阶段。又以电影播放事件为例,根据电影从宣传到放映这一事情的发展过程可分别以点映时间、首映时间、公映开始时间和公映结束时间为时间节点将其区分为上映前、上映中及上映后三个阶段。
在一个具体的实施例中,获取事件对应的词向量包括:基于事件提取关键词;将提取的关键词映射到词向量空间,得到关键词对应的词向量。具体地,将提取的事件的关键词输入词向量模型,将词向量模型输出的向量作为关键词对应的词向量。这里,关键词通常是指描述事件过程中必然会提到的且能够体现事件独特的特征的信息,如,事件的描述信息通常包括时间、地点、人物、事情经过四个要素相关的信息。此外,事件本身还有所处行业、或领域类别的属性信息,事件所属类别是事件另一个要素相关的信息,从而事件的关键词至少可以根据与该五个要素相关的信息来确定或进行提取。以“XX电影于XX日在北京进行首映,该影片中的主演人员XX参加了首映”这一事件为例,事件的关键词可以从时间要素角度提取“XX”日,从地点要素角度提取“北京”,从人物要素角度提取主演人员“XX”,从事件类别要素角度提取“娱乐”类分别作为事件的关键词。新闻是呈现事件的一种具体表达形式,基于事件提取关键词还可以是基于事件已知的多个关联新闻提取关键词。具体地,获取一个或多个与事件关联的新闻,根据该一个或多个新闻本身所包含的内容的信息以及特定关联信息,确定事件的关键词。
步骤105,基于待识别新闻的词向量与事件的词向量的相似度,确定待识别新闻的关联事件,以及确定待识别新闻在关联事件中所对应的时间节点。其中,当事件为多个时,确定待识别新闻与各事件的相似度,根据待识别新闻与各事件的相似度,在多个事件中选取一个事件作为所述待识别新闻的关联事件。进而在关联事件的时间节点序列中确定待识别新闻对应的时间节点。
相似度是指表示两个事物之间的关联程度。基于待识别新闻的词向量与事件的词向量确定待识别新闻与事件的相似度的方式主要包括:通过待识别新闻的词向量与事件的词向量之间进行匹配,根据匹配的结果确定;或者待识别新闻的词向量与事件的词向量之间计算相似度值,根据相似度值的大小确定。通过待识别新闻与事件之间的相似度自动识别待识别新闻所对应的关联事件,即识别待识别新闻是否为特定事件的关联新闻。通过待识别新闻与事件之间的相似度自动识别待识别新闻所对应关联事件的时间节点,即识别待识别新闻所对应关联事件所处的发展阶段。
上述实施例所提供的新闻处理方法中,通过设置事件的时间节点,提取待识别新闻的相关信息,自动识别出与事件相关的关联新闻,以及根据新闻的时间信息确定新闻所对应事件的时间节点,通过引入事件的时间节点对新闻设置合理的生命周期,从而可基于新闻对应该事件的时间节点判断该新闻所处事件的发展阶段,对于待识别新闻所对应的事件,以及待识别新闻是否与事件当前发展阶段对应能够准确识别,有利于提高待识别新闻的时效性。
步骤106:根据所述时间节点确定所述新闻是否有效。
通过确定待识别新闻在关联事件中所对应的时间节点,可以基于该时间节点设置待识别新闻的失效时间。请参阅图3,在一个实施例中,所述根据所述时间节点确定所述新闻是否有效,新闻处理方法还包括:
步骤107,当对应的时间节点为与失效关联的特定的时间节点时,确定待识别新闻失效。
事件的时间节点通常为包括以时间先后顺序进行排列的多个时间节点构成的序列。每一个时间节点代表该事件的一个发展阶段的开始时间或者表示该事件的另一个发展阶段的结束时间,任意相邻的两个时间节点即对应该事件的一个发展阶段。 因此,当确定该待识别新闻对应的时间节点后,即确定了该待识别新闻所处事件的发展阶段,从而可以根据对应的时间节点确定是否是与失效关联的特定的时间节点。对于与失效关联的特定的时间节点,可以将待识别的新闻对应的时间节点的下一时间节点,即该待识别新闻所处事件发展阶段的结束时间或该待识别新闻所处事件发展阶段的下一发展阶段的开始时间作为与失效关联的特定的时间节点,同时可以将失效关联的特定的时间节点确定为待识别新闻的失效时间。作为另一实施例,当确定该待识别新闻对应的时间节点后,还可以将对应的时间节点的后续的具有预设间隔的时间节点,即该待识别新闻所处事件发展阶段的后续发展阶段中的时间节点作为与失效关联的特定的时间节点,将该特定的时间节点确定为待识别新闻的失效时间。作为又一实施例,当确认该待识别新闻对应的时间节点后,还可以将对应的时间节点加上一个预设的时长作为失效关联的特定的时间节点,将该失效关联的特定的时间节点确定为待识别新闻的失效时间。
其中,与失效关联的特定的时间节点可以是时刻,也可以是时间段,当与失效关联的特定的时间节点是用时间段表示时,则可以根据实际应用需要而设置该时间段内的任意时刻确定为待识别新闻的失效时间。在其中一个具体的实施例中,是将该事件的下一发展阶段的开始时间设置为该待识别新闻的失效时间,与失效关联的特定的时间节点即指待识别新闻所处事件的发展阶段的下一个发展阶段的开始时间。通过时间节点将事件发展区分为多个发展阶段,并在识别新闻所处事件的不同发展阶段后,将新闻的失效时间设置为下一个发展阶段或者后续的特定发展阶段的开始时间,具体选取哪个发展阶段根据实际应用需求而定。通过新闻对应的时间节点确定与失效关联的特定的时间节点,从而只将属于事件的当前发展阶段的新闻推荐给用户,并将不属于事件的当前的发展阶段的新闻及时下架处理,以确保推荐给用户的新闻的时效性。请参阅图4,在另一个实施例中,所述根据所述时间节点确定所述新闻是否有效,新闻处理方法还包括:
步骤108,当对应的时间节点的类型为结束时间节点,且相较于结束时间节点的预设失效时长到达时,确定待识别新闻失效。
事件的时间节点的设置中,每一个时间节点可用于表示事件的一个发展阶段的开始时间或者表示该事件的另一个发展阶段的结束时间。位于节点序列最前端的时 间节点为起始时间节点,位于节点序列最末端的时间节点为结束时间节点,位于最前端和最末端之间的时间节点均为中间时间节点。其中,当结束时间节点设置为表示事件最后一个发展阶段的开始时间时,也就是说,当结束时间节点是用于表示事件的最后一个发展阶段的开始时间时,相当于是最后一个发展阶段没有设置时间节点来限定其结束时间,根据待识别新闻包含的时间信息确定其所对应的事件的时间节点时可能是该结束时间节点。因此,针对将每一时间节点用于表示事件的一个发展阶段的开始时间的情况,当确认该待识别新闻对应事件的时间节点为起始时间节点或中间时间节点时,均可将对应的时间节点的下一时间节点、或者后续的具有预设间隔的时间节点、或者对应事件的时间节点加上预设的时长所确定的时间节点,确定为待识别新闻的失效时间。而当确认该待识别新闻对应事件的时间节点为结束时间节点时,则通过设置预设失效时长来确定属于该事件最后一个发展阶段的相关新闻的失效时间。
预设失效时长是指预设的新闻有效的时间范围,对于新闻发布后保持为有效状态的时间超出该有效的时间范围的即作失效处理。当结束时间节点设置为表示事件最后一个发展阶段的开始时间时,且根据待识别新闻包含的时间信息确定其所对应的事件的时间节点为结束时间节点时,针对属于事件最后一个发展阶段的新闻则可以通过设置将对应的时间节点加上预设失效时长的方式来确定新闻的失效时间。通过该种时间节点的设置方式,将事件通过多个时间节点区分为多个不同的发展阶段,只需考虑每个发展阶段开始的时间,再针对不同领域的事件的最后发展阶段统一设置预设失效时长即可,从而可以降低对事件的时间节点的设置的难度。
在本申请实施例所提供的新闻处理方法中,通过事件的时间节点将事件的发展根据不同发展阶段所具有的某种共同特性而将事件区分为多个阶段,通过确定待识别新闻对应的事件的时间节点,从而可以获知待识别新闻所处事件的发展阶段,确定该新闻是否属于事件当前发展阶段的新闻,将不属于事件当前发展阶段的新闻确定为失效新闻。基于事件的时间节点对新闻设置合理的生命周期,及时将不属于事件当前发展阶段的新闻确定为失效新闻,以避免将不符合事件当前发展阶段的新闻性低的新闻推荐到用户,以提高向用户推荐的新闻的时效性。
进一步的,在一实施例中,在步骤103中,获取事件的时间节点包括:
获取事件的预先定义的时间节点。
事件的时间节点的设置可以通过预先定义的方式形成。如,通过分析不同领域类别的事件的共同发展特性将其分割为几个发展阶段,并确定几个发展阶段的分割时间点,将这些分割时间点作为对应类别的事件的预先定义的时间节点。又如,通过分析不同热议程度的事件的共同发展特性将其分割为几个热议阶段,并确定几个热议阶段的分割时间点,将这些分割时间点作为对应热议程度的事件预先定义的时间节点。其中,分割时间点可以为时刻,也可以是时间段,相应的,以分割时间点确定时间节点时,则时间节点也可以是时刻或者是时间段,当分割时间点为时间段时,则可根据实际需求而选择将该时间段内的任意时刻设置为属于与其相邻的两个发展阶段所共同包括的时间或者属于其中之一所包括的时间。
在另一实施例中,在步骤103中,获取事件的时间节点包括:
获取事件的相关新闻并进行聚类处理,根据不同类别的相关新闻包含的时间信息确定事件的时间节点。
事件的时间节点的设置可以通过对事件的相关新闻进行聚类分析的方式确定。聚类是指将数据分类到不同的类或者簇的过程,同一类或者簇中的对象有很大的相似性,而不同类或者簇间的对象有很大的相异性。相关新闻包含的时间信息包括相关新闻的发布时间、新闻中涉及到的时间的发生时间等。本实施例中,相关新闻包含的时间信息是指新闻的发布时间,在进行聚类时,根据相关新闻的关键词将相关新闻分为不同的类别,具体地,可以将相关新闻的关键词对应的向量输入分类模型,通过分类模型将相关新闻分为不同的类别,其中,分类模型是预先训练好的。对于同属于一个类别的相关新闻,根据该类别中的各相关新闻的发布时间确定该类别对应的分割时间节点。例如,可以根据聚类结果中不同类别所包含的相关新闻中的最早发布时间和最晚发布时间来确定该对应类的分割时间点。将不同类别对应的分割时间点作为事件的时间节点。通过对事件的相关新闻进行聚类处理,不需要事先人为去分析来获知该事件的发展特性来划分发展阶段,而且聚类处理的结果通常还可以反映出该事件的不同发展阶段的新闻量等随机性特征,从而可行性高。
进一步的,在另一个实施例中,获取事件的相关新闻并进行聚类处理,根据不同类别的相关新闻包含的时间信息确定事件的时间节点,包括:
获取事件的相关新闻并进行聚类处理,根据不同类别的相关新闻的时间信息确定事件的初始时间节点;
根据初始时间节点确定该事件的时间节点。
相关新闻包含的时间信息包括相关新闻的发布时间、新闻中涉及到的事件的发生时间等。以相关新闻包含的时间信息是指新闻的发布时间为例,首先通过聚类处理得到的不同类别的相关新闻中的最早发布时间和最晚发布时间作为该对应类别的分割时间点,将这些分割时间点作为对应事件的初始时间节点。根据初始时间节点确定该事件的时间节点的过程中,可以以初始时间节点为基础,根据一些个性化需求制定调节规则,根据调节规则对初始时间节点进行调整而获得事件的时间节点;或者以初始时间节点为基础,通过用户根据经验或者其它情况以自定义方式进行调整获得事件的时间节点。
在一个实施例中,请参阅图5,步骤105,基于待识别新闻的词向量与事件的词向量的相似度,确定待识别新闻的关联事件,以及确定待识别新闻在关联事件中所对应的时间节点,包括:
步骤1051,基于待识别新闻的词向量与事件的词向量的相似度,构建待识别新闻对应的第一特征。
其中,待识别新闻的词向量与事件的词向量的相似度的确定方式包括:通过新闻的词向量与事件的词向量之间的匹配概率值进行确定;或,通过计算新闻的词向量与事件的词向量之间的相似度值确定。相应的,第一特征是指与待识别新闻的词向量与事件的词向量的匹配概率值或者相似度值所表征的相似度。作为一种示意性的实施例,新闻的词向量与事件的词向量之间的相似度值的计算方式如下:
Figure PCTCN2018104156-appb-000001
在公式1中,f e表示事件的关键词,a i表示f e中第i个事件的关键词的词向量;f n表示待识别新闻的关键词,b j表示f n中第j个新闻的关键词的词向量,n表示新闻的关键词的个数,K表示事件的关键词的个数。其中事件关键词的词向量和新闻 的关键词的词向量均是采用数字化的方式表达相应的信息,如何确定事件的关键词的词向量和新闻的关键词的词向量可以通过已知方式实现,如通过word2vec语言模型实现。
基于待识别新闻的词向量与事件的词向量的相似度,构建待识别新闻对应的第一特征的具体表示如下:
fea=[Similar]  (公式2)
在公式2中,fea表示待识别新闻对应的第一特征。其中,fea表征待识别新闻相对于某一个事件的特征,当存在N个事件时,则存在N个所述fea。
步骤1052,将第一特征作为样本特征输入第一分类模型,得到不同事件是待识别新闻的关联事件的置信度。
第一分类模型可以为softmax回归模型或者支持向量机(SVM,Support Vector Machine)模型。将样本特征用x表示,将第一特征作为样本特征输入第一分类模型得到不同事件是待识别新闻的关联事件的置信度的具体表示如下:
Figure PCTCN2018104156-appb-000002
在公式3中,h θ(x)表示置信度,θ表示训练得到的模型参数,x表示样本特征。
步骤1053,确定置信度满足条件的事件为待识别新闻的关联事件。
置信度满足条件的具体表示如下:
Figure PCTCN2018104156-appb-000003
公式4中,J(θ)表示代价函数,x (i)表示输入,y (i)表示输出,m表示样本特征的数量。通过采用迭代的优化算法如梯度下降法,求解最小化代价函数,从而确定置信度需满足条件,实现一个可用的分类模型,即确定分类模型的模型参数。进而将待识别新闻对应的第一特征输入第一分类模型,确定待识别新闻属于一个事件的关联新闻的概率(置信度),也即事件为待识别新闻的关联事件的概率。根据所述置信度确定所述待识别新闻的关联事件,进而确定待识别新闻在关联事件中所对应的时间节点。
在一个实施例中,步骤1051,基于待识别新闻的词向量与事件的词向量的相似度,构建待识别新闻对应的第一特征,包括:
将以下的特征分量组合,得到待识别新闻对应的第一特征:待识别新闻的词向量与事件的词向量的相似度;待识别新闻的时间与事件的时间节点的关系。其中,所述待识别新闻的时间指待识别新闻的发布时间,待识别新闻的时间与事件的时间节点的关系指待识别新闻的发布时间与事件的各时间节点之间的关系。
待识别新闻的时间包括待识别新闻的发布时间、待识别新闻中涉及到的事件内容的发生时间等。以待识别新闻的时间为待识别新闻的发布时间为例,待识别新闻的时间与事件的时间节点的关系可以是待识别新闻的发布时间与事件的发生时间的差值。基于待识别新闻的词向量与事件的词向量的相似度,构建待识别新闻对应的第一特征具体如下:
fea=[Similar,|newtime-eventime|]    (公式5)
在公式5中,fea表示待识别新闻对应的第一特征,Similar表示新闻的关键词与事件的关键词的相似度,newtime表示待识别新闻的发布时间,eventime表示事件的时间节点,在该实例中,eventime可以为事件的发生时间,事件的发生时间可以是事件的第一个时间节点对应的时间。在构建第一特征时,还可以增加一维特征分量:待识别新闻的词向量的均值。即将待识别新闻的词向量的均值、待识别新闻的词向量与事件的词向量的相似度;待识别新闻的时间与事件的时间节点的关系构建上述第一特征。在另一个实施例中,在步骤105中,确定待识别新闻在关联事件中所对应的时间节点,包括:
步骤1054,基于待识别新闻的时间与事件的时间节点的关系,构建待识别新闻对应的第二特征。其中,所述待识别新闻的时间指待识别新闻的发布时间,待识别新闻的时间与事件的时间节点的关系指待识别新闻的发布时间与事件的各时间节点之间的关系。
待识别新闻的时间主要包括待识别新闻的发布时间、待识别新闻中涉及到的事情内容的发生时间等。待识别新闻的时间与事件的时间节点的关系可以是待识别新闻的时间与事件的时间节点的差值、或者是根据差值的大小而赋予的数值等。本实施例中,待识别新闻中的时间是指新闻发布时间,待识别新闻中的时间与事件的时 间节点的关系为差值,构建待识别新闻的时间向量如下所示:
timefea=[newtime-e_time 0,....,newtime-e_time i,...,newtime-e_time n]    (公式6)
公式6中,timefea表示待识别新闻的时间向量,e_time i表示事件的第i个时间节点,newtime表示待识别新闻的新闻发布时间。
可以将上述timefea作为第二特征,此外,也可以将
Figure PCTCN2018104156-appb-000004
timefea]作为第二特征,其中,W i为待识别新闻第i个关键词的词向量,M为待识别新闻中关键词的数量。
步骤1055,输入第二特征至第二分类模型,得到待识别新闻对应关联事件不同时间节点的置信度。
第二分类模型可以为softmax回归模型或者SVM模型。输出第二特征至第二分类模型是指将第二特征作为第二样本特征输入至第二分类模型,将样本特征用x表示,输出所述第二特征至第二分类模型,得到所述待识别新闻对应所述关联事件不同时间节点的置信度的具体表示如下:
Figure PCTCN2018104156-appb-000005
在公式7中,h θ(x)表示置信度,θ表示训练模型参数,x表示样本特征。
步骤1056,确定置信度满足条件的时间节点为待识别新闻所对应的时间节点。
Figure PCTCN2018104156-appb-000006
公式8中,J(θ)表示代价函数,x (i)表示输入,y (i)表示输出,m表示样本特征的数量。通过采用迭代的优化算法如梯度下降法,求解最小化代价函数,实现一个可用的分类模型,即确定第二分类模型的模型参数。进而将第二特征输入第二分类模型,计算待识别新闻对应事件的各时间节点的概率,即通过待识别新闻对应事件的各时间节点的概率而确定待识别新闻所对应的时间节点,其中,所述代价函数用 以确定模型的参数,模型的参数通过训练得到,在训练时,将一些样本输入公式(7)得到样本的置信度,其中,样本的置信度是模型参数表示的置信度,将样本的置信度输入公式(8),求解代价函数,确定模型的参数。
在一个实施例中,步骤1054,基于待识别新闻的时间与事件的时间节点的关系,构建待识别新闻对应的第二特征,包括:
将以下的特征分量组合,得到待识别新闻对应的所述第二特征:待识别新闻的词向量的均值;待识别新闻的时间与关联事件的不同时间节点的关系。
待识别新闻的词向量的均值是指待识别新闻所关联事件的时间节点对应的词向量的均值。待识别新闻的时间与事件的时间节点的关系可以是待识别新闻的时间与事件的时间节点的差值、或者是根据差值的大小而赋予的数值等。本实施例中,待识别新闻中的时间与事件的时间节点的关系为差值,构建待识别新闻的第二特征如下所示:
Figure PCTCN2018104156-appb-000007
在公式9中,fea表示第二特征,M表示关联事件的时间节点的数量,Wi表示待识别新闻第i个词的词向量,timefea表示基于待识别新闻中的时间与事件的时间节点的关系表征的待识别新闻的时间向量,如公式6所示的基于待识别新闻中的时间与事件的时间节点的差值表征的待识别新闻的时间向量。
在又一实施例中,如图6所示,在步骤105中,基于待识别新闻的词向量与事件的词向量的相似度,确定待识别新闻的关联事件,以及,确定待识别新闻在关联时间中所对应的时间节点还可以通过另外一种实现方式来实现,根据第三分类模型直接确定待识别新闻对应的时间节点。其中,多个事件对应多个时间节点,第三分类模型用以确定待识别新闻对应所述多个时间节点中的哪一个时间节点,进而将确定的时间节点对应的事件作为待识别新闻的关联事件,具体包括以下步骤:
步骤1057,基于待识别新闻的词向量与事件的词向量的相似度、以及待识别新闻的时间与事件的时间节点的关系,构建待识别新闻对应的第三特征。
在一个具体的实施例中,步骤1057,基于待识别新闻的词向量与事件的词向量的相似度、以及待识别新闻的时间与事件的时间节点的关系,构建待识别新闻对应的第三特征,包括将以下特征分量组合得到第三特征:待识别新闻的词向量与事件的词向量的相似度;待识别新闻的时间与事件的发生时间节点的关系;待识别新闻的词向量的均值;待识别新闻的时间与关联事件的不同时间节点的关系。所述特征分量与前述实施例中的相应特征分量的表征方式可以相同,如待识别新闻的词向量与事件的词向量的相似度如公式(2)所示,待识别新闻的词向量与事件的词向量的相似度、和待识别新闻的时间与事件的发生时间节点的关系的组合如公式(5)所示;待识别新闻的时间与事件的时间节点的关系如公式(6)所示,待识别新闻的时间与事件的时间节点的关系、和待识别新闻的词向量的均值的组合如公式(9)所示,从而第三特征可以由公式(2)和公式(5)其中之一所表征的特征分量与公式(6)和公式(9)其中之一所表征的特征分量组合形成。
步骤1058,输入第三特征至第三分类模型,得到待识别新闻的时间对应不同事件的不同时间节点的置信度。
第三分类模型可以为softmax回归模型或者SVM(Support Vector Machine)模型。输出第三特征至第三分类模型是指将第三特征作为第三样本特征输入至第三分类模型,将样本特征用x表示,输出所述第三特征至第三分类模型,得到待识别新闻的时间对应不同事件的不同时间节点的置信度的具体表示如下:
Figure PCTCN2018104156-appb-000008
在公式10中,h θ(x)表示置信度,θ表示训练模型参数,x表示由第三特征形成的样本特征。在训练第三模型时,将新闻样本的第三特征以及新闻对应的时间节点分别作为第三模型的输入及输出进行训练。多个新闻的时间节点构成时间节点集合,各时间节点携带对应的事件的标识,第三分类模型用以确定待识别新闻对应时间节点集合中的哪一个时间节点,进而将确定的时间节点对应的事件作为待识别新闻的关联事件。
步骤1059,确定置信度满足条件的时间节点为所待识别新闻对应的时间节点,以及,将所述确定的时间节点对应的事件作为待识别新闻的关联事件。
Figure PCTCN2018104156-appb-000009
公式11中,J(θ)表示代价函数,x (i)表示输入,y (i)表示输出,m表示样本特征的数量。采用迭代的优化算法如梯度下降法,求解最小化代价函数,从而确定第三分类模型的模型参数,实现一个可用的分类模型,将待识别新闻的第三特征输入第三分类模型,确定第三分类模型输出的待识别新闻对应各时间节点的概率,,确定置信度满足条件的时间节点为所待识别新闻对应的时间节点,进一步将确定的时间节点对应的事件作为待识别新闻的关联事件。
本申请实施例所提供的新闻处理方法中,通过事件的时间节点将事件的发展阶段进行划分,将与事件相关的关联新闻的生命周期与事件的发展阶段进行对应,从而对于识别新闻与事件是否关联以及新闻的时间对应于事件当前所处发展阶段的判断更加科学、精确,进一步通过该方式确定新闻的失效时间的计算上可以达到较好的效果。
该新闻处理方法可应用于任意可供用户获取阅读新闻信息的新闻阅读应用软件,如天天快报、腾讯新闻等。以图1所示新闻处理系统的应用场景中终端100为安装的新闻阅读应用为天天快报的客户端为例,请参阅图7,为本申请实施例提供的新闻阅读应用在新闻推送业务时在服务器进行新闻处理的一个应用场景示意图,其中,服务器200通过运行本申请实施例所提供的新闻处理方法,识别属于关联事件的新闻并将处于事件对应的当前发展阶段的新闻推送给终端100,请参阅图8,为本申请实施例提供的新闻阅读应用在新闻推送业务时在终端显示的一个应用场景示意图,用户通过在终端中安装新闻阅读应用客户端可以阅读服务器通过新闻处理方法确定待识别新闻的关联事件以及对应事件的时间节点后,推送的与事件当前发展阶段对应的新闻,用户通过终端100上的新闻阅读应用的软件界面进行查看。请参阅图9,具体以体育比赛事件A、待识别新闻B为例,对本申请实施例所提供的新闻处理方法确定新闻的失效时间的一种具体应用方式如下,包括:
S1,通过对事件的相关新闻进行聚类处理获取事件A的时间节点,具体包括:对体育比赛事件A的相关新闻进行聚类处理,获取体育比赛事件A的四个时间节点A1、A2、A3、A4将该事件划分为比赛事件A比赛前(时间节点A1~A2)、比赛事 件A比赛中(时间节点A2~A3)、比赛事件A比赛后(时间节点A3~A4)。
S2,获取待识别新闻B的关键词和事件A的关键词,根据新闻B的关键词与事件A的关键词之间的相似度确定待识别新闻B是否为事件A的关联新闻,具体包括:从待识别新闻B的标题、报道内容和评论中分别去提取结构化信息作为新闻B的关键词,将新闻B的关键词与预定义或者预抽取的事件A的关键词进行相似度计算,并根据相似度构建样本特征,通过分类模型进行分类以判断该待识别新闻B是否为比赛事件A的关联新闻。由于对待识别新闻是否为关联新闻的识别中,待识别新闻B的关键词的提取可以考虑到新闻的全文甚至是评论所包含的内容,且相似度包括多个新闻的关键词分别与事件的关键词之间的相似度计算,可以得到更加准确的判断结果,如针对待识别新闻中提到部分比赛内容而实际并非与记录体育比赛事件相关的新闻可以有效的识别出并召回,从而对新闻和事件的相关性判断准确性更高,在对体育类比赛事件的相关新闻时效性计算中,新闻和比赛的相关性判断召回率可以达到85%,而正确率可以达到98%。
S3,当待识别新闻B为事件A的关联新闻时,即确定待识别新闻B的关联事件为事件A,根据待识别新闻B的发布时间确认待识别新闻B对应事件A的时间节点A n,具体包括:根据待识别新闻的发布时间与事件的时间节点构建样本特征,通过分类模型进行分类以判断该待识别新闻B对应比赛事件A的哪个时间节点,如,确定待识别新闻B对应为比赛前的阶段,即对应关联事件中的时间节点为A1;如待识别新闻B对应为比赛中的阶段,即对应关联事件中的时间节点为A2;如待识别新闻B对应为比赛后的接段,即对应关联事件中的时间节点为A3。
S4,根据对应的时间节点A n,确定待识别新闻B与失效关联的特定时间节点为对应的失效时间节点,于该待识别新闻B对应的失效时间节点到来之前将该待识别新闻B推送给终端100,于该待识别新闻B对应的失效时间节点到来时召回。在一个具体的实施例中,待识别新闻B对应的失效时间节点为对应的时间节点A n的下一时间节点A n+1。根据对应的时间节点A n,将下一时间节点A n+1确定为待识别新闻B的失效时间。任意相邻两个时间节点(A n、A n+1)分别表示事件A的一个发展阶段的开始和结束的时间,通过确定待识别新闻所处事件的发展阶段,从而可以在当前发展阶段开始时即将属于上一发展阶段的关联新闻做失效处理,确保新闻的时效 性。具体包括,将属于比赛前的关联新闻于比赛事件A的比赛中阶段未到来之前推送给用户,而于比赛事件A的时间节点A2到来时即召回;将属于比赛中的关联新闻于比赛后阶段未到来之前保持推送给用户,而于比赛事件A的时间节点A3到来时即召回;将属于比赛后的关联新闻于比赛事件A的时间节点A4即召回。通过本实施例提供的新闻处理方法,对于比赛前新闻识别的正确率可以达到95%,比赛中新闻识别正确率可以达到90%,比赛后新闻识别正确率可以达到97%。
以上新闻处理方法通过对新闻设置合理的生命周期,提高新闻推荐的时效性,从而可提高新闻阅读应用软件的竞争力。
请参阅图10,在一个实施例中,提供一种新闻处理装置,包括第一获取模块11、第二获取模块13及确定模块15。第一获取模块11用于获取待识别新闻的词向量。第二获取模块13用于获取事件对应的词向量、以及事件的时间节点。确定模块15用于基于待识别新闻的词向量与事件的词向量的相似度,确定待识别新闻的关联事件,以及,确定待识别新闻在关联事件中所对应的时间节点,根据所述时间节点确定所述新闻是否有效。
请参阅图10,其中第一获取模块11包括关键词提取单元111和词向量单元113。关键词提取单元用于基于待识别新闻提取关键词。词向量单元用于将提取的关键词映射到词向量空间,得到关键词对应的词向量。关键词提取单元具体用于从以下至少之一提取对应所述待识别新闻的关键词:所述待识别新闻;所述待识别新闻的特定关联信息。
第二获取模块13包括预定义单元131或者聚类单元133。预定义单元131用于获取事件的预先定义的时间节点。聚类单元133用于获取事件的相关新闻并进行聚类处理,根据不同类别的相关新闻包含的时间信息确定事件的时间节点。
其中,还包括失效确定模块17,用于当对应的时间节点的类型为结束时间节点,且相较于结束时间节点的预设失效时长到达时,确定待识别新闻失效。
在另一个实施例中,失效确定模块17用于当对应的时间节点为与失效关联的特定时间节点时,确定待识别新闻失效。
确定模块15包括第一特征单元151、第一分类单元152及事件确定单元153。第一特征单元151用于基于待识别新闻的词向量与事件的词向量的相似度,构建待 识别新闻对应的第一特征。第一分类单元152用于将第一特征作为样本特征输入第一分类模型,得到不同事件是待识别新闻的关联事件的置信度。事件确定单元153用于确定置信度满足条件的事件为待识别新闻的关联事件。第一特征单元151具体用于将以下的特征分量组合,得到待识别新闻对应的第一特征:待识别新闻的词向量与事件的词向量的相似度;待识别新闻的时间与事件的时间节点的关系。
进一步的,确定模块还包括第二特征单元154、第二分类单元155及时间确定单元156。第二特征单元154用于基于待识别新闻的时间与事件的时间节点的关系,构建待识别新闻对应的第二特征。第二分类单元155用于输入第二特征至第二分类模型,得到待识别新闻对应关联事件不同时间节点的置信度。时间确定单元156用于确定置信度满足条件的时间节点为待识别新闻所对应的时间节点。第二特征单元154具体用于将以下的特征分量组合,得到待识别新闻对应的第二特征:待识别新闻的词向量的均值;待识别新闻的时间与关联事件的不同时间节点的关系。
在另一个实施例中,请参阅图11,确定单元15包括第三特征单元157、第三分类单元158以及确定单元159。第三特征单元157用于基于待识别新闻的词向量与事件的词向量的相似度、以及待识别新闻的时间与事件的时间节点的关系,构建待识别新闻对应的第三特征。第三分类单元158用于输入第三特征至第三分类模型,得到待识别新闻的时间对应不同事件的不同时间节点的置信度。确定单元159用于确定置信度满足条件的时间节点为待识别新闻对应的时间节点,以及,将确定的时间节点对应的事件作为待识别新闻的关联事件。
本申请实施例所提供的新闻处理装置,通过事件的时间节点将事件的发展根据不同时间段所具有的某种共同特性而将事情区分为多个发展阶段,通过确定待识别新闻对应的事件的时间节点,从而可以获知待识别新闻所处事件的发展阶段,确定该新闻是否属于事件当前发展阶段的新闻,将不属于事件当前发展阶段的新闻确定为失效新闻。基于事件的时间节点对新闻设置合理的生命周期,及时将不属于事件当前发展阶段的新闻确定为失效新闻,以避免将不符合事件当前发展阶段的新闻性低的新闻推荐到用户,以提高向用户推荐的新闻的时效性。
需要说明的是:上述实施例提供的新闻处理装置在进行信息提醒时,仅以上述各程序模块的划分进行举例说明,实际应用中,可以根据需要而将上述处理分配由 不同的程序模块完成,即将装置的内部结构划分成不同的程序模块,以完成以上描述的全部或者部分处理。另外,上述实施例提供的新闻处理装置与新闻处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本申请实施例还提供了一种计算机设备,该计算机设备包括处理器及用于存储能够在处理器上运行的计算机程序的存储器,其中,所述处理器用于运行所述计算机程序时,执行:一种新闻处理方法,包括:获取待识别新闻的词向量;获取事件的词向量、以及所述事件的时间节点;基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,以及,确定所述待识别新闻在所述关联事件中所对应的时间节点,根据所述时间节点确定所述新闻是否有效。
所述处理器还用于运行所述计算机程序时,执行:所述获取待识别新闻的词向量,包括:基于待识别新闻提取关键词;将所述提取的关键词映射到词向量空间,得到所述关键词对应的词向量。
所述处理器还用于运行所述计算机程序时,执行:所述基于待识别新闻提取关键词,包括:从以下至少之一提取对应所述待识别新闻的关键词:所述待识别新闻;所述待识别新闻的特定关联信息。
所述处理器还用于运行所述计算机程序时,执行:所述获取所述事件的时间节点,包括:获取所述事件的预先定义的时间节点;或,获取所述事件的相关新闻并进行聚类处理,根据不同类别的相关新闻包含的时间信息确定所述事件的时间节点。
所述处理器还用于运行所述计算机程序时,执行:所述基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,包括:基于所述待识别新闻的词向量与所述事件的词向量的相似度,构建所述待识别新闻对应的第一特征;将所述第一特征作为样本特征输入第一分类模型,得到不同所述事件是所述待识别新闻的关联事件的置信度;确定置信度满足条件的事件为所述待识别新闻的关联事件。
所述处理器还用于运行所述计算机程序时,执行:所述基于所述待识别新闻的词向量与所述事件的词向量的相似度,构建所述待识别新闻对应的第一特征,包括:将以下的特征分量组合,得到所述待识别新闻对应的所述第一特征:所述待识别新闻的词向量与所述事件的词向量的相似度;所述待识别新闻的时间与所述事件的时间节点的关系。
所述处理器还用于运行所述计算机程序时,执行:所述确定所述待识别新闻在所述关联时间中所对应的时间节点,包括:基于所述待识别新闻的时间与所述事件的时间节点的关系,构建所述待识别新闻对应的第二特征;以及,输入所述第二特征至第二分类模型,得到所述待识别新闻对应所述关联事件不同时间节点的置信度;确定置信度满足条件的时间节点为所述待识别新闻所对应的时间节点。
所述处理器还用于运行所述计算机程序时,执行:所述基于所述待识别新闻的时间与所述事件的时间节点的关系,构建所述待识别新闻对应的第二特征,包括:将以下的特征分量组合,得到所述待识别新闻对应的所述第二特征:所述待识别新闻的词向量的均值;所述待识别新闻的时间与所述关联事件的不同时间节点的关系。
所述处理器还用于运行所述计算机程序时,执行:所述基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,以及,确定所述待识别新闻在所述关联时间中所对应的时间节点,包括:基于所述待识别新闻的词向量与所述事件的词向量的相似度、以及所述待识别新闻的时间与所述事件的时间节点的关系,构建所述待识别新闻对应的第三特征;输入所述第三特征至第三分类模型,得到所述待识别新闻的时间对应不同事件的不同时间节点的置信度;确定置信度满足条件的时间节点为所述待识别新闻对应的时间节点,以及,将所述确定的时间节点对应的事件作为所述待识别新闻的关联事件。
所述处理器还用于运行所述计算机程序时,执行:所述新闻处理方法还包括:当所述对应的时间节点的类型为结束时间节点,且相较于所述结束时间节点的预设失效时长到达时,确定所述待识别新闻失效。
所述处理器还用于运行所述计算机程序时,执行:所述新闻处理方法还包括:当所述对应的时间节点为与失效关联的特定时间节点时,确定所述待识别新闻失效。
如图12所示,为一计算机设备的内部结构示意图,该计算机设备可以为图1中所示的服务器200,包括通过系统总线连接的处理器、内存储器、网络接口和非易失性存储介质。其中,处理器用于实现计算功能和控制服务器工作的功能,该处理器被配置为执行本申请实施例提供的新闻处理方法。非易失性存储介质存储有操作系统、数据库和用于实现本申请实施例提供的新闻处理方法的新闻处理装置。网络接口用于连接终端。
其中,存储器可以由任何类型的易失性或非易失性存储设备、或者它们的组合 来实现。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)、磁性随机存取存储器(FRAM,Ferromagnetic Random Access Memory)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM,Compact Disc Read-Only Memory);磁表面存储器可以是磁盘存储器或磁带存储器。易失性存储器可以是随机存取存储器(RAM,Random Access Memory),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(SRAM,Static Random Access Memory)、同步静态随机存取存储器(SSRAM,Synchronous Static Random Access Memory)、动态随机存取存储器(DRAM,Dynamic Random Access Memory)、同步动态随机存取存储器(SDRAM,Synchronous Dynamic Random Access Memory)、双倍数据速率同步动态随机存取存储器(DDRSDRAM,Double Data Rate Synchronous Dynamic Random Access Memory)、增强型同步动态随机存取存储器(ESDRAM,Enhanced Synchronous Dynamic Random Access Memory)、同步连接动态随机存取存储器(SLDRAM,SyncLink Dynamic Random Access Memory)、直接内存总线随机存取存储器(DRRAM,Direct Rambus Random Access Memory)。本申请实施例描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
存储器用于存储各种类型的数据以支持新闻处理装置的操作。这些数据的示例包括:用于在新闻处理装置上操作的任何计算机程序,如操作系统和应用程序;待识别新闻、待识别新闻的词向量、事件的时间节点、时间的词向量等等。其中,操作系统包含各种系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务。应用程序可以包含各种应用程序,例如新闻应用、媒体播放器(Media Player)、浏览器(Browser)等,用于实现各种应用业务。实现本申请实施例方法的程序可以包含在应用程序中。
网络接口用于新闻处理装置与其他设备之间有线或无线方式的通信。新闻处理装置可以接入基于通信标准的无线网络,如WiFi、2G或3G、或它们的组合。在一个示例性实施例中,网络接口经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述网络接口还包括近场通信(NFC,Near  Field Communication)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID,Radio Frequency IDentification)技术、红外数据组织(IrDA,Infrared Data Association)技术、超宽带(UWB,Ultra WideBand)技术、蓝牙(BT,BlueTooth)技术或其他技术来实现。
上述本申请实施例揭示的新闻处理方法可以应用于处理器中,或者由处理器实现。处理器的数量可以是一个或者多个,以完成上述方法的全部或者部分步骤。处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤,可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于存储介质中,该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成前述方法的步骤。
在示例性实施例中,新闻处理装置可以被一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)、通用处理器、控制器、微控制器(MCU,Micro Controller Unit)、微处理器(Microprocessor)、或其他电子元件实现,用于执行前述方法。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (16)

  1. 一种新闻处理方法,由服务器执行,其中,所述方法包括:
    获取待识别新闻的词向量;
    获取事件的词向量、以及所述事件的时间节点;
    基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,以及,
    确定所述待识别新闻在所述关联事件中所对应的时间节点;
    根据所述时间节点确定所述新闻是否有效。
  2. 如权利要求1所述的新闻处理方法,其中,所述获取待识别新闻的词向量,包括:
    基于待识别新闻提取关键词;
    将所述提取的关键词映射到词向量空间,得到所述关键词对应的词向量。
  3. 如权利要求2所述的新闻处理方法,其中,所述基于待识别新闻提取关键词,包括:
    从以下至少之一提取对应所述待识别新闻的关键词:
    所述待识别新闻;所述待识别新闻的特定关联信息。
  4. 如权利要求1所述的新闻处理方法,其中,所述获取所述事件的时间节点,包括:
    获取所述事件的预先定义的时间节点。
  5. 如权利要求1所述的新闻处理方法,其中,所述获取所述事件的时间节点包括:获取所述事件的相关新闻并进行聚类处理,根据不同类别的相关新闻包含的时间信息确定所述事件的时间节点。
  6. 如权利要求1所述的新闻处理方法,其中,所述基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,包括:
    基于所述待识别新闻的词向量与所述事件的词向量的相似度,构建所述待识别新闻对应的第一特征;
    将所述第一特征作为样本特征输入第一分类模型,得到不同所述事件是所述待识别新闻的关联事件的置信度;
    确定置信度满足条件的事件为所述待识别新闻的关联事件。
  7. 如权利要求6所述的新闻处理方法,其中,所述基于所述待识别新闻的词向量与所述事件的词向量的相似度,构建所述待识别新闻对应的第一特征,包括:
    将以下的特征分量组合,得到所述待识别新闻对应的所述第一特征:
    所述待识别新闻的词向量与所述事件的词向量的相似度;
    所述待识别新闻的时间与所述事件的时间节点的关系。
  8. 如权利要求1所述的新闻处理方法,其中,所述确定所述待识别新闻在所述关联时间中所对应的时间节点,包括:
    基于所述待识别新闻的时间与所述事件的时间节点的关系,构建所述待识别新闻对应的第二特征;以及
    输入所述第二特征至第二分类模型,得到所述待识别新闻对应所述关联事件不同时间节点的置信度;
    确定置信度满足条件的时间节点为所述待识别新闻所对应的时间节点。
  9. 如权利要求8所述的新闻处理方法,其中,所述基于所述待识别新闻的时间与所述事件的时间节点的关系,构建所述待识别新闻对应的第二特征,包括:
    将以下的特征分量组合,得到所述待识别新闻对应的所述第二特征:
    所述待识别新闻的词向量的均值;
    所述待识别新闻的时间与所述关联事件的不同时间节点的关系。
  10. 如权利要求1所述的新闻处理方法,其中,所述基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,以及,确定所述待识别新闻在所述关联事件中所对应的时间节点,包括:
    基于所述待识别新闻的词向量与所述事件的词向量的相似度、以及所述待识别新闻的时间与所述事件的时间节点的关系,构建所述待识别新闻对应的第三特征;
    输入所述第三特征至第三分类模型,得到所述待识别新闻的时间对应不同事件的不同时间节点的置信度;
    确定置信度满足条件的时间节点为所述待识别新闻对应的时间节点,以及,将所述确定的时间节点对应的事件作为所述待识别新闻的关联事件。
  11. 如权利要求1所述的新闻处理方法,其中,所述根据所述时间节点确定所述新闻是否有效包括:
    当所述对应的时间节点的类型为结束时间节点,且相较于所述结束时间节点的预设失效时长到达时,确定所述待识别新闻失效。
  12. 如权利要求1所述的新闻处理方法,其中,所述根据所述时间节点确定所述新闻是否有效包括:
    当所述对应的时间节点为与失效关联的特定时间节点时,确定所述待识别新闻失效。
  13. 一种新闻处理装置,其中,所述装置包括:
    第一获取模块,用于获取待识别新闻的词向量;
    第二获取模块,用于获取事件对应的词向量、以及所述事件的时间节点;
    确定模块,用于基于所述待识别新闻的词向量与所述事件的词向量的相似度,确定所述待识别新闻的关联事件,以及,
    确定所述待识别新闻在所述关联事件中所对应的时间节点,根据所述时间节点确定所述新闻是否有效。
  14. 如权利要求13所述的新闻处理装置,其中,所述确定模块包括:
    第一特征单元,用于基于所述待识别新闻的词向量与所述事件的词向量的相似度,构建所述待识别新闻对应的第一特征;
    第一分类单元,用于将所述第一特征作为样本特征输入第一分类模型,得到不同所述事件是所述待识别新闻的关联事件的置信度;
    事件确定单元,用于确定置信度满足条件的事件为所述待识别新闻的关联事件;
    第二特征单元,用于基于所述待识别新闻的时间与所述事件的时间节点的关系,构建所述待识别新闻对应的第二特征;以及
    第二分类单元,用于输入所述第二特征至第二分类模型,得到所述待识别新闻对应所述关联事件不同时间节点的置信度;
    时间确定单元,用于确定置信度满足条件的时间节点为所述待识别新闻所对应的时间节点。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,该计算机程序被处理器执行时实现如权利要求1-12中任意一项所述新闻处理方法。
  16. 一种计算机设备,包括存储器、处理器及存储在所述存储器上运行的计算机程序,其中,所述处理器执行所述程序时实现如权利要求1-12中任意一项所述的 新闻处理方法。
PCT/CN2018/104156 2017-09-05 2018-09-05 新闻处理方法、装置、存储介质及计算机设备 WO2019047849A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710791715.7 2017-09-05
CN201710791715.7A CN110020104B (zh) 2017-09-05 2017-09-05 新闻处理方法、装置、存储介质及计算机设备

Publications (1)

Publication Number Publication Date
WO2019047849A1 true WO2019047849A1 (zh) 2019-03-14

Family

ID=65634737

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/104156 WO2019047849A1 (zh) 2017-09-05 2018-09-05 新闻处理方法、装置、存储介质及计算机设备

Country Status (2)

Country Link
CN (1) CN110020104B (zh)
WO (1) WO2019047849A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990705A (zh) * 2019-12-06 2020-04-10 腾讯科技(深圳)有限公司 一种新闻处理方法、装置、设备及介质
CN111125429A (zh) * 2019-12-20 2020-05-08 腾讯科技(深圳)有限公司 一种视频推送方法、装置和计算机可读存储介质
CN111125520A (zh) * 2019-12-11 2020-05-08 东南大学 一种面向新闻文本的基于深度聚类模型的事件线抽取方法
CN112948528A (zh) * 2021-03-02 2021-06-11 北京秒针人工智能科技有限公司 一种基于关键词的数据归类方法及系统
CN113407714A (zh) * 2020-11-04 2021-09-17 腾讯科技(深圳)有限公司 基于时效的数据处理方法、装置、电子设备及存储介质
CN115048486A (zh) * 2022-05-24 2022-09-13 支付宝(杭州)信息技术有限公司 事件抽取方法、装置、计算机程序产品、存储介质及设备

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704603B (zh) * 2019-09-12 2022-09-09 武汉灯塔之光科技有限公司 一种通过资讯发掘当前热点事件的方法和装置
CN110889024A (zh) * 2019-10-25 2020-03-17 武汉灯塔之光科技有限公司 一种用于计算资讯关联股票的方法和装置
CN110888877A (zh) * 2019-11-13 2020-03-17 深圳市超视智慧科技有限公司 事件信息显示方法、装置、计算设备及存储介质
CN112257734B (zh) * 2019-11-15 2024-08-20 北京沃东天骏信息技术有限公司 一种信息处理方法及装置、存储介质
CN110929018B (zh) * 2019-12-04 2023-03-21 Oppo(重庆)智能科技有限公司 文本处理方法、装置、存储介质及电子设备
CN111324748B (zh) * 2020-02-28 2023-08-04 北京百度网讯科技有限公司 一种体育战报的生成方法、装置、电子设备及存储介质
CN113722593B (zh) * 2021-08-31 2024-01-16 北京百度网讯科技有限公司 事件数据处理方法、装置、电子设备和介质
CN114185922A (zh) * 2021-12-01 2022-03-15 维沃移动通信有限公司 信息检测方法、信息检测装置、电子设备和可读存储介质
CN116340639B (zh) * 2023-03-31 2023-12-12 北京百度网讯科技有限公司 新闻召回方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012092150A2 (en) * 2010-12-30 2012-07-05 Pelco Inc. Inference engine for video analytics metadata-based event detection and forensic search
CN104915446A (zh) * 2015-06-29 2015-09-16 华南理工大学 基于新闻的事件演化关系自动提取方法及其系统
CN105468669A (zh) * 2015-10-13 2016-04-06 中国科学院信息工程研究所 一种融合用户关系的自适应微博话题追踪方法
CN106886567A (zh) * 2017-01-12 2017-06-23 北京航空航天大学 基于语义扩展的微博突发事件检测方法及装置
CN107122423A (zh) * 2017-04-06 2017-09-01 深圳Tcl数字技术有限公司 影视推介方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8661025B2 (en) * 2008-11-21 2014-02-25 Stubhub, Inc. System and methods for third-party access to a network-based system for providing location-based upcoming event information
CN103324718B (zh) * 2013-06-25 2016-08-10 百度在线网络技术(北京)有限公司 基于海量搜索日志挖掘话题脉络的方法和系统
CN103473263B (zh) * 2013-07-18 2017-02-08 大连理工大学 一种面向新闻事件演变过程的可视化展现方法
CN104768131B (zh) * 2015-03-12 2018-10-19 中国科学技术大学苏州研究院 一种基于车车通信的中继节点告警消息转发方法
CN107016556B (zh) * 2016-01-27 2021-02-05 创新先进技术有限公司 数据处理方法及装置
CN105787095B (zh) * 2016-03-16 2019-09-27 广州索答信息科技有限公司 互联网新闻的自动生成方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012092150A2 (en) * 2010-12-30 2012-07-05 Pelco Inc. Inference engine for video analytics metadata-based event detection and forensic search
CN104915446A (zh) * 2015-06-29 2015-09-16 华南理工大学 基于新闻的事件演化关系自动提取方法及其系统
CN105468669A (zh) * 2015-10-13 2016-04-06 中国科学院信息工程研究所 一种融合用户关系的自适应微博话题追踪方法
CN106886567A (zh) * 2017-01-12 2017-06-23 北京航空航天大学 基于语义扩展的微博突发事件检测方法及装置
CN107122423A (zh) * 2017-04-06 2017-09-01 深圳Tcl数字技术有限公司 影视推介方法及装置

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990705A (zh) * 2019-12-06 2020-04-10 腾讯科技(深圳)有限公司 一种新闻处理方法、装置、设备及介质
CN110990705B (zh) * 2019-12-06 2024-04-12 深圳市雅阅科技有限公司 一种新闻处理方法、装置、设备及介质
CN111125520A (zh) * 2019-12-11 2020-05-08 东南大学 一种面向新闻文本的基于深度聚类模型的事件线抽取方法
CN111125520B (zh) * 2019-12-11 2023-04-21 东南大学 一种面向新闻文本的基于深度聚类模型的事件线抽取方法
CN111125429A (zh) * 2019-12-20 2020-05-08 腾讯科技(深圳)有限公司 一种视频推送方法、装置和计算机可读存储介质
CN111125429B (zh) * 2019-12-20 2023-05-30 腾讯科技(深圳)有限公司 一种视频推送方法、装置和计算机可读存储介质
CN113407714A (zh) * 2020-11-04 2021-09-17 腾讯科技(深圳)有限公司 基于时效的数据处理方法、装置、电子设备及存储介质
CN113407714B (zh) * 2020-11-04 2024-03-12 腾讯科技(深圳)有限公司 基于时效的数据处理方法、装置、电子设备及存储介质
CN112948528A (zh) * 2021-03-02 2021-06-11 北京秒针人工智能科技有限公司 一种基于关键词的数据归类方法及系统
CN115048486A (zh) * 2022-05-24 2022-09-13 支付宝(杭州)信息技术有限公司 事件抽取方法、装置、计算机程序产品、存储介质及设备
CN115048486B (zh) * 2022-05-24 2024-05-31 支付宝(杭州)信息技术有限公司 事件抽取方法、装置、计算机程序产品、存储介质及设备

Also Published As

Publication number Publication date
CN110020104A (zh) 2019-07-16
CN110020104B (zh) 2023-04-07

Similar Documents

Publication Publication Date Title
WO2019047849A1 (zh) 新闻处理方法、装置、存储介质及计算机设备
CN108009228B (zh) 一种内容标签的设置方法、装置及存储介质
US11315546B2 (en) Computerized system and method for formatted transcription of multimedia content
CN110430476B (zh) 直播间搜索方法、系统、计算机设备和存储介质
WO2020207074A1 (zh) 一种信息推送的方法及设备
WO2018072071A1 (zh) 知识图谱构建系统及方法
US8990065B2 (en) Automatic story summarization from clustered messages
CN105701254B (zh) 一种信息处理方法和装置、一种用于信息处理的装置
CN111008321B (zh) 基于逻辑回归推荐方法、装置、计算设备、可读存储介质
KR102027471B1 (ko) 소셜 네트워크 컨텐츠를 기반으로 단어 벡터화 기법을 이용하여 일상 언어로 확장하기 위한 방법 및 시스템
CN110929125B (zh) 搜索召回方法、装置、设备及其存储介质
CN103106287B (zh) 一种用户检索语句的处理方法及系统
CN110297880B (zh) 语料产品的推荐方法、装置、设备及存储介质
CN110019675B (zh) 一种关键词提取的方法及装置
US20170011114A1 (en) Common data repository for improving transactional efficiencies of user interactions with a computing device
CN109582847B (zh) 一种信息处理方法及装置、存储介质
US20240348846A1 (en) Video generating method and apparatus, electronic device, and readable storage medium
CN113806588A (zh) 搜索视频的方法和装置
CN113343108B (zh) 推荐信息处理方法、装置、设备及存储介质
CN113688310A (zh) 一种内容推荐方法、装置、设备及存储介质
US11314793B2 (en) Query processing
US9607031B2 (en) Social data filtering system, method and non-transitory computer readable storage medium of the same
CN113407775B (zh) 视频搜索方法、装置及电子设备
JP2020521246A (ja) ネットワークアクセス可能なコンテンツの自動化された分類
CN105512270B (zh) 一种确定相关对象的方法和装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18854812

Country of ref document: EP

Kind code of ref document: A1