[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113761227B - Text data searching method and device - Google Patents

Text data searching method and device Download PDF

Info

Publication number
CN113761227B
CN113761227B CN202010806630.3A CN202010806630A CN113761227B CN 113761227 B CN113761227 B CN 113761227B CN 202010806630 A CN202010806630 A CN 202010806630A CN 113761227 B CN113761227 B CN 113761227B
Authority
CN
China
Prior art keywords
corpus
text
space
time
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010806630.3A
Other languages
Chinese (zh)
Other versions
CN113761227A (en
Inventor
兰亚伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202010806630.3A priority Critical patent/CN113761227B/en
Publication of CN113761227A publication Critical patent/CN113761227A/en
Application granted granted Critical
Publication of CN113761227B publication Critical patent/CN113761227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a text data searching method and device, and relates to the technical field of computers. The method comprises the following steps: extracting at least one of time features or space features of the search text data as space-time features by using a machine learning model; and determining the corpus text matched with the search text data according to the matching degree of the space-time characteristics and the space-time labels of the corpus texts, wherein the space-time labels are used for labeling at least one item of time information or space information of the corpus text.

Description

Text data searching method and device
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a text data searching method, a text data searching device, and a non-volatile computer readable storage medium.
Background
Because of the development of computer and network technologies, vast amounts of text are stored on today's networks and are constantly growing. Therefore, it is important how to accurately search for desired contents from a huge amount of text.
In the related art, a search engine as an entry for a user to acquire information is mostly implemented based on keyword content matching.
Disclosure of Invention
The inventors of the present disclosure found that the above-described related art has the following problems: the method has no function of deeply mining the internal connection of the information, so that the accuracy of the search result is low.
In view of this, the present disclosure proposes a technical solution for searching text data, which can improve accuracy of search results.
According to some embodiments of the present disclosure, there is provided a text data searching method including: extracting at least one of time features or space features of the search text data as space-time features by using a machine learning model; and determining the corpus text matched with the search text data according to the matching degree of the space-time characteristics and the space-time labels of the corpus texts, wherein the space-time labels are used for labeling at least one item of time information or space information of the corpus text.
In some embodiments, the spatiotemporal label is generated by: extracting at least one item of time characteristics or space characteristics of each sentence in the text to be processed as space-time characteristics by using a machine learning model; dividing the text to be processed into each corpus text according to the space-time characteristics, and generating space-time labels of each corpus text.
In some embodiments, determining corpus text that matches the search text data based on the degree of matching of the spatiotemporal features to the spatiotemporal labels of the respective corpus text comprises: determining a first corpus text according to the matching degree of the search features and the space-time labels of the corpus texts; determining a second corpus text which belongs to the same kind of event with the first corpus text according to the event label of the first corpus text; and determining the first corpus text and the second corpus text as corpus texts matched with the search text data.
In some embodiments, the event tag is generated by: according to the context information of each corpus text in the text to be processed, extracting event characteristics of each corpus text by using a machine learning model; the corpus texts with similar event characteristics are labeled with the same event labels.
In some embodiments, the corpus text matching the search text data is a plurality of; the method further comprises the steps of: determining related events for searching text data according to event labels of a plurality of matched corpus texts; and generating at least one item of space track information or time axis information of the related event according to the space-time labels of the plurality of matched corpus texts.
In some embodiments, the method further comprises at least one of the following steps: labeling and displaying the related events at corresponding positions on the map according to the space track information of the related events; or determining a relevant area on the map according to the space track information of the relevant event, and displaying the time text information or the time axis graphic information determined according to the time axis information on the relevant area.
According to other embodiments of the present disclosure, there is provided a text data searching apparatus including: an extraction unit for extracting at least one of a temporal feature or a spatial feature of the search text data as a spatiotemporal feature using the machine learning model; the determining unit is used for determining the corpus text matched with the search text data according to the matching degree of the space-time characteristics and the space-time labels of the corpus text, wherein the space-time labels are used for labeling at least one item of time information or space information of the corpus text.
In some embodiments, the spatiotemporal label is generated by: extracting at least one item of time characteristics or space characteristics of each sentence in the text to be processed as space-time characteristics by using a machine learning model; dividing the text to be processed into each corpus text according to the space-time characteristics, and generating space-time labels of each corpus text.
In some embodiments, the determining unit determines the first corpus text according to the matching degree of the search feature and the space-time label of each corpus text; determining a second corpus text which belongs to the same kind of event with the first corpus text according to the event label of the first corpus text; and determining the first corpus text and the second corpus text as corpus texts matched with the search text data.
In some embodiments, the event tag is generated by: according to the context information of each corpus text in the text to be processed, extracting event characteristics of each corpus text by using a machine learning model; the corpus texts with similar event characteristics are labeled with the same event labels.
In some embodiments, the corpus text matched with the search text data is a plurality of, and the determining unit determines related events of the search text data according to event tags of the plurality of matched corpus texts; the apparatus further comprises: and the generating unit is used for generating at least one item of space track information or time axis information of the related event according to the space-time labels of the plurality of matched corpus texts.
In some embodiments, the apparatus further comprises a display unit for performing at least one of the following steps: labeling and displaying the related events at corresponding positions on the map according to the space track information of the related events; or determining a relevant area on the map according to the space track information of the relevant event, and displaying the time text information or the time axis graphic information determined according to the time axis information on the relevant area.
According to still further embodiments of the present disclosure, there is provided a text data searching apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform the method of searching text data in any of the embodiments described above based on instructions stored in the memory device.
According to still further embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of searching text data in any of the above embodiments.
In the embodiment, the space-time characteristics of the text data are taken as the search basis, so that the association relation in the text data can be deeply mined, and the accuracy of the search result is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The disclosure may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 illustrates a flow chart of some embodiments of a method of searching text data of the present disclosure;
FIG. 2 illustrates a flow chart of some embodiments of step 120 of FIG. 1;
FIG. 3 illustrates a schematic diagram of some embodiments of a method of searching text data of the present disclosure;
FIG. 4 illustrates a schematic diagram of further embodiments of a method of searching text data of the present disclosure;
FIG. 5 illustrates a block diagram of some embodiments of a search apparatus for text data of the present disclosure;
FIG. 6 illustrates a block diagram of further embodiments of a search apparatus for text data of the present disclosure;
Fig. 7 illustrates a block diagram of still further embodiments of a search apparatus for text data of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the authorization specification where appropriate.
In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
As previously mentioned, the vast amount of text stored on a network contains a large amount of temporal and spatial information, so that there is often a spatiotemporal association between the vast amount of text content. By using the search method without extracting, organizing, associating, retrieving and analyzing the space-time information, users often face the technical problems that the search results are inaccurate or the search results need to be manually screened in the process of using a search engine.
In order to solve the technical problems, the method and the device are based on natural language processing technology, and intelligently extract, calculate and infer time and space information in text content. The text content is cut into a plurality of spatiotemporal events based on the spatiotemporal scene determined by the spatiotemporal information. The spatiotemporal event may have attributes such as time, place, person, event type, etc.
The time-space event is used as the minimum processing particle for searching and analyzing, so that the accuracy of the search result can be improved. And combining different application analysis models, the space-time knowledge and value in the text content can be further mined. For example, the technical solution of the present disclosure may be implemented by the following embodiments.
Fig. 1 illustrates a flow chart of some embodiments of a method of searching text data of the present disclosure.
As shown in fig. 1, the method includes: step 110, extracting space-time characteristics; and step 120, determining the matched corpus text.
In step 110, at least one of temporal or spatial features of the search text data is extracted as spatiotemporal features using a machine learning model.
In step 120, the corpus text matching the search text data is determined based on the degree of matching of the spatiotemporal features with the spatiotemporal labels of the respective corpus text. The space-time labels are used for labeling at least one item of time information or space information of the corpus text.
In some embodiments, a corpus may be created for storing a set of annotated corpus texts. For example, each corpus of text has a spatiotemporal label as a spatiotemporal time.
In some embodiments, at least one of temporal features or spatial features of each sentence in the text to be processed is extracted as a spatiotemporal feature using a machine learning model; dividing the text to be processed into each corpus text according to the space-time characteristics, and generating space-time labels of each corpus text.
In some embodiments, word segmentation and part-of-speech determination may be performed on each sentence. And extracting the space-time characteristics of each sentence by using a machine learning model according to the processing result. For example, a labeled-LDA (LATENT DIRICHLET Allocation ) model may be utilized to extract spatiotemporal features and label spatiotemporal tags.
In some embodiments, each statement may be segmented using an n-gram model. For example, the occurrence probability P (ω ii-(n-1),…,ωi-1) of a word ω i in a sentence for its top n words can be calculated by the following formula:
count () is the number of times a word combination appears. That is, P (ω ii-(n-1),…,ωi-1) is the ratio of the word frequency of the word combination (ω i-(n-1),…,ωi) in the document to the word frequency of the word combination (ω i-(n-1),…,ωi-1) in the document.
From P (ω ii-(n-1),…,ωi-1) of each ω i, the probability distribution P (ω i-(n-1),…,ωi) of the word combination (ω i-(n-1),…,ωi) is calculated. For example, P (ω i-(n-1),…,ωi) may be calculated from the product of P (ω ii-(n-1),…,ωi-1). In the case where P (ω i-(n-1),…,ωi) is greater than the threshold value, the single word combination (ω i-(n-1),…,ωi) is divided into one word.
In some embodiments, after word segmentation processing is performed on each sentence, part-of-speech tagging can be modeled as a sequence tagging problem, and part-of-speech tagging is performed by using a machine learning model. For example, the machine learning model may be a hidden Markov model, a conditional random field model, or the like.
Thus, the non-appearing words in the dictionary can be divided, and the word segmentation accuracy can be improved according to the context.
After word segmentation and part-of-speech tagging, the spatio-temporal features may be further extracted. Therefore, the space-time correlation in the text data can be deeply mined as the basis of the following search, so that the search accuracy is improved.
In some embodiments, step 120 may be implemented by the embodiment of fig. 2.
Fig. 2 shows a flow chart of some embodiments of step 120 in fig. 1.
As shown in fig. 2, step 120 includes: step 1210, determining a first corpus text; step 1220, determining a second corpus text; and step 1230, determining the matched corpus text.
In step 1210, a first corpus text is determined based on the degree of matching of the search features with the spatiotemporal labels of the respective corpus text.
In step 1220, a second corpus text is determined that belongs to the same class of event as the first corpus text based on the event tags of the first corpus text.
In some embodiments, according to the context information of each corpus text in the text to be processed, extracting event features of each corpus text by using a machine learning model; the corpus texts with similar event characteristics are labeled with the same event labels.
In some embodiments, spatiotemporal events belonging to the same event may be categorized into the same class of spatiotemporal events and the same class of spatiotemporal events may be built into an event collection. Each spatiotemporal event in an event collection has the same event label.
Therefore, multi-space-time association analysis of each corpus text can be realized, and different space-time under the same event can be associated together. For example, event sorting and geographical classification can be performed on each spatio-temporal event belonging to the same event set according to the spatio-temporal labels, so as to realize process deduction of one event. Through the space-time correlation, the coverage range of the search result can be improved, and the accuracy of the search result is further improved.
In step 1230, the first corpus text and the second corpus text are determined as corpus text matching the search text data.
In some embodiments, the corpus text that matches the search text data is a plurality. In this case, the related event of searching text data may be determined according to the event tags of the plurality of matched corpus texts; and generating at least one item of space track information or time axis information of the related event according to the space-time labels of the plurality of matched corpus texts.
In some embodiments, the related events are marked and displayed at corresponding positions on the map according to the spatial track information of the related events.
In some embodiments, a relevant area is determined on a map according to the spatial trajectory information of the relevant event, and the time text information determined according to the time axis information, or the time axis graphic information is displayed on the relevant area.
In some embodiments, the server of the technical solution of the present disclosure may be configured by the embodiment in fig. 3.
Fig. 3 illustrates a schematic diagram of some embodiments of a method of searching text data of the present disclosure.
As shown in fig. 3, the service end (platform) of the method may include an application presentation layer, a first service layer, a second service layer, and a base component layer.
In some embodiments, the application presentation layer may include a React+Redux framework, terria map framework, echart (visualization tool), or the like.
In some embodiments, the first service layer may include a shiro+ jwt rights framework, a base service module, a data acquisition module, and the like. For example, an algorithm analysis pool, a spatial and temporal information extraction module, a news situation analysis model, a multi-temporal spatial correlation analysis model, and the like can also be included.
In some embodiments, the second service layer may include WndShaft, car.
In some embodiments, the base components may include Citus, postgresql, zombodb, ES (ELASTIC SEARCH ), redis (cache), mapNik, and the like.
In some embodiments, in view of large data volume of the server side, a single database is difficult to support, a cluster can be built by postgresql, and read-write pressure of the single database and the single table can be relieved by adopting a sub-database sub-table.
In some embodiments, the search of the method may include full text retrieval. Full text retrieval may be implemented using postgresql dedicated citus database middleware, using ES services. For example, zombodb plug-ins may be employed to access ES services. Thus Zombodb can enable the postgresql database to internally support ES full text indexing without having to synchronize the data in the ES service.
In some embodiments, the data caching service is implemented based on Redis. The spatial information is rendered into a map using Mapnik. The message queue is implemented using kafka.
In some embodiments, the first service layer obtains data from the platform database in response to requests of the upper layer application and the second service layer for business logic processing; and feeding back data to the application layer. Providing service support for the realization of the whole functions.
Fig. 4 shows a schematic diagram of further embodiments of a method of searching text data of the present disclosure.
As shown in fig. 4, the document library is used to store document contents as search ranges. The document library may include document content and folders (e.g., subfolder nesting may be supported).
For example, deleting a folder will delete both the contained document content and subfolders. Renaming and copy movement may be supported. The document library may be set to public or private.
For example, document library creation and management may support a maximum of 10 document libraries created by a user by default, and may be configured as desired. The maximum usage document storage space of the default user can be 200MB, and can be configured as required.
In some embodiments, the document content may be added to a specified folder by both uploading the file or providing a link before the creation, browsing, management of the document content reaches the user storage space threshold.
In some embodiments, the document content in the document library may be a personal document uploaded by the user or a linked text provided by this; or the web document crawlers can be used for crawling from internet websites and uploading the crawlers. For example, the links may support web page text extraction.
In some embodiments, it is desirable to provide metadata of a document when uploading the document for more accurate full text parsing. The public or private setting of a document depends on the publicity of the document library in which it resides.
In some embodiments, the corpus is used to store a annotated corpus text set for training. The annotated corpus text comprises labels and segmented document contents.
For example, a user may default to support creation of up to 10 corpora and may be configured as desired. The corpus may not store files, but only extract text content for storage. The corpus text of the default user can support 2 ten thousand vocabularies at most and can be configured according to the needs.
For example, corpus text may support browsing and editing. After updating the corpus, the machine learning model may be retrained.
In some embodiments, the corpus text in the corpus may be forwarded from the document library and then edited; may be uploaded directly to the corpus by the user. For example, the extraction of space-time features and multi-space analysis can be performed on the documents in the document library, the corpus text is generated and forwarded to the corpus for storage.
In some embodiments, each corpus may correspond to a Labeled-LDA model for labeling spatiotemporal labels. For example, after Labeled-LDA model updates, the task of updating the labels may be performed. At this point, the spatio-temporal labels may be regenerated using Labeled-LDA model. The corpus may be set public or private.
In some embodiments, the event sets may be user-created logical groupings for collecting the same class of spatiotemporal events together. The event collection may be used for subsequent analysis and map visualization. For example, a user may create a collection of events for an activity, for collecting all spatiotemporal events for the activity together.
In some embodiments, the event collection may be set to public or private. The public event set is not differentiated from spatiotemporal events originating from private documents or public documents. Once the event collection is published, spatiotemporal events derived from the private document will also be published, but the source document will not.
In some embodiments, all spatiotemporal events corresponding to documents in a document library may be added to the event collection in batches. The event collection may provide a data basis for map visualization and analysis. Each document library may create a set of events, and the user may modify the default set of events.
In some embodiments, a user may search text through a search engine to retrieve queries with processing granularity of spatiotemporal events. For example, both basic queries for keyword queries and advanced queries based on temporal and spatial queries may be supported.
In some embodiments, tag-based retrieval and specification of a scope of retrieval (e.g., all public events and own private data, specified event sets) may be supported.
In some embodiments, the space-time events can be stored in a corpus and an event set after being trained and annotated through word segmentation. When the user searches through the keywords, the user can display the event labels or the space-time labels of the event set as indexes.
In some embodiments, the time ordering may be performed according to the time information in the search text data corresponding to the keyword query and the event query; or carrying out geographic classification according to the space information in the search text data corresponding to the space inquiry. And inquiring in the event set according to the sorting result and the classifying result.
In some embodiments, the search results may be returned to the user. And the map service can be used for displaying according to the search result so as to carry out visual analysis of the map.
In some embodiments, the data source for the map visualization analysis is a specified set of events. The event collection and the mapping scheme may be a one-to-many relationship. For example, the same event set may create different map visualization schemes (location tracks, time axis, time information, etc.). The mapping scheme may be kept, public or private depending on the set of events used for mapping. The drawing schema may be retrieved by drawing name, author user name, and data set name.
Fig. 5 illustrates a block diagram of some embodiments of a search apparatus for text data of the present disclosure.
As shown in fig. 5, the text data searching apparatus 5 includes an extracting unit 51, a determining unit 52.
The extraction unit 51 extracts at least one of temporal features or spatial features of the search text data as spatiotemporal features using a machine learning model.
The determining unit 52 determines a corpus text matching the search text data based on the degree of matching of the spatiotemporal features with the spatiotemporal labels of the respective corpus texts. The space-time labels are used for labeling at least one item of time information or space information of the corpus text.
In some embodiments, the spatiotemporal label is generated by: extracting at least one item of time characteristics or space characteristics of each sentence in the text to be processed as space-time characteristics by using a machine learning model; dividing the text to be processed into each corpus text according to the space-time characteristics, and generating space-time labels of each corpus text.
In some embodiments, the determining unit 52 determines the first corpus text according to a degree of matching of the search feature with the spatiotemporal label of each corpus text; determining a second corpus text which belongs to the same kind of event with the first corpus text according to the event label of the first corpus text; and determining the first corpus text and the second corpus text as corpus texts matched with the search text data.
In some embodiments, the event tag is generated by: according to the context information of each corpus text in the text to be processed, extracting event characteristics of each corpus text by using a machine learning model; the corpus texts with similar event characteristics are labeled with the same event labels.
In some embodiments, the corpus text that matches the search text data is a plurality. The determining unit 52 determines related events of the search text data based on event tags of the plurality of matched corpus texts.
In some embodiments, the search apparatus 5 further includes a generating unit 51, configured to generate at least one of spatial trajectory information or time axis information of the related event according to the spatiotemporal labels of the plurality of matched corpus texts.
In some embodiments, the search apparatus 5 further comprises a display unit 52 for performing at least one of the following steps: labeling and displaying the related events at corresponding positions on the map according to the space track information of the related events; or determining a relevant area on the map according to the space track information of the relevant event, and displaying the time text information or the time axis graphic information determined according to the time axis information on the relevant area.
Fig. 6 shows a block diagram of further embodiments of a search apparatus for text data of the present disclosure.
As shown in fig. 6, the text data searching apparatus 6 of this embodiment includes: a memory 61 and a processor 62 coupled to the memory 61, the processor 62 being configured to perform the method of searching text data in any of the embodiments of the present disclosure based on instructions stored in the memory 61.
The memory 61 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, application programs, boot Loader, database, and other programs.
Fig. 7 illustrates a block diagram of still further embodiments of a search apparatus for text data of the present disclosure.
As shown in fig. 7, the text data searching apparatus 7 of this embodiment includes: a memory 710 and a processor 720 coupled to the memory 710, the processor 720 being configured to perform the text data searching method of any of the foregoing embodiments based on instructions stored in the memory 710.
Memory 710 may include, for example, system memory, fixed nonvolatile storage media, and the like. The system memory stores, for example, an operating system, application programs, boot Loader, and other programs.
The text data searching apparatus 7 may further include an input-output interface 730, a network interface 740, a storage interface 750, and the like. These interfaces 730, 740, 750, and memory 710 and processor 720 may be connected by, for example, a bus 760. The input/output interface 730 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, a touch screen, a microphone, and a speaker. Network interface 740 provides a connection interface for various networking devices. Storage interface 750 provides a connection interface for external storage devices such as SD cards, U-discs, and the like.
It will be appreciated by those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media including, but not limited to, disk storage, CD-ROM, optical storage, and the like, having computer-usable program code embodied therein.
Heretofore, a search method of text data, a search apparatus of text data, and a non-volatile computer-readable storage medium according to the present disclosure have been described in detail. In order to avoid obscuring the concepts of the present disclosure, some details known in the art are not described. How to implement the solutions disclosed herein will be fully apparent to those skilled in the art from the above description.
The methods and systems of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the disclosure. The scope of the present disclosure is defined by the appended claims.

Claims (10)

1. A text data searching method, comprising:
Extracting at least one of time features or space features of the search text data as space-time features by using a machine learning model;
Determining the corpus text matched with the search text data according to the matching degree of the space-time characteristics and the space-time labels of the corpus texts, wherein the space-time labels are used for marking at least one item of time information or space information of the corpus text;
Wherein, according to the matching degree of the space-time characteristics and the space-time labels of the corpus texts, determining the corpus texts matched with the search text data comprises:
determining a first corpus text according to the matching degree of the space-time characteristics and the space-time labels of the corpus texts;
Determining a second corpus text belonging to the same kind of event with the first corpus text according to the event label of the first corpus text;
And determining the first corpus text and the second corpus text as corpus texts matched with the search text data.
2. The search method of claim 1, wherein the spatiotemporal label is generated by:
Extracting at least one item of time characteristics or space characteristics of each sentence in the text to be processed as space-time characteristics by using a machine learning model;
Dividing the text to be processed into the corpus texts according to the space-time characteristics, and generating space-time labels of the corpus texts.
3. The search method of claim 1, wherein the event tag is generated by:
extracting event characteristics of each corpus text by using a machine learning model according to the context information of each corpus text in the text to be processed;
the corpus texts with similar event characteristics are labeled with the same event labels.
4. The search method according to any one of claim 1 to 3, wherein,
Corpus texts matched with the search text data are multiple;
Further comprises:
determining related events of the search text data according to event tags of a plurality of matched corpus texts;
And generating at least one item of space track information or time axis information of the related event according to the space-time labels of the plurality of matched corpus texts.
5. The search method of claim 4, further comprising at least one of:
Labeling and displaying the related events at corresponding positions on a map according to the space track information of the related events; or alternatively
And determining a relevant area on a map according to the space track information of the relevant event, and displaying time text information or time axis graphic information determined according to the time axis information on the relevant area.
6. A text data search apparatus comprising:
an extraction unit for extracting at least one of a temporal feature or a spatial feature of the search text data as a spatiotemporal feature using the machine learning model;
A determining unit, configured to determine a corpus text matched with the search text data according to a matching degree of the space-time feature and a space-time tag of each corpus text, where the space-time tag is used to annotate at least one item of time information or space information of the corpus text,
The determining unit determines a first corpus text according to the matching degree of the space-time characteristics and the space-time labels of the corpus texts, determines a second corpus text belonging to the same kind of event with the first corpus text according to the event labels of the first corpus text, and determines the first corpus text and the second corpus text as corpus texts matched with the search text data.
7. The search apparatus according to claim 6, wherein,
The corpus text matched with the search text data is a plurality of, and the determining unit determines related events of the search text data according to event tags of the plurality of matched corpus texts;
Further comprises:
and the generating unit is used for generating at least one item of space track information or time axis information of the related event according to the space-time labels of the matched corpus texts.
8. The search apparatus according to claim 7, further comprising a display unit for performing at least one of:
Labeling and displaying the related events at corresponding positions on a map according to the space track information of the related events; or alternatively
And determining a relevant area on a map according to the space track information of the relevant event, and displaying time text information or time axis graphic information determined according to the time axis information on the relevant area.
9. A text data search apparatus comprising:
A memory; and
A processor coupled to the memory, the processor configured to perform the method of searching for text data of any of claims 1-5 based on instructions stored in the memory.
10. A non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the text data search method of any of claims 1-5.
CN202010806630.3A 2020-08-12 2020-08-12 Text data searching method and device Active CN113761227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010806630.3A CN113761227B (en) 2020-08-12 2020-08-12 Text data searching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010806630.3A CN113761227B (en) 2020-08-12 2020-08-12 Text data searching method and device

Publications (2)

Publication Number Publication Date
CN113761227A CN113761227A (en) 2021-12-07
CN113761227B true CN113761227B (en) 2024-10-18

Family

ID=78785654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010806630.3A Active CN113761227B (en) 2020-08-12 2020-08-12 Text data searching method and device

Country Status (1)

Country Link
CN (1) CN113761227B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104584010A (en) * 2012-09-19 2015-04-29 苹果公司 Voice-based media searching

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6535873B1 (en) * 2000-04-24 2003-03-18 The Board Of Trustees Of The Leland Stanford Junior University System and method for indexing electronic text
US7598977B2 (en) * 2005-04-28 2009-10-06 Mitsubishi Electric Research Laboratories, Inc. Spatio-temporal graphical user interface for querying videos
JP5133294B2 (en) * 2009-04-14 2013-01-30 日本電信電話株式会社 Spatio-temporal search device, method and program
CN102393900B (en) * 2011-07-02 2013-05-29 山东大学 Video copying detection method based on robust hash
TWI480751B (en) * 2012-12-27 2015-04-11 Ind Tech Res Inst Interactive object retrieval method and system based on association information
CN103927310B (en) * 2013-01-14 2019-03-08 百度在线网络技术(北京)有限公司 Generation method and device are suggested in a kind of search of map datum
CN103336957B (en) * 2013-07-18 2016-12-28 中国科学院自动化研究所 A kind of network homology video detecting method based on space-time characteristic
KR102370044B1 (en) * 2015-03-20 2022-03-02 아이피루씨 주식회사 A system and a method for searching prior art information and measuring similarity thereof
KR20150111336A (en) * 2015-09-09 2015-10-05 삼성전자주식회사 Method and Apparatus for searching contents
TW201804342A (en) * 2016-07-21 2018-02-01 國立成功大學 Search method of spatial-temporal based on multi-rule
EP3499384A1 (en) * 2017-12-18 2019-06-19 Fortia Financial Solutions Word and sentence embeddings for sentence classification
CN110472158B (en) * 2018-05-11 2024-01-30 北京搜狗科技发展有限公司 Method and device for ordering search entries
CN109101597B (en) * 2018-07-31 2019-08-06 中电传媒股份有限公司 A kind of electric power news data acquisition system
CN110795932B (en) * 2019-09-30 2021-03-30 中国地质大学(武汉) Geological report text information extraction method based on geological ontology
CN111414487B (en) * 2020-03-20 2023-06-23 北京百度网讯科技有限公司 Method, device, equipment and medium for associated expansion of event theme

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104584010A (en) * 2012-09-19 2015-04-29 苹果公司 Voice-based media searching

Also Published As

Publication number Publication date
CN113761227A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN109992645B (en) Data management system and method based on text data
US11593438B2 (en) Generating theme-based folders by clustering digital images in a semantic space
US20220261427A1 (en) Methods and system for semantic search in large databases
CN106383887B (en) Method and system for collecting, recommending and displaying environment-friendly news data
CN111680173A (en) CMR model for uniformly retrieving cross-media information
US8661049B2 (en) Weight-based stemming for improving search quality
CN107085583B (en) Electronic document management method and device based on content
CN105045852A (en) Full-text search engine system for teaching resources
WO2013119400A1 (en) System and method for semantically annotating images
CN107844493B (en) File association method and system
CN113190687B (en) Knowledge graph determining method and device, computer equipment and storage medium
CN110633375A (en) System for media information integration utilization based on government affair work
US20090327877A1 (en) System and method for disambiguating text labeling content objects
CN111651675B (en) UCL-based user interest topic mining method and device
CN116775972A (en) Remote resource arrangement service method and system based on information technology
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
CN114706938A (en) Document tag determination method and device, electronic equipment and storage medium
US20230409624A1 (en) Multi-modal hierarchical semantic search engine
CN113761227B (en) Text data searching method and device
CN112100500A (en) Example learning-driven content-associated website discovery method
US8875007B2 (en) Creating and modifying an image wiki page
Malhotra et al. Web page segmentation towards information extraction for web semantics
Tsapatsoulis Web image indexing using WICE and a learning-free language model
Mitocaru et al. The Lib2Life Platform-Processing, Indexing and Semantic Search for Old Romanian Documents.
US20240289356A1 (en) Structured document access for electronic documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant