Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a data retrieval method based on an Elasticsearch, as shown in fig. 1, the method includes steps S1-S8.
Step S1: internet information is acquired, and the Internet information comprises a plurality of Internet data.
In this embodiment, the internet information includes a plurality of internet data, and the internet information can be provided by the collection system, and the collection system is responsible for information collection, and specifically, the collection system obtains the internet information by the collaborative crawler. This is only illustrated schematically in the present embodiment, and is not limited thereto. The acquired internet information specifically comprises internet data and information classification and acquisition time corresponding to the internet data.
Step S2: and obtaining information classification and acquisition time corresponding to each piece of internet data through the acquisition system for the internet information, wherein the information classification is used for representing the source position of the internet data.
In this embodiment, the information classification is used to characterize the source location of the internet data, and the acquisition time is the time when the acquisition system acquires the information. Specifically, the source location may be different websites such as WeChat, microblog, Baidu, headline, and surf, and the different websites may be different information categories. In this embodiment, the information classification mainly includes: the present invention relates to a network media (web), a microblog (weibo), a Weixin (weixin), a forum (forum), a bar (baidu), a headline (toutiao), a newspaper (printmedia), a video (video), etc., which are only schematically described in this embodiment, and are not limited thereto; in other embodiments, the information classification may also include other classifications, which may be set as appropriate as needed.
Specifically, the acquisition system acquires the acquired information classification by configuring a specific information classification for the collaborative crawler, for example, if a certain group of crawlers is responsible for acquiring microblog data, info _ flag is added to the information, and another group of crawlers is responsible for acquiring hectometer bar data, then info _ flag is added to the information, and by analogy, different data sources have different identifications; the collection time is the time when the crawler collects the information, and ctime is currentTime.
Step S3: and adapting through an information index adaptation module according to the information classification and acquisition time, and determining the index name corresponding to each internet data in the Elasticissearch cluster.
In this embodiment, the information index adaptation module is mainly responsible for adapting the acquired information to the index in the Elasticsearch cluster. The module can map information classification and acquisition time with indexes in the Elasticissearch cluster, and mainly aims to uniformly manage index names so as to store acquired acquisition data to corresponding indexes in the Elasticissearch cluster, and meanwhile, the module can play a decoupling role in a retrieval process.
Specifically, an index is generated in the elastic search cluster in advance, the name of the index is determined according to the index information classification and the index acquisition time, and the specific format may be the index acquisition time _ index information classification. In this embodiment, the index acquisition time may be accurate to the day, and of course, in other embodiments, the index acquisition time may also be set to other values, for example, one week, half a month, and the like, and may be reasonably set according to actual needs. This is only schematically described in the present embodiment, and is not limited thereto. One specific example of the index established in the Elasticsearch cluster is shown in fig. 2.
For example, if the index information is classified as microblog and the index collection time is 20210315 a day, the collection data from the microblog and the collection time 20210315 in the whole day is stored in the folder of "20210315 _ microblog".
The acquisition system pushes the information to the message queue, and the information index adaptation module reads the information in the message queue. The information exists in a json format in the message queue, the information is referred to as data for short, info _ flag in the data is an information classification identifier, gtime is information acquisition time, ctime is information publishing time, and the module maps an index name according to the information classification and acquisition time, wherein the index name is as follows: data.get ("gtime") + "_" + data.get ("info _ flag"), such as information classified as a microblog, with an acquisition time of 20201219, the index name is: 20201219_ weibo.
Step S4: and storing the Internet data into an index corresponding to the index name in the Elasticissearch cluster according to the index name.
In this embodiment, an index storage space is established for each index name in the Elasticsearch cluster, so that the collected data is stored according to the index name, and in the subsequent retrieval process, the retrieval range can also be determined according to the retrieval statement. After the index name corresponding to each internet data is obtained, the internet information can be respectively stored in the Elasticsearch cluster according to the index name.
Step S5: and acquiring information to be retrieved, wherein the information to be retrieved comprises retrieval keywords, information classification of the retrieval information and a retrieval time range.
In this embodiment, the information to be retrieved is determined according to the retrieval requirement, and may specifically include the retrieval keyword, the information classification of the retrieval information, and the retrieval time range.
Step S6: and generating a retrieval statement for the information to be retrieved according to the query grammar of the Elasticissearch.
In this embodiment, according to the specific search keyword to be searched, the information classification of the search information, and the search time range, the information and index adapter module is called, a specific search statement is generated according to the query syntax of the Elasticsearch, and then the Elasticsearch cluster is used for searching.
The key of the search refers to a keyword that the user wants to search, for example, if the user wants to search the information that the keyword of "two parties" is on the "microblog" platform and the time range is within 20210321 and 20210322, the name of the searched index is: 20210321_ weibo,20210322_ weibo, the retrieved statement is:
{ "query": { "bone": { "filter": { "bone": { "must _ not": { "term": { "data _ type":3} },' must ": {" range ": {" public _ time ": {" gte ": 1615824000000", "lte": 1616428740000"}, {" bone ": {" short ": {" shell ": } } } } }," mut ": {" query ": two parties", "idfields": [ "title", "content" } } } }.
Step S7: and searching in the information index adaptation module according to the retrieval statement to obtain an index retrieval range corresponding to the retrieval statement.
In this embodiment, the index statement includes the information classification and the time range of the information to be retrieved, so that the index retrieval range in the Elasticsearch cluster corresponding to the information to be retrieved can be determined.
Step S8: and searching in the Elasticissearch cluster according to the index searching range to obtain a searching result.
In this embodiment, the index name may be determined according to the index retrieval range, and then the acquired data stored in the index corresponding to the index name is found by searching in the Elasticsearch cluster according to the index name, and the acquired data is retrieved to obtain the retrieval result.
In the steps, the internet information is stored in the index corresponding to the Elasticissearch cluster according to the information classification and acquisition time of the internet information, and the index can be searched according to the information classification and acquisition time in the subsequent search, so that the multi-dimensional full-text search of the information classification and acquisition time is realized, and the search efficiency is improved.
As an exemplary embodiment, the step S8 is further included after the step of retrieving the results of the retrieval in the Elasticsearch cluster according to the index retrieval range, and the step S9 is included.
Step S9: and displaying the retrieval result.
In the present embodiment, step S9 includes steps S91-S93.
Step S91: and acquiring display requirement information.
In this embodiment, the display requirement information is determined according to the user retrieval requirement. Specifically, the display requirement information comprises keyword colors and preset attribute extraction information; this is only schematically illustrated in the present embodiment, which is not limited to this, and the present embodiment may be reasonably configured as required in practical application.
Wherein, the keywords are retrieval keywords input by the user; the preset attribute is a key attribute, the key attribute belongs to the service characteristics of the service system, for example, in the public opinion industry, information publishing time, author figure images, information forwarding chains and the like all belong to the key attribute, and the service system processes information according to the service characteristics of the service system.
Step S92: and identifying the retrieval result according to the display demand information to obtain the identified retrieval result.
Specifically, the search result is identified according to the display requirement information, for example, if the color of the keyword in the display requirement information is set to be red, the keyword in the search result is marked with red.
Step S93: and displaying the identified retrieval result.
Specifically, the identified retrieval result is displayed to the user, so that the user can more visually see the retrieval result.
According to the steps, the retrieval result is identified according to the display requirement information, and the identified retrieval result is displayed, so that the retrieval result is more visual.
As an exemplary embodiment, after the step of storing the internet data in the index corresponding to the index name in the Elasticsearch cluster according to the index name in the step S4, steps S10-S11 are further included.
Step S10: and determining the index deletion time according to the service requirement.
In this embodiment, the service requirement includes a requirement for the retrieval time, and the index deletion time may be determined according to the requirement for the retrieval time. For example, if the retrieval time is about 5 years or about 10 years, data about five years ago or about ten years ago can be deleted to reduce the storage space.
Specifically, the index deletion time may be one day, one week, one month, or the like, and may be determined reasonably according to the service requirement.
Step S11: and deleting the index with the earlier index time according to the preset deletion period according to the index deletion time.
In this embodiment, the preset deletion period may be reasonably set according to actual needs, specifically, the preset deletion period may be one day, one week, one month, and the like, which is only schematically described in this embodiment and is not limited thereto.
For example, if the index deletion time is one week and the preset deletion period is one week, the acquired data of one week with the earliest index time is deleted every week.
In this embodiment, the information classification is actually a fixed dimension, a new index is generated every day as time passes, an index used in the next day is created at 1 point in the morning every day by using a timing script, and meanwhile, the integral deletion of an earlier index can be performed according to actual business requirements, so that the problem of performance degradation of an Elasticsearch cluster when conditional data deletion is performed is solved.
According to the steps, the indexes with earlier time are deleted regularly according to actual service requirements, so that the aims of managing and storing mass data are fulfilled.
As an exemplary embodiment, the step S3 further includes steps S12-S13 before the step of adapting by the information index adaptation module according to the information classification and collection time.
Step S12: indexes are built in advance in the Elasticsearch cluster.
In this embodiment, an index storage space is established for each index name in the Elasticsearch cluster, so that the collected data is stored according to the index name.
Step S13: and mapping the indexes with the information classification and acquisition time one by one.
In the present embodiment, a specific example of the mapping process is as follows.
For example
For example, if the information classification is weibo, the acquisition time is 20201219, then the name of the index is 20201219_ weibo; for another example, if the information classification is weixin, the index name is 20201219_ weixin.
The above steps, an index is established in the Elasticsearch cluster in advance, and information classification and acquisition time are mapped so as to store the acquired data into the Elasticsearch cluster.
In the embodiment, the index name of the Elasticissearch is generated according to the information classification and acquisition time of the Internet information, and the data is stored into the corresponding index during storage; during retrieval, retrieval of the designated index can be carried out according to information classification and acquisition time; when the index is deleted, the index of a certain specified classification and date can be completely deleted at one time, so that multi-dimensional full-text retrieval of information classification, acquisition time and the like is realized, and the massive data can be efficiently and quickly managed.
A detailed description is given below with a specific example.
a. Information acquisition system
The method mainly provides basic internet information for the embodiment, performs classification identification on the information, namely information classification, realizes interaction with the embodiment through a message queue, and comprises the steps of pushing the information to the message queue by an acquisition system and reading the information in the message queue by a processing system.
b. An information processing system (processing system) mainly comprises the following sub-modules
1) Information and index adapter module
This module is mainly responsible for the adaptation of information to the indexes in the Elasticsearch cluster. The module can map information classification and acquisition time with indexes in the Elasticissearch cluster, and the main purpose is to uniformly manage index names and play a decoupling role.
2) Elasticissearch cluster index management module
This module is mainly responsible for the management of the Elasticsearch cluster index. The module can call an information and index adapter module, and generate an index in the Elasticissearch cluster in advance, wherein the index name is as follows: the time of acquisition _ information class (acquisition time is accurate to days), for example, if the information class is weibo, the acquisition time is 20201219, the name of the index is 20201219_ weibo, if the information class is weixin, the name of the index is 20201219_ weixin, and so on. Meanwhile, the module can delete the index with earlier time regularly according to the actual service requirement so as to achieve the purpose of managing mass data.
3) Information warehousing management module
The module is mainly responsible for storing information into the elastic search cluster, and when the information processing system receives data pushed by the acquisition system, the information and index adapter module is called to adapt the information and the index, and then the data is stored into the corresponding index.
c. Information retrieval system (short for retrieval system)
After the system or the module finishes classifying and storing the information, the retrieval system provides a standard interface for the outside to serve each business system, the retrieval system calls the information and index adapter module according to the specific key words to be retrieved, the information classification and the time range, a specific retrieval statement is generated according to the query grammar of the Elasticissearch, and then the client of the Elasticissearch cluster is used for retrieval.
d. Information display system (business system for short)
The service system is a user-oriented system, which mainly provides some convenient interactive operations for users, the users can input search keywords, select information classification, time range or other search conditions, the service system sends a search request to the search system for information search, and finally the information is displayed to the users after keyword red marking and key attribute extraction are carried out in the service system.
The embodiment also provides a data retrieval system based on the elastic search, which is used for implementing the above embodiments and preferred embodiments, and the description of the system already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
The embodiment also provides an Elasticsearch-based data retrieval system, as shown in fig. 3, including:
the system comprises a first acquisition module 1, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring internet information which comprises a plurality of internet data;
the first processing module 2 is used for classifying and acquiring information corresponding to each piece of internet data obtained by the internet information through an acquisition system, wherein the information classification is used for representing the source position of the internet data;
the second processing module 3 is used for performing adaptation through the information index adaptation module according to the information classification and acquisition time, and determining an index name corresponding to each internet data in the Elasticissearch cluster;
the third processing module 4 is configured to store the internet data into an index corresponding to the index name in the Elasticsearch cluster according to the index name;
the second obtaining module 5 is used for obtaining information to be retrieved, wherein the information to be retrieved comprises retrieval keywords, information classification of the retrieval information and a retrieval time range;
the fourth processing module 6 is configured to generate a retrieval statement for the information to be retrieved according to the query syntax of the Elasticsearch;
the fifth processing module 7 is configured to search in the information index adaptation module according to the search statement to obtain an index search range corresponding to the search statement;
and the sixth processing module 8 is configured to perform retrieval in the Elasticsearch cluster according to the index retrieval range to obtain a retrieval result.
Optionally, the method further comprises: and the seventh processing module is used for displaying the result of the retrieval result.
Optionally, the seventh processing module includes: the first acquisition unit is used for acquiring the display requirement information; the first processing unit is used for identifying the retrieval result according to the display requirement information to obtain the identified retrieval result; and displaying the identified retrieval result.
Optionally, the display requirement information includes a keyword color and preset attribute extraction information.
Optionally, the method further comprises: the eighth processing module is used for determining the index deletion time according to the service requirement; and the ninth processing module is used for deleting the indexes with the index time ahead according to the index deletion time and the preset deletion period.
Optionally, the method further comprises: a tenth processing module, configured to establish an index in the Elasticsearch cluster in advance; and the eleventh processing module is used for mapping the indexes with the information classification and acquisition time one by one.
The Elasticsearch based data retrieval system in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and a memory executing one or more software or fixed programs, and/or other devices that can provide the above-described functionality.
Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.
An embodiment of the present invention further provides an electronic device, as shown in fig. 4, the electronic device includes one or more processors 71 and a memory 72, where one processor 71 is taken as an example in fig. 4.
The controller may further include: an input device 73 and an output device 74.
The processor 71, the memory 72, the input device 73 and the output device 74 may be connected by a bus or other means, as exemplified by the bus connection in fig. 4.
The processor 71 may be a Central Processing Unit (CPU). The Processor 71 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 72 is a non-transitory computer readable storage medium, and can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the Elasticsearch-based data retrieval method in the embodiment of the present application. The processor 71 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 72, namely, implements the Elasticsearch-based data retrieval method of the above-described method embodiment.
The memory 72 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a processing device operated by the server, and the like. Further, the memory 72 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 72 may optionally include memory located remotely from the processor 71, which may be connected to a network connection device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 73 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing device of the server. The output device 74 may include a display device such as a display screen.
One or more modules are stored in the memory 72, which when executed by the one or more processors 71 perform the method shown in FIG. 1.
It will be understood by those skilled in the art that all or part of the processes in the method according to the above embodiments may be implemented by instructing relevant hardware through a computer program, and the executed program may be stored in a computer-readable storage medium, and when executed, may include the processes according to the embodiments of the data retrieval method based on the Elasticsearch. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.