Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention provides a data de-duplication mark code generation method, which comprises steps S1-S10 as shown in FIG. 1.
Step S1: a bid data set is acquired, the bid data set including a plurality of collected bid data.
In this embodiment, the web crawler performs data collection on bidding data, and forms a bidding data set from a plurality of collected bidding data, and the acquisition mode of bidding data in this embodiment is only schematically described, but not limited thereto; of course, in other embodiments, the bidding data may be obtained by other technologies in the prior art, for example, the bidding data may be obtained through a business interface, and may be set reasonably according to actual needs.
Step S2: the bid title, bid content, bid number, bid unit name, and bid stage type of each bid data are obtained from the bid data sets.
In this embodiment, each bid data in the bid data set is subjected to header extraction, and a bid header corresponding to each bid data is obtained. The specific extraction method of the bidding titles can be that the back end is used for collecting, and the algorithm processing directly extracts the content and the title corresponding to the bidding data from the queue; of course, in other embodiments, other header extraction methods in the prior art may also be employed to obtain the bidding headers of the bidding data; this is only schematically illustrated in the present embodiment, and is not limited thereto.
In this embodiment, content extraction is performed on each bidding data in the bidding data set, so as to obtain bidding content corresponding to each bidding data. The specific extraction method of bidding content can be that the back end is used for collecting, and the algorithm processing directly extracts the content and the title corresponding to bidding data from the queue; of course, in other embodiments, other content extraction methods in the prior art may also be employed to obtain bidding content for bidding data; this is only schematically illustrated in the present embodiment, and is not limited thereto.
In this embodiment, the bid number refers to a bid document item number, each bid item has a unique item number, and different targets are distinguished by the item number. Item number extraction is performed on each bid data in the bid data set, and a bid number corresponding to each bid data is obtained. The specific extraction method of the bid number can be to apply a regular expression, bid data are semi-structured data, each bid corresponds to a unique bid number, and text contents describing the bid number are relatively fixed. Such as' sign number: 2019-10-09-broad words MNQ 'or' project number: 2019-10-09-broad-word MNQ', extracting phrase records describing bidding numbers, which are fixedly appeared in bidding data, from bidding contents in a regular manner; of course, in other embodiments, other project number extraction methods in the prior art may also be employed to obtain bid numbers of the bid data; this is only schematically illustrated in the present embodiment, and is not limited thereto.
In this embodiment, bid unit extraction is performed on each bid data in the bid data set, and a bid unit name corresponding to each bid data is obtained. The specific extraction method of the bid unit name can be regular expression extraction, and text content beginning with a bid unit of 'bid person' in the bid content is extracted by using the regular expression as the bid unit; of course, in other embodiments, other unit name extraction methods in the prior art may also be used to obtain the bid unit name of the bid data; this is only schematically illustrated in the present embodiment, and is not limited thereto.
In this embodiment, the bidding stage types refer to different stages in the bidding process, and specific bidding stage types are divided into two major types of bidding bulletins and bidding results, and each type is further divided into 20 small types of negotiating, competing, inquiring, correcting, barking, and bidding. Obtaining keywords corresponding to each small type through statistics of a large amount of historical bidding data; and then mapping the keywords under each category and the corresponding bidding stage types, and generating a preset stage category dictionary in advance. The specific classification of the bidding stage types in this embodiment is only schematically described, but not limited thereto.
Specifically, in the process of determining the bid-bidding stage type, when the stage type word is removed from the title, the stage type word corresponding to the title is reserved, the reserved stage type word is used as the stage type keyword corresponding to the bid-bidding data, the stage type keyword is searched in a pre-mapped preset stage type dictionary, the stage type corresponding to the stage type keyword is found, and then the bid-bidding stage type is obtained.
Step S3: and obtaining the title characteristic corresponding to each bidding data according to the bidding title of each bidding data.
In this embodiment, by analyzing bidding data, the repetition is divided into two cases, one of which is identical, and the other is different stages of the same target, for example: the method comprises the steps of "bid-inviting announcement of strong and weak electricity transformation project of a certain company simulation training building" and "bid-inviting announcement of strong and weak electricity transformation project of a certain company simulation training building", wherein the two pieces of data are the same target content, bid-inviting numbers and other related contents are the same, but the data cannot be calculated to be repeated during data processing, so that the target stage types are involved, and are used for representing different life cycles of the same target, so that more value is provided in subsequent data display.
Words related to the stage type are generally included in the title of the bidding data, for example, the title "bid announcement" in the "bid announcement of strong and weak electricity transformation project of a training building for a certain team" indicates the stage type of the target; the title "winning bulletin" in the winning bulletin of the team simulation training building strong and weak electricity transformation project "indicates the type of the target stage.
In this embodiment, since the bid-inviting stage type is a feature for representing different stages of the subject, the stage type is not required to be considered in the subject, and words of the stage type mentioned in the subject are removed before extracting the subject keyword. And then, extracting keywords from the bidding titles of the words with the stage types removed corresponding to each bidding data to obtain title keywords, and arranging the title keywords according to a certain sequence to serve as title features corresponding to each bidding data.
Step S4: and obtaining the content characteristics corresponding to each bidding data according to the bidding content of each bidding data.
In this embodiment, the bidding content of each bidding data is divided into words, word frequency statistics is performed after word division, M keywords of top M before word frequency values are selected, then the words are ordered according to word lengths, N words of top N before the length are selected, finally words commonly appearing in the M keywords of the word frequency top M and the N words of the word length top N are selected as target content keywords to be put into a list to be output, and the content keywords are arranged according to a certain sequence to be used as content features corresponding to each bidding data.
Step S5: and obtaining the number characteristics corresponding to each bidding data according to the bidding number of each bidding data.
In this embodiment, the bid code of each acquired bid data is used as the number feature corresponding to the bid data.
Step S6: and obtaining the unit name characteristics corresponding to each bidding data according to the bidding unit name of each bidding data.
In this embodiment, the same target unit is generally the same home, and the name of the target unit is used as the judgment basis, so as to improve the accuracy of repeated data judgment. And respectively taking the bid unit names of the acquired bid data as unit name features corresponding to the bid data.
Step S7: and obtaining the phase type characteristic corresponding to each bidding data according to the bidding phase type of each bidding data.
In this embodiment, the bidding phase type of each acquired bidding data is used as the phase type feature corresponding to the bidding data.
Step S8: and obtaining the data code corresponding to each bidding data according to the title characteristic, the content characteristic, the number characteristic, the unit name characteristic and the stage type characteristic of each bidding data.
In this embodiment, the title feature, the content feature, the numbering feature, the unit name feature and the stage type feature corresponding to each bidding data are spliced according to a certain sequence, and the spliced multidimensional feature is encoded, which may be specifically md5 encoded; and obtaining the data code corresponding to each bidding data after coding.
Step S9: and obtaining the group code corresponding to each bidding data according to the title characteristic, the content characteristic, the number characteristic and the unit name characteristic of each bidding data.
In this embodiment, the title feature, the content feature, the numbering feature and the unit name feature corresponding to each bidding data are spliced according to a certain sequence, and the spliced multidimensional feature is encoded, which may be md5 encoding specifically; and obtaining the group code corresponding to each bidding data after coding.
Step S10: and obtaining the duplication elimination marking code corresponding to each bidding data according to the data coding and the group coding of each bidding data.
In this embodiment, one bidding data corresponds to one data code and one group code, and the data code and the group code corresponding to the bidding data are used as the duplication removal tag code of the bidding data, and then the duplication removal processing can be performed on the data in the bidding database according to the duplication removal tag code. Specifically, all data of the bidding database are coded with two codes, namely an unique_code and a group_code, and the bidding database can be subjected to grouping and de-duplication processing according to one code of the unique_code; the bid database may also be grouped and de-duplicated according to both group_code and unique_code, which may exhibit the same object at different stages.
The method comprises the steps of firstly determining a bid title, bid content, bid number, bid unit name and bid stage type corresponding to each bid data according to the acquired bid data set; secondly, title features are determined according to the bidding titles, content features are determined according to bidding contents, number features are determined according to bidding numbers, number features are determined according to bidding unit names, and stage type features are determined according to bidding stage types; then, obtaining a data code corresponding to each bidding data according to the title feature, the content feature, the numbering feature, the unit name feature and the stage type feature; obtaining a group code corresponding to each bidding data according to the title feature, the content feature, the numbering feature and the unit name feature; and finally, taking the data codes and the group codes as the deduplication marking codes corresponding to each bidding data, and performing deduplication processing on the bidding data through the deduplication marking codes.
As an exemplary embodiment, step S3 includes steps S31 to S36 in the step of obtaining a title feature corresponding to each bidding data from a bidding title of each bidding data.
S31: and acquiring a category dictionary of the preset stage.
In this embodiment, the phase types refer to different life cycles of the same target, that is, different phases of the same target in the bidding process. The specific bidding stage types are divided into two large types of bidding bulletins and bid winning results, and each large type is divided into more than 20 small types of negotiation, bidding, price inquiry, correction, bargaining, flow bidding and the like. Each small category, in turn, includes a plurality of keywords representing the corresponding category, which are obtained by counting a large number of historical bidding data. And (3) corresponding the keywords under the small types with the large types, wherein each large type corresponds to one stage type, so that a mapping relation between the stage type keywords and the stage types is formed, and a mapping dictionary of the stage type keywords and the stage types is generated, and the pre-mapped dictionary is a preset stage type dictionary.
S32: and respectively removing the stage type words in the bidding titles of each bidding data according to the preset stage type dictionary.
In this embodiment, the words in the bid header of each bid data are compared with the staged keywords in the preset staged category dictionary, respectively, and when the staged keywords in the preset staged category dictionary appear in the bid header, the staged keywords in the header are removed.
In this embodiment, after the stage type word is removed, the title cannot be directly transcoded as a whole. A problem exists with bidding titles in which the content of one or more words added or subtracted from the information of the bid belongs to different websites, but in fact the same bid; moreover, in the actual database, such duplicate information is slightly larger in proportion, so that it is also necessary to extract keywords for the title in the subsequent step.
S33: and segmenting the bid titles from which the stage type words are removed to obtain title segmentation corresponding to each bid data.
In this embodiment, after the stage type word is removed from the bid-posting title, word segmentation is performed, and then the title word segmentation of each bid-posting title is obtained. The specific word segmentation method can be barking word segmentation; of course, in other embodiments, other word segmentation methods in the prior art may be used, which are only schematically described in this embodiment, and are not limited thereto.
S34: and respectively calculating the TFIDF value of each word in the title word corresponding to each bidding data.
In this embodiment, word frequency statistics is performed on the title word, and the frequency of occurrence and the frequency of inverse document of each word are calculated to obtain the TFIDF value of each word.
S35: and extracting keywords by taking the first preset number of segmented words with high TFIDF values as titles.
In this embodiment, the first preset number may be 3, that is, 3 segmentation words with high TFIDF values are selected as the title extraction keywords. Of course, in other embodiments, the first preset number may also be 2 or 4; in this embodiment, the first preset number is only schematically described, which is not limited to this, and may be set reasonably according to needs in practical application.
Specifically, the words in the title word segmentation are arranged in ascending order or descending order according to the TFIDF value, and then 3 words with high TFIDF value are taken as the title extraction keywords corresponding to the bid-in title.
S36: and ordering the title extraction keywords according to a first preset sequence to obtain title ordering keywords, and taking the title ordering keywords as title features corresponding to each bidding data.
In this embodiment, the first preset sequence may be a pinyin initial sequence of chinese characters; of course, in other embodiments, the first predetermined sequence may be the number of strokes of the Chinese character. This is only schematically described in the present embodiment, and is not limited thereto.
In this embodiment, the title extracting keywords are arranged according to a first preset sequence, the arranged title extracting keywords are title ordering keywords corresponding to the bid title, and the title ordering keywords corresponding to the bid title are used as the title features of the bid title.
The method comprises the steps that when a title is processed, words related to phase types are removed according to a preset phase class dictionary, wherein the preset phase class dictionary uses a phase class dictionary table which is counted by data in the earlier stage; after the stage type words are removed, keywords are also required to be extracted from the title, a first preset number of keywords are extracted from the title by using a TFIDF mode to serve as title extraction keywords, the title extraction keywords are ordered, and the ordered title extraction keywords serve as bid-identifying title feature items to be transcoded. Through the steps, word noise caused by different websites on the title characteristics can be shielded.
As an exemplary embodiment, step S4 includes steps S41 to S46 in the step of obtaining the content feature corresponding to each bidding data from the bidding content of each bidding data.
Step S41: and respectively segmenting bidding contents of each bidding data to obtain content segmentation.
In this embodiment, the text content is described as the same subject content, but the whole article is not identical, for example, the typesetting format is different or the end of the header is different, for example, due to the collection reason. And the length of bidding data content is typically around 500 words, so consider the way content keywords are used to represent the current target content.
In this embodiment, the bid content is segmented, and content segmentation corresponding to the bid content of each bid data is obtained after segmentation. The specific word segmentation method can be barking word segmentation; of course, in other embodiments, other word segmentation methods in the prior art may be used, which are only schematically described in this embodiment, and are not limited thereto.
Step S42: and removing stop words in the content word segmentation according to the preset stop word dictionary.
In this embodiment, the preset stop word dictionary is obtained by counting a large number of historical bidding contents. Specifically, the preset stop word dictionary may be a halftoning large stop word library; the method can also be a machine learning intelligent laboratory stop word stock of university of Sichuan; hundred degree stop vocabulary is also possible. This is only schematically described in the present embodiment, and is not limited thereto; in other embodiments, other stop word lists can be used, and the stop word lists can be reasonably set according to the needs.
Step S43: and counting word frequency of the content word segmentation after the stop word is removed, and taking the content word segmentation with the second preset number and high word frequency as the first content keyword.
In this embodiment, the second preset number may be 5, that is, 5 segmentation words with high word frequency values are selected as the first content keywords. Of course, in other embodiments, the second preset number may also be 4 or 6; in this embodiment, the second preset number is only schematically described, which is not limited to the above, and may be reasonably set according to needs in practical application.
In this embodiment, word frequency statistics is performed on content word segments from which stop words are removed, word frequency of each word segment is calculated, word frequency is compared, and 5 word segments with high word frequency values are selected as first content keywords corresponding to bidding content.
Step S44: and sorting word lengths of the content word fragments after the stop words are removed, and taking the content word fragments with the third preset number and high word lengths as second content keywords.
In this embodiment, the third preset number may be 10; of course, in other embodiments, the third preset number may also be 8 or 12. In this embodiment, the third preset number is only schematically described, which is not limited to the above, and may be reasonably set according to needs in practical application.
In this embodiment, word length statistics is performed on the content word segments from which the stop word is removed, word length of each content word segment is counted, and 10 content word segments with high word length are used as the second content keywords.
Step S45: and taking the keywords which co-occur in the first content keywords and the second content keywords as content extraction keywords.
In this embodiment, the first content keyword and the second content keyword are compared to find out the keywords that appear together, and these keywords that appear together are used as the content extraction keywords corresponding to the bidding content.
Step S46: and sequencing the content extraction keywords according to a second preset sequence to obtain content sequencing keywords, and taking the content sequencing keywords as content characteristics corresponding to each bidding data.
In this embodiment, the second preset sequence may be the initial sequence of pinyin for a chinese character; of course, in other embodiments, the second predetermined sequence may be the number of strokes of the Chinese character. This is only schematically described in the present embodiment, and is not limited thereto.
In this embodiment, the content extraction keywords are arranged according to a second preset sequence, and the arranged content extraction keywords are content ordering keywords corresponding to the bidding content, and the content ordering keywords corresponding to the bidding content are used as the content features of the bidding content.
Firstly, word segmentation is carried out on target content; stopping words after word segmentation so that the separated words have strong bidding features; calculating word frequency of each word segmentation, and taking a second preset number of word segmentation with high word frequency number as a first content word segmentation; sorting according to word length, and taking a third preset number of word-length-high word segmentation words as second content word segmentation words; then, taking the word which appears together by the first content word and the second content word as the target content extraction keyword; and sequencing the content extraction keywords, and taking the sequenced content extraction keywords as the bidding content feature items to be transcoded. Through the steps, the interference on the content characteristics caused by different websites can be shielded.
As an exemplary embodiment, step S8 includes steps S81-S82 in the step of obtaining a data code corresponding to each bidding data based on the title feature, the content feature, the number feature, the unit name feature, and the stage type feature of each bidding data.
Step S81: and splicing the title features, the content features, the numbering features, the unit name features and the stage type features corresponding to each bidding data according to a first preset splicing sequence to obtain a first splicing feature.
In this embodiment, the first preset splicing order may be a title feature W1, a content feature W2, a number feature W3, a unit name feature W4, and a stage type feature W5; of course, in other embodiments, the first preset stitching order may also be other orders, such as a number feature W3, a unit name feature W4, a title feature W1, a content feature W2, and a stage type feature W5. This is only schematically described in this embodiment, but not limited to, and may be reasonably set according to needs in practical applications.
In this embodiment, all the features are spliced in the form of character strings, and the first splicing feature is result 1=w1+w2+w3+w4+w5 without turning.
Step S82: and carrying out coding encryption on the first splicing characteristic to obtain the data code of each bidding data.
In this embodiment, the encoding encryption mode is md5 encoding; of course, in other embodiments, the encoding encryption mode may be other encryption methods in the prior art, such as sha256 encryption and HMAC encryption. This is only schematically described in this embodiment, but not limited to, and may be reasonably set according to needs in practical applications.
In this embodiment, the first splicing feature is result 1=w1+w2+w3+w4+w5; the data obtained after the first splicing characteristic is coded as follows: unique_code=md5 (result 1).
The method comprises the steps that characteristics of each dimension corresponding to each bidding data are directly spliced, vector conversion is not needed, first splicing characteristics are obtained after splicing, and data codes are obtained by encoding the first splicing characteristics; a set of coding mode is developed aiming at a specific data form of bidding, the data codes are input into a database in the form of labels, and then repeated data can be determined without calculation according to the judgment of the repeated data carried out by the data codes, so that the data deduplication efficiency is improved.
As an exemplary embodiment, step S9 includes steps S91-S92 in the step of obtaining a group code corresponding to each bidding data based on the title feature, the content feature, the number feature, and the unit name feature of each bidding data.
Step S91: and splicing the title features, the content features, the number features and the unit name features corresponding to each bidding data according to a second preset splicing sequence to obtain second splicing features.
In this embodiment, the second preset splicing order may be a title feature W1, a content feature W2, a numbering feature W3, and a unit name feature W4; of course, in other embodiments, the second preset stitching order may also be other orders, such as the numbered feature W3, the unit name feature W4, the title feature W1, and the content feature W2. This is only schematically described in this embodiment, but not limited to, and may be reasonably set according to needs in practical applications.
In this embodiment, the above-mentioned multiple features are spliced in the form of character strings, without turning, and the second splice feature is result 2=w1+w2+w3+w4.
Step S92: and carrying out coding encryption on the second splicing characteristic to obtain a group code of each bidding data.
In this embodiment, the encoding encryption mode is md5 encoding; of course, in other embodiments, the encoding encryption mode may be other encryption methods in the prior art, such as sha256 encryption and HMAC encryption. This is only schematically described in this embodiment, but not limited to, and may be reasonably set according to needs in practical applications.
In this embodiment, the second splicing feature is result 2=w1+w2+w3+w4; the group code obtained after the second splice feature is coded is: group_code=md5 (result 2). The block coding does not consider the target stage type characteristic W5, and all repeated contents are divided into a group; the same target lifecycle may be revealed if the group code group_code and the data code unique_code are used in combination.
The steps are carried out direct splicing on a plurality of characteristics corresponding to each bidding data without vector conversion, a second spliced characteristic is obtained after splicing, and the second spliced characteristic is encoded to obtain a group code; the group code is added into the database in the form of a label without considering the characteristic of the type of the stage of the object, and then the same object can be checked at different stages according to the group code.
As an exemplary embodiment, step S10 further includes steps S11-S12 after the step of obtaining the deduplication encoding code corresponding to each bidding data according to the data encoding and the group encoding of each bidding data.
Step S11: the deduplication requirements are obtained.
In this embodiment, the deduplication requirement is determined according to the customer requirement. Specifically, the deduplication requirement may be that deduplication is performed according to the data code in the deduplication tag code; the deduplication may also be performed based on the group code and the data code in the deduplication flag code.
Step S12: and carrying out de-duplication processing on the bidding data according to the de-duplication requirement and the de-duplication marking code corresponding to each bidding data to obtain de-duplicated bidding data.
In this embodiment, all bidding data of the bidding database are coded with both the data code unique_code and the group code group_code.
When the de-duplication requirement is to de-duplication according to the data code in the de-duplication flag code, only the data code in the de-duplication flag code is used in the data de-duplication, that is, only the data code is used to de-duplication the bidding data.
When the de-duplication requirement is to de-duplication according to the group code and the data code in the de-duplication mark code, the group code and the data code in the de-duplication mark code are used for data de-duplication, the group code group_code is used for grouping, the data belonging to the same target is found, and the different stages of the same target are divided into a group; then, the data code unique_code is used to group within the group, such that a target lifecycle is revealed.
According to the steps, the duplication elimination processing is carried out on the bidding data according to the duplication elimination requirement and the duplication elimination marking code corresponding to each bidding data, so that the flexibility and the diversity of the data processing are improved.
As an exemplary embodiment, when the deduplication requirement is to perform deduplication according to the data code in the deduplication tag code, step S12 performs deduplication processing on the bidding data according to the deduplication requirement and the deduplication tag code corresponding to each bidding data, and the step of obtaining the deduplicated bidding data includes steps S121 to S124.
Step S121: the bid data is coded in accordance with the data coding.
In this embodiment, the bidding data of the same data code is repeated data, the data codes of each bidding data are ordered, and the bidding data of the same data code is found so as to deduplicate the bidding data with the same data code.
Step S122: and acquiring the acquisition and storage time of each bidding data.
In this embodiment, when the bidding data is collected and put in storage, the collection and put in storage time of each bidding data needs to be recorded, and then the same bidding data is removed according to the put in storage time.
Step S123: and time ordering the bidding data with the same data codes according to the collection and warehousing time.
In this embodiment, bidding data with the same data code are ordered according to the collection and storage time, and the specific time ordering mode may be the order from early to late or the order from late to early, and may be reasonably set according to actual needs.
Step S124: and reserving bidding data with early collection and storage time in bidding data with the same data codes, and taking the bidding data with early collection and storage time as the bidding data after de-duplication.
In this embodiment, among a plurality of bidding data with the same data code, the bidding data with the earliest collection and storage time is reserved, other repeated data are removed, and the reserved bidding data with the earliest collection and storage time is the bidding data after deduplication.
According to the steps, data deduplication is carried out on bidding data with the same data codes according to the collecting and warehousing time, the bidding data with the earliest collecting and warehousing time is used as the bidding data after deduplication, and the removal of repeated data is achieved.
In the following, a specific example will be described in detail, and as shown in fig. 2, fig. 2 is a flowchart of operation decomposition and transcoding and warehousing after text input.
The repetition of the multiple dimension judgment targets is determined first, in the target content, multiple characteristics can judge whether two targets are the same content, and here, the target number is selected as a judgment condition of one dimension in consideration of the target characteristics because each target code is unique.
The second dimension is selected from the text content, first the title, and through analysis of bidding data, the repetition is divided into two cases, one of which is identical, and the other is different stages of the same target, for example: the two data are the same target content, the bid number and other related content, but the data can not be counted as repetition in business processing, which involves another tag code-group_code, which is used for using the same target for different life cycles so as to provide more value in the subsequent data display. And continuing to select the content in the second dimension, considering the stage characteristics of the targets, removing the keywords related to the stages when processing the titles, and extracting the keywords from the titles after removing the stage keywords by using a stage category dictionary table (which is divided into 20 types of advertisement and bid-winning results from the major category, each category is divided into negotiation, competition, price inquiry, correction, achievement, stream mark and the like) which is counted by data in the early stage, so that the titles cannot be directly transcoded as a whole. The addition or deletion of one or more words reveals content when the subject information belongs to different web site collections, and is in fact the same subject. In the actual database, the repeated information is slightly larger in proportion, so that the second processing of the title is to extract keywords, three keywords are extracted from the title in a TFIDF mode and serve as title feature items to be transcoded, and word noise caused by different websites is shielded.
The third dimension is considered from the subject matter, also for acquisition reasons, and although the subject matter describes the same subject matter, the whole article is not exactly the same, such as different typesetting formats or different end of header, etc. And the length of bidding data content is generally about 500 words, so the current target content is considered to be represented by keywords.
The fourth dimension is to use item types, and if the item types are the same, it can assist in determining whether the items are the same label.
The fifth dimension is a bid unit, and the extracted bid unit is used as a judgment basis, so that the same bid unit is generally the same home.
The above is to determine five dimensions to determine whether the target is repeated or not.
Specifically, the title removes stage type words from the stage type dictionary according to the label, then three keywords are extracted by using a TFIDF mode, and the three keywords are used as an output form in a list mode to be used as W1; the method comprises the steps of segmenting target content, removing stop words, adding a bid and ask related dictionary (obtained by data statistics) during word segmentation, so that the segmented words have strong bid and ask characteristics, calculating word frequency, taking keywords with the first 5 bits of word frequency, sequencing according to word length, taking the first 10, and finally taking words with the word frequency top5 and the word length top10 which are commonly appeared as target content keywords to put into a list to be output, wherein W2 is used for representing; the bid number is taken as W3; the bidding unit is taken as W4; and finally, outputting the item type as W5, wherein the item type is obtained by mapping a corresponding type dictionary table which is reserved when the item type keyword is removed from the title.
All the above features are spliced in the form of character strings without turning quantity, result 1=w1+w2+w3+w4+w5, and unique_code=md5 (result 1). Similarly, result 2=w1+w2+w3+w4, group_code=md5 (result 2). That is, the block code will divide all the repeated contents into one group regardless of the target phase type W5.
And marking two codes of the unique_code and the group_code on all data of the database, and if the grouping condition is only carried out according to the unique_code, dividing the data with the same unique_code in the database into a group, and selecting one of the codes according to time at the moment to finish de-duplication.
If the group_code is combined, the same target life cycle can be displayed, the group_code is used for grouping, the different stages of the current target are divided into a group after the group_code is used, and the unique_code is used for grouping in the group, so that the target life cycle can be displayed.
The embodiment also provides a system for generating the data deduplication marker code, which is used for realizing the above embodiment and the preferred implementation manner, and the description is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the following embodiments is preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The embodiment also provides a system for generating the data deduplication marker code, as shown in fig. 3, including:
a first acquisition module 1 for acquiring a bidding data set including a plurality of acquired bidding data;
a first processing module 2 for obtaining a bid title, a bid content, a bid number, a bid unit name, and a bid stage type of each bid data from the bid data set;
a second processing module 3, configured to obtain a title feature corresponding to each bidding data according to the bidding title of each bidding data;
a third processing module 4, configured to obtain content features corresponding to each bidding data according to bidding content of each bidding data;
a fourth processing module 5, configured to obtain a number feature corresponding to each bidding data according to the bidding number of each bidding data;
a fifth processing module 6, configured to obtain a unit name feature corresponding to each bidding data according to the bidding unit name of each bidding data;
a sixth processing module 7, configured to obtain a phase type feature corresponding to each bidding data according to the bidding phase type of each bidding data;
A seventh processing module 8, configured to obtain a data code corresponding to each bidding data according to the title feature, the content feature, the number feature, the unit name feature, and the stage type feature of each bidding data;
an eighth processing module 9, configured to obtain a group code corresponding to each bidding data according to the title feature, the content feature, the number feature and the unit name feature of each bidding data;
and a ninth processing module 10, configured to obtain a duplication elimination mark code corresponding to each bidding data according to the data encoding and the group encoding of each bidding data.
As an exemplary embodiment, the second processing module includes: the first acquisition unit is used for acquiring a category dictionary of a preset stage; the first processing unit is used for respectively removing the stage type words in the bidding titles of each bidding data according to a preset stage type dictionary; the second processing unit is used for segmenting the bid titles of the removed stage type words to obtain title segmentation corresponding to each bid data; a third processing unit, configured to calculate a TFIDF value of each word in the heading word corresponding to each bidding data; a fourth processing unit, configured to extract keywords from a first preset number of segmented words with high TFIDF values as titles; and a fifth processing unit, configured to sort the title extraction keywords according to a first preset order, obtain title sorting keywords, and use the title sorting keywords as title features corresponding to each bidding data.
As an exemplary embodiment, the third processing module includes: a sixth processing unit, configured to segment bidding content of each bidding data, so as to obtain content segmentation; a seventh processing unit, configured to remove stop words in the content word segmentation according to a preset stop word dictionary; the eighth processing unit is used for carrying out word frequency statistics on the content word segmentation after the stop word is removed, and taking a second preset number of content word segmentation with high word frequency as a first content keyword; a ninth processing unit, configured to perform word length ranking on the content segmentation words from which the stop words are removed, and use a third preset number of content segmentation words with a high word length as second content keywords; a tenth processing unit configured to extract, as a content extraction keyword, a keyword that co-appears in the first content keyword and the second content keyword; and the eleventh processing unit is used for sequencing the content extraction keywords according to a second preset sequence to obtain content sequencing keywords, and taking the content sequencing keywords as the content characteristics corresponding to each bidding data.
As an exemplary embodiment, the seventh processing module includes: the twelfth processing unit is used for splicing the title features, the content features, the number features, the unit name features and the stage type features corresponding to each bidding data according to a first preset splicing sequence to obtain first splicing features; and the thirteenth processing unit is used for carrying out coding encryption on the first splicing characteristic to obtain a data code of each bidding data.
As an exemplary embodiment, the eighth processing module includes: a fourteenth processing unit, configured to splice the title feature, the content feature, the numbering feature and the unit name feature corresponding to each bidding data according to a second preset splicing order, so as to obtain a second splicing feature; and a fifteenth processing unit, configured to perform encoding encryption on the second splicing feature to obtain a group code of each bidding data.
As an exemplary embodiment, further comprising: the second acquisition module is used for acquiring the duplication elimination requirement; and a tenth processing module, configured to perform deduplication processing on the bidding data according to the deduplication requirement and the deduplication tag code corresponding to each bidding data, so as to obtain deduplicated bidding data.
As an exemplary embodiment, when the deduplication requirement is to perform deduplication according to the data encoding in the deduplication label code, the tenth processing module includes: a sixteenth processing unit for coding and ordering the bidding data according to the data coding; the second acquisition unit is used for acquiring the acquisition and storage time of each bidding data; seventeenth processing unit, is used for according to gathering the time of putting in storage to the bidding data with identical data code to carry on the time sequencing; and the eighteenth processing unit is used for reserving bidding data with early collection and storage time in bidding data with the same data codes, and taking the bidding data with early collection and storage time as the bidding data after de-duplication.
The data deduplication marker code generation system in this embodiment is presented in the form of functional units, where units refer to ASIC circuits, processors and memories that execute one or more software or firmware programs, and/or other devices that may provide the functionality described above.
Further functional descriptions of the above respective modules are the same as those of the above corresponding embodiments, and are not repeated here.
The embodiment of the invention also provides an electronic device, as shown in fig. 4, which includes one or more processors 71 and a memory 72, and in fig. 4, one processor 71 is taken as an example.
The controller may further include: an input device 73 and an output device 74.
The processor 71, memory 72, input device 73 and output device 74 may be connected by a bus or otherwise, for example in fig. 4.
The processor 71 may be a central processing unit (Central Processing Unit, CPU). The processor 71 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 72 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the data deduplication marker code generation method in the embodiments of the present application. The processor 71 executes various functional applications of the server and data processing, i.e., implements the data deduplication marker code generation method of the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 72.
Memory 72 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of a processing device operated by the server, or the like. In addition, memory 72 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 72 may optionally include memory located remotely from processor 71, such remote memory being connectable to the network connection device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 73 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing device of the server. The output device 74 may include a display device such as a display screen.
One or more modules are stored in the memory 72 that, when executed by the one or more processors 71, perform the method shown in fig. 1.
It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment method may be implemented by a computer program to instruct related hardware, and the executed program may be stored in a computer readable storage medium, where the program may include the above-described embodiment method of generating a data deduplication marker code when executed. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.