[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN109191167A - A kind of method for digging and device of target user - Google Patents

A kind of method for digging and device of target user Download PDF

Info

Publication number
CN109191167A
CN109191167A CN201810784513.4A CN201810784513A CN109191167A CN 109191167 A CN109191167 A CN 109191167A CN 201810784513 A CN201810784513 A CN 201810784513A CN 109191167 A CN109191167 A CN 109191167A
Authority
CN
China
Prior art keywords
sample
text
log
user
text log
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810784513.4A
Other languages
Chinese (zh)
Inventor
陈明星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810784513.4A priority Critical patent/CN109191167A/en
Publication of CN109191167A publication Critical patent/CN109191167A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0203Market surveys; Market polls

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Accounting & Taxation (AREA)
  • Data Mining & Analysis (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This specification embodiment provides the method for digging and device of a kind of target user, wherein, the described method includes: acquiring the text media data that the user delivers within a preset period of time for user to be discriminated, the text media data includes a plurality of text log;For each of these text log, the corresponding feature vector of the text log is calculated;By described eigenvector input log identification model trained in advance, the corresponding probability value of the text log is exported;According to the probability value of a plurality of text log, the corresponding probability value of the user is determined, to determine whether the user is target user according to the probability value.

Description

A kind of method for digging and device of target user
Technical field
This disclosure relates to payment technology field, the in particular to method for digging and device of a kind of target user.
Background technique
It in multiple business scene, all may relate to excavate a part of user, for this specific certain customers Take specific business measure.For example, user states the height asked to the sense of security in the application scenarios of setting payment cipher strategy Difference can use different Password Policy.For example, this kind of if a user especially payes attention to the safety of payment User is properly termed as sense of security Gao Shu and asks user, can use this kind of user the core skill section of double factor;, whereas if user The sense of security state and ask lower, the core skill section of single-factor can be used.In order to identify that the sense of security of different user states the journey asked Business rule can be arranged in degree according to business experience, for example if user checks the number frequency of security centre whithin a period of time It is numerous, and the problem of often feed back some secure contexts, this kind of user can be identified as sense of security Gao Shu and ask user.
Summary of the invention
In view of this, this specification one or more embodiment provides the method for digging and device of a kind of target user, with Improve the accuracy that target user excavates.
Specifically, this specification one or more embodiment is achieved by the following technical solution:
In a first aspect, providing a kind of method, which comprises
For user to be discriminated, the text media data that the user delivers within a preset period of time, the text are acquired This media data includes a plurality of text log;
For each of these text log, the corresponding feature vector of the text log is calculated;
By described eigenvector input log identification model trained in advance, the corresponding probability of the text log is exported Value, the probability value is for indicating that the user for delivering text log belongs to the probability of target user;
According to the probability value of a plurality of text log, the corresponding probability value of the user is determined, according to the probability Value determines whether the user is target user.
Second aspect provides a kind of training method of log identification model, which comprises
Obtain black and white sample daily record data;
For the text log in the black and white sample daily record data, the corresponding feature vector of the text log is calculated: Word cutting is carried out to the text log, obtains multiple words;It is embedded in word embedding algorithm using word, calculates separately each word Corresponding term vector obtains the corresponding feature vector of the text log according to the term vector of each word;
Using the feature vector of the black and white sample daily record data, training has two disaggregated models of supervision, as the day Will identification model.
The third aspect, provides the excavating gear of target user a kind of, and described device includes:
Data acquisition module, for acquiring the text that the user delivers within a preset period of time for user to be discriminated This media data, the text media data include a plurality of text log;
Vector calculation module, for calculating the corresponding feature of the text log for each of these text log Vector;
Model prediction module exports the text for the log identification model that described eigenvector input is trained in advance The corresponding probability value of this log, the probability value is for indicating that the user for delivering text log belongs to the probability of target user;
User data module determines the corresponding probability of the user for the probability value according to a plurality of text log Value, to determine whether the user is target user according to the probability value.
Fourth aspect, provides a kind of training device of log identification model, and described device includes:
Sample acquisition module, for obtaining black and white sample daily record data;
Vector Processing module, for calculating the text day for the text log in the black and white sample daily record data The corresponding feature vector of will: word cutting is carried out to the text log, obtains multiple words;It is calculated using word insertion word embedding Method calculates separately the corresponding term vector of each word, according to the term vector of each word obtain the corresponding feature of the text log to Amount;
Model training module, for using the feature vector of the black and white sample daily record data, training has two points of supervision Class model, as the log identification model.
5th aspect, provides the excavating equipment of target user a kind of, the equipment includes memory, processor, Yi Jicun The computer instruction that can be run on a memory and on a processor is stored up, the processor performs the steps of when executing instruction
For user to be discriminated, the text media data that the user delivers within a preset period of time, the text are acquired This media data includes a plurality of text log;
For each of these text log, the corresponding feature vector of the text log is calculated;
By described eigenvector input log identification model trained in advance, the corresponding probability of the text log is exported Value, the probability value is for indicating that the user for delivering text log belongs to the probability of target user;
According to the probability value of a plurality of text log, the corresponding probability value of the user is determined, according to the probability Value determines whether the user is target user.
The method for digging and device of the target user of this specification one or more embodiment, passes through the text based on term vector Eigen extracts, and comes training pattern and identification log, comes without devoting a tremendous amount of time as traditional machine learning algorithm It constructs manual features and does feature selecting, being directly based upon the text media data that user delivers can be obtained by the feature of user, The feature that this method obtains can effectively reduce because business understand it is not deep and caused by the not high problem of modelling effect, mention The high performance and accuracy of model.
Detailed description of the invention
In order to illustrate more clearly of this specification one or more embodiment or technical solution in the prior art, below will A brief introduction will be made to the drawings that need to be used in the embodiment or the description of the prior art, it should be apparent that, it is described below Attached drawing is only some embodiments recorded in this specification one or more embodiment, and those of ordinary skill in the art are come It says, without any creative labor, is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of process for sample acquisition that this specification one or more embodiment provides;
Fig. 2 is the schematic diagram for the sampling that this specification one or more embodiment provides;
Fig. 3 is the process for the text log vectorization that this specification one or more embodiment provides;
Fig. 4 is the process of the method for digging for the target user that this specification one or more embodiment provides;
Fig. 5 is a kind of excavating gear for target user that this specification one or more embodiment provides;
Fig. 6 is a kind of excavating gear for target user that this specification one or more embodiment provides.
Specific embodiment
In order to make those skilled in the art more fully understand the technical solution in this specification one or more embodiment, Below in conjunction with the attached drawing in this specification one or more embodiment, to the technology in this specification one or more embodiment Scheme is clearly and completely described, it is clear that and described embodiment is only this specification a part of the embodiment, rather than Whole embodiments.Based on this specification one or more embodiment, those of ordinary skill in the art are not making creativeness The range of disclosure protection all should belong in every other embodiment obtained under the premise of labour.
The method for digging for the target user that at least one embodiment of this specification provides is calculated based on word embedding Method, the log that user is delivered is converted to feature vector, and this feature vector is shown input log identification model trained in advance, By determining whether the user for delivering the log belongs to the target user to be excavated to the analysis identification of log, for example, safety Feel the user of Gao Shuqiu.
Target user to be excavated, including but not limited to " user of sense of security Gao Shuqiu ", can be true according to business demand It is fixed.For example, in the application scenarios of setting payment cipher strategy there is the different sense of security to state the user asked, it can be using not With Password Policy, the user of sense of security Gao Shuqiu can use the core skill section of double factor, the sense of security is low state the user asked can Using the core skill section of single-factor.Therefore, it for the business demand of the application scenarios of setting payment cipher strategy, can excavate Which user is the user of sense of security Gao Shuqiu, which user, which is that the sense of security is low, states the user asked, can be by sense of security Gao Shuqiu User or low state of the sense of security ask user as target user to be excavated.Certainly, in other application scenarios, according to business It is target user that other kinds of user, which can be set, in demand.
As follows to the description of method for digging, by taking the user of sense of security Gao Shuqiu as an example:
The method for digging of the target user of at least one embodiment of this specification, including " model training " and " model application " Two parts.Wherein, the model in " model training " can be the type for the text media data that user for identification delivers, this article The type of this media data can correspond to reaction and deliver which kind of user the user of the data is.And " " model application " can use Trained model reacts the user for delivering the data by the identification of text media data to identify text media data It whether is target user.
The training of model
The first, the acquisition of training sample.
User can deliver some text media datas, when acquiring text media data, can be acquisition and this digging The relevant data of the associated services of the target user of pick.For example, the user of sense of security Gao Shuqiu to be excavated, is to " Alipay Safety " aspect has sense of security Gao Shuqiu, therefore, can be by filtering out the text of all about Alipay above social media Media data, the including but not limited to following data of user: Alipay feedback data, Alipay incoming call text data, social activity are flat Alipay relevant microblog model on platform is commented on, knows article content, public platform content and the article read etc..
The text media data may include a plurality of text log, for example, microblogging model has three logs, public platform Inside have five logs, knows that article content there are six logs, etc..When model training, available a certain number of black and white samples This daily record data, for example, black sample daily record data can be the log for showing user security sense Gao Shuqiu, white sample daily record data It can be and show that user security sense is low and state the log asked.
For example, the acquisition of the easy data of black and white sample log, can be and obtain sample based on Active Learning.Fig. 1 is illustrated The process of sample acquisition may include:
In step 100, it randomly selects fraction sample progress black and white attribute labeling and obtains the first mark sample.
For example, this step can randomly select the sample of very little data, artificial mark black and white attribute, obtained mark are carried out Sample is properly termed as the first mark sample.The black and white attribute labeling can be, and deliver the number according to what sample was embodied According to user whether there is sense of security Gao Shu to ask determining, if determining requirement of the corresponding user to the sense of security according to the log delivered It is very high, sample can be labeled as black, otherwise, sample can be labeled as white.The step is properly termed as Mark The part Instances.
In a step 102, based on the first mark sample, training obtains sample classification model.
This step the first mark sample, training sample disaggregated model according to obtained in step 100.The sample classification Model can export a sample score value, and the sample score value is used for the foundation of the black and white attribute as assessment text log, than Such as, if sample score value is higher, it is believed that probability of the corresponding user of the sample with sense of security Gao Shuqiu is higher;Conversely, If sample score value is lower, user security sense is low, and to state the probability asked lower.In addition, the day of the sample classification model and subsequent descriptions The model structure of will identification model, can be different.This step is properly termed as the part Update Model.
At step 104, by the sample classification model, the sample that other are not marked is identified, sample is obtained Score value.
This step can use trained sample classification model, carries out identification marking to the sample that other are not marked, obtains To sample score value.This step is properly termed as the part Predict unlabel Instances.
In step 106, target sample is extracted according to the sample score value, to carry out black and white attribute labeling to target sample Obtain the second mark sample.
This step is the part Select unlabel instances.
It may include difference incorporated by reference to a plurality of text log shown in Figure 2, in the text media data that user delivers Belong to the log of different medium types.The text media data not marked can be clustered by clustering algorithm, be divided For the text log of different media types, for example, the log of C1 class, the log of C2 class, C3 class log.For example, these different media Type may include microblogging class, public platform content class, incoming call text class etc., which can promote the diversity of sample, And then improve the performance of model training.It is of course also possible to be extracted to mixed sample according to the sample score value without cluster Target sample.
As shown in Fig. 2, in the text log of every a kind of medium type target sample can be extracted according to sample score value. Wherein, the target sample includes sequence in the text log of preceding presetting digit capacity (for example, Top N) and by preceding presetting digit capacity Text log except other text logs in the part text log (Random Select K) randomly selected.The N The summation of a log and K log is exactly the target sample being drawn into.This part extract target sample can give expert into Row black and white attribute labeling obtains the second mark sample.
In step 108, the second mark sample and the first mark sample are combined, and returns to step 102, continue to instruct Practice and updates the sample classification model.
The sample that other are not marked is continued to identify by the sample classification model of update, that is, continues to execute step 102 to 106, this is the process of a continuous iterative cycles, until the quantity of the second mark sample reaches preset black and white sample Until quantitative value, stop circulation.
The process obtained by the exemplar of above-mentioned Active Learning, available a certain number of sense of security lack day Will and without missing log.Wherein, the log of sense of security missing can be black sample daily record data, and no missing log can be white sample Daily record data.Also, it is based on Active Learning, sample more targeted can be picked out and manually be marked, thus effectively It reduces the time manually marked, and can be helped by using the method for cluster and promote the diversification of mark sample.
The second, the character representation of training sample.
For the text log in the black and white sample daily record data, can calculate the corresponding feature of the text log to Amount.Fig. 3 illustrates the process of text log vectorization, may include:
In step 300, word cutting is carried out to the text log, obtains multiple words.
For example, a text log TXT can obtain multiple words: word1, word2 ... ..wordk with word cutting.
In step 302, it is embedded in word embedding algorithm using word, calculates separately the corresponding term vector of each word. For example, can use word2vec, fastText scheduling algorithm, the corresponding term vector < f1 of each word, f2, f3 ... are calculated .fk>。
In step 304, the corresponding feature vector of the text log is obtained according to the term vector of each word.
For example, the term vector of each word of a text log can be calculated averagely, it is corresponding to obtain text log Feature vector.
Third, training pattern.
Text log in training sample can obtain corresponding feature vector by the process of Fig. 3, and using described black The feature vector of white sample daily record data, training has two disaggregated models of supervision, as the log identification model.Described two Disaggregated model may include logistic regression, decision tree, random forest etc..
Wherein, the black and white sample daily record data obtained based on Active Learning, can be randomly divided into three parts data, In 60% be used as training set, 20% for verifying collection, be left 20% be used as test set.Using having two disaggregated models of supervision in training Collect learning model above, and verified with verifying collection to obtain the model parameter that acquirement is optimal on verifying collection, is finally existed Model measurement is carried out above test set.
The input of the log identification model can be the corresponding feature vector of a text log, and output can be the text The corresponding probability value of this log, the probability value is for indicating that the user for delivering text log belongs to target user's (e.g., safety Sense Gao Shu ask user) probability.
Further, it is also possible to according to the text media data of continuous renewal, carry out regularly that model is more to log identification model Newly, so that model is more accurate, model performance is more preferable.
The use of model
Trained log identification model can be used to identify the probability of the sense of security missing of the corresponding user of a log, And it excavates sense of security Gao Shu accordingly and asks user.As shown in figure 4, illustrating the excavation side of the target user of at least one embodiment Method may include:
In step 400, for user to be discriminated, the text media that the user delivers within a preset period of time is acquired Data, the text media data include a plurality of text log.
For example, a plurality of text log of an available user whithin a period of time, including the content in blog, or The content etc. that public platform is delivered.
In step 402, for each of these text log, the corresponding feature vector of the text log is calculated. For example, the corresponding feature vector of text log can be obtained according to the process of Fig. 3.
In step 404, the log identification model that described eigenvector input is trained in advance, exports the text log Corresponding probability value.
In a step 406, according to the probability value of a plurality of text log, the corresponding probability value of the user is determined, with Determine whether the user is target user according to the probability value.
For example, can be corresponding as the user by the maximum value in the probability value of a plurality of text log in step 400 Probability value determines whether the user is target user according to the probability value.For example, if the probability value is higher than some threshold Value, can determine that the user is the user of sense of security Gao Shuqiu.
Above-mentioned target user's method for digging comes training pattern and identification by the Text character extraction based on term vector Log constructs manual features without devoting a tremendous amount of time as traditional machine learning algorithm and does feature selecting, directly Connecing the text media data delivered based on user can be obtained by the feature of user, and the feature that this method obtains can effectively subtract It is few because business understand it is not deep and caused by the not high problem of modelling effect, improve the performance and accuracy of model.
In order to realize the method for digging of above-mentioned target user, at least one embodiment of this specification provides a kind of target The excavating gear of user.As shown in figure 5, the apparatus may include: data acquisition module 51, vector calculation module 52, model are pre- Survey module 53 and user data module 54.
Data acquisition module 51, for acquiring what the user delivered within a preset period of time for user to be discriminated Text media data, the text media data include a plurality of text log;
Vector calculation module 52, for calculating the corresponding spy of the text log for each of these text log Levy vector;
Model prediction module 53, for the log identification model that described eigenvector input is trained in advance, described in output The corresponding probability value of text log, the probability value is for indicating that the user for delivering text log belongs to the probability of target user;
User data module 54 determines that the user is corresponding general for the probability value according to a plurality of text log Rate value, to determine whether the user is target user according to the probability value.
In one example, vector calculation module 52 are specifically used for: carrying out word cutting to the text log, obtain multiple Word;It is embedded in word embedding algorithm using word, calculates separately the corresponding term vector of each word;By the term vector meter of each word Calculation averagely obtains the corresponding feature vector of the text log.
In order to realize that the training method of above-mentioned log identification model, at least one embodiment of this specification provide one kind The training device of log identification model.As shown in fig. 6, the apparatus may include: sample acquisition module 61, Vector Processing module 62 With model training module 63.
Sample acquisition module 61, for obtaining black and white sample daily record data;
Vector Processing module 62, for calculating the text for the text log in the black and white sample daily record data The corresponding feature vector of log: word cutting is carried out to the text log, obtains multiple words;Word embedding is embedded in using word Algorithm calculates separately the corresponding term vector of each word, obtains the corresponding feature of the text log according to the term vector of each word Vector;
Model training module 63, for using the feature vector of the black and white sample daily record data, training has the two of supervision Disaggregated model, as the log identification model.
In one example, sample acquisition module 61, for when obtaining black and white sample daily record data, comprising:
Black and white attribute labeling is carried out to the sample randomly selected, obtains the first mark sample;
Based on the first mark sample, training obtains sample classification model;
By the sample classification model, the sample that other are not marked is identified, obtains sample score value, the sample Score value is used for the foundation of the black and white attribute as assessment text log;
Target sample is extracted according to the sample score value, obtains the second mark to carry out black and white attribute labeling to target sample Sample;
Second mark sample and the first mark sample are combined, continues training and updates the sample classification model;
The sample that other are not marked is continued to identify by the sample classification model of update, until the second mark sample Quantity reach preset black and white sample number magnitude.
In one example, sample acquisition module 61, when for extracting target sample according to sample score value, comprising: point The other text log to different media types is ranked up according to the sample score value;In the text day of every a kind of medium type In will, target sample is extracted according to sample score value, wherein the target sample includes the text day sorted in preceding presetting digit capacity Will and the part text log by being extracted in other text logs except the text log of preceding presetting digit capacity.
The device or module that above-described embodiment illustrates can specifically realize by computer chip or entity, or by having The product of certain function is realized.A kind of typically to realize that equipment is computer, the concrete form of computer can be personal meter Calculation machine, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation are set It is any several in standby, E-mail receiver/send equipment, game console, tablet computer, wearable device or these equipment The combination of equipment.
For convenience of description, it is divided into various modules when description apparatus above with function to describe respectively.Certainly, implementing this The function of each module can be realized in the same or multiple software and or hardware when specification one or more embodiment.
Each step in above-mentioned process as shown in the figure, execution sequence are not limited to the sequence in flow chart.In addition, each The description of a step can be implemented as software, hardware or its form combined, for example, those skilled in the art can be by it It is embodied as the form of software code, can is the computer executable instructions that can be realized the corresponding logic function of the step. When it is realized in the form of software, the executable instruction be can store in memory, and by the processor in equipment It executes.
For example, corresponding to the above method, this specification one or more embodiment provides the digging of target user a kind of simultaneously Dig equipment.The equipment may include processor, memory and storage on a memory and the calculating that can run on a processor Machine instruction, the processor is by executing described instruction, for realizing following steps:
For user to be discriminated, the text media data that the user delivers within a preset period of time, the text are acquired This media data includes a plurality of text log;
For each of these text log, the corresponding feature vector of the text log is calculated;
By described eigenvector input log identification model trained in advance, the corresponding probability of the text log is exported Value, the probability value is for indicating that the user for delivering text log belongs to the probability of target user;
According to the probability value of a plurality of text log, the corresponding probability value of the user is determined, according to the probability Value determines whether the user is target user.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want There is also other identical elements in the process, method of element, commodity or equipment.
It will be understood by those skilled in the art that this specification one or more embodiment can provide as method, system or calculating Machine program product.Therefore, this specification one or more embodiment can be used complete hardware embodiment, complete software embodiment or The form of embodiment combining software and hardware aspects.Moreover, this specification one or more embodiment can be used at one or It is multiple wherein include computer usable program code computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) on the form of computer program product implemented.
This specification one or more embodiment can computer executable instructions it is general on It hereinafter describes, such as program module.Generally, program module includes executing particular task or realization particular abstract data type Routine, programs, objects, component, data structure etc..Can also practice in a distributed computing environment this specification one or Multiple embodiments, in these distributed computing environments, by being executed by the connected remote processing devices of communication network Task.In a distributed computing environment, the local and remote computer that program module can be located at including storage equipment is deposited In storage media.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.At data For managing apparatus embodiments, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to side The part of method embodiment illustrates.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.
The foregoing is merely the preferred embodiments of this specification one or more embodiment, not to limit this theory Bright book one or more embodiment, all within the spirit and principle of this specification one or more embodiment, that is done is any Modification, equivalent replacement, improvement etc. should be included within the scope of the protection of this specification one or more embodiment.

Claims (14)

1. a kind of method for digging of target user, which comprises
For user to be discriminated, the text media data that the user delivers within a preset period of time, the text matchmaker are acquired Volume data includes a plurality of text log;
For each of these text log, the corresponding feature vector of the text log is calculated;
By described eigenvector input log identification model trained in advance, the corresponding probability value of the text log, institute are exported Probability value is stated for indicating that the user for delivering text log belongs to the probability of target user;
According to the probability value of a plurality of text log, the corresponding probability value of the user is determined, with true according to the probability value Whether the fixed user is target user.
2. according to the method described in claim 1, a plurality of text log in the text media data, including being belonging respectively to not The log of same medium type.
3. according to the method described in claim 1, the corresponding feature vector of the calculating text log, comprising:
Word cutting is carried out to the text log, obtains multiple words;
It is embedded in word embedding algorithm using word, calculates separately the corresponding term vector of each word;
The term vector calculating of each word is averagely obtained into the corresponding feature vector of the text log.
4. according to the method described in claim 1, the probability value according to a plurality of text log, determines the user's Probability value, comprising:
Take the peak in the probability value of a plurality of text log, the probability value as the user.
5. according to the method described in claim 1,
The target user is the user with sense of security Gao Shuqiu.
6. a kind of training method of log identification model, which comprises
Obtain black and white sample daily record data;
For the text log in the black and white sample daily record data, the corresponding feature vector of the text log is calculated: to institute It states text log and carries out word cutting, obtain multiple words;It is embedded in word embedding algorithm using word, it is corresponding to calculate separately each word Term vector, the corresponding feature vector of the text log is obtained according to the term vector of each word;
Using the feature vector of the black and white sample daily record data, training has two disaggregated models of supervision, knows as the log Other model.
7. according to the method described in claim 6, the acquisition black and white sample daily record data, comprising:
Black and white attribute labeling is carried out to the sample randomly selected, obtains the first mark sample;
Based on the first mark sample, training obtains sample classification model;
By the sample classification model, the sample that other are not marked is identified, obtains sample score value, the sample score value Foundation for the black and white attribute as assessment text log;
Target sample is extracted according to the sample score value, obtains the second mark sample to carry out black and white attribute labeling to target sample This;
Second mark sample and the first mark sample are combined, continues training and updates the sample classification model;
The sample that other are not marked is continued to identify by the sample classification model of update, until the number of the second mark sample Amount reaches preset black and white sample number magnitude.
8. according to the method described in claim 7, described extract target sample according to sample score value, comprising:
Respectively to the text log of different media types, it is ranked up according to the sample score value;
In the text log of every a kind of medium type, target sample is extracted according to sample score value, wherein the target sample packet Sequence is included in the text log of preceding presetting digit capacity and by taking out in other text logs except the text log of preceding presetting digit capacity The part text log taken.
9. a kind of excavating gear of target user, described device include:
Data acquisition module, for acquiring the text matchmaker that the user delivers within a preset period of time for user to be discriminated Volume data, the text media data include a plurality of text log;
Vector calculation module, for calculating the corresponding feature vector of the text log for each of these text log;
Model prediction module exports the text day for the log identification model that described eigenvector input is trained in advance The corresponding probability value of will, the probability value is for indicating that the user for delivering text log belongs to the probability of target user;
User data module determines the corresponding probability value of the user for the probability value according to a plurality of text log, with Determine whether the user is target user according to the probability value.
10. device according to claim 9,
The vector calculation module, is specifically used for: carrying out word cutting to the text log, obtains multiple words;It is embedded in using word Word embedding algorithm calculates separately the corresponding term vector of each word;The term vector calculating of each word is averagely obtained into institute State the corresponding feature vector of text log.
11. a kind of training device of log identification model, described device include:
Sample acquisition module, for obtaining black and white sample daily record data;
Vector Processing module, for calculating the text log pair for the text log in the black and white sample daily record data The feature vector answered: word cutting is carried out to the text log, obtains multiple words;It is embedded in word embedding algorithm using word, The corresponding term vector of each word is calculated separately, the corresponding feature vector of the text log is obtained according to the term vector of each word;
Model training module, for using the feature vector of the black and white sample daily record data, training has two classification moulds of supervision Type, as the log identification model.
12. device according to claim 11, the sample acquisition module, for obtaining black and white sample daily record data When, comprising:
Black and white attribute labeling is carried out to the sample randomly selected, obtains the first mark sample;
Based on the first mark sample, training obtains sample classification model;
By the sample classification model, the sample that other are not marked is identified, obtains sample score value, the sample score value Foundation for the black and white attribute as assessment text log;
Target sample is extracted according to the sample score value, obtains the second mark sample to carry out black and white attribute labeling to target sample This;
Second mark sample and the first mark sample are combined, continues training and updates the sample classification model;
The sample that other are not marked is continued to identify by the sample classification model of update, until the number of the second mark sample Amount reaches preset black and white sample number magnitude.
13. device according to claim 12,
The sample acquisition module, when for extracting target sample according to sample score value, comprising: respectively to different media types Text log, be ranked up according to the sample score value;In the text log of every a kind of medium type, according to sample score value Extract target sample, wherein the target sample includes sorting in the text log of preceding presetting digit capacity and by preceding presetting digit capacity Text log except other text logs in the part text log that extracts.
14. a kind of excavating equipment of target user, the equipment includes memory, processor, and is stored on a memory simultaneously The computer instruction that can be run on a processor, the processor perform the steps of when executing instruction
For user to be discriminated, the text media data that the user delivers within a preset period of time, the text matchmaker are acquired Volume data includes a plurality of text log;
For each of these text log, the corresponding feature vector of the text log is calculated;
By described eigenvector input log identification model trained in advance, the corresponding probability value of the text log, institute are exported Probability value is stated for indicating that the user for delivering text log belongs to the probability of target user;
According to the probability value of a plurality of text log, the corresponding probability value of the user is determined, with true according to the probability value Whether the fixed user is target user.
CN201810784513.4A 2018-07-17 2018-07-17 A kind of method for digging and device of target user Pending CN109191167A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810784513.4A CN109191167A (en) 2018-07-17 2018-07-17 A kind of method for digging and device of target user

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810784513.4A CN109191167A (en) 2018-07-17 2018-07-17 A kind of method for digging and device of target user

Publications (1)

Publication Number Publication Date
CN109191167A true CN109191167A (en) 2019-01-11

Family

ID=64936784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810784513.4A Pending CN109191167A (en) 2018-07-17 2018-07-17 A kind of method for digging and device of target user

Country Status (1)

Country Link
CN (1) CN109191167A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555007A (en) * 2019-09-09 2019-12-10 成都西山居互动娱乐科技有限公司 Method and device for judging number stealing behavior, computing equipment and storage medium
CN110633466A (en) * 2019-08-26 2019-12-31 深圳安巽科技有限公司 Short message crime identification method and system based on semantic analysis and readable storage medium
CN110704648A (en) * 2019-09-27 2020-01-17 北京达佳互联信息技术有限公司 Method, device, server and storage medium for determining user behavior attribute
CN112182193A (en) * 2020-10-19 2021-01-05 山东旗帜信息有限公司 Log obtaining method, device and medium in traffic industry
CN112685374A (en) * 2019-10-17 2021-04-20 中国移动通信集团浙江有限公司 Log classification method and device and electronic equipment

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324628A (en) * 2012-03-21 2013-09-25 腾讯科技(深圳)有限公司 Industry classification method and system for text publishing
CN104809236A (en) * 2015-05-11 2015-07-29 苏州大学 Microblog-based user age classification method and Microblog-based user age classification system
CN105205043A (en) * 2015-08-26 2015-12-30 苏州大学张家港工业技术研究院 Classification method and system of emotions of news readers
CN106022834A (en) * 2016-05-24 2016-10-12 腾讯科技(深圳)有限公司 Advertisement against cheating method and device
CN106126751A (en) * 2016-08-18 2016-11-16 苏州大学 A kind of sorting technique with time availability and device
CN106202181A (en) * 2016-06-27 2016-12-07 苏州大学 A kind of sensibility classification method, Apparatus and system
CN106295701A (en) * 2016-08-11 2017-01-04 五八同城信息技术有限公司 user identification method and device
CN106960154A (en) * 2017-03-30 2017-07-18 兴华永恒(北京)科技有限责任公司 A kind of rogue program dynamic identifying method based on decision-tree model
CN107193974A (en) * 2017-05-25 2017-09-22 北京百度网讯科技有限公司 Localized information based on artificial intelligence determines method and apparatus
US20170278015A1 (en) * 2016-03-24 2017-09-28 Accenture Global Solutions Limited Self-learning log classification system
CN107273416A (en) * 2017-05-05 2017-10-20 深信服科技股份有限公司 The dark chain detection method of webpage, device and computer-readable recording medium
CN107273454A (en) * 2017-05-31 2017-10-20 北京京东尚科信息技术有限公司 User data sorting technique, device, server and computer-readable recording medium
CN107391545A (en) * 2017-05-25 2017-11-24 阿里巴巴集团控股有限公司 A kind of method classified to user, input method and device

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324628A (en) * 2012-03-21 2013-09-25 腾讯科技(深圳)有限公司 Industry classification method and system for text publishing
CN104809236A (en) * 2015-05-11 2015-07-29 苏州大学 Microblog-based user age classification method and Microblog-based user age classification system
CN105205043A (en) * 2015-08-26 2015-12-30 苏州大学张家港工业技术研究院 Classification method and system of emotions of news readers
US20170278015A1 (en) * 2016-03-24 2017-09-28 Accenture Global Solutions Limited Self-learning log classification system
CN106022834A (en) * 2016-05-24 2016-10-12 腾讯科技(深圳)有限公司 Advertisement against cheating method and device
CN106202181A (en) * 2016-06-27 2016-12-07 苏州大学 A kind of sensibility classification method, Apparatus and system
CN106295701A (en) * 2016-08-11 2017-01-04 五八同城信息技术有限公司 user identification method and device
CN106126751A (en) * 2016-08-18 2016-11-16 苏州大学 A kind of sorting technique with time availability and device
CN106960154A (en) * 2017-03-30 2017-07-18 兴华永恒(北京)科技有限责任公司 A kind of rogue program dynamic identifying method based on decision-tree model
CN107273416A (en) * 2017-05-05 2017-10-20 深信服科技股份有限公司 The dark chain detection method of webpage, device and computer-readable recording medium
CN107193974A (en) * 2017-05-25 2017-09-22 北京百度网讯科技有限公司 Localized information based on artificial intelligence determines method and apparatus
CN107391545A (en) * 2017-05-25 2017-11-24 阿里巴巴集团控股有限公司 A kind of method classified to user, input method and device
CN107273454A (en) * 2017-05-31 2017-10-20 北京京东尚科信息技术有限公司 User data sorting technique, device, server and computer-readable recording medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633466A (en) * 2019-08-26 2019-12-31 深圳安巽科技有限公司 Short message crime identification method and system based on semantic analysis and readable storage medium
CN110633466B (en) * 2019-08-26 2021-01-19 深圳安巽科技有限公司 Short message crime identification method and system based on semantic analysis and readable storage medium
CN110555007A (en) * 2019-09-09 2019-12-10 成都西山居互动娱乐科技有限公司 Method and device for judging number stealing behavior, computing equipment and storage medium
CN110555007B (en) * 2019-09-09 2023-09-05 成都西山居互动娱乐科技有限公司 Method and device for discriminating theft behavior, computing equipment and storage medium
CN110704648A (en) * 2019-09-27 2020-01-17 北京达佳互联信息技术有限公司 Method, device, server and storage medium for determining user behavior attribute
CN112685374A (en) * 2019-10-17 2021-04-20 中国移动通信集团浙江有限公司 Log classification method and device and electronic equipment
CN112182193A (en) * 2020-10-19 2021-01-05 山东旗帜信息有限公司 Log obtaining method, device and medium in traffic industry
CN112182193B (en) * 2020-10-19 2023-01-13 山东旗帜信息有限公司 Log obtaining method, device and medium in traffic industry

Similar Documents

Publication Publication Date Title
CN109191167A (en) A kind of method for digging and device of target user
US11915104B2 (en) Normalizing text attributes for machine learning models
CN103761254B (en) Method for matching and recommending service themes in various fields
CN109741173B (en) Method, device, equipment and computer storage medium for identifying suspicious money laundering teams
CN106022826A (en) Cheating user recognition method and system in webcast platform
CN112494952B (en) Target game user detection method, device and equipment
CN114332984B (en) Training data processing method, device and storage medium
CN109447156A (en) Method and apparatus for generating model
CN106843941B (en) Information processing method, device and computer equipment
CN105045715B (en) Leak clustering method based on programming mode and pattern match
CN113971527A (en) Data risk assessment method and device based on machine learning
CN109933660A (en) The API information search method based on handout and Stack Overflow towards natural language form
CN117436679B (en) Meta-universe resource matching method and system
CN109213831A (en) Event detecting method and device calculate equipment and storage medium
CN110609908A (en) Case serial-parallel method and device
CN109600344A (en) Identify the method, apparatus and electronic equipment of risk group
CN109033148A (en) One kind is towards polytypic unbalanced data preprocess method, device and equipment
CN107622326A (en) User&#39;s classification, available resources Forecasting Methodology, device and equipment
CN109255000A (en) A kind of the dimension management method and device of label data
CN112884569A (en) Credit assessment model training method, device and equipment
CN104965846B (en) Visual human&#39;s method for building up in MapReduce platform
CN108604248A (en) Utilize the notes providing method and device of the correlation calculations based on artificial intelligence
CN106649380A (en) Hot spot recommendation method and system based on tag
CN106649743A (en) Method and system for storing and sharing creative idea classified brain library resources
CN110472659A (en) Data processing method, device, computer readable storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200925

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200925

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20190111

RJ01 Rejection of invention patent application after publication