[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

US20150032729A1 - Matching snippets of search results to clusters of objects - Google Patents

Matching snippets of search results to clusters of objects Download PDF

Info

Publication number
US20150032729A1
US20150032729A1 US14/337,352 US201414337352A US2015032729A1 US 20150032729 A1 US20150032729 A1 US 20150032729A1 US 201414337352 A US201414337352 A US 201414337352A US 2015032729 A1 US2015032729 A1 US 2015032729A1
Authority
US
United States
Prior art keywords
objects
data
cluster
matches
snippet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/337,352
Inventor
Pawan Nachnani
Arun Kumar Jagota
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Salesforce Inc
Original Assignee
Salesforce com Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Salesforce com Inc filed Critical Salesforce com Inc
Priority to US14/337,352 priority Critical patent/US20150032729A1/en
Assigned to SALESFORCE.COM, INC. reassignment SALESFORCE.COM, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: JAGOTA, ARUN KUMAR, NACHNANI, PAWAN
Publication of US20150032729A1 publication Critical patent/US20150032729A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F17/30554
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F17/30598

Definitions

  • Some customer data providers attempt to address this challenge by using a crowd-sourced platform to build a contact database which is sourced and updated by sales and marketing professionals.
  • the customer data provided by customer data providers often has a variety problems, such as invalid email addresses or invalid phone numbers, a contact record with incorrect information from a name spelled wrong to a bad address, incomplete or inaccurate records for company names, job titles, and phone numbers, non-current data, wrong company information or wrong contact data, duplicate contacts with inconsistent information, fields that are empty due to poor data capture techniques or contain other inaccurate information, completed fields that contain nonsense data such as “TBA” or “TBD,” and outdated information, such as a contact that no longer works at the contact's former company.
  • systems and methods for matching and confidently adding snippets of search results to clusters of objects Information is searched based on objects in a cluster of objects.
  • a data snippet is extracted from the search results.
  • the data snippet is added to the cluster of objects if the data snippet includes data that matches at least one of the objects in the cluster of objects.
  • a confidence score may be calculated for adding the data snippet to the cluster of objects based on the recency, a job title, an email address, and/or a phone number associated with the data snippet.
  • the data snippet may be added to the cluster of objects in a customer accessible database if the confidence score is sufficiently high, and a notice for review may be generated if the confidence score is not sufficiently high.
  • a database system searches a business database for information about a business contact stored in a contact database, wherein the contact database includes objects stored in a cluster of objects that correspond to a given name “Gregory,” a family name “Jones,” a company “International Business Machines,” a title “V.P for sales,” a location “New York City,” and an email address for a specific business contact.
  • the database system extracts data that includes a given name “Greg,” a family name “Jones,” a company “IBM,” and a mobile phone number from the information in one of the search results.
  • the database system determines whether the data snippet extracted from the information in the search results includes data that matches any of the objects stored in the cluster of objects in the contact database corresponding to the business contact named Gregory Jones.
  • the database system adds the extracted data snippet, including the mobile phone number, to the objects stored in the cluster of objects in the customer accessible database that correspond to the business contact named Gregory Jones because the calculated confidence score is sufficiently high since both the data snippet and the objects in the cluster of objects include the uncommon family name “Jones.”
  • a sales person planning on contacting Greg Jones at IBM now has Jones's mobile phone number that the sales person did not have previously.
  • the database system builds, manages and sustains a high-quality person data object by bringing in data from multiple sources, normalizing, enriching, matching, and merging data to provide a “golden record,” or a best version of the data, for a person and the person's various business profile attributes.
  • the database system leverages free web data sources such as news feeds, blogs and search results to mine attributes such as titles, social handles, etc., to further improve the quality of contact, company, and location data objects, and uses this additional data to build, validate and enrich person profiles.
  • While one or more implementations and techniques are described with reference to an embodiment in which matching and confidently adding snippets of search results to clusters of objects is implemented in a system having an application server providing a front end for an on-demand database service capable of supporting multiple tenants, the one or more implementations and techniques are not limited to multi-tenant databases nor deployment on application servers. Embodiments may be practiced using other database architectures, i.e., ORACLE®, DB2® by IBM and the like without departing from the scope of the embodiments claimed.
  • any of the above embodiments may be used alone or together with one another in any combination.
  • the one or more implementations encompassed within this specification may also include embodiments that are only partially mentioned or alluded to or are not mentioned or alluded to at all in this brief summary or in the abstract.
  • FIG. 1 is an operational flow diagram illustrating a high level overview of a method for matching and confidently adding snippets of search results to clusters of objects in an embodiment
  • FIG. 2 is a block diagram of a system for matching and confidently adding snippets of search results to clusters of objects in an embodiment
  • FIG. 3 illustrates a block diagram of an example of an environment wherein an on-demand database service might be used
  • FIG. 4 illustrates a block diagram of an embodiment of elements of FIG. 3 and various possible interconnections between these elements.
  • the term multi-tenant database system refers to those systems in which various elements of hardware and software of the database system may be shared by one or more customers. For example, a given application server may simultaneously process requests for a great number of customers, and a given database table may store rows for a potentially much greater number of customers.
  • the term query plan refers to a set of steps used to access information in a database system.
  • a lot of customer data makes up a database of contact records.
  • the primary source for this customer data could be a website where users add and update business card information by adding or updating contact information one record at a time through a web form or by uploading comma separated value files that contain contact information. Users may also occasionally submit bounce email reports that contain error codes for invalid emails that they receive from their mail providers as part of their email marketing campaigns.
  • a database system can receive and process millions of data records to provide new or updated data to customers in a timely manner.
  • the database system cleans the data from the record, normalizes the data into a standard set of values that might be used for matching, enriches the data, and attempts to match the data with previously stored data to create a “golden record” for a person identified by the incoming data and/or previously stored data. False matches can result in the loss of good data and missed matches may reduce the value of previously stored data.
  • the matching process also helps in identifying duplicates and decreases the possibilities that duplicate records are created for the same person. After the matching process returns a suitable list of matching person candidates, the database system adds the incoming data to a cluster of data objects that contains data values that matches data objects for the person identified by the incoming data.
  • the database system creates a new cluster of data objects for the person if the database system does not already include any cluster of data objects that match data objects for the person identified by the incoming data. Then the database system determines whether to store the added data in a customer accessible database.
  • a significant majority of bad and erroneous operations may be prevented, thereby resulting in much higher quality of customer data, if a database system treats every add or update contribution as a claim and takes into account the reputation of the user/partner who makes the claim.
  • bad and erroneous operations may be prevented if a database system takes into account the type of the claim, the date and time of the claim, and further validates the claim with data from the free web and other sources with additional levels of data stewardship. Claims from trusted users can be treated as sources of truth and valuable enough to overwrite almost all existing information. Claims from average members and the free web may be treated as good as any other information. The more consistent points identified will prevail, such as if three people evaluate some data as good and one person evaluates the same data as bad, the evaluations as good prevail.
  • the database system weighs a claim on a graded scale and calculates various scores to generate a confidence score that is then used to determine the type of actions that are needed before the claim is fully processed and applied to generate a “golden record” for a person.
  • the database system determines the quality of each and every individual attribute such as names, titles, emails, phones, social handles etc. Each and every attribute of data in the claim is scored and weighed against similar attributes from other claims and golden records in case they already exist. If the data in an attribute of a new claim is of better quality than an existing attribute and the confidence score of the new claim is above a certain threshold, the database system uses the incoming attribute in a data snippet to replace the existing data for that attribute in the golden record.
  • the attribute in the data snippet is linked to a person record, where data from multiple contacts is combined to create/update work profiles for the person record, allowing tracking of the lifecycle and work profile of contacts. If the attribute in the data snippet is conflicting or additional details are needed, the database system generates an additional task/alert to data stewards for additional review based on the importance of the data record and the attribute in question. If the attribute in the data snippet is of poor quality, then the database system rejects the claim and the state of the attribute and golden record remains unaffected. If there is not enough information to make a decision, there is not enough authority to change the state, or no new information is detected, then no decision is made, re-affirming the current state of the data.
  • FIG. 1 is an operational flow diagram illustrating a high level overview of a method 100 for matching and confidently adding snippets of search results to clusters of objects.
  • a database system may match and confidently add snippets of search results to clusters of objects.
  • a database system searches information based on objects in a cluster of objects, block 102 .
  • this can include the database system searching a business database for information about a business contact stored in a contact database, wherein the contact database includes objects stored in a cluster of objects that correspond to a given name “Gregory,” a family name “Jones,” a company “International Business Machines,” a title “V.P for sales,” a location “New York City, and an email address for a specific business contact.
  • the database system extracts a data snippet from the search results, block 104 .
  • this can include the database system extracting data that includes a given name “Greg,” a family name “Jones,” a company “IBM,” and a mobile phone number from the information in one of the search results.
  • the database system determines whether the data snippet includes data that matches at least one of the objects in the cluster of objects, block 106 .
  • this can include the database system determining whether the data snippet extracted from the information in the search result includes data that matches any of the objects stored in the cluster of objects in the contact database corresponding to the business contact named Gregory Jones. Whether the data snippet includes data that matches at least one of the objects in the cluster of objects may include matching based on first name aliases and/or acronym expansion.
  • “Greg” is a given name alias that matches the given name “Gregory” and “IBM” is an acronym that can be expanded to match “International Business Machines.”
  • the method continues to block 108 . If the data snippet does not include data that matches at least one of the objects in the cluster of objects, the method proceeds to block 110 . If the data snippet includes data that matches at least one of the objects in the cluster of objects, the database system adds the data snippet to the cluster of objects, block 108 .
  • this can include the database system adding the extracted data snippet, including the mobile phone number, to the objects stored in the cluster of objects in the contact database that correspond to the business contact named Gregory Jones because both the data snippet and the objects in the cluster of objects include the uncommon family name “Jones.”
  • the method 100 then proceeds to block 112 .
  • the database system optionally stores the data snippet for matching with subsequent clusters of objects, block 110 .
  • this can include the database system storing the data snippet for matching with subsequent clusters of objects if the data snippet does not include data that matches at least one of the objects in the cluster of objects, as the contact database may be later supplemented with an additional contact that includes an object which matches some of the data in the extracted data snippet.
  • the method 100 either terminates or begins again at block 102 .
  • the database system can also determine whether the data snippet includes data that matches objects in another cluster of objects, block 112 . In embodiments, this can include the database system determining that the extracted data snippet that includes “Greg,” “Jones,” and the mobile phone number also matches an object in another cluster of objects that includes “Greg,” “Jones,” and a company “Microsoft.” If the data snippet includes data that matches at least one of the objects in another cluster of objects, the method continues to block 114 . If the data snippet does not include data that matches at least one of the objects in another cluster of objects, the method proceeds to block 116 .
  • the database system may combine the cluster of objects with the other cluster of objects, block 114 .
  • this can include the database system combining the clusters of objects for the two business contacts named “Jones” whose objects include “International Business Machines” and “Microsoft.”
  • Such a combination of clusters of objects for business contact objects could be useful for a sales person planning on contacting Greg Jones at IBM if the sales person knows some business contacts who worked at Microsoft at the time when Greg Jones worked at Microsoft.
  • the database system After adding a data snippet to a cluster of objects, the database system optionally calculates a confidence score for adding the data snippet to the cluster of objects based on the recency, a job title, an email address, and/or a phone number associated with the data snippet, block 116 .
  • this can include the database system calculating a confidence score based on how recent the data objects from the search result were stored in the business database, with the today's date of storage equated with the highest recency score.
  • the database system calculates a confidence score based on a job title from the search result, with hierarchically higher job titles equated with a higher title rank score, and with job titles known to be used by the business contact's claimed company equated with a higher title quality score.
  • the database system calculates a confidence score based on an email address from the search result, with the email score based on how well the email address matches the pattern of other email addresses for business contacts for the business contact's claimed company and how well the email address matches the first name and the last name of the business contact.
  • the database system calculates a confidence score based on a phone number from the search result, where the phone number score is based on the consistency between the claimed phone number and the area code associated with the claimed geographic location for the business contact.
  • the confidence score may be based on any weighted combination of the recency, the job title, the email address, and the phone number from the data snippet.
  • the database system optionally determines whether a confidence score is sufficiently high for adding the data snippet to the cluster of objects stored in a customer accessible database, block 118 . In embodiments, this can include the database system determining that a confidence score is sufficiently high for a new mobile phone number to be added to the cluster of data objects for Greg Jones in a customer accessible database. If a confidence score is sufficiently high for adding the data snippet to the cluster of objects stored in a customer accessible database, the method 100 continues to block 120 . If a confidence score is not sufficiently high for adding the data snippet to the cluster of objects stored in a customer accessible database, the method 100 proceeds to block 122 . If the confidence score is sufficiently high for adding the data snippet to the cluster of objects stored in the customer accessible database, the database system optionally adds the data snippet to the cluster of objects stored in the customer accessible database, block 120 .
  • this can include the database system storing the new mobile phone number in the contact database that is accessible by a sales person planning on contacting Greg Jones, who now has Jones' mobile phone number that the salesman did not have previously. Then the method 100 either terminates or begins again at block 102 .
  • this example describes the database system using a confidence score to determine whether to add a data snippet to a cluster of objects in a customer accessible database
  • the database system may also use a confidence score to determine whether to combine the cluster of objects with the other cluster of objects.
  • the database system may also use a confidence score to determine whether to combine the cluster of objects with the other cluster of objects in a customer accessible database. If the confidence score is not sufficiently high for adding the data snippet to the cluster of objects stored in the customer accessible database, the database system optionally generates a notice for review, block 122 .
  • the notice for review can include the database system generating a notice for reviewing the adding of the data snippet to the cluster of objects because the mobile phone number in the search results is not associated with New York City, the claimed office location for Jones in the search results, and the title “VP” in the search results is too generic and does not match any titles known to be used by IBM, the claimed company for Jones in the search results. Then the method 100 either terminates or begins again at block 102 . Accordingly, systems and methods are provided which enable a database system for matching and confidently adding snippets of search results to clusters of objects.
  • the method 100 may be repeated as desired.
  • this disclosure describes the blocks 102 - 122 executing in a particular order, the blocks 102 - 122 may be executed in a different order. In other implementations, each of the blocks 102 - 122 may also be executed in combination with other blocks and/or some blocks may be divided into a different set of blocks.
  • FIG. 2 illustrates a block diagram of an example system for matching and confidently adding snippets of search results to clusters of objects, under an embodiment.
  • the system 200 may illustrate a cloud computing environment in which data, applications, services, and other resources are stored and delivered through shared data-centers and appear as a single point of access for the users.
  • the system 200 may also represent any other type of distributed computer network environment in which servers control the storage and distribution of resources and services for different client users.
  • Storm is a real time, open source data streaming framework that functions entirely in memory.
  • Storm constructs a processing graph, called a “topology,” that feeds data from input sources through processing nodes.
  • the input data sources are called “spouts,” and the processing nodes are called “bolts.”
  • the data model consists of tuples, which flow from spouts to the bolts, which execute user code. Besides simply being locations where data is transformed or accumulated, bolts may also join streams of data and branch streams of data.
  • Storm is designed to be run on several machines to provided parallelism. Storm processes streams of tuples.
  • a stream is defined to be an unlimited ordered sequence of tuples, and each tuple is a one dimensional array of objects.
  • the system 200 acts as a central data processing hub, or clearing house, that brings in multiple data sources and free web data together to generate the “golden” record for core data assets around accounts and persons.
  • the following describes the key components of the system 200 as part of a storm topology to implement the data processing pipeline.
  • the system 200 generates a set of specialized keys for each person record and claims that enable fast lookups for the purpose of matching and retrieval. Indices are created for each company, person and location object in the cache, and these indices are used for person matching, company matching and location matching.
  • a spout is a source of data streams in a Storm topology. Generally spouts will read tuples from an external source and emit them into the topology.
  • the business directory spout 202 reads data from the business directory database and emits tuples, which are treated as claims, into the topology, such as contact added, contact updated, contact invalid phone, contact invalid email, and contact not at company. Each of these claim types has an associated contributor identifier that is the identity of the user who performed the action.
  • the business directory spout 202 is an unbounded stream and keeps emitting data till there is no more data to be read.
  • the tuples that are emitted out of the business directory spout 202 may be distributed randomly (shuffle grouping) to a normalize bolt 204 which is the first bolt in the pipeline. This data can also sent to a search engine bolt 206 which executes free web search queries and tries to find additional data around this contact, such as titles and social handles.
  • the partial records spout 208 provides partial records from disparate sources.
  • the partial records spout 208 reads contact data from a partial records database where files that are uploaded by users on the website are stored in raw format before partial records processing.
  • the key difference here is that unlike the business directory spout 202 , the partial records spout 208 emits tuples based on partial data based on the data in the uploaded files. Also, the tuples that come out of the partial records spout 208 will often contain very poorly normalized data.
  • the tuples that are emitted out of the partial records spout 208 may be distributed randomly (shuffle grouping) to the normalize bolt 204 which is the first bolt in the pipeline. This data can also sent to the search engine bolt 206 which executes free web search queries and tries to find additional data around this contact, such as titles and social handles. Examples of claims emitted by the partial records spout 208 include contact added and contact added for new company.
  • the bounce email spout 210 reads bounce email error codes, which may be from comma separated value files that are uploaded by website administrators and website users. Examples of claims emitted by the bounce email spout include contact email and contact message.
  • the bounce file message that the bounce email spout 210 receives for an email is typically unstructured text, such as records that are comma-separated with the email in the first column and the second column containing the bounce message as unstructured text.
  • an automatic column mapping algorithm may initially process the first few lines of the file. The algorithm does not need to rely on the names of the column headers, but rather the algorithm can tokenize the bounce file.
  • the field separator may be determined from the file by tokenizing on each kind of separator and computing how consistent the number of tokens the algorithm creates for the entire file. After determining the field separator, the algorithm can determine which column contains the email and which column contains the message. The algorithm may split out the record, remove the email, and concatenate the rest of the record to create the contact message claim.
  • the emitted contact message claim is typically an unstructured snippet of text.
  • the social handle spout 212 reads contact data and social handles from a social handle repository and submits claims such as contact social handle.
  • the crawler spout 214 emits contacts found on the web from crawling websites for their management pages.
  • the crawler spout 214 may start with a number of seed companies that the system 200 currently has and use it as the starting point for crawling. Examples of claims emitted by the crawler spout 214 include contact added and contact updated.
  • the normalize bolt 204 processes all the tuples that come to it through a series of data normalization routines.
  • the normalizer bolt 204 may standardize addresses, titles, phone numbers, and properly classify contact records by department and level. The following are some of the key normalizations.
  • An address normalizer can include a list of abbreviations, such as E to East, W to West, Blvd to Boulevard; only allows letters, numbers, and special characters; and remove any space if there are any spaces around the special characters.
  • a title normalizer may include a list of misspellings and abbreviations.
  • a name normalizer can allow letters and special characters, not allow special characters at the beginning and the end of a name, capitalize the first letter and add a space after each name, capitalize the next letter if a name starts with “Mc,” and capitalize all Roman numerals.
  • a city normalizer may only allow letters and special characters, and only keep the last non-space special character if there are a sequence of special characters.
  • a base normalizer can return the correct country normalizer based on the country abbreviation.
  • a phone normalizer may normalize phone patterns based on each country having its own phone pattern.
  • a zip normalizer can normalize zip code patterns based on each country having its own zip code pattern.
  • a state normalizer may normalize states based on countries having its own state requirements, if there are any.
  • the enrich bolt 216 uses external data services for email verification, for phone verification and social append services for social handles, and appends a set of meta-attributes to all the new contact claims that enter the pipeline. After enrichment, the tuple may contain additional metadata around emails, phones and social handles that is useful for matching and merging purposes. The enrich bolt 216 passes this data to the match bolt 218 that tries to match the incoming contact claims with other existing claims and facts in the system 200 .
  • the working data model of person object attributes may include: first name, last name, linkedin handle, twitter handle, other social handles, links to contact objects, work history, photos, education, and snippets, which are unstructured short pieces of text such as search result snippets, tweets, etc., and others containing person-identifying content.
  • P denotes a person object and p.work_history.company_names denotes the names of companies p has worked at, p.work_history.cities denotes the set of all cities p has worked in, p.work_history.titles denotes the set of job titles that the person has held, and similar notations exist for work emails, work phones, work states, work countries, and social handles.
  • the formats of the objects of different types of social handles (linkedin, twitter, etc.) is quite different, so it may not be necessary to have a different index type for a different type of social handle because there is no risk of a collision.
  • a final check models the probability that a match is a chance event.
  • M denote this match.
  • the system 200 can estimate the upper bound on the expected number E(M) of objects in the universe that have the properties of the match M, under the universe probability model. If this upper bound estimate is below a certain threshold (1 may be a sensible choice) the system 200 accepts this match, otherwise the system 200 rejects the match.
  • One way to estimate a suitable upper bound on E(M) is to model the probabilities of various attribute:value pairs under the universe probability model, then assume the independence of attributes in the match and multiply out these probabilities, then finally multiply this by n.
  • E(M) n*product_ ⁇ a:v in M ⁇ P(a:v) (EUB 1. Modeling the probabilities of all attribute:value pairs in the universe is probably too complex, so the database system may begin by modeling the probabilities of certain key attributes and their value, drop all attributes other than these from M and still use (EUB 1. The result is still an estimate of the upper bound on E(M).
  • the result-set size based estimate may not generalize as well as explicit modeling.
  • the P(person_name) explicit model which assumes independence of first and last names does not generalize well.
  • An alternative to an explicit estimate is a result-set size based estimate.
  • the system 200 runs the matcher to find all true positive matches.
  • ‘true positive’ may not include ‘modeling chance matches’. If there are at least two distinct objects in the result set, the system 200 deems that the probe being matched is not matched uniquely. This approach has the benefit that the P(a:v) probabilities are not explicitly modeled.
  • the result set will carry the information to judge whether a match is unique or not, even in complex cases.
  • This approach has the limitation that it does not model the real world; only the current, actual universe of (golden) data objects. Another issue is that to implement this approach, the system 200 may need to do this computation after all the true positives have been generated. Furthermore, the system 200 can match within the result set to check whether there are indeed at least two different objects or not.
  • the search engine bolt 206 takes partial data (aka seed) and tries to find more publically available information via a search engine 220 , such as Yahoo® Boss, because finding titles and social handles is useful.
  • the data thus obtained is passed through a search results bolt 222 to extract vital information and enrich a data record to build a full person profile, such as by passing the data to a handle extractor bolt 224 .
  • the search results bolt 222 uses search result snippets having attractive properties that suggest they be made first-class “objects” in a person database 226 and/or contact data model and matching engines. Snippets are consumed without running afoul of terms of use restrictions. For the most part, snippets contain information about a single entity—a person, company or contact. Snippets might be matched to a different type of suitable object, such as person, company, or contact. Some snippets contain information about multiple companies at which a person has worked, so snippets could be used to connect together multiple contacts of the same person Such a matching is of mostly unstructured text (the snippet) to structured data (a particular contact object): This matching does not require entity extraction from the snippet.
  • This matching could be algorithmically relatively easy to do.
  • certain “nuggets” might be extracted from the snippet and the matching object enriched. For example, if the snippet contains a LinkedIn handle and the snippet matches a particular contact sufficiently well, this handle is then be attached to that contact.
  • a snippet may tie together multiple contacts of the same person because the snippet contains the names of multiple companies at which the person has worked.
  • Contact initiated snippets generation and matching may work as follows. Start with a contact J. Let C denote the cluster of the person database 226 containing J. Generate a suitable query Q to the search engine 220 from J. For each snippet S in the top search results on Q, if S matches C with a sufficiently high confidence, add S to C, otherwise add S to a collection of unmatched snippets. If the person name in J is sufficiently uncommon, set Q to person-name(J), else set Q to person-name(J)+company-name(J). Two examples are Pawan Nachnani and John Smith ibm. Note that there is no data quality risk by setting a query too broad, such as a common person name, because the resulting snippets will be deeply matched with C.
  • An overly broad query does not yield good recall because none of the snippets in its result set deeply match C. Recall may be less important than precision because if the system 200 makes up for low recall by pounding away at the search engine 220 , so long as the system 200 is not constrained overly by search volume limits. Also, if the system 200 uses a mechanism to consume unmatched snippets, this mitigates the recall limitation a lot.
  • C denotes the data of a single person. A snippet may contain data of this person spread across multiple contacts, which is why the database system matches S to C and not merely to J.
  • the match bolt 218 includes bolts such as a handle bolt 228 , an email bolt 230 , a name@company bolt 232 , a name@phone bolt 234 , and a name@location bolt 236 to match snippets to clusters of objects in the person database 226 .
  • a cluster bolt 238 clusters all matching claims together into a common cluster.
  • a merge bolt 240 merges all claims and existing contact records (partial and/or complete) from a cluster into a single composite record (the merged record) and computes a confidence score for the merged record. If the merged record is incomplete, the merge bolt 240 enriches the record when possible with information available in the cache. If the record is complete, the merge bolt 240 marks the record as canonicalized. At this point, the record is ready to be persisted in the person database 226 , provided its confidence score is sufficiently high. The merge bolt 240 also updates the merge time of the incoming claim.
  • r.day is today, then this score may have the value 1, and the score can reduce to 0 for a long time (many, many days) in the past.
  • Score(r,rank) based on r.title. c-level titles may get a score of 1 and the rank score can monotonically decay for lower rank titles.
  • Score(r,title_quality) High rank titles, e.g. Vice President, do not necessarily have high quality.
  • Title_quality may score this separate dimension. A title might be deemed to have high quality if it has a known rank and has a known department and is not in an explicit list of poor titles. The quality may decrease depending on which (and how many) of the tests in the above sentence are violated.
  • Score(r,domain) might only be defined when r's company has been matched to company jc.
  • Score(r,d) #emails in domain d/#contacts in company jc.
  • updates algorithmically deemed risky may be logged for review by a data steward or community. Feedback from the review can be used to assess the accuracy of this scoring/detection mechanism, and tuning of it if it is deemed useful enough.
  • An update is risky if a contact's last name is changed.
  • a title change with more than one level increase in rank, such as software engineer to ceo, is also risky.
  • a score version of this may make the risk score depend on the number of skipped levels.
  • a title change which changes departments to another incompatible department, such as. vp sales to vp engineering is also risky.
  • Updating or adding a C-level contact in a large company is risky, but easy to generalize in a scoring setting—the higher the rank of the contact and the larger the company size, the higher the risk score may be. Also, different update actions might possibly have differing risks, such as a title change is generally more risky than a last name change for a female. A fortune 1000 headquarters address change is also risky, but scoring may generalize this to important company combined with attribute-specific change score overall risk score)
  • the join bolt 242 takes all the merged claims from the merge bolt 240 and construct person objects.
  • a person object may be a collection of major profiles, such as a person profile, a work profile, and a social profile.
  • the data from each merged claim can update one or many attributes across all the three profiles of a person.
  • a merge claim may end up creating new profile objects as new claims become available.
  • Each attribute in a profile ends up with a confidence score that may ultimately determine the level of “gold” for that particular profile object. While most of the attributes might be permanent, some of the attributes could be transient and need to re-computed over time due to privacy and legal reasons.
  • a persist bolt 244 may save all the resultant person records and the underlying claims to the person database 246 once all the processing is completed by the join bolt 242 .
  • the bounce email processing bolt is a reaper bolt 246 that aggregates multiple facts with a current claim and comes up with a score and a disposition about that score.
  • the reaper bolt 246 may determine if a fact is a duplicate.
  • the fact disposition can determine if the computed score warrants a graveyard or ungraveyard of the underlying contact.
  • the score of the current claim could be computed as follows: Take all claims and scored facts for the same email. For each fact, get the base score determined by the response category of the email. From the description from the bounce email spout 210 , the contact message is typically unstructured data.
  • the reaper bolt 246 may address this by using a trie-based approach to find tokens specified in a list of vendor dictionaries.
  • Each vendor dictionary can specify the token with a classified response category.
  • Response categories for email may be hard_error, heavy_error, soft_error, email — received, unknown.
  • the crawler spout 214 looks at free web (sites approved by a legal department for acceptable terms of service) and finds publically available information/claims. Since most of the open web sources of data are un-structured; the publicly available information typically requires sophisticated natural language processing techniques to extract meaningful information from it. Therefore, the crawler spout 214 feeds snippets of information to a natural language processing bolt 248 , which applies natural language processing and machine learning techniques to extract relevant data/facts to emit the following types of claims: contact added, contact updated, contact graveyarded, and social handles.
  • a natural, human person may be represented as a graph of p:Person entities (nodes, or vertices) interconnected by links (edges). Each node can represent a different facet of the user (person). Each of these facets may be held in a separate (graph) container called a context.
  • Each person entity node can be a set of attributes and objects. These attributes might be simple literals (such as the user's first name) or they could be other entities (called complex attributes). These latter attributes might be links to other entity nodes.
  • each node in the person graph is located in its own context.
  • the root node may lie in a special context (for each user) called the root context.
  • the system 200 delivers this data to the person database 226 that is customer accessible. This golden data may also be propagated back to the original source systems and other partner systems and help keep the data clean in their respective source databases.
  • the system 200 provides a complete 360 degrees feedback loop and reduces the chances that bad or fraudulent data may ever make it into customer's customer relationship management systems or any other system where a consolidated view of an account and person data is required.
  • the core person and account repository also continues to grow over time as new pieces of data are found on the free web and other sources. Additional sources of data may also be on-boarded quickly into the system 200 by adding and configuring new spouts and corresponding bolts into the Storm topology.
  • a de-duplication bolt detects duplicates and automatically merges the duplicates or float suspected duplicates to a community for task resolution.
  • a pinger bolt pings hypertext transfer protocol and simple mail transfer protocol domains for validity, automatically graveyarding when a domain is deemed invalid.
  • the system 200 may create indices for each company, person, and location object for matching purposes.
  • person indices include record identifier, social handle, email direct phone number, company, city, zip, state, and country.
  • location indices include record identifier, zip, city, and country.
  • company indices include record identifier, domain, corporate phone, company prefix, stock ticker, company name and city, domain and city.
  • the system may build an inverted index from a snippet, and use the index to map words in the snippet to their positions.
  • the positions for a given word could be in increasing order.
  • An inverted index is illustrated in an example below.
  • the system 200 detects acronyms (if any) in the snippet, expands out these acronyms, tokenizes the expansion and incorporates these expansions into the inverted index, as illustrated in the example below.
  • the inverted index contains the entry ibm ⁇ 0,i,j> where i and j denote the word positions of the 2nd and 3rd occurrence of IBM in the snippet.
  • the database system After recognizing the acronym ibm ⁇ “international business machines”, the database system adds the entries international [i,0], business [i,1], and machines [i,2] to the inverted index.
  • Acronym-expansion entries in a snippet's inverted index could be useful for matching titles or company names to the snippet.
  • the system 200 may represent an attribute:value pair as an ordered tree.
  • the order can capture the order of the words in the value, and also in acronym expansions.
  • the ordered tree may capture choices, which include aliases, and acronym expansions.
  • Table 1 below shows various examples. Ordered trees can be depicted as nested arrays, and constructed via attribute-specific constructors. For example, person_name objects are expanded to include first name aliases, and acronyms in company names and titles are detected and expanded, such as depicted in table 1.
  • Ordered trees may have alternating levels of ordered ANDs and unordered ORs. For visual convenience, an AND-node is encapsulated in [ . . . ] and an OR-node in ( . . . ).
  • [chairman, and, (ceo, [chief, executive, officer])] is read as “chairman AND (ceo OR (chief AND executive AND officer)).”
  • Chairman AND ceo OR (chief AND executive AND officer)
  • Representing the snippet as an inverted index combined with representing attribute:value pairs as ordered trees may lead to a very fast matching algorithm, as described below.
  • the system 200 has attribute-specific matchers to match a value of a field to a snippet, which is unstructured text.
  • the attribute-specific matchers could be instances of the following generic matcher.
  • Row attribute_value_ordered_tree snippet_inverted_index hits 1 [shabd, vaid] ⁇ shabd ⁇ ⁇ 0, 5>, vaid [ ⁇ 0, 5>, ⁇ 1, 6>] ⁇ ⁇ 1, 6>, vice ⁇ ⁇ 9>, president ⁇ ⁇ 10>, . . . ⁇ 2 [vice, president] ⁇ shabd ⁇ ⁇ 0, 5>, vaid [ ⁇ 9>, ⁇ 10>] ⁇ ⁇ 1, 6>, vice ⁇ ⁇ 9>, president ⁇ ⁇ 10>, . . .
  • Enumerating individual hits may be described based on the hits data structure in the last column of Table 2. Individual hits can reveal exactly what tokens in the query matched what positions in the snippet. Each hit could be individually scored. The overall score for the match of the attribute:value pair in the snippet might be defined as the aggregation of these individual scores.
  • a hit could be a pair (tokens,positions), where tokens might be an array of tokens in attribute_value_ordered_tree and positions could be an array of positions in the snippet at which these tokens match, such as the examples below.
  • a one-level hits tree is simply an array of post-lists.
  • Table 2 hits of rows 1 and 2 form one-level trees.
  • the system 200 may use a k-merge like algorithm to enumerate all the hits of such a tree to a snippet. This algorithm can “merge” k post-lists, as illustrated below. Below is an illustration on the hits [ ⁇ 0,5>, ⁇ 1,6>]
  • the underlined entries depict the locations of the pointers in the various post-lists.
  • the pointers are at the start positions. Since 1 minus 0 equals 1, the system 200 generates a hit, 0 . . . 1, and advances both pointers.
  • step 2 since 6 minus 5 equals 1, the system 200 enumerates a hit, 5 . . . 6, and advances both pointers.
  • Enumerating hits of a multi-level tree may be done by suitably generalizing the k-merge operation.
  • the generalization can be a little complex, and may be well described by building up inductively from different types of multi-level tree examples.
  • Example 1 is based on the hits of row 3 in Table 2: [(nil,[ ⁇ 9>, ⁇ 10>])] and corresponds to a 3-level tree.
  • the system 200 processes this example as follows.
  • the system 200 goes down one level since the top level is a singleton-AND. Next, the system 200 skips the nil. Finally, the system 200 produces the hit 9 . . . 10 from [ ⁇ 9>, ⁇ 10>] and annotates it with [vice, president].
  • Example 2 is based on the hits of row 4 in Table 2: [(nil, ⁇ 8>), ⁇ 9>]
  • step 1 the system 200 tries to 2-merge (nil, ⁇ 8>) with ⁇ 9>. Recognizing that the first argument is an OR, the system 200 goes down one level into the OR and effectively does the 2-merge of [ ⁇ 8>, ⁇ 9>] in step 2.
  • Example 3 is based on the hits in row 5 of Table 2: [ ⁇ 0>, ⁇ 1>, ( ⁇ 8>, [ ⁇ 2>, ⁇ 3>, ⁇ 4>])]
  • step 1 the system 200 recognizes that the need of a 3-merge at the top level.
  • the system 200 places the pointers at the correct locations of the first two entries.
  • the third entry is an OR, so the system 200 descends into the third entry and then places the pointer on the first entry in the first post-list in the OR choices. (This entry is 8.)
  • the system 200 then outputs the hit (0 . . . 1,8) off to the scorer.
  • step 3 the system 200 moves over to the second choice in this OR. This is itself an AND of three choices. So the system 200 needs a 3-merge, of [ ⁇ 2>, ⁇ 3>, ⁇ 4>]. This 3-merge produces the hit 2 . . . 4, which gets appended to 0 . . . 1 to yield 0 . . . 4.
  • Example 4 is based on the hits row 6 of Table 2: [( ⁇ 0>,[nil,nil,nil]),nil]
  • step 1 the system 200 recognizes that the need of a 2-merge at the top level.
  • the system 200 notices that the first entry is an OR, so the system 200 descends into the first entry and then places the pointer on the first entry in the first post-list in the OR choices.
  • the system 200 notes that the second entry of the top-level AND is nil, so the system 200 outputs [0,nil] as one hit.
  • the system 200 advances the first pointer to the second choice in the OR ( ⁇ 0 >,[nil,nil,nil]) and notices that it is [nil,nil,nil]. So the system 200 stops; such that no new hits are generated.
  • the hit scorer may take two arguments: argument_name and hit.
  • Table 3 shows a number of examples explaining the scoring. Table 3, Scoring individual hits:
  • the system 200 brings together various algorithms, processes and techniques that are particularly suited for finding inaccurate data and piecing together rapidly changing pieces of data and claims to generate golden records at a massive scale.
  • the system 200 provides a complete framework to efficiently evaluate data and to improve the completeness and accuracy of data.
  • the system 200 provides a solid foundation for linking external data sources to core data assets in a reliable and scalable way that will enable customers to gain additional insights into their customers.
  • FIG. 3 illustrates a block diagram of an environment 310 wherein an on-demand database service might be used.
  • the environment 310 may include user systems 312 , a network 314 , a system 316 , a processor system 317 , an application platform 318 , a network interface 320 , a tenant data storage 322 , a system data storage 324 , program code 326 , and a process space 328 .
  • the environment 310 may not have all of the components listed and/or may have other elements instead of, or in addition to, those listed above.
  • the environment 310 is an environment in which an on-demand database service exists.
  • a user system 312 may be any machine or system that is used by a user to access a database user system.
  • any of the user systems 312 may be a handheld computing device, a mobile phone, a laptop computer, a work station, and/or a network of computing devices.
  • the user systems 312 might interact via the network 314 with an on-demand database service, which is the system 316 .
  • An on-demand database service such as the system 316
  • Some on-demand database services may store information from one or more tenants stored into tables of a common database image to form a multi-tenant database system (MTS).
  • MTS multi-tenant database system
  • the “on-demand database service 316 ” and the “system 316 ” will be used interchangeably herein.
  • a database image may include one or more database objects.
  • a relational database management system (RDMS) or the equivalent may execute storage and retrieval of information against the database object(s).
  • RDMS relational database management system
  • the application platform 318 may be a framework that allows the applications of the system 316 to run, such as the hardware and/or software, e.g., the operating system.
  • the on-demand database service 316 may include the application platform 318 which enables creation, managing and executing one or more applications developed by the provider of the on-demand database service, users accessing the on-demand database service via user systems 312 , or third party application developers accessing the on-demand database service via the user systems 312 .
  • the users of the user systems 312 may differ in their respective capacities, and the capacity of a particular user system 312 might be entirely determined by permissions (permission levels) for the current user. For example, where a salesperson is using a particular user system 312 to interact with the system 316 , that user system 312 has the capacities allotted to that salesperson. However, while an administrator is using that user system 312 to interact with the system 316 , that user system 312 has the capacities allotted to that administrator. In systems with a hierarchical role model, users at one permission level may have access to applications, data, and database information accessible by a lower permission level user, but may not have access to certain applications, database information, and data accessible by a user at a higher permission level. Thus, different users will have different capabilities with regard to accessing and modifying application and database information, depending on a user's security or permission level.
  • the network 314 is any network or combination of networks of devices that communicate with one another.
  • the network 314 may be any one or any combination of a LAN (local area network), WAN (wide area network), telephone network, wireless network, point-to-point network, star network, token ring network, hub network, or other appropriate configuration.
  • LAN local area network
  • WAN wide area network
  • telephone network wireless network
  • point-to-point network star network
  • token ring network token ring network
  • hub network or other appropriate configuration.
  • TCP/IP Transfer Control Protocol and Internet Protocol
  • the user systems 312 might communicate with the system 316 using TCP/IP and, at a higher network level, use other common Internet protocols to communicate, such as HTTP, FTP, AFS, WAP, etc.
  • HTTP HyperText Transfer Protocol
  • the user systems 312 might include an HTTP client commonly referred to as a “browser” for sending and receiving HTTP messages to and from an HTTP server at the system 316 .
  • HTTP server might be implemented as the sole network interface between the system 316 and the network 314 , but other techniques might be used as well or instead.
  • the interface between the system 316 and the network 314 includes load sharing functionality, such as round-robin HTTP request distributors to balance loads and distribute incoming HTTP requests evenly over a plurality of servers. At least as for the users that are accessing that server, each of the plurality of servers has access to the MTS' data; however, other alternative configurations may be used instead.
  • the system 316 implements a web-based customer relationship management (CRM) system.
  • the system 316 includes application servers configured to implement and execute CRM software applications as well as provide related data, code, forms, webpages and other information to and from the user systems 312 and to store to, and retrieve from, a database system related data, objects, and Webpage content.
  • CRM customer relationship management
  • data for multiple tenants may be stored in the same physical database object, however, tenant data typically is arranged so that data of one tenant is kept logically separate from that of other tenants so that one tenant does not have access to another tenant's data, unless such data is expressly shared.
  • the system 316 implements applications other than, or in addition to, a CRM application.
  • the system 316 may provide tenant access to multiple hosted (standard and custom) applications, including a CRM application.
  • User (or third party developer) applications which may or may not include CRM, may be supported by the application platform 318 , which manages creation, storage of the applications into one or more database objects and executing of the applications in a virtual machine in the process space of the system 316 .
  • FIG. 3 One arrangement for elements of the system 316 is shown in FIG. 3 , including the network interface 320 , the application platform 318 , the tenant data storage 322 for tenant data 323 , the system data storage 324 for system data 325 accessible to the system 316 and possibly multiple tenants, the program code 326 for implementing various functions of the system 316 , and the process space 328 for executing MTS system processes and tenant-specific processes, such as running applications as part of an application hosting service. Additional processes that may execute on the system 316 include database indexing processes.
  • each of the user systems 312 could include a desktop personal computer, workstation, laptop, PDA, cell phone, or any wireless access protocol (WAP) enabled device or any other computing device capable of interfacing directly or indirectly to the Internet or other network connection.
  • WAP wireless access protocol
  • Each of the user systems 312 typically runs an HTTP client, e.g., a browsing program, such as Microsoft's Internet Explorer browser, Netscape's Navigator browser, Opera's browser, or a WAP-enabled browser in the case of a cell phone, PDA or other wireless device, or the like, allowing a user (e.g., subscriber of the multi-tenant database system) of the user systems 312 to access, process and view information, pages and applications available to it from the system 316 over the network 314 .
  • a browsing program such as Microsoft's Internet Explorer browser, Netscape's Navigator browser, Opera's browser, or a WAP-enabled browser in the case of a cell phone, PDA or other wireless device, or the like.
  • Each of the user systems 312 also typically includes one or more user interface devices, such as a keyboard, a mouse, trackball, touch pad, touch screen, pen or the like, for interacting with a graphical user interface (GUI) provided by the browser on a display (e.g., a monitor screen, LCD display, etc.) in conjunction with pages, forms, applications and other information provided by the system 316 or other systems or servers.
  • GUI graphical user interface
  • the user interface device may be used to access data and applications hosted by the system 316 , and to perform searches on stored data, and otherwise allow a user to interact with various GUI pages that may be presented to a user.
  • embodiments are suitable for use with the Internet, which refers to a specific global internetwork of networks. However, it should be understood that other networks can be used instead of the Internet, such as an intranet, an extranet, a virtual private network (VPN), a non-TCP/IP based network, any LAN or WAN or the like.
  • VPN virtual private network
  • each of the user systems 312 and all of its components are operator configurable using applications, such as a browser, including computer code run using a central processing unit such as an Intel Pentium® processor or the like.
  • the system 316 (and additional instances of an MTS, where more than one is present) and all of their components might be operator configurable using application(s) including computer code to run using a central processing unit such as the processor system 317 , which may include an Intel Pentium® processor or the like, and/or multiple processor units.
  • a computer program product embodiment includes a machine-readable storage medium (media) having instructions stored thereon/in which can be used to program a computer to perform any of the processes of the embodiments described herein.
  • Computer code for operating and configuring the system 316 to intercommunicate and to process webpages, applications and other data and media content as described herein are preferably downloaded and stored on a hard disk, but the entire program code, or portions thereof, may also be stored in any other volatile or non-volatile memory medium or device as is well known, such as a ROM or RAM, or provided on any media capable of storing program code, such as any type of rotating media including floppy disks, optical discs, digital versatile disk (DVD), compact disk (CD), microdrive, and magneto-optical disks, and magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.
  • any type of rotating media including floppy disks, optical discs, digital versatile disk (DVD), compact disk (CD), microdrive, and magneto-optical disks, and magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data
  • the entire program code, or portions thereof may be transmitted and downloaded from a software source over a transmission medium, e.g., over the Internet, or from another server, as is well known, or transmitted over any other conventional network connection as is well known (e.g., extranet, VPN, LAN, etc.) using any communication medium and protocols (e.g., TCP/IP, HTTP, HTTPS, Ethernet, etc.) as are well known.
  • a transmission medium e.g., over the Internet
  • any other conventional network connection e.g., extranet, VPN, LAN, etc.
  • any communication medium and protocols e.g., TCP/IP, HTTP, HTTPS, Ethernet, etc.
  • computer code for implementing embodiments can be implemented in any programming language that can be executed on a client system and/or server or server system such as, for example, C, C++, HTML, any other markup language, JavaTM, JavaScript, ActiveX, any other scripting language, such as VBScript, and many other programming languages as are well known may be used.
  • JavaTM is a trademark of Sun Microsystems, Inc.
  • the system 316 is configured to provide webpages, forms, applications, data and media content to the user (client) systems 312 to support the access by the user systems 312 as tenants of the system 316 .
  • the system 316 provides security mechanisms to keep each tenant's data separate unless the data is shared.
  • MTS Mobility Management Entity
  • they may be located in close proximity to one another (e.g., in a server farm located in a single building or campus), or they may be distributed at locations remote from one another (e.g., one or more servers located in city A and one or more servers located in city B).
  • each MTS could include one or more logically and/or physically connected servers distributed locally or across one or more geographic locations.
  • server is meant to include a computer system, including processing hardware and process space(s), and an associated storage system and database application (e.g., OODBMS or RDBMS) as is well known in the art. It should also be understood that “server system” and “server” are often used interchangeably herein.
  • database object described herein can be implemented as single databases, a distributed database, a collection of distributed databases, a database with redundant online or offline backups or other redundancies, etc., and might include a distributed database or storage network and associated processing intelligence.
  • FIG. 4 also illustrates the environment 310 . However, in FIG. 4 elements of the system 316 and various interconnections in an embodiment are further illustrated.
  • FIG. 4 shows that the each of the user systems 312 may include a processor system 312 A, a memory system 312 B, an input system 312 C, and an output system 312 D.
  • FIG. 4 shows the network 314 and the system 316 .
  • system 316 may include the tenant data storage 322 , the tenant data 323 , the system data storage 324 , the system data 325 , a User Interface (UI) 430 , an Application Program Interface (API) 432 , a PL/SOQL 434 , save routines 436 , an application setup mechanism 438 , applications servers 400 1 - 400 N , a system process space 402 , tenant process spaces 404 , a tenant management process space 410 , a tenant storage area 412 , a user storage 414 , and application metadata 416 .
  • the environment 310 may not have the same elements as those listed above and/or may have other elements instead of, or in addition to, those listed above.
  • the processor system 312 A may be any combination of one or more processors.
  • the memory system 312 B may be any combination of one or more memory devices, short term, and/or long term memory.
  • the input system 312 C may be any combination of input devices, such as one or more keyboards, mice, trackballs, scanners, cameras, and/or interfaces to networks.
  • the output system 312 D may be any combination of output devices, such as one or more monitors, printers, and/or interfaces to networks. As shown by FIG.
  • the system 316 may include the network interface 320 (of FIG. 3 ) implemented as a set of HTTP application servers 400 , the application platform 318 , the tenant data storage 322 , and the system data storage 324 . Also shown is the system process space 402 , including individual tenant process spaces 404 and the tenant management process space 410 .
  • Each application server 400 may be configured to access tenant data storage 322 and the tenant data 323 therein, and the system data storage 324 and the system data 325 therein to serve requests of the user systems 312 .
  • the tenant data 323 might be divided into individual tenant storage areas 412 , which can be either a physical arrangement and/or a logical arrangement of data.
  • each tenant storage area 412 the user storage 414 and the application metadata 416 might be similarly allocated for each user. For example, a copy of a user's most recently used (MRU) items might be stored to the user storage 414 . Similarly, a copy of MRU items for an entire organization that is a tenant might be stored to the tenant storage area 412 .
  • the UI 430 provides a user interface and the API 432 provides an application programmer interface to the system 316 resident processes to users and/or developers at the user systems 312 .
  • the tenant data and the system data may be stored in various databases, such as one or more OracleTM databases.
  • the application platform 318 includes the application setup mechanism 438 that supports application developers' creation and management of applications, which may be saved as metadata into the tenant data storage 322 by the save routines 436 for execution by subscribers as one or more tenant process spaces 404 managed by the tenant management process 410 for example. Invocations to such applications may be coded using the PL/SOQL 34 that provides a programming language style interface extension to the API 432 . A detailed description of some PL/SOQL language embodiments is discussed in commonly owned U.S. Pat. No. 7,730,478 entitled, METHOD AND SYSTEM FOR ALLOWING ACCESS TO DEVELOPED APPLICATIONS VIA A MULTI-TENANT ON-DEMAND DATABASE SERVICE, by Craig Weissman, filed Sep. 21, 2007, which is incorporated in its entirety herein for all purposes. Invocations to applications may be detected by one or more system processes, which manages retrieving the application metadata 416 for the subscriber making the invocation and executing the metadata as an application in a virtual machine.
  • Each application server 400 may be communicably coupled to database systems, e.g., having access to the system data 325 and the tenant data 323 , via a different network connection.
  • database systems e.g., having access to the system data 325 and the tenant data 323 , via a different network connection.
  • one application server 400 1 might be coupled via the network 314 (e.g., the Internet)
  • another application server 400 N-1 might be coupled via a direct network link
  • another application server 400 N might be coupled by yet a different network connection.
  • Transfer Control Protocol and Internet Protocol TCP/IP
  • TCP/IP Transfer Control Protocol and Internet Protocol
  • each application server 400 is configured to handle requests for any user associated with any organization that is a tenant. Because it is desirable to be able to add and remove application servers from the server pool at any time for any reason, there is preferably no server affinity for a user and/or organization to a specific application server 400 .
  • an interface system implementing a load balancing function e.g., an F5 Big-IP load balancer
  • the load balancer uses a least connections algorithm to route user requests to the application servers 400 .
  • Other examples of load balancing algorithms such as round robin and observed response time, also can be used.
  • the system 316 is multi-tenant, wherein the system 316 handles storage of, and access to, different objects, data and applications across disparate users and organizations.
  • one tenant might be a company that employs a sales force where each salesperson uses the system 316 to manage their sales process.
  • a user might maintain contact data, leads data, customer follow-up data, performance data, goals and progress data, etc., all applicable to that user's personal sales process (e.g., in the tenant data storage 322 ).
  • the user since all of the data and the applications to access, view, modify, report, transmit, calculate, etc., can be maintained and accessed by a user system having nothing more than network access, the user can manage his or her sales efforts and cycles from any of many different user systems. For example, if a salesperson is visiting a customer and the customer has Internet access in their lobby, the salesperson can obtain critical updates as to that customer while waiting for the customer to arrive in the lobby.
  • the user systems 312 (which may be client systems) communicate with the application servers 400 to request and update system-level and tenant-level data from the system 316 that may require sending one or more queries to the tenant data storage 322 and/or the system data storage 324 .
  • the system 316 e.g., an application server 400 in the system 316 ) automatically generates one or more SQL statements (e.g., one or more SQL queries) that are designed to access the desired information.
  • the system data storage 324 may generate query plans to access the requested data from the database.
  • Each database can generally be viewed as a collection of objects, such as a set of logical tables, containing data fitted into predefined categories.
  • a “table” is one representation of a data object, and may be used herein to simplify the conceptual description of objects and custom objects. It should be understood that “table” and “object” may be used interchangeably herein.
  • Each table generally contains one or more data categories logically arranged as columns or fields in a viewable schema. Each row or record of a table contains an instance of data for each category defined by the fields.
  • a CRM database may include a table that describes a customer with fields for basic contact information such as name, address, phone number, fax number, etc.
  • Another table might describe a purchase order, including fields for information such as customer, product, sale price, date, etc.
  • standard entity tables might be provided for use by all tenants.
  • such standard entities might include tables for Account, Contact, Lead, and Opportunity data, each containing pre-defined fields. It should be understood that the word “entity” may also be used interchangeably herein with “object” and “table”.
  • tenants may be allowed to create and store custom objects, or they may be allowed to customize standard entities or objects, for example by creating custom fields for standard objects, including custom index fields.
  • all custom entity data rows are stored in a single multi-tenant physical table, which may contain multiple logical tables per organization. It is transparent to customers that their multiple “tables” are in fact stored in one large table or that their data may be stored in the same table as the data of other customers.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Systems and methods are provided for matching snippets of search results to clusters of objects. A system searches information based on objects in a cluster of objects. The system extracts a data snippet from the search results. The system determines whether the data snippet includes data that matches at least one of the objects in the cluster of objects. The system adds the data snippet to the cluster of objects if the data snippet includes data that matches at least one of the objects in the cluster of objects.

Description

    CLAIM OF PRIORITY
  • This application claims the benefit of U.S. Provisional Patent Application No. 61/857,325 entitled, SYSTEM AND METHOD FOR MATCHING SNIPPETS OF SEARCH RESULTS TO CLUSTERS OF OBJECTS, by Nachnani, et al., filed Jul. 23, 2013, and U.S. Provisional Patent Application No. 61/862,873 entitled SYSTEM AND METHOD FOR CONFIDENTLY MERGING SNIPPETS OF SEARCH RESULTS WITH CLUSTERS OF OBJECTS, by Nachnani, et al., filed Aug. 6, 2013, the entire contents of which are incorporated herein by reference.
  • COPYRIGHT NOTICE
  • A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
  • BACKGROUND
  • The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.
  • Companies are often overwhelmed with customer data. Names, titles, billing addresses, shipping addresses, email addresses, phone numbers, household data, affiliated companies, and associated parties are examples of customer data fields. Managing customer data can become extremely complex and dynamic due to the many changes individual customers go through over time. Multiply all of these customer data fields by the millions of customer data records which a company may have in its data sources, and factor in how quickly and how often this customer data changes, and the result is that many companies have a significant data management challenge.
  • Some customer data providers attempt to address this challenge by using a crowd-sourced platform to build a contact database which is sourced and updated by sales and marketing professionals. However, the customer data provided by customer data providers often has a variety problems, such as invalid email addresses or invalid phone numbers, a contact record with incorrect information from a name spelled wrong to a bad address, incomplete or inaccurate records for company names, job titles, and phone numbers, non-current data, wrong company information or wrong contact data, duplicate contacts with inconsistent information, fields that are empty due to poor data capture techniques or contain other inaccurate information, completed fields that contain nonsense data such as “TBA” or “TBD,” and outdated information, such as a contact that no longer works at the contact's former company. Customer data providers may have these problems because community update models treat every add request or update request as an absolute fact, which can potentially lead to bad updates, such as incorrectly inactivating high-profile executives or fraudulently adding bogus contacts. While some issues may be alleviated by adding carrot-and-stick safeguards such as penalties for bad updates, rewards for good updates, and reputation-based updates, only a few ill-intentioned users can undermine the quality of customer data. Furthermore, the potential for bad data still exists when millions of records enter a customer data provider system from other sources, such that users or partners may end up adding bad data unknowingly from outdated lists and databases.
  • BRIEF SUMMARY
  • In accordance with embodiments, there are provided systems and methods for matching and confidently adding snippets of search results to clusters of objects. Information is searched based on objects in a cluster of objects. A data snippet is extracted from the search results. The data snippet is added to the cluster of objects if the data snippet includes data that matches at least one of the objects in the cluster of objects. A confidence score may be calculated for adding the data snippet to the cluster of objects based on the recency, a job title, an email address, and/or a phone number associated with the data snippet. The data snippet may be added to the cluster of objects in a customer accessible database if the confidence score is sufficiently high, and a notice for review may be generated if the confidence score is not sufficiently high.
  • For example, a database system searches a business database for information about a business contact stored in a contact database, wherein the contact database includes objects stored in a cluster of objects that correspond to a given name “Gregory,” a family name “Jones,” a company “International Business Machines,” a title “V.P for sales,” a location “New York City,” and an email address for a specific business contact. The database system extracts data that includes a given name “Greg,” a family name “Jones,” a company “IBM,” and a mobile phone number from the information in one of the search results. The database system determines whether the data snippet extracted from the information in the search results includes data that matches any of the objects stored in the cluster of objects in the contact database corresponding to the business contact named Gregory Jones. The database system adds the extracted data snippet, including the mobile phone number, to the objects stored in the cluster of objects in the customer accessible database that correspond to the business contact named Gregory Jones because the calculated confidence score is sufficiently high since both the data snippet and the objects in the cluster of objects include the uncommon family name “Jones.” In this example, a sales person planning on contacting Greg Jones at IBM now has Jones's mobile phone number that the sales person did not have previously.
  • The database system builds, manages and sustains a high-quality person data object by bringing in data from multiple sources, normalizing, enriching, matching, and merging data to provide a “golden record,” or a best version of the data, for a person and the person's various business profile attributes. The database system leverages free web data sources such as news feeds, blogs and search results to mine attributes such as titles, social handles, etc., to further improve the quality of contact, company, and location data objects, and uses this additional data to build, validate and enrich person profiles.
  • While one or more implementations and techniques are described with reference to an embodiment in which matching and confidently adding snippets of search results to clusters of objects is implemented in a system having an application server providing a front end for an on-demand database service capable of supporting multiple tenants, the one or more implementations and techniques are not limited to multi-tenant databases nor deployment on application servers. Embodiments may be practiced using other database architectures, i.e., ORACLE®, DB2® by IBM and the like without departing from the scope of the embodiments claimed.
  • Any of the above embodiments may be used alone or together with one another in any combination. The one or more implementations encompassed within this specification may also include embodiments that are only partially mentioned or alluded to or are not mentioned or alluded to at all in this brief summary or in the abstract. Although various embodiments may have been motivated by various deficiencies with the prior art, which may be discussed or alluded to in one or more places in the specification, the embodiments do not necessarily address any of these deficiencies. In other words, different embodiments may address different deficiencies that may be discussed in the specification. Some embodiments may only partially address some deficiencies or just one deficiency that may be discussed in the specification, and some embodiments may not address any of these deficiencies.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • In the following drawings like reference numbers are used to refer to like elements. Although the following figures depict various examples, the one or more implementations are not limited to the examples depicted in the figures.
  • FIG. 1 is an operational flow diagram illustrating a high level overview of a method for matching and confidently adding snippets of search results to clusters of objects in an embodiment;
  • FIG. 2 is a block diagram of a system for matching and confidently adding snippets of search results to clusters of objects in an embodiment;
  • FIG. 3 illustrates a block diagram of an example of an environment wherein an on-demand database service might be used; and
  • FIG. 4 illustrates a block diagram of an embodiment of elements of FIG. 3 and various possible interconnections between these elements.
  • DETAILED DESCRIPTION General Overview
  • Systems and methods are provided for matching and confidently adding snippets of search results to clusters of objects. As used herein, the term multi-tenant database system refers to those systems in which various elements of hardware and software of the database system may be shared by one or more customers. For example, a given application server may simultaneously process requests for a great number of customers, and a given database table may store rows for a potentially much greater number of customers. As used herein, the term query plan refers to a set of steps used to access information in a database system. Next, mechanisms and methods for matching and confidently adding snippets of search results to clusters of objects will be described with reference to example embodiments. The following detailed description will first describe a method for matching and confidently adding snippets of search results to clusters of objects. Next, a block diagram of an example system for matching and confidently adding snippets of search results to clusters of objects is described.
  • A lot of customer data makes up a database of contact records. The primary source for this customer data could be a website where users add and update business card information by adding or updating contact information one record at a time through a web form or by uploading comma separated value files that contain contact information. Users may also occasionally submit bounce email reports that contain error codes for invalid emails that they receive from their mail providers as part of their email marketing campaigns. A database system can receive and process millions of data records to provide new or updated data to customers in a timely manner. The database system cleans the data from the record, normalizes the data into a standard set of values that might be used for matching, enriches the data, and attempts to match the data with previously stored data to create a “golden record” for a person identified by the incoming data and/or previously stored data. False matches can result in the loss of good data and missed matches may reduce the value of previously stored data. The matching process also helps in identifying duplicates and decreases the possibilities that duplicate records are created for the same person. After the matching process returns a suitable list of matching person candidates, the database system adds the incoming data to a cluster of data objects that contains data values that matches data objects for the person identified by the incoming data. Alternatively, the database system creates a new cluster of data objects for the person if the database system does not already include any cluster of data objects that match data objects for the person identified by the incoming data. Then the database system determines whether to store the added data in a customer accessible database.
  • A significant majority of bad and erroneous operations may be prevented, thereby resulting in much higher quality of customer data, if a database system treats every add or update contribution as a claim and takes into account the reputation of the user/partner who makes the claim. In addition, bad and erroneous operations may be prevented if a database system takes into account the type of the claim, the date and time of the claim, and further validates the claim with data from the free web and other sources with additional levels of data stewardship. Claims from trusted users can be treated as sources of truth and valuable enough to overwrite almost all existing information. Claims from average members and the free web may be treated as good as any other information. The more consistent points identified will prevail, such as if three people evaluate some data as good and one person evaluates the same data as bad, the evaluations as good prevail.
  • The database system weighs a claim on a graded scale and calculates various scores to generate a confidence score that is then used to determine the type of actions that are needed before the claim is fully processed and applied to generate a “golden record” for a person. The database system determines the quality of each and every individual attribute such as names, titles, emails, phones, social handles etc. Each and every attribute of data in the claim is scored and weighed against similar attributes from other claims and golden records in case they already exist. If the data in an attribute of a new claim is of better quality than an existing attribute and the confidence score of the new claim is above a certain threshold, the database system uses the incoming attribute in a data snippet to replace the existing data for that attribute in the golden record. The attribute in the data snippet is linked to a person record, where data from multiple contacts is combined to create/update work profiles for the person record, allowing tracking of the lifecycle and work profile of contacts. If the attribute in the data snippet is conflicting or additional details are needed, the database system generates an additional task/alert to data stewards for additional review based on the importance of the data record and the attribute in question. If the attribute in the data snippet is of poor quality, then the database system rejects the claim and the state of the attribute and golden record remains unaffected. If there is not enough information to make a decision, there is not enough authority to change the state, or no new information is detected, then no decision is made, re-affirming the current state of the data.
  • FIG. 1 is an operational flow diagram illustrating a high level overview of a method 100 for matching and confidently adding snippets of search results to clusters of objects. As shown in FIG. 1, a database system may match and confidently add snippets of search results to clusters of objects.
  • A database system searches information based on objects in a cluster of objects, block 102. For example and without limitation, this can include the database system searching a business database for information about a business contact stored in a contact database, wherein the contact database includes objects stored in a cluster of objects that correspond to a given name “Gregory,” a family name “Jones,” a company “International Business Machines,” a title “V.P for sales,” a location “New York City, and an email address for a specific business contact. After receiving search results based on objects in a cluster of objects, the database system extracts a data snippet from the search results, block 104. By way of example and without limitation, this can include the database system extracting data that includes a given name “Greg,” a family name “Jones,” a company “IBM,” and a mobile phone number from the information in one of the search results.
  • Having extracted the data snippet from the search results, the database system determines whether the data snippet includes data that matches at least one of the objects in the cluster of objects, block 106. In embodiments, this can include the database system determining whether the data snippet extracted from the information in the search result includes data that matches any of the objects stored in the cluster of objects in the contact database corresponding to the business contact named Gregory Jones. Whether the data snippet includes data that matches at least one of the objects in the cluster of objects may include matching based on first name aliases and/or acronym expansion.
  • For example, “Greg” is a given name alias that matches the given name “Gregory” and “IBM” is an acronym that can be expanded to match “International Business Machines.” If the data snippet includes data that matches at least one of the objects in the cluster of objects, the method continues to block 108. If the data snippet does not include data that matches at least one of the objects in the cluster of objects, the method proceeds to block 110. If the data snippet includes data that matches at least one of the objects in the cluster of objects, the database system adds the data snippet to the cluster of objects, block 108. For example and without limitation, this can include the database system adding the extracted data snippet, including the mobile phone number, to the objects stored in the cluster of objects in the contact database that correspond to the business contact named Gregory Jones because both the data snippet and the objects in the cluster of objects include the uncommon family name “Jones.”
  • The method 100 then proceeds to block 112. If the data snippet does not include data that matches at least one of the objects in the cluster of objects, the database system optionally stores the data snippet for matching with subsequent clusters of objects, block 110. By way of example and without limitation, this can include the database system storing the data snippet for matching with subsequent clusters of objects if the data snippet does not include data that matches at least one of the objects in the cluster of objects, as the contact database may be later supplemented with an additional contact that includes an object which matches some of the data in the extracted data snippet. Then the method 100 either terminates or begins again at block 102.
  • Having determined that the data snippet includes data which matches objects in a cluster of objects, the database system can also determine whether the data snippet includes data that matches objects in another cluster of objects, block 112. In embodiments, this can include the database system determining that the extracted data snippet that includes “Greg,” “Jones,” and the mobile phone number also matches an object in another cluster of objects that includes “Greg,” “Jones,” and a company “Microsoft.” If the data snippet includes data that matches at least one of the objects in another cluster of objects, the method continues to block 114. If the data snippet does not include data that matches at least one of the objects in another cluster of objects, the method proceeds to block 116. If the data snippet includes data that matches at least one object in the other cluster of objects, the database system may combine the cluster of objects with the other cluster of objects, block 114. For example and without limitation, this can include the database system combining the clusters of objects for the two business contacts named “Jones” whose objects include “International Business Machines” and “Microsoft.” Such a combination of clusters of objects for business contact objects could be useful for a sales person planning on contacting Greg Jones at IBM if the sales person knows some business contacts who worked at Microsoft at the time when Greg Jones worked at Microsoft.
  • After adding a data snippet to a cluster of objects, the database system optionally calculates a confidence score for adding the data snippet to the cluster of objects based on the recency, a job title, an email address, and/or a phone number associated with the data snippet, block 116. By way of example and without limitation, this can include the database system calculating a confidence score based on how recent the data objects from the search result were stored in the business database, with the today's date of storage equated with the highest recency score.
  • In another example, the database system calculates a confidence score based on a job title from the search result, with hierarchically higher job titles equated with a higher title rank score, and with job titles known to be used by the business contact's claimed company equated with a higher title quality score. In yet another example, the database system calculates a confidence score based on an email address from the search result, with the email score based on how well the email address matches the pattern of other email addresses for business contacts for the business contact's claimed company and how well the email address matches the first name and the last name of the business contact. In a further example, the database system calculates a confidence score based on a phone number from the search result, where the phone number score is based on the consistency between the claimed phone number and the area code associated with the claimed geographic location for the business contact. The confidence score may be based on any weighted combination of the recency, the job title, the email address, and the phone number from the data snippet.
  • The database system optionally determines whether a confidence score is sufficiently high for adding the data snippet to the cluster of objects stored in a customer accessible database, block 118. In embodiments, this can include the database system determining that a confidence score is sufficiently high for a new mobile phone number to be added to the cluster of data objects for Greg Jones in a customer accessible database. If a confidence score is sufficiently high for adding the data snippet to the cluster of objects stored in a customer accessible database, the method 100 continues to block 120. If a confidence score is not sufficiently high for adding the data snippet to the cluster of objects stored in a customer accessible database, the method 100 proceeds to block 122. If the confidence score is sufficiently high for adding the data snippet to the cluster of objects stored in the customer accessible database, the database system optionally adds the data snippet to the cluster of objects stored in the customer accessible database, block 120.
  • For example and without limitation, this can include the database system storing the new mobile phone number in the contact database that is accessible by a sales person planning on contacting Greg Jones, who now has Jones' mobile phone number that the salesman did not have previously. Then the method 100 either terminates or begins again at block 102. Although this example describes the database system using a confidence score to determine whether to add a data snippet to a cluster of objects in a customer accessible database, the database system may also use a confidence score to determine whether to combine the cluster of objects with the other cluster of objects. The database system may also use a confidence score to determine whether to combine the cluster of objects with the other cluster of objects in a customer accessible database. If the confidence score is not sufficiently high for adding the data snippet to the cluster of objects stored in the customer accessible database, the database system optionally generates a notice for review, block 122.
  • By way of example and without limitation, the notice for review can include the database system generating a notice for reviewing the adding of the data snippet to the cluster of objects because the mobile phone number in the search results is not associated with New York City, the claimed office location for Jones in the search results, and the title “VP” in the search results is too generic and does not match any titles known to be used by IBM, the claimed company for Jones in the search results. Then the method 100 either terminates or begins again at block 102. Accordingly, systems and methods are provided which enable a database system for matching and confidently adding snippets of search results to clusters of objects.
  • The method 100 may be repeated as desired. Although this disclosure describes the blocks 102-122 executing in a particular order, the blocks 102-122 may be executed in a different order. In other implementations, each of the blocks 102-122 may also be executed in combination with other blocks and/or some blocks may be divided into a different set of blocks.
  • FIG. 2 illustrates a block diagram of an example system for matching and confidently adding snippets of search results to clusters of objects, under an embodiment. As shown in FIG. 2, the system 200 may illustrate a cloud computing environment in which data, applications, services, and other resources are stored and delivered through shared data-centers and appear as a single point of access for the users. The system 200 may also represent any other type of distributed computer network environment in which servers control the storage and distribution of resources and services for different client users.
  • One example of a system that can implement matching and confidently adding data snippets to clusters of objects is the popular open-source framework from Twitter® called Storm, which is a real time, open source data streaming framework that functions entirely in memory. Storm constructs a processing graph, called a “topology,” that feeds data from input sources through processing nodes. The input data sources are called “spouts,” and the processing nodes are called “bolts.” The data model consists of tuples, which flow from spouts to the bolts, which execute user code. Besides simply being locations where data is transformed or accumulated, bolts may also join streams of data and branch streams of data. Storm is designed to be run on several machines to provided parallelism. Storm processes streams of tuples. A stream is defined to be an unlimited ordered sequence of tuples, and each tuple is a one dimensional array of objects.
  • The system 200 acts as a central data processing hub, or clearing house, that brings in multiple data sources and free web data together to generate the “golden” record for core data assets around accounts and persons. The following describes the key components of the system 200 as part of a storm topology to implement the data processing pipeline. As part of the system initialization, before the Storm topology is activated to process any incoming claims, all the existing claims, reference data and golden person records are loaded into an in-memory key-value data store. The system 200 generates a set of specialized keys for each person record and claims that enable fast lookups for the purpose of matching and retrieval. Indices are created for each company, person and location object in the cache, and these indices are used for person matching, company matching and location matching. A spout is a source of data streams in a Storm topology. Generally spouts will read tuples from an external source and emit them into the topology.
  • The business directory spout 202 reads data from the business directory database and emits tuples, which are treated as claims, into the topology, such as contact added, contact updated, contact invalid phone, contact invalid email, and contact not at company. Each of these claim types has an associated contributor identifier that is the identity of the user who performed the action. The business directory spout 202 is an unbounded stream and keeps emitting data till there is no more data to be read. The tuples that are emitted out of the business directory spout 202 may be distributed randomly (shuffle grouping) to a normalize bolt 204 which is the first bolt in the pipeline. This data can also sent to a search engine bolt 206 which executes free web search queries and tries to find additional data around this contact, such as titles and social handles.
  • The partial records spout 208 provides partial records from disparate sources. The partial records spout 208 reads contact data from a partial records database where files that are uploaded by users on the website are stored in raw format before partial records processing. The key difference here is that unlike the business directory spout 202, the partial records spout 208 emits tuples based on partial data based on the data in the uploaded files. Also, the tuples that come out of the partial records spout 208 will often contain very poorly normalized data. Similar to the business directory spout 202, the tuples that are emitted out of the partial records spout 208 may be distributed randomly (shuffle grouping) to the normalize bolt 204 which is the first bolt in the pipeline. This data can also sent to the search engine bolt 206 which executes free web search queries and tries to find additional data around this contact, such as titles and social handles. Examples of claims emitted by the partial records spout 208 include contact added and contact added for new company.
  • The bounce email spout 210 reads bounce email error codes, which may be from comma separated value files that are uploaded by website administrators and website users. Examples of claims emitted by the bounce email spout include contact email and contact message. The bounce file message that the bounce email spout 210 receives for an email is typically unstructured text, such as records that are comma-separated with the email in the first column and the second column containing the bounce message as unstructured text. In order for the bounce email spout 210 to emit the objects properly, an automatic column mapping algorithm may initially process the first few lines of the file. The algorithm does not need to rely on the names of the column headers, but rather the algorithm can tokenize the bounce file. The field separator may be determined from the file by tokenizing on each kind of separator and computing how consistent the number of tokens the algorithm creates for the entire file. After determining the field separator, the algorithm can determine which column contains the email and which column contains the message. The algorithm may split out the record, remove the email, and concatenate the rest of the record to create the contact message claim. The emitted contact message claim is typically an unstructured snippet of text.
  • The social handle spout 212 reads contact data and social handles from a social handle repository and submits claims such as contact social handle.
  • The crawler spout 214 emits contacts found on the web from crawling websites for their management pages. The crawler spout 214 may start with a number of seed companies that the system 200 currently has and use it as the starting point for crawling. Examples of claims emitted by the crawler spout 214 include contact added and contact updated.
  • Processing in a Storm topology is generally done in bolts. Bolts may do anything from filtering, functions, aggregations, joins, talking to databases, and more. The normalize bolt 204 processes all the tuples that come to it through a series of data normalization routines. The normalizer bolt 204 may standardize addresses, titles, phone numbers, and properly classify contact records by department and level. The following are some of the key normalizations. An address normalizer can include a list of abbreviations, such as E to East, W to West, Blvd to Boulevard; only allows letters, numbers, and special characters; and remove any space if there are any spaces around the special characters. A title normalizer may include a list of misspellings and abbreviations. A name normalizer can allow letters and special characters, not allow special characters at the beginning and the end of a name, capitalize the first letter and add a space after each name, capitalize the next letter if a name starts with “Mc,” and capitalize all Roman numerals. A city normalizer may only allow letters and special characters, and only keep the last non-space special character if there are a sequence of special characters. A base normalizer can return the correct country normalizer based on the country abbreviation. A phone normalizer may normalize phone patterns based on each country having its own phone pattern. A zip normalizer can normalize zip code patterns based on each country having its own zip code pattern. A state normalizer may normalize states based on countries having its own state requirements, if there are any. Once the data is normalized, the normalizer bolt 204 can pass the data to the next stage in the pipeline, which is an enrich bolt 216.
  • The enrich bolt 216 uses external data services for email verification, for phone verification and social append services for social handles, and appends a set of meta-attributes to all the new contact claims that enter the pipeline. After enrichment, the tuple may contain additional metadata around emails, phones and social handles that is useful for matching and merging purposes. The enrich bolt 216 passes this data to the match bolt 218 that tries to match the incoming contact claims with other existing claims and facts in the system 200.
  • The match bolt 218 is based on the system 200 modeling a specific data model of a person object. For example, to allow matching on the probe (title=CEO, company=Google), the system 200 creates a suitable index (title@company or title_rank@company). A probe is a (partial) person record, such as some attribute:value pairs of a person with at least the person name present. For example (first_name=shabd, last_name=vaid, company=Responsys) should match the person Shabd Vaid because this name is uncommon and in the past he has worked at Responsys. The working data model of person object attributes may include: first name, last name, linkedin handle, twitter handle, other social handles, links to contact objects, work history, photos, education, and snippets, which are unstructured short pieces of text such as search result snippets, tweets, etc., and others containing person-identifying content.
  • A person object is composed of contact objects in a one-too-many relationship. That is, a person may have many contact objects, but a contact object belongs to only one person object. So if the probe matches a contact object, the system 200 can infer that the contact object matches the associated person object. If a probe does not match any contact object, yet it does match a person object, the probe contains some person-level attributes (such as social handles) which match a person object, or the probe contains some attributes of a person which cross contact boundaries. For example, the probe may be {person_name:“shabd vaid”, company=“iStorez”, company=“Responsys”}. This probe should match the Shabd Vaid person because his name is uncommon and he worked at both companies.
  • P denotes a person object and p.work_history.company_names denotes the names of companies p has worked at, p.work_history.cities denotes the set of all cities p has worked in, p.work_history.titles denotes the set of job titles that the person has held, and similar notations exist for work emails, work phones, work states, work countries, and social handles. The formats of the objects of different types of social handles (linkedin, twitter, etc.) is quite different, so it may not be necessary to have a different index type for a different type of social handle because there is no risk of a collision.
  • A final check models the probability that a match is a chance event. Let M denote this match. Specifically, assume a universe of objects (here, persons) that has size n, and assume a uniform probability model on this universe, that is, all objects are equally likely. The system 200 can estimate the upper bound on the expected number E(M) of objects in the universe that have the properties of the match M, under the universe probability model. If this upper bound estimate is below a certain threshold (1 may be a sensible choice) the system 200 accepts this match, otherwise the system 200 rejects the match. One way to estimate a suitable upper bound on E(M) is to model the probabilities of various attribute:value pairs under the universe probability model, then assume the independence of attributes in the match and multiply out these probabilities, then finally multiply this by n. To formally describe this, let M={a:v|a is an attribute and v is its value}. For example, M={person_name: “john smith”, company_name: “ibm”}. This means that the person name matched in M is John Smith, and the company_name matched in M is ibm. Now E(M)=n*product_{a:v in M} P(a:v) (EUB 1. Modeling the probabilities of all attribute:value pairs in the universe is probably too complex, so the database system may begin by modeling the probabilities of certain key attributes and their value, drop all attributes other than these from M and still use (EUB 1. The result is still an estimate of the upper bound on E(M). For concreteness, suppose the system 200 has modeled the probability of person names in the universe, and of company names. For example M={person_name: “john smith”, company_name: “ibm”} The estimated upper bound on E(M) is P(person_name: “john smith”)*P(company_name: “ibm”)*n˜P(person_name: “john smith”)*#contacts_in_company(company_name: “ibm”)
  • The result-set size based estimate may not generalize as well as explicit modeling. For example, the P(person_name) explicit model which assumes independence of first and last names does not generalize well. An alternative to an explicit estimate is a result-set size based estimate. In this version, the system 200 runs the matcher to find all true positive matches. Here, ‘true positive’ may not include ‘modeling chance matches’. If there are at least two distinct objects in the result set, the system 200 deems that the probe being matched is not matched uniquely. This approach has the benefit that the P(a:v) probabilities are not explicitly modeled. The result set will carry the information to judge whether a match is unique or not, even in complex cases. This approach has the limitation that it does not model the real world; only the current, actual universe of (golden) data objects. Another issue is that to implement this approach, the system 200 may need to do this computation after all the true positives have been generated. Furthermore, the system 200 can match within the result set to check whether there are indeed at least two different objects or not.
  • The search engine bolt 206 takes partial data (aka seed) and tries to find more publically available information via a search engine 220, such as Yahoo® Boss, because finding titles and social handles is useful. The data thus obtained is passed through a search results bolt 222 to extract vital information and enrich a data record to build a full person profile, such as by passing the data to a handle extractor bolt 224.
  • The search results bolt 222 uses search result snippets having attractive properties that suggest they be made first-class “objects” in a person database 226 and/or contact data model and matching engines. Snippets are consumed without running afoul of terms of use restrictions. For the most part, snippets contain information about a single entity—a person, company or contact. Snippets might be matched to a different type of suitable object, such as person, company, or contact. Some snippets contain information about multiple companies at which a person has worked, so snippets could be used to connect together multiple contacts of the same person Such a matching is of mostly unstructured text (the snippet) to structured data (a particular contact object): This matching does not require entity extraction from the snippet. This matching could be algorithmically relatively easy to do. Once a snippet has been matched to a suitable object with a sufficiently high confidence score, certain “nuggets” might be extracted from the snippet and the matching object enriched. For example, if the snippet contains a LinkedIn handle and the snippet matches a particular contact sufficiently well, this handle is then be attached to that contact. A snippet may tie together multiple contacts of the same person because the snippet contains the names of multiple companies at which the person has worked.
  • Contact initiated snippets generation and matching may work as follows. Start with a contact J. Let C denote the cluster of the person database 226 containing J. Generate a suitable query Q to the search engine 220 from J. For each snippet S in the top search results on Q, if S matches C with a sufficiently high confidence, add S to C, otherwise add S to a collection of unmatched snippets. If the person name in J is sufficiently uncommon, set Q to person-name(J), else set Q to person-name(J)+company-name(J). Two examples are Pawan Nachnani and John Smith ibm. Note that there is no data quality risk by setting a query too broad, such as a common person name, because the resulting snippets will be deeply matched with C. An overly broad query does not yield good recall because none of the snippets in its result set deeply match C. Recall may be less important than precision because if the system 200 makes up for low recall by pounding away at the search engine 220, so long as the system 200 is not constrained overly by search volume limits. Also, if the system 200 uses a mechanism to consume unmatched snippets, this mitigates the recall limitation a lot. C denotes the data of a single person. A snippet may contain data of this person spread across multiple contacts, which is why the database system matches S to C and not merely to J.
  • The process described in the previous section can produce a lot of snippets that remain unmatched. Accumulating these even over a short period of time may yield millions of snippets. Many of these snippets could contain useful information about contacts or persons that are not even yet in the database. In short, these snippets collectively have a lot of value. These snippets might be matched to contact or person objects and placed in the suitable cluster, then be available for merge. One major challenge in this regard is that of indexing a snippet for efficient matching. A person name may be a good index for snippets from person queries. The person name can be found from a snippet by light-weight entity recognition. Therefore, the match bolt 218 includes bolts such as a handle bolt 228, an email bolt 230, a name@company bolt 232, a name@phone bolt 234, and a name@location bolt 236 to match snippets to clusters of objects in the person database 226.
  • A cluster bolt 238 clusters all matching claims together into a common cluster. A merge bolt 240 merges all claims and existing contact records (partial and/or complete) from a cluster into a single composite record (the merged record) and computes a confidence score for the merged record. If the merged record is incomplete, the merge bolt 240 enriches the record when possible with information available in the cache. If the record is complete, the merge bolt 240 marks the record as canonicalized. At this point, the record is ready to be persisted in the person database 226, provided its confidence score is sufficiently high. The merge bolt 240 also updates the merge time of the incoming claim.
  • If r.day is today, then this score may have the value 1, and the score can reduce to 0 for a long time (many, many days) in the past. Score(r,rank)—based on r.title. c-level titles may get a score of 1 and the rank score can monotonically decay for lower rank titles. Score(r,title_quality)—High rank titles, e.g. Vice President, do not necessarily have high quality. Title_quality may score this separate dimension. A title might be deemed to have high quality if it has a known rank and has a known department and is not in an explicit list of poor titles. The quality may decrease depending on which (and how many) of the tests in the above sentence are violated. Score(r,domain)—might only be defined when r's company has been matched to company jc. Score(r,d)=#emails in domain d/#contacts in company jc. Score(r,pattern_domain)—How well (r.first_name,r.last_name,r.email) fits the email pattern of the domain of r.email Let p(r)=(r.first_name,r.lastname,r.email) be the pattern in r. For example, p=first.last for (john,doe,john.doe@xyz.com), p=flast for (john doe,jdoe@xyz.com) Score(r,pattern_domain)=#emails in domain of r.email having pattern p(r) divided by #emails in domain of r.email
  • The intent is that updates algorithmically deemed risky may be logged for review by a data steward or community. Feedback from the review can be used to assess the accuracy of this scoring/detection mechanism, and tuning of it if it is deemed useful enough. An update is risky if a contact's last name is changed. A title change with more than one level increase in rank, such as software engineer to ceo, is also risky. A score version of this may make the risk score depend on the number of skipped levels. A title change which changes departments to another incompatible department, such as. vp sales to vp engineering is also risky. Updating or adding a C-level contact in a large company is risky, but easy to generalize in a scoring setting—the higher the rank of the contact and the larger the company size, the higher the risk score may be. Also, different update actions might possibly have differing risks, such as a title change is generally more risky than a last name change for a female. A fortune 1000 headquarters address change is also risky, but scoring may generalize this to important company combined with attribute-specific change score
    Figure US20150032729A1-20150129-P00001
    overall risk score)
  • The join bolt 242 takes all the merged claims from the merge bolt 240 and construct person objects. A person object may be a collection of major profiles, such as a person profile, a work profile, and a social profile. The data from each merged claim can update one or many attributes across all the three profiles of a person. In some cases, a merge claim may end up creating new profile objects as new claims become available. Each attribute in a profile ends up with a confidence score that may ultimately determine the level of “gold” for that particular profile object. While most of the attributes might be permanent, some of the attributes could be transient and need to re-computed over time due to privacy and legal reasons.
  • A persist bolt 244 may save all the resultant person records and the underlying claims to the person database 246 once all the processing is completed by the join bolt 242.
  • The bounce email processing bolt is a reaper bolt 246 that aggregates multiple facts with a current claim and comes up with a score and a disposition about that score. The reaper bolt 246 may determine if a fact is a duplicate. The fact disposition can determine if the computed score warrants a graveyard or ungraveyard of the underlying contact. The score of the current claim could be computed as follows: Take all claims and scored facts for the same email. For each fact, get the base score determined by the response category of the email. From the description from the bounce email spout 210, the contact message is typically unstructured data. The reaper bolt 246 may address this by using a trie-based approach to find tokens specified in a list of vendor dictionaries. Each vendor dictionary can specify the token with a classified response category. Response categories for email may be hard_error, heavy_error, soft_error, emailreceived, unknown. Once the score is computed, depending on the live contact and graveyard thresholds, the reaper bolt 246 may determine if the contact is to be made live or graveyarded. The reaper bolt 246 can automatically graveyard records from bounce reports and phone campaigns, or float these records to a community for task resolution.
  • The crawler spout 214 looks at free web (sites approved by a legal department for acceptable terms of service) and finds publically available information/claims. Since most of the open web sources of data are un-structured; the publicly available information typically requires sophisticated natural language processing techniques to extract meaningful information from it. Therefore, the crawler spout 214 feeds snippets of information to a natural language processing bolt 248, which applies natural language processing and machine learning techniques to extract relevant data/facts to emit the following types of claims: contact added, contact updated, contact graveyarded, and social handles.
  • A natural, human person may be represented as a graph of p:Person entities (nodes, or vertices) interconnected by links (edges). Each node can represent a different facet of the user (person). Each of these facets may be held in a separate (graph) container called a context. Each person entity node can be a set of attributes and objects. These attributes might be simple literals (such as the user's first name) or they could be other entities (called complex attributes). These latter attributes might be links to other entity nodes. Typically each node in the person graph is located in its own context. The root node may lie in a special context (for each user) called the root context.
  • Once the golden records are curated, the system 200 delivers this data to the person database 226 that is customer accessible. This golden data may also be propagated back to the original source systems and other partner systems and help keep the data clean in their respective source databases.
  • The system 200 provides a complete 360 degrees feedback loop and reduces the chances that bad or fraudulent data may ever make it into customer's customer relationship management systems or any other system where a consolidated view of an account and person data is required. The core person and account repository also continues to grow over time as new pieces of data are found on the free web and other sources. Additional sources of data may also be on-boarded quickly into the system 200 by adding and configuring new spouts and corresponding bolts into the Storm topology. For example, a de-duplication bolt detects duplicates and automatically merges the duplicates or float suspected duplicates to a community for task resolution. In another example, a pinger bolt pings hypertext transfer protocol and simple mail transfer protocol domains for validity, automatically graveyarding when a domain is deemed invalid.
  • The system 200 may create indices for each company, person, and location object for matching purposes. Examples of person indices include record identifier, social handle, email direct phone number, company, city, zip, state, and country. Examples of location indices include record identifier, zip, city, and country. Examples of company indices include record identifier, domain, corporate phone, company prefix, stock ticker, company name and city, domain and city.
  • The system may build an inverted index from a snippet, and use the index to map words in the snippet to their positions. The positions for a given word could be in increasing order. An inverted index is illustrated in an example below.
  • Snippet= Shabd Vaid|LinkedIn
  • www.linkedin.com/in/shabdvaid Cached
    Shabd Vaid. Experience: Co-founder, Vice President Engineering & Operations, iStorez Inc.; Director of Engineering, Responsys; Senior Software Engineer, Newgen . . . .
    Inverted Index: (only some key-value pairs shown).
    shabd→<0,5>, vaid→<1,6>, vice→<9>, president→<10>,
  • The system 200 detects acronyms (if any) in the snippet, expands out these acronyms, tokenizes the expansion and incorporates these expansions into the inverted index, as illustrated in the example below.
  • IBM News room—Virginia M. Rometty—Chairman, President and . . . .
    www-03.ibm.com/press/us/en/biography/10069.wss Cached
    IBM Press Room—Ginni Rometti Biography . . . Full biography. Ginni Rometty is Chairman, President and Chief Executive Officer of IBM.
  • Before acronymization, the inverted index contains the entry ibm
    Figure US20150032729A1-20150129-P00001
    <0,i,j> where i and j denote the word positions of the 2nd and 3rd occurrence of IBM in the snippet. After recognizing the acronym ibm→“international business machines”, the database system adds the entries international
    Figure US20150032729A1-20150129-P00001
    [i,0], business
    Figure US20150032729A1-20150129-P00001
    [i,1], and machines
    Figure US20150032729A1-20150129-P00001
    [i,2] to the inverted index. Acronym-expansion entries in a snippet's inverted index could be useful for matching titles or company names to the snippet.
  • The system 200 may represent an attribute:value pair as an ordered tree. The order can capture the order of the words in the value, and also in acronym expansions. The ordered tree may capture choices, which include aliases, and acronym expansions. Table 1 below shows various examples. Ordered trees can be depicted as nested arrays, and constructed via attribute-specific constructors. For example, person_name objects are expanded to include first name aliases, and acronyms in company names and titles are detected and expanded, such as depicted in table 1. Ordered trees may have alternating levels of ordered ANDs and unordered ORs. For visual convenience, an AND-node is encapsulated in [ . . . ] and an OR-node in ( . . . ).
  • Table 1, Ordered Trees of Attribute:Value Pairs:
  • attribute value ordered tree
    person_name (first_name = bob, [(bob, robert), smith]
    last_name = smith)
    title chairman and ceo [chairman, and , (ceo, [chief,
    executive, officer])]
    company ibm corp [(ibm,
    [international, business,
    machines]), corp]
  • As an example, [chairman, and, (ceo, [chief, executive, officer])] is read as “chairman AND (ceo OR (chief AND executive AND officer)).” Representing the snippet as an inverted index combined with representing attribute:value pairs as ordered trees may lead to a very fast matching algorithm, as described below. The system 200 has attribute-specific matchers to match a value of a field to a snippet, which is unstructured text. The attribute-specific matchers could be instances of the following generic matcher.
    • match(attribute,value,snippet_inverted_index)
    • Build ordered tree, attribute_value_ordered_tree, from attribute:value pair.
    • Build hits, which populates a copy of the ordered tree with positions of words in the snippet that match (these replace the words in the original ordered tree). hits uses snippet_inverted_index and attribute_value_ordered_tree as arguments.
    • Analyze hits to score the match.
    • end match
  • Building hits could be attribute-independent. Analyzing hits might be done “on-the-fly” with building hits, however the algorithm is easier to understand when the two steps are separated out. Table 2 below shows some examples. A post-list in hits is represented by < . . . >.
  • Table 2, Hits from attribute_value_ordered_tree and snippet_inverted_index:
  • Row attribute_value_ordered_tree snippet_inverted_index hits
    1 [shabd, vaid] {shabd → <0, 5>, vaid [<0, 5>, <1, 6>]
    → <1, 6>, vice → <9>,
    president → <10>, . . . }
    2 [vice, president] {shabd → <0, 5>, vaid [<9>, <10>]
    → <1, 6>, vice → <9>,
    president → <10>, . . . }
    3 [(vp, [vice, president])] {shabd → <0, 5>, vaid [(nil, [<9>, <10>])]
    → <1, 6>, vice → <9>,
    president → <10>, . . . }
    4 [(bob, robert), smith] {robert → <8>, smith [(nil, <8>), <9>]
    → <9>, . . . }
    5 [chairman, and , (ceo, [chief, {chairman → <0>, and [<0>, <1>, (<8>,
    executive, officer])] → <1>, chief → <2>, [<2>, <3>, <4>])]
    executive → <3>,
    officer → <4>, . . . , ceo
    → <8>}
    6 [(ibm, {ibm → <0>} [(<0>, [nil, nil, nil]),
    [international, business, nil]
    machines]), corp]
  • Enumerating individual hits may be described based on the hits data structure in the last column of Table 2. Individual hits can reveal exactly what tokens in the query matched what positions in the snippet. Each hit could be individually scored. The overall score for the match of the attribute:value pair in the snippet might be defined as the aggregation of these individual scores. A hit could be a pair (tokens,positions), where tokens might be an array of tokens in attribute_value_ordered_tree and positions could be an array of positions in the snippet at which these tokens match, such as the examples below.
  • A one-level hits tree is simply an array of post-lists. In Table 2, hits of rows 1 and 2 form one-level trees. The system 200 may use a k-merge like algorithm to enumerate all the hits of such a tree to a snippet. This algorithm can “merge” k post-lists, as illustrated below. Below is an illustration on the hits [<0,5>, <1,6>]
  • [< 0 ,5>, < 1 ,6>]→([shabd,vaid],0 . . . 1)
    [<0,5>, <1,6>]→([shabd,vaid],5 . . . 6)
  • The underlined entries depict the locations of the pointers in the various post-lists. In step 1, the pointers are at the start positions. Since 1 minus 0 equals 1, the system 200 generates a hit, 0 . . . 1, and advances both pointers. In step 2, since 6 minus 5 equals 1, the system 200 enumerates a hit, 5 . . . 6, and advances both pointers.
  • Enumerating hits of a multi-level tree may be done by suitably generalizing the k-merge operation. The generalization can be a little complex, and may be well described by building up inductively from different types of multi-level tree examples.
  • Example 1 is based on the hits of row 3 in Table 2: [(nil,[<9>, <10>])] and corresponds to a 3-level tree. The system 200 processes this example as follows.
  • [(nil, [<9>, <10>])]
    (nil, [<9>, <10>])
    [<9>, <10>]→([vice,president],9 . . . 10)
  • First, the system 200 goes down one level since the top level is a singleton-AND. Next, the system 200 skips the nil. Finally, the system 200 produces the hit 9 . . . 10 from [<9>, <10>] and annotates it with [vice, president].
  • Example 2 is based on the hits of row 4 in Table 2: [(nil,<8>),<9>]
  • [(nil,<8>),<9>]
    [(nil,<8>),<9>]→([robert, smith],8 . . . 9)
  • In step 1, the system 200 tries to 2-merge (nil,<8>) with <9>. Recognizing that the first argument is an OR, the system 200 goes down one level into the OR and effectively does the 2-merge of [<8>,<9>] in step 2.
  • Example 3 is based on the hits in row 5 of Table 2: [<0>, <1>, (<8>, [<2>, <3>, <4>])]
  • [<0>, <1>, (<8>, [<2>, <3>, <4>])]
    [<0>, <1>, (<8>, [<2>, <3>, <4>])]→([chairman,and,ceo],(0 . . . 1,8))
    [<0>, <1>, (<8>, [<2>, <3>, <4>])]→([chairman,and,chief,executive,officer],0 . . . 4)
  • In step 1, the system 200 recognizes that the need of a 3-merge at the top level. The system 200 places the pointers at the correct locations of the first two entries. The third entry is an OR, so the system 200 descends into the third entry and then places the pointer on the first entry in the first post-list in the OR choices. (This entry is 8.) The system 200 then outputs the hit (0 . . . 1,8) off to the scorer. Next, in step 3, the system 200 moves over to the second choice in this OR. This is itself an AND of three choices. So the system 200 needs a 3-merge, of [<2>, <3>, <4>]. This 3-merge produces the hit 2 . . . 4, which gets appended to 0 . . . 1 to yield 0 . . . 4.
  • Example 4: is based on the hits row 6 of Table 2: [(<0>,[nil,nil,nil]),nil]
  • [(<0>, [nil,nil,nil]),nil]
    [(<0>,[nil,nil,nil]),nil]→([ibm,corp],[0,nil])
  • In step 1, the system 200 recognizes that the need of a 2-merge at the top level. The system 200 notices that the first entry is an OR, so the system 200 descends into the first entry and then places the pointer on the first entry in the first post-list in the OR choices. The system 200 notes that the second entry of the top-level AND is nil, so the system 200 outputs [0,nil] as one hit. Next, the system 200 advances the first pointer to the second choice in the OR (<0>,[nil,nil,nil]) and notices that it is [nil,nil,nil]. So the system 200 stops; such that no new hits are generated.
  • The hit scorer may take two arguments: argument_name and hit. Table 3 shows a number of examples explaining the scoring. Table 3, Scoring individual hits:
  • Attribute Hit Scoring Explanation
    person ([shabd, vaid], 0 . . . 1) Very high score since 1-0 = 1
    name
    title ([vice, president], 9 . . . 10) Very high score since 10-9 = 1
    title ([chairman, and, ceo], (0 . . . 1, 8)) Moderate score since 8 is far
    from 0 . . . 1
    title ([chairman, and, chief, Very high score because of
    executive, officer], 0 . . . 4) 0 . . . 4
    company ([ibm, corp], [0, nil]) High score because the
    name unmatched corp is a
    company stop word
    company ([jigsaw, data, corp], [5, nil, nil]) Moderately high because the
    name unmatched corp is a
    company stop word and the
    unmatched data is not the
    first word
    company ([data, corp], [nil, 3]) Low score because the
    name unmatched data is first word
    in company name
    person ([john, smith], [3, 9]) Low score because the
    name distance between the two
    matches, i.e. 9-3, is too high.
    person ([john, smith], [3, 5]) Moderately high score
    name because the distance 5-3 is
    small (2) though not ideal (1).
    title ([director, of, engineering], [6, nil, 5]) Moderately high score
    because |5-6| = 1 and title
    matches should be looser on
    word order.
  • The system 200 brings together various algorithms, processes and techniques that are particularly suited for finding inaccurate data and piecing together rapidly changing pieces of data and claims to generate golden records at a massive scale. The system 200 provides a complete framework to efficiently evaluate data and to improve the completeness and accuracy of data. The system 200 provides a solid foundation for linking external data sources to core data assets in a reliable and scalable way that will enable customers to gain additional insights into their customers.
  • System Overview
  • FIG. 3 illustrates a block diagram of an environment 310 wherein an on-demand database service might be used. The environment 310 may include user systems 312, a network 314, a system 316, a processor system 317, an application platform 318, a network interface 320, a tenant data storage 322, a system data storage 324, program code 326, and a process space 328. In other embodiments, the environment 310 may not have all of the components listed and/or may have other elements instead of, or in addition to, those listed above.
  • The environment 310 is an environment in which an on-demand database service exists. A user system 312 may be any machine or system that is used by a user to access a database user system. For example, any of the user systems 312 may be a handheld computing device, a mobile phone, a laptop computer, a work station, and/or a network of computing devices. As illustrated in FIG. 3 (and in more detail in FIG. 4) the user systems 312 might interact via the network 314 with an on-demand database service, which is the system 316.
  • An on-demand database service, such as the system 316, is a database system that is made available to outside users that do not need to necessarily be concerned with building and/or maintaining the database system, but instead may be available for their use when the users need the database system (e.g., on the demand of the users). Some on-demand database services may store information from one or more tenants stored into tables of a common database image to form a multi-tenant database system (MTS). Accordingly, the “on-demand database service 316” and the “system 316” will be used interchangeably herein. A database image may include one or more database objects. A relational database management system (RDMS) or the equivalent may execute storage and retrieval of information against the database object(s). The application platform 318 may be a framework that allows the applications of the system 316 to run, such as the hardware and/or software, e.g., the operating system. In an embodiment, the on-demand database service 316 may include the application platform 318 which enables creation, managing and executing one or more applications developed by the provider of the on-demand database service, users accessing the on-demand database service via user systems 312, or third party application developers accessing the on-demand database service via the user systems 312.
  • The users of the user systems 312 may differ in their respective capacities, and the capacity of a particular user system 312 might be entirely determined by permissions (permission levels) for the current user. For example, where a salesperson is using a particular user system 312 to interact with the system 316, that user system 312 has the capacities allotted to that salesperson. However, while an administrator is using that user system 312 to interact with the system 316, that user system 312 has the capacities allotted to that administrator. In systems with a hierarchical role model, users at one permission level may have access to applications, data, and database information accessible by a lower permission level user, but may not have access to certain applications, database information, and data accessible by a user at a higher permission level. Thus, different users will have different capabilities with regard to accessing and modifying application and database information, depending on a user's security or permission level.
  • The network 314 is any network or combination of networks of devices that communicate with one another. For example, the network 314 may be any one or any combination of a LAN (local area network), WAN (wide area network), telephone network, wireless network, point-to-point network, star network, token ring network, hub network, or other appropriate configuration. As the most common type of computer network in current use is a TCP/IP (Transfer Control Protocol and Internet Protocol) network, such as the global internetwork of networks often referred to as the “Internet” with a capital “I,” that network will be used in many of the examples herein. However, it should be understood that the networks that the one or more implementations might use are not so limited, although TCP/IP is a frequently implemented protocol.
  • The user systems 312 might communicate with the system 316 using TCP/IP and, at a higher network level, use other common Internet protocols to communicate, such as HTTP, FTP, AFS, WAP, etc. In an example where HTTP is used, the user systems 312 might include an HTTP client commonly referred to as a “browser” for sending and receiving HTTP messages to and from an HTTP server at the system 316. Such an HTTP server might be implemented as the sole network interface between the system 316 and the network 314, but other techniques might be used as well or instead. In some implementations, the interface between the system 316 and the network 314 includes load sharing functionality, such as round-robin HTTP request distributors to balance loads and distribute incoming HTTP requests evenly over a plurality of servers. At least as for the users that are accessing that server, each of the plurality of servers has access to the MTS' data; however, other alternative configurations may be used instead.
  • In one embodiment, the system 316, shown in FIG. 3, implements a web-based customer relationship management (CRM) system. For example, in one embodiment, the system 316 includes application servers configured to implement and execute CRM software applications as well as provide related data, code, forms, webpages and other information to and from the user systems 312 and to store to, and retrieve from, a database system related data, objects, and Webpage content. With a multi-tenant system, data for multiple tenants may be stored in the same physical database object, however, tenant data typically is arranged so that data of one tenant is kept logically separate from that of other tenants so that one tenant does not have access to another tenant's data, unless such data is expressly shared. In certain embodiments, the system 316 implements applications other than, or in addition to, a CRM application. For example, the system 316 may provide tenant access to multiple hosted (standard and custom) applications, including a CRM application. User (or third party developer) applications, which may or may not include CRM, may be supported by the application platform 318, which manages creation, storage of the applications into one or more database objects and executing of the applications in a virtual machine in the process space of the system 316.
  • One arrangement for elements of the system 316 is shown in FIG. 3, including the network interface 320, the application platform 318, the tenant data storage 322 for tenant data 323, the system data storage 324 for system data 325 accessible to the system 316 and possibly multiple tenants, the program code 326 for implementing various functions of the system 316, and the process space 328 for executing MTS system processes and tenant-specific processes, such as running applications as part of an application hosting service. Additional processes that may execute on the system 316 include database indexing processes.
  • Several elements in the system shown in FIG. 3 include conventional, well-known elements that are explained only briefly here. For example, each of the user systems 312 could include a desktop personal computer, workstation, laptop, PDA, cell phone, or any wireless access protocol (WAP) enabled device or any other computing device capable of interfacing directly or indirectly to the Internet or other network connection. Each of the user systems 312 typically runs an HTTP client, e.g., a browsing program, such as Microsoft's Internet Explorer browser, Netscape's Navigator browser, Opera's browser, or a WAP-enabled browser in the case of a cell phone, PDA or other wireless device, or the like, allowing a user (e.g., subscriber of the multi-tenant database system) of the user systems 312 to access, process and view information, pages and applications available to it from the system 316 over the network 314. Each of the user systems 312 also typically includes one or more user interface devices, such as a keyboard, a mouse, trackball, touch pad, touch screen, pen or the like, for interacting with a graphical user interface (GUI) provided by the browser on a display (e.g., a monitor screen, LCD display, etc.) in conjunction with pages, forms, applications and other information provided by the system 316 or other systems or servers. For example, the user interface device may be used to access data and applications hosted by the system 316, and to perform searches on stored data, and otherwise allow a user to interact with various GUI pages that may be presented to a user. As discussed above, embodiments are suitable for use with the Internet, which refers to a specific global internetwork of networks. However, it should be understood that other networks can be used instead of the Internet, such as an intranet, an extranet, a virtual private network (VPN), a non-TCP/IP based network, any LAN or WAN or the like.
  • According to one embodiment, each of the user systems 312 and all of its components are operator configurable using applications, such as a browser, including computer code run using a central processing unit such as an Intel Pentium® processor or the like. Similarly, the system 316 (and additional instances of an MTS, where more than one is present) and all of their components might be operator configurable using application(s) including computer code to run using a central processing unit such as the processor system 317, which may include an Intel Pentium® processor or the like, and/or multiple processor units. A computer program product embodiment includes a machine-readable storage medium (media) having instructions stored thereon/in which can be used to program a computer to perform any of the processes of the embodiments described herein. Computer code for operating and configuring the system 316 to intercommunicate and to process webpages, applications and other data and media content as described herein are preferably downloaded and stored on a hard disk, but the entire program code, or portions thereof, may also be stored in any other volatile or non-volatile memory medium or device as is well known, such as a ROM or RAM, or provided on any media capable of storing program code, such as any type of rotating media including floppy disks, optical discs, digital versatile disk (DVD), compact disk (CD), microdrive, and magneto-optical disks, and magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data. Additionally, the entire program code, or portions thereof, may be transmitted and downloaded from a software source over a transmission medium, e.g., over the Internet, or from another server, as is well known, or transmitted over any other conventional network connection as is well known (e.g., extranet, VPN, LAN, etc.) using any communication medium and protocols (e.g., TCP/IP, HTTP, HTTPS, Ethernet, etc.) as are well known. It will also be appreciated that computer code for implementing embodiments can be implemented in any programming language that can be executed on a client system and/or server or server system such as, for example, C, C++, HTML, any other markup language, Java™, JavaScript, ActiveX, any other scripting language, such as VBScript, and many other programming languages as are well known may be used. (Java™ is a trademark of Sun Microsystems, Inc.).
  • According to one embodiment, the system 316 is configured to provide webpages, forms, applications, data and media content to the user (client) systems 312 to support the access by the user systems 312 as tenants of the system 316. As such, the system 316 provides security mechanisms to keep each tenant's data separate unless the data is shared. If more than one MTS is used, they may be located in close proximity to one another (e.g., in a server farm located in a single building or campus), or they may be distributed at locations remote from one another (e.g., one or more servers located in city A and one or more servers located in city B). As used herein, each MTS could include one or more logically and/or physically connected servers distributed locally or across one or more geographic locations. Additionally, the term “server” is meant to include a computer system, including processing hardware and process space(s), and an associated storage system and database application (e.g., OODBMS or RDBMS) as is well known in the art. It should also be understood that “server system” and “server” are often used interchangeably herein. Similarly, the database object described herein can be implemented as single databases, a distributed database, a collection of distributed databases, a database with redundant online or offline backups or other redundancies, etc., and might include a distributed database or storage network and associated processing intelligence.
  • FIG. 4 also illustrates the environment 310. However, in FIG. 4 elements of the system 316 and various interconnections in an embodiment are further illustrated. FIG. 4 shows that the each of the user systems 312 may include a processor system 312A, a memory system 312B, an input system 312C, and an output system 312D. FIG. 4 shows the network 314 and the system 316. FIG. 4 also shows that the system 316 may include the tenant data storage 322, the tenant data 323, the system data storage 324, the system data 325, a User Interface (UI) 430, an Application Program Interface (API) 432, a PL/SOQL 434, save routines 436, an application setup mechanism 438, applications servers 400 1-400 N, a system process space 402, tenant process spaces 404, a tenant management process space 410, a tenant storage area 412, a user storage 414, and application metadata 416. In other embodiments, the environment 310 may not have the same elements as those listed above and/or may have other elements instead of, or in addition to, those listed above.
  • The user systems 312, the network 314, the system 316, the tenant data storage 322, and the system data storage 324 were discussed above in FIG. 3. Regarding the user systems 312, the processor system 312A may be any combination of one or more processors. The memory system 312B may be any combination of one or more memory devices, short term, and/or long term memory. The input system 312C may be any combination of input devices, such as one or more keyboards, mice, trackballs, scanners, cameras, and/or interfaces to networks. The output system 312D may be any combination of output devices, such as one or more monitors, printers, and/or interfaces to networks. As shown by FIG. 4, the system 316 may include the network interface 320 (of FIG. 3) implemented as a set of HTTP application servers 400, the application platform 318, the tenant data storage 322, and the system data storage 324. Also shown is the system process space 402, including individual tenant process spaces 404 and the tenant management process space 410. Each application server 400 may be configured to access tenant data storage 322 and the tenant data 323 therein, and the system data storage 324 and the system data 325 therein to serve requests of the user systems 312. The tenant data 323 might be divided into individual tenant storage areas 412, which can be either a physical arrangement and/or a logical arrangement of data. Within each tenant storage area 412, the user storage 414 and the application metadata 416 might be similarly allocated for each user. For example, a copy of a user's most recently used (MRU) items might be stored to the user storage 414. Similarly, a copy of MRU items for an entire organization that is a tenant might be stored to the tenant storage area 412. The UI 430 provides a user interface and the API 432 provides an application programmer interface to the system 316 resident processes to users and/or developers at the user systems 312. The tenant data and the system data may be stored in various databases, such as one or more Oracle™ databases.
  • The application platform 318 includes the application setup mechanism 438 that supports application developers' creation and management of applications, which may be saved as metadata into the tenant data storage 322 by the save routines 436 for execution by subscribers as one or more tenant process spaces 404 managed by the tenant management process 410 for example. Invocations to such applications may be coded using the PL/SOQL 34 that provides a programming language style interface extension to the API 432. A detailed description of some PL/SOQL language embodiments is discussed in commonly owned U.S. Pat. No. 7,730,478 entitled, METHOD AND SYSTEM FOR ALLOWING ACCESS TO DEVELOPED APPLICATIONS VIA A MULTI-TENANT ON-DEMAND DATABASE SERVICE, by Craig Weissman, filed Sep. 21, 2007, which is incorporated in its entirety herein for all purposes. Invocations to applications may be detected by one or more system processes, which manages retrieving the application metadata 416 for the subscriber making the invocation and executing the metadata as an application in a virtual machine.
  • Each application server 400 may be communicably coupled to database systems, e.g., having access to the system data 325 and the tenant data 323, via a different network connection. For example, one application server 400 1 might be coupled via the network 314 (e.g., the Internet), another application server 400 N-1 might be coupled via a direct network link, and another application server 400 N might be coupled by yet a different network connection. Transfer Control Protocol and Internet Protocol (TCP/IP) are typical protocols for communicating between application servers 400 and the database system. However, it will be apparent to one skilled in the art that other transport protocols may be used to optimize the system depending on the network interconnect used.
  • In certain embodiments, each application server 400 is configured to handle requests for any user associated with any organization that is a tenant. Because it is desirable to be able to add and remove application servers from the server pool at any time for any reason, there is preferably no server affinity for a user and/or organization to a specific application server 400. In one embodiment, therefore, an interface system implementing a load balancing function (e.g., an F5 Big-IP load balancer) is communicably coupled between the application servers 400 and the user systems 312 to distribute requests to the application servers 400. In one embodiment, the load balancer uses a least connections algorithm to route user requests to the application servers 400. Other examples of load balancing algorithms, such as round robin and observed response time, also can be used. For example, in certain embodiments, three consecutive requests from the same user could hit three different application servers 400, and three requests from different users could hit the same application server 400. In this manner, the system 316 is multi-tenant, wherein the system 316 handles storage of, and access to, different objects, data and applications across disparate users and organizations.
  • As an example of storage, one tenant might be a company that employs a sales force where each salesperson uses the system 316 to manage their sales process. Thus, a user might maintain contact data, leads data, customer follow-up data, performance data, goals and progress data, etc., all applicable to that user's personal sales process (e.g., in the tenant data storage 322). In an example of a MTS arrangement, since all of the data and the applications to access, view, modify, report, transmit, calculate, etc., can be maintained and accessed by a user system having nothing more than network access, the user can manage his or her sales efforts and cycles from any of many different user systems. For example, if a salesperson is visiting a customer and the customer has Internet access in their lobby, the salesperson can obtain critical updates as to that customer while waiting for the customer to arrive in the lobby.
  • While each user's data might be separate from other users' data regardless of the employers of each user, some data might be organization-wide data shared or accessible by a plurality of users or all of the users for a given organization that is a tenant. Thus, there might be some data structures managed by the system 316 that are allocated at the tenant level while other data structures might be managed at the user level. Because an MTS might support multiple tenants including possible competitors, the MTS should have security protocols that keep data, applications, and application use separate. Also, because many tenants may opt for access to an MTS rather than maintain their own system, redundancy, up-time, and backup are additional functions that may be implemented in the MTS. In addition to user-specific data and tenant specific data, the system 316 might also maintain system level data usable by multiple tenants or other data. Such system level data might include industry reports, news, postings, and the like that are sharable among tenants.
  • In certain embodiments, the user systems 312 (which may be client systems) communicate with the application servers 400 to request and update system-level and tenant-level data from the system 316 that may require sending one or more queries to the tenant data storage 322 and/or the system data storage 324. The system 316 (e.g., an application server 400 in the system 316) automatically generates one or more SQL statements (e.g., one or more SQL queries) that are designed to access the desired information. The system data storage 324 may generate query plans to access the requested data from the database.
  • Each database can generally be viewed as a collection of objects, such as a set of logical tables, containing data fitted into predefined categories. A “table” is one representation of a data object, and may be used herein to simplify the conceptual description of objects and custom objects. It should be understood that “table” and “object” may be used interchangeably herein. Each table generally contains one or more data categories logically arranged as columns or fields in a viewable schema. Each row or record of a table contains an instance of data for each category defined by the fields. For example, a CRM database may include a table that describes a customer with fields for basic contact information such as name, address, phone number, fax number, etc. Another table might describe a purchase order, including fields for information such as customer, product, sale price, date, etc. In some multi-tenant database systems, standard entity tables might be provided for use by all tenants. For CRM database applications, such standard entities might include tables for Account, Contact, Lead, and Opportunity data, each containing pre-defined fields. It should be understood that the word “entity” may also be used interchangeably herein with “object” and “table”.
  • In some multi-tenant database systems, tenants may be allowed to create and store custom objects, or they may be allowed to customize standard entities or objects, for example by creating custom fields for standard objects, including custom index fields. U.S. Pat. No. 7,779,039, filed Apr. 2, 2004, entitled “Custom Entities and Fields in a Multi-Tenant Database System”, which is hereby incorporated herein by reference, teaches systems and methods for creating custom objects as well as customizing standard objects in a multi-tenant database system. In certain embodiments, for example, all custom entity data rows are stored in a single multi-tenant physical table, which may contain multiple logical tables per organization. It is transparent to customers that their multiple “tables” are in fact stored in one large table or that their data may be stored in the same table as the data of other customers.
  • While one or more implementations have been described by way of example and in terms of the specific embodiments, it is to be understood that one or more implementations are not limited to the disclosed embodiments. To the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.

Claims (20)

1. A system for matching snippets of search results to clusters of objects, the system comprising:
one or more processors; and
a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to:
search for information based on objects in a cluster of objects;
extract a data snippet from the search results;
determine whether the data snippet includes data that matches at least one of the objects in the cluster of objects; and
add the data snippet to the cluster of objects in response to a determination that the data snippet includes data that matches at least one of the objects in the cluster of objects.
2. The system of claim 1, further comprising the step of storing the data snippet for matching with subsequent clusters of objects in response to a determination that the data snippet does not include data that matches at least one of the objects in the cluster of objects.
3. The system of claim 1, wherein the objects comprise at least one of a personal name, a company name, a geolocation, a job title, and contact information.
4. The system of claim 1, wherein determining whether the data snippet includes data that matches at least one of the objects in the cluster of objects comprises matching based on at least one of a first name alias and acronym expansion.
5. The system of claim 1, further comprising the steps of:
determining whether the data snippet includes data that matches at least one object in a second cluster of objects; and
combining the cluster of objects with the second cluster of objects in response to a determination that the data snippet includes data that matches at least one object in the second cluster of objects.
6. A computer program product comprising computer-readable program code to be executed by one or more processors when retrieved from a non-transitory computer-readable medium, the program code including instructions to:
search for information based on objects in a cluster of objects;
extract a data snippet from the search results;
determine whether the data snippet includes data that matches at least one of the objects in the cluster of objects; and
add the data snippet to the cluster of objects in response to a determination that the data snippet includes data that matches at least one of the objects in the cluster of objects.
7. The computer program product of claim 6, the program code further including instructions to store the data snippet for matching with subsequent clusters of objects in response to a determination that the data snippet does not include data that matches at least one of the objects in the cluster of objects.
8. The computer program product of claim 6, wherein the objects comprise at least one of a personal name, a company name, a geolocation, a job title, and contact information.
9. The computer program product of claim 6, wherein determining whether the data snippet includes data that matches at least one of the objects in the cluster of objects comprises matching based on at least one of a first name alias and acronym expansion.
10. The computer program product of claim 6, the program code further including instructions to:
determine whether the data snippet includes data that matches at least one object in a second cluster of objects; and
combine the cluster of objects with the second cluster of objects in response to a determination that the data snippet includes data that matches at least one object in the second cluster of objects.
11. A method for matching snippets of search results to clusters of objects, the method comprising:
searching for information based on objects in a cluster of objects;
extracting a data snippet from the search results;
determining whether the data snippet includes data that matches at least one of the objects in the cluster of objects; and
adding the data snippet to the cluster of objects in response to a determination that the data snippet includes data that matches at least one of the objects in the cluster of objects.
12. The method of claim 11, the method further comprising storing the data snippet for matching with subsequent clusters of objects in response to a determination that the data snippet does not include data that matches at least one of the objects in the cluster of objects.
13. The method of claim 11, wherein the objects comprise at least one of a personal name, a company name, a geolocation, a job title, and contact information.
14. The method of claim 11, wherein determining whether the data snippet includes data that matches at least one of the objects in the cluster of objects comprises matching based on at least one of a first name alias and acronym expansion.
15. The method of claim 11, the method further comprising:
determining whether the data snippet includes data that matches at least one object in a second cluster of objects; and
combining the cluster of objects with the second cluster of objects in response to a determination that the data snippet includes data that matches at least one object in the second cluster of objects.
16. A method for transmitting code for matching snippets of search results to clusters of objects, the method comprising:
transmitting code to search for information based on objects in a cluster of objects;
transmitting code to extract a data snippet from the search results;
transmitting code to determine whether the data snippet includes data that matches at least one of the objects in the cluster of objects; and
transmitting code to add the data snippet to the cluster of objects in response to a determination that the data snippet includes data that matches at least one of the objects in the cluster of objects.
17. The method for transmitting code of claim 16, the method further comprising storing the data snippet for matching with subsequent clusters of objects in response to a determination that the data snippet does not include data that matches at least one of the objects in the cluster of objects.
18. The method for transmitting code of claim 16, wherein the objects comprise at least one of a personal name, a company name, a geolocation, a job title, and contact information.
19. The method for transmitting code of claim 16, wherein determining whether the data snippet includes data that matches at least one of the objects in the cluster of objects comprises matching based on at least one of a first name alias and acronym expansion.
20. The method for transmitting code of claim 16, the method further comprising:
determining whether the data snippet includes data that matches at least one object in a second cluster of objects; and
combining the cluster of objects with the second cluster of objects in response to a determination that the data snippet includes data that matches at least one object in the second cluster of objects.
US14/337,352 2013-07-23 2014-07-22 Matching snippets of search results to clusters of objects Abandoned US20150032729A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/337,352 US20150032729A1 (en) 2013-07-23 2014-07-22 Matching snippets of search results to clusters of objects

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201361857325P 2013-07-23 2013-07-23
US201361862873P 2013-08-06 2013-08-06
US14/337,352 US20150032729A1 (en) 2013-07-23 2014-07-22 Matching snippets of search results to clusters of objects

Publications (1)

Publication Number Publication Date
US20150032729A1 true US20150032729A1 (en) 2015-01-29

Family

ID=52391371

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/337,352 Abandoned US20150032729A1 (en) 2013-07-23 2014-07-22 Matching snippets of search results to clusters of objects
US14/337,505 Active 2035-10-26 US9760620B2 (en) 2013-07-23 2014-07-22 Confidently adding snippets of search results to clusters of objects

Family Applications After (1)

Application Number Title Priority Date Filing Date
US14/337,505 Active 2035-10-26 US9760620B2 (en) 2013-07-23 2014-07-22 Confidently adding snippets of search results to clusters of objects

Country Status (1)

Country Link
US (2) US20150032729A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017016130A1 (en) * 2015-07-30 2017-02-02 中兴通讯股份有限公司 Message processing method and device
US10366247B2 (en) 2015-06-02 2019-07-30 ALTR Solutions, Inc. Replacing distinct data in a relational database with a distinct reference to that data and distinct de-referencing of database data
US11360990B2 (en) 2019-06-21 2022-06-14 Salesforce.Com, Inc. Method and a system for fuzzy matching of entities in a database system based on machine learning

Families Citing this family (208)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10565229B2 (en) 2018-05-24 2020-02-18 People.ai, Inc. Systems and methods for matching electronic activities directly to record objects of systems of record
US20150006304A1 (en) * 2013-06-28 2015-01-01 International Business Machines Corporation Location-based and time-sensitive goods ratings
US10007702B2 (en) * 2013-12-19 2018-06-26 Siemens Aktiengesellschaft Processing an input query
US10181051B2 (en) 2016-06-10 2019-01-15 OneTrust, LLC Data processing systems for generating and populating a data inventory for processing data access requests
US10289867B2 (en) 2014-07-27 2019-05-14 OneTrust, LLC Data processing systems for webform crawling to map processing activities and related methods
US9729583B1 (en) 2016-06-10 2017-08-08 OneTrust, LLC Data processing systems and methods for performing privacy assessments and monitoring of new versions of computer code for privacy compliance
WO2016145457A1 (en) * 2015-03-12 2016-09-15 Kaplan, Inc. Course skill matching system and method thereof
CN104967885B (en) * 2015-03-27 2019-01-11 哈尔滨工业大学深圳研究生院 A kind of method and system for advertisement recommendation based on video content perception
US9953073B2 (en) * 2015-05-18 2018-04-24 Oath Inc. System and method for editing dynamically aggregated data
US10083403B2 (en) * 2015-06-30 2018-09-25 The Boeing Company Data driven classification and data quality checking method
US10089581B2 (en) * 2015-06-30 2018-10-02 The Boeing Company Data driven classification and data quality checking system
US10664481B2 (en) * 2015-09-29 2020-05-26 Cisco Technology, Inc. Computer system programmed to identify common subsequences in logs
US10657135B2 (en) * 2015-09-30 2020-05-19 International Business Machines Corporation Smart tuple resource estimation
US10296620B2 (en) 2015-09-30 2019-05-21 International Business Machines Corporation Smart tuple stream alteration
US10733209B2 (en) 2015-09-30 2020-08-04 International Business Machines Corporation Smart tuple dynamic grouping of tuples
US10558670B2 (en) 2015-09-30 2020-02-11 International Business Machines Corporation Smart tuple condition-based operation performance
FR3047622B1 (en) * 2016-02-09 2019-07-26 Idemia Identity And Security METHOD FOR CONTROLLING AN INDICATIVE PARAMETER OF A CONFIDENCE LEVEL ASSOCIATED WITH A USER ACCOUNT OF AN ONLINE SERVICE
US11244367B2 (en) 2016-04-01 2022-02-08 OneTrust, LLC Data processing systems and methods for integrating privacy information management systems with data loss prevention tools or other tools for privacy design
US10706447B2 (en) 2016-04-01 2020-07-07 OneTrust, LLC Data processing systems and communication systems and methods for the efficient generation of privacy risk assessments
US11004125B2 (en) 2016-04-01 2021-05-11 OneTrust, LLC Data processing systems and methods for integrating privacy information management systems with data loss prevention tools or other tools for privacy design
US10423996B2 (en) 2016-04-01 2019-09-24 OneTrust, LLC Data processing systems and communication systems and methods for the efficient generation of privacy risk assessments
US20220164840A1 (en) 2016-04-01 2022-05-26 OneTrust, LLC Data processing systems and methods for integrating privacy information management systems with data loss prevention tools or other tools for privacy design
US10706174B2 (en) 2016-06-10 2020-07-07 OneTrust, LLC Data processing systems for prioritizing data subject access requests for fulfillment and related methods
US10592648B2 (en) 2016-06-10 2020-03-17 OneTrust, LLC Consent receipt management systems and related methods
US10944725B2 (en) 2016-06-10 2021-03-09 OneTrust, LLC Data processing systems and methods for using a data model to select a target data asset in a data migration
US11354435B2 (en) 2016-06-10 2022-06-07 OneTrust, LLC Data processing systems for data testing to confirm data deletion and related methods
US11188862B2 (en) 2016-06-10 2021-11-30 OneTrust, LLC Privacy management systems and methods
US10885485B2 (en) 2016-06-10 2021-01-05 OneTrust, LLC Privacy management systems and methods
US11200341B2 (en) 2016-06-10 2021-12-14 OneTrust, LLC Consent receipt management systems and related methods
US10284604B2 (en) 2016-06-10 2019-05-07 OneTrust, LLC Data processing and scanning systems for generating and populating a data inventory
US11403377B2 (en) 2016-06-10 2022-08-02 OneTrust, LLC Privacy management systems and methods
US10510031B2 (en) 2016-06-10 2019-12-17 OneTrust, LLC Data processing systems for identifying, assessing, and remediating data processing risks using data modeling techniques
US10440062B2 (en) 2016-06-10 2019-10-08 OneTrust, LLC Consent receipt management systems and related methods
US10642870B2 (en) 2016-06-10 2020-05-05 OneTrust, LLC Data processing systems and methods for automatically detecting and documenting privacy-related aspects of computer software
US10467432B2 (en) 2016-06-10 2019-11-05 OneTrust, LLC Data processing systems for use in automatically generating, populating, and submitting data subject access requests
US11138299B2 (en) 2016-06-10 2021-10-05 OneTrust, LLC Data processing and scanning systems for assessing vendor risk
US10614247B2 (en) 2016-06-10 2020-04-07 OneTrust, LLC Data processing systems for automated classification of personal information from documents and related methods
US10282692B2 (en) 2016-06-10 2019-05-07 OneTrust, LLC Data processing systems for identifying, assessing, and remediating data processing risks using data modeling techniques
US11157600B2 (en) 2016-06-10 2021-10-26 OneTrust, LLC Data processing and scanning systems for assessing vendor risk
US11025675B2 (en) 2016-06-10 2021-06-01 OneTrust, LLC Data processing systems and methods for performing privacy assessments and monitoring of new versions of computer code for privacy compliance
US10713387B2 (en) 2016-06-10 2020-07-14 OneTrust, LLC Consent conversion optimization systems and related methods
US10949565B2 (en) 2016-06-10 2021-03-16 OneTrust, LLC Data processing systems for generating and populating a data inventory
US11144622B2 (en) 2016-06-10 2021-10-12 OneTrust, LLC Privacy management systems and methods
US11392720B2 (en) 2016-06-10 2022-07-19 OneTrust, LLC Data processing systems for verification of consent and notice processing and related methods
US10353673B2 (en) 2016-06-10 2019-07-16 OneTrust, LLC Data processing systems for integration of consumer feedback with data subject access requests and related methods
US10909265B2 (en) 2016-06-10 2021-02-02 OneTrust, LLC Application privacy scanning systems and related methods
US10606916B2 (en) 2016-06-10 2020-03-31 OneTrust, LLC Data processing user interface monitoring systems and related methods
US10585968B2 (en) 2016-06-10 2020-03-10 OneTrust, LLC Data processing systems for fulfilling data subject access requests and related methods
US11238390B2 (en) 2016-06-10 2022-02-01 OneTrust, LLC Privacy management systems and methods
US11636171B2 (en) 2016-06-10 2023-04-25 OneTrust, LLC Data processing user interface monitoring systems and related methods
US11301796B2 (en) 2016-06-10 2022-04-12 OneTrust, LLC Data processing systems and methods for customizing privacy training
US10346637B2 (en) 2016-06-10 2019-07-09 OneTrust, LLC Data processing systems for the identification and deletion of personal data in computer systems
US10848523B2 (en) 2016-06-10 2020-11-24 OneTrust, LLC Data processing systems for data-transfer risk identification, cross-border visualization generation, and related methods
US11277448B2 (en) 2016-06-10 2022-03-15 OneTrust, LLC Data processing systems for data-transfer risk identification, cross-border visualization generation, and related methods
US11461500B2 (en) 2016-06-10 2022-10-04 OneTrust, LLC Data processing systems for cookie compliance testing with website scanning and related methods
US11188615B2 (en) 2016-06-10 2021-11-30 OneTrust, LLC Data processing consent capture systems and related methods
US10430740B2 (en) 2016-06-10 2019-10-01 One Trust, LLC Data processing systems for calculating and communicating cost of fulfilling data subject access requests and related methods
US11341447B2 (en) 2016-06-10 2022-05-24 OneTrust, LLC Privacy management systems and methods
US10565161B2 (en) 2016-06-10 2020-02-18 OneTrust, LLC Data processing systems for processing data subject access requests
US11074367B2 (en) 2016-06-10 2021-07-27 OneTrust, LLC Data processing systems for identity validation for consumer rights requests and related methods
US11416589B2 (en) 2016-06-10 2022-08-16 OneTrust, LLC Data processing and scanning systems for assessing vendor risk
US11416109B2 (en) 2016-06-10 2022-08-16 OneTrust, LLC Automated data processing systems and methods for automatically processing data subject access requests using a chatbot
US10783256B2 (en) 2016-06-10 2020-09-22 OneTrust, LLC Data processing systems for data transfer risk identification and related methods
US11295316B2 (en) 2016-06-10 2022-04-05 OneTrust, LLC Data processing systems for identity validation for consumer rights requests and related methods
US11222142B2 (en) 2016-06-10 2022-01-11 OneTrust, LLC Data processing systems for validating authorization for personal data collection, storage, and processing
US10776517B2 (en) 2016-06-10 2020-09-15 OneTrust, LLC Data processing systems for calculating and communicating cost of fulfilling data subject access requests and related methods
US10769301B2 (en) 2016-06-10 2020-09-08 OneTrust, LLC Data processing systems for webform crawling to map processing activities and related methods
US10353674B2 (en) 2016-06-10 2019-07-16 OneTrust, LLC Data processing and communications systems and methods for the efficient implementation of privacy by design
US10997318B2 (en) 2016-06-10 2021-05-04 OneTrust, LLC Data processing systems for generating and populating a data inventory for processing data access requests
US12136055B2 (en) 2016-06-10 2024-11-05 OneTrust, LLC Data processing systems for identifying, assessing, and remediating data processing risks using data modeling techniques
US10509894B2 (en) 2016-06-10 2019-12-17 OneTrust, LLC Data processing and scanning systems for assessing vendor risk
US11151233B2 (en) 2016-06-10 2021-10-19 OneTrust, LLC Data processing and scanning systems for assessing vendor risk
US11586700B2 (en) 2016-06-10 2023-02-21 OneTrust, LLC Data processing systems and methods for automatically blocking the use of tracking tools
US11294939B2 (en) 2016-06-10 2022-04-05 OneTrust, LLC Data processing systems and methods for automatically detecting and documenting privacy-related aspects of computer software
US10896394B2 (en) 2016-06-10 2021-01-19 OneTrust, LLC Privacy management systems and methods
US10708305B2 (en) 2016-06-10 2020-07-07 OneTrust, LLC Automated data processing systems and methods for automatically processing requests for privacy-related information
US10496803B2 (en) 2016-06-10 2019-12-03 OneTrust, LLC Data processing systems and methods for efficiently assessing the risk of privacy campaigns
US10282700B2 (en) 2016-06-10 2019-05-07 OneTrust, LLC Data processing systems for generating and populating a data inventory
US11727141B2 (en) 2016-06-10 2023-08-15 OneTrust, LLC Data processing systems and methods for synching privacy-related user consent across multiple computing devices
US11410106B2 (en) 2016-06-10 2022-08-09 OneTrust, LLC Privacy management systems and methods
US10289866B2 (en) 2016-06-10 2019-05-14 OneTrust, LLC Data processing systems for fulfilling data subject access requests and related methods
US11138242B2 (en) 2016-06-10 2021-10-05 OneTrust, LLC Data processing systems and methods for automatically detecting and documenting privacy-related aspects of computer software
US11343284B2 (en) 2016-06-10 2022-05-24 OneTrust, LLC Data processing systems and methods for performing privacy assessments and monitoring of new versions of computer code for privacy compliance
US11354434B2 (en) 2016-06-10 2022-06-07 OneTrust, LLC Data processing systems for verification of consent and notice processing and related methods
US10169609B1 (en) 2016-06-10 2019-01-01 OneTrust, LLC Data processing systems for fulfilling data subject access requests and related methods
US10242228B2 (en) 2016-06-10 2019-03-26 OneTrust, LLC Data processing systems for measuring privacy maturity within an organization
US10997315B2 (en) 2016-06-10 2021-05-04 OneTrust, LLC Data processing systems for fulfilling data subject access requests and related methods
US10509920B2 (en) 2016-06-10 2019-12-17 OneTrust, LLC Data processing systems for processing data subject access requests
US10496846B1 (en) 2016-06-10 2019-12-03 OneTrust, LLC Data processing and communications systems and methods for the efficient implementation of privacy by design
US10678945B2 (en) 2016-06-10 2020-06-09 OneTrust, LLC Consent receipt management systems and related methods
US10853501B2 (en) 2016-06-10 2020-12-01 OneTrust, LLC Data processing and scanning systems for assessing vendor risk
US11438386B2 (en) 2016-06-10 2022-09-06 OneTrust, LLC Data processing systems for data-transfer risk identification, cross-border visualization generation, and related methods
US10798133B2 (en) 2016-06-10 2020-10-06 OneTrust, LLC Data processing systems for data-transfer risk identification, cross-border visualization generation, and related methods
US10454973B2 (en) * 2016-06-10 2019-10-22 OneTrust, LLC Data processing systems for data-transfer risk identification, cross-border visualization generation, and related methods
US10181019B2 (en) 2016-06-10 2019-01-15 OneTrust, LLC Data processing systems and communications systems and methods for integrating privacy compliance systems with software development and agile tools for privacy design
US11416590B2 (en) 2016-06-10 2022-08-16 OneTrust, LLC Data processing and scanning systems for assessing vendor risk
US12045266B2 (en) 2016-06-10 2024-07-23 OneTrust, LLC Data processing systems for generating and populating a data inventory
US10318761B2 (en) 2016-06-10 2019-06-11 OneTrust, LLC Data processing systems and methods for auditing data request compliance
US11328092B2 (en) 2016-06-10 2022-05-10 OneTrust, LLC Data processing systems for processing and managing data subject access in a distributed environment
US10846433B2 (en) 2016-06-10 2020-11-24 OneTrust, LLC Data processing consent management systems and related methods
US11418492B2 (en) 2016-06-10 2022-08-16 OneTrust, LLC Data processing systems and methods for using a data model to select a target data asset in a data migration
US11057356B2 (en) 2016-06-10 2021-07-06 OneTrust, LLC Automated data processing systems and methods for automatically processing data subject access requests using a chatbot
US10275614B2 (en) 2016-06-10 2019-04-30 OneTrust, LLC Data processing systems for generating and populating a data inventory
US10776518B2 (en) 2016-06-10 2020-09-15 OneTrust, LLC Consent receipt management systems and related methods
US10839102B2 (en) 2016-06-10 2020-11-17 OneTrust, LLC Data processing systems for identifying and modifying processes that are subject to data subject access requests
US11038925B2 (en) 2016-06-10 2021-06-15 OneTrust, LLC Data processing systems for data-transfer risk identification, cross-border visualization generation, and related methods
US10878127B2 (en) 2016-06-10 2020-12-29 OneTrust, LLC Data subject access request processing systems and related methods
US11520928B2 (en) 2016-06-10 2022-12-06 OneTrust, LLC Data processing systems for generating personal data receipts and related methods
US11210420B2 (en) 2016-06-10 2021-12-28 OneTrust, LLC Data subject access request processing systems and related methods
US10762236B2 (en) 2016-06-10 2020-09-01 OneTrust, LLC Data processing user interface monitoring systems and related methods
US12118121B2 (en) 2016-06-10 2024-10-15 OneTrust, LLC Data subject access request processing systems and related methods
US11023842B2 (en) 2016-06-10 2021-06-01 OneTrust, LLC Data processing systems and methods for bundled privacy policies
US10572686B2 (en) 2016-06-10 2020-02-25 OneTrust, LLC Consent receipt management systems and related methods
US11544667B2 (en) 2016-06-10 2023-01-03 OneTrust, LLC Data processing systems for generating and populating a data inventory
US10346638B2 (en) 2016-06-10 2019-07-09 OneTrust, LLC Data processing systems for identifying and modifying processes that are subject to data subject access requests
US11651106B2 (en) 2016-06-10 2023-05-16 OneTrust, LLC Data processing systems for fulfilling data subject access requests and related methods
US10592692B2 (en) 2016-06-10 2020-03-17 OneTrust, LLC Data processing systems for central consent repository and related methods
US11087260B2 (en) 2016-06-10 2021-08-10 OneTrust, LLC Data processing systems and methods for customizing privacy training
US10706131B2 (en) 2016-06-10 2020-07-07 OneTrust, LLC Data processing systems and methods for efficiently assessing the risk of privacy campaigns
US10706379B2 (en) 2016-06-10 2020-07-07 OneTrust, LLC Data processing systems for automatic preparation for remediation and related methods
US10685140B2 (en) 2016-06-10 2020-06-16 OneTrust, LLC Consent receipt management systems and related methods
US10565397B1 (en) 2016-06-10 2020-02-18 OneTrust, LLC Data processing systems for fulfilling data subject access requests and related methods
US10289870B2 (en) 2016-06-10 2019-05-14 OneTrust, LLC Data processing systems for fulfilling data subject access requests and related methods
US11416798B2 (en) 2016-06-10 2022-08-16 OneTrust, LLC Data processing systems and methods for providing training in a vendor procurement process
US10607028B2 (en) 2016-06-10 2020-03-31 OneTrust, LLC Data processing systems for data testing to confirm data deletion and related methods
US10282559B2 (en) 2016-06-10 2019-05-07 OneTrust, LLC Data processing systems for identifying, assessing, and remediating data processing risks using data modeling techniques
US11146566B2 (en) 2016-06-10 2021-10-12 OneTrust, LLC Data processing systems for fulfilling data subject access requests and related methods
US11100444B2 (en) 2016-06-10 2021-08-24 OneTrust, LLC Data processing systems and methods for providing training in a vendor procurement process
US11227247B2 (en) 2016-06-10 2022-01-18 OneTrust, LLC Data processing systems and methods for bundled privacy policies
US11481710B2 (en) 2016-06-10 2022-10-25 OneTrust, LLC Privacy management systems and methods
US10740487B2 (en) 2016-06-10 2020-08-11 OneTrust, LLC Data processing systems and methods for populating and maintaining a centralized database of personal data
US10796260B2 (en) 2016-06-10 2020-10-06 OneTrust, LLC Privacy management systems and methods
US10438017B2 (en) 2016-06-10 2019-10-08 OneTrust, LLC Data processing systems for processing data subject access requests
US11625502B2 (en) 2016-06-10 2023-04-11 OneTrust, LLC Data processing systems for identifying and modifying processes that are subject to data subject access requests
US10776514B2 (en) 2016-06-10 2020-09-15 OneTrust, LLC Data processing systems for the identification and deletion of personal data in computer systems
US11336697B2 (en) 2016-06-10 2022-05-17 OneTrust, LLC Data processing systems for data-transfer risk identification, cross-border visualization generation, and related methods
US10452866B2 (en) 2016-06-10 2019-10-22 OneTrust, LLC Data processing systems for fulfilling data subject access requests and related methods
US10586075B2 (en) 2016-06-10 2020-03-10 OneTrust, LLC Data processing systems for orphaned data identification and deletion and related methods
US10503926B2 (en) 2016-06-10 2019-12-10 OneTrust, LLC Consent receipt management systems and related methods
US10706176B2 (en) 2016-06-10 2020-07-07 OneTrust, LLC Data-processing consent refresh, re-prompt, and recapture systems and related methods
US11562097B2 (en) 2016-06-10 2023-01-24 OneTrust, LLC Data processing systems for central consent repository and related methods
US10437412B2 (en) 2016-06-10 2019-10-08 OneTrust, LLC Consent receipt management systems and related methods
US12052289B2 (en) 2016-06-10 2024-07-30 OneTrust, LLC Data processing systems for data-transfer risk identification, cross-border visualization generation, and related methods
US10803200B2 (en) 2016-06-10 2020-10-13 OneTrust, LLC Data processing systems for processing and managing data subject access in a distributed environment
US10949170B2 (en) 2016-06-10 2021-03-16 OneTrust, LLC Data processing systems for integration of consumer feedback with data subject access requests and related methods
US10909488B2 (en) 2016-06-10 2021-02-02 OneTrust, LLC Data processing systems for assessing readiness for responding to privacy-related incidents
US10726158B2 (en) 2016-06-10 2020-07-28 OneTrust, LLC Consent receipt management and automated process blocking systems and related methods
US11675929B2 (en) 2016-06-10 2023-06-13 OneTrust, LLC Data processing consent sharing systems and related methods
US10452864B2 (en) 2016-06-10 2019-10-22 OneTrust, LLC Data processing systems for webform crawling to map processing activities and related methods
US10565236B1 (en) 2016-06-10 2020-02-18 OneTrust, LLC Data processing systems for generating and populating a data inventory
US10873606B2 (en) 2016-06-10 2020-12-22 OneTrust, LLC Data processing systems for data-transfer risk identification, cross-border visualization generation, and related methods
US10416966B2 (en) 2016-06-10 2019-09-17 OneTrust, LLC Data processing systems for identity validation of data subject access requests and related methods
US11366786B2 (en) 2016-06-10 2022-06-21 OneTrust, LLC Data processing systems for processing data subject access requests
US11651104B2 (en) 2016-06-10 2023-05-16 OneTrust, LLC Consent receipt management systems and related methods
US11366909B2 (en) 2016-06-10 2022-06-21 OneTrust, LLC Data processing and scanning systems for assessing vendor risk
US11228620B2 (en) 2016-06-10 2022-01-18 OneTrust, LLC Data processing systems for data-transfer risk identification, cross-border visualization generation, and related methods
US11222139B2 (en) 2016-06-10 2022-01-11 OneTrust, LLC Data processing systems and methods for automatic discovery and assessment of mobile software development kits
US11475136B2 (en) 2016-06-10 2022-10-18 OneTrust, LLC Data processing systems for data transfer risk identification and related methods
US11134086B2 (en) 2016-06-10 2021-09-28 OneTrust, LLC Consent conversion optimization systems and related methods
US10204154B2 (en) 2016-06-10 2019-02-12 OneTrust, LLC Data processing systems for generating and populating a data inventory
US11222309B2 (en) 2016-06-10 2022-01-11 OneTrust, LLC Data processing systems for generating and populating a data inventory
US10235534B2 (en) 2016-06-10 2019-03-19 OneTrust, LLC Data processing systems for prioritizing data subject access requests for fulfillment and related methods
CN107544979A (en) * 2016-06-24 2018-01-05 上海壹账通金融科技有限公司 The credibility Analysis method and system of user data
CN106909600A (en) * 2016-07-07 2017-06-30 阿里巴巴集团控股有限公司 The collection method and device of user context information
US10585930B2 (en) 2016-07-29 2020-03-10 International Business Machines Corporation Determining a relevancy of a content summary
US10803070B2 (en) * 2016-07-29 2020-10-13 International Business Machines Corporation Selecting a content summary based on relevancy
US10372816B2 (en) 2016-12-13 2019-08-06 International Business Machines Corporation Preprocessing of string inputs in natural language processing
US10546063B2 (en) 2016-12-13 2020-01-28 International Business Machines Corporation Processing of string inputs utilizing machine learning
US20180203916A1 (en) * 2017-01-19 2018-07-19 Acquire Media Ventures Inc. Data clustering with reduced partial signature matching using key-value storage and retrieval
US10645138B2 (en) * 2017-05-02 2020-05-05 Salesforce.Com, Inc Event stream processing system using a coordinating spout instance
US11005864B2 (en) 2017-05-19 2021-05-11 Salesforce.Com, Inc. Feature-agnostic behavior profile based anomaly detection
US10013577B1 (en) 2017-06-16 2018-07-03 OneTrust, LLC Data processing systems for identifying whether cookies contain personally identifying information
US20190042932A1 (en) * 2017-08-01 2019-02-07 Salesforce Com, Inc. Techniques and Architectures for Deep Learning to Support Security Threat Detection
GB2572541A (en) * 2018-03-27 2019-10-09 Innoplexus Ag System and method for identifying at least one association of entity
US10956402B2 (en) 2018-04-13 2021-03-23 Visa International Service Association Method and system for automatically detecting errors in at least one date entry using image maps
US11924297B2 (en) 2018-05-24 2024-03-05 People.ai, Inc. Systems and methods for generating a filtered data set
US11463441B2 (en) 2018-05-24 2022-10-04 People.ai, Inc. Systems and methods for managing the generation or deletion of record objects based on electronic activities and communication policies
US10803202B2 (en) 2018-09-07 2020-10-13 OneTrust, LLC Data processing systems for orphaned data identification and deletion and related methods
US11144675B2 (en) 2018-09-07 2021-10-12 OneTrust, LLC Data processing systems and methods for automatically protecting sensitive data within privacy management systems
US11544409B2 (en) 2018-09-07 2023-01-03 OneTrust, LLC Data processing systems and methods for automatically protecting sensitive data within privacy management systems
US11126673B2 (en) 2019-01-29 2021-09-21 Salesforce.Com, Inc. Method and system for automatically enriching collected seeds with information extracted from one or more websites
US10866996B2 (en) 2019-01-29 2020-12-15 Saleforce.com, inc. Automated method and system for clustering enriched company seeds into a cluster and selecting best values for each attribute within the cluster to generate a company profile
US11755914B2 (en) 2019-01-31 2023-09-12 Salesforce, Inc. Machine learning from data steward feedback for merging records
US11176108B2 (en) 2019-02-04 2021-11-16 International Business Machines Corporation Data resolution among disparate data sources
WO2020191355A1 (en) * 2019-03-21 2020-09-24 Salesforce.Com, Inc. Machine learning from data steward feedback for merging records
US11157508B2 (en) 2019-06-21 2021-10-26 Salesforce.Com, Inc. Estimating the number of distinct entities from a set of records of a database system
US12039538B2 (en) 2020-04-01 2024-07-16 Visa International Service Association System, method, and computer program product for breach detection using convolutional neural networks
US11966372B1 (en) * 2020-05-01 2024-04-23 Bottomline Technologies, Inc. Database record combination
US11487936B2 (en) * 2020-05-27 2022-11-01 Capital One Services, Llc System and method for electronic text analysis and contextual feedback
EP4179435B1 (en) 2020-07-08 2024-09-04 OneTrust LLC Systems and methods for targeted data discovery
WO2022026564A1 (en) 2020-07-28 2022-02-03 OneTrust, LLC Systems and methods for automatically blocking the use of tracking tools
US11475165B2 (en) 2020-08-06 2022-10-18 OneTrust, LLC Data processing systems and methods for automatically redacting unstructured data from a data subject access request
US11500853B1 (en) * 2020-09-04 2022-11-15 Live Data Technologies, Inc. Virtual data store systems and methods
WO2022060860A1 (en) 2020-09-15 2022-03-24 OneTrust, LLC Data processing systems and methods for detecting tools for the automatic blocking of consent requests
US11526624B2 (en) 2020-09-21 2022-12-13 OneTrust, LLC Data processing systems and methods for automatically detecting target data transfers and target data processing
US11397819B2 (en) 2020-11-06 2022-07-26 OneTrust, LLC Systems and methods for identifying data processing activities based on data discovery results
WO2022159901A1 (en) 2021-01-25 2022-07-28 OneTrust, LLC Systems and methods for discovery, classification, and indexing of data in a native computing system
WO2022170047A1 (en) 2021-02-04 2022-08-11 OneTrust, LLC Managing custom attributes for domain objects defined within microservices
US11494515B2 (en) 2021-02-08 2022-11-08 OneTrust, LLC Data processing systems and methods for anonymizing data samples in classification analysis
WO2022173912A1 (en) 2021-02-10 2022-08-18 OneTrust, LLC Systems and methods for mitigating risks of third-party computing system functionality integration into a first-party computing system
US11775348B2 (en) 2021-02-17 2023-10-03 OneTrust, LLC Managing custom workflows for domain objects defined within microservices
US11546661B2 (en) 2021-02-18 2023-01-03 OneTrust, LLC Selective redaction of media content
EP4305539A1 (en) 2021-03-08 2024-01-17 OneTrust, LLC Data transfer discovery and analysis systems and related methods
US11562078B2 (en) 2021-04-16 2023-01-24 OneTrust, LLC Assessing and managing computational risk involved with integrating third party computing functionality within a computing system
US20220374700A1 (en) * 2021-05-21 2022-11-24 Adp, Llc Time-Series Anomaly Detection Via Deep Learning
US11934402B2 (en) 2021-08-06 2024-03-19 Bank Of America Corporation System and method for generating optimized data queries to improve hardware efficiency and utilization
US11748346B2 (en) 2021-09-30 2023-09-05 Amazon Technologies, Inc. Multi-tenant hosting of inverted indexes for text searches
US11620142B1 (en) 2022-06-03 2023-04-04 OneTrust, LLC Generating and customizing user interfaces for demonstrating functions of interactive user environments

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5504890A (en) * 1994-03-17 1996-04-02 Sanford; Michael D. System for data sharing among independently-operating information-gathering entities with individualized conflict resolution rules
US20030037051A1 (en) * 1999-07-20 2003-02-20 Gruenwald Bjorn J. System and method for organizing data
US6658423B1 (en) * 2001-01-24 2003-12-02 Google, Inc. Detecting duplicate and near-duplicate files
US6947930B2 (en) * 2003-03-21 2005-09-20 Overture Services, Inc. Systems and methods for interactive search query refinement
US20050234952A1 (en) * 2004-04-15 2005-10-20 Microsoft Corporation Content propagation for enhanced document retrieval
US20060026152A1 (en) * 2004-07-13 2006-02-02 Microsoft Corporation Query-based snippet clustering for search result grouping
US20060117002A1 (en) * 2004-11-26 2006-06-01 Bing Swen Method for search result clustering
US20070027921A1 (en) * 2005-08-01 2007-02-01 Billy Alvarado Context based action
US20070192293A1 (en) * 2006-02-13 2007-08-16 Bing Swen Method for presenting search results
US20080222140A1 (en) * 2007-02-20 2008-09-11 Wright State University Comparative web search system and method
US20090240672A1 (en) * 2008-03-18 2009-09-24 Cuill, Inc. Apparatus and method for displaying search results with a variety of display paradigms
US20100023515A1 (en) * 2008-07-28 2010-01-28 Andreas Marx Data clustering engine
US20100070460A1 (en) * 2005-05-02 2010-03-18 Fuerst Karl System and method for rule-based data object matching
US20120023107A1 (en) * 2010-01-15 2012-01-26 Salesforce.Com, Inc. System and method of matching and merging records
US8782016B2 (en) * 2011-08-26 2014-07-15 Qatar Foundation Database record repair

Family Cites Families (118)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5608872A (en) 1993-03-19 1997-03-04 Ncr Corporation System for allowing all remote computers to perform annotation on an image and replicating the annotated image on the respective displays of other comuters
US5649104A (en) 1993-03-19 1997-07-15 Ncr Corporation System for allowing user of any computer to draw image over that generated by the host computer and replicating the drawn image to other computers
US5577188A (en) 1994-05-31 1996-11-19 Future Labs, Inc. Method to provide for virtual screen overlay
GB2300991B (en) 1995-05-15 1997-11-05 Andrew Macgregor Ritchie Serving signals to browsing clients
US5715450A (en) 1995-09-27 1998-02-03 Siebel Systems, Inc. Method of selecting and presenting data from a database using a query language to a user of a computer system
US5821937A (en) 1996-02-23 1998-10-13 Netsuite Development, L.P. Computer method for updating a network design
US5831610A (en) 1996-02-23 1998-11-03 Netsuite Development L.P. Designing networks
US6604117B2 (en) 1996-03-19 2003-08-05 Siebel Systems, Inc. Method of maintaining a network of partially replicated database system
US5873096A (en) 1997-10-08 1999-02-16 Siebel Systems, Inc. Method of maintaining a network of partially replicated database system
EP1021775A4 (en) 1997-02-26 2005-05-11 Siebel Systems Inc Method of determining the visibility to a remote databaseclient of a plurality of database transactions using simplified visibility rules
AU6654798A (en) 1997-02-26 1998-09-18 Siebel Systems, Inc. Method of determining visibility to a remote database client of a plurality of database transactions using a networked proxy server
AU6183698A (en) 1997-02-26 1998-09-18 Siebel Systems, Inc. Method of determining visibility to a remote database client of a plurality of database transactions having variable visibility strengths
WO1998040804A2 (en) 1997-02-26 1998-09-17 Siebel Systems, Inc. Distributed relational database
AU6440398A (en) 1997-02-26 1998-09-18 Siebel Systems, Inc. Method of using a cache to determine the visibility to a remote database client of a plurality of database transactions
AU6336798A (en) 1997-02-27 1998-09-29 Siebel Systems, Inc. Method of synchronizing independently distributed software and database schema
WO1998040807A2 (en) 1997-02-27 1998-09-17 Siebel Systems, Inc. Migrating to a successive software distribution level
WO1998038564A2 (en) 1997-02-28 1998-09-03 Siebel Systems, Inc. Partially replicated distributed database with multiple levels of remote clients
US6169534B1 (en) 1997-06-26 2001-01-02 Upshot.Com Graphical user interface for customer information management
US5918159A (en) 1997-08-04 1999-06-29 Fomukong; Mundi Location reporting satellite paging system with optional blocking of location reporting
US6560461B1 (en) 1997-08-04 2003-05-06 Mundi Fomukong Authorized location reporting paging system
US20020059095A1 (en) 1998-02-26 2002-05-16 Cook Rachael Linette System and method for generating, capturing, and managing customer lead information over a computer network
US6732111B2 (en) 1998-03-03 2004-05-04 Siebel Systems, Inc. Method, apparatus, system, and program product for attaching files and other objects to a partially replicated database
US6772229B1 (en) 2000-11-13 2004-08-03 Groupserve, Inc. Centrifugal communication and collaboration method
US6161149A (en) 1998-03-13 2000-12-12 Groupserve, Inc. Centrifugal communication and collaboration method
US5963953A (en) 1998-03-30 1999-10-05 Siebel Systems, Inc. Method, and system for product configuration
AU5791899A (en) 1998-08-27 2000-03-21 Upshot Corporation A method and apparatus for network-based sales force management
US6393605B1 (en) 1998-11-18 2002-05-21 Siebel Systems, Inc. Apparatus and system for efficient delivery and deployment of an application
US6728960B1 (en) 1998-11-18 2004-04-27 Siebel Systems, Inc. Techniques for managing multiple threads in a browser environment
US6601087B1 (en) 1998-11-18 2003-07-29 Webex Communications, Inc. Instant document sharing
EP1163604A4 (en) 1998-11-30 2002-01-09 Siebel Systems Inc Assignment manager
JP2002531890A (en) 1998-11-30 2002-09-24 シーベル システムズ,インコーポレイティド Development tools, methods and systems for client-server applications
JP2002531896A (en) 1998-11-30 2002-09-24 シーベル システムズ,インコーポレイティド Call center using smart script
JP2002531899A (en) 1998-11-30 2002-09-24 シーベル システムズ,インコーポレイティド State model for process monitoring
US7356482B2 (en) 1998-12-18 2008-04-08 Alternative Systems, Inc. Integrated change management unit
US20020072951A1 (en) 1999-03-03 2002-06-13 Michael Lee Marketing support database management method, system and program product
US6574635B2 (en) 1999-03-03 2003-06-03 Siebel Systems, Inc. Application instantiation based upon attributes and values stored in a meta data repository, including tiering of application layers objects and components
US8095413B1 (en) 1999-05-07 2012-01-10 VirtualAgility, Inc. Processing management information
US7698160B2 (en) 1999-05-07 2010-04-13 Virtualagility, Inc System for performing collaborative tasks
US6621834B1 (en) 1999-11-05 2003-09-16 Raindance Communications, Inc. System and method for voice transmission over network protocols
US6535909B1 (en) 1999-11-18 2003-03-18 Contigo Software, Inc. System and method for record and playback of collaborative Web browsing session
US6324568B1 (en) 1999-11-30 2001-11-27 Siebel Systems, Inc. Method and system for distributing objects over a network
US6654032B1 (en) 1999-12-23 2003-11-25 Webex Communications, Inc. Instant sharing of documents on a remote server
US6577726B1 (en) 2000-03-31 2003-06-10 Siebel Systems, Inc. Computer telephony integration hotelling method and system
US7266502B2 (en) 2000-03-31 2007-09-04 Siebel Systems, Inc. Feature centric release manager method and system
US6336137B1 (en) 2000-03-31 2002-01-01 Siebel Systems, Inc. Web client-server system and method for incompatible page markup and presentation languages
US6732100B1 (en) 2000-03-31 2004-05-04 Siebel Systems, Inc. Database access method and system for user role defined access
US6665655B1 (en) 2000-04-14 2003-12-16 Rightnow Technologies, Inc. Implicit rating of retrieved information in an information search system
US6434550B1 (en) 2000-04-14 2002-08-13 Rightnow Technologies, Inc. Temporal updates of relevancy rating of retrieved information in an information search system
US7730072B2 (en) 2000-04-14 2010-06-01 Rightnow Technologies, Inc. Automated adaptive classification system for knowledge networks
US6842748B1 (en) 2000-04-14 2005-01-11 Rightnow Technologies, Inc. Usage based strength between related information in an information retrieval system
US6763501B1 (en) 2000-06-09 2004-07-13 Webex Communications, Inc. Remote document serving
KR100365357B1 (en) 2000-10-11 2002-12-18 엘지전자 주식회사 Method for data communication of mobile terminal
US7581230B2 (en) 2001-02-06 2009-08-25 Siebel Systems, Inc. Adaptive communication application programming interface
USD454139S1 (en) 2001-02-20 2002-03-05 Rightnow Technologies Display screen for a computer
US7363388B2 (en) 2001-03-28 2008-04-22 Siebel Systems, Inc. Method and system for direct server synchronization with a computing device
US6829655B1 (en) 2001-03-28 2004-12-07 Siebel Systems, Inc. Method and system for server synchronization with a computing device via a companion device
US7174514B2 (en) 2001-03-28 2007-02-06 Siebel Systems, Inc. Engine to present a user interface based on a logical structure, such as one for a customer relationship management system, across a web site
US20030018705A1 (en) 2001-03-31 2003-01-23 Mingte Chen Media-independent communication server
US20030206192A1 (en) 2001-03-31 2003-11-06 Mingte Chen Asynchronous message push to web browser
US6732095B1 (en) 2001-04-13 2004-05-04 Siebel Systems, Inc. Method and apparatus for mapping between XML and relational representations
US7761288B2 (en) 2001-04-30 2010-07-20 Siebel Systems, Inc. Polylingual simultaneous shipping of software
US6782383B2 (en) 2001-06-18 2004-08-24 Siebel Systems, Inc. System and method to implement a persistent and dismissible search center frame
US6728702B1 (en) 2001-06-18 2004-04-27 Siebel Systems, Inc. System and method to implement an integrated search center supporting a full-text search and query on a database
US6763351B1 (en) 2001-06-18 2004-07-13 Siebel Systems, Inc. Method, apparatus, and system for attaching search results
US6711565B1 (en) 2001-06-18 2004-03-23 Siebel Systems, Inc. Method, apparatus, and system for previewing search results
US20030004971A1 (en) 2001-06-29 2003-01-02 Gong Wen G. Automatic generation of data models and accompanying user interfaces
WO2003007734A1 (en) 2001-07-19 2003-01-30 San-Ei Gen F.F.I., Inc. Flavor-improving compositions and application thereof
US6826582B1 (en) 2001-09-28 2004-11-30 Emc Corporation Method and system for using file systems for content management
US7761535B2 (en) 2001-09-28 2010-07-20 Siebel Systems, Inc. Method and system for server synchronization with a computing device
US6978445B2 (en) 2001-09-28 2005-12-20 Siebel Systems, Inc. Method and system for supporting user navigation in a browser environment
US6993712B2 (en) 2001-09-28 2006-01-31 Siebel Systems, Inc. System and method for facilitating user interaction in a browser environment
US6724399B1 (en) 2001-09-28 2004-04-20 Siebel Systems, Inc. Methods and apparatus for enabling keyboard accelerators in applications implemented via a browser
US8359335B2 (en) 2001-09-29 2013-01-22 Siebel Systems, Inc. Computing system and method to implicitly commit unsaved data for a world wide web application
US7146617B2 (en) 2001-09-29 2006-12-05 Siebel Systems, Inc. Method, apparatus, and system for implementing view caching in a framework to support web-based applications
US6901595B2 (en) 2001-09-29 2005-05-31 Siebel Systems, Inc. Method, apparatus, and system for implementing a framework to support a web-based application
US7962565B2 (en) 2001-09-29 2011-06-14 Siebel Systems, Inc. Method, apparatus and system for a mobile web client
US7289949B2 (en) 2001-10-09 2007-10-30 Right Now Technologies, Inc. Method for routing electronic correspondence based on the level and type of emotion contained therein
US7062502B1 (en) 2001-12-28 2006-06-13 Kesler John N Automated generation of dynamic data entry user interface for relational database management systems
US6804330B1 (en) 2002-01-04 2004-10-12 Siebel Systems, Inc. Method and system for accessing CRM data via voice
US7058890B2 (en) 2002-02-13 2006-06-06 Siebel Systems, Inc. Method and system for enabling connectivity to a data system
US7672853B2 (en) 2002-03-29 2010-03-02 Siebel Systems, Inc. User interface for processing requests for approval
US7131071B2 (en) 2002-03-29 2006-10-31 Siebel Systems, Inc. Defining an approval process for requests for approval
US6968348B1 (en) * 2002-05-28 2005-11-22 Providian Financial Corporation Method and system for creating and maintaining an index for tracking files relating to people
US6850949B2 (en) 2002-06-03 2005-02-01 Right Now Technologies, Inc. System and method for generating a dynamic interface via a communications network
US7437720B2 (en) 2002-06-27 2008-10-14 Siebel Systems, Inc. Efficient high-interactivity user interface for client-server applications
US8639542B2 (en) 2002-06-27 2014-01-28 Siebel Systems, Inc. Method and apparatus to facilitate development of a customer-specific business process model
US7594181B2 (en) 2002-06-27 2009-09-22 Siebel Systems, Inc. Prototyping graphical user interfaces
US7251787B2 (en) 2002-08-28 2007-07-31 Siebel Systems, Inc. Method and apparatus for an integrated process modeller
US9448860B2 (en) 2003-03-21 2016-09-20 Oracle America, Inc. Method and architecture for providing data-change alerts to external applications via a push service
JP2006521641A (en) 2003-03-24 2006-09-21 シーベル システムズ,インコーポレイティド Custom common objects
WO2004086198A2 (en) 2003-03-24 2004-10-07 Siebel Systems, Inc. Common common object
US7904340B2 (en) 2003-03-24 2011-03-08 Siebel Systems, Inc. Methods and computer-readable medium for defining a product model
US8762415B2 (en) 2003-03-25 2014-06-24 Siebel Systems, Inc. Modeling of order data
US7620655B2 (en) 2003-05-07 2009-11-17 Enecto Ab Method, device and computer program product for identifying visitors of websites
US7409336B2 (en) 2003-06-19 2008-08-05 Siebel Systems, Inc. Method and system for searching data based on identified subset of categories and relevance-scored text representation-category combinations
US20040260659A1 (en) 2003-06-23 2004-12-23 Len Chan Function space reservation system
US7237227B2 (en) 2003-06-30 2007-06-26 Siebel Systems, Inc. Application user interface template with free-form layout
US7694314B2 (en) 2003-08-28 2010-04-06 Siebel Systems, Inc. Universal application network architecture
US8209308B2 (en) 2006-05-01 2012-06-26 Rueben Steven L Method for presentation of revisions of an electronic document
US9135228B2 (en) 2006-05-01 2015-09-15 Domo, Inc. Presentation of document history in a web browsing application
US8566301B2 (en) 2006-05-01 2013-10-22 Steven L. Rueben Document revisions in a collaborative computing environment
US7779475B2 (en) 2006-07-31 2010-08-17 Petnote Llc Software-based method for gaining privacy by affecting the screen of a computing device
US8082301B2 (en) 2006-11-10 2011-12-20 Virtual Agility, Inc. System for supporting collaborative activity
US8954500B2 (en) 2008-01-04 2015-02-10 Yahoo! Inc. Identifying and employing social network relationships
US8719287B2 (en) 2007-08-31 2014-05-06 Business Objects Software Limited Apparatus and method for dynamically selecting componentized executable instructions at run time
US20090100342A1 (en) 2007-10-12 2009-04-16 Gabriel Jakobson Method and system for presenting address and mapping information
US8504945B2 (en) 2008-02-01 2013-08-06 Gabriel Jakobson Method and system for associating content with map zoom function
US8490025B2 (en) 2008-02-01 2013-07-16 Gabriel Jakobson Displaying content associated with electronic mapping systems
US8014943B2 (en) 2008-05-08 2011-09-06 Gabriel Jakobson Method and system for displaying social networking navigation information
US8032297B2 (en) 2008-05-08 2011-10-04 Gabriel Jakobson Method and system for displaying navigation information on an electronic map
US8646103B2 (en) 2008-06-30 2014-02-04 Gabriel Jakobson Method and system for securing online identities
US8510664B2 (en) 2008-09-06 2013-08-13 Steven L. Rueben Method and system for displaying email thread information
US8010663B2 (en) 2008-11-21 2011-08-30 The Invention Science Fund I, Llc Correlating data indicating subjective user states associated with multiple users with data indicating objective occurrences
US8495384B1 (en) * 2009-03-10 2013-07-23 James DeLuccia Data comparison system
US8577849B2 (en) * 2011-05-18 2013-11-05 Qatar Foundation Guided data repair
US8769004B2 (en) 2012-02-17 2014-07-01 Zebedo Collaborative web browsing system integrated with social networks
US8756275B2 (en) 2012-02-17 2014-06-17 Zebedo Variable speed collaborative web browsing system
US8769017B2 (en) 2012-02-17 2014-07-01 Zebedo Collaborative web browsing system having document object model element interaction detection

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5504890A (en) * 1994-03-17 1996-04-02 Sanford; Michael D. System for data sharing among independently-operating information-gathering entities with individualized conflict resolution rules
US20030037051A1 (en) * 1999-07-20 2003-02-20 Gruenwald Bjorn J. System and method for organizing data
US6658423B1 (en) * 2001-01-24 2003-12-02 Google, Inc. Detecting duplicate and near-duplicate files
US6947930B2 (en) * 2003-03-21 2005-09-20 Overture Services, Inc. Systems and methods for interactive search query refinement
US20050234952A1 (en) * 2004-04-15 2005-10-20 Microsoft Corporation Content propagation for enhanced document retrieval
US7305389B2 (en) * 2004-04-15 2007-12-04 Microsoft Corporation Content propagation for enhanced document retrieval
US20060026152A1 (en) * 2004-07-13 2006-02-02 Microsoft Corporation Query-based snippet clustering for search result grouping
US20060117002A1 (en) * 2004-11-26 2006-06-01 Bing Swen Method for search result clustering
US20100070460A1 (en) * 2005-05-02 2010-03-18 Fuerst Karl System and method for rule-based data object matching
US20070027921A1 (en) * 2005-08-01 2007-02-01 Billy Alvarado Context based action
US20070192293A1 (en) * 2006-02-13 2007-08-16 Bing Swen Method for presenting search results
US20080222140A1 (en) * 2007-02-20 2008-09-11 Wright State University Comparative web search system and method
US20090240672A1 (en) * 2008-03-18 2009-09-24 Cuill, Inc. Apparatus and method for displaying search results with a variety of display paradigms
US20100023515A1 (en) * 2008-07-28 2010-01-28 Andreas Marx Data clustering engine
US20120023107A1 (en) * 2010-01-15 2012-01-26 Salesforce.Com, Inc. System and method of matching and merging records
US8782016B2 (en) * 2011-08-26 2014-07-15 Qatar Foundation Database record repair

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10366247B2 (en) 2015-06-02 2019-07-30 ALTR Solutions, Inc. Replacing distinct data in a relational database with a distinct reference to that data and distinct de-referencing of database data
WO2017016130A1 (en) * 2015-07-30 2017-02-02 中兴通讯股份有限公司 Message processing method and device
US11360990B2 (en) 2019-06-21 2022-06-14 Salesforce.Com, Inc. Method and a system for fuzzy matching of entities in a database system based on machine learning

Also Published As

Publication number Publication date
US20150032738A1 (en) 2015-01-29
US9760620B2 (en) 2017-09-12

Similar Documents

Publication Publication Date Title
US9760620B2 (en) Confidently adding snippets of search results to clusters of objects
US10579691B2 (en) Application programming interface representation of multi-tenant non-relational platform objects
US8521758B2 (en) System and method of matching and merging records
US10733212B2 (en) Entity identifier clustering based on context scores
US9465828B2 (en) Computer implemented methods and apparatus for identifying similar labels using collaborative filtering
US11016959B2 (en) Trie-based normalization of field values for matching
US20190114342A1 (en) Entity identifier clustering
US9646246B2 (en) System and method for using a statistical classifier to score contact entities
US9223852B2 (en) Methods and systems for analyzing search terms in a multi-tenant database system environment
US10579692B2 (en) Composite keys for multi-tenant non-relational platform objects
US11714811B2 (en) Run-time querying of multi-tenant non-relational platform objects
US9268822B2 (en) System and method for determining organizational hierarchy from business card data
US11216435B2 (en) Techniques and architectures for managing privacy information and permissions queries across disparate database tables
US20170060919A1 (en) Transforming columns from source files to target files
US10599654B2 (en) Method and system for determining unique events from a stream of events
US20150106390A1 (en) Processing user-submitted updates based on user reliability scores
US20160379265A1 (en) Account recommendations for user account sets
US20160378759A1 (en) Account routing to user account sets
US10817465B2 (en) Match index creation
US10671626B2 (en) Identity consolidation in heterogeneous data environment
US10628384B2 (en) Optimized match keys for fields with prefix structure
US9619458B2 (en) System and method for phrase matching with arbitrary text
US10852926B2 (en) Filter of data presentations via user-generated links
US11436233B2 (en) Generating adaptive match keys
US9659059B2 (en) Matching large sets of words

Legal Events

Date Code Title Description
AS Assignment

Owner name: SALESFORCE.COM, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NACHNANI, PAWAN;JAGOTA, ARUN KUMAR;SIGNING DATES FROM 20140714 TO 20140718;REEL/FRAME:033360/0816

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION