[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

US20120130940A1 - Real-time analytics of streaming data - Google Patents

Real-time analytics of streaming data Download PDF

Info

Publication number
US20120130940A1
US20120130940A1 US13/300,523 US201113300523A US2012130940A1 US 20120130940 A1 US20120130940 A1 US 20120130940A1 US 201113300523 A US201113300523 A US 201113300523A US 2012130940 A1 US2012130940 A1 US 2012130940A1
Authority
US
United States
Prior art keywords
attributes
value
tuple
attribute
aggregates
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/300,523
Inventor
Abhishek Gattani
Anand Rajaraman
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Walmart Apollo LLC
Original Assignee
Wal Mart Stores Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wal Mart Stores Inc filed Critical Wal Mart Stores Inc
Priority to US13/300,523 priority Critical patent/US20120130940A1/en
Priority to US13/300,519 priority patent/US9183270B2/en
Publication of US20120130940A1 publication Critical patent/US20120130940A1/en
Assigned to WAL-MART STORES, INC. reassignment WAL-MART STORES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GATTANI, ABHISHEK, RAJARAMAN, ANAND
Assigned to WALMART APOLLO, LLC reassignment WALMART APOLLO, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: WAL-MART STORES, INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • social media services such as TwitterTM, DiggTM, MyspaceTM and FacebookTM have seen a meteoric rise in popularity resulting in an ever evolving universe of streaming content/data which is often user/consumer generated.
  • social media is able to capture, better than many other sources, a raw and unfiltered pulse of society.
  • a company may gather and analyze information relevant to the company's markets to promote accurate and confident decision-making in determining market opportunity, market penetration strategy, market development metrics, etc.
  • the present disclosure relates to real-time analytics of data streams. More particularly, the present disclosure relates to storage media, systems and methods for processing data streams and analyzing data extracted therefrom.
  • Storage media, systems and methods for performing real time analytics on streaming data are disclosed herein.
  • a method for performing real time analytics on streaming data may include: processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
  • a tuple frequency may be tracked over an interval for each of the implicated tuples, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval.
  • aggregates over an interval for a first plurality of implicated tuples having same attribute-value pairs for zero or more ordinary dimension attributes and different attribute-value pairs for one or more leaderboard dimension attributes may be tracked, and a top-N values determined for the one or more leadership dimension attributes over the interval, for example, wherein the Top-N values are characterized as resulting in the highest aggregates, the lowest aggregates or the aggregates closest to a selected value, over the interval.
  • the one or more dimension attributes may include a K-Gram for identifying topics of interest.
  • a tuple frequency for each of the implicated tuples including a K-Gram may be tracked over an interval, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval, whereby statistics for trending K-Gram-value pairs are tracked.
  • a method for implementing a real time analytics platform may include establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
  • a method for performing real-time analytics on a data stream may include: processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and returning one of the stored aggregates in response to a query.
  • a system for performing real time analytics on streaming data may include: a processor for processing an event in a data stream to extract a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; a mapper for identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and one or more updaters for updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
  • a system for performing real-time analytics on a data stream may include: a processor for processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and memory for storing the plurality of stored aggregates.
  • a multi-dimensional data structure for implementing a real-time analytics platform characterized by one or more time windows, one or more dimension attributes, and one or more value attributes, may include: a plurality of tuples associated with the one or more dimension attributes; and a slate associated with each tuple for maintaining an aggregate for each of the one or more value attributes over each of the one or more time windows.
  • a multi-dimensional data structure for implementing a real-time analytics platform may include: a plurality of stored tuples each representing a set of search query parameters for prospective queries extrapolated from a pre-established framework of possible query parameters and one or more stored aggregates associated with each of the stored tuples, wherein each aggregate represents a result for a prospective query characterized by the set of search query parameters represented in the tuple associated with that aggregate.
  • a non-transitory computer readable medium may store processor executable instructions for performing methods described herein.
  • the computer readable medium may store processor executable instructions for processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
  • the computer readable medium may store processor executable instructions for establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
  • the computer readable medium may store processor executable instructions for processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and returning one of the stored aggregates in response to a query.
  • FIG. 1 depicts an exemplary data stream, according to the present disclosure.
  • FIG. 2 depicts an exemplary query, according to the present disclosure.
  • FIG. 3 depicts an exemplary data cube, according to the present disclosure.
  • FIG. 4 depicts an exemplary implementation of a distributed architecture for maintaining a data cube, according to the present disclosure.
  • FIG. 5 depicts another exemplary implementation of a distributed architecture for maintaining a data cube, according to the present disclosure.
  • FIGS. 6 a - g depict a sequence of events for a worked example using the distributed architecture of FIG. 5 , according to the present disclosure.
  • FIGS. 7 a - c depict flowcharts for exemplary methods for performing real time analytics on streaming data, according to the present disclosure.
  • FIGS. 8 a - b depict overlaying real-time social statistics on a geographic map.
  • FIG. 9 depicts an exemplary computing device for implementing embodiments of the present disclosure.
  • FIG. 10 depicts an exemplary network environment for implementing a distributed architecture, according to the present disclosure.
  • Storage media, systems and methods are disclosed herein for analyzing data streams in real time and/or pre-computing statistics in real time with no query time computation. More particularly, storage media, systems and methods are presented for processing data streams to calculate results for prospective queries.
  • the results may be advantageously computed prior to the formulation of the specific query, for example, based on a pre-established framework of potential query parameters. More particularly, a universe of potential queries may be extrapolated from the pre-established framework of potential query parameters. Results for each of the potential queries may them be tracked in real time. For example results for each of the potential queries may be continuously updated based on real-time processing of events in a data stream.
  • an event generally refers to an atomic unit in a data stream, for example, a single tweetTM in a TwitterTM feed or a single purchasing transaction in a transaction stream.
  • a data stream may include a continuous flow of data that is not pre-divided into discrete events.
  • an event may be inferred, for example, by identifying a set of one or more related attributes in the data stream.
  • related attributes may be identified based on temporal and/or source commonalities.
  • a contextual analysis for example, a semantic analysis
  • a semantic analysis for example, a semantic analysis
  • the storage media, systems and methods of the present disclosure may be used for real-time analysis of any type of streaming data, structured or unstructured.
  • the storage media, systems and methods of the present disclosure may be used for real-time analysis of purchase transactions, customer reviews/feedback, customer wish lists/shopping carts, etc.
  • prospective queries of data streams may include queries related to social statistics. For example:
  • the result for each of the above queries may be calculated as an aggregate of a value attribute (number of events, percentage of events, sentiment, and average age, respectively) over a specified time window (between 10 a.m. and 11 a.m. today, yesterday, during the limited time offer, and last year, respectively) for a specified set of dimension attributes (related to product P, positive sentiment related to product P, related to the limited time offer from women in Arizona, and related to women in Arizona who purchased product P, respectively).
  • a value attribute number of events, percentage of events, sentiment, and average age, respectively
  • a specified time window between 10 a.m. and 11 a.m. today, yesterday, during the limited time offer, and last year, respectively
  • dimension attributes related to product P, positive sentiment related to product P, related to the limited time offer from women in Arizona, and related to women in Arizona who purchased product P, respectively.
  • a framework of potential query parameters may be pre-established by selecting, for example, via a user input, attributes of interest including one or more time windows, one or more dimension attributes and one or more value attributes.
  • the framework may further include, for each of the one or more value attributes, an aggregate function defining how to aggregate instances of the value attribute.
  • the term dimension attribute refers to an identifiable attribute in a data stream which is of interest as pertaining to a query search parameter.
  • the term value attribute refers to an identifiable attribute in a data stream of interest which is of interest as pertaining to a query result parameter.
  • a same attribute may be both a dimension attribute and a value attribute.
  • the attribute “sentiment” may be used as both a search parameter (such as, in the query “how many women have a positive opinion about product P?”) and a result parameter (such as in the query “what is the sentiment of women regarding product P?”).
  • a pre-established framework of query parameters may be used to generate a multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes; an aggregate of each of the one or more value attributes over each of the one or more time windows.
  • the aggregates stored in the multi-dimensional data structure may be updated for each new event processed from a data stream.
  • the event may be analyzed to extract a set of related attribute-value pairs including for one or more dimension attributes and one or more value attributes.
  • the extracted set of related attributes-value pairs may then be used to identify or more implicated tuples in the data structure for updating.
  • aggregates associated with each of the implicated tuples for each of the one or more value attributes may be updated by applying an appropriate aggregation function to each identified value attribute-value pair. In this way the multi-dimensional data structure may maintain real-time analytics of the data stream.
  • a distributed architecture such as Muppet (map, update) may be used to implement the storage media, systems and methods of the present disclosure. Exemplary implementations of Muppet are further described herein as well as in U.S. non-provisional patent application entitled “Processing Data Feeds,” filed Nov. 18, 2011 (Attorney Docket No. 114826-50302).
  • a distributed architecture may be used to map an event to one or more implicated tuples in one or more multi-dimensional data structures and update, for each of the implicated tuples, one or more slates, for example based on one or more value attribute-value pairs in the event.
  • slates for a plurality of implicated tuples may be updated in parallel for example, using different processing nodes.
  • map and “mapper,” as used herein, relate to a stream operation performed in exemplary embodiments in which events in a data stream are processed in a real-time manner to generate one or more new events which are then published to a same or different data stream.
  • a mapper may be used to publish events to one or more updaters for updating an aggregate value contained in a slates.
  • update and “updater” refer to a stream operation performed in exemplary embodiments in which events in one or more real-time data streams are processed in a real-time manner to create or update one or more persistent static “slate” data structures that are stored in a persistent manner, for example, in a durable disk storage (note that, as used herein the terms “store,” “stored” “storage” etc., imply persistence a non-transitory storage medium).
  • an updater may generate zero, one or more new stream events. The generated stream events may be published to one or more real-time data streams.
  • an updater may publish stream events to a data stream from which it accepts stream events as input.
  • slate refers to a static data structure that may be used to record aggregates as described herein.
  • a slate may have any suitable data structure or format.
  • a slate may include a collection of one or more attribute-value pairs.
  • a slate may be stored corresponding to a unique slatekey and updater that updates the slate.
  • time-based aggregates of a given value attribute over a given time-window may be maintained in a multi-dimensional data structure (sometimes referred to herein as a data cube).
  • the dimensions of the data structure generally reflect one or more dimension attributes selected in a pre-established framework of potential query parameters.
  • fan-out may exponentially relate to the number of dimensions in the data structure.
  • the term “fan-out” for a distributed architecture may refer to the ratio of internal events generated by the mapping function relative to the number of external events (e.g., tweetsTM) processed.
  • Event 100 may be processed to identify a plurality of attribute-value pairs 110 contained therin.
  • attributes may include:
  • timestamp One usefull attribute, accordingly to the present disclosure, is the timestamp.
  • two assumptions may be made regarding the timestamp: first, that the timestamp represents actual wall-clock time in some appropriate timezone; and second, that timestamps are monontonically increasing. These assumptions are generally reasonable for streaming data (for example, in TwitterTM each tweetTM contains a timestamp that satisfies these conditions).
  • One reason that timestamps are useful is that query results are represented as aggregates over a time window. Using timestamps, aggregates may be unambiguously interpreted to include a set of events whose timestamps fall within the specified time window.
  • Query 200 may specify, for example, a time window 210 , a set of dimension attribute-value pairs 220 , one or more value attributes of interest 230 , and an aggregate function 240 related to each value attribute 230 .
  • exemplary instances of query 200 are described below:
  • Example 1 may be rewritten as the following SQL query:
  • AggregateSentiment represents a custom-defined aggregation function for combining sentiment values.
  • the function may maintain a 3-tuple of count, one each for positive, neutral, and negative sentiments.
  • the function may be configured to calculate an average sentiment.
  • aggregates may be calculated for each of a plurality of time windows.
  • arbitrary time windows may be supplanted by a standard set of time windows, for example:
  • the standard set of time windows may reflect an assumption that the further back time the coarser the time granularity of interest.
  • the standard set of time windows may include time windows of varying time granularity. In exemplary embodiments, it may be sufficient to maintain aggregates for a finite number of progressively older and courser sets of time windows, such reflected above.
  • the systems and method of the subject disclosure typically utilize algebraic aggregation functions, which may be computed incrementally.
  • an average may be represented as a 2-tuple a sum and a count, wherein the average may be calculated by dividing the sum by the count.
  • algebraic aggregations functions advantageously simplifies the update process, thereby facilitating the real-time data processing and analytics as described herein.
  • aggregates are assumed to be commutative and associative, which makes manipulations thereof simpler.
  • the storage media, systems and methods of the present disclosure may advantageously utilize a multi-dimensional data structure for maintaining a universe of aggregates for a pre-established framework of potential query parameters, wherein the pre-established framework of potential query parameters is characterized, by one or more time windows, one or more dimension attributes, one or more value attributes and one or more aggregation functions.
  • the multi-dimensional data structure for storing aggregates f(Se), W for all potential combinations and values of t, s, and g may be referred to as the data cube for aggregate f(Se) and time window W.
  • the term data cube refers to the fact the aggregates may be arranged as the vertices of a hypercube. In general, given K dimension attributes, there are 2 K aggregates for each set of values of the dimension attributes, which is the number of vertices of a hypercube in K dimensions.
  • the data cube for the aggregate f(Se) allows rapid lookup of the aggregate of value attribute Se for every tuple over timewindow W.
  • a data cube may store aggregates for a plurality of different time windows, for example for a standard set of time windows such as described herein.
  • Topic (T) 310 may include as instances, names for various topics, for example, as grouped via a semantic hierarchy.
  • Location (L) 320 may include as instances, names of locations, for example names of States in the united States.
  • Sentiment (S) 330 may include instances selected from positive negative or neutral.
  • the dimension attributes Topic (T) 310 , Location (L) 320 and Sentiment (S) 33 may be reflected along the vertices of the data cube 300 .
  • the data cube 300 may maintain for each tuple (t, 1 , s) of T, L, S an aggregate for a value attribute (in this case: event count, i.e., the number of events processed for the tuple (t, l, s)).
  • event count i.e., the number of events processed for the tuple (t, l, s)
  • each tuple (t, l, s) may be associated with a slate for storing the event count.
  • the slate may further be associated with an updater for updating the slate for a new event and a mapper for mapping new events to the updater.
  • data cube 300 is updated based on a new event.
  • a new event may be processed to identify one or more dimension attributes therein.
  • the event count associated with each of the four implicated tuples may be updated (for example, by incrementing the count by one).
  • Data cube 300 advantageously maintains, in real time, results for any prospective query using the framework (T, L, S), of an event count over one or more pre-established time windows.
  • FIG. 3 illustrates three examples of queries 340 a - c , the answers to which are maintained and therefore pre-computed in data cube 300 .
  • query 340 a asks “how many people are posting about Barack Obama in New York?”
  • the result to query 340 a may be obtained by returning the event count for the tuple (Barack Obama, New York, All), for example the event count stored in the slate associated with the tuple (Barack Obama, New York, All).
  • query 340 b asks “How many people in Arizona feel positive of the new Medicare plan?”
  • the result to query 340 b be obtained by returning the event count for the tuple (Medicare, Arizona, Positive).
  • query 340 c asks “How many people feel negative of Barack Obama across the US?”
  • the result to query 340 c may obtained by returning the event count for the tuple (Barrack Obama, United States (e.g., all states), Negative).
  • the number of the dimension of a data cube may be automatically determined based on the types of attributes reflected in the data stream. For example, event types with the greatest frequencies, such as above a selected threshold, may be used as the dimensions for the data cube. Thus, for example the data stream may be analyzed to determine the best candidate attributes for cube dimensions.
  • an exemplary implementation of a distributed architecture 400 for maintaining a data cube may include a mapper, for example, CubeMapper 410 , and one or more updaters, for example, CubeTupleUpdaters 420 .
  • the mapper and/or updaters may be distributed to one or more processing nodes in the distributed architecture, e.g., via a network architecture such as described herein, for example with reference to FIG. 10 .
  • the CubeMapper 410 may advantageously determine, for example based on a set of attribute-value pairs extracted for an event, which data cubes to maintain.
  • a data cube may be defined by a set of dimension attributes (for example, Topic, Gender, State), and an aggregation function for a value attribute.
  • a configuration file may list the data cubes of interest, and give each data cube a name.
  • the aggregation function may be specified, for example, in javascript, as a function that takes two parameters (the current value of the aggregate and a new event) and returns a single value (the new value of the aggregate). Note that as a special case, the current value may be null, in which case the function may return the aggregate corresponding to just the one event.
  • the value of the aggregate may be in any data structure/format, for example a JSON object.
  • a generated data cube may apply only to specific kinds of topics.
  • the query framework may call for a data cube specific for persons of interest, such as customers, celebrities, etc., or for occasions such as holidays, the Oscars, etc.
  • the mapper may implement a selection function, based on selection criterion, to filter out only a subset of events from a data stream.
  • the selection function may, for example, accept a single parameter (the event) and return either True (this event's data should be part of this data cube) or False.
  • the CubeConfig file may advantageously be replicated for reference at each processing node in the distributed architecture.
  • the CubeMapper 410 may processes each event in a data stream and determine which data cubes it is eligible for.
  • the CubeMapper 410 constructs the 2 K tuples from the event E (for example, the 2 2 tuples: (a), (b), (a,b) and (All) depicted in FIG. 3 ) and generates an event E 1 for each tuple.
  • the key for each event E 1 is the pair (CubeName, Tuple) and the value is the content event E, for example, including a value attribute-value pair.
  • the generated events E 1 are sent on to the CubeTupleUpdaters 420 .
  • Each CubeTupleUpdater 420 maintains a slate for every (CubeName, Tuple) pair it receives.
  • the CubeTupleUpdater receives an event E 1 for a single tuple, extracts an instance of a value attribute and applies the aggregation function to add the instance to its store.
  • the updater keeps track of the aggregate value for a plurality of time windows for example a standard set of time windows such as described herein.
  • a query for any tuple and any Time Window may be answered via a quick slate lookup.
  • the CubeMapper 410 may generate 2K events for each incoming event. This leads to very high fan-out for cubes with K>3.
  • tuples may be advantageously filtered based on frequency. Frequency filtering may be implemented, for example, by selecting a small time window (for example 1 minute) referred to as the delta window D. Let S be the set of all tuples corresponding to all social updates during a delta window D.
  • a threshold ⁇ is applied to filter out all tuples in S with frequency less than ⁇ from being sent to updaters (for example, CubeTupleUpdaters 320 of FIG. 3 ).
  • may be selected to be 1 or 2.
  • a simple experiment with actual data suggests that setting ⁇ to 2 may eliminate over 90% of tuples in a delta window of 1 minute.
  • a qualitative examination of such tuples indicated that the eliminated tuples generally came from events that are not really of interest (for example, spam, outliers of some sort, or just semantic analysis errors).
  • filtering also a second effect of reducing noise, e.g., from semantic analysis errors.
  • tuples that do occur frequently in a 1-minute window often correlate well with the global interests. Frequency filtering of tuples thus both improves performance and improves the quality of the data cube.
  • FIG. 5 depicts an exemplary distributed architecture 500 for maintaining a data cube while implementing frequency filtering.
  • the distributed architecture 500 may include a mapper (CubeS elector 510 ) and 3 types of updaters (CubeTupleGenerators 520 , the CubeTupleCollectors 530 , and the CubeTupleUpdaters 540 ).
  • the CubeTupleGenerators 520 buffer all tuples with the same Interval, and then dispatch them to the CubeTupleCollectors 530 .
  • an assumption may be made that the stream has a large number of events in each Delta Window—that is, the Delta Window is very large compared to the average gap between events (for example, TwitterTM, processes approximately 100,000 tweetsTM in a Delta Window of 1 minute resulting in an average inter-event gap of less than a millisecond).
  • TwitterTM processes approximately 100,000 tweetsTM in a Delta Window of 1 minute resulting in an average inter-event gap of less than a millisecond.
  • one may detect when an interval has ended and the next one has begun based on a processing of the first event whose Interval is higher than the current Interval.
  • intervals may be tracked independent of events received.
  • intervals may be based on e.g., a requisite number of events received rather than a time frame (for example, provided that the integration of such batches over the one or more time windows being tracked results in an acceptable margin of error).
  • the threshold ⁇ may be variable, e.g., based on a changing frequency of events in the data stream.
  • a different threshold may be applied depending on the number of dimension attributes-value pairs in the tuple.
  • frequency filtering may account for variations in event traffic when filtering tuples.
  • the CubeSelector 510 is similar to the CubeMapper 210 in the na ⁇ ve implementation. It uses the config file to determine if the current Event E is eligible for a data cube. If it is, the CubeSelector 510 uses a hash function to map the timestamp of the event to one of P values, and emits to stream X an event E 1 whose key is (CubeName, P) and value is the content of the event E.
  • Each CubeTupleGenerator 520 subscribes to stream X.
  • a CTG generates all possible tuples from an event E 1 and maintains a table with 3 columns: Tuple, AggregateValue, and Count.
  • Count is the number of events that have contributed to this tuple.
  • the CTG also stores the Interval I of the first event it received during this Delta Window.
  • the CTG For each tuple, the CTG uses a hash function to map the tuple into one of P buckets 0,1 . . . , P ⁇ 1. For each bucket in 1, . . . P ⁇ 1, the CTG creates and emits to stream Y an event E 2 whose key is the bucket number, and whose value is the set of rows in its table whose tuple maps to that bucket number. The value also indicates the Interval I. Once the CTG has sent out all P events, it empties its tables and stores J as the new Interval value. The CTG then proceeds to process a newly received event E 1 .
  • Each CubeTupleCollector 530 subscribes to stream Y.
  • a CTC slate receives P events , one from each CTG slate.
  • the CTC maintains a table with 3 columns: Tuple, AggregateValue, Count.
  • the AggregateValue for a tuple is the aggregate of the partial AggregateValues for that tuple from each CTG, and the Count is the sum of the corresponding partial counts.
  • the CTC also keeps track of the current interval I and a message counter M, both initialized to 0. When the CTC receives an event E 2 from the CTG with interval J, it first compares I and J:
  • Each CubeTupleUpdater 540 maintains a slate for every (CubeName, Tuple) pair it receives.
  • the CubeTupleUpdater receives an event E 3 for a single tuple, extracts an instance of a value attribute and applies the aggregation function to add the instance to its store.
  • the updater keeps track of the aggregate value for a plurality of time windows for example a standard set of time windows such as described herein.
  • a query for any tuple and any Time Window may be answered via a quick slate lookup.
  • FIGS. 6 a - g an exemplary worked example of the distributed architecture 500 is depicted, demonstrating exemplary contents of CTG slates 620 and CTC slates 630 of a CTG and CTC such as described above with reference to FIG. 5 .
  • a data cube C with two dimensions (A and X) is assumed.
  • 4 tuples arrive.
  • Cube Selecter 610 uses the config file to determine if the Event E is eligible for a data cube, for example data cube C.
  • the CubeSelector 610 uses a hash function to map the timestamp of the event to one of P Values and emits an event E 1 whose key is (CubeName, P) and value is the content of the event E.
  • CTG slates 620 buffer all tuples with the same interval. More particularly the CTG slates 620 maintains a table with 3 columns: Tuple, AggregateValue, and Count.
  • a hash function to map each from the CTG slates 620 tuple into one of P buckets 0,1 . . . , P ⁇ 1, for example buckets 625 of FIG.
  • CTC slates 630 receive P events, one from each CTG slate 620 .
  • the CTC slates 630 maintains a table with 3 columns: Tuple, AgregateValue, Count. As depicted, once the CTG slate has sent out its events E 2 , it empties its tables and begins processing events E 1 for a new interval.
  • Exemplary implementations presented above primarily related to point aggregates—wherein a tuple (a point in the hypercube) is specified and the desired result of the query is the corresponding aggregate value.
  • the storage media, systems and methods herein are not limited to such implementations. Indeed in some embodiments, the data cube is used tracks results for different type of queries for, example, leaderboard queries.
  • a leaderboard query specifies a tuple and a leaderboard attribute, and asks for the Top-N values of the leader board attribute among events that satisfy the tuple.
  • a leaderboard query might pose the following question:
  • leaderboard queries might pose the following question: What were the Top 10 least popular topics posted about by people in Arizona between 10 am and 11 am today? What were the Top 10 locations closest to New York where users posted about a sales event.
  • the data cube may easily be extended to support leaderboard queries.
  • the leaderboard attribute has a small and known set of values (e.g., state or gender)
  • the leaderboard attribute can take on a large number of values (or example, topic, product, URL, domain). In this case we cannot possibly examine all slates for all values of the attribute.
  • leaderboard queries are indicated by adding a line to the data cube config file to indicate the leaderboard dimensions for each cube and the value for “N” (for Top-N queries; e.g., 10). For example:
  • Non-leaderboard dimensions of the data cube may be referred to as ordinary dimensions.
  • the CTC can now compute the Top-N leaderboard values for each tuple, and pass them on to the CTU in the event it constructs for the tuple.
  • the Top-N values may be characterized as resulting in the highest aggregates, the lowest aggregates, and the aggregates closest to a selected value, etc.
  • the CTU may maintain the leaderboard for each time window. Where a time window is used for filtering by the CTC, the CTU may get the leaderboard directly from the CTC. Leaderboards for larger time windows are approximated using those for the smaller time windows. For example, an hourly leaderboard may be computed by looking at the Top 10 for each minute of the hour, and summing the minute-by-minute values. While it is possible to make some errors using this method (for example, where the most frequent item during the hour was not in the top 10 for any minute of the hour) accurate results are generally achieved.
  • a semantic analysis engine can only tag instances it recognizes.
  • An interesting advantage the storage media, systems and methods described herein is that the data cube may be used to maintain statistics on topics products, locations etc. (collectively, referred to as interesting topics or topics of interest) without the help of semantic analysis.
  • a “K-Gram” is added as one of the data cube dimensions, with a relative high threshold 6 (e.g., 50).
  • a slate is automatically created and maintained for any k-gram that was mentioned very often in a delta time window.
  • a TF.IDF threshold may be instead of a pure frequency threshold to achieve better results.
  • the data cube may be used to identify trending k-grams and automatically maintains stats for them without help from semantic analysis.
  • identified topics of interest can be communicated to the semantic analysis engine and “candidate topics” and the maintained statistics relating thereto can be used by the engine to whether to incorporate each candidate topic into the taxonomy.
  • FIGS. 7 a - c illustrate exemplary methods according to the present disclosure.
  • Method 710 is depicted for performing real time analytics on streaming data.
  • Method 710 generally includes steps of: ( 712 ) processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; ( 714 ) identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and ( 716 ) for each implicated tuple, updating one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
  • Method 720 is depicted for performing real time analytics on streaming data.
  • Method 720 generally includes steps of ( 722 ) establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and ( 724 ) generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
  • Method 730 generally includes steps of: ( 732 ) processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and ( 734 ) returning one of the stored aggregates in response to a query.
  • the storage media, systems and methods presented herein may include one or more programmable processing units having associated therewith executable instructions held on one or more computer readable medium, RAM, ROM, hard drive, and/or hardware.
  • the hardware, firmware and/or executable code may be provided, for example, as upgrade module(s) for use in conjunction with existing infrastructure (for example, existing devices/processing units).
  • Hardware may, for example, include components and/or logic circuitry for executing the embodiments taught herein as a computing process.
  • Displays and/or other feedback means may also be included to convey detected/processed data, for example adjusted output representative of a particle characteristic.
  • the display and/or other feedback means may be stand-alone or may be included as one or more components/modules of the processing unit(s).
  • the display and/or other feedback means may be used to facilitate querying a data cube.
  • the display may be used to visualize, in real-time, various social statistics maintained by the data cube. For example, as depicted in FIGS. 8 a and 8 b , real-time social statistics may be overlaid on a geographic map.
  • a “processor,” “processing unit,” “computer” or “computer system” may be, for example, a wireless or wire line variety of a microcomputer, minicomputer, server, mainframe, laptop, personal data assistant (PDA), wireless e-mail device (for example, “BlackBerry,” “Android” or “Apple,” trade-designated devices), cellular phone, pager, processor, fax machine, scanner, or any other programmable device configured to transmit and receive data over a network.
  • Computer systems disclosed herein may include memory for storing certain software applications used in obtaining, processing and communicating data. It can be appreciated that such memory may be internal or external to the disclosed embodiments.
  • the memory may also include non-transitory storage medium for storing software, including a hard disk, an optical disk, floppy disk, ROM (read only memory), RAM (random access memory), PROM (programmable ROM), EEPROM (electrically erasable PROM), flash memory storage devices, or the like.
  • non-transitory storage medium for storing software, including a hard disk, an optical disk, floppy disk, ROM (read only memory), RAM (random access memory), PROM (programmable ROM), EEPROM (electrically erasable PROM), flash memory storage devices, or the like.
  • FIG. 9 depicts a block diagram representing an exemplary computing device 900 that may be used as a processing node (also referred to as a worker node) for aggregating and/or storing data as described herein, for example a processing node in a distributed architecture as described herein.
  • the computing device 900 may be any computer system, such as a workstation, desktop computer, server, laptop, handheld computer, tablet computer (e.g., the iPadTM tablet computer), mobile computing or communication device (e.g., the iPhoneTM mobile communication device, the AndroidTM mobile communication device, and the like), or other form of computing or telecommunications device that is capable of communication and that has sufficient processor power and memory capacity to perform the operations described herein.
  • a distributed computational system may be provided comprising a plurality of such computing devices.
  • the computing device 900 includes one or more non-transitory computer-readable media having encoded thereon one or more computer-executable instructions or software for implementing exemplary methods described herein.
  • the non-transitory computer-readable media may include, but are not limited to, one or more types of hardware memory, non-transitory tangible media (for example, one or more magnetic storage disks, one or more optical disks, one or more USB flash drives), and the like.
  • memory 906 included in the computing device 900 may store computer-readable and computer-executable instructions or software for implementing exemplary embodiments.
  • the computing device 900 also includes processor 902 and associated core 904 , and in some embodiments, one or more additional processor(s) 902 ′ and associated core(s) 904 ′ (for example, in the case of computer systems having multiple processors/cores), for executing computer-readable and computer-executable instructions or software stored in the memory 906 and other programs for controlling system hardware.
  • processor 902 and processor(s) 902 ′ may each be a single core processor or multiple core ( 904 and 904 ′) processor.
  • Virtualization may be employed in the computing device 900 so that infrastructure and resources in the computing device may be shared dynamically.
  • a virtual machine 914 may be provided to handle a process running on multiple processors so that the process appears to be using only one computing resource rather than multiple computing resources. Multiple virtual machines may also be used with one processor.
  • Memory 906 may include a computer system memory or random access memory, such as DRAM, SRAM, EDO RAM, and the like. Memory 906 may include other types of memory as well, or combinations thereof. Memory 906 may be used to store one or more slates on a temporary basis, for example, in cache.
  • a user may interact with the computing device 900 through a visual display device 918 , such as a screen or monitor, that may display one or more interfaces 920 that may be provided in accordance with exemplary embodiments.
  • the visual display device 918 may also display other aspects, elements and/or information or data associated with exemplary embodiments.
  • the computing device 900 may include other I/O devices for receiving input from a user, for example, a keyboard or any suitable multi-point touch interface 908 , a pointing device 910 (e.g., a mouse, a user's finger interfacing directly with a display device, etc.).
  • the keyboard 908 and the pointing device 910 may be coupled to the visual display device 918 .
  • the computing device 900 may include other suitable conventional I/O peripherals.
  • the one or more of the interfaces 920 includes an application program interface (API).
  • API application program interface
  • the computing device 900 may include one or more audio input devices 924 , such as one or more microphones, that may be used by a user to provide one or more audio input streams.
  • one or more audio input devices 924 such as one or more microphones, that may be used by a user to provide one or more audio input streams.
  • the computing device 900 may include one or more non-transitory storage devices 924 , such as a durable disk storage (which may include any suitable optical or magnetic durable storage device, e.g., RAM, ROM, Flash, USB drive, or other semiconductor-based storage medium), a hard-drive, CD-ROM, or other non-transitory computer readable media, for storing data and computer-readable instructions and/or software that implement exemplary embodiments as taught herein.
  • the storage device 924 may provide a slate storage 926 for storing computer-executable instructions for implementing the social genome data structure as described herein, for example for storing an updating (via one or more updaters) one or more slates, as described herein.
  • the storage device 924 may store one or more map modules 932 and one or more update modules 934 , as described herein.
  • the storage device 924 may be provided on the computing device 900 or provided separately or remotely from the computing device 900 .
  • the storage device 924 may be used to store one or more slates in a durable manner.
  • Exemplary mappers and updaters may be programmatically implemented by a computer process in any suitable programming language, for example, a scripting programming language, an object-oriented programming language (e.g., Java), and the like.
  • a general Mapper class or interface and Updater class or interface may be defined by the system to generally specify attributes and functionality of a generic update operation.
  • a sub-class may be created based on the Updater class.
  • One or more object instances may be created from each sub-class at a processor node, for example, a CubeTupleGenerator object may be instantiated from a CubeTupleGenerator sub-class.
  • the computing device 900 may include a network interface 912 configured to interface via one or more network devices 922 with one or more networks, for example, Local Area Network (LAN), Wide Area Network (WAN) or the Internet through a variety of connections including, but not limited to, standard telephone lines, LAN or WAN links (for example, 802.11, T1, T3, 56 kb, X.25), broadband connections (for example, ISDN, Frame Relay, ATM), wireless connections, controller area network (CAN), or some combination of any or all of the above.
  • LAN Local Area Network
  • WAN Wide Area Network
  • the Internet through a variety of connections including, but not limited to, standard telephone lines, LAN or WAN links (for example, 802.11, T1, T3, 56 kb, X.25), broadband connections (for example, ISDN, Frame Relay, ATM), wireless connections, controller area network (CAN), or some combination of any or all of the above.
  • LAN Local Area Network
  • WAN Wide Area Network
  • CAN controller area network
  • the network interface 912 may include a built-in network adapter, network interface card, PCMCIA network card, card bus network adapter, wireless network adapter, USB network adapter, modem or any other device suitable for interfacing the computing device 900 to any type of network capable of communication and performing the operations described herein.
  • the network device 922 may include one or more suitable devices for receiving and transmitting communications over the network including, but not limited to, one or more receivers, one or more transmitters, one or more transceivers, one or more antennae, and the like.
  • the computing device 900 may run any operating system 916 , such as any of the versions of the Microsoft® Windows® operating systems, the different releases of the Unix and Linux operating systems, any version of the MacOS® for Macintosh computers, any embedded operating system, any real-time operating system, any open source operating system, any proprietary operating system, any operating systems for mobile computing devices, or any other operating system capable of running on the computing device and performing the operations described herein.
  • the operating system 916 may be run in native mode or emulated mode.
  • the operating system 916 may be run on one or more cloud machine instances.
  • FIG. 10 depicts an exemplary network environment 1000 suitable for a distributed implementation of exemplary embodiments.
  • the network environment 1000 may include one or more servers 1002 and 1004 coupled to one or more clients 1006 and 1008 via a communication network 1010 .
  • the network interface 912 and the network device 922 of the computing device 900 enable the servers 1002 and 1004 to communicate with the clients 1006 and 1008 via the communication network 1010 .
  • the communication network 1010 may include, but is not limited to, the Internet, an intranet, a LAN (Local Area Network), a WAN (Wide Area Network), a MAN (Metropolitan Area Network), a wireless network, an optical network, and the like.
  • the communication facilities provided by the communication network 1010 are capable of supporting distributed implementations of exemplary embodiments.
  • the servers 1002 and 1004 may provide the clients 1006 and 1008 with computer-readable and/or computer-executable components or products under a particular condition, such as a license agreement.
  • the computer-readable and/or computer-executable components or products provided by the servers may include those for providing one or more real-time data streams to worker processes at worker nodes.
  • the clients 1006 and 1008 may process the data streams using the computer-readable and/or computer-executable components and products provided by the servers 1002 and 1004 .
  • the computer-readable and/or computer-executable components or products provided by the servers may include those for providing and executing one or more map and/or update operations, for example using one or more mappers or updaters.
  • the clients 1006 and 1008 may execute the map and update operations using the computer-readable and/or computer-executable components and products provided by the servers 1002 and 1004 .
  • the clients 1006 and 1008 may transmit events generated by update operations to the servers 1002 and 1004 for publication in one or more data streams.
  • the clients 1006 and 1008 may transmit one or more slates created or updated by update operations to the servers 1002 and 1004 for persistent storage on a disk storage or for storage in memory, e.g., in cache.
  • the clients 1006 and 1008 may provide the servers 1002 and 1004 with computer-readable and computer-executable components or products under a particular condition, such as a license agreement.
  • the computer-readable and/or computer-executable components or products provided by the clients may include those for providing one or more real-time data streams to worker processes.
  • the servers 1002 and 1006 may process the data streams using the computer-readable and/or computer-executable components and products provided by the clients 1006 and 1008 .
  • the computer-readable and/or computer-executable components or products provided by the clients may include those for providing and executing one or more map and/or update operations.
  • the servers 1002 and 1004 may execute the map and update operations using the computer-readable and/or computer-executable components and products provided by the clients 1006 and 1008 .
  • the servers 1002 and 1004 may transmit events generated by update operations to the clients 1006 and 1008 for publication in one or more data streams.
  • the servers 1002 and 1004 may transmit one or more slates created or updated by update operations to the clients 1006 and 1008 for persistent storage on a disk storage or for storage in memory, e.g., in cache.
  • one or more mappers and one or more updaters may be distributed to throughout various processing nodes of the network environment 1000 , for example nodes 1012 a - d.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Storage media, systems and methods are disclosed herein for analyzing data streams in real time. More particularly, storage media, systems and methods are presented for processing data streams to calculate results for prospective queries. The results may be advantageously computed prior to the formulation of the specific query, for example, based on a pre-established framework of potential query parameters. More particularly, a universe of potential queries may be extrapolated from the pre-established framework of potential query parameters. Results for each of the potential queries may them be tracked in real time. For example, results for each of the potential queries may be continuously updated based on real-time processing of events in a data stream.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • The present application claims priority to U.S. Provisional Patent Application No. 61/415,279, filed Nov. 18, 2010 (entitled “Social Genome”), and U.S. Provisional Patent Application No. 61/415,282, filed Nov. 18, 2010 (entitled “Managing Real-Time Data Streams”). This application also relates to U.S. Provisional Patent Application No. 61/345,252 entitled “Content Feed,” filed May 17, 2010, U.S. patent application Ser. No. 13/106,706 entitled “Processing Data Feeds,” filed May 12, 2011, a U.S. non-provisional patent application titled “Processing Data Feeds,” filed Nov. 18, 2011 (Attorney Docket No. 114826-50302), a U.S. non-provisional patent application entitled “Methods Systems and Devices for Recommending Products and Services” filed Nov. 18, 2011 (Attorney Docket No. 114826-50602), and a U.S. non-provisional patent application entitled “Social Genome,” filed Nov. 18, 2011 (Attorney Docket No. 114826-50202). The entire contents of each of the above-referenced applications are incorporated herein in their entirety by reference.
  • BACKGROUND
  • In recent years, social media services such as Twitter™, Digg™, Myspace™ and Facebook™ have seen a meteoric rise in popularity resulting in an ever evolving universe of streaming content/data which is often user/consumer generated. Thus, social media is able to capture, better than many other sources, a raw and unfiltered pulse of society.
  • Potential applications for data harvested from social media are vast. For example, from a marketing intelligence standpoint, a company may gather and analyze information relevant to the company's markets to promote accurate and confident decision-making in determining market opportunity, market penetration strategy, market development metrics, etc.
  • TECHNICAL FIELD
  • The present disclosure relates to real-time analytics of data streams. More particularly, the present disclosure relates to storage media, systems and methods for processing data streams and analyzing data extracted therefrom.
  • SUMMARY
  • Storage media, systems and methods for performing real time analytics on streaming data are disclosed herein.
  • In exemplary embodiments a method for performing real time analytics on streaming data may include: processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes. In some embodiments, a tuple frequency may be tracked over an interval for each of the implicated tuples, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval. In other embodiments, for each value attribute, aggregates over an interval for a first plurality of implicated tuples having same attribute-value pairs for zero or more ordinary dimension attributes and different attribute-value pairs for one or more leaderboard dimension attributes may be tracked, and a top-N values determined for the one or more leadership dimension attributes over the interval, for example, wherein the Top-N values are characterized as resulting in the highest aggregates, the lowest aggregates or the aggregates closest to a selected value, over the interval. In yet other embodiments, the one or more dimension attributes may include a K-Gram for identifying topics of interest. Thus, for example a tuple frequency for each of the implicated tuples including a K-Gram may be tracked over an interval, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval, whereby statistics for trending K-Gram-value pairs are tracked.
  • In other exemplary embodiments, a method for implementing a real time analytics platform may include establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
  • In other exemplary embodiments a method for performing real-time analytics on a data stream may include: processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and returning one of the stored aggregates in response to a query.
  • In exemplary embodiments, a system for performing real time analytics on streaming data, may include: a processor for processing an event in a data stream to extract a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; a mapper for identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and one or more updaters for updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
  • In other exemplary embodiments, a system for performing real-time analytics on a data stream may include: a processor for processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and memory for storing the plurality of stored aggregates.
  • In exemplary embodiments, a multi-dimensional data structure, for implementing a real-time analytics platform characterized by one or more time windows, one or more dimension attributes, and one or more value attributes, may include: a plurality of tuples associated with the one or more dimension attributes; and a slate associated with each tuple for maintaining an aggregate for each of the one or more value attributes over each of the one or more time windows.
  • In other exemplary embodiments, a multi-dimensional data structure for implementing a real-time analytics platform may include: a plurality of stored tuples each representing a set of search query parameters for prospective queries extrapolated from a pre-established framework of possible query parameters and one or more stored aggregates associated with each of the stored tuples, wherein each aggregate represents a result for a prospective query characterized by the set of search query parameters represented in the tuple associated with that aggregate.
  • In exemplary embodiments, a non-transitory computer readable medium may store processor executable instructions for performing methods described herein. For example, the computer readable medium may store processor executable instructions for processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes. In other embodiments, the computer readable medium may store processor executable instructions for establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows. In yet other embodiments, the computer readable medium may store processor executable instructions for processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and returning one of the stored aggregates in response to a query.
  • The foregoing and other objects, aspects, features and advantages of exemplary embodiments will be more fully understood from the following description when read together with the accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 depicts an exemplary data stream, according to the present disclosure.
  • FIG. 2 depicts an exemplary query, according to the present disclosure.
  • FIG. 3 depicts an exemplary data cube, according to the present disclosure.
  • FIG. 4 depicts an exemplary implementation of a distributed architecture for maintaining a data cube, according to the present disclosure.
  • FIG. 5 depicts another exemplary implementation of a distributed architecture for maintaining a data cube, according to the present disclosure.
  • FIGS. 6 a-g depict a sequence of events for a worked example using the distributed architecture of FIG. 5, according to the present disclosure.
  • FIGS. 7 a-c depict flowcharts for exemplary methods for performing real time analytics on streaming data, according to the present disclosure.
  • FIGS. 8 a-b depict overlaying real-time social statistics on a geographic map.
  • FIG. 9 depicts an exemplary computing device for implementing embodiments of the present disclosure.
  • FIG. 10 depicts an exemplary network environment for implementing a distributed architecture, according to the present disclosure.
  • DETAILED DESCRIPTION
  • Storage media, systems and methods are disclosed herein for analyzing data streams in real time and/or pre-computing statistics in real time with no query time computation. More particularly, storage media, systems and methods are presented for processing data streams to calculate results for prospective queries. The results may be advantageously computed prior to the formulation of the specific query, for example, based on a pre-established framework of potential query parameters. More particularly, a universe of potential queries may be extrapolated from the pre-established framework of potential query parameters. Results for each of the potential queries may them be tracked in real time. For example results for each of the potential queries may be continuously updated based on real-time processing of events in a data stream.
  • Note that, as used herein the term event generally refers to an atomic unit in a data stream, for example, a single tweet™ in a Twitter™ feed or a single purchasing transaction in a transaction stream. In exemplary embodiments, a data stream may include a continuous flow of data that is not pre-divided into discrete events. Thus, in some embodiments, an event may be inferred, for example, by identifying a set of one or more related attributes in the data stream. For example, related attributes may be identified based on temporal and/or source commonalities. In some embodiments, a contextual analysis (for example, a semantic analysis) of attributes in a data steam may be used to identify a set of one or more related attributes. Exemplary embodiments of semantic analysis, for example using a doctagger to identify and/or group topics, are is described herein
  • It is appreciated that, although exemplary embodiments presented herein relate to social analytics, the storage media, systems and methods of the present disclosure may be used for real-time analysis of any type of streaming data, structured or unstructured. For example, the storage media, systems and methods of the present disclosure may be used for real-time analysis of purchase transactions, customer reviews/feedback, customer wish lists/shopping carts, etc.
  • In exemplary embodiments, prospective queries of data streams may include queries related to social statistics. For example:
      • How many events relate to product P between 10 a.m. and 11 a.m. today?
      • What percentage of events had a positive opinion of product P yesterday?
      • How do women in Arizona feel about a limited time offer, based on events during the offer?
      • What is the average age of women in Arizona who purchased product P last year?
  • The result for each of the above queries may be calculated as an aggregate of a value attribute (number of events, percentage of events, sentiment, and average age, respectively) over a specified time window (between 10 a.m. and 11 a.m. today, yesterday, during the limited time offer, and last year, respectively) for a specified set of dimension attributes (related to product P, positive sentiment related to product P, related to the limited time offer from women in Arizona, and related to women in Arizona who purchased product P, respectively). The storage media, systems and methods of the present disclosure advantageously facilitate identifying and maintaining time-based aggregates, such as described above, prior to the formulation of a queries.
  • In exemplary embodiments, a framework of potential query parameters may be pre-established by selecting, for example, via a user input, attributes of interest including one or more time windows, one or more dimension attributes and one or more value attributes. The framework may further include, for each of the one or more value attributes, an aggregate function defining how to aggregate instances of the value attribute. As used herein, the term dimension attribute refers to an identifiable attribute in a data stream which is of interest as pertaining to a query search parameter. By comparison, the term value attribute refers to an identifiable attribute in a data stream of interest which is of interest as pertaining to a query result parameter. Notably, depending on the particular framework of potential query parameters, a same attribute may be both a dimension attribute and a value attribute. For example, the attribute “sentiment” may be used as both a search parameter (such as, in the query “how many women have a positive opinion about product P?”) and a result parameter (such as in the query “what is the sentiment of women regarding product P?”).
  • In exemplary embodiments a pre-established framework of query parameters may be used to generate a multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes; an aggregate of each of the one or more value attributes over each of the one or more time windows. As used herein the term tuple may refer to a set of dimension attribute-value pairs. For example, for exemplary dimension attributes Person (P) Location (L) and Thing (T), a tuple may take the form P=p, L =1, and T=t, (also expressed as the tuple (p, l, t) for dimensions (P, L, T)).
  • In exemplary embodiments, the aggregates stored in the multi-dimensional data structure may be updated for each new event processed from a data stream. In particular, the event may be analyzed to extract a set of related attribute-value pairs including for one or more dimension attributes and one or more value attributes. The extracted set of related attributes-value pairs may then be used to identify or more implicated tuples in the data structure for updating. Thus, aggregates associated with each of the implicated tuples for each of the one or more value attributes, may be updated by applying an appropriate aggregation function to each identified value attribute-value pair. In this way the multi-dimensional data structure may maintain real-time analytics of the data stream.
  • In exemplary embodiments, a distributed architecture, such as Muppet (map, update), may be used to implement the storage media, systems and methods of the present disclosure. Exemplary implementations of Muppet are further described herein as well as in U.S. non-provisional patent application entitled “Processing Data Feeds,” filed Nov. 18, 2011 (Attorney Docket No. 114826-50302). In general, a distributed architecture may be used to map an event to one or more implicated tuples in one or more multi-dimensional data structures and update, for each of the implicated tuples, one or more slates, for example based on one or more value attribute-value pairs in the event. Advantageously, slates for a plurality of implicated tuples may be updated in parallel for example, using different processing nodes.
  • The terms “map” and “mapper,” as used herein, relate to a stream operation performed in exemplary embodiments in which events in a data stream are processed in a real-time manner to generate one or more new events which are then published to a same or different data stream. In exemplary embodiments, a mapper may be used to publish events to one or more updaters for updating an aggregate value contained in a slates.
  • The terms “update” and “updater” refer to a stream operation performed in exemplary embodiments in which events in one or more real-time data streams are processed in a real-time manner to create or update one or more persistent static “slate” data structures that are stored in a persistent manner, for example, in a durable disk storage (note that, as used herein the terms “store,” “stored” “storage” etc., imply persistence a non-transitory storage medium). In some exemplary embodiments, an updater may generate zero, one or more new stream events. The generated stream events may be published to one or more real-time data streams. In an exemplary embodiments, an updater may publish stream events to a data stream from which it accepts stream events as input.
  • As used herein, the term “slate” refers to a static data structure that may be used to record aggregates as described herein. A slate may have any suitable data structure or format. In an exemplary format, a slate may include a collection of one or more attribute-value pairs. A slate may be stored corresponding to a unique slatekey and updater that updates the slate.
  • In exemplary embodiments, time-based aggregates of a given value attribute over a given time-window may be maintained in a multi-dimensional data structure (sometimes referred to herein as a data cube). The dimensions of the data structure generally reflect one or more dimension attributes selected in a pre-established framework of potential query parameters.
  • In a naïve implementation of a distributed architecture for maintaining the data structure, fan-out may exponentially relate to the number of dimensions in the data structure. The term “fan-out” for a distributed architecture may refer to the ratio of internal events generated by the mapping function relative to the number of external events (e.g., tweets™) processed.
  • Since, handling and storing such a volume of data may prove impractical, alternative implementations of a distributed architecture are also presented herein that take advantage of various properties of data streams to considerably reduce fan-out to a manageable number.
  • With initial reference to FIG. 1, an exemplary event 100 in a data stream 10 is depicted. Event 100 may be processed to identify a plurality of attribute-value pairs 110 contained therin. Examples of attributes may include:
      • Event ID, for example, a unique per-event identifier.
      • Sentiment, for example, with potential values of +1, 0, or −1, indicating positive, neutral, or negative sentiment.
      • Gender, for example, with potential values of Male, Female, and Unknown.
      • Country, for example, with potential values drawn from an enumerated set of country codes including unknown.
      • Topic, for example, with potential values detected via semantic analysis.
      • Product, for example, with potential values drawn from a product database.
      • Price, for example, with potential values in different currencies.
      • Timestamp, for example, based on time published or time received.
  • One usefull attribute, accordingly to the present disclosure, is the timestamp. In exemplary embodiments, two assumptions may be made regarding the timestamp: first, that the timestamp represents actual wall-clock time in some appropriate timezone; and second, that timestamps are monontonically increasing. These assumptions are generally reasonable for streaming data (for example, in Twitter™ each tweet™ contains a timestamp that satisfies these conditions). One reason that timestamps are useful is that query results are represented as aggregates over a time window. Using timestamps, aggregates may be unambiguously interpreted to include a set of events whose timestamps fall within the specified time window.
  • With reference to FIG. 2, an exemplary query 200 is depicted. Query 200 may specify, for example, a time window 210, a set of dimension attribute-value pairs 220, one or more value attributes of interest 230, and an aggregate function 240 related to each value attribute 230. Exemplary instances of query 200 are described below:
  • Example 1
  • Query: How many people posted about product P between 10 am and 11 am today?
    Time Window: 10 am to 11 am today
  • Dimension Attribute-Value Pair: Product=P
  • Value attribute: Event Id
  • Aggregate Function: Count
  • Example 1 may be rewritten as the following SQL query:
  • SELECT COUNT(EventId)
  • FROM event E
    WHERE t.product=“P”
  • AND t.timestamp>=10 am AND t.timestamp<=11 am
  • Example 2
  • How many women in Arizona posted about product P between 10 am and 11 am?
    Same as last example except for additional Dimension Attribute-Value Pairs:
  • Gender=“F”, State=“AZ” Example 3
  • What was the sentiment about product P among women in Arizona in December 2010?
  • Time Window: December 2010 Dimension Attribute-Value Pairs: Product=“P”, Gender=“F”, State=“AZ” Value Attribute: Sentiment Aggregate Function: AggregateSentiment
  • In the third query example, AggregateSentiment represents a custom-defined aggregation function for combining sentiment values. For example, the function may maintain a 3-tuple of count, one each for positive, neutral, and negative sentiments. Alternatively, the function may be configured to calculate an average sentiment.
  • Time Windows:
  • In exemplary embodiments, aggregates may be calculated for each of a plurality of time windows. In some embodiments arbitrary time windows may be supplanted by a standard set of time windows, for example:
      • By the minute for the past 60 minutes
      • By the hour for the past 24 hours
      • By the week for the past 4 weeks
      • By the month for the past 24 months (for y-o-y same month comparisons)
      • By the year for the past 10 years
  • In general, the standard set of time windows may reflect an assumption that the further back time the coarser the time granularity of interest. Thus, the standard set of time windows may include time windows of varying time granularity. In exemplary embodiments, it may be sufficient to maintain aggregates for a finite number of progressively older and courser sets of time windows, such reflected above.
  • Aggregate Functions:
  • In general there are two kinds of aggregate functions: Algebraic and Holistic. Roughly speaking, algebraic aggregates are those, like SUM, that can be computed incrementally; in other words, by aggregating subsets of the data, and computing the final result using those aggregates without going back to the base data. In contrast, Holistic aggregates typically require the base data when recalculating the aggregate. One example of a holistic aggregate is the median. Suppose you divide a data set arbitrarily into two parts, and compute the median of the two parts; there is no way to compute the median of the entire data set from the medians of the two parts.
  • The systems and method of the subject disclosure typically utilize algebraic aggregation functions, which may be computed incrementally. Thus, for example, an average may be represented as a 2-tuple a sum and a count, wherein the average may be calculated by dividing the sum by the count. The use of algebraic aggregations functions, advantageously simplifies the update process, thereby facilitating the real-time data processing and analytics as described herein. In exemplary embodiments aggregates are assumed to be commutative and associative, which makes manipulations thereof simpler.
  • Data Cubes:
  • As noted above, the storage media, systems and methods of the present disclosure may advantageously utilize a multi-dimensional data structure for maintaining a universe of aggregates for a pre-established framework of potential query parameters, wherein the pre-established framework of potential query parameters is characterized, by one or more time windows, one or more dimension attributes, one or more value attributes and one or more aggregation functions.
  • Suppose, for example, a framework characterized by time window W, dimension attributes Topic (T), State (S), Gender (G), a single value attribute Sentiment (Se), and an aggregate function ƒ. The following are the aggregates may be of interest for instances t, s and g of T, S and G:
  • (1) T=t, S=s, G=g, f(Se), W
    (2) T=t, S=s, f(Se), W
    (3) T=t, G=g, f(Se), W
    (4) S=s, G=g, f(Se), W
    (5) T=t, f(Se), W
    (6) S=s, f(Se), W
    (7) G=g, f(Se), W
    (8) All, f(Se), W (computed across all events in time window W)
  • The multi-dimensional data structure for storing aggregates f(Se), W for all potential combinations and values of t, s, and g, may be referred to as the data cube for aggregate f(Se) and time window W. The term data cube refers to the fact the aggregates may be arranged as the vertices of a hypercube. In general, given K dimension attributes, there are 2K aggregates for each set of values of the dimension attributes, which is the number of vertices of a hypercube in K dimensions.
  • As describe herein, a set of dimension attribute-value pairs, such as T=t, G=g, may be referred to as a tuple. Referring to the above example, the data cube for the aggregate f(Se) allows rapid lookup of the aggregate of value attribute Se for every tuple over timewindow W. In exemplary embodiments, a data cube may store aggregates for a plurality of different time windows, for example for a standard set of time windows such as described herein.
  • With reference to FIG. 3, an exemplary data cube 300 is depicted for three dimension attributes: Topic (T) 310, Location (L) 320 and Sentiment (S) 330. Topic (T) 310 may include as instances, names for various topics, for example, as grouped via a semantic hierarchy. Location (L) 320 may include as instances, names of locations, for example names of States in the united States. Sentiment (S) 330 may include instances selected from positive negative or neutral. The dimension attributes Topic (T) 310, Location (L) 320 and Sentiment (S) 33 may be reflected along the vertices of the data cube 300.
  • The data cube 300 may maintain for each tuple (t, 1, s) of T, L, S an aggregate for a value attribute (in this case: event count, i.e., the number of events processed for the tuple (t, l, s)). In exemplary embodiments, each tuple (t, l, s) may be associated with a slate for storing the event count. In some embodiments the slate may further be associated with an updater for updating the slate for a new event and a mapper for mapping new events to the updater.
  • In exemplary embodiments, data cube 300 is updated based on a new event. Thus, a new event may be processed to identify one or more dimension attributes therein. For example, a new event may state “I love living in NYC,” from which dimension attributes Person (P), Location (L) and Sentiment (S) may be extracted (for example P=user L=NYC (New York City), and S=positive may be extracted. The tuples of (P, L, S) implicated are as follows:
  • (user, NYC, positive)
  • (user, NYC,)
  • (user, , positive)
  • (, NYC, positive)
  • (user,)
  • (, NYC,)
  • (, positive)
  • (, ,)
  • Thus, overlapping tuples of (T, L, S) implicated by the new event are:
  • (, NYC, positive)
  • (, NYC,)
  • (, , positive) and
  • (, ,)
  • Thus, the event count associated with each of the four implicated tuples may be updated (for example, by incrementing the count by one).
  • Note that for an event containing L dimension attributes, M of which overlap with K dimension attributes of a data cube, there are 2M tuples of the data cube which are implicated (i.e. 2M tuples which overlap between the data cube and the event). Thus, a mapper may be used generate the 2L tuples for the event and map the 2M subset thereof to the data cube. An update may then be used update a slate associated with each of the 2M tuples received from the mapper. Distributed architecture implementations for maintaining a data cube are described in greater below
  • Data cube 300 advantageously maintains, in real time, results for any prospective query using the framework (T, L, S), of an event count over one or more pre-established time windows. FIG. 3, illustrates three examples of queries 340 a-c, the answers to which are maintained and therefore pre-computed in data cube 300. For example, query 340 a asks “how many people are posting about Barack Obama in New York?” The result to query 340 a may be obtained by returning the event count for the tuple (Barack Obama, New York, All), for example the event count stored in the slate associated with the tuple (Barack Obama, New York, All). As another example, query 340 b asks “How many people in Arizona feel positive of the new Medicare plan?” The result to query 340 b be obtained by returning the event count for the tuple (Medicare, Arizona, Positive). As another example, query 340 c asks “How many people feel negative of Barack Obama across the US?” The result to query 340 c may obtained by returning the event count for the tuple (Barrack Obama, United States (e.g., all states), Negative).
  • In exemplary embodiments, it is contemplated that the number of the dimension of a data cube may be automatically determined based on the types of attributes reflected in the data stream. For example, event types with the greatest frequencies, such as above a selected threshold, may be used as the dimensions for the data cube. Thus, for example the data stream may be analyzed to determine the best candidate attributes for cube dimensions.
  • Naïve Distributed Architecture Implementation:
  • Referring to FIG. 4, an exemplary implementation of a distributed architecture 400 for maintaining a data cube may include a mapper, for example, CubeMapper 410, and one or more updaters, for example, CubeTupleUpdaters 420. The mapper and/or updaters may be distributed to one or more processing nodes in the distributed architecture, e.g., via a network architecture such as described herein, for example with reference to FIG. 10.
  • The CubeMapper 410 may advantageously determine, for example based on a set of attribute-value pairs extracted for an event, which data cubes to maintain. In exemplary embodiments, a data cube may be defined by a set of dimension attributes (for example, Topic, Gender, State), and an aggregation function for a value attribute. In exemplary embodiments, a configuration file may list the data cubes of interest, and give each data cube a name. The aggregation function may be specified, for example, in javascript, as a function that takes two parameters (the current value of the aggregate and a new event) and returns a single value (the new value of the aggregate). Note that as a special case, the current value may be null, in which case the function may return the aggregate corresponding to just the one event. The value of the aggregate may be in any data structure/format, for example a JSON object.
  • In some embodiments, a generated data cube may apply only to specific kinds of topics. For example, the query framework may call for a data cube specific for persons of interest, such as customers, celebrities, etc., or for occasions such as holidays, the Oscars, etc. Thus, in exemplary embodiments the mapper may implement a selection function, based on selection criterion, to filter out only a subset of events from a data stream. The selection function may, for example, accept a single parameter (the event) and return either True (this event's data should be part of this data cube) or False.
  • An example of a configuration file listing data cube's of interest is provided below:
  • Cubes.congfig:
  • CubeName: SentimentCube
  • Select Function: True ##all events
  • Dimensions: Topic, State, Gender
  • AggregateFunction: lambda(sentiment, event) { . . . ; return sentiment}
  • CubeName: OscarsVotes
  • SelectFunction: lambda(event) {return event.event_id=oscars;}
  • Dimensions: Topic, Gender, Age
  • AggregateFunction: lambda(sentiment, event) { . . . ; return sentiment}
  • The CubeConfig file may advantageously be replicated for reference at each processing node in the distributed architecture.
  • The CubeMapper 410 may processes each event in a data stream and determine which data cubes it is eligible for. Suppose an event E with K dimension elements (for example, K=2 dimension elements a and b depicted in FIG. 3) is eligible for a data cube. The CubeMapper 410 constructs the 2K tuples from the event E (for example, the 22 tuples: (a), (b), (a,b) and (All) depicted in FIG. 3) and generates an event E1 for each tuple. The key for each event E1 is the pair (CubeName, Tuple) and the value is the content event E, for example, including a value attribute-value pair. The generated events E1 are sent on to the CubeTupleUpdaters 420.
  • Each CubeTupleUpdater 420 maintains a slate for every (CubeName, Tuple) pair it receives. Thus, the CubeTupleUpdater receives an event E1 for a single tuple, extracts an instance of a value attribute and applies the aggregation function to add the instance to its store. In exemplary embodiments, The updater keeps track of the aggregate value for a plurality of time windows for example a standard set of time windows such as described herein. Thus, a query for any tuple and any Time Window may be answered via a quick slate lookup.
  • Alternative Distributed Architecture Implementations:
  • There are two potential problems with the naïve implementation described above. First, for a data cube of K dimensions, the CubeMapper 410 may generate 2K events for each incoming event. This leads to very high fan-out for cubes with K>3. Second since, statistics are stored for every possible tuple that occurs in the data, The number of cube slates required is proportional to the number of possible combinations of dimension values that actually occurs in the data. For example, suppose there are 50 states, 2 genders, and 1 million topics. The number of slates needed may approach 50×2×1 million or 100 million. This may be impractical from a storage perspective. Exemplary alternate implementations of a distributed architecture present herein may help mitigate/prevent such potential problems.
  • A key property of data stream analytics is that events don't exist in a vacuum but rather often reflect and are influenced by a collective pulse. Thus, events often exhibit a great deal of clustering, for example, of topics, products, people, etc. Moreover, it is expected that queries to the data cube involve instances of dimension elements that are of interest to a large number of people. Taking advantage of the forgoing assumptions, tuples may be advantageously filtered based on frequency. Frequency filtering may be implemented, for example, by selecting a small time window (for example 1 minute) referred to as the delta window D. Let S be the set of all tuples corresponding to all social updates during a delta window D. A threshold δ is applied to filter out all tuples in S with frequency less than δ from being sent to updaters (for example, CubeTupleUpdaters 320 of FIG. 3). For example, δ may be selected to be 1 or 2. A simple experiment with actual data suggests that setting δ to 2 may eliminate over 90% of tuples in a delta window of 1 minute. Moreover, a qualitative examination of such tuples indicated that the eliminated tuples generally came from events that are not really of interest (for example, spam, outliers of some sort, or just semantic analysis errors). Thus, filtering also a second effect of reducing noise, e.g., from semantic analysis errors. On the other hand, tuples that do occur frequently in a 1-minute window often correlate well with the global interests. Frequency filtering of tuples thus both improves performance and improves the quality of the data cube.
  • FIG. 5 depicts an exemplary distributed architecture 500 for maintaining a data cube while implementing frequency filtering. Thus, the distributed architecture 500 may include a mapper (CubeS elector 510) and 3 types of updaters (CubeTupleGenerators 520, the CubeTupleCollectors 530, and the CubeTupleUpdaters 540).
  • Suppose T is a current timestamp. Interval I may then be defined as follows: I=floor(T/D), where D is the delta window (e.g., 1 minute). That is, the interval I counts time in units of the Delta Window. The CubeTupleGenerators 520 buffer all tuples with the same Interval, and then dispatch them to the CubeTupleCollectors 530.
  • In exemplary embodiments, an assumption may be made that the stream has a large number of events in each Delta Window—that is, the Delta Window is very large compared to the average gap between events (for example, Twitter™, processes approximately 100,000 tweets™ in a Delta Window of 1 minute resulting in an average inter-event gap of less than a millisecond). Thus, in exemplary embodiments, one may detect when an interval has ended and the next one has begun based on a processing of the first event whose Interval is higher than the current Interval. Alternatively, intervals may be tracked independent of events received.
  • In some embodiments, intervals may be based on e.g., a requisite number of events received rather than a time frame (for example, provided that the integration of such batches over the one or more time windows being tracked results in an acceptable margin of error). In other embodiments, the threshold δ may be variable, e.g., based on a changing frequency of events in the data stream. In some embodiments, a different threshold may be applied depending on the number of dimension attributes-value pairs in the tuple. Thus, frequency filtering may account for variations in event traffic when filtering tuples.
  • With reference again to FIG. 5, for each data cube, there may be P CubeTupleGenerator slates and P CubeTupleCollector slates, with keys 1, . . . , P. So as to ensure uniform distribution across the cluster, P may be selected to be a small multiple of the number of processing nodes in the distributed architecture 500. For example, if the distributed architecture 500 includes 8 processing nodes, one might select P=32 or 64.
  • The CubeSelector 510 is similar to the CubeMapper 210 in the naïve implementation. It uses the config file to determine if the current Event E is eligible for a data cube. If it is, the CubeSelector 510 uses a hash function to map the timestamp of the event to one of P values, and emits to stream X an event E1 whose key is (CubeName, P) and value is the content of the event E.
  • Each CubeTupleGenerator 520 (CTG) subscribes to stream X. A CTG generates all possible tuples from an event E1 and maintains a table with 3 columns: Tuple, AggregateValue, and Count. Here Count is the number of events that have contributed to this tuple. The CTG also stores the Interval I of the first event it received during this Delta Window. When the CTG receives a new event E1, it computes its Interval J. If J=I, the CTG enumerates the tuples of E1 and updates its table accordingly. On the other hand, if J>I, the CTG starts distributing its data as follows:
  • For each tuple, the CTG uses a hash function to map the tuple into one of P buckets 0,1 . . . , P−1. For each bucket in 1, . . . P−1, the CTG creates and emits to stream Y an event E2 whose key is the bucket number, and whose value is the set of rows in its table whose tuple maps to that bucket number. The value also indicates the Interval I. Once the CTG has sent out all P events, it empties its tables and stores J as the new Interval value. The CTG then proceeds to process a newly received event E1.
  • Each CubeTupleCollector 530 (CTC) subscribes to stream Y. For each interval I, a CTC slate receives P events , one from each CTG slate. The CTC maintains a table with 3 columns: Tuple, AggregateValue, Count. The AggregateValue for a tuple is the aggregate of the partial AggregateValues for that tuple from each CTG, and the Count is the sum of the corresponding partial counts. The CTC also keeps track of the current interval I and a message counter M, both initialized to 0. When the CTC receives an event E2 from the CTG with interval J, it first compares I and J:
  • If I=J, the CTC table is updated using the tuples in the event, and M is incremented by 1. If M=P (i.e., the CTC has received events from all the P CTCs), the CTC knows it has received all the tuples for the interval I. At this point, the CTG discards all infrequent tuples with count less than the threshold δ. For every frequent tuple, it creates a new event E3 whose key is the pair (CubeName, Tuple) and whose value contains the Aggregate and the Count. The event E3 is published to stream Z, to which the CubeTupleUpdaters 540 subscribe. The CTC then empties its table, sets its current interval to I+1 and resets M to 0.
  • If J>I, this means is that the CTC has received tuples for the next interval before it receives all the tuples for the current interval. Assuming that the aforementioned assumption about the Delta Interval being much larger than the average interval between events holds true, the only real reason for this is a node failure, which should be relatively infrequent. In this case, the interval I is declared closed, the tuples accumulated thus far are filtered. I is set to be equal to J, M to 1, and process the newly arrived event.
  • Finally, If J<I, a delayed event for an interval that is past was received. This event is ignored.
  • Each CubeTupleUpdater 540 maintains a slate for every (CubeName, Tuple) pair it receives. Thus, the CubeTupleUpdater receives an event E3 for a single tuple, extracts an instance of a value attribute and applies the aggregation function to add the instance to its store. In exemplary embodiments, The updater keeps track of the aggregate value for a plurality of time windows for example a standard set of time windows such as described herein. Thus, a query for any tuple and any Time Window may be answered via a quick slate lookup.
  • With reference to FIGS. 6 a-g an exemplary worked example of the distributed architecture 500 is depicted, demonstrating exemplary contents of CTG slates 620 and CTC slates 630 of a CTG and CTC such as described above with reference to FIG. 5. For the worked example, depicted , a data cube C with two dimensions (A and X) is assumed. The aggregate is SUM and the filtering threshold δ=1. During the Delta Window, 4 tuples arrive. In FIG. 6 a, an event E including attribute-value pairs A=a X=x and value attribute V=5 is received by the Cube Selecter 610. In FIG. 6 b, an event E including attribute-value pairs B=b Y=y and value attribute V=3 is received by the Cube Selecter 610. In FIG. 6 c, an event E including attribute-value pairs A=a Y=y and value attribute V=2 is received by the Cube Selecter 610. In FIG. 6 d an event E including attribute-value pairs A=a Y=y and value attribute V=4 is received by the Cube Selector 610. In each case, Cube Selecter 610 uses the config file to determine if the Event E is eligible for a data cube, for example data cube C. If it is, the CubeSelector 610 uses a hash function to map the timestamp of the event to one of P Values and emits an event E1 whose key is (CubeName, P) and value is the content of the event E. As depicted in FIGS. 6 a -d CTG slates 620 buffer all tuples with the same interval. More particularly the CTG slates 620 maintains a table with 3 columns: Tuple, AggregateValue, and Count. With reference to FIG. 6 e, once the interval is closed, a hash function to map each from the CTG slates 620 tuple into one of P buckets 0,1 . . . , P−1, for example buckets 625 of FIG. 6 e. For each bucket in 1,. . . P−1, the creates and emits an event E2 whose key is the bucket number, and whose value is the set of rows in its table whose tuple maps to that bucket number. With reference to FIGS. 6 f and 6 g, for each interval I, CTC slates 630 receive P events, one from each CTG slate 620. The CTC slates 630 maintains a table with 3 columns: Tuple, AgregateValue, Count. As depicted, once the CTG slate has sent out its events E2, it empties its tables and begins processing events E1 for a new interval. As depicted in Figure g6, once the CTC slates 630 have received all the tuples for a given interval all infrequent tuples, for example with count less than the threshold δ are discarded. Then for every frequent tuple, a new event E3 is created whose key is the pair (CubeName, Tuple) and whose value contains the Aggregate and the Count. The event E3 is published to stream Z subscribed to by CubeTupleUpdaters. Data cube C could then be updated for eligible events E3.
  • Leaderboard Queries:
  • Exemplary implementations presented above, primarily related to point aggregates—wherein a tuple (a point in the hypercube) is specified and the desired result of the query is the corresponding aggregate value. The storage media, systems and methods herein, however, are not limited to such implementations. Indeed in some embodiments, the data cube is used tracks results for different type of queries for, example, leaderboard queries.
  • A leaderboard query specifies a tuple and a leaderboard attribute, and asks for the Top-N values of the leader board attribute among events that satisfy the tuple. For example, a leaderboard query might pose the following question:
  • What were the Top 10 most popular topics posted about by people in Arizona between 10 am and 11 am today?
  • In SQL:
  • SELECT topic, count(eventid) as freq
    FROM events e
    WHERE t.state=“AZ” and t.timestamp >=“10 am” AND t.timestamp<“11 am”
    GROUP BY topic
    ORDER BY freq
  • LIMIT 10
  • Other exemplary leaderboard queries might pose the following question:
    What were the Top 10 least popular topics posted about by people in Arizona between 10 am and 11 am today?
    What were the Top 10 locations closest to New York where users posted about a sales event.
  • The data cube may easily be extended to support leaderboard queries. In the simple case, if the leaderboard attribute has a small and known set of values (e.g., state or gender), one can simply look up the slates corresponding to all possible values and pick the Top-N. The harder case occurs when the leaderboard attribute can take on a large number of values (or example, topic, product, URL, domain). In this case we cannot possibly examine all slates for all values of the attribute.
  • In exemplary embodiments, leaderboard queries are indicated by adding a line to the data cube config file to indicate the leaderboard dimensions for each cube and the value for “N” (for Top-N queries; e.g., 10). For example:
  • CubeName: SentimentCube
  • Select Function: True ##all events
  • Dimensions: Topic, State, Gender Leaderboards: Topic(10)
  • AggregateFunction: lambda(sentiment, event) { . . . ; return sentiment}
  • To simplify the present description, assume there is at most one leaderboard dimension in each data cube (ss would be appreciated by one of ordinary skill in the art, it is straightforward to support more given the description for one). Non-leaderboard dimensions of the data cube may be referred to as ordinary dimensions.
  • Recall that in the distribution step of the CTG a tuple is hashed into one of P buckets. For the purposes of a leaderboard query the hash function is applied only to the ordinary dimensions of each tuple. This ensures all tuples of the form: A=a, B=b, C=*, where C is the leaderboard attribute and A and B are ordinary attributes, end up at the same slate of the CTC. The CTC can now compute the Top-N leaderboard values for each tuple, and pass them on to the CTU in the event it constructs for the tuple. For example, the Top-N values may be characterized as resulting in the highest aggregates, the lowest aggregates, and the aggregates closest to a selected value, etc.
  • Thus, the CTU may maintain the leaderboard for each time window. Where a time window is used for filtering by the CTC, the CTU may get the leaderboard directly from the CTC. Leaderboards for larger time windows are approximated using those for the smaller time windows. For example, an hourly leaderboard may be computed by looking at the Top 10 for each minute of the hour, and summing the minute-by-minute values. While it is possible to make some errors using this method (for example, where the most frequent item during the hour was not in the top 10 for any minute of the hour) accurate results are generally achieved.
  • K-grams
  • One potential disadvantage in querying using dimensions requiring semantic analysis, is that the semantic analysis is limited by the available hierarchy. For example, a semantic analysis engine can only tag instances it recognizes. An interesting advantage the storage media, systems and methods described herein is that the data cube may be used to maintain statistics on topics products, locations etc. (collectively, referred to as interesting topics or topics of interest) without the help of semantic analysis. Thus in exemplary embodiments a “K-Gram” is added as one of the data cube dimensions, with a relative high threshold 6 (e.g., 50). Thus, a slate is automatically created and maintained for any k-gram that was mentioned very often in a delta time window. In exemplary embodiments, a TF.IDF threshold may be instead of a pure frequency threshold to achieve better results. Thus, the data cube may be used to identify trending k-grams and automatically maintains stats for them without help from semantic analysis. In exemplary embodiments identified topics of interest can be communicated to the semantic analysis engine and “candidate topics” and the maintained statistics relating thereto can be used by the engine to whether to incorporate each candidate topic into the taxonomy.
  • Exemplary Methods
  • FIGS. 7 a-c, illustrate exemplary methods according to the present disclosure.
  • With reference to FIG. 7 a, an exemplary method 710 is depicted for performing real time analytics on streaming data. Method 710 generally includes steps of: (712) processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; (714) identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and (716) for each implicated tuple, updating one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
  • With reference to FIG. 7 b, an exemplary method 720 is depicted for performing real time analytics on streaming data. Method 720 generally includes steps of (722) establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and (724) generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
  • With reference to FIG. 7 c an exemplary method 730 is depicted for performing real time analytics on streaming data. Method 730 generally includes steps of: (732) processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and (734) returning one of the stored aggregates in response to a query.
  • It is explicitly contemplated that the storage media, systems and methods presented herein may include one or more programmable processing units having associated therewith executable instructions held on one or more computer readable medium, RAM, ROM, hard drive, and/or hardware. In exemplary embodiments, the hardware, firmware and/or executable code may be provided, for example, as upgrade module(s) for use in conjunction with existing infrastructure (for example, existing devices/processing units). Hardware may, for example, include components and/or logic circuitry for executing the embodiments taught herein as a computing process.
  • Displays and/or other feedback means may also be included to convey detected/processed data, for example adjusted output representative of a particle characteristic. The display and/or other feedback means may be stand-alone or may be included as one or more components/modules of the processing unit(s). In exemplary embodiments, the display and/or other feedback means may be used to facilitate querying a data cube. In other embodiments, the display may be used to visualize, in real-time, various social statistics maintained by the data cube. For example, as depicted in FIGS. 8 a and 8 b, real-time social statistics may be overlaid on a geographic map.
  • The actual software code or control hardware which may be used to implement some of the present embodiments is not intended to limit the scope of such embodiments. For example, certain aspects of the embodiments described herein may be implemented in code using any suitable programming language type such as, for example, assembly code, C, C# or C++ using, for example, conventional or object-oriented programming techniques. Such code is stored or held on any type of suitable non-transitory computer-readable medium or media such as, for example, a magnetic or optical storage medium.
  • As used herein, a “processor,” “processing unit,” “computer” or “computer system” may be, for example, a wireless or wire line variety of a microcomputer, minicomputer, server, mainframe, laptop, personal data assistant (PDA), wireless e-mail device (for example, “BlackBerry,” “Android” or “Apple,” trade-designated devices), cellular phone, pager, processor, fax machine, scanner, or any other programmable device configured to transmit and receive data over a network. Computer systems disclosed herein may include memory for storing certain software applications used in obtaining, processing and communicating data. It can be appreciated that such memory may be internal or external to the disclosed embodiments. The memory may also include non-transitory storage medium for storing software, including a hard disk, an optical disk, floppy disk, ROM (read only memory), RAM (random access memory), PROM (programmable ROM), EEPROM (electrically erasable PROM), flash memory storage devices, or the like.
  • FIG. 9 depicts a block diagram representing an exemplary computing device 900 that may be used as a processing node (also referred to as a worker node) for aggregating and/or storing data as described herein, for example a processing node in a distributed architecture as described herein. The computing device 900 may be any computer system, such as a workstation, desktop computer, server, laptop, handheld computer, tablet computer (e.g., the iPad™ tablet computer), mobile computing or communication device (e.g., the iPhone™ mobile communication device, the Android™ mobile communication device, and the like), or other form of computing or telecommunications device that is capable of communication and that has sufficient processor power and memory capacity to perform the operations described herein. A distributed computational system may be provided comprising a plurality of such computing devices.
  • The computing device 900 includes one or more non-transitory computer-readable media having encoded thereon one or more computer-executable instructions or software for implementing exemplary methods described herein. The non-transitory computer-readable media may include, but are not limited to, one or more types of hardware memory, non-transitory tangible media (for example, one or more magnetic storage disks, one or more optical disks, one or more USB flash drives), and the like. For example, memory 906 included in the computing device 900 may store computer-readable and computer-executable instructions or software for implementing exemplary embodiments. The computing device 900 also includes processor 902 and associated core 904, and in some embodiments, one or more additional processor(s) 902′ and associated core(s) 904′ (for example, in the case of computer systems having multiple processors/cores), for executing computer-readable and computer-executable instructions or software stored in the memory 906 and other programs for controlling system hardware. Processor 902 and processor(s) 902′ may each be a single core processor or multiple core (904 and 904′) processor.
  • Virtualization may be employed in the computing device 900 so that infrastructure and resources in the computing device may be shared dynamically. A virtual machine 914 may be provided to handle a process running on multiple processors so that the process appears to be using only one computing resource rather than multiple computing resources. Multiple virtual machines may also be used with one processor.
  • Memory 906 may include a computer system memory or random access memory, such as DRAM, SRAM, EDO RAM, and the like. Memory 906 may include other types of memory as well, or combinations thereof. Memory 906 may be used to store one or more slates on a temporary basis, for example, in cache.
  • A user may interact with the computing device 900 through a visual display device 918, such as a screen or monitor, that may display one or more interfaces 920 that may be provided in accordance with exemplary embodiments. The visual display device 918 may also display other aspects, elements and/or information or data associated with exemplary embodiments. The computing device 900 may include other I/O devices for receiving input from a user, for example, a keyboard or any suitable multi-point touch interface 908, a pointing device 910 (e.g., a mouse, a user's finger interfacing directly with a display device, etc.). The keyboard 908 and the pointing device 910 may be coupled to the visual display device 918. The computing device 900 may include other suitable conventional I/O peripherals. In exemplary embodiments, the one or more of the interfaces 920 includes an application program interface (API).
  • The computing device 900 may include one or more audio input devices 924, such as one or more microphones, that may be used by a user to provide one or more audio input streams.
  • The computing device 900 may include one or more non-transitory storage devices 924, such as a durable disk storage (which may include any suitable optical or magnetic durable storage device, e.g., RAM, ROM, Flash, USB drive, or other semiconductor-based storage medium), a hard-drive, CD-ROM, or other non-transitory computer readable media, for storing data and computer-readable instructions and/or software that implement exemplary embodiments as taught herein. For example, the storage device 924 may provide a slate storage 926 for storing computer-executable instructions for implementing the social genome data structure as described herein, for example for storing an updating (via one or more updaters) one or more slates, as described herein. The storage device 924 may store one or more map modules 932 and one or more update modules 934, as described herein. The storage device 924 may be provided on the computing device 900 or provided separately or remotely from the computing device 900. The storage device 924 may be used to store one or more slates in a durable manner.
  • Exemplary mappers and updaters may be programmatically implemented by a computer process in any suitable programming language, for example, a scripting programming language, an object-oriented programming language (e.g., Java), and the like. In an exemplary object-oriented implementation, a general Mapper class or interface and Updater class or interface may be defined by the system to generally specify attributes and functionality of a generic update operation. For each desired update operation, a sub-class may be created based on the Updater class. One or more object instances may be created from each sub-class at a processor node, for example, a CubeTupleGenerator object may be instantiated from a CubeTupleGenerator sub-class.
  • The computing device 900 may include a network interface 912 configured to interface via one or more network devices 922 with one or more networks, for example, Local Area Network (LAN), Wide Area Network (WAN) or the Internet through a variety of connections including, but not limited to, standard telephone lines, LAN or WAN links (for example, 802.11, T1, T3, 56 kb, X.25), broadband connections (for example, ISDN, Frame Relay, ATM), wireless connections, controller area network (CAN), or some combination of any or all of the above. The network interface 912 may include a built-in network adapter, network interface card, PCMCIA network card, card bus network adapter, wireless network adapter, USB network adapter, modem or any other device suitable for interfacing the computing device 900 to any type of network capable of communication and performing the operations described herein. The network device 922 may include one or more suitable devices for receiving and transmitting communications over the network including, but not limited to, one or more receivers, one or more transmitters, one or more transceivers, one or more antennae, and the like.
  • The computing device 900 may run any operating system 916, such as any of the versions of the Microsoft® Windows® operating systems, the different releases of the Unix and Linux operating systems, any version of the MacOS® for Macintosh computers, any embedded operating system, any real-time operating system, any open source operating system, any proprietary operating system, any operating systems for mobile computing devices, or any other operating system capable of running on the computing device and performing the operations described herein. In exemplary embodiments, the operating system 916 may be run in native mode or emulated mode. In an exemplary embodiment, the operating system 916 may be run on one or more cloud machine instances.
  • FIG. 10 depicts an exemplary network environment 1000 suitable for a distributed implementation of exemplary embodiments. The network environment 1000 may include one or more servers 1002 and 1004 coupled to one or more clients 1006 and 1008 via a communication network 1010. The network interface 912 and the network device 922 of the computing device 900 enable the servers 1002 and 1004 to communicate with the clients 1006 and 1008 via the communication network 1010. The communication network 1010 may include, but is not limited to, the Internet, an intranet, a LAN (Local Area Network), a WAN (Wide Area Network), a MAN (Metropolitan Area Network), a wireless network, an optical network, and the like. The communication facilities provided by the communication network 1010 are capable of supporting distributed implementations of exemplary embodiments.
  • In an exemplary embodiment, the servers 1002 and 1004 may provide the clients 1006 and 1008 with computer-readable and/or computer-executable components or products under a particular condition, such as a license agreement. In some exemplary embodiments, the computer-readable and/or computer-executable components or products provided by the servers may include those for providing one or more real-time data streams to worker processes at worker nodes. The clients 1006 and 1008 may process the data streams using the computer-readable and/or computer-executable components and products provided by the servers 1002 and 1004. In some exemplary embodiments, the computer-readable and/or computer-executable components or products provided by the servers may include those for providing and executing one or more map and/or update operations, for example using one or more mappers or updaters. The clients 1006 and 1008 may execute the map and update operations using the computer-readable and/or computer-executable components and products provided by the servers 1002 and 1004. In some exemplary embodiments, the clients 1006 and 1008 may transmit events generated by update operations to the servers 1002 and 1004 for publication in one or more data streams. In some exemplary embodiments, the clients 1006 and 1008 may transmit one or more slates created or updated by update operations to the servers 1002 and 1004 for persistent storage on a disk storage or for storage in memory, e.g., in cache.
  • Alternatively, in another exemplary embodiment, the clients 1006 and 1008 may provide the servers 1002 and 1004 with computer-readable and computer-executable components or products under a particular condition, such as a license agreement. In some exemplary embodiments, the computer-readable and/or computer-executable components or products provided by the clients may include those for providing one or more real-time data streams to worker processes. The servers 1002 and 1006 may process the data streams using the computer-readable and/or computer-executable components and products provided by the clients 1006 and 1008. In some exemplary embodiments, the computer-readable and/or computer-executable components or products provided by the clients may include those for providing and executing one or more map and/or update operations. The servers 1002 and 1004 may execute the map and update operations using the computer-readable and/or computer-executable components and products provided by the clients 1006 and 1008. In some exemplary embodiments, the servers 1002 and 1004 may transmit events generated by update operations to the clients 1006 and 1008 for publication in one or more data streams. In some exemplary embodiments, the servers 1002 and 1004 may transmit one or more slates created or updated by update operations to the clients 1006 and 1008 for persistent storage on a disk storage or for storage in memory, e.g., in cache.
  • In exemplary embodiments one or more mappers and one or more updaters for example map module 932 and update module 934 of FIG. 9, may be distributed to throughout various processing nodes of the network environment 1000, for example nodes 1012 a-d.
  • Although the teachings herein have been described with reference to exemplary embodiments and implementations thereof, the disclosed systems, methods and non-transitory storage medium are not limited to such exemplary embodiments/implementations. Rather, as will be readily apparent to persons skilled in the art from the description taught herein, the disclosed storage media, systems and methods are susceptible to modifications, alterations and enhancements without departing from the spirit or scope hereof. Accordingly, all such modifications, alterations and enhancements within the scope hereof are encompassed herein.

Claims (20)

1. A method for performing real time analytics on streaming data, the method comprising:
processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes;
identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes;
for each implicated tuple, updating one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
2. The method of claim 1 further comprising tracking over an interval a tuple frequency for each of the implicated tuples, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval.
3. The method of claim 1 further comprising tracking over an interval, for each value attribute, aggregates for a first plurality of implicated tuples having same attribute-value pairs for zero or more ordinary dimension attributes and different attribute-value pairs for one or more leaderboard dimension attributes, and determining a top-N values for the one or more leadership dimension attributes over the interval.
4. The method of claim 3, wherein the Top-N values are characterized as resulting in one of (i) the highest aggregates (ii) the lowest aggregates, and (iii) the aggregates closest to a selected value, over the interval.
5. The method of claim 3 further comprising determining a set of top-N values for each of a plurality of intervals in a time window and determining a top-N values for the one or more leadership dimensions attributes over the time window based on the plurality of sets of top-N values.
6. The method of claim 1, wherein the one or more dimension attributes include a K-Gram for identifying topics of interest.
7. The method of claim 6, further comprising tracking over an interval a tuple frequency for each of the implicated tuples including a K-Gram, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval, whereby statistics for trending K-Gram-value pairs are tracked.
8. A method for implementing a real time analytics platform, the method comprising:
establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute;
generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
9. A system for performing real time analytics on streaming data, the system comprising:
a processor for processing an event in a data stream to extract a set of attribute-value pairs for one or more dimension attributes and one or more value attributes;
a mapper for identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and
one or more updaters for updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
10. The system of claim 9, wherein the system is configured to track over an interval a tuple frequency for each of the implicated tuples, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval.
11. The system of claim 9 wherein the system is configured to: (i) update, over an interval, for each value attribute, aggregates for a first plurality of implicated tuples having same attribute-value pairs for zero or more ordinary dimension attributes and different attribute-value pairs for one or more leaderboard dimension attributes, and (ii) determine a top-N values for the one or more leadership dimension attributes over the interval.
12. The system of claim 9, wherein the one or more dimension attributes include a K-Gram for identifying topics of interest, wherein the system is configured to track over an interval a tuple frequency for each of the implicated tuples including a K-Gram, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval, whereby statistics for trending K-Gram-value pairs are tracked.
13. A multi-dimensional data structure for implementing a real-time analytics platform characterized by one or more time windows, one or more dimension attributes, and one or more value attributes, the data structure comprising:
a plurality of tuples associated with the one or more dimension attributes; and
a slate associated with each tuple for maintaining an aggregate for each of the one or more value attributes over each of the one or more time windows.
14. A method for performing real-time analytics on a data stream, the methods comprising:
processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters;
returning one of the stored aggregates in response to a query.
15. A system for performing real-time analytics on a data stream the system comprising:
a processor for processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and
memory for storing the plurality of stored aggregates.
16. The system of claim 15, further comprising an interface for providing a query, wherein the processor is configured to return one of the stored aggregates in response to the query.
17. A multi-dimensional data structure for implementing a real-time analytics platform, the data structure comprising:
a plurality of stored tuples each representing a set of search query parameters for prospective queries extrapolated from a pre-established framework of possible query parameters; and
one or more stored aggregates associated with each of the stored tuples, wherein each aggregate represents a result for a prospective query characterized by the set of search query parameters represented in the tuple associated with that aggregate.
18. A non-transitory computer readable medium storing processor executable instructions for performing real time analytics on streaming data, including instructions for:
processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes;
identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes;
for each implicated tuple, updating one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
19. A non-transitory computer readable medium storing processor executable instructions for performing real time analytics on streaming data, including instructions for:
establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute;
generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
20. A non-transitory computer readable medium storing processor executable instructions for performing real time analytics on streaming data, including instructions for:
processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and
returning one of the stored aggregates in response to a query.
US13/300,523 2010-05-17 2011-11-18 Real-time analytics of streaming data Abandoned US20120130940A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/300,523 US20120130940A1 (en) 2010-11-18 2011-11-18 Real-time analytics of streaming data
US13/300,519 US9183270B2 (en) 2010-05-17 2011-11-18 Social genome

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US41527910P 2010-11-18 2010-11-18
US41528210P 2010-11-18 2010-11-18
US13/300,523 US20120130940A1 (en) 2010-11-18 2011-11-18 Real-time analytics of streaming data

Publications (1)

Publication Number Publication Date
US20120130940A1 true US20120130940A1 (en) 2012-05-24

Family

ID=45316067

Family Applications (4)

Application Number Title Priority Date Filing Date
US13/300,523 Abandoned US20120130940A1 (en) 2010-05-17 2011-11-18 Real-time analytics of streaming data
US13/300,473 Active 2032-08-17 US8725592B2 (en) 2010-05-17 2011-11-18 Method, system, and medium for recommending gift products based on textual information of a selected user
US13/300,519 Active 2032-02-18 US9183270B2 (en) 2010-05-17 2011-11-18 Social genome
US14/936,426 Active US9679074B2 (en) 2010-11-18 2015-11-09 Social genome

Family Applications After (3)

Application Number Title Priority Date Filing Date
US13/300,473 Active 2032-08-17 US8725592B2 (en) 2010-05-17 2011-11-18 Method, system, and medium for recommending gift products based on textual information of a selected user
US13/300,519 Active 2032-02-18 US9183270B2 (en) 2010-05-17 2011-11-18 Social genome
US14/936,426 Active US9679074B2 (en) 2010-11-18 2015-11-09 Social genome

Country Status (2)

Country Link
US (4) US20120130940A1 (en)
WO (1) WO2012068557A1 (en)

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130254374A1 (en) * 2012-03-21 2013-09-26 International Business Machines Corporation Resource allocation based on social networking trends in a networked computing environment
US20130339473A1 (en) * 2012-06-15 2013-12-19 Zynga Inc. Real time analytics via stream processing
CN103916478A (en) * 2014-04-11 2014-07-09 华为技术有限公司 Streaming data cube establishing method and device based on distributed system
US20140365524A1 (en) * 2013-06-10 2014-12-11 International Business Machines Corporation Incremental aggregation-based event pattern matching
US20150026207A1 (en) * 2013-07-22 2015-01-22 International Business Machines Corporation Managing sparsity in an multidimensional data structure
US20150339752A1 (en) * 2011-09-14 2015-11-26 International Business Machines Corporation Deriving Dynamic Consumer Defined Product Attributes from Input Queries
US9367424B2 (en) 2014-03-13 2016-06-14 International Business Machines Corporation Method for performance monitoring and optimization via trend detection and forecasting
US20170177573A1 (en) * 2015-12-18 2017-06-22 International Business Machines Corporation Method and system for hybrid sort and hash-based query execution
US20170177446A1 (en) * 2015-12-21 2017-06-22 Ab Initio Technology Llc Search and retrieval data processing system for computing near real-time data aggregations
US20170185583A1 (en) * 2015-12-28 2017-06-29 Facebook, Inc. Language model personalization
US9740687B2 (en) 2014-06-11 2017-08-22 Facebook, Inc. Classifying languages for objects and entities
US9749430B2 (en) 2013-05-06 2017-08-29 Microsoft Technology Licensing, Llc Scalable data enrichment for cloud streaming analytics
US9805029B2 (en) 2015-12-28 2017-10-31 Facebook, Inc. Predicting future translations
US9830404B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Analyzing language dependency structures
US9830386B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Determining trending topics in social media
US9864744B2 (en) 2014-12-03 2018-01-09 Facebook, Inc. Mining multi-lingual data
US9899020B2 (en) 2015-02-13 2018-02-20 Facebook, Inc. Machine learning dialect identification
US9904718B2 (en) 2013-03-13 2018-02-27 Wal-Mart Stores, Inc. System and method for streaming events in a transaction-based system
US20180075516A1 (en) * 2016-09-14 2018-03-15 Microsoft Technology Licensing, Llc System for producing recommendations and predicting purchases of products based on usage patterns
WO2018049183A1 (en) * 2016-09-09 2018-03-15 BloomReach, Inc. Attribute extraction
US9946522B1 (en) 2016-12-16 2018-04-17 International Business Machines Corporation Generating code for real-time stream processing
US10067936B2 (en) 2014-12-30 2018-09-04 Facebook, Inc. Machine translation output reranking
US10089299B2 (en) 2015-12-17 2018-10-02 Facebook, Inc. Multi-media context language processing
US10133738B2 (en) 2015-12-14 2018-11-20 Facebook, Inc. Translation confidence scores
US10147046B2 (en) 2014-10-30 2018-12-04 International Business Machines Corporation System and methodology to handle misdirected input data during multi partitioned real time analytics
US10289681B2 (en) 2015-12-28 2019-05-14 Facebook, Inc. Predicting future translations
US10324943B2 (en) 2015-08-10 2019-06-18 Business Objects Software, Ltd. Auto-monitoring and adjustment of dynamic data visualizations
US10346537B2 (en) 2015-09-22 2019-07-09 Facebook, Inc. Universal translation
US10380249B2 (en) 2017-10-02 2019-08-13 Facebook, Inc. Predicting future trending topics
US10459922B2 (en) 2016-11-08 2019-10-29 At&T Intellectual Property I, L.P. Unique identification generation for records in a data streaming processing system
US10540386B2 (en) * 2013-08-09 2020-01-21 Shaofeng YANG Method for processing and displaying real-time social data on map
EP3678033A1 (en) * 2019-01-07 2020-07-08 QlikTech International AB A computer implemented method for indexlet based aggregation
US10812322B2 (en) * 2017-04-13 2020-10-20 Walmart Apollo, Llc Systems and methods for real time streaming
US10902221B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US10902215B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
CN112817965A (en) * 2019-11-18 2021-05-18 百度在线网络技术(北京)有限公司 Data splicing method and device, electronic equipment and storage medium
CN113127512A (en) * 2020-01-15 2021-07-16 百度在线网络技术(北京)有限公司 Data splicing triggering method and device for multiple data streams, electronic equipment and medium
US11100191B2 (en) 2016-08-26 2021-08-24 1Qb Information Technologies Inc. Method and system for performing real-time analytics on a plurality of data streams
US11294876B2 (en) * 2017-06-01 2022-04-05 Oracle International Corporation System and method for generating a multi dimensional data cube for analytics using a map-reduce program
EP4033373A1 (en) * 2021-01-25 2022-07-27 QlikTech International AB Methods and systems for undetermined query analytics
CN116756215A (en) * 2023-06-27 2023-09-15 上海蚂蚁创将信息技术有限公司 Transaction in-transit state query method and system

Families Citing this family (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120130940A1 (en) * 2010-11-18 2012-05-24 Wal-Mart Stores, Inc. Real-time analytics of streaming data
IL207176A0 (en) * 2010-07-25 2011-04-28 Verint Systems Ltd System and method for video - assisted identification of mobile phone users
US8990241B2 (en) * 2010-12-23 2015-03-24 Yahoo! Inc. System and method for recommending queries related to trending topics based on a received query
US9530167B2 (en) * 2011-08-12 2016-12-27 Facebook, Inc. Coefficients attribution for different objects based on natural language processing
US9536015B1 (en) * 2011-09-06 2017-01-03 Google Inc. Using social networking information
US20130204701A1 (en) * 2012-02-06 2013-08-08 Gianmauro Calafiore Apparatus, system and methods for marketing targeted products to users of social media
US20140074650A1 (en) * 2012-03-01 2014-03-13 Qloo, Inc. Personalized cross-domain recommender system
US20130297380A1 (en) * 2012-05-02 2013-11-07 Ebay Inc. Inventory adjustment based on social network data
JP5962213B2 (en) * 2012-05-28 2016-08-03 ソニー株式会社 Information processing apparatus, information processing method, and program
US9117237B2 (en) * 2012-06-12 2015-08-25 Gyft, Inc. System, method, and medium for digital gift card selection
US9424612B1 (en) * 2012-08-02 2016-08-23 Facebook, Inc. Systems and methods for managing user reputations in social networking systems
US10560057B1 (en) * 2012-08-06 2020-02-11 Google Llc Measuring media attention over time based on long term heterogeneous archive data
US20140067595A1 (en) * 2012-08-31 2014-03-06 Wal-Mart Stores, Inc. Determining giftability of a product
US9299100B2 (en) * 2012-08-31 2016-03-29 Wal-Mart Stores, Inc. Determining giftability of a product based on recipient interests
US9104667B2 (en) 2012-09-24 2015-08-11 International Business Machines Corporation Social media event detection and content-based retrieval
US9218392B1 (en) * 2012-11-30 2015-12-22 Amazon Technologies, Inc. Interest related search results
US9934283B2 (en) * 2013-03-08 2018-04-03 Google Llc Social annotations for enhanced search results
US20150149539A1 (en) * 2013-11-22 2015-05-28 Adobe Systems Incorporated Trending Data Demographics
US10607255B1 (en) 2013-12-17 2020-03-31 Amazon Technologies, Inc. Product detail page advertising
US9846904B2 (en) * 2013-12-26 2017-12-19 Target Brands, Inc. Retail website user interface, systems and methods
US9875561B2 (en) 2014-05-20 2018-01-23 Jeffrey C. Mohr Method and system for dynamically creating and exploring graph structures
US9858610B2 (en) * 2014-08-29 2018-01-02 Wal-Mart Stores, Inc. Product recommendation based on geographic location and user activities
US11042946B2 (en) 2014-09-30 2021-06-22 Walmart Apollo, Llc Identity mapping between commerce customers and social media users
US10621231B2 (en) * 2015-08-24 2020-04-14 Google Llc Generation of a topic index with natural language processing
US9454584B1 (en) 2015-09-21 2016-09-27 Pearson Education, Inc. Assessment item generation and scoring
US9460162B1 (en) 2015-09-21 2016-10-04 Pearson Education, Inc. Assessment item generator
US10832304B2 (en) * 2016-01-15 2020-11-10 Target Brands, Inc. Resorting product suggestions for a user interface
US10600062B2 (en) 2016-03-15 2020-03-24 Target Brands Inc. Retail website user interface, systems, and methods for displaying trending looks by location
US10776860B2 (en) 2016-03-15 2020-09-15 Target Brands, Inc. Retail website user interface, systems, and methods for displaying trending looks
US11100415B2 (en) * 2016-10-04 2021-08-24 University Of Louisiana At Lafayette Architecture and method for providing insights in networks domain
US10969933B1 (en) * 2017-02-07 2021-04-06 The Mathworks, Inc. Graphical representation of ordered model items based on solver information
US20190043065A1 (en) * 2017-08-04 2019-02-07 John Hall Method and system of facilitating recommendation of digital content based on user responses
CN109389451B (en) * 2017-08-08 2021-10-19 阿里巴巴集团控股有限公司 Method and system for determining recommendation information
US20190066186A1 (en) * 2017-08-24 2019-02-28 Artivatic Data Labs Private Limited Cross domain recommendation system and method
US11475014B2 (en) * 2018-12-20 2022-10-18 AVAST Software s.r.o. Updating a toplist for a continuous data stream
US11706167B2 (en) 2020-06-29 2023-07-18 Snap Inc. Generating and accessing video content for products
US11748453B2 (en) * 2021-02-17 2023-09-05 International Business Machines Corporation Converting unstructured computer text to domain-specific groups using graph datastructures

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6965886B2 (en) * 2001-11-01 2005-11-15 Actimize Ltd. System and method for analyzing and utilizing data, by executing complex analytical models in real time
US7146416B1 (en) * 2000-09-01 2006-12-05 Yahoo! Inc. Web site activity monitoring system with tracking by categories and terms
US7523462B1 (en) * 2003-05-27 2009-04-21 International Business Machines Corporation Method for providing a real time view of heterogeneous enterprise data
US20090138446A1 (en) * 2007-11-27 2009-05-28 Umber Systems Method and apparatus for real-time multi-dimensional reporting and analyzing of data on application level activity and other user information on a mobile data network

Family Cites Families (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090234712A1 (en) * 1999-06-28 2009-09-17 Musicip Corporation Method and apparatus for automated selection, organization, and recommendation of items based on user preference topography
US7421660B2 (en) * 2003-02-04 2008-09-02 Cataphora, Inc. Method and apparatus to visually present discussions for data mining purposes
WO2004027649A1 (en) 2002-09-18 2004-04-01 Netezza Corporation Asymmetric streaming record data processor method and apparatus
US20040167887A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Integration of structured data with relational facts from free text for data mining
US7606168B2 (en) * 2005-01-28 2009-10-20 Attenex Corporation Apparatus and method for message-centric analysis and multi-aspect viewing using social networks
US7920584B2 (en) 2005-05-04 2011-04-05 Arm Limited Data processing system
US8655801B2 (en) 2005-10-26 2014-02-18 Cortica, Ltd. Computing device, a system and a method for parallel processing of data streams
US7668821B1 (en) * 2005-11-17 2010-02-23 Amazon Technologies, Inc. Recommendations based on item tagging activities of users
US7603619B2 (en) 2005-11-29 2009-10-13 Google Inc. Formatting a user network site based on user preferences and format performance data
EP1989639A4 (en) * 2006-02-28 2012-05-02 Buzzlogic Inc Social analytics system and method for analyzing conversations in social media
US8745661B2 (en) 2006-07-31 2014-06-03 Rovi Guides, Inc. Systems and methods for providing enhanced sports watching media guidance
US8930204B1 (en) * 2006-08-16 2015-01-06 Resource Consortium Limited Determining lifestyle recommendations using aggregated personal information
US8234623B2 (en) 2006-09-11 2012-07-31 The Mathworks, Inc. System and method for using stream objects to perform stream processing in a text-based computing environment
US7765259B2 (en) * 2006-12-05 2010-07-27 Avaya Inc. System and method for aggregation of user conversations and visualizing personal communications map
US7970891B1 (en) 2007-01-17 2011-06-28 Google Inc. Tracking links in web browsers
US20080215607A1 (en) * 2007-03-02 2008-09-04 Umbria, Inc. Tribe or group-based analysis of social media including generating intelligence from a tribe's weblogs or blogs
US8055664B2 (en) * 2007-05-01 2011-11-08 Google Inc. Inferring user interests
US8122006B2 (en) 2007-05-29 2012-02-21 Oracle International Corporation Event processing query language including retain clause
US7676461B2 (en) 2007-07-18 2010-03-09 Microsoft Corporation Implementation of stream algebra over class instances
US7853622B1 (en) * 2007-11-01 2010-12-14 Google Inc. Video-related recommendations using link structure
US20120203831A1 (en) * 2011-02-03 2012-08-09 Kent Schoen Sponsored Stories Unit Creation from Organic Activity Stream
US7882087B2 (en) 2008-01-15 2011-02-01 At&T Intellectual Property I, L.P. Complex dependencies for efficient data warehouse updates
US20110161827A1 (en) * 2008-03-05 2011-06-30 Anastasia Dedis Social media communication and contact organization
US8606721B1 (en) * 2008-03-11 2013-12-10 Amazon Technologies, Inc. Implicit social graph edge strengths
TWI418993B (en) * 2008-06-27 2013-12-11 Ind Tech Res Inst System and method for establishing personal social network, trusted network and social networking system
WO2010048172A1 (en) * 2008-10-20 2010-04-29 Cascaad Srl Social graph based recommender
US20100121707A1 (en) * 2008-11-13 2010-05-13 Buzzient, Inc. Displaying analytic measurement of online social media content in a graphical user interface
US20100128638A1 (en) 2008-11-20 2010-05-27 Sap Ag Hierarchical shortest path first network routing protocol
US20100169160A1 (en) * 2008-12-30 2010-07-01 Ebay Inc. Gift recommendation method and system
US8160996B2 (en) 2009-02-02 2012-04-17 The Hong Kong Polytechnic University Sequence online analytical processing system
US9330395B2 (en) 2009-05-05 2016-05-03 Suboti, Llc System, method and computer readable medium for determining attention areas of a web page
US20100306249A1 (en) * 2009-05-27 2010-12-02 James Hill Social network systems and methods
US8996556B2 (en) 2009-06-05 2015-03-31 Microsoft Technology Licensing, Llc Parallel processing of an ordered data stream
US20110004692A1 (en) * 2009-07-01 2011-01-06 Tom Occhino Gathering Information about Connections in a Social Networking Service
US8239364B2 (en) * 2009-12-08 2012-08-07 Facebook, Inc. Search and retrieval of objects in a social networking system
US8880600B2 (en) * 2010-03-31 2014-11-04 Facebook, Inc. Creating groups of users in a social networking system
US8326880B2 (en) * 2010-04-05 2012-12-04 Microsoft Corporation Summarizing streams of information
US8666979B2 (en) 2010-04-09 2014-03-04 Palo Alto Research Center Incorporated Recommending interesting content using messages containing URLs
US8185558B1 (en) * 2010-04-19 2012-05-22 Facebook, Inc. Automatically generating nodes and edges in an integrated social graph
US8918418B2 (en) * 2010-04-19 2014-12-23 Facebook, Inc. Default structured search queries on online social networks
US8782080B2 (en) * 2010-04-19 2014-07-15 Facebook, Inc. Detecting social graph elements for structured search queries
US9530166B2 (en) * 2010-04-21 2016-12-27 Facebook, Inc. Social graph that includes web pages outside of a social networking system
JP5810452B2 (en) * 2010-05-16 2015-11-11 アクセス ビジネス グループ インターナショナル リミテッド ライアビリティ カンパニー Data collection, tracking and analysis methods for multimedia including impact analysis and impact tracking
US20120130940A1 (en) * 2010-11-18 2012-05-24 Wal-Mart Stores, Inc. Real-time analytics of streaming data
US8595234B2 (en) 2010-05-17 2013-11-26 Wal-Mart Stores, Inc. Processing data feeds
US20120016661A1 (en) * 2010-07-19 2012-01-19 Eyal Pinkas System, method and device for intelligent textual conversation system
US9262517B2 (en) * 2010-08-18 2016-02-16 At&T Intellectual Property I, L.P. Systems and methods for social media data mining
US9240020B2 (en) * 2010-08-24 2016-01-19 Yahoo! Inc. Method of recommending content via social signals
US9852176B2 (en) * 2010-09-03 2017-12-26 Vocus, Inc. Dynamic gathering of social media content
US8892605B2 (en) * 2010-12-03 2014-11-18 Relationship Capital Technologies, Inc. Systems and methods for managing social networks based upon predetermined objectives
US8943154B1 (en) * 2012-05-11 2015-01-27 Amazon Technologies, Inc. Systems and methods for modeling relationships between users, network elements, and events
US8949250B1 (en) * 2013-12-19 2015-02-03 Facebook, Inc. Generating recommended search queries on online social networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7146416B1 (en) * 2000-09-01 2006-12-05 Yahoo! Inc. Web site activity monitoring system with tracking by categories and terms
US6965886B2 (en) * 2001-11-01 2005-11-15 Actimize Ltd. System and method for analyzing and utilizing data, by executing complex analytical models in real time
US7523462B1 (en) * 2003-05-27 2009-04-21 International Business Machines Corporation Method for providing a real time view of heterogeneous enterprise data
US20090138446A1 (en) * 2007-11-27 2009-05-28 Umber Systems Method and apparatus for real-time multi-dimensional reporting and analyzing of data on application level activity and other user information on a mobile data network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Gupta et al., "CHAOS: A Data Stream Analysis Architecture for Enterprise Applications", IEEE Conference on Commerce and Enterprise Computing '09, Pages 33-40, IEEE, 2009 *
Krishnan et al., "Scalable Visual Analytics of Massive Textual Datasets", IEEE International Parallel and Distributed Processing Symposium 2007, Pages 1-10, IEEE, 2007 *

Cited By (68)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150339752A1 (en) * 2011-09-14 2015-11-26 International Business Machines Corporation Deriving Dynamic Consumer Defined Product Attributes from Input Queries
US9830633B2 (en) * 2011-09-14 2017-11-28 International Business Machines Corporation Deriving dynamic consumer defined product attributes from input queries
US20130254374A1 (en) * 2012-03-21 2013-09-26 International Business Machines Corporation Resource allocation based on social networking trends in a networked computing environment
US10353738B2 (en) * 2012-03-21 2019-07-16 International Business Machines Corporation Resource allocation based on social networking trends in a networked computing environment
US20130339473A1 (en) * 2012-06-15 2013-12-19 Zynga Inc. Real time analytics via stream processing
US9904718B2 (en) 2013-03-13 2018-02-27 Wal-Mart Stores, Inc. System and method for streaming events in a transaction-based system
US11010406B2 (en) 2013-03-13 2021-05-18 Walmart Apollo, Llc System and method for streaming events in a transaction-based system
US10318548B2 (en) 2013-03-13 2019-06-11 Walmart Apollo, Llc System and method for streaming events in a transaction-based system
US10735536B2 (en) 2013-05-06 2020-08-04 Microsoft Technology Licensing, Llc Scalable data enrichment for cloud streaming analytics
US9749430B2 (en) 2013-05-06 2017-08-29 Microsoft Technology Licensing, Llc Scalable data enrichment for cloud streaming analytics
US10306001B2 (en) 2013-05-06 2019-05-28 Microsoft Technology Licensing, Llc Scalable data enrichment for cloud streaming analytics
US9158824B2 (en) * 2013-06-10 2015-10-13 International Business Machines Corporation Incremental aggregation-based event pattern matching
US20140365524A1 (en) * 2013-06-10 2014-12-11 International Business Machines Corporation Incremental aggregation-based event pattern matching
US10275484B2 (en) * 2013-07-22 2019-04-30 International Business Machines Corporation Managing sparsity in a multidimensional data structure
US10169406B2 (en) 2013-07-22 2019-01-01 International Business Machines Corporation Managing sparsity in an multidimensional data structure
US20150026207A1 (en) * 2013-07-22 2015-01-22 International Business Machines Corporation Managing sparsity in an multidimensional data structure
US10540386B2 (en) * 2013-08-09 2020-01-21 Shaofeng YANG Method for processing and displaying real-time social data on map
US9367424B2 (en) 2014-03-13 2016-06-14 International Business Machines Corporation Method for performance monitoring and optimization via trend detection and forecasting
US10019505B2 (en) 2014-04-11 2018-07-10 Huawei Technologies Co., Ltd. Method and apparatus for creating data cube in streaming manner based on distributed system
TWI552543B (en) * 2014-04-11 2016-10-01 華為技術有限公司 Method and apparatus for creating data cube in streaming manner based on distributed system
CN103916478A (en) * 2014-04-11 2014-07-09 华为技术有限公司 Streaming data cube establishing method and device based on distributed system
US10002131B2 (en) 2014-06-11 2018-06-19 Facebook, Inc. Classifying languages for objects and entities
US10013417B2 (en) 2014-06-11 2018-07-03 Facebook, Inc. Classifying languages for objects and entities
US9740687B2 (en) 2014-06-11 2017-08-22 Facebook, Inc. Classifying languages for objects and entities
US10152677B2 (en) 2014-10-30 2018-12-11 International Business Machines Corporation System and methodology to handle misdirected input data during multi partitioned real time analytics
US10147046B2 (en) 2014-10-30 2018-12-04 International Business Machines Corporation System and methodology to handle misdirected input data during multi partitioned real time analytics
US9864744B2 (en) 2014-12-03 2018-01-09 Facebook, Inc. Mining multi-lingual data
US9830404B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Analyzing language dependency structures
US10067936B2 (en) 2014-12-30 2018-09-04 Facebook, Inc. Machine translation output reranking
US9830386B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Determining trending topics in social media
US9899020B2 (en) 2015-02-13 2018-02-20 Facebook, Inc. Machine learning dialect identification
US10324943B2 (en) 2015-08-10 2019-06-18 Business Objects Software, Ltd. Auto-monitoring and adjustment of dynamic data visualizations
US10346537B2 (en) 2015-09-22 2019-07-09 Facebook, Inc. Universal translation
US10133738B2 (en) 2015-12-14 2018-11-20 Facebook, Inc. Translation confidence scores
US10089299B2 (en) 2015-12-17 2018-10-02 Facebook, Inc. Multi-media context language processing
US11194778B2 (en) * 2015-12-18 2021-12-07 International Business Machines Corporation Method and system for hybrid sort and hash-based query execution
US20170177573A1 (en) * 2015-12-18 2017-06-22 International Business Machines Corporation Method and system for hybrid sort and hash-based query execution
US11989096B2 (en) * 2015-12-21 2024-05-21 Ab Initio Technology Llc Search and retrieval data processing system for computing near real-time data aggregations
US20170177446A1 (en) * 2015-12-21 2017-06-22 Ab Initio Technology Llc Search and retrieval data processing system for computing near real-time data aggregations
US20170185583A1 (en) * 2015-12-28 2017-06-29 Facebook, Inc. Language model personalization
US10002125B2 (en) * 2015-12-28 2018-06-19 Facebook, Inc. Language model personalization
US9805029B2 (en) 2015-12-28 2017-10-31 Facebook, Inc. Predicting future translations
US10289681B2 (en) 2015-12-28 2019-05-14 Facebook, Inc. Predicting future translations
US10540450B2 (en) 2015-12-28 2020-01-21 Facebook, Inc. Predicting future translations
US10902215B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US10902221B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US11100191B2 (en) 2016-08-26 2021-08-24 1Qb Information Technologies Inc. Method and system for performing real-time analytics on a plurality of data streams
WO2018049183A1 (en) * 2016-09-09 2018-03-15 BloomReach, Inc. Attribute extraction
US10445812B2 (en) 2016-09-09 2019-10-15 BloomReach, Inc. Attribute extraction
US20180075516A1 (en) * 2016-09-14 2018-03-15 Microsoft Technology Licensing, Llc System for producing recommendations and predicting purchases of products based on usage patterns
US10825072B2 (en) * 2016-09-14 2020-11-03 Microsoft Technology Licensing, Llc System for producing recommendations and predicting purchases of products based on usage patterns
US11341140B2 (en) 2016-11-08 2022-05-24 At&T Intellectual Property I, L.P. Unique identification generation for records in a data streaming processing system
US10459922B2 (en) 2016-11-08 2019-10-29 At&T Intellectual Property I, L.P. Unique identification generation for records in a data streaming processing system
US9946522B1 (en) 2016-12-16 2018-04-17 International Business Machines Corporation Generating code for real-time stream processing
US10241762B2 (en) 2016-12-16 2019-03-26 International Business Machines Corporation Generating code for real-time stream processing
US9983858B1 (en) 2016-12-16 2018-05-29 International Business Machines Corporation Generating code for real-time stream processing
US10812322B2 (en) * 2017-04-13 2020-10-20 Walmart Apollo, Llc Systems and methods for real time streaming
US11294876B2 (en) * 2017-06-01 2022-04-05 Oracle International Corporation System and method for generating a multi dimensional data cube for analytics using a map-reduce program
US12111809B2 (en) 2017-06-01 2024-10-08 Oracle International Corporation System and method for generating a multi dimensional data cube for analytics using a map-reduce program
US10380249B2 (en) 2017-10-02 2019-08-13 Facebook, Inc. Predicting future trending topics
EP3678033A1 (en) * 2019-01-07 2020-07-08 QlikTech International AB A computer implemented method for indexlet based aggregation
US11768857B2 (en) * 2019-01-07 2023-09-26 Qliktech International Ab Methods and systems for indexlet based aggregation
US12105738B2 (en) 2019-01-07 2024-10-01 Qliktech International Ab Methods and systems for indexlet based aggregation
CN112817965A (en) * 2019-11-18 2021-05-18 百度在线网络技术(北京)有限公司 Data splicing method and device, electronic equipment and storage medium
CN113127512A (en) * 2020-01-15 2021-07-16 百度在线网络技术(北京)有限公司 Data splicing triggering method and device for multiple data streams, electronic equipment and medium
EP4033373A1 (en) * 2021-01-25 2022-07-27 QlikTech International AB Methods and systems for undetermined query analytics
US11625395B2 (en) 2021-01-25 2023-04-11 Qliktech International Ab Methods and systems for undetermined query analytics
CN116756215A (en) * 2023-06-27 2023-09-15 上海蚂蚁创将信息技术有限公司 Transaction in-transit state query method and system

Also Published As

Publication number Publication date
US20120197750A1 (en) 2012-08-02
US20120131047A1 (en) 2012-05-24
US20160132515A1 (en) 2016-05-12
US9679074B2 (en) 2017-06-13
WO2012068557A1 (en) 2012-05-24
US8725592B2 (en) 2014-05-13
US9183270B2 (en) 2015-11-10

Similar Documents

Publication Publication Date Title
US20120130940A1 (en) Real-time analytics of streaming data
US11985037B2 (en) Systems and methods for conducting more reliable assessments with connectivity statistics
US11665072B2 (en) Parallel computational framework and application server for determining path connectivity
US10311106B2 (en) Social graph visualization and user interface
US20220283883A1 (en) Distributed processing in a messaging platform
US9922134B2 (en) Assessing and scoring people, businesses, places, things, and brands
US9483580B2 (en) Estimation of closeness of topics based on graph analytics
EP2937797A1 (en) System and method for searching a distributed node-sharded graph
US20110307474A1 (en) Party reputation aggregation system and method
US20180039688A1 (en) Data flow based feature vector clustering
KR20120126093A (en) Method, system and server for managing dynamic information of friends in network
EP3070661A1 (en) System and method for providing context driven hyper-personalized recommendation
US10803133B2 (en) System for decomposing events from managed infrastructures that includes a reference tool signalizer
CN111046237B (en) User behavior data processing method and device, electronic equipment and readable medium
US20220138191A1 (en) Systems and methods for matching electronic activities with whitespace domains to record objects in a multi-tenant system
US11336596B2 (en) Personalized low latency communication
US9846746B2 (en) Querying groups of users based on user attributes for social analytics
CN113592293A (en) Risk identification processing method, electronic device and computer-readable storage medium
US20180137198A1 (en) Data retrieval system
US12056132B2 (en) Systems and methods for selection of a first record object for association with second record objects based on connection profiles
US20150100515A1 (en) Customer data unification
CN114090873A (en) Method, device, equipment and computer readable medium for matching data
US20190392498A1 (en) Recommendation engine and system
CN113360765B (en) Event information processing method and device, electronic equipment and medium
Yuanjing et al. Research on the Application of Data Analysis in Agricultural Products E-Commerce

Legal Events

Date Code Title Description
AS Assignment

Owner name: WAL-MART STORES, INC., ARIZONA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GATTANI, ABHISHEK;RAJARAMAN, ANAND;REEL/FRAME:029248/0065

Effective date: 20120203

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: WALMART APOLLO, LLC, ARKANSAS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:WAL-MART STORES, INC.;REEL/FRAME:045817/0115

Effective date: 20180131