US20120130940A1 - Real-time analytics of streaming data - Google Patents
Real-time analytics of streaming data Download PDFInfo
- Publication number
- US20120130940A1 US20120130940A1 US13/300,523 US201113300523A US2012130940A1 US 20120130940 A1 US20120130940 A1 US 20120130940A1 US 201113300523 A US201113300523 A US 201113300523A US 2012130940 A1 US2012130940 A1 US 2012130940A1
- Authority
- US
- United States
- Prior art keywords
- attributes
- value
- tuple
- attribute
- aggregates
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0631—Item recommendations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2379—Updates performed during online database operations; commit processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9538—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- social media services such as TwitterTM, DiggTM, MyspaceTM and FacebookTM have seen a meteoric rise in popularity resulting in an ever evolving universe of streaming content/data which is often user/consumer generated.
- social media is able to capture, better than many other sources, a raw and unfiltered pulse of society.
- a company may gather and analyze information relevant to the company's markets to promote accurate and confident decision-making in determining market opportunity, market penetration strategy, market development metrics, etc.
- the present disclosure relates to real-time analytics of data streams. More particularly, the present disclosure relates to storage media, systems and methods for processing data streams and analyzing data extracted therefrom.
- Storage media, systems and methods for performing real time analytics on streaming data are disclosed herein.
- a method for performing real time analytics on streaming data may include: processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
- a tuple frequency may be tracked over an interval for each of the implicated tuples, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval.
- aggregates over an interval for a first plurality of implicated tuples having same attribute-value pairs for zero or more ordinary dimension attributes and different attribute-value pairs for one or more leaderboard dimension attributes may be tracked, and a top-N values determined for the one or more leadership dimension attributes over the interval, for example, wherein the Top-N values are characterized as resulting in the highest aggregates, the lowest aggregates or the aggregates closest to a selected value, over the interval.
- the one or more dimension attributes may include a K-Gram for identifying topics of interest.
- a tuple frequency for each of the implicated tuples including a K-Gram may be tracked over an interval, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval, whereby statistics for trending K-Gram-value pairs are tracked.
- a method for implementing a real time analytics platform may include establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
- a method for performing real-time analytics on a data stream may include: processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and returning one of the stored aggregates in response to a query.
- a system for performing real time analytics on streaming data may include: a processor for processing an event in a data stream to extract a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; a mapper for identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and one or more updaters for updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
- a system for performing real-time analytics on a data stream may include: a processor for processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and memory for storing the plurality of stored aggregates.
- a multi-dimensional data structure for implementing a real-time analytics platform characterized by one or more time windows, one or more dimension attributes, and one or more value attributes, may include: a plurality of tuples associated with the one or more dimension attributes; and a slate associated with each tuple for maintaining an aggregate for each of the one or more value attributes over each of the one or more time windows.
- a multi-dimensional data structure for implementing a real-time analytics platform may include: a plurality of stored tuples each representing a set of search query parameters for prospective queries extrapolated from a pre-established framework of possible query parameters and one or more stored aggregates associated with each of the stored tuples, wherein each aggregate represents a result for a prospective query characterized by the set of search query parameters represented in the tuple associated with that aggregate.
- a non-transitory computer readable medium may store processor executable instructions for performing methods described herein.
- the computer readable medium may store processor executable instructions for processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
- the computer readable medium may store processor executable instructions for establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
- the computer readable medium may store processor executable instructions for processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and returning one of the stored aggregates in response to a query.
- FIG. 1 depicts an exemplary data stream, according to the present disclosure.
- FIG. 2 depicts an exemplary query, according to the present disclosure.
- FIG. 3 depicts an exemplary data cube, according to the present disclosure.
- FIG. 4 depicts an exemplary implementation of a distributed architecture for maintaining a data cube, according to the present disclosure.
- FIG. 5 depicts another exemplary implementation of a distributed architecture for maintaining a data cube, according to the present disclosure.
- FIGS. 6 a - g depict a sequence of events for a worked example using the distributed architecture of FIG. 5 , according to the present disclosure.
- FIGS. 7 a - c depict flowcharts for exemplary methods for performing real time analytics on streaming data, according to the present disclosure.
- FIGS. 8 a - b depict overlaying real-time social statistics on a geographic map.
- FIG. 9 depicts an exemplary computing device for implementing embodiments of the present disclosure.
- FIG. 10 depicts an exemplary network environment for implementing a distributed architecture, according to the present disclosure.
- Storage media, systems and methods are disclosed herein for analyzing data streams in real time and/or pre-computing statistics in real time with no query time computation. More particularly, storage media, systems and methods are presented for processing data streams to calculate results for prospective queries.
- the results may be advantageously computed prior to the formulation of the specific query, for example, based on a pre-established framework of potential query parameters. More particularly, a universe of potential queries may be extrapolated from the pre-established framework of potential query parameters. Results for each of the potential queries may them be tracked in real time. For example results for each of the potential queries may be continuously updated based on real-time processing of events in a data stream.
- an event generally refers to an atomic unit in a data stream, for example, a single tweetTM in a TwitterTM feed or a single purchasing transaction in a transaction stream.
- a data stream may include a continuous flow of data that is not pre-divided into discrete events.
- an event may be inferred, for example, by identifying a set of one or more related attributes in the data stream.
- related attributes may be identified based on temporal and/or source commonalities.
- a contextual analysis for example, a semantic analysis
- a semantic analysis for example, a semantic analysis
- the storage media, systems and methods of the present disclosure may be used for real-time analysis of any type of streaming data, structured or unstructured.
- the storage media, systems and methods of the present disclosure may be used for real-time analysis of purchase transactions, customer reviews/feedback, customer wish lists/shopping carts, etc.
- prospective queries of data streams may include queries related to social statistics. For example:
- the result for each of the above queries may be calculated as an aggregate of a value attribute (number of events, percentage of events, sentiment, and average age, respectively) over a specified time window (between 10 a.m. and 11 a.m. today, yesterday, during the limited time offer, and last year, respectively) for a specified set of dimension attributes (related to product P, positive sentiment related to product P, related to the limited time offer from women in Arizona, and related to women in Arizona who purchased product P, respectively).
- a value attribute number of events, percentage of events, sentiment, and average age, respectively
- a specified time window between 10 a.m. and 11 a.m. today, yesterday, during the limited time offer, and last year, respectively
- dimension attributes related to product P, positive sentiment related to product P, related to the limited time offer from women in Arizona, and related to women in Arizona who purchased product P, respectively.
- a framework of potential query parameters may be pre-established by selecting, for example, via a user input, attributes of interest including one or more time windows, one or more dimension attributes and one or more value attributes.
- the framework may further include, for each of the one or more value attributes, an aggregate function defining how to aggregate instances of the value attribute.
- the term dimension attribute refers to an identifiable attribute in a data stream which is of interest as pertaining to a query search parameter.
- the term value attribute refers to an identifiable attribute in a data stream of interest which is of interest as pertaining to a query result parameter.
- a same attribute may be both a dimension attribute and a value attribute.
- the attribute “sentiment” may be used as both a search parameter (such as, in the query “how many women have a positive opinion about product P?”) and a result parameter (such as in the query “what is the sentiment of women regarding product P?”).
- a pre-established framework of query parameters may be used to generate a multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes; an aggregate of each of the one or more value attributes over each of the one or more time windows.
- the aggregates stored in the multi-dimensional data structure may be updated for each new event processed from a data stream.
- the event may be analyzed to extract a set of related attribute-value pairs including for one or more dimension attributes and one or more value attributes.
- the extracted set of related attributes-value pairs may then be used to identify or more implicated tuples in the data structure for updating.
- aggregates associated with each of the implicated tuples for each of the one or more value attributes may be updated by applying an appropriate aggregation function to each identified value attribute-value pair. In this way the multi-dimensional data structure may maintain real-time analytics of the data stream.
- a distributed architecture such as Muppet (map, update) may be used to implement the storage media, systems and methods of the present disclosure. Exemplary implementations of Muppet are further described herein as well as in U.S. non-provisional patent application entitled “Processing Data Feeds,” filed Nov. 18, 2011 (Attorney Docket No. 114826-50302).
- a distributed architecture may be used to map an event to one or more implicated tuples in one or more multi-dimensional data structures and update, for each of the implicated tuples, one or more slates, for example based on one or more value attribute-value pairs in the event.
- slates for a plurality of implicated tuples may be updated in parallel for example, using different processing nodes.
- map and “mapper,” as used herein, relate to a stream operation performed in exemplary embodiments in which events in a data stream are processed in a real-time manner to generate one or more new events which are then published to a same or different data stream.
- a mapper may be used to publish events to one or more updaters for updating an aggregate value contained in a slates.
- update and “updater” refer to a stream operation performed in exemplary embodiments in which events in one or more real-time data streams are processed in a real-time manner to create or update one or more persistent static “slate” data structures that are stored in a persistent manner, for example, in a durable disk storage (note that, as used herein the terms “store,” “stored” “storage” etc., imply persistence a non-transitory storage medium).
- an updater may generate zero, one or more new stream events. The generated stream events may be published to one or more real-time data streams.
- an updater may publish stream events to a data stream from which it accepts stream events as input.
- slate refers to a static data structure that may be used to record aggregates as described herein.
- a slate may have any suitable data structure or format.
- a slate may include a collection of one or more attribute-value pairs.
- a slate may be stored corresponding to a unique slatekey and updater that updates the slate.
- time-based aggregates of a given value attribute over a given time-window may be maintained in a multi-dimensional data structure (sometimes referred to herein as a data cube).
- the dimensions of the data structure generally reflect one or more dimension attributes selected in a pre-established framework of potential query parameters.
- fan-out may exponentially relate to the number of dimensions in the data structure.
- the term “fan-out” for a distributed architecture may refer to the ratio of internal events generated by the mapping function relative to the number of external events (e.g., tweetsTM) processed.
- Event 100 may be processed to identify a plurality of attribute-value pairs 110 contained therin.
- attributes may include:
- timestamp One usefull attribute, accordingly to the present disclosure, is the timestamp.
- two assumptions may be made regarding the timestamp: first, that the timestamp represents actual wall-clock time in some appropriate timezone; and second, that timestamps are monontonically increasing. These assumptions are generally reasonable for streaming data (for example, in TwitterTM each tweetTM contains a timestamp that satisfies these conditions).
- One reason that timestamps are useful is that query results are represented as aggregates over a time window. Using timestamps, aggregates may be unambiguously interpreted to include a set of events whose timestamps fall within the specified time window.
- Query 200 may specify, for example, a time window 210 , a set of dimension attribute-value pairs 220 , one or more value attributes of interest 230 , and an aggregate function 240 related to each value attribute 230 .
- exemplary instances of query 200 are described below:
- Example 1 may be rewritten as the following SQL query:
- AggregateSentiment represents a custom-defined aggregation function for combining sentiment values.
- the function may maintain a 3-tuple of count, one each for positive, neutral, and negative sentiments.
- the function may be configured to calculate an average sentiment.
- aggregates may be calculated for each of a plurality of time windows.
- arbitrary time windows may be supplanted by a standard set of time windows, for example:
- the standard set of time windows may reflect an assumption that the further back time the coarser the time granularity of interest.
- the standard set of time windows may include time windows of varying time granularity. In exemplary embodiments, it may be sufficient to maintain aggregates for a finite number of progressively older and courser sets of time windows, such reflected above.
- the systems and method of the subject disclosure typically utilize algebraic aggregation functions, which may be computed incrementally.
- an average may be represented as a 2-tuple a sum and a count, wherein the average may be calculated by dividing the sum by the count.
- algebraic aggregations functions advantageously simplifies the update process, thereby facilitating the real-time data processing and analytics as described herein.
- aggregates are assumed to be commutative and associative, which makes manipulations thereof simpler.
- the storage media, systems and methods of the present disclosure may advantageously utilize a multi-dimensional data structure for maintaining a universe of aggregates for a pre-established framework of potential query parameters, wherein the pre-established framework of potential query parameters is characterized, by one or more time windows, one or more dimension attributes, one or more value attributes and one or more aggregation functions.
- the multi-dimensional data structure for storing aggregates f(Se), W for all potential combinations and values of t, s, and g may be referred to as the data cube for aggregate f(Se) and time window W.
- the term data cube refers to the fact the aggregates may be arranged as the vertices of a hypercube. In general, given K dimension attributes, there are 2 K aggregates for each set of values of the dimension attributes, which is the number of vertices of a hypercube in K dimensions.
- the data cube for the aggregate f(Se) allows rapid lookup of the aggregate of value attribute Se for every tuple over timewindow W.
- a data cube may store aggregates for a plurality of different time windows, for example for a standard set of time windows such as described herein.
- Topic (T) 310 may include as instances, names for various topics, for example, as grouped via a semantic hierarchy.
- Location (L) 320 may include as instances, names of locations, for example names of States in the united States.
- Sentiment (S) 330 may include instances selected from positive negative or neutral.
- the dimension attributes Topic (T) 310 , Location (L) 320 and Sentiment (S) 33 may be reflected along the vertices of the data cube 300 .
- the data cube 300 may maintain for each tuple (t, 1 , s) of T, L, S an aggregate for a value attribute (in this case: event count, i.e., the number of events processed for the tuple (t, l, s)).
- event count i.e., the number of events processed for the tuple (t, l, s)
- each tuple (t, l, s) may be associated with a slate for storing the event count.
- the slate may further be associated with an updater for updating the slate for a new event and a mapper for mapping new events to the updater.
- data cube 300 is updated based on a new event.
- a new event may be processed to identify one or more dimension attributes therein.
- the event count associated with each of the four implicated tuples may be updated (for example, by incrementing the count by one).
- Data cube 300 advantageously maintains, in real time, results for any prospective query using the framework (T, L, S), of an event count over one or more pre-established time windows.
- FIG. 3 illustrates three examples of queries 340 a - c , the answers to which are maintained and therefore pre-computed in data cube 300 .
- query 340 a asks “how many people are posting about Barack Obama in New York?”
- the result to query 340 a may be obtained by returning the event count for the tuple (Barack Obama, New York, All), for example the event count stored in the slate associated with the tuple (Barack Obama, New York, All).
- query 340 b asks “How many people in Arizona feel positive of the new Medicare plan?”
- the result to query 340 b be obtained by returning the event count for the tuple (Medicare, Arizona, Positive).
- query 340 c asks “How many people feel negative of Barack Obama across the US?”
- the result to query 340 c may obtained by returning the event count for the tuple (Barrack Obama, United States (e.g., all states), Negative).
- the number of the dimension of a data cube may be automatically determined based on the types of attributes reflected in the data stream. For example, event types with the greatest frequencies, such as above a selected threshold, may be used as the dimensions for the data cube. Thus, for example the data stream may be analyzed to determine the best candidate attributes for cube dimensions.
- an exemplary implementation of a distributed architecture 400 for maintaining a data cube may include a mapper, for example, CubeMapper 410 , and one or more updaters, for example, CubeTupleUpdaters 420 .
- the mapper and/or updaters may be distributed to one or more processing nodes in the distributed architecture, e.g., via a network architecture such as described herein, for example with reference to FIG. 10 .
- the CubeMapper 410 may advantageously determine, for example based on a set of attribute-value pairs extracted for an event, which data cubes to maintain.
- a data cube may be defined by a set of dimension attributes (for example, Topic, Gender, State), and an aggregation function for a value attribute.
- a configuration file may list the data cubes of interest, and give each data cube a name.
- the aggregation function may be specified, for example, in javascript, as a function that takes two parameters (the current value of the aggregate and a new event) and returns a single value (the new value of the aggregate). Note that as a special case, the current value may be null, in which case the function may return the aggregate corresponding to just the one event.
- the value of the aggregate may be in any data structure/format, for example a JSON object.
- a generated data cube may apply only to specific kinds of topics.
- the query framework may call for a data cube specific for persons of interest, such as customers, celebrities, etc., or for occasions such as holidays, the Oscars, etc.
- the mapper may implement a selection function, based on selection criterion, to filter out only a subset of events from a data stream.
- the selection function may, for example, accept a single parameter (the event) and return either True (this event's data should be part of this data cube) or False.
- the CubeConfig file may advantageously be replicated for reference at each processing node in the distributed architecture.
- the CubeMapper 410 may processes each event in a data stream and determine which data cubes it is eligible for.
- the CubeMapper 410 constructs the 2 K tuples from the event E (for example, the 2 2 tuples: (a), (b), (a,b) and (All) depicted in FIG. 3 ) and generates an event E 1 for each tuple.
- the key for each event E 1 is the pair (CubeName, Tuple) and the value is the content event E, for example, including a value attribute-value pair.
- the generated events E 1 are sent on to the CubeTupleUpdaters 420 .
- Each CubeTupleUpdater 420 maintains a slate for every (CubeName, Tuple) pair it receives.
- the CubeTupleUpdater receives an event E 1 for a single tuple, extracts an instance of a value attribute and applies the aggregation function to add the instance to its store.
- the updater keeps track of the aggregate value for a plurality of time windows for example a standard set of time windows such as described herein.
- a query for any tuple and any Time Window may be answered via a quick slate lookup.
- the CubeMapper 410 may generate 2K events for each incoming event. This leads to very high fan-out for cubes with K>3.
- tuples may be advantageously filtered based on frequency. Frequency filtering may be implemented, for example, by selecting a small time window (for example 1 minute) referred to as the delta window D. Let S be the set of all tuples corresponding to all social updates during a delta window D.
- a threshold ⁇ is applied to filter out all tuples in S with frequency less than ⁇ from being sent to updaters (for example, CubeTupleUpdaters 320 of FIG. 3 ).
- ⁇ may be selected to be 1 or 2.
- a simple experiment with actual data suggests that setting ⁇ to 2 may eliminate over 90% of tuples in a delta window of 1 minute.
- a qualitative examination of such tuples indicated that the eliminated tuples generally came from events that are not really of interest (for example, spam, outliers of some sort, or just semantic analysis errors).
- filtering also a second effect of reducing noise, e.g., from semantic analysis errors.
- tuples that do occur frequently in a 1-minute window often correlate well with the global interests. Frequency filtering of tuples thus both improves performance and improves the quality of the data cube.
- FIG. 5 depicts an exemplary distributed architecture 500 for maintaining a data cube while implementing frequency filtering.
- the distributed architecture 500 may include a mapper (CubeS elector 510 ) and 3 types of updaters (CubeTupleGenerators 520 , the CubeTupleCollectors 530 , and the CubeTupleUpdaters 540 ).
- the CubeTupleGenerators 520 buffer all tuples with the same Interval, and then dispatch them to the CubeTupleCollectors 530 .
- an assumption may be made that the stream has a large number of events in each Delta Window—that is, the Delta Window is very large compared to the average gap between events (for example, TwitterTM, processes approximately 100,000 tweetsTM in a Delta Window of 1 minute resulting in an average inter-event gap of less than a millisecond).
- TwitterTM processes approximately 100,000 tweetsTM in a Delta Window of 1 minute resulting in an average inter-event gap of less than a millisecond.
- one may detect when an interval has ended and the next one has begun based on a processing of the first event whose Interval is higher than the current Interval.
- intervals may be tracked independent of events received.
- intervals may be based on e.g., a requisite number of events received rather than a time frame (for example, provided that the integration of such batches over the one or more time windows being tracked results in an acceptable margin of error).
- the threshold ⁇ may be variable, e.g., based on a changing frequency of events in the data stream.
- a different threshold may be applied depending on the number of dimension attributes-value pairs in the tuple.
- frequency filtering may account for variations in event traffic when filtering tuples.
- the CubeSelector 510 is similar to the CubeMapper 210 in the na ⁇ ve implementation. It uses the config file to determine if the current Event E is eligible for a data cube. If it is, the CubeSelector 510 uses a hash function to map the timestamp of the event to one of P values, and emits to stream X an event E 1 whose key is (CubeName, P) and value is the content of the event E.
- Each CubeTupleGenerator 520 subscribes to stream X.
- a CTG generates all possible tuples from an event E 1 and maintains a table with 3 columns: Tuple, AggregateValue, and Count.
- Count is the number of events that have contributed to this tuple.
- the CTG also stores the Interval I of the first event it received during this Delta Window.
- the CTG For each tuple, the CTG uses a hash function to map the tuple into one of P buckets 0,1 . . . , P ⁇ 1. For each bucket in 1, . . . P ⁇ 1, the CTG creates and emits to stream Y an event E 2 whose key is the bucket number, and whose value is the set of rows in its table whose tuple maps to that bucket number. The value also indicates the Interval I. Once the CTG has sent out all P events, it empties its tables and stores J as the new Interval value. The CTG then proceeds to process a newly received event E 1 .
- Each CubeTupleCollector 530 subscribes to stream Y.
- a CTC slate receives P events , one from each CTG slate.
- the CTC maintains a table with 3 columns: Tuple, AggregateValue, Count.
- the AggregateValue for a tuple is the aggregate of the partial AggregateValues for that tuple from each CTG, and the Count is the sum of the corresponding partial counts.
- the CTC also keeps track of the current interval I and a message counter M, both initialized to 0. When the CTC receives an event E 2 from the CTG with interval J, it first compares I and J:
- Each CubeTupleUpdater 540 maintains a slate for every (CubeName, Tuple) pair it receives.
- the CubeTupleUpdater receives an event E 3 for a single tuple, extracts an instance of a value attribute and applies the aggregation function to add the instance to its store.
- the updater keeps track of the aggregate value for a plurality of time windows for example a standard set of time windows such as described herein.
- a query for any tuple and any Time Window may be answered via a quick slate lookup.
- FIGS. 6 a - g an exemplary worked example of the distributed architecture 500 is depicted, demonstrating exemplary contents of CTG slates 620 and CTC slates 630 of a CTG and CTC such as described above with reference to FIG. 5 .
- a data cube C with two dimensions (A and X) is assumed.
- 4 tuples arrive.
- Cube Selecter 610 uses the config file to determine if the Event E is eligible for a data cube, for example data cube C.
- the CubeSelector 610 uses a hash function to map the timestamp of the event to one of P Values and emits an event E 1 whose key is (CubeName, P) and value is the content of the event E.
- CTG slates 620 buffer all tuples with the same interval. More particularly the CTG slates 620 maintains a table with 3 columns: Tuple, AggregateValue, and Count.
- a hash function to map each from the CTG slates 620 tuple into one of P buckets 0,1 . . . , P ⁇ 1, for example buckets 625 of FIG.
- CTC slates 630 receive P events, one from each CTG slate 620 .
- the CTC slates 630 maintains a table with 3 columns: Tuple, AgregateValue, Count. As depicted, once the CTG slate has sent out its events E 2 , it empties its tables and begins processing events E 1 for a new interval.
- Exemplary implementations presented above primarily related to point aggregates—wherein a tuple (a point in the hypercube) is specified and the desired result of the query is the corresponding aggregate value.
- the storage media, systems and methods herein are not limited to such implementations. Indeed in some embodiments, the data cube is used tracks results for different type of queries for, example, leaderboard queries.
- a leaderboard query specifies a tuple and a leaderboard attribute, and asks for the Top-N values of the leader board attribute among events that satisfy the tuple.
- a leaderboard query might pose the following question:
- leaderboard queries might pose the following question: What were the Top 10 least popular topics posted about by people in Arizona between 10 am and 11 am today? What were the Top 10 locations closest to New York where users posted about a sales event.
- the data cube may easily be extended to support leaderboard queries.
- the leaderboard attribute has a small and known set of values (e.g., state or gender)
- the leaderboard attribute can take on a large number of values (or example, topic, product, URL, domain). In this case we cannot possibly examine all slates for all values of the attribute.
- leaderboard queries are indicated by adding a line to the data cube config file to indicate the leaderboard dimensions for each cube and the value for “N” (for Top-N queries; e.g., 10). For example:
- Non-leaderboard dimensions of the data cube may be referred to as ordinary dimensions.
- the CTC can now compute the Top-N leaderboard values for each tuple, and pass them on to the CTU in the event it constructs for the tuple.
- the Top-N values may be characterized as resulting in the highest aggregates, the lowest aggregates, and the aggregates closest to a selected value, etc.
- the CTU may maintain the leaderboard for each time window. Where a time window is used for filtering by the CTC, the CTU may get the leaderboard directly from the CTC. Leaderboards for larger time windows are approximated using those for the smaller time windows. For example, an hourly leaderboard may be computed by looking at the Top 10 for each minute of the hour, and summing the minute-by-minute values. While it is possible to make some errors using this method (for example, where the most frequent item during the hour was not in the top 10 for any minute of the hour) accurate results are generally achieved.
- a semantic analysis engine can only tag instances it recognizes.
- An interesting advantage the storage media, systems and methods described herein is that the data cube may be used to maintain statistics on topics products, locations etc. (collectively, referred to as interesting topics or topics of interest) without the help of semantic analysis.
- a “K-Gram” is added as one of the data cube dimensions, with a relative high threshold 6 (e.g., 50).
- a slate is automatically created and maintained for any k-gram that was mentioned very often in a delta time window.
- a TF.IDF threshold may be instead of a pure frequency threshold to achieve better results.
- the data cube may be used to identify trending k-grams and automatically maintains stats for them without help from semantic analysis.
- identified topics of interest can be communicated to the semantic analysis engine and “candidate topics” and the maintained statistics relating thereto can be used by the engine to whether to incorporate each candidate topic into the taxonomy.
- FIGS. 7 a - c illustrate exemplary methods according to the present disclosure.
- Method 710 is depicted for performing real time analytics on streaming data.
- Method 710 generally includes steps of: ( 712 ) processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; ( 714 ) identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and ( 716 ) for each implicated tuple, updating one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
- Method 720 is depicted for performing real time analytics on streaming data.
- Method 720 generally includes steps of ( 722 ) establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and ( 724 ) generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
- Method 730 generally includes steps of: ( 732 ) processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and ( 734 ) returning one of the stored aggregates in response to a query.
- the storage media, systems and methods presented herein may include one or more programmable processing units having associated therewith executable instructions held on one or more computer readable medium, RAM, ROM, hard drive, and/or hardware.
- the hardware, firmware and/or executable code may be provided, for example, as upgrade module(s) for use in conjunction with existing infrastructure (for example, existing devices/processing units).
- Hardware may, for example, include components and/or logic circuitry for executing the embodiments taught herein as a computing process.
- Displays and/or other feedback means may also be included to convey detected/processed data, for example adjusted output representative of a particle characteristic.
- the display and/or other feedback means may be stand-alone or may be included as one or more components/modules of the processing unit(s).
- the display and/or other feedback means may be used to facilitate querying a data cube.
- the display may be used to visualize, in real-time, various social statistics maintained by the data cube. For example, as depicted in FIGS. 8 a and 8 b , real-time social statistics may be overlaid on a geographic map.
- a “processor,” “processing unit,” “computer” or “computer system” may be, for example, a wireless or wire line variety of a microcomputer, minicomputer, server, mainframe, laptop, personal data assistant (PDA), wireless e-mail device (for example, “BlackBerry,” “Android” or “Apple,” trade-designated devices), cellular phone, pager, processor, fax machine, scanner, or any other programmable device configured to transmit and receive data over a network.
- Computer systems disclosed herein may include memory for storing certain software applications used in obtaining, processing and communicating data. It can be appreciated that such memory may be internal or external to the disclosed embodiments.
- the memory may also include non-transitory storage medium for storing software, including a hard disk, an optical disk, floppy disk, ROM (read only memory), RAM (random access memory), PROM (programmable ROM), EEPROM (electrically erasable PROM), flash memory storage devices, or the like.
- non-transitory storage medium for storing software, including a hard disk, an optical disk, floppy disk, ROM (read only memory), RAM (random access memory), PROM (programmable ROM), EEPROM (electrically erasable PROM), flash memory storage devices, or the like.
- FIG. 9 depicts a block diagram representing an exemplary computing device 900 that may be used as a processing node (also referred to as a worker node) for aggregating and/or storing data as described herein, for example a processing node in a distributed architecture as described herein.
- the computing device 900 may be any computer system, such as a workstation, desktop computer, server, laptop, handheld computer, tablet computer (e.g., the iPadTM tablet computer), mobile computing or communication device (e.g., the iPhoneTM mobile communication device, the AndroidTM mobile communication device, and the like), or other form of computing or telecommunications device that is capable of communication and that has sufficient processor power and memory capacity to perform the operations described herein.
- a distributed computational system may be provided comprising a plurality of such computing devices.
- the computing device 900 includes one or more non-transitory computer-readable media having encoded thereon one or more computer-executable instructions or software for implementing exemplary methods described herein.
- the non-transitory computer-readable media may include, but are not limited to, one or more types of hardware memory, non-transitory tangible media (for example, one or more magnetic storage disks, one or more optical disks, one or more USB flash drives), and the like.
- memory 906 included in the computing device 900 may store computer-readable and computer-executable instructions or software for implementing exemplary embodiments.
- the computing device 900 also includes processor 902 and associated core 904 , and in some embodiments, one or more additional processor(s) 902 ′ and associated core(s) 904 ′ (for example, in the case of computer systems having multiple processors/cores), for executing computer-readable and computer-executable instructions or software stored in the memory 906 and other programs for controlling system hardware.
- processor 902 and processor(s) 902 ′ may each be a single core processor or multiple core ( 904 and 904 ′) processor.
- Virtualization may be employed in the computing device 900 so that infrastructure and resources in the computing device may be shared dynamically.
- a virtual machine 914 may be provided to handle a process running on multiple processors so that the process appears to be using only one computing resource rather than multiple computing resources. Multiple virtual machines may also be used with one processor.
- Memory 906 may include a computer system memory or random access memory, such as DRAM, SRAM, EDO RAM, and the like. Memory 906 may include other types of memory as well, or combinations thereof. Memory 906 may be used to store one or more slates on a temporary basis, for example, in cache.
- a user may interact with the computing device 900 through a visual display device 918 , such as a screen or monitor, that may display one or more interfaces 920 that may be provided in accordance with exemplary embodiments.
- the visual display device 918 may also display other aspects, elements and/or information or data associated with exemplary embodiments.
- the computing device 900 may include other I/O devices for receiving input from a user, for example, a keyboard or any suitable multi-point touch interface 908 , a pointing device 910 (e.g., a mouse, a user's finger interfacing directly with a display device, etc.).
- the keyboard 908 and the pointing device 910 may be coupled to the visual display device 918 .
- the computing device 900 may include other suitable conventional I/O peripherals.
- the one or more of the interfaces 920 includes an application program interface (API).
- API application program interface
- the computing device 900 may include one or more audio input devices 924 , such as one or more microphones, that may be used by a user to provide one or more audio input streams.
- one or more audio input devices 924 such as one or more microphones, that may be used by a user to provide one or more audio input streams.
- the computing device 900 may include one or more non-transitory storage devices 924 , such as a durable disk storage (which may include any suitable optical or magnetic durable storage device, e.g., RAM, ROM, Flash, USB drive, or other semiconductor-based storage medium), a hard-drive, CD-ROM, or other non-transitory computer readable media, for storing data and computer-readable instructions and/or software that implement exemplary embodiments as taught herein.
- the storage device 924 may provide a slate storage 926 for storing computer-executable instructions for implementing the social genome data structure as described herein, for example for storing an updating (via one or more updaters) one or more slates, as described herein.
- the storage device 924 may store one or more map modules 932 and one or more update modules 934 , as described herein.
- the storage device 924 may be provided on the computing device 900 or provided separately or remotely from the computing device 900 .
- the storage device 924 may be used to store one or more slates in a durable manner.
- Exemplary mappers and updaters may be programmatically implemented by a computer process in any suitable programming language, for example, a scripting programming language, an object-oriented programming language (e.g., Java), and the like.
- a general Mapper class or interface and Updater class or interface may be defined by the system to generally specify attributes and functionality of a generic update operation.
- a sub-class may be created based on the Updater class.
- One or more object instances may be created from each sub-class at a processor node, for example, a CubeTupleGenerator object may be instantiated from a CubeTupleGenerator sub-class.
- the computing device 900 may include a network interface 912 configured to interface via one or more network devices 922 with one or more networks, for example, Local Area Network (LAN), Wide Area Network (WAN) or the Internet through a variety of connections including, but not limited to, standard telephone lines, LAN or WAN links (for example, 802.11, T1, T3, 56 kb, X.25), broadband connections (for example, ISDN, Frame Relay, ATM), wireless connections, controller area network (CAN), or some combination of any or all of the above.
- LAN Local Area Network
- WAN Wide Area Network
- the Internet through a variety of connections including, but not limited to, standard telephone lines, LAN or WAN links (for example, 802.11, T1, T3, 56 kb, X.25), broadband connections (for example, ISDN, Frame Relay, ATM), wireless connections, controller area network (CAN), or some combination of any or all of the above.
- LAN Local Area Network
- WAN Wide Area Network
- CAN controller area network
- the network interface 912 may include a built-in network adapter, network interface card, PCMCIA network card, card bus network adapter, wireless network adapter, USB network adapter, modem or any other device suitable for interfacing the computing device 900 to any type of network capable of communication and performing the operations described herein.
- the network device 922 may include one or more suitable devices for receiving and transmitting communications over the network including, but not limited to, one or more receivers, one or more transmitters, one or more transceivers, one or more antennae, and the like.
- the computing device 900 may run any operating system 916 , such as any of the versions of the Microsoft® Windows® operating systems, the different releases of the Unix and Linux operating systems, any version of the MacOS® for Macintosh computers, any embedded operating system, any real-time operating system, any open source operating system, any proprietary operating system, any operating systems for mobile computing devices, or any other operating system capable of running on the computing device and performing the operations described herein.
- the operating system 916 may be run in native mode or emulated mode.
- the operating system 916 may be run on one or more cloud machine instances.
- FIG. 10 depicts an exemplary network environment 1000 suitable for a distributed implementation of exemplary embodiments.
- the network environment 1000 may include one or more servers 1002 and 1004 coupled to one or more clients 1006 and 1008 via a communication network 1010 .
- the network interface 912 and the network device 922 of the computing device 900 enable the servers 1002 and 1004 to communicate with the clients 1006 and 1008 via the communication network 1010 .
- the communication network 1010 may include, but is not limited to, the Internet, an intranet, a LAN (Local Area Network), a WAN (Wide Area Network), a MAN (Metropolitan Area Network), a wireless network, an optical network, and the like.
- the communication facilities provided by the communication network 1010 are capable of supporting distributed implementations of exemplary embodiments.
- the servers 1002 and 1004 may provide the clients 1006 and 1008 with computer-readable and/or computer-executable components or products under a particular condition, such as a license agreement.
- the computer-readable and/or computer-executable components or products provided by the servers may include those for providing one or more real-time data streams to worker processes at worker nodes.
- the clients 1006 and 1008 may process the data streams using the computer-readable and/or computer-executable components and products provided by the servers 1002 and 1004 .
- the computer-readable and/or computer-executable components or products provided by the servers may include those for providing and executing one or more map and/or update operations, for example using one or more mappers or updaters.
- the clients 1006 and 1008 may execute the map and update operations using the computer-readable and/or computer-executable components and products provided by the servers 1002 and 1004 .
- the clients 1006 and 1008 may transmit events generated by update operations to the servers 1002 and 1004 for publication in one or more data streams.
- the clients 1006 and 1008 may transmit one or more slates created or updated by update operations to the servers 1002 and 1004 for persistent storage on a disk storage or for storage in memory, e.g., in cache.
- the clients 1006 and 1008 may provide the servers 1002 and 1004 with computer-readable and computer-executable components or products under a particular condition, such as a license agreement.
- the computer-readable and/or computer-executable components or products provided by the clients may include those for providing one or more real-time data streams to worker processes.
- the servers 1002 and 1006 may process the data streams using the computer-readable and/or computer-executable components and products provided by the clients 1006 and 1008 .
- the computer-readable and/or computer-executable components or products provided by the clients may include those for providing and executing one or more map and/or update operations.
- the servers 1002 and 1004 may execute the map and update operations using the computer-readable and/or computer-executable components and products provided by the clients 1006 and 1008 .
- the servers 1002 and 1004 may transmit events generated by update operations to the clients 1006 and 1008 for publication in one or more data streams.
- the servers 1002 and 1004 may transmit one or more slates created or updated by update operations to the clients 1006 and 1008 for persistent storage on a disk storage or for storage in memory, e.g., in cache.
- one or more mappers and one or more updaters may be distributed to throughout various processing nodes of the network environment 1000 , for example nodes 1012 a - d.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Economics (AREA)
- General Business, Economics & Management (AREA)
- Strategic Management (AREA)
- Marketing (AREA)
- Development Economics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Storage media, systems and methods are disclosed herein for analyzing data streams in real time. More particularly, storage media, systems and methods are presented for processing data streams to calculate results for prospective queries. The results may be advantageously computed prior to the formulation of the specific query, for example, based on a pre-established framework of potential query parameters. More particularly, a universe of potential queries may be extrapolated from the pre-established framework of potential query parameters. Results for each of the potential queries may them be tracked in real time. For example, results for each of the potential queries may be continuously updated based on real-time processing of events in a data stream.
Description
- The present application claims priority to U.S. Provisional Patent Application No. 61/415,279, filed Nov. 18, 2010 (entitled “Social Genome”), and U.S. Provisional Patent Application No. 61/415,282, filed Nov. 18, 2010 (entitled “Managing Real-Time Data Streams”). This application also relates to U.S. Provisional Patent Application No. 61/345,252 entitled “Content Feed,” filed May 17, 2010, U.S. patent application Ser. No. 13/106,706 entitled “Processing Data Feeds,” filed May 12, 2011, a U.S. non-provisional patent application titled “Processing Data Feeds,” filed Nov. 18, 2011 (Attorney Docket No. 114826-50302), a U.S. non-provisional patent application entitled “Methods Systems and Devices for Recommending Products and Services” filed Nov. 18, 2011 (Attorney Docket No. 114826-50602), and a U.S. non-provisional patent application entitled “Social Genome,” filed Nov. 18, 2011 (Attorney Docket No. 114826-50202). The entire contents of each of the above-referenced applications are incorporated herein in their entirety by reference.
- In recent years, social media services such as Twitter™, Digg™, Myspace™ and Facebook™ have seen a meteoric rise in popularity resulting in an ever evolving universe of streaming content/data which is often user/consumer generated. Thus, social media is able to capture, better than many other sources, a raw and unfiltered pulse of society.
- Potential applications for data harvested from social media are vast. For example, from a marketing intelligence standpoint, a company may gather and analyze information relevant to the company's markets to promote accurate and confident decision-making in determining market opportunity, market penetration strategy, market development metrics, etc.
- The present disclosure relates to real-time analytics of data streams. More particularly, the present disclosure relates to storage media, systems and methods for processing data streams and analyzing data extracted therefrom.
- Storage media, systems and methods for performing real time analytics on streaming data are disclosed herein.
- In exemplary embodiments a method for performing real time analytics on streaming data may include: processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes. In some embodiments, a tuple frequency may be tracked over an interval for each of the implicated tuples, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval. In other embodiments, for each value attribute, aggregates over an interval for a first plurality of implicated tuples having same attribute-value pairs for zero or more ordinary dimension attributes and different attribute-value pairs for one or more leaderboard dimension attributes may be tracked, and a top-N values determined for the one or more leadership dimension attributes over the interval, for example, wherein the Top-N values are characterized as resulting in the highest aggregates, the lowest aggregates or the aggregates closest to a selected value, over the interval. In yet other embodiments, the one or more dimension attributes may include a K-Gram for identifying topics of interest. Thus, for example a tuple frequency for each of the implicated tuples including a K-Gram may be tracked over an interval, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval, whereby statistics for trending K-Gram-value pairs are tracked.
- In other exemplary embodiments, a method for implementing a real time analytics platform may include establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
- In other exemplary embodiments a method for performing real-time analytics on a data stream may include: processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and returning one of the stored aggregates in response to a query.
- In exemplary embodiments, a system for performing real time analytics on streaming data, may include: a processor for processing an event in a data stream to extract a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; a mapper for identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and one or more updaters for updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
- In other exemplary embodiments, a system for performing real-time analytics on a data stream may include: a processor for processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and memory for storing the plurality of stored aggregates.
- In exemplary embodiments, a multi-dimensional data structure, for implementing a real-time analytics platform characterized by one or more time windows, one or more dimension attributes, and one or more value attributes, may include: a plurality of tuples associated with the one or more dimension attributes; and a slate associated with each tuple for maintaining an aggregate for each of the one or more value attributes over each of the one or more time windows.
- In other exemplary embodiments, a multi-dimensional data structure for implementing a real-time analytics platform may include: a plurality of stored tuples each representing a set of search query parameters for prospective queries extrapolated from a pre-established framework of possible query parameters and one or more stored aggregates associated with each of the stored tuples, wherein each aggregate represents a result for a prospective query characterized by the set of search query parameters represented in the tuple associated with that aggregate.
- In exemplary embodiments, a non-transitory computer readable medium may store processor executable instructions for performing methods described herein. For example, the computer readable medium may store processor executable instructions for processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes. In other embodiments, the computer readable medium may store processor executable instructions for establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows. In yet other embodiments, the computer readable medium may store processor executable instructions for processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and returning one of the stored aggregates in response to a query.
- The foregoing and other objects, aspects, features and advantages of exemplary embodiments will be more fully understood from the following description when read together with the accompanying drawings.
-
FIG. 1 depicts an exemplary data stream, according to the present disclosure. -
FIG. 2 depicts an exemplary query, according to the present disclosure. -
FIG. 3 depicts an exemplary data cube, according to the present disclosure. -
FIG. 4 depicts an exemplary implementation of a distributed architecture for maintaining a data cube, according to the present disclosure. -
FIG. 5 depicts another exemplary implementation of a distributed architecture for maintaining a data cube, according to the present disclosure. -
FIGS. 6 a-g depict a sequence of events for a worked example using the distributed architecture ofFIG. 5 , according to the present disclosure. -
FIGS. 7 a-c depict flowcharts for exemplary methods for performing real time analytics on streaming data, according to the present disclosure. -
FIGS. 8 a-b depict overlaying real-time social statistics on a geographic map. -
FIG. 9 depicts an exemplary computing device for implementing embodiments of the present disclosure. -
FIG. 10 depicts an exemplary network environment for implementing a distributed architecture, according to the present disclosure. - Storage media, systems and methods are disclosed herein for analyzing data streams in real time and/or pre-computing statistics in real time with no query time computation. More particularly, storage media, systems and methods are presented for processing data streams to calculate results for prospective queries. The results may be advantageously computed prior to the formulation of the specific query, for example, based on a pre-established framework of potential query parameters. More particularly, a universe of potential queries may be extrapolated from the pre-established framework of potential query parameters. Results for each of the potential queries may them be tracked in real time. For example results for each of the potential queries may be continuously updated based on real-time processing of events in a data stream.
- Note that, as used herein the term event generally refers to an atomic unit in a data stream, for example, a single tweet™ in a Twitter™ feed or a single purchasing transaction in a transaction stream. In exemplary embodiments, a data stream may include a continuous flow of data that is not pre-divided into discrete events. Thus, in some embodiments, an event may be inferred, for example, by identifying a set of one or more related attributes in the data stream. For example, related attributes may be identified based on temporal and/or source commonalities. In some embodiments, a contextual analysis (for example, a semantic analysis) of attributes in a data steam may be used to identify a set of one or more related attributes. Exemplary embodiments of semantic analysis, for example using a doctagger to identify and/or group topics, are is described herein
- It is appreciated that, although exemplary embodiments presented herein relate to social analytics, the storage media, systems and methods of the present disclosure may be used for real-time analysis of any type of streaming data, structured or unstructured. For example, the storage media, systems and methods of the present disclosure may be used for real-time analysis of purchase transactions, customer reviews/feedback, customer wish lists/shopping carts, etc.
- In exemplary embodiments, prospective queries of data streams may include queries related to social statistics. For example:
-
- How many events relate to product P between 10 a.m. and 11 a.m. today?
- What percentage of events had a positive opinion of product P yesterday?
- How do women in Arizona feel about a limited time offer, based on events during the offer?
- What is the average age of women in Arizona who purchased product P last year?
- The result for each of the above queries may be calculated as an aggregate of a value attribute (number of events, percentage of events, sentiment, and average age, respectively) over a specified time window (between 10 a.m. and 11 a.m. today, yesterday, during the limited time offer, and last year, respectively) for a specified set of dimension attributes (related to product P, positive sentiment related to product P, related to the limited time offer from women in Arizona, and related to women in Arizona who purchased product P, respectively). The storage media, systems and methods of the present disclosure advantageously facilitate identifying and maintaining time-based aggregates, such as described above, prior to the formulation of a queries.
- In exemplary embodiments, a framework of potential query parameters may be pre-established by selecting, for example, via a user input, attributes of interest including one or more time windows, one or more dimension attributes and one or more value attributes. The framework may further include, for each of the one or more value attributes, an aggregate function defining how to aggregate instances of the value attribute. As used herein, the term dimension attribute refers to an identifiable attribute in a data stream which is of interest as pertaining to a query search parameter. By comparison, the term value attribute refers to an identifiable attribute in a data stream of interest which is of interest as pertaining to a query result parameter. Notably, depending on the particular framework of potential query parameters, a same attribute may be both a dimension attribute and a value attribute. For example, the attribute “sentiment” may be used as both a search parameter (such as, in the query “how many women have a positive opinion about product P?”) and a result parameter (such as in the query “what is the sentiment of women regarding product P?”).
- In exemplary embodiments a pre-established framework of query parameters may be used to generate a multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes; an aggregate of each of the one or more value attributes over each of the one or more time windows. As used herein the term tuple may refer to a set of dimension attribute-value pairs. For example, for exemplary dimension attributes Person (P) Location (L) and Thing (T), a tuple may take the form P=p, L =1, and T=t, (also expressed as the tuple (p, l, t) for dimensions (P, L, T)).
- In exemplary embodiments, the aggregates stored in the multi-dimensional data structure may be updated for each new event processed from a data stream. In particular, the event may be analyzed to extract a set of related attribute-value pairs including for one or more dimension attributes and one or more value attributes. The extracted set of related attributes-value pairs may then be used to identify or more implicated tuples in the data structure for updating. Thus, aggregates associated with each of the implicated tuples for each of the one or more value attributes, may be updated by applying an appropriate aggregation function to each identified value attribute-value pair. In this way the multi-dimensional data structure may maintain real-time analytics of the data stream.
- In exemplary embodiments, a distributed architecture, such as Muppet (map, update), may be used to implement the storage media, systems and methods of the present disclosure. Exemplary implementations of Muppet are further described herein as well as in U.S. non-provisional patent application entitled “Processing Data Feeds,” filed Nov. 18, 2011 (Attorney Docket No. 114826-50302). In general, a distributed architecture may be used to map an event to one or more implicated tuples in one or more multi-dimensional data structures and update, for each of the implicated tuples, one or more slates, for example based on one or more value attribute-value pairs in the event. Advantageously, slates for a plurality of implicated tuples may be updated in parallel for example, using different processing nodes.
- The terms “map” and “mapper,” as used herein, relate to a stream operation performed in exemplary embodiments in which events in a data stream are processed in a real-time manner to generate one or more new events which are then published to a same or different data stream. In exemplary embodiments, a mapper may be used to publish events to one or more updaters for updating an aggregate value contained in a slates.
- The terms “update” and “updater” refer to a stream operation performed in exemplary embodiments in which events in one or more real-time data streams are processed in a real-time manner to create or update one or more persistent static “slate” data structures that are stored in a persistent manner, for example, in a durable disk storage (note that, as used herein the terms “store,” “stored” “storage” etc., imply persistence a non-transitory storage medium). In some exemplary embodiments, an updater may generate zero, one or more new stream events. The generated stream events may be published to one or more real-time data streams. In an exemplary embodiments, an updater may publish stream events to a data stream from which it accepts stream events as input.
- As used herein, the term “slate” refers to a static data structure that may be used to record aggregates as described herein. A slate may have any suitable data structure or format. In an exemplary format, a slate may include a collection of one or more attribute-value pairs. A slate may be stored corresponding to a unique slatekey and updater that updates the slate.
- In exemplary embodiments, time-based aggregates of a given value attribute over a given time-window may be maintained in a multi-dimensional data structure (sometimes referred to herein as a data cube). The dimensions of the data structure generally reflect one or more dimension attributes selected in a pre-established framework of potential query parameters.
- In a naïve implementation of a distributed architecture for maintaining the data structure, fan-out may exponentially relate to the number of dimensions in the data structure. The term “fan-out” for a distributed architecture may refer to the ratio of internal events generated by the mapping function relative to the number of external events (e.g., tweets™) processed.
- Since, handling and storing such a volume of data may prove impractical, alternative implementations of a distributed architecture are also presented herein that take advantage of various properties of data streams to considerably reduce fan-out to a manageable number.
- With initial reference to
FIG. 1 , anexemplary event 100 in adata stream 10 is depicted.Event 100 may be processed to identify a plurality of attribute-value pairs 110 contained therin. Examples of attributes may include: -
- Event ID, for example, a unique per-event identifier.
- Sentiment, for example, with potential values of +1, 0, or −1, indicating positive, neutral, or negative sentiment.
- Gender, for example, with potential values of Male, Female, and Unknown.
- Country, for example, with potential values drawn from an enumerated set of country codes including unknown.
- Topic, for example, with potential values detected via semantic analysis.
- Product, for example, with potential values drawn from a product database.
- Price, for example, with potential values in different currencies.
- Timestamp, for example, based on time published or time received.
- One usefull attribute, accordingly to the present disclosure, is the timestamp. In exemplary embodiments, two assumptions may be made regarding the timestamp: first, that the timestamp represents actual wall-clock time in some appropriate timezone; and second, that timestamps are monontonically increasing. These assumptions are generally reasonable for streaming data (for example, in Twitter™ each tweet™ contains a timestamp that satisfies these conditions). One reason that timestamps are useful is that query results are represented as aggregates over a time window. Using timestamps, aggregates may be unambiguously interpreted to include a set of events whose timestamps fall within the specified time window.
- With reference to
FIG. 2 , anexemplary query 200 is depicted. Query 200 may specify, for example, atime window 210, a set of dimension attribute-value pairs 220, one or more value attributes ofinterest 230, and anaggregate function 240 related to eachvalue attribute 230. Exemplary instances ofquery 200 are described below: - Query: How many people posted about product P between 10 am and 11 am today?
Time Window: 10 am to 11 am today - Value attribute: Event Id
- Example 1 may be rewritten as the following SQL query:
- FROM event E
WHERE t.product=“P” - AND t.timestamp>=10 am AND t.timestamp<=11 am
- How many women in Arizona posted about product P between 10 am and 11 am?
Same as last example except for additional Dimension Attribute-Value Pairs: - What was the sentiment about product P among women in Arizona in December 2010?
- In the third query example, AggregateSentiment represents a custom-defined aggregation function for combining sentiment values. For example, the function may maintain a 3-tuple of count, one each for positive, neutral, and negative sentiments. Alternatively, the function may be configured to calculate an average sentiment.
- Time Windows:
- In exemplary embodiments, aggregates may be calculated for each of a plurality of time windows. In some embodiments arbitrary time windows may be supplanted by a standard set of time windows, for example:
-
- By the minute for the past 60 minutes
- By the hour for the past 24 hours
- By the week for the past 4 weeks
- By the month for the past 24 months (for y-o-y same month comparisons)
- By the year for the past 10 years
- In general, the standard set of time windows may reflect an assumption that the further back time the coarser the time granularity of interest. Thus, the standard set of time windows may include time windows of varying time granularity. In exemplary embodiments, it may be sufficient to maintain aggregates for a finite number of progressively older and courser sets of time windows, such reflected above.
- Aggregate Functions:
- In general there are two kinds of aggregate functions: Algebraic and Holistic. Roughly speaking, algebraic aggregates are those, like SUM, that can be computed incrementally; in other words, by aggregating subsets of the data, and computing the final result using those aggregates without going back to the base data. In contrast, Holistic aggregates typically require the base data when recalculating the aggregate. One example of a holistic aggregate is the median. Suppose you divide a data set arbitrarily into two parts, and compute the median of the two parts; there is no way to compute the median of the entire data set from the medians of the two parts.
- The systems and method of the subject disclosure typically utilize algebraic aggregation functions, which may be computed incrementally. Thus, for example, an average may be represented as a 2-tuple a sum and a count, wherein the average may be calculated by dividing the sum by the count. The use of algebraic aggregations functions, advantageously simplifies the update process, thereby facilitating the real-time data processing and analytics as described herein. In exemplary embodiments aggregates are assumed to be commutative and associative, which makes manipulations thereof simpler.
- Data Cubes:
- As noted above, the storage media, systems and methods of the present disclosure may advantageously utilize a multi-dimensional data structure for maintaining a universe of aggregates for a pre-established framework of potential query parameters, wherein the pre-established framework of potential query parameters is characterized, by one or more time windows, one or more dimension attributes, one or more value attributes and one or more aggregation functions.
- Suppose, for example, a framework characterized by time window W, dimension attributes Topic (T), State (S), Gender (G), a single value attribute Sentiment (Se), and an aggregate function ƒ. The following are the aggregates may be of interest for instances t, s and g of T, S and G:
- (1) T=t, S=s, G=g, f(Se), W
(2) T=t, S=s, f(Se), W
(3) T=t, G=g, f(Se), W
(4) S=s, G=g, f(Se), W
(5) T=t, f(Se), W
(6) S=s, f(Se), W
(7) G=g, f(Se), W
(8) All, f(Se), W (computed across all events in time window W) - The multi-dimensional data structure for storing aggregates f(Se), W for all potential combinations and values of t, s, and g, may be referred to as the data cube for aggregate f(Se) and time window W. The term data cube refers to the fact the aggregates may be arranged as the vertices of a hypercube. In general, given K dimension attributes, there are 2K aggregates for each set of values of the dimension attributes, which is the number of vertices of a hypercube in K dimensions.
- As describe herein, a set of dimension attribute-value pairs, such as T=t, G=g, may be referred to as a tuple. Referring to the above example, the data cube for the aggregate f(Se) allows rapid lookup of the aggregate of value attribute Se for every tuple over timewindow W. In exemplary embodiments, a data cube may store aggregates for a plurality of different time windows, for example for a standard set of time windows such as described herein.
- With reference to
FIG. 3 , anexemplary data cube 300 is depicted for three dimension attributes: Topic (T) 310, Location (L) 320 and Sentiment (S) 330. Topic (T) 310 may include as instances, names for various topics, for example, as grouped via a semantic hierarchy. Location (L) 320 may include as instances, names of locations, for example names of States in the united States. Sentiment (S) 330 may include instances selected from positive negative or neutral. The dimension attributes Topic (T) 310, Location (L) 320 and Sentiment (S) 33 may be reflected along the vertices of thedata cube 300. - The
data cube 300 may maintain for each tuple (t, 1, s) of T, L, S an aggregate for a value attribute (in this case: event count, i.e., the number of events processed for the tuple (t, l, s)). In exemplary embodiments, each tuple (t, l, s) may be associated with a slate for storing the event count. In some embodiments the slate may further be associated with an updater for updating the slate for a new event and a mapper for mapping new events to the updater. - In exemplary embodiments,
data cube 300 is updated based on a new event. Thus, a new event may be processed to identify one or more dimension attributes therein. For example, a new event may state “I love living in NYC,” from which dimension attributes Person (P), Location (L) and Sentiment (S) may be extracted (for example P=user L=NYC (New York City), and S=positive may be extracted. The tuples of (P, L, S) implicated are as follows: - (user, NYC, positive)
- (user, NYC,)
- (user, , positive)
- (, NYC, positive)
- (user,)
- (, NYC,)
- (, positive)
- (, ,)
- Thus, overlapping tuples of (T, L, S) implicated by the new event are:
- (, NYC, positive)
- (, NYC,)
- (, , positive) and
- (, ,)
- Thus, the event count associated with each of the four implicated tuples may be updated (for example, by incrementing the count by one).
- Note that for an event containing L dimension attributes, M of which overlap with K dimension attributes of a data cube, there are 2M tuples of the data cube which are implicated (i.e. 2M tuples which overlap between the data cube and the event). Thus, a mapper may be used generate the 2L tuples for the event and map the 2M subset thereof to the data cube. An update may then be used update a slate associated with each of the 2M tuples received from the mapper. Distributed architecture implementations for maintaining a data cube are described in greater below
-
Data cube 300 advantageously maintains, in real time, results for any prospective query using the framework (T, L, S), of an event count over one or more pre-established time windows.FIG. 3 , illustrates three examples of queries 340 a-c, the answers to which are maintained and therefore pre-computed indata cube 300. For example, query 340 a asks “how many people are posting about Barack Obama in New York?” The result to query 340 a may be obtained by returning the event count for the tuple (Barack Obama, New York, All), for example the event count stored in the slate associated with the tuple (Barack Obama, New York, All). As another example, query 340 b asks “How many people in Arizona feel positive of the new Medicare plan?” The result to query 340 b be obtained by returning the event count for the tuple (Medicare, Arizona, Positive). As another example, query 340 c asks “How many people feel negative of Barack Obama across the US?” The result to query 340 c may obtained by returning the event count for the tuple (Barrack Obama, United States (e.g., all states), Negative). - In exemplary embodiments, it is contemplated that the number of the dimension of a data cube may be automatically determined based on the types of attributes reflected in the data stream. For example, event types with the greatest frequencies, such as above a selected threshold, may be used as the dimensions for the data cube. Thus, for example the data stream may be analyzed to determine the best candidate attributes for cube dimensions.
- Naïve Distributed Architecture Implementation:
- Referring to
FIG. 4 , an exemplary implementation of a distributedarchitecture 400 for maintaining a data cube may include a mapper, for example,CubeMapper 410, and one or more updaters, for example,CubeTupleUpdaters 420. The mapper and/or updaters may be distributed to one or more processing nodes in the distributed architecture, e.g., via a network architecture such as described herein, for example with reference toFIG. 10 . - The
CubeMapper 410 may advantageously determine, for example based on a set of attribute-value pairs extracted for an event, which data cubes to maintain. In exemplary embodiments, a data cube may be defined by a set of dimension attributes (for example, Topic, Gender, State), and an aggregation function for a value attribute. In exemplary embodiments, a configuration file may list the data cubes of interest, and give each data cube a name. The aggregation function may be specified, for example, in javascript, as a function that takes two parameters (the current value of the aggregate and a new event) and returns a single value (the new value of the aggregate). Note that as a special case, the current value may be null, in which case the function may return the aggregate corresponding to just the one event. The value of the aggregate may be in any data structure/format, for example a JSON object. - In some embodiments, a generated data cube may apply only to specific kinds of topics. For example, the query framework may call for a data cube specific for persons of interest, such as customers, celebrities, etc., or for occasions such as holidays, the Oscars, etc. Thus, in exemplary embodiments the mapper may implement a selection function, based on selection criterion, to filter out only a subset of events from a data stream. The selection function may, for example, accept a single parameter (the event) and return either True (this event's data should be part of this data cube) or False.
- An example of a configuration file listing data cube's of interest is provided below:
- Cubes.congfig:
- Select Function: True ##all events
- AggregateFunction: lambda(sentiment, event) { . . . ; return sentiment}
- SelectFunction: lambda(event) {return event.event_id=oscars;}
- AggregateFunction: lambda(sentiment, event) { . . . ; return sentiment}
- The CubeConfig file may advantageously be replicated for reference at each processing node in the distributed architecture.
- The
CubeMapper 410 may processes each event in a data stream and determine which data cubes it is eligible for. Suppose an event E with K dimension elements (for example, K=2 dimension elements a and b depicted inFIG. 3 ) is eligible for a data cube. TheCubeMapper 410 constructs the 2K tuples from the event E (for example, the 22 tuples: (a), (b), (a,b) and (All) depicted inFIG. 3 ) and generates an event E1 for each tuple. The key for each event E1 is the pair (CubeName, Tuple) and the value is the content event E, for example, including a value attribute-value pair. The generated events E1 are sent on to theCubeTupleUpdaters 420. - Each
CubeTupleUpdater 420 maintains a slate for every (CubeName, Tuple) pair it receives. Thus, the CubeTupleUpdater receives an event E1 for a single tuple, extracts an instance of a value attribute and applies the aggregation function to add the instance to its store. In exemplary embodiments, The updater keeps track of the aggregate value for a plurality of time windows for example a standard set of time windows such as described herein. Thus, a query for any tuple and any Time Window may be answered via a quick slate lookup. - There are two potential problems with the naïve implementation described above. First, for a data cube of K dimensions, the
CubeMapper 410 may generate 2K events for each incoming event. This leads to very high fan-out for cubes with K>3. Second since, statistics are stored for every possible tuple that occurs in the data, The number of cube slates required is proportional to the number of possible combinations of dimension values that actually occurs in the data. For example, suppose there are 50 states, 2 genders, and 1 million topics. The number of slates needed may approach 50×2×1 million or 100 million. This may be impractical from a storage perspective. Exemplary alternate implementations of a distributed architecture present herein may help mitigate/prevent such potential problems. - A key property of data stream analytics is that events don't exist in a vacuum but rather often reflect and are influenced by a collective pulse. Thus, events often exhibit a great deal of clustering, for example, of topics, products, people, etc. Moreover, it is expected that queries to the data cube involve instances of dimension elements that are of interest to a large number of people. Taking advantage of the forgoing assumptions, tuples may be advantageously filtered based on frequency. Frequency filtering may be implemented, for example, by selecting a small time window (for example 1 minute) referred to as the delta window D. Let S be the set of all tuples corresponding to all social updates during a delta window D. A threshold δ is applied to filter out all tuples in S with frequency less than δ from being sent to updaters (for example,
CubeTupleUpdaters 320 ofFIG. 3 ). For example, δ may be selected to be 1 or 2. A simple experiment with actual data suggests that setting δ to 2 may eliminate over 90% of tuples in a delta window of 1 minute. Moreover, a qualitative examination of such tuples indicated that the eliminated tuples generally came from events that are not really of interest (for example, spam, outliers of some sort, or just semantic analysis errors). Thus, filtering also a second effect of reducing noise, e.g., from semantic analysis errors. On the other hand, tuples that do occur frequently in a 1-minute window often correlate well with the global interests. Frequency filtering of tuples thus both improves performance and improves the quality of the data cube. -
FIG. 5 depicts an exemplary distributedarchitecture 500 for maintaining a data cube while implementing frequency filtering. Thus, the distributedarchitecture 500 may include a mapper (CubeS elector 510) and 3 types of updaters (CubeTupleGenerators 520, theCubeTupleCollectors 530, and the CubeTupleUpdaters 540). - Suppose T is a current timestamp. Interval I may then be defined as follows: I=floor(T/D), where D is the delta window (e.g., 1 minute). That is, the interval I counts time in units of the Delta Window. The
CubeTupleGenerators 520 buffer all tuples with the same Interval, and then dispatch them to theCubeTupleCollectors 530. - In exemplary embodiments, an assumption may be made that the stream has a large number of events in each Delta Window—that is, the Delta Window is very large compared to the average gap between events (for example, Twitter™, processes approximately 100,000 tweets™ in a Delta Window of 1 minute resulting in an average inter-event gap of less than a millisecond). Thus, in exemplary embodiments, one may detect when an interval has ended and the next one has begun based on a processing of the first event whose Interval is higher than the current Interval. Alternatively, intervals may be tracked independent of events received.
- In some embodiments, intervals may be based on e.g., a requisite number of events received rather than a time frame (for example, provided that the integration of such batches over the one or more time windows being tracked results in an acceptable margin of error). In other embodiments, the threshold δ may be variable, e.g., based on a changing frequency of events in the data stream. In some embodiments, a different threshold may be applied depending on the number of dimension attributes-value pairs in the tuple. Thus, frequency filtering may account for variations in event traffic when filtering tuples.
- With reference again to
FIG. 5 , for each data cube, there may be P CubeTupleGenerator slates and P CubeTupleCollector slates, withkeys 1, . . . , P. So as to ensure uniform distribution across the cluster, P may be selected to be a small multiple of the number of processing nodes in the distributedarchitecture 500. For example, if the distributedarchitecture 500 includes 8 processing nodes, one might select P=32 or 64. - The CubeSelector 510 is similar to the
CubeMapper 210 in the naïve implementation. It uses the config file to determine if the current Event E is eligible for a data cube. If it is, the CubeSelector 510 uses a hash function to map the timestamp of the event to one of P values, and emits to stream X an event E1 whose key is (CubeName, P) and value is the content of the event E. - Each CubeTupleGenerator 520 (CTG) subscribes to stream X. A CTG generates all possible tuples from an event E1 and maintains a table with 3 columns: Tuple, AggregateValue, and Count. Here Count is the number of events that have contributed to this tuple. The CTG also stores the Interval I of the first event it received during this Delta Window. When the CTG receives a new event E1, it computes its Interval J. If J=I, the CTG enumerates the tuples of E1 and updates its table accordingly. On the other hand, if J>I, the CTG starts distributing its data as follows:
- For each tuple, the CTG uses a hash function to map the tuple into one of
P buckets - Each CubeTupleCollector 530 (CTC) subscribes to stream Y. For each interval I, a CTC slate receives P events , one from each CTG slate. The CTC maintains a table with 3 columns: Tuple, AggregateValue, Count. The AggregateValue for a tuple is the aggregate of the partial AggregateValues for that tuple from each CTG, and the Count is the sum of the corresponding partial counts. The CTC also keeps track of the current interval I and a message counter M, both initialized to 0. When the CTC receives an event E2 from the CTG with interval J, it first compares I and J:
- If I=J, the CTC table is updated using the tuples in the event, and M is incremented by 1. If M=P (i.e., the CTC has received events from all the P CTCs), the CTC knows it has received all the tuples for the interval I. At this point, the CTG discards all infrequent tuples with count less than the threshold δ. For every frequent tuple, it creates a new event E3 whose key is the pair (CubeName, Tuple) and whose value contains the Aggregate and the Count. The event E3 is published to stream Z, to which the CubeTupleUpdaters 540 subscribe. The CTC then empties its table, sets its current interval to I+1 and resets M to 0.
- If J>I, this means is that the CTC has received tuples for the next interval before it receives all the tuples for the current interval. Assuming that the aforementioned assumption about the Delta Interval being much larger than the average interval between events holds true, the only real reason for this is a node failure, which should be relatively infrequent. In this case, the interval I is declared closed, the tuples accumulated thus far are filtered. I is set to be equal to J, M to 1, and process the newly arrived event.
- Finally, If J<I, a delayed event for an interval that is past was received. This event is ignored.
- Each CubeTupleUpdater 540 maintains a slate for every (CubeName, Tuple) pair it receives. Thus, the CubeTupleUpdater receives an event E3 for a single tuple, extracts an instance of a value attribute and applies the aggregation function to add the instance to its store. In exemplary embodiments, The updater keeps track of the aggregate value for a plurality of time windows for example a standard set of time windows such as described herein. Thus, a query for any tuple and any Time Window may be answered via a quick slate lookup.
- With reference to
FIGS. 6 a-g an exemplary worked example of the distributedarchitecture 500 is depicted, demonstrating exemplary contents ofCTG slates 620 andCTC slates 630 of a CTG and CTC such as described above with reference toFIG. 5 . For the worked example, depicted , a data cube C with two dimensions (A and X) is assumed. The aggregate is SUM and the filtering threshold δ=1. During the Delta Window, 4 tuples arrive. InFIG. 6 a, an event E including attribute-value pairs A=a X=x and value attribute V=5 is received by theCube Selecter 610. InFIG. 6 b, an event E including attribute-value pairs B=b Y=y and value attribute V=3 is received by theCube Selecter 610. InFIG. 6 c, an event E including attribute-value pairs A=a Y=y and value attribute V=2 is received by theCube Selecter 610. InFIG. 6 d an event E including attribute-value pairs A=a Y=y and value attribute V=4 is received by theCube Selector 610. In each case,Cube Selecter 610 uses the config file to determine if the Event E is eligible for a data cube, for example data cube C. If it is, theCubeSelector 610 uses a hash function to map the timestamp of the event to one of P Values and emits an event E1 whose key is (CubeName, P) and value is the content of the event E. As depicted inFIGS. 6 a -d CTG slates 620 buffer all tuples with the same interval. More particularly theCTG slates 620 maintains a table with 3 columns: Tuple, AggregateValue, and Count. With reference toFIG. 6 e, once the interval is closed, a hash function to map each from theCTG slates 620 tuple into one ofP buckets example buckets 625 ofFIG. 6 e. For each bucket in 1,. . . P−1, the creates and emits an event E2 whose key is the bucket number, and whose value is the set of rows in its table whose tuple maps to that bucket number. With reference toFIGS. 6 f and 6 g, for each interval I,CTC slates 630 receive P events, one from eachCTG slate 620. TheCTC slates 630 maintains a table with 3 columns: Tuple, AgregateValue, Count. As depicted, once the CTG slate has sent out its events E2, it empties its tables and begins processing events E1 for a new interval. As depicted in Figure g6, once theCTC slates 630 have received all the tuples for a given interval all infrequent tuples, for example with count less than the threshold δ are discarded. Then for every frequent tuple, a new event E3 is created whose key is the pair (CubeName, Tuple) and whose value contains the Aggregate and the Count. The event E3 is published to stream Z subscribed to by CubeTupleUpdaters. Data cube C could then be updated for eligible events E3. - Exemplary implementations presented above, primarily related to point aggregates—wherein a tuple (a point in the hypercube) is specified and the desired result of the query is the corresponding aggregate value. The storage media, systems and methods herein, however, are not limited to such implementations. Indeed in some embodiments, the data cube is used tracks results for different type of queries for, example, leaderboard queries.
- A leaderboard query specifies a tuple and a leaderboard attribute, and asks for the Top-N values of the leader board attribute among events that satisfy the tuple. For example, a leaderboard query might pose the following question:
- What were the
Top 10 most popular topics posted about by people in Arizona between 10 am and 11 am today? - SELECT topic, count(eventid) as freq
FROM events e
WHERE t.state=“AZ” and t.timestamp >=“10 am” AND t.timestamp<“11 am”
GROUP BY topic
ORDER BY freq - Other exemplary leaderboard queries might pose the following question:
What were theTop 10 least popular topics posted about by people in Arizona between 10 am and 11 am today?
What were theTop 10 locations closest to New York where users posted about a sales event. - The data cube may easily be extended to support leaderboard queries. In the simple case, if the leaderboard attribute has a small and known set of values (e.g., state or gender), one can simply look up the slates corresponding to all possible values and pick the Top-N. The harder case occurs when the leaderboard attribute can take on a large number of values (or example, topic, product, URL, domain). In this case we cannot possibly examine all slates for all values of the attribute.
- In exemplary embodiments, leaderboard queries are indicated by adding a line to the data cube config file to indicate the leaderboard dimensions for each cube and the value for “N” (for Top-N queries; e.g., 10). For example:
- Select Function: True ##all events
- AggregateFunction: lambda(sentiment, event) { . . . ; return sentiment}
- To simplify the present description, assume there is at most one leaderboard dimension in each data cube (ss would be appreciated by one of ordinary skill in the art, it is straightforward to support more given the description for one). Non-leaderboard dimensions of the data cube may be referred to as ordinary dimensions.
- Recall that in the distribution step of the CTG a tuple is hashed into one of P buckets. For the purposes of a leaderboard query the hash function is applied only to the ordinary dimensions of each tuple. This ensures all tuples of the form: A=a, B=b, C=*, where C is the leaderboard attribute and A and B are ordinary attributes, end up at the same slate of the CTC. The CTC can now compute the Top-N leaderboard values for each tuple, and pass them on to the CTU in the event it constructs for the tuple. For example, the Top-N values may be characterized as resulting in the highest aggregates, the lowest aggregates, and the aggregates closest to a selected value, etc.
- Thus, the CTU may maintain the leaderboard for each time window. Where a time window is used for filtering by the CTC, the CTU may get the leaderboard directly from the CTC. Leaderboards for larger time windows are approximated using those for the smaller time windows. For example, an hourly leaderboard may be computed by looking at the
Top 10 for each minute of the hour, and summing the minute-by-minute values. While it is possible to make some errors using this method (for example, where the most frequent item during the hour was not in the top 10 for any minute of the hour) accurate results are generally achieved. - One potential disadvantage in querying using dimensions requiring semantic analysis, is that the semantic analysis is limited by the available hierarchy. For example, a semantic analysis engine can only tag instances it recognizes. An interesting advantage the storage media, systems and methods described herein is that the data cube may be used to maintain statistics on topics products, locations etc. (collectively, referred to as interesting topics or topics of interest) without the help of semantic analysis. Thus in exemplary embodiments a “K-Gram” is added as one of the data cube dimensions, with a relative high threshold 6 (e.g., 50). Thus, a slate is automatically created and maintained for any k-gram that was mentioned very often in a delta time window. In exemplary embodiments, a TF.IDF threshold may be instead of a pure frequency threshold to achieve better results. Thus, the data cube may be used to identify trending k-grams and automatically maintains stats for them without help from semantic analysis. In exemplary embodiments identified topics of interest can be communicated to the semantic analysis engine and “candidate topics” and the maintained statistics relating thereto can be used by the engine to whether to incorporate each candidate topic into the taxonomy.
-
FIGS. 7 a-c, illustrate exemplary methods according to the present disclosure. - With reference to
FIG. 7 a, anexemplary method 710 is depicted for performing real time analytics on streaming data.Method 710 generally includes steps of: (712) processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes; (714) identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and (716) for each implicated tuple, updating one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes. - With reference to
FIG. 7 b, anexemplary method 720 is depicted for performing real time analytics on streaming data.Method 720 generally includes steps of (722) establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute; and (724) generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows. - With reference to
FIG. 7 c anexemplary method 730 is depicted for performing real time analytics on streaming data.Method 730 generally includes steps of: (732) processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and (734) returning one of the stored aggregates in response to a query. - It is explicitly contemplated that the storage media, systems and methods presented herein may include one or more programmable processing units having associated therewith executable instructions held on one or more computer readable medium, RAM, ROM, hard drive, and/or hardware. In exemplary embodiments, the hardware, firmware and/or executable code may be provided, for example, as upgrade module(s) for use in conjunction with existing infrastructure (for example, existing devices/processing units). Hardware may, for example, include components and/or logic circuitry for executing the embodiments taught herein as a computing process.
- Displays and/or other feedback means may also be included to convey detected/processed data, for example adjusted output representative of a particle characteristic. The display and/or other feedback means may be stand-alone or may be included as one or more components/modules of the processing unit(s). In exemplary embodiments, the display and/or other feedback means may be used to facilitate querying a data cube. In other embodiments, the display may be used to visualize, in real-time, various social statistics maintained by the data cube. For example, as depicted in
FIGS. 8 a and 8 b, real-time social statistics may be overlaid on a geographic map. - The actual software code or control hardware which may be used to implement some of the present embodiments is not intended to limit the scope of such embodiments. For example, certain aspects of the embodiments described herein may be implemented in code using any suitable programming language type such as, for example, assembly code, C, C# or C++ using, for example, conventional or object-oriented programming techniques. Such code is stored or held on any type of suitable non-transitory computer-readable medium or media such as, for example, a magnetic or optical storage medium.
- As used herein, a “processor,” “processing unit,” “computer” or “computer system” may be, for example, a wireless or wire line variety of a microcomputer, minicomputer, server, mainframe, laptop, personal data assistant (PDA), wireless e-mail device (for example, “BlackBerry,” “Android” or “Apple,” trade-designated devices), cellular phone, pager, processor, fax machine, scanner, or any other programmable device configured to transmit and receive data over a network. Computer systems disclosed herein may include memory for storing certain software applications used in obtaining, processing and communicating data. It can be appreciated that such memory may be internal or external to the disclosed embodiments. The memory may also include non-transitory storage medium for storing software, including a hard disk, an optical disk, floppy disk, ROM (read only memory), RAM (random access memory), PROM (programmable ROM), EEPROM (electrically erasable PROM), flash memory storage devices, or the like.
-
FIG. 9 depicts a block diagram representing anexemplary computing device 900 that may be used as a processing node (also referred to as a worker node) for aggregating and/or storing data as described herein, for example a processing node in a distributed architecture as described herein. Thecomputing device 900 may be any computer system, such as a workstation, desktop computer, server, laptop, handheld computer, tablet computer (e.g., the iPad™ tablet computer), mobile computing or communication device (e.g., the iPhone™ mobile communication device, the Android™ mobile communication device, and the like), or other form of computing or telecommunications device that is capable of communication and that has sufficient processor power and memory capacity to perform the operations described herein. A distributed computational system may be provided comprising a plurality of such computing devices. - The
computing device 900 includes one or more non-transitory computer-readable media having encoded thereon one or more computer-executable instructions or software for implementing exemplary methods described herein. The non-transitory computer-readable media may include, but are not limited to, one or more types of hardware memory, non-transitory tangible media (for example, one or more magnetic storage disks, one or more optical disks, one or more USB flash drives), and the like. For example,memory 906 included in thecomputing device 900 may store computer-readable and computer-executable instructions or software for implementing exemplary embodiments. Thecomputing device 900 also includesprocessor 902 and associatedcore 904, and in some embodiments, one or more additional processor(s) 902′ and associated core(s) 904′ (for example, in the case of computer systems having multiple processors/cores), for executing computer-readable and computer-executable instructions or software stored in thememory 906 and other programs for controlling system hardware.Processor 902 and processor(s) 902′ may each be a single core processor or multiple core (904 and 904′) processor. - Virtualization may be employed in the
computing device 900 so that infrastructure and resources in the computing device may be shared dynamically. Avirtual machine 914 may be provided to handle a process running on multiple processors so that the process appears to be using only one computing resource rather than multiple computing resources. Multiple virtual machines may also be used with one processor. -
Memory 906 may include a computer system memory or random access memory, such as DRAM, SRAM, EDO RAM, and the like.Memory 906 may include other types of memory as well, or combinations thereof.Memory 906 may be used to store one or more slates on a temporary basis, for example, in cache. - A user may interact with the
computing device 900 through avisual display device 918, such as a screen or monitor, that may display one or more interfaces 920 that may be provided in accordance with exemplary embodiments. Thevisual display device 918 may also display other aspects, elements and/or information or data associated with exemplary embodiments. Thecomputing device 900 may include other I/O devices for receiving input from a user, for example, a keyboard or any suitablemulti-point touch interface 908, a pointing device 910 (e.g., a mouse, a user's finger interfacing directly with a display device, etc.). Thekeyboard 908 and thepointing device 910 may be coupled to thevisual display device 918. Thecomputing device 900 may include other suitable conventional I/O peripherals. In exemplary embodiments, the one or more of the interfaces 920 includes an application program interface (API). - The
computing device 900 may include one or moreaudio input devices 924, such as one or more microphones, that may be used by a user to provide one or more audio input streams. - The
computing device 900 may include one or morenon-transitory storage devices 924, such as a durable disk storage (which may include any suitable optical or magnetic durable storage device, e.g., RAM, ROM, Flash, USB drive, or other semiconductor-based storage medium), a hard-drive, CD-ROM, or other non-transitory computer readable media, for storing data and computer-readable instructions and/or software that implement exemplary embodiments as taught herein. For example, thestorage device 924 may provide aslate storage 926 for storing computer-executable instructions for implementing the social genome data structure as described herein, for example for storing an updating (via one or more updaters) one or more slates, as described herein. Thestorage device 924 may store one ormore map modules 932 and one ormore update modules 934, as described herein. Thestorage device 924 may be provided on thecomputing device 900 or provided separately or remotely from thecomputing device 900. Thestorage device 924 may be used to store one or more slates in a durable manner. - Exemplary mappers and updaters may be programmatically implemented by a computer process in any suitable programming language, for example, a scripting programming language, an object-oriented programming language (e.g., Java), and the like. In an exemplary object-oriented implementation, a general Mapper class or interface and Updater class or interface may be defined by the system to generally specify attributes and functionality of a generic update operation. For each desired update operation, a sub-class may be created based on the Updater class. One or more object instances may be created from each sub-class at a processor node, for example, a CubeTupleGenerator object may be instantiated from a CubeTupleGenerator sub-class.
- The
computing device 900 may include anetwork interface 912 configured to interface via one ormore network devices 922 with one or more networks, for example, Local Area Network (LAN), Wide Area Network (WAN) or the Internet through a variety of connections including, but not limited to, standard telephone lines, LAN or WAN links (for example, 802.11, T1, T3, 56 kb, X.25), broadband connections (for example, ISDN, Frame Relay, ATM), wireless connections, controller area network (CAN), or some combination of any or all of the above. Thenetwork interface 912 may include a built-in network adapter, network interface card, PCMCIA network card, card bus network adapter, wireless network adapter, USB network adapter, modem or any other device suitable for interfacing thecomputing device 900 to any type of network capable of communication and performing the operations described herein. Thenetwork device 922 may include one or more suitable devices for receiving and transmitting communications over the network including, but not limited to, one or more receivers, one or more transmitters, one or more transceivers, one or more antennae, and the like. - The
computing device 900 may run anyoperating system 916, such as any of the versions of the Microsoft® Windows® operating systems, the different releases of the Unix and Linux operating systems, any version of the MacOS® for Macintosh computers, any embedded operating system, any real-time operating system, any open source operating system, any proprietary operating system, any operating systems for mobile computing devices, or any other operating system capable of running on the computing device and performing the operations described herein. In exemplary embodiments, theoperating system 916 may be run in native mode or emulated mode. In an exemplary embodiment, theoperating system 916 may be run on one or more cloud machine instances. -
FIG. 10 depicts anexemplary network environment 1000 suitable for a distributed implementation of exemplary embodiments. Thenetwork environment 1000 may include one ormore servers more clients communication network 1010. Thenetwork interface 912 and thenetwork device 922 of thecomputing device 900 enable theservers clients communication network 1010. Thecommunication network 1010 may include, but is not limited to, the Internet, an intranet, a LAN (Local Area Network), a WAN (Wide Area Network), a MAN (Metropolitan Area Network), a wireless network, an optical network, and the like. The communication facilities provided by thecommunication network 1010 are capable of supporting distributed implementations of exemplary embodiments. - In an exemplary embodiment, the
servers clients clients servers clients servers clients servers clients servers - Alternatively, in another exemplary embodiment, the
clients servers servers clients servers clients servers clients servers clients - In exemplary embodiments one or more mappers and one or more updaters for
example map module 932 andupdate module 934 ofFIG. 9 , may be distributed to throughout various processing nodes of thenetwork environment 1000, for example nodes 1012 a-d. - Although the teachings herein have been described with reference to exemplary embodiments and implementations thereof, the disclosed systems, methods and non-transitory storage medium are not limited to such exemplary embodiments/implementations. Rather, as will be readily apparent to persons skilled in the art from the description taught herein, the disclosed storage media, systems and methods are susceptible to modifications, alterations and enhancements without departing from the spirit or scope hereof. Accordingly, all such modifications, alterations and enhancements within the scope hereof are encompassed herein.
Claims (20)
1. A method for performing real time analytics on streaming data, the method comprising:
processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes;
identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes;
for each implicated tuple, updating one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
2. The method of claim 1 further comprising tracking over an interval a tuple frequency for each of the implicated tuples, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval.
3. The method of claim 1 further comprising tracking over an interval, for each value attribute, aggregates for a first plurality of implicated tuples having same attribute-value pairs for zero or more ordinary dimension attributes and different attribute-value pairs for one or more leaderboard dimension attributes, and determining a top-N values for the one or more leadership dimension attributes over the interval.
4. The method of claim 3 , wherein the Top-N values are characterized as resulting in one of (i) the highest aggregates (ii) the lowest aggregates, and (iii) the aggregates closest to a selected value, over the interval.
5. The method of claim 3 further comprising determining a set of top-N values for each of a plurality of intervals in a time window and determining a top-N values for the one or more leadership dimensions attributes over the time window based on the plurality of sets of top-N values.
6. The method of claim 1 , wherein the one or more dimension attributes include a K-Gram for identifying topics of interest.
7. The method of claim 6 , further comprising tracking over an interval a tuple frequency for each of the implicated tuples including a K-Gram, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval, whereby statistics for trending K-Gram-value pairs are tracked.
8. A method for implementing a real time analytics platform, the method comprising:
establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute;
generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
9. A system for performing real time analytics on streaming data, the system comprising:
a processor for processing an event in a data stream to extract a set of attribute-value pairs for one or more dimension attributes and one or more value attributes;
a mapper for identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes; and
one or more updaters for updating, for each implicated tuple, one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
10. The system of claim 9 , wherein the system is configured to track over an interval a tuple frequency for each of the implicated tuples, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval.
11. The system of claim 9 wherein the system is configured to: (i) update, over an interval, for each value attribute, aggregates for a first plurality of implicated tuples having same attribute-value pairs for zero or more ordinary dimension attributes and different attribute-value pairs for one or more leaderboard dimension attributes, and (ii) determine a top-N values for the one or more leadership dimension attributes over the interval.
12. The system of claim 9 , wherein the one or more dimension attributes include a K-Gram for identifying topics of interest, wherein the system is configured to track over an interval a tuple frequency for each of the implicated tuples including a K-Gram, wherein the updating the one or more stored aggregates includes discarding each implicated tuple with a low tuple frequency over the interval, whereby statistics for trending K-Gram-value pairs are tracked.
13. A multi-dimensional data structure for implementing a real-time analytics platform characterized by one or more time windows, one or more dimension attributes, and one or more value attributes, the data structure comprising:
a plurality of tuples associated with the one or more dimension attributes; and
a slate associated with each tuple for maintaining an aggregate for each of the one or more value attributes over each of the one or more time windows.
14. A method for performing real-time analytics on a data stream, the methods comprising:
processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters;
returning one of the stored aggregates in response to a query.
15. A system for performing real-time analytics on a data stream the system comprising:
a processor for processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and
memory for storing the plurality of stored aggregates.
16. The system of claim 15 , further comprising an interface for providing a query, wherein the processor is configured to return one of the stored aggregates in response to the query.
17. A multi-dimensional data structure for implementing a real-time analytics platform, the data structure comprising:
a plurality of stored tuples each representing a set of search query parameters for prospective queries extrapolated from a pre-established framework of possible query parameters; and
one or more stored aggregates associated with each of the stored tuples, wherein each aggregate represents a result for a prospective query characterized by the set of search query parameters represented in the tuple associated with that aggregate.
18. A non-transitory computer readable medium storing processor executable instructions for performing real time analytics on streaming data, including instructions for:
processing events in a data stream to extract from each event a set of attribute-value pairs for one or more dimension attributes and one or more value attributes;
identifying one or more tuples in a multidimensional data structure implicated by the extracted attribute-value pairs for the one or more dimension attributes;
for each implicated tuple, updating one or more stored aggregates associated therewith, based on the extracted attribute-value pairs for the one or more value attributes.
19. A non-transitory computer readable medium storing processor executable instructions for performing real time analytics on streaming data, including instructions for:
establishing an analytics platform framework characterized by one or more time windows, one or more dimension attributes, and one or more value attribute;
generating a first multi-dimensional data structure for maintaining, for each tuple of the one or more dimension attributes, an aggregate of each of the one or more value attributes over each of the one or more time windows.
20. A non-transitory computer readable medium storing processor executable instructions for performing real time analytics on streaming data, including instructions for:
processing a data stream to maintain a plurality of stored aggregates for a universe of prospective queries extrapolated from a pre-established framework of possible query parameters; and
returning one of the stored aggregates in response to a query.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/300,523 US20120130940A1 (en) | 2010-11-18 | 2011-11-18 | Real-time analytics of streaming data |
US13/300,519 US9183270B2 (en) | 2010-05-17 | 2011-11-18 | Social genome |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US41527910P | 2010-11-18 | 2010-11-18 | |
US41528210P | 2010-11-18 | 2010-11-18 | |
US13/300,523 US20120130940A1 (en) | 2010-11-18 | 2011-11-18 | Real-time analytics of streaming data |
Publications (1)
Publication Number | Publication Date |
---|---|
US20120130940A1 true US20120130940A1 (en) | 2012-05-24 |
Family
ID=45316067
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/300,523 Abandoned US20120130940A1 (en) | 2010-05-17 | 2011-11-18 | Real-time analytics of streaming data |
US13/300,473 Active 2032-08-17 US8725592B2 (en) | 2010-05-17 | 2011-11-18 | Method, system, and medium for recommending gift products based on textual information of a selected user |
US13/300,519 Active 2032-02-18 US9183270B2 (en) | 2010-05-17 | 2011-11-18 | Social genome |
US14/936,426 Active US9679074B2 (en) | 2010-11-18 | 2015-11-09 | Social genome |
Family Applications After (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/300,473 Active 2032-08-17 US8725592B2 (en) | 2010-05-17 | 2011-11-18 | Method, system, and medium for recommending gift products based on textual information of a selected user |
US13/300,519 Active 2032-02-18 US9183270B2 (en) | 2010-05-17 | 2011-11-18 | Social genome |
US14/936,426 Active US9679074B2 (en) | 2010-11-18 | 2015-11-09 | Social genome |
Country Status (2)
Country | Link |
---|---|
US (4) | US20120130940A1 (en) |
WO (1) | WO2012068557A1 (en) |
Cited By (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130254374A1 (en) * | 2012-03-21 | 2013-09-26 | International Business Machines Corporation | Resource allocation based on social networking trends in a networked computing environment |
US20130339473A1 (en) * | 2012-06-15 | 2013-12-19 | Zynga Inc. | Real time analytics via stream processing |
CN103916478A (en) * | 2014-04-11 | 2014-07-09 | 华为技术有限公司 | Streaming data cube establishing method and device based on distributed system |
US20140365524A1 (en) * | 2013-06-10 | 2014-12-11 | International Business Machines Corporation | Incremental aggregation-based event pattern matching |
US20150026207A1 (en) * | 2013-07-22 | 2015-01-22 | International Business Machines Corporation | Managing sparsity in an multidimensional data structure |
US20150339752A1 (en) * | 2011-09-14 | 2015-11-26 | International Business Machines Corporation | Deriving Dynamic Consumer Defined Product Attributes from Input Queries |
US9367424B2 (en) | 2014-03-13 | 2016-06-14 | International Business Machines Corporation | Method for performance monitoring and optimization via trend detection and forecasting |
US20170177573A1 (en) * | 2015-12-18 | 2017-06-22 | International Business Machines Corporation | Method and system for hybrid sort and hash-based query execution |
US20170177446A1 (en) * | 2015-12-21 | 2017-06-22 | Ab Initio Technology Llc | Search and retrieval data processing system for computing near real-time data aggregations |
US20170185583A1 (en) * | 2015-12-28 | 2017-06-29 | Facebook, Inc. | Language model personalization |
US9740687B2 (en) | 2014-06-11 | 2017-08-22 | Facebook, Inc. | Classifying languages for objects and entities |
US9749430B2 (en) | 2013-05-06 | 2017-08-29 | Microsoft Technology Licensing, Llc | Scalable data enrichment for cloud streaming analytics |
US9805029B2 (en) | 2015-12-28 | 2017-10-31 | Facebook, Inc. | Predicting future translations |
US9830404B2 (en) | 2014-12-30 | 2017-11-28 | Facebook, Inc. | Analyzing language dependency structures |
US9830386B2 (en) | 2014-12-30 | 2017-11-28 | Facebook, Inc. | Determining trending topics in social media |
US9864744B2 (en) | 2014-12-03 | 2018-01-09 | Facebook, Inc. | Mining multi-lingual data |
US9899020B2 (en) | 2015-02-13 | 2018-02-20 | Facebook, Inc. | Machine learning dialect identification |
US9904718B2 (en) | 2013-03-13 | 2018-02-27 | Wal-Mart Stores, Inc. | System and method for streaming events in a transaction-based system |
US20180075516A1 (en) * | 2016-09-14 | 2018-03-15 | Microsoft Technology Licensing, Llc | System for producing recommendations and predicting purchases of products based on usage patterns |
WO2018049183A1 (en) * | 2016-09-09 | 2018-03-15 | BloomReach, Inc. | Attribute extraction |
US9946522B1 (en) | 2016-12-16 | 2018-04-17 | International Business Machines Corporation | Generating code for real-time stream processing |
US10067936B2 (en) | 2014-12-30 | 2018-09-04 | Facebook, Inc. | Machine translation output reranking |
US10089299B2 (en) | 2015-12-17 | 2018-10-02 | Facebook, Inc. | Multi-media context language processing |
US10133738B2 (en) | 2015-12-14 | 2018-11-20 | Facebook, Inc. | Translation confidence scores |
US10147046B2 (en) | 2014-10-30 | 2018-12-04 | International Business Machines Corporation | System and methodology to handle misdirected input data during multi partitioned real time analytics |
US10289681B2 (en) | 2015-12-28 | 2019-05-14 | Facebook, Inc. | Predicting future translations |
US10324943B2 (en) | 2015-08-10 | 2019-06-18 | Business Objects Software, Ltd. | Auto-monitoring and adjustment of dynamic data visualizations |
US10346537B2 (en) | 2015-09-22 | 2019-07-09 | Facebook, Inc. | Universal translation |
US10380249B2 (en) | 2017-10-02 | 2019-08-13 | Facebook, Inc. | Predicting future trending topics |
US10459922B2 (en) | 2016-11-08 | 2019-10-29 | At&T Intellectual Property I, L.P. | Unique identification generation for records in a data streaming processing system |
US10540386B2 (en) * | 2013-08-09 | 2020-01-21 | Shaofeng YANG | Method for processing and displaying real-time social data on map |
EP3678033A1 (en) * | 2019-01-07 | 2020-07-08 | QlikTech International AB | A computer implemented method for indexlet based aggregation |
US10812322B2 (en) * | 2017-04-13 | 2020-10-20 | Walmart Apollo, Llc | Systems and methods for real time streaming |
US10902221B1 (en) | 2016-06-30 | 2021-01-26 | Facebook, Inc. | Social hash for language models |
US10902215B1 (en) | 2016-06-30 | 2021-01-26 | Facebook, Inc. | Social hash for language models |
CN112817965A (en) * | 2019-11-18 | 2021-05-18 | 百度在线网络技术(北京)有限公司 | Data splicing method and device, electronic equipment and storage medium |
CN113127512A (en) * | 2020-01-15 | 2021-07-16 | 百度在线网络技术(北京)有限公司 | Data splicing triggering method and device for multiple data streams, electronic equipment and medium |
US11100191B2 (en) | 2016-08-26 | 2021-08-24 | 1Qb Information Technologies Inc. | Method and system for performing real-time analytics on a plurality of data streams |
US11294876B2 (en) * | 2017-06-01 | 2022-04-05 | Oracle International Corporation | System and method for generating a multi dimensional data cube for analytics using a map-reduce program |
EP4033373A1 (en) * | 2021-01-25 | 2022-07-27 | QlikTech International AB | Methods and systems for undetermined query analytics |
CN116756215A (en) * | 2023-06-27 | 2023-09-15 | 上海蚂蚁创将信息技术有限公司 | Transaction in-transit state query method and system |
Families Citing this family (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120130940A1 (en) * | 2010-11-18 | 2012-05-24 | Wal-Mart Stores, Inc. | Real-time analytics of streaming data |
IL207176A0 (en) * | 2010-07-25 | 2011-04-28 | Verint Systems Ltd | System and method for video - assisted identification of mobile phone users |
US8990241B2 (en) * | 2010-12-23 | 2015-03-24 | Yahoo! Inc. | System and method for recommending queries related to trending topics based on a received query |
US9530167B2 (en) * | 2011-08-12 | 2016-12-27 | Facebook, Inc. | Coefficients attribution for different objects based on natural language processing |
US9536015B1 (en) * | 2011-09-06 | 2017-01-03 | Google Inc. | Using social networking information |
US20130204701A1 (en) * | 2012-02-06 | 2013-08-08 | Gianmauro Calafiore | Apparatus, system and methods for marketing targeted products to users of social media |
US20140074650A1 (en) * | 2012-03-01 | 2014-03-13 | Qloo, Inc. | Personalized cross-domain recommender system |
US20130297380A1 (en) * | 2012-05-02 | 2013-11-07 | Ebay Inc. | Inventory adjustment based on social network data |
JP5962213B2 (en) * | 2012-05-28 | 2016-08-03 | ソニー株式会社 | Information processing apparatus, information processing method, and program |
US9117237B2 (en) * | 2012-06-12 | 2015-08-25 | Gyft, Inc. | System, method, and medium for digital gift card selection |
US9424612B1 (en) * | 2012-08-02 | 2016-08-23 | Facebook, Inc. | Systems and methods for managing user reputations in social networking systems |
US10560057B1 (en) * | 2012-08-06 | 2020-02-11 | Google Llc | Measuring media attention over time based on long term heterogeneous archive data |
US20140067595A1 (en) * | 2012-08-31 | 2014-03-06 | Wal-Mart Stores, Inc. | Determining giftability of a product |
US9299100B2 (en) * | 2012-08-31 | 2016-03-29 | Wal-Mart Stores, Inc. | Determining giftability of a product based on recipient interests |
US9104667B2 (en) | 2012-09-24 | 2015-08-11 | International Business Machines Corporation | Social media event detection and content-based retrieval |
US9218392B1 (en) * | 2012-11-30 | 2015-12-22 | Amazon Technologies, Inc. | Interest related search results |
US9934283B2 (en) * | 2013-03-08 | 2018-04-03 | Google Llc | Social annotations for enhanced search results |
US20150149539A1 (en) * | 2013-11-22 | 2015-05-28 | Adobe Systems Incorporated | Trending Data Demographics |
US10607255B1 (en) | 2013-12-17 | 2020-03-31 | Amazon Technologies, Inc. | Product detail page advertising |
US9846904B2 (en) * | 2013-12-26 | 2017-12-19 | Target Brands, Inc. | Retail website user interface, systems and methods |
US9875561B2 (en) | 2014-05-20 | 2018-01-23 | Jeffrey C. Mohr | Method and system for dynamically creating and exploring graph structures |
US9858610B2 (en) * | 2014-08-29 | 2018-01-02 | Wal-Mart Stores, Inc. | Product recommendation based on geographic location and user activities |
US11042946B2 (en) | 2014-09-30 | 2021-06-22 | Walmart Apollo, Llc | Identity mapping between commerce customers and social media users |
US10621231B2 (en) * | 2015-08-24 | 2020-04-14 | Google Llc | Generation of a topic index with natural language processing |
US9454584B1 (en) | 2015-09-21 | 2016-09-27 | Pearson Education, Inc. | Assessment item generation and scoring |
US9460162B1 (en) | 2015-09-21 | 2016-10-04 | Pearson Education, Inc. | Assessment item generator |
US10832304B2 (en) * | 2016-01-15 | 2020-11-10 | Target Brands, Inc. | Resorting product suggestions for a user interface |
US10600062B2 (en) | 2016-03-15 | 2020-03-24 | Target Brands Inc. | Retail website user interface, systems, and methods for displaying trending looks by location |
US10776860B2 (en) | 2016-03-15 | 2020-09-15 | Target Brands, Inc. | Retail website user interface, systems, and methods for displaying trending looks |
US11100415B2 (en) * | 2016-10-04 | 2021-08-24 | University Of Louisiana At Lafayette | Architecture and method for providing insights in networks domain |
US10969933B1 (en) * | 2017-02-07 | 2021-04-06 | The Mathworks, Inc. | Graphical representation of ordered model items based on solver information |
US20190043065A1 (en) * | 2017-08-04 | 2019-02-07 | John Hall | Method and system of facilitating recommendation of digital content based on user responses |
CN109389451B (en) * | 2017-08-08 | 2021-10-19 | 阿里巴巴集团控股有限公司 | Method and system for determining recommendation information |
US20190066186A1 (en) * | 2017-08-24 | 2019-02-28 | Artivatic Data Labs Private Limited | Cross domain recommendation system and method |
US11475014B2 (en) * | 2018-12-20 | 2022-10-18 | AVAST Software s.r.o. | Updating a toplist for a continuous data stream |
US11706167B2 (en) | 2020-06-29 | 2023-07-18 | Snap Inc. | Generating and accessing video content for products |
US11748453B2 (en) * | 2021-02-17 | 2023-09-05 | International Business Machines Corporation | Converting unstructured computer text to domain-specific groups using graph datastructures |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6965886B2 (en) * | 2001-11-01 | 2005-11-15 | Actimize Ltd. | System and method for analyzing and utilizing data, by executing complex analytical models in real time |
US7146416B1 (en) * | 2000-09-01 | 2006-12-05 | Yahoo! Inc. | Web site activity monitoring system with tracking by categories and terms |
US7523462B1 (en) * | 2003-05-27 | 2009-04-21 | International Business Machines Corporation | Method for providing a real time view of heterogeneous enterprise data |
US20090138446A1 (en) * | 2007-11-27 | 2009-05-28 | Umber Systems | Method and apparatus for real-time multi-dimensional reporting and analyzing of data on application level activity and other user information on a mobile data network |
Family Cites Families (52)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090234712A1 (en) * | 1999-06-28 | 2009-09-17 | Musicip Corporation | Method and apparatus for automated selection, organization, and recommendation of items based on user preference topography |
US7421660B2 (en) * | 2003-02-04 | 2008-09-02 | Cataphora, Inc. | Method and apparatus to visually present discussions for data mining purposes |
WO2004027649A1 (en) | 2002-09-18 | 2004-04-01 | Netezza Corporation | Asymmetric streaming record data processor method and apparatus |
US20040167887A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Integration of structured data with relational facts from free text for data mining |
US7606168B2 (en) * | 2005-01-28 | 2009-10-20 | Attenex Corporation | Apparatus and method for message-centric analysis and multi-aspect viewing using social networks |
US7920584B2 (en) | 2005-05-04 | 2011-04-05 | Arm Limited | Data processing system |
US8655801B2 (en) | 2005-10-26 | 2014-02-18 | Cortica, Ltd. | Computing device, a system and a method for parallel processing of data streams |
US7668821B1 (en) * | 2005-11-17 | 2010-02-23 | Amazon Technologies, Inc. | Recommendations based on item tagging activities of users |
US7603619B2 (en) | 2005-11-29 | 2009-10-13 | Google Inc. | Formatting a user network site based on user preferences and format performance data |
EP1989639A4 (en) * | 2006-02-28 | 2012-05-02 | Buzzlogic Inc | Social analytics system and method for analyzing conversations in social media |
US8745661B2 (en) | 2006-07-31 | 2014-06-03 | Rovi Guides, Inc. | Systems and methods for providing enhanced sports watching media guidance |
US8930204B1 (en) * | 2006-08-16 | 2015-01-06 | Resource Consortium Limited | Determining lifestyle recommendations using aggregated personal information |
US8234623B2 (en) | 2006-09-11 | 2012-07-31 | The Mathworks, Inc. | System and method for using stream objects to perform stream processing in a text-based computing environment |
US7765259B2 (en) * | 2006-12-05 | 2010-07-27 | Avaya Inc. | System and method for aggregation of user conversations and visualizing personal communications map |
US7970891B1 (en) | 2007-01-17 | 2011-06-28 | Google Inc. | Tracking links in web browsers |
US20080215607A1 (en) * | 2007-03-02 | 2008-09-04 | Umbria, Inc. | Tribe or group-based analysis of social media including generating intelligence from a tribe's weblogs or blogs |
US8055664B2 (en) * | 2007-05-01 | 2011-11-08 | Google Inc. | Inferring user interests |
US8122006B2 (en) | 2007-05-29 | 2012-02-21 | Oracle International Corporation | Event processing query language including retain clause |
US7676461B2 (en) | 2007-07-18 | 2010-03-09 | Microsoft Corporation | Implementation of stream algebra over class instances |
US7853622B1 (en) * | 2007-11-01 | 2010-12-14 | Google Inc. | Video-related recommendations using link structure |
US20120203831A1 (en) * | 2011-02-03 | 2012-08-09 | Kent Schoen | Sponsored Stories Unit Creation from Organic Activity Stream |
US7882087B2 (en) | 2008-01-15 | 2011-02-01 | At&T Intellectual Property I, L.P. | Complex dependencies for efficient data warehouse updates |
US20110161827A1 (en) * | 2008-03-05 | 2011-06-30 | Anastasia Dedis | Social media communication and contact organization |
US8606721B1 (en) * | 2008-03-11 | 2013-12-10 | Amazon Technologies, Inc. | Implicit social graph edge strengths |
TWI418993B (en) * | 2008-06-27 | 2013-12-11 | Ind Tech Res Inst | System and method for establishing personal social network, trusted network and social networking system |
WO2010048172A1 (en) * | 2008-10-20 | 2010-04-29 | Cascaad Srl | Social graph based recommender |
US20100121707A1 (en) * | 2008-11-13 | 2010-05-13 | Buzzient, Inc. | Displaying analytic measurement of online social media content in a graphical user interface |
US20100128638A1 (en) | 2008-11-20 | 2010-05-27 | Sap Ag | Hierarchical shortest path first network routing protocol |
US20100169160A1 (en) * | 2008-12-30 | 2010-07-01 | Ebay Inc. | Gift recommendation method and system |
US8160996B2 (en) | 2009-02-02 | 2012-04-17 | The Hong Kong Polytechnic University | Sequence online analytical processing system |
US9330395B2 (en) | 2009-05-05 | 2016-05-03 | Suboti, Llc | System, method and computer readable medium for determining attention areas of a web page |
US20100306249A1 (en) * | 2009-05-27 | 2010-12-02 | James Hill | Social network systems and methods |
US8996556B2 (en) | 2009-06-05 | 2015-03-31 | Microsoft Technology Licensing, Llc | Parallel processing of an ordered data stream |
US20110004692A1 (en) * | 2009-07-01 | 2011-01-06 | Tom Occhino | Gathering Information about Connections in a Social Networking Service |
US8239364B2 (en) * | 2009-12-08 | 2012-08-07 | Facebook, Inc. | Search and retrieval of objects in a social networking system |
US8880600B2 (en) * | 2010-03-31 | 2014-11-04 | Facebook, Inc. | Creating groups of users in a social networking system |
US8326880B2 (en) * | 2010-04-05 | 2012-12-04 | Microsoft Corporation | Summarizing streams of information |
US8666979B2 (en) | 2010-04-09 | 2014-03-04 | Palo Alto Research Center Incorporated | Recommending interesting content using messages containing URLs |
US8185558B1 (en) * | 2010-04-19 | 2012-05-22 | Facebook, Inc. | Automatically generating nodes and edges in an integrated social graph |
US8918418B2 (en) * | 2010-04-19 | 2014-12-23 | Facebook, Inc. | Default structured search queries on online social networks |
US8782080B2 (en) * | 2010-04-19 | 2014-07-15 | Facebook, Inc. | Detecting social graph elements for structured search queries |
US9530166B2 (en) * | 2010-04-21 | 2016-12-27 | Facebook, Inc. | Social graph that includes web pages outside of a social networking system |
JP5810452B2 (en) * | 2010-05-16 | 2015-11-11 | アクセス ビジネス グループ インターナショナル リミテッド ライアビリティ カンパニー | Data collection, tracking and analysis methods for multimedia including impact analysis and impact tracking |
US20120130940A1 (en) * | 2010-11-18 | 2012-05-24 | Wal-Mart Stores, Inc. | Real-time analytics of streaming data |
US8595234B2 (en) | 2010-05-17 | 2013-11-26 | Wal-Mart Stores, Inc. | Processing data feeds |
US20120016661A1 (en) * | 2010-07-19 | 2012-01-19 | Eyal Pinkas | System, method and device for intelligent textual conversation system |
US9262517B2 (en) * | 2010-08-18 | 2016-02-16 | At&T Intellectual Property I, L.P. | Systems and methods for social media data mining |
US9240020B2 (en) * | 2010-08-24 | 2016-01-19 | Yahoo! Inc. | Method of recommending content via social signals |
US9852176B2 (en) * | 2010-09-03 | 2017-12-26 | Vocus, Inc. | Dynamic gathering of social media content |
US8892605B2 (en) * | 2010-12-03 | 2014-11-18 | Relationship Capital Technologies, Inc. | Systems and methods for managing social networks based upon predetermined objectives |
US8943154B1 (en) * | 2012-05-11 | 2015-01-27 | Amazon Technologies, Inc. | Systems and methods for modeling relationships between users, network elements, and events |
US8949250B1 (en) * | 2013-12-19 | 2015-02-03 | Facebook, Inc. | Generating recommended search queries on online social networks |
-
2011
- 2011-11-18 US US13/300,523 patent/US20120130940A1/en not_active Abandoned
- 2011-11-18 WO PCT/US2011/061547 patent/WO2012068557A1/en active Application Filing
- 2011-11-18 US US13/300,473 patent/US8725592B2/en active Active
- 2011-11-18 US US13/300,519 patent/US9183270B2/en active Active
-
2015
- 2015-11-09 US US14/936,426 patent/US9679074B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7146416B1 (en) * | 2000-09-01 | 2006-12-05 | Yahoo! Inc. | Web site activity monitoring system with tracking by categories and terms |
US6965886B2 (en) * | 2001-11-01 | 2005-11-15 | Actimize Ltd. | System and method for analyzing and utilizing data, by executing complex analytical models in real time |
US7523462B1 (en) * | 2003-05-27 | 2009-04-21 | International Business Machines Corporation | Method for providing a real time view of heterogeneous enterprise data |
US20090138446A1 (en) * | 2007-11-27 | 2009-05-28 | Umber Systems | Method and apparatus for real-time multi-dimensional reporting and analyzing of data on application level activity and other user information on a mobile data network |
Non-Patent Citations (2)
Title |
---|
Gupta et al., "CHAOS: A Data Stream Analysis Architecture for Enterprise Applications", IEEE Conference on Commerce and Enterprise Computing '09, Pages 33-40, IEEE, 2009 * |
Krishnan et al., "Scalable Visual Analytics of Massive Textual Datasets", IEEE International Parallel and Distributed Processing Symposium 2007, Pages 1-10, IEEE, 2007 * |
Cited By (68)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150339752A1 (en) * | 2011-09-14 | 2015-11-26 | International Business Machines Corporation | Deriving Dynamic Consumer Defined Product Attributes from Input Queries |
US9830633B2 (en) * | 2011-09-14 | 2017-11-28 | International Business Machines Corporation | Deriving dynamic consumer defined product attributes from input queries |
US20130254374A1 (en) * | 2012-03-21 | 2013-09-26 | International Business Machines Corporation | Resource allocation based on social networking trends in a networked computing environment |
US10353738B2 (en) * | 2012-03-21 | 2019-07-16 | International Business Machines Corporation | Resource allocation based on social networking trends in a networked computing environment |
US20130339473A1 (en) * | 2012-06-15 | 2013-12-19 | Zynga Inc. | Real time analytics via stream processing |
US9904718B2 (en) | 2013-03-13 | 2018-02-27 | Wal-Mart Stores, Inc. | System and method for streaming events in a transaction-based system |
US11010406B2 (en) | 2013-03-13 | 2021-05-18 | Walmart Apollo, Llc | System and method for streaming events in a transaction-based system |
US10318548B2 (en) | 2013-03-13 | 2019-06-11 | Walmart Apollo, Llc | System and method for streaming events in a transaction-based system |
US10735536B2 (en) | 2013-05-06 | 2020-08-04 | Microsoft Technology Licensing, Llc | Scalable data enrichment for cloud streaming analytics |
US9749430B2 (en) | 2013-05-06 | 2017-08-29 | Microsoft Technology Licensing, Llc | Scalable data enrichment for cloud streaming analytics |
US10306001B2 (en) | 2013-05-06 | 2019-05-28 | Microsoft Technology Licensing, Llc | Scalable data enrichment for cloud streaming analytics |
US9158824B2 (en) * | 2013-06-10 | 2015-10-13 | International Business Machines Corporation | Incremental aggregation-based event pattern matching |
US20140365524A1 (en) * | 2013-06-10 | 2014-12-11 | International Business Machines Corporation | Incremental aggregation-based event pattern matching |
US10275484B2 (en) * | 2013-07-22 | 2019-04-30 | International Business Machines Corporation | Managing sparsity in a multidimensional data structure |
US10169406B2 (en) | 2013-07-22 | 2019-01-01 | International Business Machines Corporation | Managing sparsity in an multidimensional data structure |
US20150026207A1 (en) * | 2013-07-22 | 2015-01-22 | International Business Machines Corporation | Managing sparsity in an multidimensional data structure |
US10540386B2 (en) * | 2013-08-09 | 2020-01-21 | Shaofeng YANG | Method for processing and displaying real-time social data on map |
US9367424B2 (en) | 2014-03-13 | 2016-06-14 | International Business Machines Corporation | Method for performance monitoring and optimization via trend detection and forecasting |
US10019505B2 (en) | 2014-04-11 | 2018-07-10 | Huawei Technologies Co., Ltd. | Method and apparatus for creating data cube in streaming manner based on distributed system |
TWI552543B (en) * | 2014-04-11 | 2016-10-01 | 華為技術有限公司 | Method and apparatus for creating data cube in streaming manner based on distributed system |
CN103916478A (en) * | 2014-04-11 | 2014-07-09 | 华为技术有限公司 | Streaming data cube establishing method and device based on distributed system |
US10002131B2 (en) | 2014-06-11 | 2018-06-19 | Facebook, Inc. | Classifying languages for objects and entities |
US10013417B2 (en) | 2014-06-11 | 2018-07-03 | Facebook, Inc. | Classifying languages for objects and entities |
US9740687B2 (en) | 2014-06-11 | 2017-08-22 | Facebook, Inc. | Classifying languages for objects and entities |
US10152677B2 (en) | 2014-10-30 | 2018-12-11 | International Business Machines Corporation | System and methodology to handle misdirected input data during multi partitioned real time analytics |
US10147046B2 (en) | 2014-10-30 | 2018-12-04 | International Business Machines Corporation | System and methodology to handle misdirected input data during multi partitioned real time analytics |
US9864744B2 (en) | 2014-12-03 | 2018-01-09 | Facebook, Inc. | Mining multi-lingual data |
US9830404B2 (en) | 2014-12-30 | 2017-11-28 | Facebook, Inc. | Analyzing language dependency structures |
US10067936B2 (en) | 2014-12-30 | 2018-09-04 | Facebook, Inc. | Machine translation output reranking |
US9830386B2 (en) | 2014-12-30 | 2017-11-28 | Facebook, Inc. | Determining trending topics in social media |
US9899020B2 (en) | 2015-02-13 | 2018-02-20 | Facebook, Inc. | Machine learning dialect identification |
US10324943B2 (en) | 2015-08-10 | 2019-06-18 | Business Objects Software, Ltd. | Auto-monitoring and adjustment of dynamic data visualizations |
US10346537B2 (en) | 2015-09-22 | 2019-07-09 | Facebook, Inc. | Universal translation |
US10133738B2 (en) | 2015-12-14 | 2018-11-20 | Facebook, Inc. | Translation confidence scores |
US10089299B2 (en) | 2015-12-17 | 2018-10-02 | Facebook, Inc. | Multi-media context language processing |
US11194778B2 (en) * | 2015-12-18 | 2021-12-07 | International Business Machines Corporation | Method and system for hybrid sort and hash-based query execution |
US20170177573A1 (en) * | 2015-12-18 | 2017-06-22 | International Business Machines Corporation | Method and system for hybrid sort and hash-based query execution |
US11989096B2 (en) * | 2015-12-21 | 2024-05-21 | Ab Initio Technology Llc | Search and retrieval data processing system for computing near real-time data aggregations |
US20170177446A1 (en) * | 2015-12-21 | 2017-06-22 | Ab Initio Technology Llc | Search and retrieval data processing system for computing near real-time data aggregations |
US20170185583A1 (en) * | 2015-12-28 | 2017-06-29 | Facebook, Inc. | Language model personalization |
US10002125B2 (en) * | 2015-12-28 | 2018-06-19 | Facebook, Inc. | Language model personalization |
US9805029B2 (en) | 2015-12-28 | 2017-10-31 | Facebook, Inc. | Predicting future translations |
US10289681B2 (en) | 2015-12-28 | 2019-05-14 | Facebook, Inc. | Predicting future translations |
US10540450B2 (en) | 2015-12-28 | 2020-01-21 | Facebook, Inc. | Predicting future translations |
US10902215B1 (en) | 2016-06-30 | 2021-01-26 | Facebook, Inc. | Social hash for language models |
US10902221B1 (en) | 2016-06-30 | 2021-01-26 | Facebook, Inc. | Social hash for language models |
US11100191B2 (en) | 2016-08-26 | 2021-08-24 | 1Qb Information Technologies Inc. | Method and system for performing real-time analytics on a plurality of data streams |
WO2018049183A1 (en) * | 2016-09-09 | 2018-03-15 | BloomReach, Inc. | Attribute extraction |
US10445812B2 (en) | 2016-09-09 | 2019-10-15 | BloomReach, Inc. | Attribute extraction |
US20180075516A1 (en) * | 2016-09-14 | 2018-03-15 | Microsoft Technology Licensing, Llc | System for producing recommendations and predicting purchases of products based on usage patterns |
US10825072B2 (en) * | 2016-09-14 | 2020-11-03 | Microsoft Technology Licensing, Llc | System for producing recommendations and predicting purchases of products based on usage patterns |
US11341140B2 (en) | 2016-11-08 | 2022-05-24 | At&T Intellectual Property I, L.P. | Unique identification generation for records in a data streaming processing system |
US10459922B2 (en) | 2016-11-08 | 2019-10-29 | At&T Intellectual Property I, L.P. | Unique identification generation for records in a data streaming processing system |
US9946522B1 (en) | 2016-12-16 | 2018-04-17 | International Business Machines Corporation | Generating code for real-time stream processing |
US10241762B2 (en) | 2016-12-16 | 2019-03-26 | International Business Machines Corporation | Generating code for real-time stream processing |
US9983858B1 (en) | 2016-12-16 | 2018-05-29 | International Business Machines Corporation | Generating code for real-time stream processing |
US10812322B2 (en) * | 2017-04-13 | 2020-10-20 | Walmart Apollo, Llc | Systems and methods for real time streaming |
US11294876B2 (en) * | 2017-06-01 | 2022-04-05 | Oracle International Corporation | System and method for generating a multi dimensional data cube for analytics using a map-reduce program |
US12111809B2 (en) | 2017-06-01 | 2024-10-08 | Oracle International Corporation | System and method for generating a multi dimensional data cube for analytics using a map-reduce program |
US10380249B2 (en) | 2017-10-02 | 2019-08-13 | Facebook, Inc. | Predicting future trending topics |
EP3678033A1 (en) * | 2019-01-07 | 2020-07-08 | QlikTech International AB | A computer implemented method for indexlet based aggregation |
US11768857B2 (en) * | 2019-01-07 | 2023-09-26 | Qliktech International Ab | Methods and systems for indexlet based aggregation |
US12105738B2 (en) | 2019-01-07 | 2024-10-01 | Qliktech International Ab | Methods and systems for indexlet based aggregation |
CN112817965A (en) * | 2019-11-18 | 2021-05-18 | 百度在线网络技术(北京)有限公司 | Data splicing method and device, electronic equipment and storage medium |
CN113127512A (en) * | 2020-01-15 | 2021-07-16 | 百度在线网络技术(北京)有限公司 | Data splicing triggering method and device for multiple data streams, electronic equipment and medium |
EP4033373A1 (en) * | 2021-01-25 | 2022-07-27 | QlikTech International AB | Methods and systems for undetermined query analytics |
US11625395B2 (en) | 2021-01-25 | 2023-04-11 | Qliktech International Ab | Methods and systems for undetermined query analytics |
CN116756215A (en) * | 2023-06-27 | 2023-09-15 | 上海蚂蚁创将信息技术有限公司 | Transaction in-transit state query method and system |
Also Published As
Publication number | Publication date |
---|---|
US20120197750A1 (en) | 2012-08-02 |
US20120131047A1 (en) | 2012-05-24 |
US20160132515A1 (en) | 2016-05-12 |
US9679074B2 (en) | 2017-06-13 |
WO2012068557A1 (en) | 2012-05-24 |
US8725592B2 (en) | 2014-05-13 |
US9183270B2 (en) | 2015-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20120130940A1 (en) | Real-time analytics of streaming data | |
US11985037B2 (en) | Systems and methods for conducting more reliable assessments with connectivity statistics | |
US11665072B2 (en) | Parallel computational framework and application server for determining path connectivity | |
US10311106B2 (en) | Social graph visualization and user interface | |
US20220283883A1 (en) | Distributed processing in a messaging platform | |
US9922134B2 (en) | Assessing and scoring people, businesses, places, things, and brands | |
US9483580B2 (en) | Estimation of closeness of topics based on graph analytics | |
EP2937797A1 (en) | System and method for searching a distributed node-sharded graph | |
US20110307474A1 (en) | Party reputation aggregation system and method | |
US20180039688A1 (en) | Data flow based feature vector clustering | |
KR20120126093A (en) | Method, system and server for managing dynamic information of friends in network | |
EP3070661A1 (en) | System and method for providing context driven hyper-personalized recommendation | |
US10803133B2 (en) | System for decomposing events from managed infrastructures that includes a reference tool signalizer | |
CN111046237B (en) | User behavior data processing method and device, electronic equipment and readable medium | |
US20220138191A1 (en) | Systems and methods for matching electronic activities with whitespace domains to record objects in a multi-tenant system | |
US11336596B2 (en) | Personalized low latency communication | |
US9846746B2 (en) | Querying groups of users based on user attributes for social analytics | |
CN113592293A (en) | Risk identification processing method, electronic device and computer-readable storage medium | |
US20180137198A1 (en) | Data retrieval system | |
US12056132B2 (en) | Systems and methods for selection of a first record object for association with second record objects based on connection profiles | |
US20150100515A1 (en) | Customer data unification | |
CN114090873A (en) | Method, device, equipment and computer readable medium for matching data | |
US20190392498A1 (en) | Recommendation engine and system | |
CN113360765B (en) | Event information processing method and device, electronic equipment and medium | |
Yuanjing et al. | Research on the Application of Data Analysis in Agricultural Products E-Commerce |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: WAL-MART STORES, INC., ARIZONA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GATTANI, ABHISHEK;RAJARAMAN, ANAND;REEL/FRAME:029248/0065 Effective date: 20120203 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: WALMART APOLLO, LLC, ARKANSAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:WAL-MART STORES, INC.;REEL/FRAME:045817/0115 Effective date: 20180131 |