GB2488147A - A method for generating an OLAP cube from a database of user activity on a network - Google Patents
A method for generating an OLAP cube from a database of user activity on a network Download PDFInfo
- Publication number
- GB2488147A GB2488147A GB1102777.8A GB201102777A GB2488147A GB 2488147 A GB2488147 A GB 2488147A GB 201102777 A GB201102777 A GB 201102777A GB 2488147 A GB2488147 A GB 2488147A
- Authority
- GB
- United Kingdom
- Prior art keywords
- user
- activity data
- relational database
- advert
- activity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 230000000694 effects Effects 0.000 title claims abstract description 117
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000004931 aggregating effect Effects 0.000 claims abstract description 11
- 230000009466 transformation Effects 0.000 claims abstract description 7
- 230000001131 transforming effect Effects 0.000 claims abstract description 6
- 235000014510 cooky Nutrition 0.000 claims description 13
- 230000003993 interaction Effects 0.000 claims description 11
- 230000004044 response Effects 0.000 claims description 11
- 238000004458 analytical method Methods 0.000 claims description 8
- 230000009471 action Effects 0.000 claims description 5
- 238000005192 partition Methods 0.000 description 10
- 238000007405 data analysis Methods 0.000 description 8
- 230000002776 aggregation Effects 0.000 description 4
- 238000004220 aggregation Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000002730 additional effect Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 241000533950 Leucojum Species 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G06F17/30592—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Databases & Information Systems (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Entrepreneurship & Innovation (AREA)
- Tourism & Hospitality (AREA)
- Marketing (AREA)
- Game Theory and Decision Science (AREA)
- General Business, Economics & Management (AREA)
- Quality & Reliability (AREA)
- Development Economics (AREA)
- Operations Research (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Educational Administration (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The method comprises the steps of: storing the activity data at a first relational database 21, user identifiers and timestamps being preserved in the first relational database; generating from the first relational database a second relational database 22 formatted for transformation into an OLAP cube 23, the second relational database 22 being generated by: processing tracking elements of a subset of the activity data so as to, for each activity datum of the subset, extract at least some of the variables; defining one or more fact tables that include facts corresponding to at least some of the extracted variables, the extracted variables not being dimensions of the second relational database; and aggregating factual activity data over a predetermined length of time with respect to the dimensions of the second relational database 22 and transforming the second relational database 22 into an OLAP cube 23.
Description
I
Efficient Database Construction
BACKGROUND OF THE INVENTION
This invention relates to a method for efficiently generating an OLAP cube that allows the rapid analysis of data.
With the advent of digital technologies and the ever-increasing ability of computer systems to generate and store terabytes of data, there has been a steady pressure to keep pace with tools that allow such enormous amounts of data to be rapidly analysed. Such tools can be in the form of improved algorithms for processing data extracted from a database but often come in the form of improved database structures. Many different database structures can efficiently store very large amounts of data but such structures do not necessarily allow data stored in the database to be analysed in an efficient manner.
Online analytical processing (OLAP) is a particular approach to data analysis that allows complex analytical queries to be rapidly handled. The foundations of OLAP were set out in Edgar F. Todd's paper "Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate", published by Codd & Date, Inc. in 1993. OLAP is now a well-known data analysis technique and has been described in several textbooks, including "OLAP Solutions: Building Multidimensional lnformation Systems" by Erik Thomsen (John Wiley & Sons).
OLAP queries are performed on an OLAP cube (or hypercube), which is a multidimensional database whose structure is the key to the analytical power of OLAP. The content of an OLAP cube is categorized according to a set of dimensions that define the axes' of the cube. This structure can overcome many of the limitations of conventional database structures, such as relational databases, and permit the rapid analysis and extraction of large amounts of data.
However, transforming a data set into a database optimised for the efficient analysis of data (such as an OLAP cube) can require a lot of processing time. In situations in which a database is being regularly updated with large volumes of data (for example, on a real-time basis or according to a daily schedule) the processing time required to transform a data set into a database optimised for the efficient anaiysis of data can leave very little time for performing analysis on the database, and in extreme examples can exceed the time between data updates.
This problem is especially acute when generating OLAP cubes from data sets in which some of the data are free variables that are not well defined in the data set schema.
There is therefore a need for an efficient method for the construction of an OLAF cube.
SUMMARY OF THE INVENTION
Accord ing to a first aspect of the present invention there is provided a method for generating an OLAP cube based on a data set comprising activity data that represents logged activities of a plurality of users on a network, each activity datum including a timestamp and a user identifier of the user that performed the respective logged activity, and a subset of the activity data including a tracking element comprising a plurality of variables, and the method comprising the steps of: storing the activity data at a first relational database, the user identifiers and timestamps being preserved in the first relational database; generating from the first relational database a second relational database formatted for transformation into an OLAP cube, the second relational database being generated by: processing the tracking elements of the subset of the activity data so as to, for each activity datum of the subset, extract at least some of the variables; defining one or more fact tables that include facts corresponding to at least some of the extracted variables, the extracted variables not being dimensions of the second relational database; and aggregating factual activity data over a predetermined length of time with respect to the dimensions of the second relational database, the predetermined length of time being substantially greater than the resolution of the timestamps of the activity data; and, transforming the second relational database into an OLAF cube such that at least some of the facts corresponding to at least some of the extracted variables become dimensions of the OLAF cube.
Preferably the step of generating the second relational database further comprises, prior to aggregating the activity data over a predetermined length of time, discarding activity data that is not a fact or dimension of the second relational database. Preferably, on aggregating factual activity data over a predetermined length of time, the user identifiers and timestamps are discarded.
Preferably activity data relating to a particular fact of the second relational database is aggregated only over those dimensions of the second relational database that are linked to said particular fact.
The step of aggregating factual activity data over a predetermined length of time could comprise one or more of: summing numerical information stored in the activity data; calculating a mean, median or mode of numerical information stored in the activity data; calculating a minimum or maximum of numerical information stored in the activity data; calculating a variance or standard deviation of numerical information stored in the activity data; and grouping information stored in the activity data so as to generate a count of the number of times a particular identifier occurs.
Suitably, wherein the network is the internet and each user's activities are logged by means of one or more cookies accessible to the respective user's internet browser, at least one of the one or more cookies including said user identifier.
Suitably said one or more cookies are HTTP cookies or Flash cookies.
Suitably each logged activity relates to the interaction of a user with an advert served by an ad server. The logged activities could include one or more of: a user viewing an advert; a user interacting with an advert by holding a mouse cursor over the advert; a user clicking an advert. Preferably each activity datum additionally includes an identifier of the advert served by the ad server.
Suitably the ad server comprises a plurality of internet servers arranged to serve adverts to users and, in response to serving an advert to a user, Jog an activity datum for that activity. Preferably each tracking element is generated by the website whose content is served with the advert with respect to which that tracking element is generated.
Preferably at least some of the tracking elements are generated at a webpage served by the website in response to a user action, the user action being one of: purchasing goods and/or services through the website; filling in a web form; requesting a brochure; and registering interest in a product.
The logged activities to which the subset of the activity data relate could include a user being served an advert by the ad server on receiving a webpage from a website confirming the purchase of goods and/or services through that website, the website generating a tracking element that includes one or more of the following variables: a total purchase cost; identifiers of the goods and/or services purchased; an individual cost of one or more of the goods and/or services purchased; a reference number for the purchase; a customer number of the user.
The logged activities to which the subset of the activity data relate could include a user being served an advert by the ad server on receiving a webpage from a website confirming the registering of interest by the user in goods and/or services offered through that website, the website generating a tracking element that includes one or more of the following variables: a cost of the goods and/or services in which the user is interested; identifiers of the goods and/or services in which the user is interested; a reference number for the registering of interest; a customer number of the user.
The logged activities to which the subset of the activity data relate could include a user being served an advert by the ad server on receiving a webpage of search results from a search engine, the website generating a tracking element that includes one or more of the following variables: the query performed by the user at the search engine; identifiers of one or more search results provided to the user; information identifying the user.
Preferably the predetermined length of time is one day. Preferably the resolution of the timestamps is one second or one minute.
Suitably the step of transforming the second relational database into an OLAF cube is performed by Microsoft Analysis Services.
According to a second aspect of the present invention there is provided a method for performing a logical query on an OLAP cube generated in accordance with the first aspect of the present invention, the method comprising: if the logical query relates to a particular user or to a length of time that is shorter than the predetermined length of time, performing the logical query on the first relational database; and otherwise, performing the logical query on the OLAF cube.
DESCRIPTION OF THE DRAWINGS
The present invention will now be described by way of example with reference to the accompanying drawings, in which: Figure 1 is a schematic drawing of a data processing system operable to perform the steps of the present invention.
Figure 2 is a flow diagram illustrating the processing performed in accordance with the present invention of a data set into an OLAP cube.
Figure 3 is a schematic drawing of a schema for a relational database configured in accordance with the present invention for transformation into an OLAP cube.
DETAILED DESCRIPTION OF THE DRAWINGS
The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art.
The general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention.
Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The present invention concerns methods for generating an online analytical processing (OLAF) database, commonly referred to as an OLAF cube or hypercube. OLAF cubes are generally created from a more conventional database structure, such as a relational database. The invention will be described below in relation to the construction of an OLAF cube from a data set generated by an online ad server system, but the method of the present invention could be used with any data set having the characteristics described herein and, in particular, as set out in the claims. An ad server as described herein is to be understood to typically refer to the collection of servers and online infrastructure operated by a commercial entity in the business of providing adverts online to internet users and is not to be understood to necessarily refer to a single computer system configured as a server.
Several database management products offer OLAF tools and can generate OLAF cubes on which analytical processing can be formed, including Oracle Database OLAF Option and Microsoft Analysis Services, which is part of Microsoft SQL Server 2008. These products differ in the features they provide -for example, the query languages supported -but at their core all of the products are based around a multi-dimensional cube configured in accordance with the principles of 0 LAP.
A data processing system at which the methods of the present invention could be performed is shown in figure 1. Data analysis system 10 comprises data store 12 for storing a data set received over network 11 and a processor 13 operable to transform such a data set into an OLAP cube for storage at data store 12. Data analysis system 10 may be a single server comprising data storage 12 and one or more data processors 13, or the data analysis system could be a collection of one or more data processing units. For example, data store 12 could be a fileserver or storage area network (SAN) and processor 13 could be a data processing unit (such as a server) or a cluster of computers. The present invention is not limited to operation at any particular kind of data analysis system.
In a preferred embodiment, with respect to which the present invention will now be described, data analysis system 10 is configured to receive data from an online ad serving system. Data analysis system 10 could itself form part of an online ad serving system. An online ad server is an entity or collection of entities configured to provide adverts to internet users as they consume online content such as websites, RSS feeds or podcasts. Online adverts can be provided as text, sound, images or video and may be provided in response to various activities, such as visiting a particular webpage, entering a search string into a search engine, or clicking a hyperlink.
The adverts provided to a user can be contextual in that an advert provided to a user is selected in dependence on, for example, a search string entered by the user, content on the webpage in which the advert is served, and social networking content on a webpage. Online adverts can also be selected on the basis of tracking information generated for a user. For example, the activities of internet users can often by tracked by means of a cookie stored by their web-browser that carries a unique identifier for that user. By tracking the unique identifier exposed by the cookie each time the user's browser requests an advert from the ad server, the webpages visited by a user can be logged by the ad server. Armed with this information, an ad server can serve adverts to the user in dependence on, for exam pie, the websites visited by that user, the terms searched for by that user, or the adverts previously seen by the user.
The unique identifier of a user can be captured each time the user interacts with an advert served by the ad server. These interactions can be, for example: the user merely browsing to a webpage that includes a banner advert (a digital exposure); the user being served a text advert with the results of a search engine query (a search exposure), the advert being selected at least in part in dependence on the search query; the user clicking a banner advert served to them in order to be taken to a landing page of the website to which the advert links (a digital response); the user clicking a text advert served to them alongside the results of a search engine query in order to be taken to a landing page of the website to which the advert links (a search response). Such interactions are referred to herein as user activities and the data generated in response to such interactions constitutes activity data.
Each activity datum comprises a set of fields that hold parameters describing a particular interaction of an internet user with an advert, such as an exposure or click event. An activity datum comprises fields holding, for example, the unique identifier of the user, a timestamp indicating when the interaction occurred (e.g. when the advert was served or an advert was clicked), an identifier of the particular advert involved in the interaction, an identifier of the ad client to which the advert relates, an identifier of the ad campaign to which the advert relates, and identifier of the type of advert served. Activity data relating to search activities will comprise a different set of fields to activity data relating to exposures -for example, a search activity datum might comprise a field holding the search query and an exposure activity datum might comprise a field holding an identifier of the webpage on which the advert was served. g
As an example, consider an online supermarket selling a variety of goods. The supermarket might pay for banner adverts to be served by an ad server on the internet in order to direct internet users to the website of the online supermarket.
The ad server arranges for websites to serve webpages to users that include code (e.g. HTML, Flash or Javascript) directing the user's browser to request an advert from the ad server, and sometimes the ad served will be one of the online supermarket's banner ads. Each time an advert is served to an internet user, the unique identifier of that user will be captured by the ad server and activity data logged by the ad server. This is because, on requesting an advert, the user's browser provides to the ad server the cookie associated with the ad server's internet domain that comprises the unique identifier of the user. This allows the ad server to build up data identifying which users have been served (i.e. exposed to) the advert for that online supermarket.
As well as activity data relating to the interaction of a user with the adverts served by an ad server, particular websites can be arranged to generate activity data relating to the interactions of a user with the respective website and pass that information to the ad server as the variables of a tag expressed in the code associated with an advert, or other element of the webpage (such as a logo or hidden image). For instance, continuing the above online supermarket example, the online supermarket may wish to determine whether its online advertising is driving internet users to buy goods from the supermarket, and what sort of goods those users are buying. This could be achieved by the supermarket including an advert from the ad server in a dynamically generated webpage that is served to a user when a user has just completed a purchase. The advert allows the ad server to pick up the unique identifier of the user and the online supermarket website can further include in the code defining the advert a tag or tracking element comprising one or more free variables generated in response to the sale that indicate, for example: the amount of money spent by the user; details of the goods bought by the user; an order number (which could later be used to link the internet user to a user account of the online supermarket). The tag represents additional activity data and could be a series of parameters included in an HTML link to the ad server that is included in the webpage generated by the supermarket website when a purchase is made, for example: <iframe src="https://www.adzserver.com/adverfframe/?tag[varl];[var2];[var3]" width="1" height="1" frameborder="O" I> In this example the ad server dynamically serves a banner advert as the "advertframe" requested in the hyperlink. The request made to the ad server includes the three variables of the tag passed in the parameters of the link.
The additional activity data provided by the variables generated by the online supermarket allows the ad server to associate the browsing history of a particular user with the purchases made on the supermarket website. This allows the ad server (or another entity that accesses the activity data collected by the ad server) to establish whether the user was served with any adverts to the online supermarket prior to making their purchase and whether the user clicked on any adverts to take them to the online supermarket. This can help the online supermarket to determine the performance of an online marketing campaign.
The activity data generated by tracking users online can be a rich source of information about a user's browsing habits, as well as revealing the sequence of activities that led a user to, for example, make a purchase of a particular product, or choose to buy a product from a particular website. Such information is very valuable to ad servers and other commercial entities operating in the marketing industry. More generally, activity data can be generated in response to any user action that provides additional information about the user to the website: for example, filling in a web form, requesting a brochure, or registering interest in a product.
One way in which the activity data generated by an ad server can be analysed is by means of an OLAP cube whose dimensions are selected so as to permit the rapid execution of a desired set of queries on the activity data. However, with potentially millions of users being served millions of adverts each day, the rate at which activity data is generated by an ad server can be enormous, making the task of creating an OLAP cube very demanding. The task becomes even more onerous when one takes into account that a significant proportion of the activity data logged by an ad server can be in the form of free variables generated by websites in response to the interactions of internet users at those websites.
Because each website is free to generate a set of variables for provision to the ad server, it can be difficult to efficiently incorporate the variables into a database having a fixed schema in such a way that allows the rapid analysis of the data represented by the variables.
The general steps of processing activity data 20 into an OLAP cube 23 in accordance with the present invention are shown in figure 2. The present invention will continue to be described in relation to activity data generated by an ad server, but the activity data could in general be any kind of data having the characteristics described herein and, in particular, set out in the claims. Activity data 20 from an ad server, which includes free variables, generated by entities other than the ad server, is stored at a first database 21. Data 20 may be processed prior to storage at database 21. For example, activity data 20 might be provided as a set of flat files and database 21 might be a relational database, in which case the activity data must be organised in accordance with the schema of database 21 prior to storage. Such processing is generally referred to as extract, transform and load (ETL).
It is advantageous if database 21 comprises a plurality of partitions. This allows data to be written to the database by swapping out a partition and writing data to that partition without suspending indexing and access to the entire database. The partitions could be logically ordered by the age of the data they contain, with the partition that includes the oldest data being swapped out and newly-received activity data being written to that partition over the old data. For example, there could be thirteen partitions with each partition comprising up to one month's worth of activity data and, when data relating to a new month is received, the oldest partition could be swapped out and overwritten with the new data so as to maintain at least a year of activity data at all times.
Database 21 would typically be very large and not optimised for transformation into an OLAF cube. For instance, the activity data held in database 21 includes the unique identifiers of the users in respect of which each activity datum was collected and the timestamp indicating the time at which activity datum was collected. Trying to transform database 21 directly into an OLAF cube would generally be very slow -for example, it could take longer than 24 hours to process one day's worth of data.
First database 21 is therefore processed into second database 22, which is a relational database configured so as to allow efficient processing of the second database into an OLAF cube. Three operations are performed on database 21 in order to generate database 22: i. the tags or tracking elements in database 21 are processed so as to extract at least some of the variables from the tags; ii. fact tables are defined that include facts corresponding to at least some of the variables extracted from the tags but variables do not become dimensions of the database 22; and iii. factual activity data stored at database 21 and the extracted variables that correspond to facts held in the fact tables are aggregated over a predetermined length of time that is substantially greater than the resolution of the timestamps of the activity data.
The schema of database 22 may define one or more fact tables and one or more dimension tables. Aggregation of factual data is performed with respect to the dimensions of the second database: i.e. data that represents a given fact of the database are aggregated by the dimensions of the second database that relate to that fact. The dimension tables of the second database do not include any dimensions defined in respect of the variables extracted from the tags. Preferably data that cannot be represented as a fact or dimension of database 22 is not migrated from database 21 into database 22.
The variabJes extracted from the tags stored in database 21 are free to hold any data, as determined by the entity that generated the tags. Extracted variables can be stored as numerical or textual data in the fact table(s) of the second data base.
Synthetic keys are not useful for a collection of unique text variables since there is no restriction to the type or meaning of such data. Furthermore, creating links in data that has an unpredictable or unknown distribution provides little if any benefit and can be resource intensive.
As for database 21, database 22 may be partitioned so as to allow just the relevant partition of the database to be swapped out for writing to when data is mig rated from database 21 to database 22.
Typically activity data is collected with timestamp data having a resolution of one second (although this might not reflect the true accuracy of the timestamps) or one minute, but the resolution of the timestamps could be any resolution. In a preferred embodiment the activity data of database 21 is migrated into database 22 and aggregated over a time period of one day, but this period could be any length of time: for example, one or more hours or a few days. This preferably causes the timestamp and user identifiers associated with the activity data to be lost, which substantially reduces the size of the database. Aggregation can be performed in two ways, depending on the type of data involved.
Numeric data can be mathematically aggregated so as to give, for example, a sum total, an average (e.g. mean, median or mode), a variance or standard deviation, or a maximum or minimum value for the predetermined time period.
For example, each activity datum might include an "AdservCost" field that holds the cost of serving the respective advert. On aggregation, that field would be processed over the dimensions of database 22 defined in its dimension tables.
Thus aggregating the AdservCost field over an "Ad" dimension corresponding to the identifier of an advert would cause the AdservCost fields of each activity datum relating to the serving of that advert to be summed and represented as an aggregated value along that dimension, e.g. the total cost of a given advert for a particular day.
Textual data and some numeric data (such as identifying numbers or codes) can be grouped on aggregation to show, for instance, the number of times that a particular text string or identifier occurred within the predetermined time period. In this process a new fact is created indicating the number of occurrences of the text string or identifier. For example, the "Ad" field holding the identifier of an advert could be grouped so as to form a count value for each ad identifier that indicates the number of times each ad has been served; in other words, a new "Count" fact is created in database 22 on aggregating the ad identifiers over an "Ad" dimension of the database (e.g. the number of times the advert is served on a particular day). It is generally not practical to aggregate the textual fields of activity data where each item is unique. Such data is preferably not aggregated over the dimensions of database 22. Most preferably such data is not migrated into database 22.
Variables extracted from tag data are also aggregated over the dimensions of database 22 on its creation. As is noted above, variables extracted from tag data are not dimensions of database 22 but, as will be described, do become dimensions of the OLAP cube. This ensures that the OLAP cube can rapidly service any queries that require information held by variables whilst minimising the number of rows in the fact table(s) of database 22: this significantly reduces the total time required to generate an OLAP cube from activity data 20.Preferably database 22 is chosen to have a star or snowflake schema, which are particularly suitable for transformation into an OLAF cube. A simple star schema for database 22 is shown in figure 3. In the figure, fact table 31 defines three facts 32 for the three variables passed to the ad server by the online supermarket in the example described above. More generally some of the facts in table 31 are defined to hold the free variables stored in the activity data logged by the ad server. Database 22 also defines in its schema one or more dimension tables 33 linked to the one or more fact tables.
Relational database 22 is transformed into OLAF cube 23 in accordance with the techniques known in the art. Importantly, this transformation is performed such that OLAP cube 23 includes dimensions that correspond to the facts of database 22 defined for the free variables extracted at step (i) above. These variables are not however dimensions of database 22, which allows database 22 to be rapidly generated and optimises the overall process shown in figure 2 for generating OLAP cube 23. Furthermore, the OLAF cube is configured to not include any data that is not defined as a fact or dimension of the cube. Variables that are numerical data typically remain as factual data of the OLAP cube, whilst variables that are textual data typically become a dimension of the OLAP cube.
The particular set of variables that are to become dimensions of the OLAP cube, and the facts that the dimensions are linked to, depends on the application.
Continuing the example of the online supermarket above, the particular variables that become dimensions of the OLAP cube would depend on the information that the online supermarket wants the ad server to provide to it. For example, if the supermarket that generates the tag and its variables stores the type of product sold on the supermarket website in a first variable of a tag and the cost of the product in a second variable of the tag, with the product type being selected to become a dimension of the OLAP cube, then the OLAF cube would be configured to respond to queries relating to the revenue generated by the supermarket for each product type.
To give a second example, the supermarket could alternatively or additionally generate variables storing the delivery cost and delivery city of an order placed on its website, with the delivery destination being selected to become a dimension of the OLAF cube. The OLAF cube would then be configured to respond to queries relating to the delivery cost charged by the supermarket for each delivery destination.
The resulting structure of the OLAF cube allows queries of the activity data aggregated along the dimensions of the OLAP cube to be rapidly performed. The dimensions of the cube are therefore selected so as to optimise the performance of queries that are likely to be commonly made. In particular, the dimensions of the OLAP cube include at least some of the free variables that can be generated external to the ad server and provided to it for storage as activity data. This allows queries on these variables to be rapidly performed.
Once the OLAP cube has been generated, it is advantageous if at least the first relational database 21 and the OLAP cube 23 are provided for satisfying queries performed on the activity data set gathered by the ad server. Since the OLAP cube does not include any user identifiers or timestamps it cannot satisfy queries that require such information. However, database 21 does include this data, It is therefore advantageous that any queries relating to the aggregated activity data available at OLAP cube 23 are performed using an OLAP query language on the OLAF cube, and that any queries relating to a particular user or a group of users is performed by means of a relational query to database 21. This can be achieved by providing a unified interface to the databases which translates the queries entered by means of the interface into suitable queries directed to the OLAP cube or database 21.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims, The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.
Claims (21)
- CLAIMS1. A method for generating an OLAF cube based on a data set comprising activity data that represents logged activities of a plurality of users on a network, each activity datum including a timestamp and a user identifier of the user that performed the respective logged activity, and a subset of the activity data including a tracking element comprising a plurality of variables, and the method comprising the steps of: storing the activity data at a first relational database, the user identifiers and timestamps being preserved in the first relational database; generating from the first relational database a second relational database formatted for transformation into an OLAF cube, the second relational database being generated by: processing the tracking elements of the subset of the activity data so as to, for each activity datum of the subset, extract at least some of the variables; defining one or more fact tables that include facts corresponding to at least some of the extracted variables, the extracted variables not being dimensions of the second relational database; and aggregating factual activity data over a predetermined length of time with respect to the dimensions of the second relational database, the predetermined length of time being substantially greater than the resolution of the timestamps of the activity data; and, transforming the second relational database into an OLAF cube such that at least some of the facts corresponding to at least some of the extracted variables become dimensions of the OLAF cube.
- 2. A method as claimed in claim 1, wherein the step of generating the second relational database further comprises, prior to aggregating the activity data over a predetermined length of time, discarding activity data that is not a fact or dimension of the second relational database.
- 3. A method as claimed in claim I or 2, wherein, on aggregating factual activity data over a predetermined length of time, the user identifiers and timestamps are discarded.
- 4. A method as claimed in any preceding claim, wherein activity data relating to a particular fact of the second relational database is aggregated only over those dimensions of the second relational database that are linked to said particular fact.
- 5. A method as claimed in any preceding claim, wherein the step of aggregating factual activity data over a predetermined length of time comprises one or more of: summing numerical information stored in the activity data; calculating a mean, median or mode of numerical information stored in the activity data; calculating a minimum or maximum of numerical information stored in the activity data; calculating a variance or standard deviation of numerical information stored in the activity data; and grouping information stored in the activity data so as to generate a count of the number of times a particular identifier occurs.
- 6. A method as claimed in any preceding claim, wherein the network is the internet and each user's activities are logged by means of one or more cookies accessible to the respective user's internet browser, at least one of the one or more cookies including said user identifier.
- 7. A method as claimed in claim 5 or 6, wherein said one or more cookies are HTTP cookies or Flash cookies.
- 8. A method as claimed in any preceding claim, wherein each logged activity relates to the interaction of a user with an advert served by an ad server.
- 9. A method as claimed in claim 8, wherein the logged activities include one or more of: a user viewing an advert; a user interacting with an advert by holding a mouse cursor over the advert; a user clicking an advert.
- 10. A method as claimed in claim 8 or 9, wherein each activity datum additionally includes an identifier of the advert served by the ad server.
- 11. A method as claimed in any of claims 8 to 10, wherein the ad server comprises a plurality of internet servers arranged to serve adverts to users and, in response to serving an advert to a user, log an activity datum for that activity.
- 12. A method as claimed in any of claims 8 to 11, wherein each tracking element is generated by the website whose content is served with the advert with respect to which that tracking element is generated.
- 13. A method as claimed in any of claims 8 to 12, wherein at least some of the tracking elements are generated at a webpage served by the website in response to a user action, the user action being one of: purchasing goods and/or services through the website; filling in a web form; requesting a brochure; and registering interest in a product.
- 14. A method as claimed in any of claims 8 to 13, wherein the logged activities to which the subset of the activity data relate include a user being served an advert by the ad server on receiving a webpage from a website confirming the purchase of goods and/or services through that website, the website generating a tracking element that includes one or more of the following variables: a total purchase cost; identifiers of the goods and/or services purchased; an individual cost of one or more of the goods and/or services purchased; a reference number for the purchase; a customer number of the user.
- 15. A method as claimed in any of claims 8 to 13, wherein the logged activities to which the subset of the activity data relate include a user being served an advert by the ad server on receiving a webpage from a website confirming the registering of interest by the user in goods and/or services offered through that website, the website generating a tracking element that includes one or more of the following variables: a cost of the goods and/or services in which the user is interested; identifiers of the goods and/or services in which the user is interested; a reference number for the registering of interest; a customer number of the user.
- 16. A method as claimed in any of claims 8 to 15, wherein the logged activities to which the subset of the activity data relate include a user being served an advert by the ad server on receiving a webpage of search results from a search engine, the website generating a tracking element that includes one or more of the following variables: the query performed by the user at the search engine; identifiers of one or more search results provided to the user; information identifying the user.
- 17. A method as claimed in any preceding claim, wherein the predetermined length of time is one day.
- 18. A method as claimed in any preceding claim, wherein the resolution of the timestamps is one second or one minute.
- 19. A method as claimed in any preceding claim, wherein the step of transforming the second relational database into an OLAF cube is performed by Microsoft Analysis Services.
- 20. A method for performing a logical query on an OLAP cube generated in accord ance with any preceding claim, the method comprising: if the logical query relates to a particular user or to a length of time that is shorter than the predetermined length of time, performing the logical query on the first relational database; and otherwise, performing the logical query on the OLAP cube.
- 21. A method for generating an OLAF database substantially as described herein with reference to any of figures 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1102777.8A GB2488147A (en) | 2011-02-17 | 2011-02-17 | A method for generating an OLAP cube from a database of user activity on a network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1102777.8A GB2488147A (en) | 2011-02-17 | 2011-02-17 | A method for generating an OLAP cube from a database of user activity on a network |
Publications (2)
Publication Number | Publication Date |
---|---|
GB201102777D0 GB201102777D0 (en) | 2011-03-30 |
GB2488147A true GB2488147A (en) | 2012-08-22 |
Family
ID=43859568
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB1102777.8A Withdrawn GB2488147A (en) | 2011-02-17 | 2011-02-17 | A method for generating an OLAP cube from a database of user activity on a network |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2488147A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114595294A (en) * | 2022-03-11 | 2022-06-07 | 北京梦诚科技有限公司 | Data warehouse modeling and extracting method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009120329A2 (en) * | 2008-03-22 | 2009-10-01 | Thomson Reuters ( Tax & Accounting ) Inc. | Online analytic processing cube with time stamping |
US20100250412A1 (en) * | 2008-03-22 | 2010-09-30 | Steven Wagner | Online analytic processing cube with time stamping |
-
2011
- 2011-02-17 GB GB1102777.8A patent/GB2488147A/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009120329A2 (en) * | 2008-03-22 | 2009-10-01 | Thomson Reuters ( Tax & Accounting ) Inc. | Online analytic processing cube with time stamping |
US20100250412A1 (en) * | 2008-03-22 | 2010-09-30 | Steven Wagner | Online analytic processing cube with time stamping |
Non-Patent Citations (2)
Title |
---|
http://blog.developers.ie/rrawat/archive/2005/10/14/1754.aspx * |
http://stackoverflow.com/questions/7821706/best-way-to-transform-data-to-intermediate-db-for-olap * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114595294A (en) * | 2022-03-11 | 2022-06-07 | 北京梦诚科技有限公司 | Data warehouse modeling and extracting method and system |
Also Published As
Publication number | Publication date |
---|---|
GB201102777D0 (en) | 2011-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11507551B2 (en) | Analytics based on scalable hierarchical categorization of web content | |
CN108416620B (en) | Portrait data intelligent social advertisement putting platform based on big data | |
US11625415B2 (en) | Data engine integration and data refinement | |
US11822545B2 (en) | Search integration | |
US8843610B2 (en) | Referred internet traffic analysis system and method | |
CN102236851B (en) | The method and system that the multidimensional credit system composing power based on user calculates in real time | |
US10042883B2 (en) | System and method for asynchronous consumer item searching requests with synchronous parallel searching | |
JP2021511582A (en) | Dimensional context propagation technology for optimizing SQL query plans | |
US20150213109A1 (en) | System and method for providing big data analytics on dynamically-changing data models | |
US20140101201A1 (en) | Distributed data warehouse | |
Verma et al. | An intelligent approach to Big Data analytics for sustainable retail environment using Apriori-MapReduce framework | |
WO2019242343A1 (en) | Marketing information release platform construction method and apparatus | |
US10366079B1 (en) | Enterprise connectivity | |
US20140200958A1 (en) | System and method for categorization of factors to predict demand | |
Salem et al. | Active XML-based Web data integration | |
EP4226296A1 (en) | Unified machine learning feature data pipeline | |
US10523546B1 (en) | Triggering supplemental channel communications based on data from non-transactional communication sessions | |
US20220036477A1 (en) | System and method for determining revenue generated by any zone in a webpage | |
JP7011552B2 (en) | Ad management system, ad management method, and ad management program | |
US11593451B2 (en) | System and method for comparing zones for different versions of a website based on performance metrics | |
GB2488147A (en) | A method for generating an OLAP cube from a database of user activity on a network | |
Mohanty et al. | Extracting value from big data: in-memory solutions, real time analytics, and recommendation systems | |
Cheng | Product recommendation system design | |
CA2708415C (en) | Referred internet traffic analysis system and method | |
Zumstein et al. | A fuzzy Web analytics model for Web mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |