Hi all,

Please tell me if this makes any sense. Any pointers to relevant projects will be much appreciated.

Philip Stoev
http://www.stoev.org/pivot/manifest.htm


OLAP PROPOSAL FOR MYSQL



The goal is to create an OLAP engine coupled with a presentation layer that will be easy enough for normal people to use, with no MDX experience required. While it is probably a fact that Wal-Mart has 70 GB of data, this does not mean that all people have such data sets, so the goal is reasonable performance for reasonably-sized datasets. Most people do not join 30 tables together either. Also, it is pre-supposed that Wal-Mart engage in extra-complex calculations to determine business strategies, most people are often content to know “How much I sold yesterday”.



I. OLAP ENGINE AND CACHING



The OLAP “engine” takes a standard SQL query with GROUP BY statements and aggregate functions, executes it, and saves the entire resulting dataset in the cache. A cache index entry is then created, noting what the source tables, the GROUP_BY columns, the aggregate functions and the WHERE conditions that were used.



Upon execution of further queries, the OLAP engine checks the cache whether there is a cached dataset that can be used to answer the query immediately. This would include any of the following:



1. The query’s GROUP BY columns are equal or a sub-set of the cached query. So, a query like:

SELECT salesman, state, SUM(sales) FROM company.sales GROUP BY salesman, state

provides the answer for

SELECT salesman, SUM(sales) FROM company.sales GROUP BY salesman



2. The query’s WHERE clause is equal or more restrictive to the WHERE clause of a cached query, and contains columns that were GROUP BY-ed.

A query like:

SELECT date, salesman, SUM(sales) FROM company.sales GROUP BY date, salesman WHERE date > ‘2003-01-01’

provides the answer for:

SELECT date, salesman, SUM(sales) FROM company.sales GROUP BY date, salesman WHERE date > ‘2003-01-01’ AND date > ‘2003-06-01’

Obviously, a human will not write a query with such a WHERE statement, however a graphical Pivot tool may be explicitly designed to create such a query when drilling-down so that a cache hit is scored.



3. The query’s source tables are equal or a sub-set of the cached query’s source tables.

So, the query:

SELECT salesman, gender, SUM(sales) FROM company.sales INNER JOIN salesman USING (salesman_id) GROUP BY salesman, gender

or even something very complex with 10 joined tables, can be used to answer:

SELECT salesman, SUM(sales) FROM company.sales GROUP BY salesman

or even something even more complex with 5 joined tables



4. The query’s aggregate functions are equal of a sub-set of the cached query’s. Certain aggregate functions may not be cached like COUNT(DISTINCT), and others require special care (AVERAGE(value) must be translated to SUM(value)/COUNT(value)).



The benefits of such a cache implementation is that is it data-independent. You do not have to describe your data prior to executing your queries. It also does not rely on creating your own cache structure and your own cache index – a few tables can be used to hold the cache index and can be then queried by SQL themselves to determine a hit.



If an interactive Pivoting tool is executing those queries, the cache should (hopefully) soon fill with entries that allow most, if not all, of the queries resulting from interactive browsing to be served from the cache. Additionally, the tool can apply for pre-fetching of relevant data by drilling down a bit more than the user has requested, resulting in a cache hit when the user indeed drills deeper. Also, the tool does not have to cache data to sort it on its own, since queries that differ only in their SORT BY are cached. An additional enhancement would be the ability to serve a hit from the cache using more than one cached table.



Example:



A. No cache hit, so we just populate the cache

Initial query:

SELECT salesman, state, COUNT(*) FROM sales GROUP BY salesman, state

The server does:

CREATE TABLE 1234567 SELECT salesman, COUNT(*) FROM sales GROUP BY salesman, state

SELECT * FROM 1234567



B. A cache hit

Initial query:

SELECT state, COUNT(*) FROM sales GROUP BY state

The server does:

SELECT state, SUM(`COUNT(*)`) AS `COUNT(*)` FROM 1234567 GROUP BY state

[`COUNT(*)` being a valid column name for table 1234567]



II. DATA DESCRIPTION AND MANIPULATION



1. In my humble opinion, people do not think in MDX. Instead, they think in terms of GROUP BY. So, for most uses, it should be sufficient to allow the user to construct his own GROUP BY statement and specify the aggregate functions that he is interested in, rather than asking him to create a cube, an axis, a view, a measure, etc, etc.



2. People also think in terms of everyday phrases, like “last 7 days” or “all Mondays”. A pre-compiled dictionary of such phrases will be immensely useful, as well as the ability to specify such phrases. People also like to be able to do “call duration in 5-minite intervals”, which is not available in Microsoft Excel when working with columns of type “time”.



3. Normal people do not expect all of their columns to be available for analysis, and they do not want their report to have either 2 or 2000 rows.



For example, if you have a date column and you do a Microsoft Excel PivotTable, you will first have to select that column from a list that contains bunch of other fields, then wait for the table to be generated with a row for each date, and then you group or sort the dates somehow to arrive to the numbers that interest you. Other tools (at least in their example scenarios) facing a date column will start with the data grouped by year, and you then have to expand to month (the months often being shown as numbers), and from there on to weeks and days, and table has to refresh and recalculate a dozen times for your convenience.



Instead, a person should have a list of phrases that we can use as rows and columns, like “last 7 days per day”, “all months since January by week”, etc. She will then be able to arrive precisely to the data that she wants to see. Only one SQL query will be required.



4. Data is not always perfect



If you store your data as 1 and 0, and your boss wants to see “yes” and “no”, this should be possible. If sales > $5000 means a pro salesman, then the user does not have to display the row sales number in a column, and then group on figures below $5000 and figures above $5000, and then separately calculate the salesmen that are too recently hired to be able to score. Months and days of week have names. Times of the day may be morning, afternoon and evening, not (0..24:0..59:0.59). Times that are messed up due to time zones can be adjusted on the fly without jeopardizing the work of company software that relates on data being messed up.



III. PRESENTATION



A mod_perl GUI is envisioned that will allow you view and rotate your data as you see fit. In particular, the following goals have been set:

1. Fully bookmarkable URLs that people can mail around to others so that they too can see the same report;

2. Usage of phrases described in Section II to make access to the most relevant portions of the report easier;

3. Sorting, drilling up and down, expanding, contracting, hiding, showing, axis-swapping, grouping and ungrouping, coloring, etc., etc.

4. Tabs instead of drop-down lists, e.g. a tab for January, a tab for February, etc.

5. Access control, full logging, etc. etc.;

6. Speed, speed, speed. Anything that is slower than Microsoft Excel for comparable datasets should be optimized. Data may be queried (and retrieved) in portions to provide concurrency and instant feedback to user. For example, if we have a table keyed by date, we can always retrieve January, show it to the user, and then proceed to retrieve the other months and keep displaying them as they arrive (which, as a side effect, may cause other queries to slip in between, providing faster performance for everyone at least perceptually). Any queries that are known to run long (based on timing previous invocations), should have a progress bar.