Hi there.
Working as an analyst I often find myself doing adhoc queries on quite large amounts of data. The goal is often to find out some pattern to movements in our customer base or to track other changes over time.
I'm not looking for some definitive answers (not sure there are any) but if you guys can point me in some direction or offer some insights about other ways to go about this it would be greatly appreciated.
I work in Telecommunications and my main areas of interests are calls, products/services and revenues and how they relate to our customers and how they change over time.
Example
For a given month I have aggregated call data (millions of rows) and customer ids. A hypothetical table definition could look like:
Code:
CREATE TABLE calls (First_day_of_month (Date)
,CustomerID (Int)
,Call_Seconds (Int))
The combination of First_day_of_month and CustomerID is unique.
A typical question I would like to answer is, for each month how many customers called more than they did last month? For the last two years.
What I end up doing is something like (pseudocodish):
Code:
SELECT calls1.First_day_of_month
,COUNT(calls1.CustomerID) AS Number_of_customers_calling_more
FROM calls calls1
INNER JOIN calls calls2
ON calls1.First_day_of_month = DATEADD(calls2.First_day_of_month, MONTH, 1)
AND calls1.CustomerID = calls2.CustomerID
WHERE calls1.Call_Seconds > calls2.Call_Seconds
AND calls1.First_day_of_month >= CAST('2008-01-01' AS DATETIME)
What I'm doing is joining the calls table on itself on customer ids and month but changing the specific month on one of the tables to reflect the time period I'm interested in tracking changes over.
This, as you can imagine, puts the database through it's paces and depending on the size of the datasets can be quite slow to work with. Slow in this case means hours for query completion and I find that puts a defenitive damper on my enthusiasm for digging through the data. It works, but it is a serious pain to work with.
I've tried a lot of different indexing options and it definitely helps up to a point but I'm thinking that this problem can't be unique to me and somewhere there is a more elegant solution.
Any thoughts? Pointers? Links?
I'm willing to rip the data apart in any way that will help me get better performance.
Thanks