Working as an analyst I often find myself doing adhoc queries on quite large amounts of data. The goal is often to find out some pattern to movements in our customer base or to track other changes over time.
I'm not looking for some definitive answers (not sure there are any) but if you guys can point me in some direction or offer some insights about other ways to go about this it would be greatly appreciated.
I work in Telecommunications and my main areas of interests are calls, products/services and revenues and how they relate to our customers and how they change over time.
For a given month I have aggregated call data (millions of rows) and customer ids. A hypothetical table definition could look like:
The combination of First_day_of_month and CustomerID is unique.
A typical question I would like to answer is, for each month how many customers called more than they did last month? For the last two years.
What I end up doing is something like (pseudocodish):
,COUNT(calls1.CustomerID) AS Number_of_customers_calling_more
FROM calls calls1
INNER JOIN calls calls2
ON calls1.First_day_of_month = DATEADD(calls2.First_day_of_month, MONTH, 1)
AND calls1.CustomerID = calls2.CustomerID
WHERE calls1.Call_Seconds > calls2.Call_Seconds
AND calls1.First_day_of_month >= CAST('2008-01-01' AS DATETIME)
What I'm doing is joining the calls table on itself on customer ids and month but changing the specific month on one of the tables to reflect the time period I'm interested in tracking changes over.
This, as you can imagine, puts the database through it's paces and depending on the size of the datasets can be quite slow to work with. Slow in this case means hours for query completion and I find that puts a defenitive damper on my enthusiasm for digging through the data. It works, but it is a serious pain to work with.
I've tried a lot of different indexing options and it definitely helps up to a point but I'm thinking that this problem can't be unique to me and somewhere there is a more elegant solution.
Any thoughts? Pointers? Links?
I'm willing to rip the data apart in any way that will help me get better performance.
Is this SQL Server? If so, what version?
You say the first day of month and customer id are unique - please can you post the exact definition of that constraint i.e. script it out?
Are you aware that using functions on indexed columns renders them unsargable and therefore unusable (at least in seeks)?
if all your queries are ad-hoc then it is really quite difficult to plan for them. On the other hand, if this is a regular requirement then a column to store the previous month's calls would be a simple and very performant solution.
It's all kinds of servers I do most of my work on Sql Server 2008 but I'm not a MSSQL fanatic and play around with other systems as well depending on the situation.
Regarding the uniqueness of CustomerID vs Month that's not something I have actually specified constraints for. The integrity of the data is ensured by other means and is not a priority in this environment. I've been told that constraints are not really helping with performance but I could be wrong. That said, each Customer ID should only show up once every month. And so far it does.
I dig the data out of our enterprise data store and dump it into Sql Server 2008 and I'm not working in the production environment. The output is used for reports or if I find something interesting it's used as a prototype for requirements to go into production at a later stage.
I was not aware that funcions on indexed columns had such an impact, thanks for the information.
Unique and Primary Key constraints are enforced using B-Tree indexes in most enterprise RDBMSs, so whoever told you that constraints don't help with performance was quite wrong. Certain other constraints can be used by some RDBMS engines to optimise their querying, for example NOT NULL and FOREIGN KEY constraints.
Indexes pretty well always either help or make no difference to select queries - they have no cost in this regard. Their cost is upkeep during INSERT, DELETE and UPDATE operations. If this is offline and you don't care how long it takes to load, just how long it takes to query, then I would index liberally and defragment the indexes after every load.
You would do well to study execution plan analysis in order to identify what indexes are being used and where indexes are required. The DTA in SQL Server can help you with this.
I'll do some further testing and see what I can come up with based on your input. I was aware that primary keys helped with performance but not that unique constraints could help further. Have a nice weekend
(Anyone else with further input is very welcome to contribute)
... I was aware that primary keys helped with performance but not that unique constraints could help further...
Well, a primary key is just a unique key with a further constraint the data be non-null.
Last edited by loquin; 05-20-10 at 15:27.
"Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
"I have my standards. They may be low, but I have them!" - Bette Middler
"It's a book about a Spanish guy named Manual. You should read it." - Dilbert