I'm currently performance tuning a table with 100 million rows in it
(about 18 GB of data) and would like to know if -
1. Is the table too large to be performance tuned. Is it better to just
redesign the schema ?
2. Can using techniques as indexed views really help me with tuning such a table.
3. How long would it take to create a clustered, non clustered index on
a varchar column (for instance) on a table with 100 million rows ?
(i know this is a function of hardware as well - let's assume i'm using
afairly maxed out DL 360 - i.e. dual processor with 4 GB of memory)
My question wasn't about your typing skills.
It's about how long it takes to create a clustered / non-clustered
index on a varchar column of a table with 100 million rows.
In this case I'm not joining other tables. (the index view question)
I'm only selecting from this one table with 100 million rows.
Thank you Kaiowas. That's good to know.
How long does it typically take to perform a query on your indexed,
150 million row table ?
In this case, the fragmentation is low. The table serves as a user activity log.
This is currently a 20 GB table with 100 million rows and I'd like to
decide between re-modelling the database design (normalize the table into
a few more tables) or focusing on tuning the table.
Sorry, I answered your question as asked instead of what you meant. It has been one of those weeks.
Building indexes isn't really CPU bound, and is rarely RAM bound (although for your table the RAM could be a constraint). Disk performance is usually the key component for building indexes. The DL 360 can be configured a lot of different ways, and disk performance is wildly variable depending on how the system is configured.
Normalizing the table is usually a good idea. When dealing with 20 Gb of data, normalization is really important. If you normalize the data, you might be able to turn one table with 20 Gb of data into six tables with 8 Gb (total) of data. That will almost always improve your performance all by itself, just because the machine needs to schlepp a lot less data from point A to point B in order to resolve any query.
Just out of curiosity, how wide is this 20GB table? For me, that would be one of the places to look. Since creating an index on this table could take up to half an hour, this is something that you will not be able to do by trial and error. You are certain there is table scanning going on? And more importantly, that the queries you are tuning can use indexes?
As PatP points out, normalization will give you significant performance benefits. I'd say it would trump disk speed and will improve performance for caching and index seeking and searching. Then, index those tables to benefit your application.
"Normalizing the table is usually a good idea. When dealing with 20 Gb of data, normalization is really important. If you normalize the data, you might be able to turn one table with 20 Gb of data into six tables with 8 Gb (total) of data. That will almost always improve your performance all by itself, just because the machine needs to schlepp a lot less data from point A to point B in order to resolve any query."
Ok, I'm gonna stick my neck out on this one and disagree with Pat.
Normalizing data does NOT necessarily result in an increase in performance, because many of your queries will now require joins which would not be required in a normalized schema. The primary purpose of denormalizing data is increase the performance of output queries. Witness the common Star Schema used (all too frequently) in Data Warehouses.
What you lose by denormalizing data is control over relational integrity. When you have hierarchical relationships between datasets greater than two levels, and then try to represent this in a flat table, enforcing business rules for input becomes cumbersome, and you will find yourself doing a lot of coding to make up for it. That is why Star Schemas are best used for applications that focus on reporting, and which recieve their data from external sources which can be trusted to verify the integrity of the data they submit.
That said, I never use Star Schemas (the idiot's "One Size Fits All" solution to data warehousing), and the databases I design are almost always fully normalized.
If it's not practically useful, then it's practically useless.
As a general rule, I find that normalized databases will outperform non-normalized databases almost every time.
If you think about what work goes on at a primative level, shlepping about 8 Gb of data which you join at the last minute to produce 20 Gb of results is going to be faster than hauling the entire 20 Gb from start to finish. The overhead needed to support joins when indicies are available is trivial, well less than 1% of total query cost for each joined table. When you compare the cost of the join to the cost of the raw I/O, the normalized design will run roughshod over the non-normalized design.
The reason that data warehouses store non-normalized data doesn't really have anything to do with machine performance... The intent of a data warehouse is to make data available to the end user. End users don't do joins naturally, so a star schema suits them well. When you look at the actual performance issues, a normalized design usually outperforms the warehouse design from the machine's perspective.
In a warehouse situation, people time is a lot more expensive than machine time... If the subject matter experts using a data warehouse have to spend an extra minute composing a query, that delay can cost more than the query can save. If the delay impacts a decision (for example in a board meeting), it will definitely cost the business money.
Disk is usually the weakest link in a database chain between a web/application server and the data it needs. Reducing the total amount of disk I/O is the highest payback optimization that I've found. If you can normalize data, it almost always reduces the amount of disk I/O needed, and that speeds up the whole system.