Quote:
Originally Posted by verellen
Hey all,
I have the following question:
When designing a database, qua performance, which is the best senario:
Having a table with lots of columns, so you do not have to use many joins.
Or storing your data in seperated tables, which is more clarifying and flexible, but when you need to gather the data, you'll have to use more joins.
When does having a lot of joins have affect on the performance?
|
I'll run over the pros and cons very quickly:
Pros
- For some queries, performance is better.
- The schema is simpler.
Cons
- Some queries may perform worse.
- Updating is more expensive, and concurrency of updates may suffer.
- Integrity constraints may not be expressible as simple foreign key constraints.
- Your business logic will move out of the schema into the application.
And the "simpler schema" often lures people in: taking all those messy tables and glomming them into one is very appealing to, shall we say, the PowerPoint mentality. So when you're proposing an initial design, you've probably only done slides on the schema, and so some jackalope will say, "Keep it simple, stupid" because he thinks that deleting complexity from the schema is going to make it magically disappear. The trouble is that when the slideshow is over and you start working on the application, that complexity comes right back. Since the schema is settled on, the complexity has to be managed by the application! You just got promoted from DBA / app designer to DBMS-functionality-reinventer.
Now, there are some specialized cases where join performance is critical over everything else. But it's hard to know ahead of time whether that's your case and whether it can only be achieved by denormalizing. To know for certain, you're going to want to have profiling data and query plans to prove that joins are the actual chokepoint for your performance. You're going to want to prove to yourself that you can't mitigate the performance problem with optimizer hints, clustered tables or materialized views. All of these techniques can potentially solve the problem, but only if you have an actual system with real data to test them on.
Which brings me to the standard warning about premature optimization, which applies doubly when you're designing a schema. It's generally much harder to go from a denormalized schema to a normalized schema than the reverse. It sounds easy enough in theory, but in practice, denormalization allows redundancy to creep in. (When you denormalize, you're not using the DBMS's native ability to prevent redundancy.) Redundancy, in turn, allows inconsistent data to creep in. Real data, unlike toy examples you see in books, represents actual people and actual things and actual money. If you have to fix that data, you're going to have to be the bearer of bad news to a whole lot of people.
Redundancy will also appear in the application. If you have a check in the application and another coder does the classic "copy and paste", that check has now been replicated. If you have to change it, you have to somehow verify that you've changed it in every place in your application. You could, in theory, put all those checks neatly in an application library. But rather than rely on programmers to always do the right thing, it makes more sense to centralize your logic and manage it through the DBMS. That's what it's there for, and the way to make that work is by normalizing.