Unanswered: Tuning Very Large SQL-Server Databases (Physical Storage Optimization)
Tuning Very Large SQL-Server Databases
[COLOR="Blue"]by Mike Eberhart (5/09/2007)[/COLOR]
SQL-Server Performance Optimization and SQL-Server Speed Optimization should always be a primary consideration during the database design phase, regardless of the size of the relational database — with the exception perhaps of "tiny" or "trivial" databases where future growth is guaranteed not to be an issue. And, very large databases (in excess of a Gigabyte of data, or millions of rows of data in any particular table) definitely require the SQL-Server database designer / administrator to consider the ramifications of proper (or improper) database design. Reduce Physical Storage Requirements
The first thing I always make sure of when designing databases, and when tuning large SQL-Server databases for performance, is that physical storage space is not being wasting with data-types that are overly precise or excessively large for their intended function, and that data is properly normalized. If you don't understand the concepts of normalization, please study up on it before reading further (or pretending you can do your job as a SQL-Server developer or DBA).
Keep in mind that this discussion is all to be considered in the context of very large databases, where there will be millions of rows of data in one or more tables.
I will now present some very simple rules for reducing physical storage requirements and some (storage space) savings-opportunity examples to consider:
Do not use NVarchar or NChar (vs. Varchar and Char respectively) unless you need Unicode character support. Using NVarchar (or NCHAR) where a standard Varchar/Char will suffice will double the storage requirements for a column, since Unicode characters map to double-byte pairs (to accommodate all those foreign language characters and extended symbols, etc).
Savings Opportunity: I have encountered NVarChar use on columns where there is an Alphanumeric "TransactionID" on millions of rows of data, but where that AlphaNumeric value is most certainly only containing the 10 digits and the standard 26 letters of the English alphabet. Instant space savings!
Use SmallDateTime instead of DateTime, unless you really need the precision. SmallDateTime fields are accurate to the nearest minute, whereas full DateTime types are accurate to 3.33 millisecond intervals. And, that extra precision comes with the price tag of double the storage — SmallDateTime requires 4 Bytes instead of 8 Bytes for DateTime.
Savings Opportunities: if you see fields like "AccountingPeriodBeginDate", it should be obvious that an accounting period is going to start on a particular day boundary, and SmallDateTime will more than suffice. I also regularly see fields like "LastChangeDate" on tables in large databases where, upon inquiring, nobody can provide any reason why any accuracy beyond a particular minute is needed, but yet the fields are defined as full DateTime fields.
Use the smallest Integer data-types that will safely and comfortably handle all possible values in an integer-data column. SQL-Server provides a few datatypes for this, including bigint, int, smallint, and tinyint. Each has their storage limitations and storage-space requirements. TinyInt takes just one byte, but can only store values ranging from 0-255 (i.e., 2^8 values), whereas SmallInt is two bytes in length, and can store 2^16 values, Int is 4-bytes and holds 2^32 values, and BigInt is 8-bytes and holds 2^64 values (see the pattern?)
Savings Opportunity: look for columns that have an obviously small number of values, but whose column-definitions are grossly oversized. E.g., if you see a field like (accounting) "Quarter" defined as an Int, TinyInt can easily hold the four possible values (Quarter 1, 2, 3, or 4) and saves 3bytes/row. Or, perhaps you'll see UserID defined as an Int where there's no way in the next 100 years the application will have over 10 or 20,000 users — this sure looks like a good situation to downsize to SmallInt and save half the space!
As with Integer types, the money data-types come in a large and small size — Money and SmallMoney respectively; the first of which takes 8-bytes to store, and the latter which takes only 4 bytes. SmallMoney values are limited to roughly +/-214,000.0000 compared to hundreds of trillions for Money.
Savings Opportunity: if you have per-unit Costs, Charges, or the like stored in your database columns where the price is never going to even begin to approach that $214,000 value, use SmallMoney and cut your storage by half. E.g., if you are storing the price of items sold in your convenient-store locations, or even the price-per-gallon of gasoline you sell, this limitation should suffice for the foreseeable future.
Decimal and Numeric data-types provide a bit more complexity when analyzing the proper "best" size for the fields.
Variable-length fields where fixed-length should be used. Varchar fields require 2 extra bytes in addition to the the defined maximum field-width in order to store a value which holds the characters (used) count. Varchars are great for when there really is heavy variation in the actual length of data being stored. But, they do not make sense if you'll always be filling all bytes in the field (or even all bytes minus two).
Savings Opportunity: perhaps one of the most ridiculous things you will encounter is a one or two character field that is defined as Varchar(1) or (2). This is one sure way to make a one-character field take 3-bytes of storage per row. I have seen this repeatedly with things like "Flag" fields or "Code" fields people design into table structures. I personally think the RDBMS should not allow a user to even define Variable-width columns of just one or two characters.
Note: SQL-Server has one oddity to watch out for to ensure that Char fields are truly treated as fixed-length fields "If SET ANSI_PADDING is OFF when either CREATE TABLE or ALTER TABLE is executed, a char column that is defined as NULL is handled as Varchar"...
Savings Opportunity: start by looking for just plainly blatant ignorant denormalization. I have seen so many massive tables where things like "LastUpdateUser" is a 10 or 20 character (if not Unicode character) field in databases where those last-update "Users" are a finite group of individuals that can be counted on both hands. Talk about a waste! If you encounter this type of situation, it's time to create a "UpdateUsers" table (or whatever) with an Integer primary-key (in this example, a TinyInt) and a field called "UserName" or whatever, and replace the "LastUpdateUser" on the massive-rows table with "LastUpdateUserID" that is a TinyInt column (foreign-key to the new UpdateUsers Table). This will save a ton of space! And, there is nearly zero cost in joining to such a small table later if you really need to get the name associated with the ID.
Another Savings Opportunity: It involves a bit of data-analysis, and thinking "out of the box". Basically, I will look for data-patterns that repeat, and whose unique (pattern) values are finite though not instantly apparent — especially when the patterns span multiple columns on the same table. I've routinely discovered that the unique values from a combination of few fields may only number in the thousands, even though the total table rows exceeds 50 or 100 million. When I find these, and establish the reason for the pattern (and ensure my logic will accommodate future conditions too)vings guidelines, chances are you'll find a way to make some significant storage-space reductions in your SQL-Server databases. And, you will have made the first steps towards performing some serious speed-optimizations in your database as well.
One note of caution when making any database changes: be sure you understand the ramifications beyond the scope of the database. Many of the changes I've outlined will be more or less transparent to other applications, depending on what (languages) the applications are developed in, and how the applications access the data (note: I always recommend that applications do not have direct access to the data-tables, but rather access data through stored procedures — for many reasons, one of which is to allow simplified maintenance to the underlying database tables without impacting other applications).
If you don't know what side-effects your changes could cause for Graphical User Interfaces (GUI applications) or reporting applications, and so on are, then be sure to involve someone that does. E.g., When changing Int fields to SmallInt, a Borland (aka CodeGear) Delphi based program most likely won't care, but C# data-bindings may have issues on an ASP.NET page or such. So, be careful. From what I have seen, Delphi as well as C# all seem fine with SmallDateTime in place of full DateTime fields in many cases, but it can also depend on how customized an application is.
Computing / Estimating Storage Requirements
In theory, you should be able to quickly estimate a Table's (physical storage) size just by knowing the number of rows, and multiplying that by the total of all columns' storage requirements. This can be put into a simple formula something like this (I didn't put ALL data types in here, but you should get the idea), where you just multiple the number of rows in a table by: