Background
Often while remodeling legacy application, one of the important tasks for the architects is to have an optimum usage of storage capabilities of database. Most of the legacy applications are constrained by the technology available at the time of their development and hence aren’t optimum as per current scenario. One of such cases is the extensive usage of CHAR fields, which aren’t optimum solution for space storage now. This paper will highlight the systematic approach which needs to be taken while converting CHAR data fields to VARCHAR in DB2 for iSeries.
Problem Statement
The basic constraint of the CHAR data type is its inflexibility to get optimally required space for storage. A CHAR(50) field will reserve 50 bytes in database, irrespective of the data content of the field. If most of the values of this field are less than length 35, then remaining 15 bytes are sheer wastage of space as it cannot be utilized by database otherwise. This will result in extra disk space.
Solution
In such a case, one of the most important and immediate option to save storage space is to let database store only relevant data without unnecessary trailing spaces. For this, architects simply convert CHAR fields to VARCHAR.
Is it really a solution?
Is simply converting all CHAR columns to VARCHAR is appropriate solution and if not then what is the optimum way to do this. This is the point where the real catch is, ignoring which can led to performance degradation, which will be hard to track at later stages.
Ideally it’s a two step process:
(a) Identifying the relevant columns for conversion: Changing all the columns blindly from CHAR to VARCHAR may result in huge effort as not only the table definition but related programs which are inserting/updating/deleting these columns needs to be changed accordingly. Therefore only those columns should be chosen where space saving is significant. Please refer section “Recommendation for architects” to select appropriate columns.
(b) Converting data definition of identified columns and related programs from CHAR to VARCHAR: This will require a little knowledge of how DB2 on iSeries implements storage of VARCHAR. Please refer section “Implementation of VARCHAR storage in DB2 (iSeries)” below.
Implementation of VARCHAR storage in DB2 (iSeries)
Variable-length column (VARCHAR) support allows you to define any number of columns in a table as variable length. If you use VARCHAR support, the size of a table can usually be reduced. Data in a variable-length column is stored internally in two areas: a fixed-length or ALLOCATE area and an overflow area. Behavior of ALLOCATE area is similar to the behavior of CHAR data type i.e. irrespective of the actual data length database reserves as much bytes for storage as specified in allocate area. Any data, over and above that length is stored in variable or overflow area. If a default value is specified, the allocated length is at least as large as the value.
Recommendation for architects
The following points help you determine the best way to use your storage area. If the primary goal is:
Space saving: use simply VARCHAR data type (instead of CHAR) while defining table definition.
Performance: Use VARCHAR with ALLOCATE clause. Create the table using the ALLOCATE keyword as
CREATE TABLE Tab (col1 VARCHAR(40) ALLOCATE(10),
Col2 VARCHAR(40) ALLOCATE(10),
COL3 VARCHAR(40) ALLOCATE(7))
In many applications, performance must be considered. If you use the default VARCHAR datatype i.e. ALLOCATE(0), it will double the disk unit traffic. ALLOCATE(0) requires two reads; one to read the fixed-length portion of the row and one to read the overflow space. The variable-length implementations, with the carefully chosen ALLOCATE, minimize overflow and space and maximizes performance.
It is possible to balance space savings and performance by using ALLOCATE clause properly. When you define a table with variable-length data, you must decide the width of the ALLOCATE area i.e. ALLOCATE area should be wide enough to incorporate at least 90% to 95% of the values for the column.
Now the question is how to get columns which needs to be converted from CHAR to VARCHAR and how to get the magical figure for the columns’ ALLOCATE clause? Because choosing irrelevant columns for conversion may lead to high efforts but fewer gains. Normally, columns which have very high variation in data length are most suitable for conversion. Columns like remarks, comments etc fall into this category and usually columns in this category have length 35 or above. Also, the corresponding table should have enough number of rows to justify the effort involved as space saving is directly proportional to the number of rows. Thus, to get the list of relevant columns and their corresponding tables, run the SQL script listed below:
SELECT a.table_name, a.column_name, a.data_type, a.length, b.number_rows
FROM qsys2.syscolumns a , qsys2.systablestat b
WHERE a.table_schema = ‘lib'
AND a.length >= 35
AND b.table_schema = a.table_schema
AND b.table_name = a.table_name
AND b.number_rows > 5000
Here, lib is the name of library where database table resides. This SQL will provide the list of all the relevant columns. Architects are recommended to review suitable value for a.length and b.number_rows according to the nature of their application. Now, we need to analyze the existing data and find the ALLOCATE figure, which should be appropriate to hold 90-95% of the data. For this, run this SQL in the concerned database, where col1 and tab1 is the column and its respective database table under consideration.
SELECT LENGTH(RTRIM(col1)) Column_Length, count(*) Count
FROM tab1
GROUP BY LENGTH(RTRIM(col1))
ORDER BYLENGTH(RTRIM(col1))
This SQL will provide column’s classification by their data length. Paste this information into an excel spread sheet and derive cumulative sum and related percentage. Attached below is the similar data for a column where number of rows in the parent table was 1626986 and the length of the column was defined as CHAR(50).