Unanswered: CHAR to VARCHAR conversion in DB2(iSeries)
Often while remodeling legacy application, one of the important tasks for the architects is to have an optimum usage of storage capabilities of database. Most of the legacy applications are constrained by the technology available at the time of their development and hence aren’t optimum as per current scenario. One of such cases is the extensive usage of CHAR fields, which aren’t optimum solution for space storage now. This paper will highlight the systematic approach which needs to be taken while converting CHAR data fields to VARCHAR in DB2 for iSeries.
The basic constraint of the CHAR data type is its inflexibility to get optimally required space for storage. A CHAR(50) field will reserve 50 bytes in database, irrespective of the data content of the field. If most of the values of this field are less than length 35, then remaining 15 bytes are sheer wastage of space as it cannot be utilized by database otherwise. This will result in extra disk space.
In such a case, one of the most important and immediate option to save storage space is to let database store only relevant data without unnecessary trailing spaces. For this, architects simply convert CHAR fields to VARCHAR.
Is it really a solution?
Is simply converting all CHAR columns to VARCHAR is appropriate solution and if not then what is the optimum way to do this. This is the point where the real catch is, ignoring which can led to performance degradation, which will be hard to track at later stages.
Ideally it’s a two step process:
(a) Identifying the relevant columns for conversion: Changing all the columns blindly from CHAR to VARCHAR may result in huge effort as not only the table definition but related programs which are inserting/updating/deleting these columns needs to be changed accordingly. Therefore only those columns should be chosen where space saving is significant. Please refer section “Recommendation for architects” to select appropriate columns.
(b) Converting data definition of identified columns and related programs from CHAR to VARCHAR: This will require a little knowledge of how DB2 on iSeries implements storage of VARCHAR. Please refer section “Implementation of VARCHAR storage in DB2 (iSeries)” below.
Implementation of VARCHAR storage in DB2 (iSeries)
Variable-length column (VARCHAR) support allows you to define any number of columns in a table as variable length. If you use VARCHAR support, the size of a table can usually be reduced. Data in a variable-length column is stored internally in two areas: a fixed-length or ALLOCATE area and an overflow area. Behavior of ALLOCATE area is similar to the behavior of CHAR data type i.e. irrespective of the actual data length database reserves as much bytes for storage as specified in allocate area. Any data, over and above that length is stored in variable or overflow area. If a default value is specified, the allocated length is at least as large as the value.
Recommendation for architects
The following points help you determine the best way to use your storage area. If the primary goal is:
Space saving: use simply VARCHAR data type (instead of CHAR) while defining table definition.
Performance: Use VARCHAR with ALLOCATE clause. Create the table using the ALLOCATE keyword as
CREATE TABLE Tab (col1 VARCHAR(40) ALLOCATE(10),
Col2 VARCHAR(40) ALLOCATE(10),
COL3 VARCHAR(40) ALLOCATE(7))
In many applications, performance must be considered. If you use the default VARCHAR datatype i.e. ALLOCATE(0), it will double the disk unit traffic. ALLOCATE(0) requires two reads; one to read the fixed-length portion of the row and one to read the overflow space. The variable-length implementations, with the carefully chosen ALLOCATE, minimize overflow and space and maximizes performance.
It is possible to balance space savings and performance by using ALLOCATE clause properly. When you define a table with variable-length data, you must decide the width of the ALLOCATE area i.e. ALLOCATE area should be wide enough to incorporate at least 90% to 95% of the values for the column.
Now the question is how to get columns which needs to be converted from CHAR to VARCHAR and how to get the magical figure for the columns’ ALLOCATE clause? Because choosing irrelevant columns for conversion may lead to high efforts but fewer gains. Normally, columns which have very high variation in data length are most suitable for conversion. Columns like remarks, comments etc fall into this category and usually columns in this category have length 35 or above. Also, the corresponding table should have enough number of rows to justify the effort involved as space saving is directly proportional to the number of rows. Thus, to get the list of relevant columns and their corresponding tables, run the SQL script listed below:
SELECT a.table_name, a.column_name, a.data_type, a.length, b.number_rows
FROM qsys2.syscolumns a , qsys2.systablestat b
WHERE a.table_schema = ‘lib'
AND a.length >= 35
AND b.table_schema = a.table_schema
AND b.table_name = a.table_name
AND b.number_rows > 5000
Here, lib is the name of library where database table resides. This SQL will provide the list of all the relevant columns. Architects are recommended to review suitable value for a.length and b.number_rows according to the nature of their application. Now, we need to analyze the existing data and find the ALLOCATE figure, which should be appropriate to hold 90-95% of the data. For this, run this SQL in the concerned database, where col1 and tab1 is the column and its respective database table under consideration.
SELECT LENGTH(RTRIM(col1)) Column_Length, count(*) Count
GROUP BY LENGTH(RTRIM(col1))
This SQL will provide column’s classification by their data length. Paste this information into an excel spread sheet and derive cumulative sum and related percentage. Attached below is the similar data for a column where number of rows in the parent table was 1626986 and the length of the column was defined as CHAR(50).
Almost 95% data is having length 25 or less than 25, therefore choosing ALLOCATE(25) will save around 35MB space with virtually no impact on performance. This can be roughly estimated as:
((Total_rows * percentile * saved_space) – (Total_rows * remaining percentile * average variable space per overflowed row)).
As the later half is not supposed to be significant, approximate value of the first half can provide rough estimate. Moreover, we need not to be very specific; any close estimate can aid our decision. For tables having millions of rows this saving may go up to even 300MB with just one column.
The result of this analysis will also aid in deciding whether this column is appropriate for conversion or not as less saving in storage indicates there is no use in spending effort for that column.
This type of analysis can easily be done for existing columns, but while designing new tables what should be this “ALLOCATE” size? For this, data architect initially has to choose an appropriate value according to the functionality of the field (initially 75% of the total length can be considered appropriate if there is no other input to aid the decision). After having considerable production data, data architect should check the performance of this field with the SQL listed below:
WHERE table_name = “tab1”
AND table_schema = “lib”
Any value below 5% is OK and anything above 5% means ALLOCATE length needs to be increased.
Possible impact on code
Apart from the table definition, programs depending upon these fields may also require to be changed in case host variables are used. If you are using host variables to insert or update variable-length columns, the host variables should be variable length. Because blanks are not truncated from fixed-length host variables, using fixed-length host variables would cause more rows to spill into the overflow space. This would increase the size of the table. In this example, fixed-length host variables are used to insert a row into a table:
01 COL1 PIC X(40). ...
MOVE "SMITH" TO COL1.
EXEC SQL INSERT INTO tab1 VALUES
(:COL1, :COL2, :COL3)
The host-variable COL1 is not variable length. The string “SMITH”, followed by 45 blanks, is inserted into the VARCHAR column COL1. The value is longer than the allocate size of 25. Twenty five of forty-five trailing blanks are in the overflow area. In this example, variable-length host variables are used to insert a row into a table:
49 Col1-LEN PIC S9(4) BINARY.
49 Col1-DATA PIC X(40).
MOVE "SMITH" TO Col1-DATA.
MOVE 5 TO Col1-LEN.
EXEC SQL INSERT INTO tab1 VALUES
(:VCol1, :VCol2, :VCol3)
The host variable VCol1 is variable length. The actual length of the data is set to 5. The value is shorter than the allocated length. It can be placed in the fixed portion of the column and thus having no performance overhead.