Results 1 to 2 of 2
  1. #1
    Join Date
    May 2010
    Posts
    4

    Unanswered: CHAR to VARCHAR conversion in DB2(iSeries)

    Background
    Often while remodeling legacy application, one of the important tasks for the architects is to have an optimum usage of storage capabilities of database. Most of the legacy applications are constrained by the technology available at the time of their development and hence aren’t optimum as per current scenario. One of such cases is the extensive usage of CHAR fields, which aren’t optimum solution for space storage now. This paper will highlight the systematic approach which needs to be taken while converting CHAR data fields to VARCHAR in DB2 for iSeries.

    Problem Statement
    The basic constraint of the CHAR data type is its inflexibility to get optimally required space for storage. A CHAR(50) field will reserve 50 bytes in database, irrespective of the data content of the field. If most of the values of this field are less than length 35, then remaining 15 bytes are sheer wastage of space as it cannot be utilized by database otherwise. This will result in extra disk space.

    Solution
    In such a case, one of the most important and immediate option to save storage space is to let database store only relevant data without unnecessary trailing spaces. For this, architects simply convert CHAR fields to VARCHAR.

    Is it really a solution?
    Is simply converting all CHAR columns to VARCHAR is appropriate solution and if not then what is the optimum way to do this. This is the point where the real catch is, ignoring which can led to performance degradation, which will be hard to track at later stages.
    Ideally it’s a two step process:
    (a) Identifying the relevant columns for conversion: Changing all the columns blindly from CHAR to VARCHAR may result in huge effort as not only the table definition but related programs which are inserting/updating/deleting these columns needs to be changed accordingly. Therefore only those columns should be chosen where space saving is significant. Please refer section “Recommendation for architects” to select appropriate columns.
    (b) Converting data definition of identified columns and related programs from CHAR to VARCHAR: This will require a little knowledge of how DB2 on iSeries implements storage of VARCHAR. Please refer section “Implementation of VARCHAR storage in DB2 (iSeries)” below.

    Implementation of VARCHAR storage in DB2 (iSeries)
    Variable-length column (VARCHAR) support allows you to define any number of columns in a table as variable length. If you use VARCHAR support, the size of a table can usually be reduced. Data in a variable-length column is stored internally in two areas: a fixed-length or ALLOCATE area and an overflow area. Behavior of ALLOCATE area is similar to the behavior of CHAR data type i.e. irrespective of the actual data length database reserves as much bytes for storage as specified in allocate area. Any data, over and above that length is stored in variable or overflow area. If a default value is specified, the allocated length is at least as large as the value.
    Recommendation for architects
    The following points help you determine the best way to use your storage area. If the primary goal is:
     Space saving: use simply VARCHAR data type (instead of CHAR) while defining table definition.
     Performance: Use VARCHAR with ALLOCATE clause. Create the table using the ALLOCATE keyword as
    CREATE TABLE Tab (col1 VARCHAR(40) ALLOCATE(10),
    Col2 VARCHAR(40) ALLOCATE(10),
    COL3 VARCHAR(40) ALLOCATE(7))

    In many applications, performance must be considered. If you use the default VARCHAR datatype i.e. ALLOCATE(0), it will double the disk unit traffic. ALLOCATE(0) requires two reads; one to read the fixed-length portion of the row and one to read the overflow space. The variable-length implementations, with the carefully chosen ALLOCATE, minimize overflow and space and maximizes performance.

    It is possible to balance space savings and performance by using ALLOCATE clause properly. When you define a table with variable-length data, you must decide the width of the ALLOCATE area i.e. ALLOCATE area should be wide enough to incorporate at least 90% to 95% of the values for the column.

    Now the question is how to get columns which needs to be converted from CHAR to VARCHAR and how to get the magical figure for the columns’ ALLOCATE clause? Because choosing irrelevant columns for conversion may lead to high efforts but fewer gains. Normally, columns which have very high variation in data length are most suitable for conversion. Columns like remarks, comments etc fall into this category and usually columns in this category have length 35 or above. Also, the corresponding table should have enough number of rows to justify the effort involved as space saving is directly proportional to the number of rows. Thus, to get the list of relevant columns and their corresponding tables, run the SQL script listed below:

    SELECT a.table_name, a.column_name, a.data_type, a.length, b.number_rows
    FROM qsys2.syscolumns a , qsys2.systablestat b
    WHERE a.table_schema = ‘lib'
    AND a.length >= 35
    AND b.table_schema = a.table_schema
    AND b.table_name = a.table_name
    AND b.number_rows > 5000

    Here, lib is the name of library where database table resides. This SQL will provide the list of all the relevant columns. Architects are recommended to review suitable value for a.length and b.number_rows according to the nature of their application. Now, we need to analyze the existing data and find the ALLOCATE figure, which should be appropriate to hold 90-95% of the data. For this, run this SQL in the concerned database, where col1 and tab1 is the column and its respective database table under consideration.

    SELECT LENGTH(RTRIM(col1)) Column_Length, count(*) Count
    FROM tab1
    GROUP BY LENGTH(RTRIM(col1))
    ORDER BYLENGTH(RTRIM(col1))

    This SQL will provide column’s classification by their data length. Paste this information into an excel spread sheet and derive cumulative sum and related percentage. Attached below is the similar data for a column where number of rows in the parent table was 1626986 and the length of the column was defined as CHAR(50).

  2. #2
    Join Date
    May 2010
    Posts
    4
    Column Length Count Cumm. Sum Percentage

    null 997280 997280
    0 581 997861 0.613318738
    1 116 997977 0.613390035
    2 188 998165 0.613505586
    3 603 998768 0.61387621
    4 961 999729 0.614466873
    5 14 999743 0.614475478
    6 2 999745 0.614476707
    7 23 999768 0.614490844
    8 14 999782 0.614499449
    9 26 999808 0.614515429
    10 2786 1002594 0.616227798
    11 500 1003094 0.616535115
    12 2454 1005548 0.618043425
    13 4438 1009986 0.620771168
    14 12623 1022609 0.628529686
    15 21423 1044032 0.641696978
    16 33377 1077409 0.662211599
    17 43236 1120645 0.68878589
    18 57116 1177761 0.723891293
    19 59734 1237495 0.760605807
    20 65324 1302819 0.800756122
    21 64123 1366942 0.840168262
    22 59600 1426542 0.876800415
    23 49011 1475553 0.906924214
    24 40671 1516224 0.931921971
    25 30856 1547080 0.9508871
    26 23382 1570462 0.96525846
    27 17017 1587479 0.975717677
    28 12170 1599649 0.983197766
    29 8537 1608186 0.988444891
    30 5674 1613860 0.991932321
    31 4108 1617968 0.994457236
    32 2778 1620746 0.996164687
    33 1922 1622668 0.997346013
    34 1315 1623983 0.998154256
    35 1030 1625013 0.998787328
    36 686 1625699 0.999208967
    37 523 1626222 0.99953042
    38 280 1626502 0.999702517
    39 169 1626671 0.99980639
    40 145 1626816 0.999895512
    41 56 1626872 0.999929932
    42 37 1626909 0.999952673
    43 20 1626929 0.999964966
    44 10 1626939 0.999971112
    45 19 1626958 0.99998279
    46 5 1626963 0.999985863
    47 3 1626966 0.999987707
    50 20 1626986 1

    Almost 95% data is having length 25 or less than 25, therefore choosing ALLOCATE(25) will save around 35MB space with virtually no impact on performance. This can be roughly estimated as:
    ((Total_rows * percentile * saved_space) – (Total_rows * remaining percentile * average variable space per overflowed row)).
    As the later half is not supposed to be significant, approximate value of the first half can provide rough estimate. Moreover, we need not to be very specific; any close estimate can aid our decision. For tables having millions of rows this saving may go up to even 300MB with just one column.
    The result of this analysis will also aid in deciding whether this column is appropriate for conversion or not as less saving in storage indicates there is no use in spending effort for that column.

    This type of analysis can easily be done for existing columns, but while designing new tables what should be this “ALLOCATE” size? For this, data architect initially has to choose an appropriate value according to the functionality of the field (initially 75% of the total length can be considered appropriate if there is no other input to aid the decision). After having considerable production data, data architect should check the performance of this field with the SQL listed below:

    SELECT DOUBLE_PRECISION(overflow)/(number_rows)
    FROM qsys2.systablestat
    WHERE table_name = “tab1”
    AND table_schema = “lib”

    Any value below 5% is OK and anything above 5% means ALLOCATE length needs to be increased.

    Possible impact on code
    Apart from the table definition, programs depending upon these fields may also require to be changed in case host variables are used. If you are using host variables to insert or update variable-length columns, the host variables should be variable length. Because blanks are not truncated from fixed-length host variables, using fixed-length host variables would cause more rows to spill into the overflow space. This would increase the size of the table. In this example, fixed-length host variables are used to insert a row into a table:
    01 COL1 PIC X(40). ...
    MOVE "SMITH" TO COL1.
    EXEC SQL INSERT INTO tab1 VALUES
    (:COL1, :COL2, :COL3)
    END-EXEC.
    The host-variable COL1 is not variable length. The string “SMITH”, followed by 45 blanks, is inserted into the VARCHAR column COL1. The value is longer than the allocate size of 25. Twenty five of forty-five trailing blanks are in the overflow area. In this example, variable-length host variables are used to insert a row into a table:
    01 VCol1.
    49 Col1-LEN PIC S9(4) BINARY.
    49 Col1-DATA PIC X(40).
    ...
    MOVE "SMITH" TO Col1-DATA.
    MOVE 5 TO Col1-LEN.
    EXEC SQL INSERT INTO tab1 VALUES
    (:VCol1, :VCol2, :VCol3)
    END-EXEC.
    The host variable VCol1 is variable length. The actual length of the data is set to 5. The value is shorter than the allocated length. It can be placed in the fixed portion of the column and thus having no performance overhead.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •