If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > DB2 > Generating Indentity Columns during load on large tables in DPF

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 10-28-09, 07:47
Marcus_A Marcus_A is offline
Registered User
 
Join Date: May 2003
Location: USA
Posts: 5,196
Generating Indentity Columns during load on large tables in DPF

DB2 for Linux 9.7. Since DPF (Data Partitioning Feature) tries to do things
in parallel, how much extra overhead is there is having DB2 assign values to
an identity column that is used as the hash partitioning key during a load
of a very large table (compared to having the value already populated in the
input file)? Obviously, this has to be done sequentially, and not in
parallel.
__________________
M. A. Feldman
IBM Certified DBA on DB2 for Linux, UNIX, and Windows
IBM Certified DBA on DB2 for z/OS and OS/390
Reply With Quote
  #2 (permalink)  
Old 10-28-09, 08:20
stolze stolze is offline
Registered User
 
Join Date: Jan 2007
Location: Jena, Germany
Posts: 2,662
I don't know the answer, but there doesn't have to be a lot of sequential processing. Identity columns don't give any guarantee that there are no gaps, so each node can get a large batch of ids from the common pool and work with this batch, independent of all other nodes in the cluster. Depending on how the load happens, i.e. a single source file, that file has to be split across all nodes, which would allow DB2 to also determine how many identity values are needed on each node. In this particular case, I would expect that there is pretty much no overhead at all. But again, I don't know exactly what DB2 is doing internally so I'm just guessing.
__________________
Knut Stolze
IBM DB2 Analytics Accelerator
IBM Germany Research & Development
Reply With Quote
  #3 (permalink)  
Old 10-28-09, 10:12
Marcus_A Marcus_A is offline
Registered User
 
Join Date: May 2003
Location: USA
Posts: 5,196
Quote:
Originally Posted by stolze
I don't know the answer, but there doesn't have to be a lot of sequential processing. Identity columns don't give any guarantee that there are no gaps, so each node can get a large batch of ids from the common pool and work with this batch, independent of all other nodes in the cluster. Depending on how the load happens, i.e. a single source file, that file has to be split across all nodes, which would allow DB2 to also determine how many identity values are needed on each node. In this particular case, I would expect that there is pretty much no overhead at all. But again, I don't know exactly what DB2 is doing internally so I'm just guessing.
When an identity column is defined, the default is to start with 1, increment by 1, etc. Even if the default is not used, you still have to specify an increment amount. It is true that if you have caching, and the database is deactivated, the unused identity values still left in cache are lost (and will cause a gap), but I don't think there is any way for each node to do their own identity column number ranges in DPF.

Besides, the identity value assignment is done on the admin node where the data is loaded, but I would think that would slow down the load process somewhat if used as the partitioning key. But I am not sure if the difference is enough to worry about.
__________________
M. A. Feldman
IBM Certified DBA on DB2 for Linux, UNIX, and Windows
IBM Certified DBA on DB2 for z/OS and OS/390
Reply With Quote
  #4 (permalink)  
Old 10-28-09, 10:35
n_i n_i is offline
:-)
 
Join Date: Jun 2003
Location: Toronto, Canada
Posts: 4,449
Don't know about 9.7, but in version 8, if I remember correctly, the input data is split into partitions on the coordinator node before those sets are passed on to other nodes to perform the actual load. Since DB2 will need to generate all identity values in order to compute hash values, which in turn are used to split incoming data, I would guess that there isn't much difference compared to a non-DPF configuration with respect to the identity generation, as it's only the coordinator node that will be doing that.
Reply With Quote
  #5 (permalink)  
Old 10-29-09, 11:07
db2girl db2girl is offline
∞∞∞∞∞∞
 
Join Date: Aug 2008
Location: Toronto, Canada
Posts: 1,816
I forwarded your question to the load team. Will let you know when I hear back from them.
Reply With Quote
  #6 (permalink)  
Old 10-29-09, 12:04
db2girl db2girl is offline
∞∞∞∞∞∞
 
Join Date: Aug 2008
Location: Toronto, Canada
Posts: 1,816
Here is the response:

1) The identity column values are generated in the partitioning agent in this case (db2lpart) since they are needed for hashing.

2) If there is only a single partitioning agent in this case, then the identity values should be assigned in order. If you explicitly specify multiple partitioning agents, or you specify ANYORDER, (which can cause the load utility to automatically employ more than a single partitioning agent), then the values can be assigned in a non-deterministic order. Parallel assignment

3) Partitioning agents will reserve a set of identity column values from the catalog at every round. It willl try to reduce the number of times it needs to communicate with the catalogs to reserve a range of identity values. The values it does generate will be flowed from the partitioning sub-agent to the server (eventually inserted on disk)

NOTE: Load generating identity columns is always costly. See the performance white paper -> ftp://ftp.software.ibm.com/software/...loaderperf.pdf . Our performance is twice as slow because of the cost of coordinating with the catalogs and reserving values.
Reply With Quote
  #7 (permalink)  
Old 10-29-09, 18:06
Marcus_A Marcus_A is offline
Registered User
 
Join Date: May 2003
Location: USA
Posts: 5,196
Thanks for the info. From page 35 of the document it looks as if the degradation of generating an identity column that is used as the hash partitioning key for DPF is 198% (plus or minus 2%) degradation. So that means a load runs about three time longer than if the key is already available on the input record.

An indentity column that is not part of a partitioning key for DPF is twice as slow.
__________________
M. A. Feldman
IBM Certified DBA on DB2 for Linux, UNIX, and Windows
IBM Certified DBA on DB2 for z/OS and OS/390
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On