Results 1 to 6 of 6
  1. #1
    Join Date
    Aug 2011
    Posts
    2

    Unanswered: Questions on Update statement on a table in DPF.

    I have a table that is created in a DPF environment, this table is spread across multiple partitions hashed based on the PK
    (column A).

    I have an update statement that runs on this table, the update is done using an ETL tool (Datastage V8.5). On an average daily
    input record volume for update is 50 million records. The update statement is happening based on the columns that are non-indexed and
    non-PK column (Columns B,C,D and E). The datastage job to perform this update is running very slow and the bottle neck seems to be in the DB update.
    Now My questions are

    1. There is an option to run the DB2 connector stage in parallel and also the ETL nodes can communicate directly with
    the DB2 nodes where the data is present. If in case, a wrong partitioning was done in ETL and data for update ends up in a
    different Db2 node than where the actual data is, there wont be updates happening properly right?

    2. The DB is in a different server than the actual Datastage server. Both are running AIX unix.
    Now if login to DB2 from DS application server using Command line and execute a query on partitioned table, how does that execute?
    Will it also be executed in parallel and run on all the partitions that the table is in? (if there are more than one table referenced
    and the tables are in different partitions, there would be data movement within partitions or they are moved to some common space
    for processing?)

    3. Suppose if I am joining two tables that are stored in different database partitions based on column C and say that
    table A is partitioned using Hashing of column A and table B using hashing of column B. How will the join happen across
    data stored in different database partitions? Would it come to some spool space for joining across partitions?
    Will this result in performance degradation?
    (if you could point me any books related to this it would be helpful)
    Since I have worked on Teradata earlier, I am just trying to compare Teradata with DB2 in this aspect.

    4. How do I find out what is the Db2 co-ordinator node my datastage is connecting to? (basically in a DPF environment, all queries
    executed through any application would go to co-ordinator node and then distributed to the paritions right? in the case of
    data retrieval queries, the result set from different paritions would be collated again in co-ordinator node?)

    5. There is a property in Datastage which can help the different nodes in datastage to connect directly with the DB2 nodes where
    the data is in instead of communicating through Co-ordinator node. So if I set this property, how do I what are the db2 nodes where
    this query is executed? Is there any system table where this is stored?

    6. The table is currently empty but updated with 50 million rows. But still running for more than 4 hours in production.
    Do the stats need to be collected on empty table as well. (The table is just created new not truncated).
    What else could be the reason the update query is running slowly? What can I do to fix it ? What steps should I follow to find out the root cause of this performance issue?

    7. How do I check the explain plan or timerons or performance of this query in this case? If it was table to table update
    I could have taken explain plan. But here, the data is read from the file and loaded to table, so how can I take explain plan or timerons or
    measure the performance while the query is running? What could be the reason for slow performance in this case?
    And please let me know what do I go about analysing slow performance of a query executed from datastage?

    8. Would creating an index on the columns used in where clause of update would help? I am just wondering why the query would run so slowly
    on a table that has been just created. Should I do any re-orgs or runstats on this table?

  2. #2
    Join Date
    Apr 2012
    Posts
    1,035
    Provided Answers: 18
    It is more manageable if you have a *single question* per post , that means: one thread for one specific question.
    Otherwise the shared thread with differening answer(s) for unrelated questions become a confusing mess.

  3. #3
    Join Date
    Jul 2013
    Location
    Moscow, Russia
    Posts
    666
    Provided Answers: 55
    Hi,

    Quote Originally Posted by Rocky007 View Post
    1. There is an option to run the DB2 connector stage in parallel and also the ETL nodes can communicate directly with
    the DB2 nodes where the data is present. If in case, a wrong partitioning was done in ETL and data for update ends up in a
    different Db2 node than where the actual data is, there wont be updates happening properly right?
    DB2 updates rows properly in any case
    If your application knows exactly the actual values for all distribution key columns then the application can get exact partition where this row resides / will reside. This can be done with db2GetRowPartNum API function. The only difference between running the command against the row which doesn't belong to the current node is performance (due to less inter-node communications).

    Quote Originally Posted by Rocky007 View Post
    2. The DB is in a different server than the actual Datastage server. Both are running AIX unix.
    Now if login to DB2 from DS application server using Command line and execute a query on partitioned table, how does that execute?
    Will it also be executed in parallel and run on all the partitions that the table is in? (if there are more than one table referenced
    and the tables are in different partitions, there would be data movement within partitions or they are moved to some common space
    for processing?)

    3. Suppose if I am joining two tables that are stored in different database partitions based on column C and say that
    table A is partitioned using Hashing of column A and table B using hashing of column B. How will the join happen across
    data stored in different database partitions? Would it come to some spool space for joining across partitions?
    Will this result in performance degradation?
    (if you could point me any books related to this it would be helpful)
    Since I have worked on Teradata earlier, I am just trying to compare Teradata with DB2 in this aspect.
    Your command will be executed in parallel on all partitions.
    Look at the following article Join methods for partitioned databases

    Quote Originally Posted by Rocky007 View Post
    4. How do I find out what is the Db2 co-ordinator node my datastage is connecting to? (basically in a DPF environment, all queries
    executed through any application would go to co-ordinator node and then distributed to the paritions right? in the case of
    data retrieval queries, the result set from different paritions would be collated again in co-ordinator node?)

    5. There is a property in Datastage which can help the different nodes in datastage to connect directly with the DB2 nodes where
    the data is in instead of communicating through Co-ordinator node. So if I set this property, how do I what are the db2 nodes where
    this query is executed? Is there any system table where this is stored?
    If you don't specify exactly the node which you want to connect to, your application connects to the node with the lowest (normally 0) logical port number on the server which holds the IP/hostname specified in the connection.
    You can get this information about the properties of the database partitions by the following query:
    SELECT * FROM TABLE(DB_PARTITIONS())
    or from ~/sqllib/db2nodes.cfg on the server

    You can check the coordinator node from your application by:
    values current node

    For any other applications you can get it like this:
    SELECT DISTINCT APPLICATION_HANDLE, COORD_MEMBER FROM TABLE(MON_GET_CONNECTION(NULL, -2))
    or if you know the application handle:
    SELECT DISTINCT COORD_MEMBER FROM TABLE(MON_GET_CONNECTION(APPL_HADLE, -2))

    Quote Originally Posted by Rocky007 View Post
    6. The table is currently empty but updated with 50 million rows. But still running for more than 4 hours in production.
    Do the stats need to be collected on empty table as well. (The table is just created new not truncated).
    What else could be the reason the update query is running slowly? What can I do to fix it ? What steps should I follow to find out the root cause of this performance issue?

    7. How do I check the explain plan or timerons or performance of this query in this case? If it was table to table update
    I could have taken explain plan. But here, the data is read from the file and loaded to table, so how can I take explain plan or timerons or
    measure the performance while the query is running? What could be the reason for slow performance in this case?
    And please let me know what do I go about analysing slow performance of a query executed from datastage?
    Do you update the empty table? May be you insert these rows and not update?
    You should check the state of your application. It might be in the lock-wait state, for example.
    Look at WLM_GET_WORKLOAD_OCCURRENCE_ACTIVITIES_V97
    Once you identified executable_id from the MON_GET_PKG_CACHE_STMT function, you can get an explain plan from the package cache using EXPLAIN_FROM_SECTION procedure.

    Quote Originally Posted by Rocky007 View Post
    8. Would creating an index on the columns used in where clause of update would help? I am just wondering why the query would run so slowly
    on a table that has been just created. Should I do any re-orgs or runstats on this table?
    If you update a few rows per statement then you definitely should create indexes and run runstats after significant data changes in the table. DB2 can do runstats automatically (there are some internal algorithms which trigger runstats on tables automatically), but this is an asynchronous process which might not happen exactly after your significant data changes.
    Regards,
    Mark.

  4. #4
    Join Date
    Nov 2011
    Posts
    334
    hi, Rocky007
    3. Suppose if I am joining two tables that are stored in different database partitions based on column C and say that
    table A is partitioned using Hashing of column A and table B using hashing of column B. How will the join happen across
    data stored in different database partitions? Would it come to some spool space for joining across partitions?
    Will this result in performance degradation?
    (if you could point me any books related to this it would be helpful)
    Since I have worked on Teradata earlier, I am just trying to compare Teradata with DB2 in this aspect.
    if you are familiar with TD, then you can easily understand the join strategy of db2 in DPF environment. They are both the share-nothing architecture。The distributed key of DPF is very similar to primary index of TD。
    In brief, There are four join strategy in DPF environment。
    1、collocation-join 。that means no data redistributed will happen ,it is the most efficient join method。
    2、broad cast one of the tables ( mostly the smaller one )。
    3、redistributed the table of which the distributed key column is not the subset of the join column.
    4、redistributed both of the join tables because none of them have the join columns in the distributed key columns.
    of course there are some difference at implement level of data transfer between nodes。。although some times for resolve deadlock of TQs , DPF have to put the data into temporay tablespace( that calls TQ spill ) ,but it will try to avoid data spill as much as possible instead of always put data to be distributed to a spool space。
    5. There is a property in Datastage which can help the different nodes in datastage to connect directly with the DB2 nodes where
    the data is in instead of communicating through Co-ordinator node. So if I set this property, how do I what are the db2 nodes where
    this query is executed? Is there any system table where this is stored?
    if a application connected to DPF , there must be a Co-ordinator node serving for it ( although we can use local-bypass method to improve the system tps, but it is not suited for olap scenario). co-ordinator is somthing like the PE of the TD. it will accepted the client request ,compile the sql , dispatch the execution section ,return final result to client。Just like we can not fetch data directory from AMP without PE. we can work round the co-odinator node to get data directly from data node。

    if you will,plz provide more infomation:
    db2level, os platform
    how many nodes you have ,and how many cores for each node.
    the sql statement , tables ddl , the execution plan ,etc。

    and A activity event monitor will be very helpful to dig out the performance problem。
    CREATE EVENT MONITOR (activities) statement

  5. #5
    Join Date
    Aug 2011
    Posts
    2
    Hi Mark/Fensun2,

    Thanks for taking time to go through my long list of questions.. I have few more doubts.

    For the first answer

    My update is not based on the key column that the table is partitioned across different nodes on. My update is happening based on the other non key columns.
    So I set the partitioning method in datastage to DB2 connector partition (which should partition the data in same way as Db2 would do according to datastage documentation).
    I get the records rejected with message saying "Row not found". And moreover I also get the SQLState -911 (deadlock issue) (this does not occur often but row not found occurs everytime).

    Now looking at deadlock issue error, Should I understand Db2 updates the table data even when the datstage sends the input data to a different node or looking at the "Row not found error" should I assume Db2 does not always update the data across nodes? you said it would always update right? if so, then why is this getting rejected? Can you please help me understand?

    And though this is a empty table, the process to insert data into this table has not been implemented this will be in implemented in few days and then data will flow to this table and update will happen on top of the data. For now the updates will be just on empty table till that process is implemented.

    I dont have access to even view explain plan. I need to work with my DBA to get this and in my shop this wont be taken as priority. Anyways I am trying to get explain plan and will post the same when I get.

    Datastage and Db2 are hosted on different AIX unix 7 servers. I will get more details from the queries you posted and will provide further information.

    Thanks.

  6. #6
    Join Date
    Jul 2013
    Location
    Moscow, Russia
    Posts
    666
    Provided Answers: 55
    Quote Originally Posted by Rocky007 View Post
    Now looking at deadlock issue error, Should I understand Db2 updates the table data even when the datstage sends the input data to a different node or looking at the "Row not found error" should I assume Db2 does not always update the data across nodes? you said it would always update right? if so, then why is this getting rejected? Can you please help me understand?
    Once again: DB2 always updates a row regardless of its physical data placement or db2 node which your application connected to.
    "Row not found" is not an error, it's a warning. If you don't anticipate it then this means you have some application logic error.
    As for deadlocks: locking behavior may depend on access plan. For example, If 2 applications do mass update on some db2 node at the same time and use table scan for that, you may see deadlocks / lock timeouts. But they may succeed doing the same and using an index access.
    Regards,
    Mark.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •