Results 1 to 13 of 13
  1. #1
    Join Date
    Nov 2004
    Posts
    27

    Unanswered: performance question

    Greetings,

    I have two tables fop_process and fop. The first table contains about 10 fields of the fop table. The rest of the fields in the fop table are just text fields. I have two queries below that I am using to retrieve all of the duplicate records in both tables. Afterwards, I will run a delete statements to remove the duplicate records including the original record.

    SELECT EliminateDuplicates_RowID,
    CC,
    SSN,
    STATUS_CODE,
    NAME,
    EXP,
    USB_CODE5,
    NEED_ACTIVE
    INTO fop_process
    FROM fop_process
    GROUP BY CC,
    SSN,
    STATUS_CODE,
    NAME,
    EXP,
    USB_CODE5,
    NEED_ACTIVE
    HAVING COUNT(*) >= 2


    AND


    SELECT EliminateDuplicates_RowID,
    USB_COMPANY_CODE,
    USB_AGENT_CODE,
    USB_COMPANY_CODE_NAME,
    USB_AGENT_CODE_NAME,
    CC,
    SSN,
    STATUS_CODE,
    ADDRESS1,
    ADDRESS2,
    CITY,
    STATE,
    POSTAL,
    USB_CODE1,
    USB_CODE2,
    USB_CODE3,
    USB_CODE4,
    NAME,
    USB_DIVISION_NAME,
    HOME_PHONE,
    EXP,
    ISSUE,
    USB_CODE5,
    Emp_ID,
    USB_CODE6,
    NEED_ACTIVE
    INTO fop2
    FROM fop
    GROUP BY USB_COMPANY_CODE,
    USB_AGENT_CODE,
    USB_COMPANY_CODE_NAME,
    USB_AGENT_CODE_NAME,
    CC,
    SSN,
    STATUS_CODE,
    ADDRESS1,
    ADDRESS2,
    CITY,
    STATE,
    POSTAL,
    USB_CODE1,
    USB_CODE2,
    USB_CODE3,
    USB_CODE4,
    NAME,
    USB_DIVISION_NAME,
    HOME_PHONE,
    EXP,
    ISSUE,
    USB_CODE5,
    Emp_ID,
    USB_CODE6,
    NEED_ACTIVE
    HAVING COUNT(*) >= 2


    This first query takes 4.5 hours to completely finish. The second bigger query only takes 35 minutes to complete. I would think because there are more comparisons of record fields, the second query would take much longer than the first. However, this is just the opposite. Does anyone have any insight to why this is the case? Plus, fewer records come back with the first query. I do not understand that either. Any help would be appreciated.

    Patrick Quinn
    Navigant Sybase Database Administrator

  2. #2
    Join Date
    Aug 2004
    Posts
    42
    What do showplan and statistics io tell you?

  3. #3
    Join Date
    Nov 2004
    Posts
    27
    showplan did not show me anything. However the 'set statistics' showed me there is alot more processing going on for the fop_process table even though it only has a subset of the fields of the fop table. This still does not make sense to me. How can processing on a table with much fewer fields take 6 times to process as a table with 3 times the number of fields. Any further insight would be appreciated.

    Total writes for this command: 0
    Table: fop scan count 2, logical reads: (regular=57454 apf=0 total=57454), physical reads: (regular=57454 apf=0 total=57454), apf IOs used=0
    Table: fop2 scan count 0, logical reads: (regular=256087 apf=0 total=256087), physical reads: (regular=2 apf=0 total=2), apf IOs used=0
    Table: Worktable1 scan count 173087, logical reads: (regular=4478420 apf=0 total=4478420), physical reads: (regular=209493 apf=0 total=209493), apf IOs used=0
    Total writes for this command: 38286
    (170108 rows affected)

    Total writes for this command: 0
    Table: fop_process scan count 85370, logical reads: (regular=605187930 apf=0 total=605187930), physical reads: (regular=7089 apf=0 total=7089), apf IOs used=0
    Table: fop_process2 scan count 0, logical reads: (regular=184752 apf=0 total=184752), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
    Table: Worktable1 scan count 1, logical reads: (regular=680372 apf=0 total=680372), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
    Total writes for this command: 7500
    (170738 rows affected)
    Total writes for this command: 0

    Patrick Quinn
    Navigant Sybase Database Administrator

  4. #4
    Join Date
    Aug 2004
    Posts
    42
    So the key comparison is on the stat io line generated for the maintables (as opposed to the output table or the worktable in each query).

    For fop:
    Table: fop scan count 2, logical reads: (regular=57454 apf=0 total=57454), physical reads: (regular=57454 apf=0 total=57454), apf IOs used=0

    For fop_preocess:
    Table: fop_process scan count 85370, logical reads: (regular=605187930 apf=0 total=605187930), physical reads: (regular=7089 apf=0 total=7089), apf IOs used=0

    The scan count of 2 for fop seems reasonable (though I'm not sure why itsn't just 1) but the scan count of 85,370 for fop_process is off the charts. Not to mention that each scan is accessing on average 605,187,930/85,370=7,089 pages per scan -- which looks to be the entire table.

    So -- there must be something different going on and I would think that the showplan output would tell something.

    I would guess that perhaps the optimizer is doing some sort of self join on the group by columns and that for fop_process there is no useful index. Still very wierd, though.

    The showplan output plus the sp_helpindex output would most likely provide some hints.

  5. #5
    Join Date
    Nov 2004
    Posts
    27
    Would a clustered or nonclustered index help on the fop_process table?

    Patrick

  6. #6
    Join Date
    Aug 2004
    Posts
    42
    I'm sure they would. What do you have on the table now? From what is happening it seems that perhaps you have no index at all. What do you have on fop? -- since that seems to work well. I am still curious to see your showplan output.

  7. #7
    Join Date
    Nov 2004
    Posts
    27
    I have no indexes on either table.

  8. #8
    Join Date
    Sep 2002
    Location
    Sydney, Australia
    Posts
    255

    No Indexes !!!

    1. Well then you do not have tables in the relational sense (relational = keys; keys = indexes), you have a heap of data which includes duplicates, and therefore you cannot expect to reasonably perform relational operations on these heaps. Indexes prevent duplicates from being inserted in the first place. The first thing to do (before looking at performance issues, showplans, statistics) is to create an index on each table table. You will have to choose the keys carefully (based on whatever criteria was used to design the tables). If you do not know this criteria, then using the columns in the respective GROUP BY clauses is a good start.

    The problem is that you have duplicates already, so you will have to create non-clustered indexes first (which allow duplicates; no PRIMARY KEY); delete the duplicates; then create the correct clustered indexes (which disallow duplicates; PRIMARY KEY); and drop the unclustered indexes.

    A better method (but you will have to do some reading) would be to:
    - bcp out the tables
    - truncate them
    - create the correct clustered index (PRIMARY KEY) and any secondary indexes (non-clustered) once and for all
    - sort (unix utility) the bcp files by key
    - bcp in the data (the duplicates will fail, use the appropriate parameters; you will only get the "first" occurence of a key)

    2. It appears that in the first SELECT, you should be selecting into a target table that is not the source table (fop_process2 ?), unless you want nightmares.

    3. On another topic, your use of ...Row_ID as well as other hints I get from your question, give me the horrible sensation that you are trying to perform procedural (one record/row at a time) processing on a relational (set oriented) database. This is completely unecessary, you just have to approach the problem with a relational mindset, not a procedural one.

    4. If you post again, provide the sp_help <table>output first (I do not need the showplan at this stage).
    Last edited by DerekA; 02-03-05 at 00:02.
    Derek Asirvadem
    Senior Sybase DBA/Information Architect derekATsoftwaregemsDOTcomDOTau
    Anything worth doing is worth doing Right The First Time
    Spend your money on standards-compliant development or spend 10 times more fixing it

  9. #9
    Join Date
    Nov 2004
    Posts
    27

    It might work except I need the duplicates

    Good day all,

    Below is the sp_help output for the fop_process and fop tables.

    fop_process
    -----------
    Object does not have any indexes.
    No defined keys for this object.
    Object is not partitioned.
    Lock scheme Allpages
    The attribute 'exp_row_size' is not applicable to tables with allpages lock scheme.
    The attribute 'concurrency_opt_threshold' is not applicable to tables with allpages lock scheme.

    fop_process dbo user table

    default Jan 28 2005 11:34AM

    CC varchar 15 NULL NULL 0 NULL NULL 0
    SSN varchar 15 NULL NULL 0 NULL NULL 0
    STATUS_CODE varchar 10 NULL NULL 0 NULL NULL 0
    NAME varchar 25 NULL NULL 0 NULL NULL 0
    EXP varchar 15 NULL NULL 0 NULL NULL 0
    USB_CODE5 varchar 10 NULL NULL 0 NULL NULL 0
    NEED_ACTIVE varchar 10 NULL NULL 0 NULL NULL 0
    EliminateDuplicates_RowID numeric 9 18 0 0 NULL NULL 1


    fop
    ---
    Object does not have any indexes.
    No defined keys for this object.
    Object is not partitioned.
    Lock scheme Allpages
    The attribute 'exp_row_size' is not applicable to tables with allpages lock scheme.
    The attribute 'concurrency_opt_threshold' is not applicable to tables with allpages lock scheme.

    fop dbo user table

    default Jan 19 2005 2:23PM

    USB_COMPANY_CODE varchar 20 NULL NULL 1 NULL NULL 0
    USB_AGENT_CODE varchar 20 NULL NULL 1 NULL NULL 0
    USB_COMPANY_CODE_NAME varchar 50 NULL NULL 1 NULL NULL 0
    USB_AGENT_CODE_NAME varchar 25 NULL NULL 1 NULL NULL 0
    CC varchar 15 NULL NULL 1 NULL NULL 0
    SSN varchar 15 NULL NULL 1 NULL NULL 0
    STATUS_CODE varchar 10 NULL NULL 1 NULL NULL 0
    ADDRESS1 varchar 50 NULL NULL 1 NULL NULL 0
    ADDRESS2 varchar 50 NULL NULL 1 NULL NULL 0
    CITY varchar 25 NULL NULL 1 NULL NULL 0
    STATE varchar 10 NULL NULL 1 NULL NULL 0
    POSTAL varchar 10 NULL NULL 1 NULL NULL 0
    USB_CODE1 varchar 10 NULL NULL 1 NULL NULL 0
    USB_CODE2 varchar 10 NULL NULL 1 NULL NULL 0
    USB_CODE3 varchar 10 NULL NULL 1 NULL NULL 0
    USB_CODE4 varchar 10 NULL NULL 1 NULL NULL 0
    NAME varchar 25 NULL NULL 1 NULL NULL 0
    USB_DIVISION_NAME varchar 25 NULL NULL 1 NULL NULL 0
    HOME_PHONE varchar 15 NULL NULL 1 NULL NULL 0
    EXP varchar 15 NULL NULL 1 NULL NULL 0
    ISSUE varchar 15 NULL NULL 1 NULL NULL 0
    USB_CODE5 varchar 10 NULL NULL 1 NULL NULL 0
    Emp_ID varchar 15 NULL NULL 1 NULL NULL 0
    USB_CODE6 varchar 10 NULL NULL 1 NULL NULL 0
    NEED_ACTIVE varchar 10 NULL NULL 1 NULL NULL 0
    EliminateDuplicates_RowID numeric 9 18 0 0 NULL NULL 1


    First off, the data in the fop tables comes from a bcp in. The data in the fop_process table comes from a select of selected fields of the fop table. I would agree your approach to create the non-clustered index, then the clustered index would work. However, I need the duplicates in either table, because following the above select statement, I execute the following delete statements to only keep those unique records that do not have duplicates.

    For the fop_process table:

    DELETE
    FROM fop_process
    FROM fop_process t1
    JOIN fop_process2 t2
    ON( t1.EliminateDuplicates_RowID <> t2.EliminateDuplicates_RowID
    AND t1.CC = t2.CC
    AND t1.SSN = t2.SSN
    AND t1.STATUS_CODE = t2.STATUS_CODE
    AND t1.NAME = t2.NAME
    AND t1.EXP = t2.EXP
    AND t1.USB_CODE5 = t2.USB_CODE5
    AND t1.NEED_ACTIVE = t2.NEED_ACTIVE)

    For the fop table:

    something similar as the above delete statement using all of the fields of the fop table.

  10. #10
    Join Date
    Nov 2004
    Posts
    27

    Showplan results

    For those you requested it, here are the showplan results involving the fop_process table mentioned above. Let me know oifthis helps.

    QUERY PLAN FOR STATEMENT 1 (at line 1).

    STEP 1
    The type of query is CREATE TABLE. \

    STEP 2
    The type of query is SELECT (into Worktable1).
    GROUP BY
    Evaluate Grouped COUNT AGGREGATE.

    FROM TABLE
    fop_process
    Nested iteration.
    Table Scan.
    Forward scan.
    Positioning at start of table.
    Using I/O Size 2 Kbytes for data pages.
    With LRU Buffer Replacement Strategy for data pages.
    TO TABLE
    Worktable1.

    STEP 3
    The type of query is INSERT.
    The update mode is direct.

    FROM TABLE
    Worktable1.
    Nested iteration.
    Table Scan.
    Forward scan.
    Positioning at start of table.
    Using I/O Size 2 Kbytes for data pages.
    With MRU Buffer Replacement Strategy for data pages.

    FROM TABLE
    fop_process
    Nested iteration.
    Table Scan.
    Forward scan.
    Positioning at start of table.
    Using I/O Size 2 Kbytes for data pages.
    With LRU Buffer Replacement Strategy for data pages.
    TO TABLE
    fop_process2

  11. #11
    Join Date
    Sep 2002
    Location
    Sydney, Australia
    Posts
    255

    Creating, then deleting duplicates ?

    I would really like to address just your question, but to do so I must ignore the approach you have used but try as I might I am unable to do that, so I will provide my responses under two headings. Please take what you like and leave the rest.

    Re The Problem

    1. Ok. Forget the create non-clustered; then create clustered. But it would appear that the rest my initial suggestion is worth considering (until you tell me otherwise). Try it, do not simply evaluate it (I am happy to provide assistance if you have an actual problem but I am not getting into educative discussions about whether something might work or not, sorry):

    if you have [the correct] indexes set up with ignore_dup_key before the bcp-in, then during the bcp-in only the "first" non-duplicate keys will be inserted. If this does not address the problem, then ...

    2. Again, the first thing is to get an index on every table. Without this a "Table Scan" has to be performed (read every row in the table regardless of whether it is required for the query or not). With decent indexes (choosing the correct columns for the index and type of index) the query may be able to do an "Index Scan", which is much [much] faster. Table Scans are to be avoided absolutely, as are unindexed tables.

    2.1. Permit me to short-circuit any discussion about candidate keys, clustered or non-clustered indexes, etc. and suggest the following:
    - create an unique clustered index on EliminateDuplicates_RowID (on all 4 tables)
    - if that does not work, create a non-unique clustered index on EliminateDuplicates_RowID (on all 4 tables)
    - if that does not work, create a non-unique clustered index on the table using the columns from the respective GROUP BY clauses.

    3. Worktables (eg. caused by SELECT...INTO) are also to be avoided. Based on what I can ascertain so far, the two SELECT...INTOs and the two "...2" tables can be replaced with a DELETE...(SELECT...GROUP BY ... HAVING COUNT(*) > 1 ). (See below.) This is where I suggest you focus your energy (after adding the correct indexes).

    Re Your Approach

    1. This appears to be creating duplicates in one table, then creating more duplicates in another table (along with duplicate columns!), then on the basis of the existence of duplicates, the duplicates are deleted. Either there is something wrong with the approach or there is something that I am not getting.
    2. There is another table which you have not told us about, fop_process2. The showplan refers to it but does not match the SQL you provided at the beginning (which does not identify it). I can only presume that you create this with a SELECT...INTO...FROM fop_process GROUP BY... as you do with fop/fop2. Otherwise the first SELECT...INTO actually creates fop_process2 (ok, you may not suffer nightmares but you definitely suffer typos).
    3. There also appears to be an error in the DELETE SQL you provided (fop_process, not fop_process2; or the first FROM is wrong).
    4. Therefore, if we continue in this vein, please provide:
    - sp_help of the tables concerned (4?)
    - Actual, correct, SQL of all populating and and deleting processes
    - showplan of the SQL
    5. I would be very interested in the process that computes and sets EliminateDuplicates_RowID, particularly the process that creates that column value in the bcp file that is used to populate fop.
    6. There is NEVER a need to create data columns in a table that already exist in another table (eg. fop_process contains 10 columns from fop). (Note Key columns from a parent table exist in the child table, thereby identifying the row in the parent that the child is 'referred' [related] to, which gives you the 'relational' in 'relational database', but these are key referential columns and are not be considered "data" column duplication.)
    6.1. Data column duplication means you do not have a "database", you have a bunch of unnormalised files (in your case, unnormalised unindexed heaps); one of the fundamental goals of the normalisation process is to eliminate duplicate columns.
    6.2. You can "relate" or join fop_process to fop by using the [apparent] key EliminateDuplicates_RowID [or the compound key made up of the 10 columns] (but do index the key columns in both tables first) and then refer to the fop columns via a join using the referential key (without having to duplicate the said columns in fop_process).
    --- Note that duplicate columns have to be maintained and this is onerous.
    --- Note this will not remove the SQL coding posibilities that you are attempting, it remains the same but will be much faster than your procedural approach. It would appear that fop2 and fop_process2 are not required. Use a subquery (query within a query): the inner (nested) query is the same as your SELECT; the outer query would be your DELETE statement.
    --- Note you can refer to one table more than once in your FROM clause and use an alias (t1, t2) to distinguish them. Similarly, in the outer query, you can refer to a table in the inner query using such an alias.
    Last edited by DerekA; 02-05-05 at 10:42.
    Derek Asirvadem
    Senior Sybase DBA/Information Architect derekATsoftwaregemsDOTcomDOTau
    Anything worth doing is worth doing Right The First Time
    Spend your money on standards-compliant development or spend 10 times more fixing it

  12. #12
    Join Date
    Nov 2004
    Posts
    27

    Found my answer

    I want to thank everybody who replied to my initial message. I managed to retrieve the unique records in a short period of time. It only took seconds to execute. Below are the queries I used.

    Thanks,

    Patrick Quinn
    Navigant Sybase Database Administrator

    insert into fop_process2
    SELECT Count(*),
    CC,
    SSN,
    STATUS_CODE,
    NAME,
    EXP,
    USB_CODE5,
    NEED_ACTIVE
    FROM fop_process
    GROUP BY CC,
    SSN,
    STATUS_CODE,
    NAME,
    EXP,
    USB_CODE5,
    NEED_ACTIVE

    select CC, SSN, STATUS_CODE, NAME, EXP, USB_CODE5, NEED_ACTIVE into fop_process_final
    from fop_process2
    where Count_Num = 1

  13. #13
    Join Date
    Sep 2002
    Location
    Sydney, Australia
    Posts
    255

    Aha !

    First it appeared you were seeking duplicates, then it appeared you were seeking the first-row-of-duplicates and non-duplicates, now I understand you are seeking non-duplicates only.

    In that case, you can eliminate the intermediate tables (fop_process2 and fop_process_final) and the three steps, by doing the entire process using a single SELECT on fop_process with a subquery (inner query with GROUP BY as per your last post, aliasing COUNT(*) to the outer query). And of course, DELETE using a subquery with COUNT(*) > 1.

    The first-row-of-duplicates will be lost.
    Derek Asirvadem
    Senior Sybase DBA/Information Architect derekATsoftwaregemsDOTcomDOTau
    Anything worth doing is worth doing Right The First Time
    Spend your money on standards-compliant development or spend 10 times more fixing it

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •