Results 1 to 10 of 10
  1. #1
    Join Date
    Jan 2004
    Posts
    66

    Unanswered: Performance issue using sql*loader

    Hi,

    Using SQL *Loader (using direct path,parallel load) , I'm inserting data into a table from a flat file.Now the flat file has large amount(in millions) of data. Problem is that,There may be duplicate record in the flat file , and the records are not in the sorted order. But in the table I've the primary key . So how to prevent the duplicate record and to improve the performance ?

    I'm doing this in many ways , but I'm not getting the best performance .

    1.I'm reading the file as external table , and deleting the duplicate record from flat file. after then use sqlloader.

    2.using sort -u to sort the flat file.after then use sqlloader.

    Thanks ........
    himridul

  2. #2
    Join Date
    Jul 2003
    Posts
    2,296
    Does DIRECT and APPEND work together??
    This way all your duplicates goto a BAD file



    another option:
    Create a temp-table of your target db-table. Make sure you create the table with NOLOGGING option.

    SQLLOAD your data into the temp table.
    From there you have MANY options to eliminate the duplicate rows.


    TO load into your target db-table, use (here I am eliminating the duplicates at this point but you could do it any way you want):
    PHP Code:
    insert /*+ APPEND*/ into target_table
    select distinct col1
    col2col3etc.  from temp_table
    - The_Duck
    you can lead someone to something but they will never learn anything ...

  3. #3
    Join Date
    Jan 2004
    Posts
    66
    Thanks Grand........
    Your second option is not useful for me , becoz the distinct clause will take hell time for 100 millions of data. Now I don't know that ur first option will be working or not, I'm trying withi that one.

    ACtually I've to complete the whole process within a certain time . So the time factor is very vital for me . On the other hand , this is not the one time job , this is an ongoing process.Every time , before starting the whole process , I have to purge the table in which the data are loaded from the flat file.

    Now what I'm trying is that,
    Spliting using shell script, the input flat file into 4 pieces b'coz we've 4 processors.Then the input files are sorted using shell script.Then loading these files into 4 temporary tables , using sqlloader.Then the 4 temp tables are merged to make the original table.That's how I'm eliminating duplicate rows.If anyone has any better idea , plz help me..
    Any help will be appreciated.

    Thanks..........
    himridul

  4. #4
    Join Date
    Jul 2003
    Posts
    2,296
    I guess I am a Poobah ... interesting.

    ANyways, if you have constraints on the table (a PK constraint) then sqlloader will just not load that duplicate line twice. It shouldn't crash or stop. Just specify your ERRORS parameter to be large enough to handle all the duplicates.

    Too bad the people that give you this file every day do not take enough pride in their work to be able to give you only unique data.

    If you are truncating this table every day, then I see no reason why you cannot change the table to NOLOGGING.

    ALSO, before you load the data I suggest dropping all indexes and disabling all FK constraints.

    Just keep the unique constraint/index.

    With direct-load, and those other things, tell me if it improves your load time.
    - The_Duck
    you can lead someone to something but they will never learn anything ...

  5. #5
    Join Date
    Jan 2004
    Posts
    66

    Thumbs up

    Sorry Poobah ..................

    I can't use PK constraint , using Direct Path load . It's true I can use unique constraint . I took ur proposal and implementing it into a little bit different way .

    In the target table , I'm dropping all the indexes even uniq index also.
    After loading the data through sqlloader , i'm imposing the uniq index on the particular column.If duplicate rows found , then while imposing unique constraint , it will through an error and I can trap that error.If duplicate row is absent , then all the things will be fine .

    Why I'm not imposing the uniq key , before loading the data ? that's what in ur mind , isn't it?
    Because if I do impose uniq key before loading , then sqlloader will check all the inserted rows for duplicacy , before inserting the new row.now for eg, when inserting to 101 th row into the table , sqlloader check first 100 rows for any duplicacy. For 100 millions of data , it will take a lot of time.What do u think?Am I right or not ?

    Need ur suggestion..............
    Thanks,
    himridul

  6. #6
    Join Date
    Jul 2003
    Posts
    2,296
    What do I think?
    WHAT DO I THINK?!

    You are a genious!

    Try it both ways and tell me the time difference between the two.
    - The_Duck
    you can lead someone to something but they will never learn anything ...

  7. #7
    Join Date
    Sep 2003
    Location
    Milan, Italy
    Posts
    130
    Originally posted by himridul

    After loading the data through sqlloader , i'm imposing the uniq index on the particular column.If duplicate rows found , then while imposing unique constraint , it will through an error and I can trap that error.If duplicate row is absent , then all the things will be fine .

    If you want to know the duplicate rows, you can use the EXCEPTIONS INTO table.

    First create the exceptions table:

    SQL> @?/rdbms/admin/utlexcpt.sql

    Table created.

    Let's create a table with two duplicated columns:

    SQL> create table t (x int);

    Table created.

    SQL> insert into t (x) values (1);

    1 row created.

    SQL> insert into t (x) values (1);

    1 row created.

    SQL> insert into t (x) values (2);

    1 row created.

    SQL> select * from t;

    X
    ----------
    1
    1
    2

    Now let's create the pk constraint:

    SQL> alter table t add constraint t_pk primary key (x) EXCEPTIONS into exceptions;
    alter table t add constraint t_pk primary key (x) EXCEPTIONS into exceptions
    *
    ERROR at line 1:
    ORA-02437: cannot validate (DELLERA.T_PK) - primary key violated

    But now we know the rowids of the offending rows:

    SQL> select * from t where rowid in (select row_id from exceptions);

    X
    ----------
    1
    1

    You could delete them (or keep one of them) with a simple delete statement. Say you want to delete them:

    SQL> delete t where rowid in (select row_id from exceptions);

    2 rows deleted.

    SQL> select * from t;

    X
    ----------
    2

    Another thing - i would create the index on the pk just BEFORE creating the constraint, as a NON-unique index, if you expect the duplicate rows to be just a little number. That way, the index stays there even if the pk creation fails, and you won't waste the time to sort the rows (and re-create the index) again.

    Also, look for the direct path option of sqlloader - really really fast if you can use it.

    HTH
    Alberto

  8. #8
    Join Date
    Dec 2003
    Location
    Noida, India.
    Posts
    171
    where can we find the script for EXCEPTION table utlexcpt.sql?

    when using direct path, it internally disables the PK and enables it again after the load completes. This is what makes the load faster. But most of the time, we avoid using the direct path when there is no guarantee of data integrity ...

    if duplicate records can be found this way, I guess best approach would be:

    -disable PK
    -load data
    -enable PK with exception
    -delete duplicates on rowid
    -enable PK without exceptions

    alberto, this is really wonderful !!! hats off ...




    Originally posted by alberto.dellera
    If you want to know the duplicate rows, you can use the EXCEPTIONS INTO table.

    First create the exceptions table:

    SQL> @?/rdbms/admin/utlexcpt.sql

    Table created.

    Let's create a table with two duplicated columns:

    SQL> create table t (x int);

    Table created.

    SQL> insert into t (x) values (1);

    1 row created.

    SQL> insert into t (x) values (1);

    1 row created.

    SQL> insert into t (x) values (2);

    1 row created.

    SQL> select * from t;

    X
    ----------
    1
    1
    2

    Now let's create the pk constraint:

    SQL> alter table t add constraint t_pk primary key (x) EXCEPTIONS into exceptions;
    alter table t add constraint t_pk primary key (x) EXCEPTIONS into exceptions
    *
    ERROR at line 1:
    ORA-02437: cannot validate (DELLERA.T_PK) - primary key violated

    But now we know the rowids of the offending rows:

    SQL> select * from t where rowid in (select row_id from exceptions);

    X
    ----------
    1
    1

    You could delete them (or keep one of them) with a simple delete statement. Say you want to delete them:

    SQL> delete t where rowid in (select row_id from exceptions);

    2 rows deleted.

    SQL> select * from t;

    X
    ----------
    2

    Another thing - i would create the index on the pk just BEFORE creating the constraint, as a NON-unique index, if you expect the duplicate rows to be just a little number. That way, the index stays there even if the pk creation fails, and you won't waste the time to sort the rows (and re-create the index) again.

    Also, look for the direct path option of sqlloader - really really fast if you can use it.

    HTH
    Alberto
    Oracle can do wonders !

  9. #9
    Join Date
    Sep 2003
    Location
    Milan, Italy
    Posts
    130
    Originally posted by cmasharma
    where can we find the script for EXCEPTION table utlexcpt.sql?

    when using direct path, it internally disables the PK and enables it again after the load completes. This is what makes the load faster. But most of the time, we avoid using the direct path when there is no guarantee of data integrity ...

    if duplicate records can be found this way, I guess best approach would be:

    -disable PK
    -load data
    -enable PK with exception
    -delete duplicates on rowid
    -enable PK without exceptions

    alberto, this is really wonderful !!! hats off ...
    The script is $ORACLE_HOME/rdbms/admin, if you are on the oracle server you can invoke it simply by using

    SQL> @?/rdbms/admin/utlexcpt.sql

    (The "?" means "ORACLE_HOME").

    [remember to truncate the EXCEPTIONS table before the pk enable]

    I may suggest, since probably the number of duplicates is very low, a refinement of your approach:

    -disable PK
    - DROP PK_IDX
    -load data
    - CREATE PK_IDX
    -enable PK with exception
    -delete duplicates on rowid
    -enable PK without exceptions

    That is, decouple the pk-supporting INDEX from the pk CONSTRAINT,either by creating the index before the pk or by creating the pk as DEFERRABLE constraint .
    This is always a good idea in general, and in this case you will save the re-creation of the index in the phase "enable PK without exceptions", which is probably 90% of the work. This assumes of course that the duplicates are very few, so updating the index is much much faster then rebuilding it completely ... but that's normally the case.

    About direct path, i've never used it to load non-empty tables so i can't comment on the pk issue. Anyway is much much faster than the conventional path - orders of magnitude - so its use is worth investigating IMO.

    Thanks for the hats ;-)

    HTH
    Al
    Last edited by alberto.dellera; 02-02-04 at 08:11.

  10. #10
    Join Date
    Jan 2004
    Posts
    66
    Thanks Alberto .
    I got the solution . this is really a wonderful idea for less duplicate rows.
    Thanks again.
    himridul

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •