Results 1 to 15 of 15
  1. #1
    Join Date
    Mar 2009
    Posts
    15

    Indexes effects on insert commands

    Suppose I have to create a table from scratch and then populate it, reading data from a file.
    I think there are only 2 options:
    1) create table - create indexes - load data
    2) create table - load data - create indexes

    Which is the best command sequence? And why?

    Is different if there is a primary key in the table? And if so, when it must be created for better performance of the loading operation?

  2. #2
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    The best sequence depends on what you want. What are your criteria for "best" in this case?

    -PatP

  3. #3
    Join Date
    Mar 2009
    Posts
    15
    Yes, the question is very general, so I expect that there's no single correct answer.

    But I'm interested in undestanding the correct behaviour of the two options.

    1) create table - create indexes - load data
    What happens when I create indexes in an empty table? I think that the DBMS just create the root node for each B-tree index. Right?
    So there are no pages allocated for data.
    Then, during the data loading, for each row the DBMS must allocates page memory space, store phisically the row and update all the indexes.

    2) create table - load data - create indexes
    Here, loading data in an empty-not-indexed table, each row is stored sequentially, in the reading order. Again, for each row must be allocated some page space. But each insertion does not require indexes updates. But at the next step, when creating indexes on the populated table, each index is created reading the entire set of rows in sequential order, so each index must be ordered, as in the previous case.

    So in both cases, there are no pages pre-allocated for data, and in both cases there's the cost of updating all the indexes for each row.
    Seems there's no difference!!!

  4. #4
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by mciasco
    Seems there's no difference!!!
    test it for yourself and see

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  5. #5
    Join Date
    Mar 2009
    Posts
    15
    Quote Originally Posted by r937
    test it for yourself and see

    well, this is what I'm going to do (I started a batch file with some test file just 5 minutes ago...)

    But apart from the results, I'd like to understand the correct approach in this kind of context, even if this is just a theoric solution or a best practice. I want to understand the mechanism behind the scene!

  6. #6
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    A few extra things to keep in mind:
    • Inserting the data before creating the indexes will usually be faster - certainly much faster to insert the data and usually faster overall but not having the indexes in place will mean you could insert duplicate rows etc.
    • When adding a clustered index to a table you will need around 2.5 times the space used by the table currently to reorganise it - with a very large table this can present problems if you're adding the index after inserting the data.
    • If you're setting up a feed then it's usually best to load the raw data into a transfer table that contains minimal restrictions (ie no FK and indexes) and then transfer the data to the main table using a stored proc.
    • If you're just loading small sets of data (<100k records) then things should be fast enough no matter how you approach things.
    Just my 2c.

  7. #7
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Testimonial:
    pootle flump
    ur codings are working excelent.

  8. #8
    Join Date
    Mar 2009
    Posts
    15
    Quote Originally Posted by pootle flump
    Yes, this is an equivalent post of mine in another forum... can't do this?

    However I made some tests.

    I tested 3 DBMS: MySQL, PostgreSQL and SQLServer with the same data bulk file.
    1) Each DBMS is installed locally (no network delays)
    2) The bulk file is 50MB and contains 120.000 rows
    3) each opeation (create table, load bulk and create indexes) is done separately and require a new connection, so for each time value is included the connection time
    4) the values showed are result of 3 runs of the same test, so they are average values.

    Sequence: CREATE TABLE - LOAD BULK - CREATE INDEXES
    MYSQL
    - create table (s): 7
    - load bulk (s): 26
    - create indexes(s): 3667
    - total (s): 3700

    POSTGRES
    - create table (s): 5
    - load bulk (s): 50
    - create indexes(s): 83
    - total (s): 509

    SQLSERVER
    - create table (s): 12
    - load bulk (s): 21
    - create indexes(s): 24
    - total (s): 426

    ----------------
    Sequence: CREATE TABLE - CREATE INDEXES - LOAD BULK
    MYSQL
    - create table (s): 1
    - load bulk (s): 0
    - create indexes(s): 565
    - total (s): 893

    POSTGRES
    - create table (s): 1
    - load bulk (s): 1
    - create indexes(s): 499
    - total (s): 883

    SQLSERVER
    - create table (s): 11
    - load bulk (s): 2
    - create indexes(s): 100
    - total (s): 462


    So, it seems that creating indexes after the bulk load is better for Postgres and SQLServer (Postgres takes more advantages from this solution, while SQLServer performances are almost the same in both cases).
    But this approach in MySQL is totally wrong! The time needed for create indexes after the bulk load grows enormously! For MySQL is better to create indexes first, and then load data. And the difference between the two approaches is huge!
    This seems strange if compared to the behaviour of Postgres and SqlServer.

    Overall, SqlServer is the fastest DBMS in both cases, while MySQL is the slowest one.

    What do you think about?
    Last edited by mciasco; 03-11-09 at 07:57.

  9. #9
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    What do you think about?
    Your figures for creating a table in MySQL are 1 second and 7 seconds yet effectively you are doing exactly the same thing in both cases. If your figures vary so much when doing exactly the same operation then I'm guessing it's difficult to base any conclusions on them.

    Are the RDBMS's shared by other users or are they dedicated to you? I'm surprised any of the RDBMS's take any time at all to create a table - the operation should be effectively instant - are you sure about your figures? I just created 10 tables in MySQL and the total time was a fraction of a second. Do your figures also include the time to login into the database?

    Perhaps you need to repeat these operations 100's of times to get real figures.

  10. #10
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Quote Originally Posted by mciasco
    Yes, this is an equivalent post of mine in another forum... can't do this?
    Of course you can. However, good web etiquette is to link any threads you post on different forums. The reason for this is that people give up their time to help, and it is not fair if they spend time covering or considering things that have already been covered at another forum.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  11. #11
    Join Date
    Mar 2009
    Posts
    15
    Quote Originally Posted by pootle flump
    Of course you can. However, good web etiquette is to link any threads you post on different forums. The reason for this is that people give up their time to help, and it is not fair if they spend time covering or considering things that have already been covered at another forum.
    Sure, I agree, sorry.
    I'm just trying ti find a solution and any help from any forum is welcome for me.
    So, to be fair here another post on the same arguments: MySQL forum - Sql command sequence


    Quote Originally Posted by mike_bike_kite
    Your figures for creating a table in MySQL are 1 second and 7 seconds yet effectively you are doing exactly the same thing in both cases. If your figures vary so much when doing exactly the same operation then I'm guessing it's difficult to base any conclusions on them.

    Are the RDBMS's shared by other users or are they dedicated to you? I'm surprised any of the RDBMS's take any time at all to create a table - the operation should be effectively instant - are you sure about your figures? I just created 10 tables in MySQL and the total time was a fraction of a second. Do your figures also include the time to login into the database?

    Perhaps you need to repeat these operations 100's of times to get real figures.
    Sorry, yes, each operation is done separately from the others so for each one I need to connect to the db. I think that the creation of the table takes some seconds for these reason.
    However the results are an average of 3 runs of the same test. 3 times is not 100 ok, but I think the values are real.
    Moreover, the DBMSs are installed locally so there's no network delay at all.
    (I'll modify the post for better clarity)
    Last edited by mciasco; 03-11-09 at 08:02.

  12. #12
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    The creation of 10 (small) tables in MySQL was done on a shared server situated in New Zealand - this is the other side of the world to me but it still took under a second.

  13. #13
    Join Date
    Mar 2009
    Posts
    15
    Quote Originally Posted by mike_bike_kite
    The creation of 10 (small) tables in MySQL was done on a shared server situated in New Zealand - this is the other side of the world to me but it still took under a second.
    I agree... I'm surprised too, but looking at the results, all the values for table creation is relatively big. For example SQLServer tooks more than 10 seconds!!!

    However, based on the fact that this strange "delay" in table creation is present for all the 3 DBMS, I think it's a negligible problem... maybe due to slowness of my test machine. Maybe all values are bigger than normal if my machine is slow, but I think the proportions is valid, and this is the key point.

  14. #14
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Quote Originally Posted by mciasco
    Maybe all values are bigger than normal if my machine is slow, but I think the proportions is valid, and this is the key point.
    If your timing figures can be out by a factor of 7 in a single statement then any differences in time you generate below a factor of 7 are simply noise. Personally I'd work on doing more scientifically correct tests that aren't influenced by any other factors before publishing your results

  15. #15
    Join Date
    Mar 2009
    Posts
    15
    Quote Originally Posted by mike_bike_kite
    If your timing figures can be out by a factor of 7 in a single statement then any differences in time you generate below a factor of 7 are simply noise. Personally I'd work on doing more scientifically correct tests that aren't influenced by any other factors before publishing your results
    Well, my results are anything but scientific! Obviously noone will choose a DBMS basing the choise on average values from a 50MB data file test!

    However, even if the test is simple, it's clear that there's no "noise factor" as you supposed. Ok, there are strange differences between the same statement, but they are more similar to a kind of "noise added". Otherwise, in the presence of "noise factor", certainly the values would be spread over a wider range, and the differences will be more significant. So I think that testing a larger bulk file this "noise added" will become negligible.

    At this point I'm not interested in producing a real scentific comparison test. I'm trying to undestand why MySQL shows bad performances during indexes creation, when they are created after bulk load, while this approach seems the correct one in theory and in practice too (based on the results for Postgres and Sqlserver)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •