If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > General > Database Concepts & Design > Indexes effects on insert commands

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 03-10-09, 10:46
mciasco mciasco is offline
Registered User
 
Join Date: Mar 2009
Posts: 15
Indexes effects on insert commands

Suppose I have to create a table from scratch and then populate it, reading data from a file.
I think there are only 2 options:
1) create table - create indexes - load data
2) create table - load data - create indexes

Which is the best command sequence? And why?

Is different if there is a primary key in the table? And if so, when it must be created for better performance of the loading operation?
Reply With Quote
  #2 (permalink)  
Old 03-10-09, 11:09
Pat Phelan Pat Phelan is offline
Resident Curmudgeon
 
Join Date: Feb 2004
Location: In front of the computer
Posts: 12,605
The best sequence depends on what you want. What are your criteria for "best" in this case?

-PatP
Reply With Quote
  #3 (permalink)  
Old 03-10-09, 12:50
mciasco mciasco is offline
Registered User
 
Join Date: Mar 2009
Posts: 15
Yes, the question is very general, so I expect that there's no single correct answer.

But I'm interested in undestanding the correct behaviour of the two options.

1) create table - create indexes - load data
What happens when I create indexes in an empty table? I think that the DBMS just create the root node for each B-tree index. Right?
So there are no pages allocated for data.
Then, during the data loading, for each row the DBMS must allocates page memory space, store phisically the row and update all the indexes.

2) create table - load data - create indexes
Here, loading data in an empty-not-indexed table, each row is stored sequentially, in the reading order. Again, for each row must be allocated some page space. But each insertion does not require indexes updates. But at the next step, when creating indexes on the populated table, each index is created reading the entire set of rows in sequential order, so each index must be ordered, as in the previous case.

So in both cases, there are no pages pre-allocated for data, and in both cases there's the cost of updating all the indexes for each row.
Seems there's no difference!!!
Reply With Quote
  #4 (permalink)  
Old 03-10-09, 13:21
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,524
Quote:
Originally Posted by mciasco
Seems there's no difference!!!
test it for yourself and see

__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #5 (permalink)  
Old 03-10-09, 13:28
mciasco mciasco is offline
Registered User
 
Join Date: Mar 2009
Posts: 15
Quote:
Originally Posted by r937
test it for yourself and see

well, this is what I'm going to do (I started a batch file with some test file just 5 minutes ago...)

But apart from the results, I'd like to understand the correct approach in this kind of context, even if this is just a theoric solution or a best practice. I want to understand the mechanism behind the scene!
Reply With Quote
  #6 (permalink)  
Old 03-10-09, 13:29
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
A few extra things to keep in mind:
  • Inserting the data before creating the indexes will usually be faster - certainly much faster to insert the data and usually faster overall but not having the indexes in place will mean you could insert duplicate rows etc.
  • When adding a clustered index to a table you will need around 2.5 times the space used by the table currently to reorganise it - with a very large table this can present problems if you're adding the index after inserting the data.
  • If you're setting up a feed then it's usually best to load the raw data into a transfer table that contains minimal restrictions (ie no FK and indexes) and then transfer the data to the main table using a stored proc.
  • If you're just loading small sets of data (<100k records) then things should be fast enough no matter how you approach things.
Just my 2c.
Reply With Quote
  #7 (permalink)  
Old 03-11-09, 04:38
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
__________________
Testimonial:
Quote:
pootle flump
ur codings are working excelent.
Reply With Quote
  #8 (permalink)  
Old 03-11-09, 06:17
mciasco mciasco is offline
Registered User
 
Join Date: Mar 2009
Posts: 15
Quote:
Originally Posted by pootle flump
Yes, this is an equivalent post of mine in another forum... can't do this?

However I made some tests.

I tested 3 DBMS: MySQL, PostgreSQL and SQLServer with the same data bulk file.
1) Each DBMS is installed locally (no network delays)
2) The bulk file is 50MB and contains 120.000 rows
3) each opeation (create table, load bulk and create indexes) is done separately and require a new connection, so for each time value is included the connection time
4) the values showed are result of 3 runs of the same test, so they are average values.

Sequence: CREATE TABLE - LOAD BULK - CREATE INDEXES
MYSQL
- create table (s): 7
- load bulk (s): 26
- create indexes(s): 3667
- total (s): 3700

POSTGRES
- create table (s): 5
- load bulk (s): 50
- create indexes(s): 83
- total (s): 509

SQLSERVER
- create table (s): 12
- load bulk (s): 21
- create indexes(s): 24
- total (s): 426

----------------
Sequence: CREATE TABLE - CREATE INDEXES - LOAD BULK
MYSQL
- create table (s): 1
- load bulk (s): 0
- create indexes(s): 565
- total (s): 893

POSTGRES
- create table (s): 1
- load bulk (s): 1
- create indexes(s): 499
- total (s): 883

SQLSERVER
- create table (s): 11
- load bulk (s): 2
- create indexes(s): 100
- total (s): 462


So, it seems that creating indexes after the bulk load is better for Postgres and SQLServer (Postgres takes more advantages from this solution, while SQLServer performances are almost the same in both cases).
But this approach in MySQL is totally wrong! The time needed for create indexes after the bulk load grows enormously! For MySQL is better to create indexes first, and then load data. And the difference between the two approaches is huge!
This seems strange if compared to the behaviour of Postgres and SqlServer.

Overall, SqlServer is the fastest DBMS in both cases, while MySQL is the slowest one.

What do you think about?

Last edited by mciasco; 03-11-09 at 06:57.
Reply With Quote
  #9 (permalink)  
Old 03-11-09, 06:38
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
What do you think about?
Your figures for creating a table in MySQL are 1 second and 7 seconds yet effectively you are doing exactly the same thing in both cases. If your figures vary so much when doing exactly the same operation then I'm guessing it's difficult to base any conclusions on them.

Are the RDBMS's shared by other users or are they dedicated to you? I'm surprised any of the RDBMS's take any time at all to create a table - the operation should be effectively instant - are you sure about your figures? I just created 10 tables in MySQL and the total time was a fraction of a second. Do your figures also include the time to login into the database?

Perhaps you need to repeat these operations 100's of times to get real figures.
Reply With Quote
  #10 (permalink)  
Old 03-11-09, 06:44
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
Quote:
Originally Posted by mciasco
Yes, this is an equivalent post of mine in another forum... can't do this?
Of course you can. However, good web etiquette is to link any threads you post on different forums. The reason for this is that people give up their time to help, and it is not fair if they spend time covering or considering things that have already been covered at another forum.
__________________
Testimonial:
Quote:
pootle flump
ur codings are working excelent.
Reply With Quote
  #11 (permalink)  
Old 03-11-09, 06:52
mciasco mciasco is offline
Registered User
 
Join Date: Mar 2009
Posts: 15
Quote:
Originally Posted by pootle flump
Of course you can. However, good web etiquette is to link any threads you post on different forums. The reason for this is that people give up their time to help, and it is not fair if they spend time covering or considering things that have already been covered at another forum.
Sure, I agree, sorry.
I'm just trying ti find a solution and any help from any forum is welcome for me.
So, to be fair here another post on the same arguments: MySQL forum - Sql command sequence


Quote:
Originally Posted by mike_bike_kite
Your figures for creating a table in MySQL are 1 second and 7 seconds yet effectively you are doing exactly the same thing in both cases. If your figures vary so much when doing exactly the same operation then I'm guessing it's difficult to base any conclusions on them.

Are the RDBMS's shared by other users or are they dedicated to you? I'm surprised any of the RDBMS's take any time at all to create a table - the operation should be effectively instant - are you sure about your figures? I just created 10 tables in MySQL and the total time was a fraction of a second. Do your figures also include the time to login into the database?

Perhaps you need to repeat these operations 100's of times to get real figures.
Sorry, yes, each operation is done separately from the others so for each one I need to connect to the db. I think that the creation of the table takes some seconds for these reason.
However the results are an average of 3 runs of the same test. 3 times is not 100 ok, but I think the values are real.
Moreover, the DBMSs are installed locally so there's no network delay at all.
(I'll modify the post for better clarity)

Last edited by mciasco; 03-11-09 at 07:02.
Reply With Quote
  #12 (permalink)  
Old 03-11-09, 06:58
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
The creation of 10 (small) tables in MySQL was done on a shared server situated in New Zealand - this is the other side of the world to me but it still took under a second.
Reply With Quote
  #13 (permalink)  
Old 03-11-09, 07:17
mciasco mciasco is offline
Registered User
 
Join Date: Mar 2009
Posts: 15
Quote:
Originally Posted by mike_bike_kite
The creation of 10 (small) tables in MySQL was done on a shared server situated in New Zealand - this is the other side of the world to me but it still took under a second.
I agree... I'm surprised too, but looking at the results, all the values for table creation is relatively big. For example SQLServer tooks more than 10 seconds!!!

However, based on the fact that this strange "delay" in table creation is present for all the 3 DBMS, I think it's a negligible problem... maybe due to slowness of my test machine. Maybe all values are bigger than normal if my machine is slow, but I think the proportions is valid, and this is the key point.
Reply With Quote
  #14 (permalink)  
Old 03-11-09, 11:52
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Originally Posted by mciasco
Maybe all values are bigger than normal if my machine is slow, but I think the proportions is valid, and this is the key point.
If your timing figures can be out by a factor of 7 in a single statement then any differences in time you generate below a factor of 7 are simply noise. Personally I'd work on doing more scientifically correct tests that aren't influenced by any other factors before publishing your results
Reply With Quote
  #15 (permalink)  
Old 03-11-09, 12:46
mciasco mciasco is offline
Registered User
 
Join Date: Mar 2009
Posts: 15
Quote:
Originally Posted by mike_bike_kite
If your timing figures can be out by a factor of 7 in a single statement then any differences in time you generate below a factor of 7 are simply noise. Personally I'd work on doing more scientifically correct tests that aren't influenced by any other factors before publishing your results
Well, my results are anything but scientific! Obviously noone will choose a DBMS basing the choise on average values from a 50MB data file test!

However, even if the test is simple, it's clear that there's no "noise factor" as you supposed. Ok, there are strange differences between the same statement, but they are more similar to a kind of "noise added". Otherwise, in the presence of "noise factor", certainly the values would be spread over a wider range, and the differences will be more significant. So I think that testing a larger bulk file this "noise added" will become negligible.

At this point I'm not interested in producing a real scentific comparison test. I'm trying to undestand why MySQL shows bad performances during indexes creation, when they are created after bulk load, while this approach seems the correct one in theory and in practice too (based on the results for Postgres and Sqlserver)
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On