Page 1 of 2 12 LastLast
Results 1 to 15 of 28
  1. #1
    Join Date
    Nov 2006
    Posts
    34

    Question Unanswered: mysql perfomance

    Hi All,
    I have a mysql DB which has 32 mill rows...myisam tables and mysql version is 4.0.20.
    The table in which i have 32 mill rows is minmax table where we store info of all attributes the schema is given below.

    "ARCHIVEDTIME" BIGINT,
    "RESID" integer not null,
    "DURATION" integer not null,
    "ATTRIBUTEID" integer not null,
    "MINVALUE" BIGINT,
    "MAXVALUE" BIGINT,
    "TOTAL" BIGINT,
    "TOTALCOUNT" BIGINT,
    PRIMARY KEY ("ARCHIVEDTIME","RESID","DURATION","ATTRIBUTEID ")

    Now we decided to split this 32 mill rows and put in new tables according to attribute id column. for that i use

    insert into newTable select * from minmax where attributeid =708

    .but this query takes too much time .....for tis particular attributeid 708 there are 5002047 rows.Also in minmax table attributeid column is indexed.
    HOW do i improve the performance of the splitting the table

    Thanks n advance
    Arun

  2. #2
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    You don't have an index on attributeid so it will scan the whole table (32m rows) to do each insert. The primary key won't help here because it starts with ARCHIVEDTIME etc and it doesn't know these values so it won't use this index.

    Out of curiosity - how is splitting your one large table into many smaller tables is going to improve your system?

    Mike

  3. #3
    Join Date
    Nov 2006
    Posts
    34
    Hi Mike,
    Thanks for your input.
    Is it ok to remove the index while inserting ? i s that going to improve the performace?

    We have all attributes information in this minmax table so we thought of splitting the data of each attribute in different tables. so the 30 mill will split to 30 small tables with each 1 mill row (approximately).so i think it will improve the performance while i do select * .am i right?

    Thanx,
    Arun

  4. #4
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    You want the new index on the original table so your select can quickly find the data with the correct attributeid.

    But adding the index to the new smaller table after doing the insert is probably faster but if you're only doing this once to set up the table then does it really matter?

  5. #5
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Provided Answers: 10
    This begs the question - why are you performing SELECT *'s?
    George
    Home | Blog

  6. #6
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Originally posted by georgev
    why are you performing SELECT *'s
    I guess its quicker than entering the following
    Code:
    insert into newTable values ( row 1 );
    insert into newTable values ( row 2 );
    ...
    insert into newTable values ( row 32,000,000 );
    If the issue is the * part then I guess as he's only creating these tables on a one off basis it doesn't really matter that he's not specifying the fields individually.

    I think the biggest question is why you'd want to split the table into smaller tables in the first place. Is it because accessing the original table is very slow? if so can you describe how the table is usually accessed so we could simply try adding alternative indexes. If the issue is actually the time it takes to insert a row, due to reordering large indexes, then maybe there is a case for splitting the table.

    32m rows is a lot of rows - can you say what it is you're actually storing?

    Mike

  7. #7
    Join Date
    Nov 2006
    Posts
    34
    hi mike,
    thanks for the reply...

    Actually we store info of all attributes like cpu utilization,memory utilization of servers in this minmax table so we do insert and select in this table.this table will grow dynamically ...so inorder to improve the performance while retrieving the data we thought of splitting this large table in to smaller ones ...so we do insert and select in smaller tables also.....
    hope this clarifes ur doubt and xpecting an answer to improving performance of migrating the data from large tables smaller ones..

    thanks,
    Arun

  8. #8
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Quantity of data: I still think 32 m rows is a lot of low level data for what you're trying to monitor and this is probably leading to all your current issues. Could you make the way the data is stored more flexible - would it be better if you just stored a time the reading was taken and the min max values stored would then apply between this time and the time the last reading was taken.

    Aggregation: I'll assume the readings are taken every 5 minutes - you could then have a background process that takes data over a week old and aggregates up all the 5 minute readings for a server and stores them as just one reading for that server for that day. Similarly, after a month, you could aggregate up all these day readings into a single value for each server for the month. These values would be stored in the same table as before but would replace all the 5 minute readings with a reading that applies to a longer time period.

    You could now allow for a longer history of data yet still get detailed breakdowns of data for readings taken over the past week. The whole system would run faster as well. Obviously these week and month boundaries would be set by you.

    Smaller points: Also do you need to store total and total count - couldn't this be calculated from data? Also storing less data might allow you to have more meaningfull data in the table ie a varchar attribute_type that might store CPU, MEMORY etc or a RESID storing the ip address of the server. Is the duration field neccessary?

    Separate tables: Moving your data into lots of separate tables is going to require a rewrite of all your code but not really provide a big improvement in the system. I suspect you'll end up with a more complex system that has the same issues as before.

    Mike

  9. #9
    Join Date
    Nov 2006
    Posts
    34
    This begs the question - why are you performing SELECT *'s?
    George Mike answered right as this is a one time activity, we are giving the select * query.

    I think the biggest question is why you'd want to split the table into smaller tables in the first place. Is it because accessing the original table is very slow? i

    Mike The objective of splitting the data is scaling more by data segmentation. Here segment by attributes. Currently we store over 100 attributes in one table. That allows us to store data for only 3 months. Instead we have decided to move groups of attributes to dedicated tables with similar schema of their own. We found this to have the least code change as we also maintain a meta table to map attribute ids to table names.
    Quantity of data: already done
    Aggregation: already done ,in this minmax table we will store only archived data like one hour aggregates and also one day aggregates. The DURATION cloumn is used to differentiate that. We are currently storing one hour aggregates for 90 days and one day aggregates for 365 days.
    Smaller points: this point is clearly mentioned above as duration column is used to differentiate the one hour and one day value.
    Separate tables : we have this done in a generic manner and have a meta data table. So we identified the current approach has minimal code change for us.

    You don't have an index on attributeid so it will scan the whole table (32m rows) to do each insert. The primary key won't help here because it starts with >>ARCHIVEDTIME etc and it doesn't know these values so it won't use this index.

    Need clarification on this comment. There is already a composite primary key defined with 4 columns. Will it not automatically be indexed ?

    Mike your comments have been great and very helpful !

    Thanks,
    Arun

  10. #10
    Join Date
    Nov 2006
    Posts
    34
    This begs the question - why are you performing SELECT *'s?
    George Mike answered right as this is a one time activity, we are giving the select * query.

    I think the biggest question is why you'd want to split the table into smaller tables in the first place. Is it because accessing the original table is very slow? i

    Mike The objective of splitting the data is scaling more by data segmentation. Here segment by attributes. Currently we store over 100 attributes in one table. That allows us to store data for only 3 months. Instead we have decided to move groups of attributes to dedicated tables with similar schema of their own. We found this to have the least code change as we also maintain a meta table to map attribute ids to table names.
    Quantity of data: already done
    Aggregation: already done ,in this minmax table we will store only archived data like one hour aggregates and also one day aggregates. The DURATION cloumn is used to differentiate that. We are currently storing one hour aggregates for 90 days and one day aggregates for 365 days.
    Smaller points: this point is clearly mentioned above as duration column is used to differentiate the one hour and one day value.
    Separate tables : we have this done in a generic manner and have a meta data table. So we identified the current approach has minimal code change for us.

    You don't have an index on attributeid so it will scan the whole table (32m rows) to do each insert. The primary key won't help here because it starts with >>ARCHIVEDTIME etc and it doesn't know these values so it won't use this index.

    Need clarification on this comment. There is already a composite primary key defined with 4 columns. Will it not automatically be indexed ?

    Mike your comments have been great and very helpful !

    Thanks,
    Arun

  11. #11
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Provided Answers: 10
    Quote Originally Posted by mike_bike_kite
    I guess its quicker than entering the following
    Code:
    insert into newTable values ( row 1 );
    insert into newTable values ( row 2 );
    ...
    insert into newTable values ( row 32,000,000 );
    Actually, what I meant was
    Code:
    INSERT INTO newTable
    SELECT col1, col2, col3
    FROM theOtherTable
    I assumed that you wern't copying the whole schema into another table, but if you are, it begs the question of "why?"
    EDIT: On catching up with the rest of the thread I understand - disregard
    Last edited by gvee; 08-10-07 at 05:16.
    George
    Home | Blog

  12. #12
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Provided Answers: 10
    Actually, the real question is...
    If this is a one off process, why are you so bothered about performance?
    George
    Home | Blog

  13. #13
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Originally posted by georgev
    Actually, what I meant was ...
    Sorry George - I was just having a bit of fun

    Originally posted by adorearun
    Aggregation: already done ,in this minmax table we will store only archived data like one hour aggregates and also one day aggregates. The DURATION cloumn is used to differentiate that. We are currently storing one hour aggregates for 90 days and one day aggregates for 365 days
    Aggregates: I still think 32 m rows is a lot for this - do you really need to know the figures for 10am on the 3rd of June - wouldn't a daily aggregate be sufficient. Why not keep hourly data for the week, then aggregate up to daily for the next 90 days and then weekly for the rest of the year - this would get rid of all your problems in a flash.

    The duration field is unusuall - normally such data would be stored using from_date and to_date fields. It's usually easier to code and uses indexes better.

    You don't have an index on attributeid so it will scan the whole table (32m rows) to do each insert ...

    insert into newTable select * from minmax where attributeid =708

    PRIMARY KEY ("ARCHIVEDTIME","RESID","DURATION","ATTRIBUTEID ")
    Indexes: Each time you run this query it will try to find an index to help select the data where ATTRIBUTEID is 708. It cannot use your primary key because it would need to know the "ARCHIVEDTIME","RESID","DURATION" before it could use the ATTRIBUTEID field. Of course, if it suspects it will pull back more than 30% of the data, then it will probably do a table scan anyway - but I don't think this is the case.

    Suggestions: If you set up an index just on ATTRIBUTEID then things would be quicker. If you do this allong with aggregating more data then you'll find things will be a lot quicker. You shouldn't have to alter your code at all to implement any of this. If it's not quicker then you could look at splitting your tables up into sub tables but I personaly don't think this will help much though.

    Mike

  14. #14
    Join Date
    Mar 2007
    Location
    636f6d7075746572
    Posts
    770
    1) Do NOT chunk your table into multi-tables!
    2) Consider further normalisation, because it sounds like you might be storing a lot more fields in there than you are suggesting in your original post.
    3) Specifiy an index on ATTRIBUTEID as mike suggested.

    I'm curious to dissect this key PRIMARY KEY ("ARCHIVEDTIME","RESID","DURATION","ATTRIBUTEID ").
    Assuming each attributeid can have multiple archivedtimes (which i'm going to assume it has) then each attributeid is going to have entries in the index file. So if there are 1000 archived times for each attributeid then an index lookup on AttributeID is going to look up these thousand?
    I was under the impression that a partial component of a key can be looked up via index (from where clause). Although mike seems to say this is not the case.

    Either way, when you're coming down to it you're STILL doing INSERT SELECT * FROM TABLE, so you are inserting 30mil records, which, lets face it, is going to take some time....

  15. #15
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Originally posted by aschk
    1) Do NOT chunk your table into multi-tables!
    Totally agree - you won't get any real benifits but you'll have a far more complex system to look after as you're now reporting on 30 tables rather than one etc etc.

    Originally posted by aschk
    I was under the impression that a partial component of a key can be looked up via index (from where clause). Although mike seems to say this is not the case
    If you have a compound index on field1, field2 and field3 then the index will be partially used if you know values for field1, it will be better used if field2 is known as well. The index will be fully utilised only when all the compound keys are known.

    It's a bit like having a dictionary. If you know that the word you're looking for starts with Q then you can jump to the right place in the dictionary immediately and start searching from there. If however you only know that the second letter is a U then you have to scan the whole dictionary looking at the second letter of every word.

    This is true for all the database systems I've worked on and I'm pretty sure MySQL doesn't do anything different.

    Mike

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •