Results 1 to 7 of 7
  1. #1
    Join Date
    Jun 2005
    Posts
    319

    Unanswered: is Table partitioning the answer?

    We have at least one table that is 1.5 TB in size, there's a job running constantly that is deleting any data that is > 30 days old (at least I think it is deleting it, might be deleting + archiving). Additionally we have reports (SELECT w/ NOLOCK) running off of this table. Finally we have import jobs inserting new records into this table. I think I need to know more about the types of queries that are running for the reports but the current problem is that inserts are somewhat slow, 25000 records are taking ~1 minute. The clustered index is a simple identity so that new rows are being added at the end of table w/o any reshuffling of the clustered index. There are other indexes on the table to help out the report which I am sure is slowing down the insert operations as well.

    What I am thinking of proposing is partitioning the table into 30 (or 31 partitions) each offset by a day so that the average partition size will be more manageable, say ~50GB. If the indexes are partitioned as well this should alleviate the insert performance since we are only dealing with in essence a 50GB table instead of a 1500GB table. A partitioned aligned index view would return comparable select performance correct? My only concern is would the creation of a partitioned aligned index view slow down the insert performance (i.e. would the insert commands need to wait for a recompile or rebuild of the view before the next insert could be processed)?

  2. #2
    Join Date
    Jun 2005
    Posts
    319
    Just a quick addendum, I have read up more on table partitioning. It sounds like we could insert the data into an empty table of the same structure and then switch this table into a partition. My question is if a table is setup for lets say 30 day partitions and we want to insert every hour can a table be partitioned for 30 days and then 24 hours and then once those 24 hours are filled up we merge them into a daily partition?

    Or would it make sense to have two tables, one table spanning 30 days partitioned by day, a second table partitioned by 24 hours with a view covering data in both of these tables as hourly data is inserted into the "24 hour" table we use the switch method mentioned above for the hourly inserts then once all 24 hours have been imported we merge the 24 partitions into 1 table and then switch that table as a new day into the "30 day" table as a new partition. If this is the correct approach could a partitioned aligned indexed view cover both tables (I am guessing no since the partitions are not of the same "size", one uses days, the other hours). Perhaps the answer is to have two separate partitioned aligned indexed views run our SELECT report type queries against each and then just union the results.
    Last edited by Gagnon; 10-01-10 at 09:49.

  3. #3
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    I'd go with 32 tables: TableA, TableB, Table01, Table02, Table03... Tables TableA and TableB would both be for today, the rest would be "history". Each table would have appropriate date based constraints for a partitioned view (note that A and B would be identical).

    Once per hour, I'd run a job. The job would truncate the "passive" of A and B and load in the current hour of new data and all the data from the "active" into it. The job would then recreate a distributed partitioned view from the component tables. This would keep all of the "writing" out of the active view, and would mean that the "30 day old" data could just be truncated from the oldest table once it went "out of scope".

    You might seriously consider the idea of having multiple views, one for each hour of the day. This would allow the views to be set up once per day, then used as appropriate throughout the day.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  4. #4
    Join Date
    Jun 2005
    Posts
    319
    Hey Pat -

    I appreciate the assistance you have given me the last few days, you have been a big help. I did a POC on a table with 776 million rows, 300GB table! It's running over the weekend so I will be able to see how much the inserts slow down once the table is filled. For the test I broke it up into 60 partitions 1 per minute and did inserts in 100k chunks. Early results were great, about 100% improvement in load speed compared to what they have going in production. I have got a few follow up questions if you have the time (or anyone else for that matter).

    Q1) Our offline DW only has 1 logical disk, but it's big like 50-60 terabytes, I forget the exact size. How much of an improvement would we see if we were to break up the logical disk into say 5 or 10 disks so that the filegroups could be spread out? I am guessing in an ideal world for a table with 60 partitions, 60 disks with a different disk controller on each would provide the best performance, right? I am not going to suggest that, but would like to get the best bang for the buck here. I am not well versed when it comes to hardware but I am quickly learning. Just to clarify, currently there is only 1 filegroup: PRIMARY which the partitions are being created on.

    Q2) Is there a way to figure out the optimal batch size for inserts? In production we are doing 25k, I did 100k just to see what kind of performance I would get. The destination table has indexes on it so that is what is limiting performance. We need the indexes in order to provide "real time" reporting of our data.

    Q3) There is not a lot of differentiation between minutes for 100k records, so only 2-3 partitions are going to be getting filled at the same time, should I choose a different (possibly arbitrary) key like seconds instead to utilize more partitions for the insert.

    Q4) probably the most important question, since I am partitioning on minute, is that going to hurt report performance? The PK is made up of an identity and minute column (ID, minute) - should that be reversed (the order). I was under the impression the partitioned column has to be part of the PK. I partitioned the indexes using my partition scheme which uses my partition function broken down by minute. I have not done any testing from the reporting side but I am wondering if queries need to filter by this minute column in order for the index to be useful. The minute column is not part of the index, but like I said is used to partition the index.

    I am still doing as much research on partitioning that I can at this point, I am not really finding a lot of good documentation out there, the msdn/BOL documentation seems to just cover the basics.
    Last edited by Gagnon; 10-02-10 at 13:35.

  5. #5
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Partitioning is still more of an art than a science, because it is a tool not a solution. As a tool, it can be used many different ways to produce many different results. Depending on which of those results you want, you may have to use the tool in a specific way to get those results.

    Due to the complexity of SAN/NAS storage and the myriad ways to connect the pieces, there isn't a "one size fits all" way to arrange data. You'll need to work with your storage administrator to figure out what your choices are and which of those choices might improve performance and which might be cost-effective. There are lots of people that know exactly how to arrange your storage... Unless they work for you, I'd bet that every one of them is wrong and even if they do work for you I'd reserve judgement!

    The optimal size for batches is the one that works best. Finally an easy question!

    Indexing usually helps retrieval of data, and hurts insert/update/delete. There are exceptions, but the general rule still holds. I would create two duplicate copies of a moderately complex production table. Time the results as you process one copy "as is". Time the results as you: drop indexes, process the copy, rebuild the indexes. Compare and report these times.

    The partitioning scheme that I recommended was to isolate changes outside of the queried data. The idea being that the changes were probably playing havok with performance.

    The partitioning scheme that you've described is 180 degrees different, it distributes the changes relatively evenly across the entire platform. This makes a lot of operations less intrusive, but at the cost of making the impact felt everywhere.

    This is a great example of two polarically opposite ways to use partitioning, both of which can be effective. Only hands on testing with your system and data can show you which of these two choices performs better for your system.

    The partitioning column does not need to be part of the primary key. Based on the use you've described, I would not make the minute column part of the primary key although I would materialize the column, index it (to give partitioning a helping hand, there is no practical advantage within a single table for what you've described), and set the constraint on it.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  6. #6
    Join Date
    Jun 2005
    Posts
    319
    I kind of shifted gears and did not stay true to the original thread. For the POC I was just loading up a large table and disregarded the archive portion as you have noted. We're in a transition between two providers and need to solve 2 different problems (1 problem specific to each provider). First problem was originally described in thread, Second problem is moving these large tables from Production to our offline DW but still allowing reports to run against the data as we load the table.

    As far as disabling the indexes, loading the data, and then reenabling the indexes and finally rebuilding the indexes, that is the first thing I suggested but was shot down for 2 reasons: 1) "real time" reporting is a requirement 2) index rebuild takes a long time on these large tables.

    Thanks again, I think my next step is learning more about our storage and what our options are with logical disks / filegroups / disk controllers.

  7. #7
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Just an observation, but my idea in post #3 used two "current" tables, one being queried and one being built. The partitioned view only used one of the two tables, so that was all that the queries ever saw. The table in "build mode" could have indexes or not, as convenient so I could have my cake and eat it too!

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •