Results 1 to 4 of 4
  1. #1
    Join Date
    Jun 2005

    Unanswered: There should be a simple table partitioning solution to this problem

    Company I just joined used to work this way:

    Insert a bunch of data from Server A to Server B
    When all inserts are complete, begin an index job of data on Server B and finally transfer it to Server C

    The new way we are implementing this process is to utilize multiple servers to pull and index the data, additionally for each server we are pulling and indexing in up to 10 batches in parallel. Also the indexing does not wait for all of the data to be pulled, it can start indexing as soon as the first batch is inserted.

    Originally there was a PK on an Identity column on Server B (Server A also had a PK on an identity column but these did not sync).

    With the new method I had a clustered PK on: (groupId, recordId) - there's only about 10 groups and pulls and inserts are keyed off of this value as well as a range of the recordId. Below are the record per second statistics broken down into 5 minute intervals with this setup (no table partitions):

    Interval	pullRPS		indexRPS
    10/15/10 16:45	3,839		917 
    10/15/10 16:50	1,877		1,181 
    10/15/10 16:55	724		905 
    10/15/10 17:00	418		1,236 
    10/15/10 17:05	228		1,118
    10/15/10 17:10	33		1,679 
    10/15/10 17:15	44		1,868
    For the new method, I preserved the identity from Server A on Server B (to allow indexing to run in parallel off of the same key) so now Server B has:

    PK on its own Identity column (non-clustered)
    unique clustered index on groupID, recordId
    as can be seen below, at the end of this post are the new metrics in RPS (records per second)

    as RANGE RIGHT for values(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

    CREATE PARTITION SCHEME psGroup as partition pfGroup all
    to ([primary])
    -- create table on the partition scheme
    ON psGroup( [groupId] )
    -- clustered index is managed by each partition
    [groupId] ASC,
    [recordId] ASC
    ON psGroup( [groupId] )
    -- PK spans all partitions
    [newRecordId] ASC
    ) ON [PRIMARY]

    Interval		pullRPS		indexRPS
    2010-10-16 14:05	3,677		587
    2010-10-16 14:10	1,811		470
    2010-10-16 14:15	1,193		802
    2010-10-16 14:20	451		443
    2010-10-16 14:25	228		672
    2010-10-16 14:30	225		622
    2010-10-16 14:35	0		714
    2010-10-16 14:40	44		660
    2010-10-16 14:45	4		991
    2010-10-16 14:50	16		1,094
    2010-10-16 15:00	0		1,335
    2010-10-16 15:10	0		1,098
    Pulling is 100% done (hence the 0 RPS stats at the end
    Indexing is 32.3% done

    As you can imagine I am underwhelmed with the results. Is there a simple mistake I made in creating the clustered index / partition scheme assignments? It does look like our server is under a heavy load right now so that probably is contributing to the results, but I still expected better pull numbers w/ no loss in performance from the index jobs.
    Last edited by Gagnon; 10-16-10 at 15:41.

  2. #2
    Join Date
    Jun 2005
    The more I think about this, the more I think I just need to create separate tables for each groupId, possibly partition by hour if the data gets too large. Our test server seems to be lacking as well, a lot of PF usage, only 14GB of memory installed on it. We're running 10 jobs each pulling up to 10k records a batch, waitfor delay 1 second then repeating. We also have 10 index jobs running indexing up to 10k a batch.

  3. #3
    Join Date
    Feb 2004
    In front of the computer
    Provided Answers: 54
    Just to humor me, try using separate tables and avoid any indexing at all until the data within the table is static (all batches have been pulled). Inserting into tables without indices (or a table with a simple monotonic identity index) can actually be better than trying to fight a dozen indicies. It isn't intuitive but if you think about what happens at the level where the data is written to disk, this becomes graphically obvious. Reducing the data paging and log writing by a lot will improve performance by a lot!

    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  4. #4
    Join Date
    Jun 2005
    The only problem with that is we fall into the same trap - insert all of the data first (which could span 3 different hour tables) and THEN start to index it. When I say index it, I mean update certain columns and then move it to another server. We want to insert it as fast as we can, and "index" it (or update it) as fast as we can so that we can move it to the offline dw and eventually send the values back to the servers that are doing the initial inserts at the beginning of the cycle.

    I am pretty confident just splitting the tables by groupId should solve the problem.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts