Page 1 of 2 12 LastLast
Results 1 to 15 of 20
  1. #1
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912

    Unanswered: The price of a theta join

    Hello chaps

    I have a theta join on two tables, a tiddler and a bigun. The tiddler is there to recode the values of a field in my main table. It consists of mutually exclusive ranges corrollated to a new code. I want to keep the data in a table for flexibility however I would like the performance of a case statement. I suspect that SQL Server would be a lot better at this if it knew that the ranges are mutually exclusive. Although I know how to set up a check constraint to enforce this rule SQL Server still will not be able to use this information. The optimiser estimates three rows returned for the clustered index scan - actual rows is 699609 - which equals (number of rows in little table) * (number of rows in big table).
    BTW - I've been chucking unique constraints at the tiddly table just to see if I can affect execution - I can't.

    Question:
    Can I make this more efficient while maintining flexibility or is this just the price of a theta join? I would prefer to avoid dynamic sql please (in reality the full query is pretty complex).

    Ta!

    Code:
    USE tempdb
    go
    
    SET NOCOUNT ON 
    
    ------------------------------    SET STUFF UP    ---------------------------------
    IF NOT EXISTS (SELECT NULL FROM sys.schemas WHERE name = N'welovepoots') BEGIN
        EXEC ('CREATE SCHEMA welovepoots')
    END
    GO
    
    IF EXISTS (SELECT NULL FROM sys.tables WHERE name = N'little_table' AND SCHEMA_NAME(schema_id) = 'welovepoots') BEGIN
        DROP TABLE welovepoots.little_table
    END    
    
    IF EXISTS (SELECT NULL FROM sys.tables WHERE name = N'big_table' AND SCHEMA_NAME(schema_id) = 'welovepoots') BEGIN
        DROP TABLE welovepoots.big_table
    END
    
    CREATE TABLE welovepoots.little_table
        (
            recode_this_lower        DECIMAL(9, 3)        NOT NULL
            , recode_this_upper        DECIMAL(9, 3)        NOT NULL
            , recode_this_recode    TINYINT                NOT NULL
            , CONSTRAINT pk_little_table        PRIMARY KEY NONCLUSTERED (recode_this_lower, recode_this_upper) WITH (FILLFACTOR = 100)
            , CONSTRAINT ix_little_table_u_nc    UNIQUE NONCLUSTERED (recode_this_recode) WITH (FILLFACTOR = 100)
            , CONSTRAINT ix_little_table_u_c    UNIQUE CLUSTERED    (recode_this_lower, recode_this_upper, recode_this_recode) WITH (FILLFACTOR = 100)
        ) 
    GO
    
    INSERT INTO welovepoots.little_table (recode_this_lower, recode_this_upper, recode_this_recode)
    SELECT    30, 999999.999, 0
    UNION ALL
    SELECT    16, 29.999, 2
    UNION ALL
    SELECT    1, 15.999, 1
    
    CREATE TABLE welovepoots.big_table
        (
            my_id            INT            NOT NULL            
            , recode_this        DECIMAL(9, 3)
            , CONSTRAINT pk_big_table    PRIMARY KEY CLUSTERED (my_id) WITH (FILLFACTOR = 100)
        )
    GO
    
    INSERT INTO welovepoots.big_table (my_id, recode_this)
    SELECT    a.number * b.number
            , CASE WHEN MIN(a.number % 50.5) = 40.5 THEN NULL ELSE MIN(a.number % 50.5) END
    FROM    dbo.numbers AS a
    CROSS JOIN
            dbo.numbers AS b
    WHERE    a.number <= 650
            AND b.number BETWEEN 651 AND 1300
    GROUP BY a.number * b.number
    ------------------------------    END SET STUFF UP    ---------------------------------
    SET NOCOUNT OFF
    
    SET STATISTICS IO ON
    SET STATISTICS TIME ON
    SET STATISTICS PROFILE ON
    
    --Inefficient but flexible
        SELECT    recode_this
                , recode_this_recode
        FROM    welovepoots.big_table AS bt
        LEFT OUTER JOIN 
                welovepoots.little_table AS lt
        ON    bt.recode_this BETWEEN lt.recode_this_lower AND lt.recode_this_upper
    
    --Efficient but inflexible
        SELECT    recode_this
                , recode_this_recode    =    CASE 
                                                WHEN recode_this BETWEEN 30 AND 999999.999    THEN 0
                                                WHEN recode_this BETWEEN 16 AND 29.999        THEN 2
                                                WHEN recode_this BETWEEN 1  AND 15.999        THEN 1    
                                            END
        FROM    welovepoots.big_table AS bt
    
    SET STATISTICS IO OFF
    SET STATISTICS TIME OFF
    SET STATISTICS PROFILE OFF
    
    SET NOCOUNT ON
    ------------------------------    CLEAN UP    ---------------------------------
    IF NOT EXISTS (SELECT NULL FROM sys.schemas WHERE name = N'welovepoots') BEGIN
        EXEC ('CREATE SCHEMA welovepoots')
    END
    GO
    
    IF EXISTS (SELECT NULL FROM sys.tables WHERE name = N'little_table' AND SCHEMA_NAME(schema_id) = 'welovepoots') BEGIN
        DROP TABLE welovepoots.little_table
    END    
    
    IF EXISTS (SELECT NULL FROM sys.tables WHERE name = N'big_table' AND SCHEMA_NAME(schema_id) = 'welovepoots') BEGIN
        DROP TABLE welovepoots.big_table
    END
    ------------------------------    END CLEAN UP    ---------------------------------
    SET NOCOUNT OFF

  2. #2
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    I don't have time to cook this script, but at first glance I'd change the cross join into a left join based on the theta criteria. If you don't respond, I'll try to cook that up tonight when I get home (about twelve hours or so).

    -PatP

  3. #3
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Ta Pat

    The cross join is just to populate a demo table with a reasonable number of rows from a tally table. The meaty stuff is between SET STATISTICS IO ON and SET STATISTICS PROFILE OFF

  4. #4
    Join Date
    Aug 2004
    Location
    Dallas, Texas
    Posts
    831
    Just curious to know if there isn't a missing table in this code frag:

    --Efficient but inflexible
    SELECT recode_this
    , recode_this_recode = CASE
    WHEN recode_this BETWEEN 30 AND 999999.999 THEN 0
    WHEN recode_this BETWEEN 16 AND 29.999 THEN 2
    WHEN recode_this BETWEEN 1 AND 15.999 THEN 1
    END
    FROM welovepoots.big_table AS bt

  5. #5
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Quote Originally Posted by corncrowe
    Just curious to know if there isn't a missing table in this code frag:
    Nope - the idea is I can either pop the ranges in a table and look up the corresponding recode based on the range (inefficient but flexible in so far as changes can be made easily) or I can hard code these ranges in code (dispensing with the smaller table) which is very efficient but requires a code change if ever the ranges or recodes change.

    Make sense? Try running the code to see what I mean (you'll need a tally table - let me know if you want me to post a link).

    I've been thinking about it and I suspect I am stuffed. A theta join is an inequality join so I bet there are no optimisations. Happy to be proved wrong though I mean there is a huge descrepancy between the estimated number of rows and actual rows retrieved from cache for the tiddly table if you check the plans.

  6. #6
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Code:
    SELECT adj.descr, d0.n + d1.n + d2.n + d3.n + d4.n
       FROM (SELECT       0 AS n UNION SELECT 1 UNION SELECT 2 UNION SELECT 3) AS d0
       CROSS JOIN (SELECT 0 AS n UNION SELECT   4 UNION SELECT   8 UNION SELECT  12) AS d1
       CROSS JOIN (SELECT 0 AS n UNION SELECT  16 UNION SELECT  32 UNION SELECT  48) AS d2
       CROSS JOIN (SELECT 0 AS n UNION SELECT  64 UNION SELECT 128 UNION SELECT 192) AS d3
       CROSS JOIN (SELECT 0 AS n UNION SELECT 256 UNION SELECT 512 UNION SELECT 768) AS d4
       LEFT OUTER JOIN (SELECT 1 AS lv, 23 AS hv, 'Small' AS descr
       UNION SELECT 26, 40, 'medium'
       UNION SELECT 50, 150, 'large'
       UNION SELECT 160, 99999, 'humungous') AS adj
          ON (d0.n + d1.n + d2.n + d3.n + d4.n BETWEEN adj.lv and adj.hv)
       ORDER BY d0.n + d1.n + d2.n + d3.n + d4.n
    -PatP

  7. #7
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Thanks Pat. Let me know where I have missed something here - the syntax is the same as my inefficient query except you have produced two derived tables in memory via code. If I were to replace your two derived tables with permanent tables then I am back to square one. Is this correct?

  8. #8
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Quote Originally Posted by pootle flump
    Is this correct?
    It is exactly correct.

    This code does exactly what I thought you requested in your first post. The timings are identical for this code versus a CASE statement on my machine. Since performance is what you were originally objecting to, I assumed that this was what you desired (a join that behaves as quickly as a CASE statement).

    Coach me, what have I missed?

    -PatP

  9. #9
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    The key was the tradeoff between flexibility (data in table - does not require code changes if the tolerences change) versus the efficiency of hard coded values (requires a developer to respond to any changes with a change to the code).

    I'm hoping for the best of both worlds though flexibility is the non-negotiable bit I suppose. Sorry if this was not clear.

  10. #10
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Code:
    DECLARE
       @d1			DATETIME
    ,  @d2			DATETIME
    ,  @d3			DATETIME
    
    SET @d1 = GetDate()
    
    SELECT adj.descr, d0.n + d1.n + d2.n + d3.n + d4.n
       FROM (SELECT       0 AS n UNION SELECT 1 UNION SELECT 2 UNION SELECT 3) AS d0
       CROSS JOIN (SELECT 0 AS n UNION SELECT   4 UNION SELECT   8 UNION SELECT  12) AS d1
       CROSS JOIN (SELECT 0 AS n UNION SELECT  16 UNION SELECT  32 UNION SELECT  48) AS d2
       CROSS JOIN (SELECT 0 AS n UNION SELECT  64 UNION SELECT 128 UNION SELECT 192) AS d3
       CROSS JOIN (SELECT 0 AS n UNION SELECT 256 UNION SELECT 512 UNION SELECT 768) AS d4
       LEFT OUTER JOIN (SELECT 1 AS lv, 23 AS hv, 'Small' AS descr
       UNION SELECT 26, 40, 'medium'
       UNION SELECT 50, 150, 'large'
       UNION SELECT 160, 99999, 'humungous') AS adj
          ON (d0.n + d1.n + d2.n + d3.n + d4.n BETWEEN adj.lv and adj.hv)
       ORDER BY d0.n + d1.n + d2.n + d3.n + d4.n
    
    SET @d2 = GetDate()
    
    SELECT
        CASE
          WHEN d0.n + d1.n + d2.n + d3.n + d4.n BETWEEN   1 AND    23 THEN 'Small'
          WHEN d0.n + d1.n + d2.n + d3.n + d4.n BETWEEN  26 AND    40 THEN 'medium'
          WHEN d0.n + d1.n + d2.n + d3.n + d4.n BETWEEN  50 AND   150 THEN 'large'
          WHEN d0.n + d1.n + d2.n + d3.n + d4.n BETWEEN 160 AND 99999 THEN 'humungous'
          ELSE NULL
       END AS descr
    , d0.n + d1.n + d2.n + d3.n + d4.n
       FROM (SELECT       0 AS n UNION SELECT 1 UNION SELECT 2 UNION SELECT 3) AS d0
       CROSS JOIN (SELECT 0 AS n UNION SELECT   4 UNION SELECT   8 UNION SELECT  12) AS d1
       CROSS JOIN (SELECT 0 AS n UNION SELECT  16 UNION SELECT  32 UNION SELECT  48) AS d2
       CROSS JOIN (SELECT 0 AS n UNION SELECT  64 UNION SELECT 128 UNION SELECT 192) AS d3
       CROSS JOIN (SELECT 0 AS n UNION SELECT 256 UNION SELECT 512 UNION SELECT 768) AS d4
       ORDER BY d0.n + d1.n + d2.n + d3.n + d4.n
    
    SET @d3 = GetDate()
    
    SELECT DateDiff(ms, @d1, @d2), DateDiff(ms, @d2, @d3)
    Give that a whirl, what are your numbers in the final result set? Mine are both 16

    -PatP

  11. #11
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Thanks Pat.

    You are obviously putting in a lot of effort. However, what I am trying to avoid is hard coding anything. I didn't literally mean theta join Vs case statement, it is simply that if I had the values in a table then I have to use a theta join and if I hard code the values then I could use a CASE statement.

    I suppose ideally I would have used something similar to your first query to demonstrate the hard coded values but I didn't think of that.

    To reword the question again - is there anything I can do to make the below query more efficient but still retain the two tables? The logical reads on the little table, as well as the huge difference in expected rows Vs actual rows are what made me think there must be something better.
    Code:
        SELECT    recode_this
                , recode_this_recode
        FROM    welovepoots.big_table AS bt
        LEFT OUTER JOIN 
                welovepoots.little_table AS lt 
        ON    bt.recode_this BETWEEN lt.recode_this_lower AND lt.recode_this_upper

  12. #12
    Join Date
    May 2005
    Location
    South Africa
    Posts
    1,365
    Provided Answers: 1
    I could not recreate your problem
    In fact for me the case statement is slower than the join
    I generated test data using Pat's code
    Code:
    SELECT my_id=identity(int,1,1)
    ,d0.n + d1.n + d2.n + d3.n + d4.n as recode_this
    into #big_table
       FROM (SELECT       0 AS n UNION SELECT 1 UNION SELECT 2 UNION SELECT 3) AS d0
       CROSS JOIN (SELECT 0 AS n UNION SELECT   4 UNION SELECT   8 UNION SELECT  12) AS d1
       CROSS JOIN (SELECT 0 AS n UNION SELECT  16 UNION SELECT  32 UNION SELECT  48) AS d2
       CROSS JOIN (SELECT 0 AS n UNION SELECT  64 UNION SELECT 128 UNION SELECT 192) AS d3
       CROSS JOIN (SELECT 0 AS n UNION SELECT 256 UNION SELECT 512 UNION SELECT 768) AS d4
    insert into #big_table select recode_this+1024 from #big_table 
    insert into #big_table select recode_this+2048 from #big_table 
    insert into #big_table select recode_this+4096 from #big_table 
    insert into #big_table select recode_this+8192 from #big_table 
    insert into #big_table select recode_this+16384 from #big_table 
    create unique index ix1 on #big_table (my_id)
    create index ix2 on #big_table (recode_this)
    
    SELECT 1 AS recode_this_lower, 23 AS recode_this_upper, 'Small' AS recode_this_recode into #little_table
       UNION SELECT 26, 40, 'medium'
       UNION SELECT 50, 150, 'large'
       UNION SELECT 160, 99999, 'humungous'
    create unique clustered index ix1 on #little_table (recode_this_lower, recode_this_upper, recode_this_recode)
    go
    /*
    name          rows        reserved           data               index_size         unused             
    ------------- ----------- ------------------ ------------------ ------------------ ------------------ 
    #little_table 4           16 KB              8 KB               8 KB               0 KB
    #big_table    32768       1880 KB            552 KB             1112 KB            216 KB
    */
    --- select using join ---
    declare @dt datetime, @i int
    create table ##t2 (runtime datetime) 
    set @i=1
    while @i<10
    begin
      set @dt=getdate()
      
      SELECT    recode_this
              , recode_this_recode
      into ##t1    
      FROM #big_table AS bt
      JOIN  #little_table lt
        ON    bt.recode_this BETWEEN lt.recode_this_lower AND lt.recode_this_upper
    
      insert into ##t2 select dateadd(ms,datediff(ms,@dt,getdate()),0)
      set @i=@i+1
      drop table ##t1
    end
    select right(convert(varchar(26), runtime,121),12) as runtime from ##t2
    drop table ##t2
    go
    /*
    runtime      
    ------------ 
    00:00:00.017
    00:00:00.017
    00:00:00.030
    00:00:00.030
    00:00:00.030
    00:00:00.033
    00:00:00.017
    00:00:00.017
    00:00:00.030
    */
    
    --- and the case statement ---
    
    declare @dt datetime, @i int
    create table ##t2 (runtime datetime)
    set @i=1
    while @i<10
    begin
      set @dt=getdate()
    
      SELECT    recode_this
              , recode_this_recode    
                = CASE 
                  WHEN recode_this BETWEEN 1 AND 23             THEN 'Small'
                  WHEN recode_this BETWEEN 26 AND 40            THEN 'medium'
                  WHEN recode_this BETWEEN 50 AND 150           THEN 'large'
                  WHEN recode_this BETWEEN 160  AND 99999       THEN 'humungous'    
                  END
      into ##t1        
      FROM    #big_table AS bt 
      
      insert into ##t2 select dateadd(ms,datediff(ms,@dt,getdate()),0)
      set @i=@i+1
      drop table ##t1
    end
    select right(convert(varchar(26), runtime,121),12) as runtime from ##t2
    drop table ##t2
    go
    /*
    runtime      
    ------------ 
    00:00:00.050
    00:00:00.060
    00:00:00.047
    00:00:00.047
    00:00:00.047
    00:00:00.050
    00:00:00.060
    00:00:00.047
    00:00:00.047
    */
    
    --------
    drop table #little_table
    drop table #big_table
    PS.
    I repeated the test more than once with similar results
    I used select into to eliminate the time of the data transfer to the screen
    Last edited by pdreyer; 12-07-07 at 04:34. Reason: pasted wrong code

  13. #13
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Thanks pd. I'm not getting a massive time difference either. The main things I have beeen focussing on are the logical reads & rows affected.

  14. #14
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    To butcher a quote from Mark Twain: There are three kinds of lies in the world of MS-SQL performance, ordinary lies, damning lies, and SQL statistics. The statistics are good information, and are useful for many things, but predicting how long a process will run is not one of those things!

    As long as the recoding table is under about three pages of data (figure 20 kb for a rough rule of thumb), it should actually perform better than the CASE statement. The statistics might well be all over the map, but the actual result set will appear as soon or sooner using the lookup table.

    -PatP

  15. #15
    Join Date
    Nov 2002
    Location
    Jersey
    Posts
    10,322
    Just like DB2 Timerons

    Poots, what the hell are you doing?

    Is this just some exercise?
    Brett
    8-)

    It's a Great Day for America everybody!

    dbforums Yak CorralRadio 'Rita
    dbForums Member List
    I'm Good Once as I ever was

    The physical order of data in a database has no meaning.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •