Results 1 to 6 of 6
  1. #1
    Join Date
    Jan 2004
    Posts
    2

    Unanswered: Useful index for count distinct with group bys?

    I am working with a reporting system in oracle 8i that uses a star/snowflake schema. It's struggling with a class of queries that have a general form (paraphrased and simplified) of:

    SELECT
    calendar.WEEK,
    item.description,
    status_1.status_description,
    count(distinct location.LOCATION_ID)
    FROM
    calendar,
    item,
    status_1,
    facttable
    WHERE
    factable.week = calendar.week
    AND facttable.item_id = item.item_id
    AND facttable.status_1_id = status.status_1_id
    AND status_1.status_1_id = 1
    AND calendar.week between to_date('01-JAN-02')
    AND to_date('01-JAN-03')
    GROUP BY
    calendar.WEEK,
    item.description,
    status_1.status_description;

    SO basically, its counting the number of locations for each item for each week that meet the status conditions.

    The fact table has about 500M rows, the dimension tables each between 1-10 thousand and there are several different status tables each with only 2 values. Pretty much everything is not null. The time range and status flag probably typically match about 1/3 of the rows in the fact table. The fact table and all its indexes are partitioned on the date.

    We have bitmaps on each of these columns in the fact table, however it doesn't seem like the optimizer is using them. (it uses them for other queries) My thinking is that a bitmap wouldn't be so useful here since it would have to sort all the records once it had found them.

    Is there a way we can have an index to totally avoid doing the huge sort? What I have tried is this index:
    CREATE facttable.ci_ix1 ON facttable
    (
    WEEK, ITEM_ID, LOCATION_ID, STATUS_1_ID,
    STATUS_2_ID, STATUS_3_ID, STATUS_4_ID
    )


    but based on the query plan it's showing me, I think its still sorting everything at the end. I believe this is also the case even when I only include the fact table.

    What I hoped it would do is something like this (at least with the /*+STAR*/ hint):

    Loop through each WEEK (filtered by conditions)
    Loop through each ITEM (unfiltered)
    Loop through ci_ix1 and count off the number of
    locations for each item that match the status flag.

    I _think_ what it is actually doing is retrieving all rows then doing a massive sort so that it can get the distinct location count. I think that its the sort that is killing us.

    Is there a better index that I could use? Would (week, item, status_1, location) be a better ordering? (if so we need at least 4 of them) Or am I just stuck with a big sort no matter what?

    Am I better off chasing a totally different approach (bitmaps & star transformation)? I'm also considering materialized views. We're already using them to flatten the snowflake into a star. My concern is that the distinct count doesn't lend itself to pre-aggregation because there are so many different possible conditions.



    Thanks,
    -- Jeff

  2. #2
    Join Date
    Jul 2003
    Posts
    2,296
    consider not joining on facttable but use EXISTS.
    Also, you could just select the description of status 1 in the SELECT section instead of the join since you already know it is status_1_id = '1'.

    where is the location table BTW????

    like this:
    PHP Code:
    SELECT 
    calendar
    .WEEK
    item.description
    (
    select status_description from status_1 where status_1_id '1'),
    count(distinct location.LOCATION_ID)
    FROM
    calendar
    ,
    item
    WHERE 
    calendar
    .week between to_date('01-JAN-02'
    AND 
    to_date('01-JAN-03')
    AND
      
    EXISTS (select null from facttable where
                      calendar
    .week =  factable.week 
                      
    AND item.item_id facttable.item_id 
                      
    AND facttable.status_1_id '1')
    group by 
    calendar
    .WEEK
    item.description
    I think this all means nothing since you probably meant FACTTABLE instead of LOCATION table ...
    - The_Duck
    you can lead someone to something but they will never learn anything ...

  3. #3
    Join Date
    Sep 2003
    Posts
    6
    I may not be absolutely right..but still I am tryin to answer.
    The group by will not work because of its complexity.
    1.There will be more than 1 value returned by your query because you donot have a distinct clause.
    2.Even if it fetches one row..the outpu is wrong because the group by skips essential groups when tryin to group everything.
    3.You shud have ex: acct_code,MIN(address),MIN(phone) in the select and group by acct_code in the end.










    Originally posted by The_Duck
    consider not joining on facttable but use EXISTS.
    Also, you could just select the description of status 1 in the SELECT section instead of the join since you already know it is status_1_id = '1'.

    where is the location table BTW????

    like this:
    PHP Code:
    SELECT 
    calendar
    .WEEK
    item.description
    (
    select status_description from status_1 where status_1_id '1'),
    count(distinct location.LOCATION_ID)
    FROM
    calendar
    ,
    item
    WHERE 
    calendar
    .week between to_date('01-JAN-02'
    AND 
    to_date('01-JAN-03')
    AND
      
    EXISTS (select null from facttable where
                      calendar
    .week =  factable.week 
                      
    AND item.item_id facttable.item_id 
                      
    AND facttable.status_1_id '1')
    group by 
    calendar
    .WEEK
    item.description
    I think this all means nothing since you probably meant FACTTABLE instead of LOCATION table ...

  4. #4
    Join Date
    Sep 2003
    Location
    Milan, Italy
    Posts
    130

    Re: Useful index for count distinct with group bys?

    Originally posted by schoenborn

    SELECT
    calendar.WEEK,
    item.description,
    status_1.status_description,
    count(distinct location.LOCATION_ID)
    FROM
    calendar,
    item,
    status_1,
    facttable
    WHERE
    factable.week = calendar.week
    AND facttable.item_id = item.item_id
    AND facttable.status_1_id = status.status_1_id
    AND status_1.status_1_id = 1
    AND calendar.week between to_date('01-JAN-02')
    AND to_date('01-JAN-03')
    GROUP BY
    calendar.WEEK,
    item.description,
    status_1.status_description;

    Using an index to avoid the sort may not be the best solution, since, yes, you avoid the sort, but then Oracle will have to access all the index leaf blocks block-by-block performing a so called 'sequential read', instead of a 'multiblock read' from the table and then a memory sort.

    It is probably better to go for a parallel sort, which is very common in DWHs - have your tables a suitable parallel degree, which is (often but not always) set to the number of cpus in the system ?

    What's the value of SORT_AREA_SIZE ?
    Is your temp tablespace a locally managed one ?

    You could anyway try to add an index and see what happens. I don't know if the optimizer is able to notice it (btw have you foreign keys from the fact table to all of the dimension ones ? The cbo will use them to compute optimal plans), but a suitable index may be:

    create index sort_avoider on fact_table (
    week,
    item_id,
    status_1_id,
    LOCATION_ID)
    LOCAL
    compute statistics;

    LOCAL is very important if partition pruning kicks in (which is likely since you partition by date and select a date range; btw the partitions
    are partitioned by week or day ?)

    Try it and see if the cbo picks the index-but probably it's better to go for the parallel memory sort.

    just my 2cents
    hth
    alberto

  5. #5
    Join Date
    Jan 2004
    Posts
    2
    Thanks for the suggestions. Right now I'm focussing on the even simpler case w/out the joins:

    SELECT
    week,
    item_id,
    status_1_id,
    count(distinct LOCATION_ID)
    FROM
    facttable
    WHERE
    status_1_id = 1
    AND week between to_date('01-JAN-02') AND to_date('01-JAN-03')
    GROUP BY
    WEEK,
    item_id,
    status_1_id;


    About 150M rows will match the where clause, and after grouping, this becomes about 250K rows.

    Execution Plan
    ----------------------------------------------------------
    0 SELECT STATEMENT Optimizer=CHOOSE (Cost=373803 Card=1)
    1 0 SORT (AGGREGATE)
    2 1 VIEW (Cost=373803 Card=150831)
    3 2 SORT (GROUP BY) (Cost=373803 Card=150831 Bytes=2865789)
    4 3 PARTITION RANGE (ITERATOR)
    5 4 INDEX (FAST FULL SCAN) OF 'CI_IX3' (NON-UNIQUE)
    (Cost=2 Card=62749878 Bytes=1192247682)

    This took 1hr 30 mins with the index alberto suggested and a 4M sort_area_size. A previous attempt came a bit faster at 1:15 with a similar index (day, item, location, status1, ...):


    Execution Plan
    ----------------------------------------------------------
    0 SELECT STATEMENT Optimizer=CHOOSE (Cost=13 Card=1)
    1 0 SORT (AGGREGATE)
    2 1 VIEW (Cost=13 Card=150831)
    3 2 SORT (GROUP BY) (Cost=13 Card=150831 Bytes=2865789)
    4 3 PARTITION RANGE (ITERATOR)
    5 4 INDEX (RANGE SCAN) OF 'CI_IX2'
    (NON-UNIQUE) (Cost=13 Card=62749878 Bytes=1192247682)


    Another test I did was to just force a scan of the rows:

    SELECT COUNT(*) FROM (SELECT
    location_id/item_id
    FROM facttable
    WHERE status_1_id = 1
    AND week between to_date('01-JAN-02') AND to_date('01-JAN-03'))

    This took about 35 minutes.

    How high would you set the sort_area_size? I've played with it a little w/out much result; the default here is 4 MB. Would you suggest going as high as 1.2 GB? I'm running a test right now with 80MB and a parallel_index(8) hint.

    The other challenge here is that this is an ad hoc reporting environment in BO, so I am somewhat limited in what I can do with the SQL. I'm not sure BO can generate the code that The Duck suggested below.


    Thanks,
    -- Jeff


    p.s. Sorry for the typos in the code. (The Duck caught one) I've been paraphrasing to get away from the huge identifiers from our data model.

  6. #6
    Join Date
    Sep 2003
    Location
    Milan, Italy
    Posts
    130
    For a dwh, 80M seems quite resonable - you usually have many gigabytes of ram and (relatively) few users in a dwh, so it is (normally) appropriate to allocate lot of ram to the processes PGA. As always, YMMV.

    The memory allocate to the processes will be between 1 or 2 times PARALLEL_DEGREE * SORT_AREA_SIZE , since every parallel process will get its share of sort_area_size and never return it back to the OS - but remember that it may use less ram, that's just an upper limit.

    Play around and see - only, your TEMP tablespace should be a LMT with a uniform extent of, say, 1-10M, that may be crucial when the processes need to swap sort data to disk.

    Notice how the first query is performing a FFS - it is using the index as a smaller table and full-scanning it, not using it to avoid the sort ...

    Why don't you set the tables and indexes to PARALLEL ? That way the ad-hoc queries will use parallel execution ...
    Remember to use a large_pool if using parallel exec.

    Is your table partitioned by DAY or WEEK ?

    hth
    al

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •