Results 1 to 11 of 11
  1. #1
    Join Date
    May 2008
    Posts
    277

    Unanswered: Querying a history table

    Feeling kinda dumb, but I'm having a hard time figuring out how to properly query a history table.

    Consider:
    Code:
     item_id | status_id | status_dt
    ---------+-----------+------------
     1       | 1         | 2010-10-01
     1       | 2         | 2011-05-03
     1       | 3         | 2011-09-15
     2       | 3         | 2011-04-05
     2       | 1         | 2012-02-10
    I guess determining the current status is relatively easy, and with proper indexes should be quite efficient:
    Code:
    select *
    from item_status
    where (item_id, status_dt) = (
            select item_id, max(status_dt)
            from item_status
            where item_id = 1
            group by item_id)
    The part I'm getting stuck on is when it comes to answering questions like "what was the status of item 1 on 15 Feb 2011?" or "when did item 2 have status 3?" or anything that requires turning this into sequential date ranges, such as:
    Code:
     item_id | status_id |  start_dt  |   end_dt   
    ---------+-----------+------------+------------
           1 |         1 | 10/01/2010 | 05/02/2011
           1 |         2 | 05/03/2011 | 09/14/2011
           1 |         3 | 09/15/2011 | 
           2 |         3 | 04/05/2011 | 02/09/2012
           2 |         1 | 02/10/2012 |
    Any help would be greatly appreciated.

  2. #2
    Join Date
    Jan 2007
    Location
    UK
    Posts
    11,434
    Provided Answers: 10
    What database engine are you using?
    George
    Home | Blog

  3. #3
    Join Date
    May 2008
    Posts
    277
    PostgreSQL

  4. #4
    Join Date
    Nov 2003
    Posts
    2,935
    Provided Answers: 12
    The following (ANSI SQL, works in PostgreSQL 8.4 and later) will give you the start and end date for each item:
    Code:
    SELECT id,
           status_id,
           status_dt as start_date,
           lag(status_dt) over(partition by id, status_id ORDER by status_dt DESC) - 1 AS end_date
    FROM item_status
    ORDER BY 1,2 DESC
    You can e.g. put that into a view and then query the view with WHERE ... BETWEEN start_date AND end_date
    Last edited by shammat; 03-12-12 at 19:55.

  5. #5
    Join Date
    Dec 2008
    Location
    At work...
    Posts
    92
    Example of old school SQL-92 correlated sub-query view:
    Code:
    create view v as
    select item_id, status_id, status_dt as start_dt,
           (select min(status_dt) - interval '1' day from t
            where item_id = t_main.item_id
              and status_dt > t_main.status_dt) as end_dt
    from t t_main;
    Code:
    SQL>select * from v where date'2011-05-01' between start_dt and end_dt;
    
        item_id   status_id start_dt   end_dt
        =======   ========= ========   ======
              1           1 2010-10-01 2011-05-02
              2           3 2011-04-05 2012-02-09
    
    2 rows found

  6. #6
    Join Date
    May 2008
    Posts
    277
    Thanks for those solutions. Both will be helpful as some of my work is still on PostgreSQL 8.4, which I don't believe has window functions.

    A small change to shammat's code:
    Code:
    SELECT
        item_id,
        status_id,
        status_dt as start_date,
        lag(status_dt) over(partition by item_id ORDER by status_dt DESC) - 1 AS end_date
    FROM item_status
    ORDER BY 1, 3 DESC
    Thanks again!

  7. #7
    Join Date
    Nov 2003
    Posts
    2,935
    Provided Answers: 12
    Quote Originally Posted by futurity View Post
    some of my work is still on PostgreSQL 8.4, which I don't believe has window functions.
    8.4 does have window functions, so go ahead and use them

    I'm pretty sure the solution with windowing functions will be faster than the co-related sub-select as only a single scan over the table is required.

  8. #8
    Join Date
    May 2008
    Posts
    277
    Quote Originally Posted by shammat View Post
    8.4 does have window functions, so go ahead and use them
    So it does. Window functions rock. Guess it's about time to learn how to use them.

  9. #9
    Join Date
    Dec 2008
    Location
    At work...
    Posts
    92
    Quote Originally Posted by shammat View Post
    I'm pretty sure the solution with windowing functions will be faster than the co-related sub-select as only a single scan over the table is required.
    Excuse me if I'm asking a silly question here, but how can the window function query execute in one single table scan?

    (I have no idea at all how window functions are executed. Any suggested reading?)

  10. #10
    Join Date
    Nov 2003
    Posts
    2,935
    Provided Answers: 12
    Quote Originally Posted by JarlH View Post
    Excuse me if I'm asking a silly question here, but how can the window function query execute in one single table scan?
    Because it is calculated while the table is being scanned. Usually the work collecting this result while scanning is less than the work imposed by a co-related sub-select. I intentionally say "usually" because sometimes this is not the case. It depends on the nature of the query and the tables involved and very much on the DBMS.

    (I have no idea at all how window functions are executed. Any suggested reading?)
    I don't have any references, but if you look at the execution plan you will see that there is only a single table scan involved.

    Look at the following two execution plans which were produced with similar queries on a table with 700k rows.

    First the plan with the sub-select solution:
    Code:
    QUERY PLAN
    Index Scan using pk_orders on public.orders o1  (cost=0.00..24943026.54 rows=700000 width=20) (actual time=76.045..26090.823 rows=700000 loops=1)
      Output: o1.id, o1.amount, o1.customer_id, o1.order_date, (SubPlan 1)
      Buffers: shared hit=7770087 read=1916
      SubPlan 1
        ->  Aggregate  (cost=35.57..35.58 rows=1 width=4) (actual time=0.034..0.034 rows=1 loops=700000)
              Output: max(o2.order_date)
              Buffers: shared hit=7714632
              ->  Bitmap Heap Scan on public.orders o2  (cost=4.39..35.56 rows=3 width=4) (actual time=0.016..0.022 rows=4 loops=700000)
                    Output: o2.id, o2.customer_id, o2.order_date, o2.amount, o2.sales_person_id
                    Recheck Cond: (o2.customer_id = o1.customer_id)
                    Filter: (o2.order_date < o1.order_date)
                    Buffers: shared hit=7714632
                    ->  Bitmap Index Scan on idx_cust_id  (cost=0.00..4.39 rows=8 width=0) (actual time=0.009..0.009 rows=8 loops=700000)
                          Index Cond: (o2.customer_id = o1.customer_id)
                          Buffers: shared hit=2117321
    Total runtime: 26194.317 ms
    And this is the plan using windowing functions:
    Code:
    QUERY PLAN
    Sort  (cost=190779.97..192529.97 rows=700000 width=20) (actual time=4105.999..4406.080 rows=700000 loops=1)
      Output: id, amount, customer_id, order_date, (lag(order_date) OVER (?))
      Sort Key: orders.id
      Sort Method: external merge  Disk: 25520kB
      Buffers: shared hit=5151, temp read=6093 written=6093
      ->  WindowAgg  (cost=94463.48..108463.48 rows=700000 width=20) (actual time=1516.327..2809.566 rows=700000 loops=1)
            Output: id, amount, customer_id, order_date, lag(order_date) OVER (?)
            Buffers: shared hit=5149, temp read=2900 written=2900
            ->  Sort  (cost=94463.48..96213.48 rows=700000 width=20) (actual time=1516.306..2069.764 rows=700000 loops=1)
                  Output: id, amount, customer_id, order_date
                  Sort Key: orders.customer_id, orders.order_date
                  Sort Method: external merge  Disk: 23176kB
                  Buffers: shared hit=5149, temp read=2900 written=2900
                  ->  Seq Scan on public.orders  (cost=0.00..12147.00 rows=700000 width=20) (actual time=0.013..182.325 rows=700000 loops=1)
                        Output: id, amount, customer_id, order_date
                        Buffers: shared hit=5147
    Total runtime: 4451.611 ms
    Even though sorting the overal result used temporary files on the disk the second plan is substiantially faster than the first one.

    It depends also on how you sort the overal result.

    On Oracle for example the sub-select solution is faster if you sort on a column that is not used as the "partition by" criteria. But if you sort on the same column that you use for the "partition by" the solution with the windowing function is faster.

  11. #11
    Join Date
    Dec 2008
    Location
    At work...
    Posts
    92
    Shammat, thank you very much!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •