I guess determining the current status is relatively easy, and with proper indexes should be quite efficient:
where (item_id, status_dt) = (
select item_id, max(status_dt)
where item_id = 1
group by item_id)
The part I'm getting stuck on is when it comes to answering questions like "what was the status of item 1 on 15 Feb 2011?" or "when did item 2 have status 3?" or anything that requires turning this into sequential date ranges, such as:
Example of old school SQL-92 correlated sub-query view:
create view v as
select item_id, status_id, status_dt as start_dt,
(select min(status_dt) - interval '1' day from t
where item_id = t_main.item_id
and status_dt > t_main.status_dt) as end_dt
from t t_main;
SQL>select * from v where date'2011-05-01' between start_dt and end_dt;
item_id status_id start_dt end_dt
======= ========= ======== ======
1 1 2010-10-01 2011-05-02
2 3 2011-04-05 2012-02-09
2 rows found
Excuse me if I'm asking a silly question here, but how can the window function query execute in one single table scan?
Because it is calculated while the table is being scanned. Usually the work collecting this result while scanning is less than the work imposed by a co-related sub-select. I intentionally say "usually" because sometimes this is not the case. It depends on the nature of the query and the tables involved and very much on the DBMS.
(I have no idea at all how window functions are executed. Any suggested reading?)
I don't have any references, but if you look at the execution plan you will see that there is only a single table scan involved.
Look at the following two execution plans which were produced with similar queries on a table with 700k rows.
First the plan with the sub-select solution:
Index Scan using pk_orders on public.orders o1 (cost=0.00..24943026.54 rows=700000 width=20) (actual time=76.045..26090.823 rows=700000 loops=1)
Output: o1.id, o1.amount, o1.customer_id, o1.order_date, (SubPlan 1)
Buffers: shared hit=7770087 read=1916
-> Aggregate (cost=35.57..35.58 rows=1 width=4) (actual time=0.034..0.034 rows=1 loops=700000)
Buffers: shared hit=7714632
-> Bitmap Heap Scan on public.orders o2 (cost=4.39..35.56 rows=3 width=4) (actual time=0.016..0.022 rows=4 loops=700000)
Output: o2.id, o2.customer_id, o2.order_date, o2.amount, o2.sales_person_id
Recheck Cond: (o2.customer_id = o1.customer_id)
Filter: (o2.order_date < o1.order_date)
Buffers: shared hit=7714632
-> Bitmap Index Scan on idx_cust_id (cost=0.00..4.39 rows=8 width=0) (actual time=0.009..0.009 rows=8 loops=700000)
Index Cond: (o2.customer_id = o1.customer_id)
Buffers: shared hit=2117321
Total runtime: 26194.317 ms
Even though sorting the overal result used temporary files on the disk the second plan is substiantially faster than the first one.
It depends also on how you sort the overal result.
On Oracle for example the sub-select solution is faster if you sort on a column that is not used as the "partition by" criteria. But if you sort on the same column that you use for the "partition by" the solution with the windowing function is faster.