I have four tables in our Oracle 9i DBMS:
sorgeor(
sorgeor_code, sorgeor_start_range, sorgeor_end_range )
Cardinality 230,000
spraddr(
spraddr_pidm, spraddr_atyp_code, spraddr_status_ind, spraddr_zip, etc. )
Cardinality 2,200,000
aprcatg(
aprcatg_pidm, aprcatg_donr_code )
Cardinality 260,000
spbpers(
spbpers_pidm, spbpers_dead_ind )
Cardinality 716,000
SORGEOR is a ‘georegion’ table. Basically we define geographical areas and store the zip code ranges in the sorgeor table. For example, I might have a georegion row that looks like this:
sorgeor{ OH_COL, 43080-0000, 43085-9999 }
sorgeor{ OH_COL, 43090-0000, 43091-9999 }
This means that we’ve defined the Columbus, OH metro georegion to include anyone whose zip code is between 43080-000, 43085-9999 or 43090-0000 to 43091-9999.
I would be included in this list because my spraddr row looks like this:
spraddr{ 12345, ‘MA’, NULL, 43081-1234 }
My brother would not be, because his address has expired:
spraddr{ 11111, ‘MA’, ‘INACTIVE’, 43082-3333 }
My neighbor would not be included even though he lives in Columbus because he is deceased:
spraddr{ 1234, ‘MA’, NULL, 43080 }
spbpers{ 1234, ‘DECEASED’ }
My mom would not be included even though she lives in Columbus because she does not have any aprcatg_donr_codes which start with A:
spraddr{ 9999, ‘MA’, NULL, 43091-1111 }
aprcatg{ 9999, ‘PG’ }
My goal is, for particular georegions, get a count of everyone who is still alive, has an ‘A%’ donr_code, and still lives in that georegion (status_ind IS NULL and spraddr_to_date < SYSDATE or spraddr_to_date IS NULL) and is of type MA (spraddr_atyp_code of ‘MA’).
This is what I currently have:
Code:
SELECT /*+INDEX( SPRADDR SPRADDR_KEY_INDEX ) +INDEX( APRCATG PK_APRCATG ) +ALL_ROWS*/
COUNT( DISTINCT spraddr_pidm )
, sorgeor_geor_code
FROM sorgeor
INNER
JOIN spraddr
ON spraddr_pidm IN( SELECT /*+INDEX( APRCATG PK_APRCATG )*/ aprcatg_pidm
FROM aprcatg
WHERE aprcatg_pidm = spraddr_pidm
AND aprcatg_donr_code LIKE 'A%' )
AND spraddr_pidm IN( SELECT spbpers_pidm
FROM spbpers
WHERE spbpers_pidm = spraddr_pidm
AND spbpers_dead_ind IS NULL )
AND spraddr_atyp_code = 'MA'
AND spraddr_status_ind IS NULL
AND NVL( spraddr_to_date, SYSDATE - 1 ) < SYSDATE
AND spraddr_zip BETWEEN sorgeor_start_range AND sorgeor_end_range
GROUP
BY sorgeor_geor_code
Here is the query plan for the first query:
Code:
0, SELECT STATEMENT
1, SORT, GROUP BY
2, TABLE ACCESS, BY INDEX ROWID, SORGEOR
3, NESTED LOOPS
4, NESTED LOOPS
5, NESTED LOOPS
6, INDEX, FULL SCAN, PK_APRCATG
7, TABLE ACCESS, BY INDEX ROWID, SPRADDR
8, INDEX, RANGE SCAN, SPRADDR_KEY_INDEX
9, TABLE ACCESS, BY INDEX ROWID, SPBPERS
10, INDEX, UNIQUE SCAN, PK_SPBPERS
11, INLIST ITERATOR
12, INDEX, RANGE SCAN, SORGEOR_KEY_INDEX2
Here is if I remove the index hints:
Code:
0, SELECT STATEMENT
1, SORT, GROUP BY
2, NESTED LOOPS
3, NESTED LOOPS
4, HASH JOIN
5, TABLE ACCESS, FULL, SPRADDR
6, TABLE ACCESS, FULL, APRCATG
7, INLIST ITERATOR
8, TABLE ACCESS, BY INDEX ROWID, SORGEOR
9, INDEX, RANGE SCAN, SORGEOR_KEY_INDEX2
10, TABLE ACCESS, BY INDEX ROWID, SPBPERS
11, INDEX, UNIQUE SCAN, PK_SPBPERS
If I try this query:
Code:
SELECT /*+ALL_ROWS INDEX( SPRADDR SPRADDR_KEY_INDEX )*/
COUNT( DISTINCT spraddr_pidm )
, sorgeor_geor_code
FROM sorgeor
INNER
JOIN spraddr
ON EXISTS( SELECT /*+INDEX( APRCATG PK_APRCATG )*/ *
FROM aprcatg
WHERE aprcatg_pidm = spraddr_pidm
AND aprcatg_donr_code LIKE 'A%' )
AND EXISTS ( SELECT /*+INDEX( SPBPERS PK_SPBPERS )*/ *
FROM spbpers
WHERE spbpers_pidm = spraddr_pidm
AND spbpers_dead_ind IS NULL )
AND spraddr_atyp_code = 'MA'
AND spraddr_status_ind IS NULL
AND NVL( spraddr_to_date, SYSDATE - 1 ) < SYSDATE
AND spraddr_zip BETWEEN sorgeor_start_range AND sorgeor_end_range
GROUP
BY sorgeor_geor_code
I get the following plan:
Code:
0, SELECT STATEMENT
1, SORT, GROUP BY
2, NESTED LOOPS
3, MERGE JOIN
4, SORT, JOIN
5, TABLE ACCESS, BY INDEX ROWID, SPRADDR
6, NESTED LOOPS
7, INDEX, FULL SCAN, PK_APRCATG
8, INDEX, RANGE SCAN, SPRADDR_KEY_INDEX
9, FILTER
10, SORT, JOIN
11, TABLE ACCESS, BY INDEX ROWID, SORGEOR
12, INDEX, RANGE SCAN, SORGEOR_KEY_INDEX2
13, TABLE ACCESS, BY INDEX ROWID, SPBPERS
14, INDEX, UNIQUE SCAN, PK_SPBPERS
The third query is unconditionally better than the first two. The problem is that it is still pretty slow. For georegion IREL39369 (371 rows in SORGEOR) it returns 538 people in about four minutes. For georegion IREL39106 (5249 rows) it returns 711 people in about fifty minutes. This is nearly linear. If I extrapolate this linear relationship to the total number of georegion rows I get approximately forty-some odd hours to run for all of them.
Any ideas on how to optimize these queries and/or any query ideas from the gallery?

This is a 3rd party application DB and I cannot (without headaches) change the schema which includes adding indexes or creating new tables.