Unanswered: "order by clause" - query optimization problem
I am facing a weird problem. Basically I have a table with huge data and I have to execute a query which involves a left outer join on this table followed by some conditions.
say i have two tables T1 and T2 my query is:
select * from (select T1.a A1, T1.b A2 from T1,T2 where T1.a = T2.b (+) and T1.b = T2.c (+) and (some conditions......) order by T1.a) where rownum<=300;
Since the driving table (T1) has around 25 lakh rows and T2 has 8 lakh rows, the above query is very time consuming [it takes around 3 minutes] .
So we tried to optimize it in the following way:
The conditions were querying table T1.
select * from (select T1.a A1, T1.b A2 from T1 where (conditions ...)),T2 where A1 = T2.b (+) and A2 = T2.c (+) and rownum<=300;
The above query generates the similar execution plan. It still uses T1 as the driving table [instead of the sub query]. Still takes 4 minutes to complete. However if I modify the query as
select * from (select T1.a A1, T1.b A2 from T1 where (conditions ...) order by T1.a),T2 where A1 = T2.b (+) and A2 = T2.c (+) and rownum<=300;
it takes result set [view] as the driving table for NESTED LOOPS operation. Now the query executes in 30 seconds.
I would like to know why "order by clause in subquery" is making such a huge difference in execution plan.
The word "lakh" is used in Indian English to mean 100,000 (one hundred thousand)
It is not a common word in the English language and therefore, many people on web sites with a potentially global reach will not understand it.
Please use commonly used words to make it easy for you to get your message across.
Re the increase in speed when using order by:
Is it possible that column "a" is not used in the other where (conditions ...) clause for table T1? If, additionally, column "a" has an index on it, then the order by will force the use of the index, thus speeding up the query.
Since your query is using the column a.a3 in the where clause, it may help to create a non-unique index on this column, if feasible. Of course, creating an index on a table that has many different queries running against it may affect the other queries. Also, inserts, updates and deletes will be slightly slower.
In your explain plan, the order by is not really not speeding up the query. Run the two queries one or two more times and check the performance.
The order by clause is causing a lot of difference. If you notice, in the case with order by clause, the input to NESTED LOOP (OUTER) IS A VIEW. This contains only few rows whereas in the other case [without the clause] it is the entire table. Correct me if I am wrong.
If you look at the explain plan outputs, and read three lines from the bottom up, they are the same.
I'll guess here and say that either b1 or (b1,b2) is the uniique key of table b here.
When you sort within the query, you are ordering the data selected from "a" by the unique key columns of "b". Since the index on table b has to work on pre-sorted rows, the actual execution becomes faster.
The increase in speed you see is not the result of just the order by clause, but a result of the fact that the order by clause orders it in a way that can be used by the next join/where clause.
With or without the order by clause, the set of data that is used for the "nested loop" operation should be the same.
If you are really not interested in the ordered set, and are only interested in getting a fiixed number of rows, you may try to add an additional clause inside the first inline view (select from a). Like this:
select a1,a2,a3,b.b1 b1,b.b2 b2,b.b3 b3
(select A.a1 a1,a.a2 a2,a.a3 a3
from A a
where a.a3=30 order by a.a1
and rownum <= 20 ), -- note new clause here
and a2 = b.b2(+)
This should work the same as either of your queries but should be much faster.
In both cases, the nested loop will operate only on the subset of data returned by the inline view (select ... from a where ...).
SQL is a declarative language. That is, you declare what data you want, and the DBMS gets it for you. In general, you do not specify the order of operations.
Order by has elements of procedural logic. It says, first get all the data that matches the specified criteria, then sort it.
As such, with or without an order by, the same identical subset of rows of table "a" will be returned before table "b" is processed.
In both cases, first there will be a full scan of table a. The nested loops operation will only loop over the subset of data returned by the inline select, not the full table. At least, that will be so in the case under discussion.
If you are returning a lot of rows and sorting them ... I have a feeling that you might be doing some "disk sorts" ... Ensure that your sort_area_size and the default next size of the temporary tablespace are equal....
If your sort_area_size is small, you might look at increasing it also, but try to keep the next extent in the temporary tablespace equal to the sort_area_size