Unanswered: Reverse Engineer Hand Written SQL ( Oracle )
My environment encompasses over 2000 tables with over 300000 lines of hand-written Oracle SQL, and what I want to do is reverse engineer the SQL to tell me what tables are being accessed and how...
I have the full list of tables stored in a reference table ( extracted from USER_TABLES ), and I also have every line of SQL loaded into another reference table, of the example format :
FILENAME, LINE_NO, SQL
products.sql,3,where id = 2
customer.sql,1,select * from
customer.sql,2,customers a, products b
customer.sql,3,where a.id = b.id
so as you can see I will be evaluating this on a line by line basis... Now this hand-written SQL will obviously be valid Oracle SQL, but it can be all over the place, in that if i am looking at it on a line-by-line basis, the INSERT, UPDATE, FROM statement may not necessary be on the same line as the table name, etc...Basically the same query can be written in many different ways...
So I am taking the list of tables ( stored in a reference table ), and then also the table holding all of the SQL code, and then basically doing a join with a LIKE clause constraint, to see what SQL lines are using what tables, but as the SQL is all hand-written the rules can get messy...
So basically what Im asking is, does anyone know of any definitive rules for reverse engineering SQL, or any example scripts, etc ? My process basically works like this at the moment to get you started :
If i find an instance of a tablename on a SQL line, it must be prefixed with either nothing, space, tab, comma, and it must be suffixed with either nothing, space, tab, comma, semicolon, right bracket... This firstly stops finding any tables thats name are part of another table, eg :
PRODUCT compared to PRODUCT_LIST
Then I make sure that it is not within a comment, so if its on a line with
/*, */, --, etc, then I disregard it ( or if it was between two sets of start-comment and end-comment i also disregard ) - eg :
SELECT * FROM products
After that, I then search for the word INTO, UPDATE, TRUNCATE, DROP, CREATE INDEX, RENAME on that line, etc to try and determine what I am doing to that table, and if i cant find any of those, I assume its a SELECT... I also look at the previous line for those keywords, because the SQL could have been written like this :
Then I also do a similar thing to look for in-line UPDATES :
SELECT end_dte FROM PRODUCTS a, CUSTOMER b
) SET a.end_dte = '1-Jan-2000'
Because the above line is really updating the PRODUCT table, whereas on a line-by-line basis it looks like a SELECT...
So these are my rules in place thus far... My end result would be to have a table that says these SQL scripts access these tables in this way, for example :
OK, Hopefully you understand what I mean... Basically if anyone has any pre-written code or ideas that would be great, but I am more interested in expanding on these rules to make sure I get it 100% right. As you can imagine, hard-written SQL, that is technically valid SQL can still be written in many, many different ways, so the rules neeed to be very tight... Or can anyone see any flaws in my rules also ?
By the way, Im only interested in what tables are accessed, trying to do the same thing on a column basis would be WAY too hard...
I'd use YACC to construct a tokenizer if I were you. Otherwise you'll go nuts trying to figure out how to parse the SQL. Once you've reduced it to tokens, take all of the table tokens from the token stream and you'll be more than 90% of the way to your goal.