I'd like to know if anyone here has any experience at querying flat files like a database. I'm specifically interested in something that would perform against files that are several hundred million records in length with physical size exceeding 2 GB. The company I work for does a lot of flat file processing via ETL tools. I know that using a database such as Oracle would be ideal for this, in the best case, actually loading the data into tables, and worst case using the external tables feature. Using a database is an approach that will probably be used in implementation of a solution when trial & error will result in the need for querying often to improve the business logic. However, once production, support teams will need a method to be able to query a file on the fly to answer customer questions regarding why things happened, etc., and it isn't necessarily feasible to load up the file into a database every time the customer has a question.
Specifically what I'm hoping to find is:
- A JDBC/ODBC driver that can sit in front of a flat file and allow SQL queries against it
- Something that is not proprietary to a specific technology (e.g. Oracle, one requirement is not to lock ourselves down)
I and several others here have done some research and have not found anything. Does anyone have any experience with anything similar to this? Know where I can find more information to help?
I'm not aware of any database product that runs on flat files and allows a reasaonable performance to query of anaylse data. The proble is going tobe the lack of indexes. You may be able to get a pproduct which can generate the index on the flat file and give what you want. Hwoever the idea of using a 2GB flat file as your data store gives me the shakes....
If you want to run SQL againast it it will need banging into a server (whether that be Oracle, MS SQL Server MySQL or whqt ever is down to your organisation, its budget and its expertese & corporate policy. As a cheap and cheerfull, bit 'horrible' approach you could put all 2GB onm a single table, each row representing 1 record from the flat file, but that would be chucking away power of the server product. you would need to break down each line into its constituent columns.
if you wanted to go down a trial & error solution then MySQL may be a good route -it would allow you to develop a fairly effective analysis tool for little corporate budget spend. You could then look to see if the project could afford a permanent soluton (which could be the MySQL - it could be Oracle, heck it could even be DB2....)
I think I need to clarify this a bit more so that it doesn't appear to be quite as bad as it seems. The need is not really to run complex SQL but simple queries, such as give me every record with X zip code, or give me everyone with either of these ids. It's really to use for a tool that won't be performing complex analysis of the whole file, but to take a small set of criteria and pull 30 records for examination off a file 200 million. It also will take the form of a service that will need to fulfill possible concurrent requests from different users for different customers with different files, and nothing being predetermined, each request will need to dynamically load that file, apply a column layout of some sort, and then extract the desired results for further examination. I think the realization for the people requesting this of us is that it will not be a small response time (we're talking 10 minute response time as being ok), but having to take an hour or more in order to load a file and index it to give ideal response time for a max of 2 or 3 queries at a time - that to them is not acceptable.
Does MySQL have any support for external tables similar to Oracle?
i'd just create a front-end for the users that uses grep at the back-end to strip the relevant lines out of the file.
you could return lines that contain certain zip codes, certain ids, etc. for a hefty speedup host it on a machine with 4gb of ram and leave the whole flat file on a ramdisk.
hacks aside, you really should import it into a mysql database and let it take care of the indexing, querying, etc. if this 2gb flat file is updated overnight or daily or whenever, set a batch job which (upon successful updating) loads it into a new mysql database and backs up the old ones.
as for the users not wanting to deal with excessive load times, but happy with terrible response times.. eh? any time a new data file arrives, load it into a new database. what's better - 1 hour to insert data once, with each query taking 0.1 seconds, or being able to start on data immediately with each query taking 10-15 minutes?
I agree whole-heartedly that under normal circumstances you would want a relational database for this, and if we were just talking a few files that'd be the easiest thing. The problem is that we're talking hundreds of customers and thousands of files and any of them have to be available to be queried dynamically upon request. It's quite a daunting challenge to provide to the means to have every one of these files loaded in a database ready to query to beforehand, just in case the need arises. The easiest solution so far is to use Oracle external tables because we can create the tables and select on the fly. But this doesn't fit with the requirement that we can't lock ourselves into a specific technology. We can abstract the implementation details away from Oracle itself, but that only means that we can use something else only if and when it becomes available. The powers-that-be might end up opting for the write-our-own option which wouldn't surprise me in the least.
well i just tested my hack - on my machine grep is far-and-away io-bound. querying 2gb took a bit over a minute. not sure why i thought it'd take any longer. that's with a 7200rpm sp1604n ide hard drive.
you'd just need some front-end work to munge the grep results into columnar data for sorting / manipulation, but if you're steadfastly against upfront database creation i don't see any other workable option.
that's, if the queries are all as simplistic as the examples you gave.
the solution to your problem is entirely dependent on the queries you're required to support. if there are any joins, a database (a real one, not some bodged up flat file + index) is essential. ad hoc database creation (load in oracle upon first access) sounds unacceptable to your users - unless you schedule database imports overnight they won't stand for the long interruptions, or at least that's how it seems.
if unrelated flat files are all the users want to work with, stick with tools designed for flat file work.