Results 1 to 10 of 10
  1. #1
    Join Date
    Nov 2002
    Posts
    2

    Question database with millions of records

    Hi all,

    Every day, a plain-text report (30,000+ pages) is produced by my employer that lists several hundred thousand insurance claims. I've built an application in visual basic that parses this report and grabs specific claims from it, and my next task is to load them into a database.

    Each claim record contains the following information: claim amount ($), account number (the account that the claim is attached to), and claim date (date the claim occurred).

    When specific accounts are billed for X amount of dollars by my employer, my job is to pull the claims from my database that make up this X amount of dollars.

    For example, if Joe's hardware store is billed for 10,000 dollars worth of insurance claims, I then need to grab the claims that make up this 10,000 dollars from my database, and show them to Joe so that he knows the details of his bill.

    Finally - my question is, what is the optimal way to setup this database so that it can be added to and queried as quickly as possible? Should I dump every single claim into one gigantic table, or should I create a seperate table for each of the 2000 or so accounts? 3 days worth of claims is easily 1,000,000+ records, so the volume of this database will be big. The database will be added to/queried/deleted from every day.

    Basically im just looking for some guidelines to follow to make this application perform as efficiently as possible. I'm not bound by any particular database system so any advice there would be helpful as well. Thanks for any advice!

    Jesse

  2. #2
    Join Date
    Sep 2002
    Location
    UK
    Posts
    5,171

    Re: database with millions of records

    Originally posted by neophyte
    Hi all,

    Every day, a plain-text report (30,000+ pages) is produced by my employer that lists several hundred thousand insurance claims. I've built an application in visual basic that parses this report and grabs specific claims from it, and my next task is to load them into a database.

    Each claim record contains the following information: claim amount ($), account number (the account that the claim is attached to), and claim date (date the claim occurred).

    When specific accounts are billed for X amount of dollars by my employer, my job is to pull the claims from my database that make up this X amount of dollars.

    For example, if Joe's hardware store is billed for 10,000 dollars worth of insurance claims, I then need to grab the claims that make up this 10,000 dollars from my database, and show them to Joe so that he knows the details of his bill.

    Finally - my question is, what is the optimal way to setup this database so that it can be added to and queried as quickly as possible? Should I dump every single claim into one gigantic table, or should I create a seperate table for each of the 2000 or so accounts? 3 days worth of claims is easily 1,000,000+ records, so the volume of this database will be big. The database will be added to/queried/deleted from every day.

    Basically im just looking for some guidelines to follow to make this application perform as efficiently as possible. I'm not bound by any particular database system so any advice there would be helpful as well. Thanks for any advice!

    Jesse
    Since this sounds like a reasoably heavy-duty database (processing millions of records per week), I'm surprised that it works by parsing a plain-text report. I mean, doesn't your company already have a database from which this report is coming? Why can't your report work from that database?

    A table per account? No. Whether you require just one table or more depends on the structure of the data, not the number of customers you have.

    Presumably here will be some kind of archival on this table, e.g. remove data more than X days old? In which case, table partitioning may be of interest. In Oracle (I can't speak for other DBMSs), you can partition a table on date ranges. Then instead of deleting millions of old records, you can just drop the partition, or move it to another database.

  3. #3
    Join Date
    Nov 2002
    Posts
    2
    Indeed there is a database that this report is coming from, but it is one that our business area has no access to. This of course was my first question as well. Unfortunately I must rebuild a database based on a plain-text report.

  4. #4
    Join Date
    Sep 2002
    Location
    UK
    Posts
    5,171
    Originally posted by neophyte
    Indeed there is a database that this report is coming from, but it is one that our business area has no access to. This of course was my first question as well. Unfortunately I must rebuild a database based on a plain-text report.
    You have my sympathy!

    I can't help wondering: if your VB program has to read 100,000s of text records, parse them and get the information you need, and then insert (and/or update?) 100,000s of database records via ODBC (presumably) - how long is that program going to take to run? If this was an Oracle database I'd be wanting to consider SQL Loader and stored procedures, or maybe Pro*C (C with embedded SQL) running on the server.

  5. #5
    Join Date
    Dec 2002
    Posts
    1
    Jesse,

    I understand your problem because I'm basically doing the same thing but for aircaft maintenance. I'm extracting data from about 12 different text files that builds 14 Access tables equaling to about 10,000 pages. Also the new text files replaces the current data, no history tracking. I'm using Monarch to do all of the data extraction (Great tool and easy to use) and put the data into a Access tables. I don't have the same number of records around 50,000 total.

    A possible solution, this is how I would do it using Access as FE/BE.
    Run #1, extact the company name and any company that not in your company list, add it.

    Run #2, Extract Company name, Claim # as key, file name of run3, report run date and data summary. This can be appended to a single table since the summary is one line.

    Run #3, extract the data using detail listing using the same claim # as the key into a seperate database and then link the table. I'm assuming your the only one doing this so I would keep all the mdb local.

    To extract the detailed data for a particular company, I would have an automated process of build a union query based upon the selected comapny that extracts the data from specific tables. There is no way I would this by hand. or have a single union query that searches all of the files.

    Because of the number of records, I would look into using sql for backend and Access as frontend. I can't say anything about Oracle since I'm not oracle developer.

    I used a batch file to copy the text files from the network to my local drive and execute Monarch. The batch file is triggered when I click on Run Monarch button. The whole process takes about 5 minutes and the data is ready for other users on the network.

  6. #6
    Join Date
    Nov 2002
    Location
    San Francisco
    Posts
    251

    Question

    there is a software called MONARCH (www.datawatch.com). This software can extract data from any report (not data drill, but report with header, footer, sections....) and stick it into any ODBC database. They have it also also BATCH (called datapump) or web based version (little bit too expensive).

    I used it a lot for getting ending balances from CNC reports..... it took me few minutes to setup template and seconds to get data from thousands pages report.



    jiri

  7. #7
    Join Date
    Jan 2003
    Location
    UK
    Posts
    2
    We are a software vendor specialising in data ETL and transforming - apologies for the sales pitch but both projects sound right up our street as we have a meta-data driven code generator called Transformation Manager which can build either Java components or XSLT transforms.

    see www.etlsolutions.com

  8. #8
    Join Date
    Oct 2002
    Location
    NZ
    Posts
    25
    My advice would be to stick to basics. If you are able to coax the suppliers of the report to provide it to you in a fairly clean textual form, something like SQL*Loader (Oracle) should be able to load it pretty quickly. 30,000 records - something like 5-15 minutes depending on your level of server grunt, disk speed (guessing a bit here).

    You should be able to automate most of this process. Don't worry about Oracle it will handle 100s of millions of rows. Look at partitioning your tables this way you will be querying data based on some common parameters (recent dates for eg).

    In regard to the 2000 or so accounts, create a static table with details of these accounts eg:
    account{account_id, name, address, phone, email ...}

    then you load a claim table with your text data.

    claim{claim_id, account_id, claim details...} where account_jd is a foriegn key to the account table.

    So, now you can create a view or query which simply gives you the account details based on any claim via the foreign key account_id.
    Get your data entry girls (excuse the sexism) to keep the account table updated - ie, no claim can be entered unless an account record exists. Oracles SQL*Loader will report anomalies, just keep a check on this.

    Simple.

    cheers
    s.

    PS: I wouldn't be tempted to piss around with XML or the like you'll just complicate your life with fad'dy shite that will be superceded in a few years. Stick to loading the raw textual data and you won't go wrong. XML won't speed that process up for you and that's were your overhead is.

    Last edited by steve63; 02-11-03 at 02:01.

  9. #9
    Join Date
    Jan 2003
    Location
    Dallas/Houston
    Posts
    7

    other ideas

    since you dont have access to it ... upps

    Try segmentation of those records dividing tables by ranges of customer numbers.

    and when performing queries refer to queries by customer number or any other variable

    if customernumber >1 and <1000 then point to customertable1
    if customernumber >1001 and <2000 then point to customertable2
    if customernumber >2001 and <3000 then point to customertable3

    etc etc

    obvious.. do it with a select case method

  10. #10
    Join Date
    Oct 2002
    Location
    NZ
    Posts
    25

    Re: other ideas

    No, no, no -
    This is silly.

    Use a database that supports SQL - create a view which is a union of all tables then all you need to do is query the view.

    "IF" (nor CASE) is not supported in SQL. When you start having to write procedural code to access data in a database you KNOW you are doing something very wrong.

    steve.
    PS: nothing personal.


    Originally posted by carloswydler
    since you dont have access to it ... upps

    Try segmentation of those records dividing tables by ranges of customer numbers.

    and when performing queries refer to queries by customer number or any other variable

    if customernumber >1 and <1000 then point to customertable1
    if customernumber >1001 and <2000 then point to customertable2
    if customernumber >2001 and <3000 then point to customertable3

    etc etc

    obvious.. do it with a select case method
    Last edited by steve63; 02-18-03 at 19:43.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •