Results 1 to 8 of 8
  1. #1
    Join Date
    Jun 2008
    Posts
    4

    Red face What to do with huge CSV & flat files

    Alright, so I'm no DBA, but I've been in the IT field long enough to know whats what.

    I'm doing a project for a company, we have several clients data, all in either excel, CSV/TXT, access. These are totally flat files, some of the clients have 38 columns of data, some have 12, theres no conformity.

    My client wants these data files to match each other (in # of columns, and information in each row, obviously we're cutting out extrenuous columns)

    I'm struggling to understand if this is even the best method. Some of the text files have 1.5+ million records, so if I try to drop it in Access I can't pull it back out in order to send the data to the DB host. (Yea, it's incredibly round about, I know, they don't listen to me.......)

    My line of thought is,
    1) files from different clients should never be mixed with each other
    2) It'll be more efficient to get each client on the DBMS (MS SQL) and then simply query the columns and results they want, rather then physically modifying each data file and putting them all together


    Obviously I need some guidance, or atleast enough back up to give my client the big slap in the face and do things the correct way.

    Help is greatly appreciated and I'll be willing to explain any information further.

  2. #2
    Join Date
    Jun 2008
    Posts
    4

    Red face What to do with huge CSV & flat files

    <-Deleted repeat post because vista sucks->
    Last edited by cubix; 06-05-08 at 10:30.

  3. #3
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    1) Depends.
    2) Yeppers.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  4. #4
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    The first thing that I'd do is figure out what I had to start with... How many files, how many columns in each file and more importantly what those columns contained, and how many row/lines/thingies there were in each file. I think that you need to understand what you start with before you can do anything productive (or even predictable) with it.

    Once you have the list of columns, lay out a "super table" that contains enough columns to contain all of your data. Note that because some of the columns had better be equivalent, this should be far less than the sum of all of your columns. In other words, if you have thirty data sources with twelve to 30 elements (columns) each, you probably have less than a hundred total elements. Due to my own preferences, I would make certain that I had two columns just for my own use that had nothing to do with any of the data being imported, and I'd make one of them a "valid flag" and the other column my surrogate key (SK) which I'll explain shortly.

    Now start to move the data into the super table, one source at a time. Don't worry about duplicate data yet, just get everything from all of the data sources into one place so you can start to grapple with it.

    After you migrate all of the data into one place, start to look for patterns (essentially duplicate information). You'll probably need to use more than one way to ferret the duplicates out, since there is probably more than one way for a pair of rows to be considered a duplicate. This is where the SK is a lifesaver, since it had nothing to do with the incomming data, it is purely for your convenience. It allows you to track the rows, find the duplicates, correct the data as you see fit (based on your client's needs) without any need to change any of their data.

    You can then start to consolidate the data, logically removing some rows by simply changing the "valid flag" that you added, possibly creating whole new rows from groups of original rows (so if ten different data sources had information about one entity, you could leave the original rows intact with a valid flag that indicated that they'd been merged, then a whole new row that you created from the merged entities.

    Once you've "laundered" the data, you can start thinking about how you want to extract the data to send it back to the client. Maybe one table/file, maybe many in different formats, as the need arises. The short answer is that you would then have the ability to manage the data as you see fit instead of staggering under the load of a gazillion files formatted a gazillion different ways!

    -PatP

  5. #5
    Join Date
    Jun 2008
    Posts
    4
    Firstly thanks for your repsonse, a lot of it I've already started on.

    This is where I'm at.

    I already have a list of all the files that they use from their different clients.
    I know how many columns and rows there are, as well as what each column contains.

    They have a template for the final resulting layout they want to achieve.
    There is no duplicate data across all 200+ million rows, however there is a great deal of room for normalization, but I'm no where near that step yet.

    Working with the smaller excel files are easy to manage/modify, but some of the clients has about 100 CSV files with >1.5 million rows each, making it difficult to pull into any type of program that can edit the file, and then re-export it.

    Now, why am I trying to edit it and re-export it. Well, they don't want to send the off site database host data that isn't already in their final form, tying my hands with what utilities I can use when changing these files.

    The data contains lists of stock purchases/sales, users simply serach for matching stock symbols for different company in order to file claims. So theres nothing big going on here, theres no editing of the tables, it's actually fairly simple, just a massive amount of data.


    So what can I even use to start working with such large files, if I try to put them together as one, I'm looking at a single ~14GB file, thats pretty mind boggling to manage with out using something like MySQL or MS Sql.

  6. #6
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Neither quick (as SQL would be) nor cheap but I've played around with files much larger than that in SAS and SSPS.

    Could you not get SQL Server Express locally and have several databases (Express does have a DB size limit doesn't it?), the output of which you consolidate once everything is scrubbed.

    BTW - I would add one (maybe two) columns to Pat's table - "row source", and perhaps "client".

  7. #7
    Join Date
    Jun 2008
    Posts
    4
    There is a client name field already accounted for. Though I still feel uneasy that combining multiple clients data together, but I guess it wouldn't much matter.

    I haven't thought of SPSS, I wonder where my copy i lying around.
    I'll also look into SQL Server express, that might be my best bet, to query in such a way that it'll come up with the resulting table I want and then export that.

  8. #8
    Join Date
    Jun 2004
    Location
    Arizona, USA
    Posts
    1,848
    Express file size limit is 4 GB per database. Should be workable.
    Lou
    使大吃一惊
    "Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
    "I have my standards. They may be low, but I have them!" - Bette Middler
    "It's a book about a Spanish guy named Manual. You should read it." - Dilbert


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •