Welcome to the dBforums forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions, articles and access our other FREE features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload your own photos and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact contact support.

If you prefer not to see double-underlined words and corresponding ads, place your cursor
here for ContentLink opt out.

Go Back  dBforums > General > Database Concepts & Design > What to do with huge CSV & flat files

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 06-04-08, 17:48
cubix cubix is offline
Registered User
 
Join Date: Jun 2008
Posts: 4
Red face What to do with huge CSV & flat files

Alright, so I'm no DBA, but I've been in the IT field long enough to know whats what.

I'm doing a project for a company, we have several clients data, all in either excel, CSV/TXT, access. These are totally flat files, some of the clients have 38 columns of data, some have 12, theres no conformity.

My client wants these data files to match each other (in # of columns, and information in each row, obviously we're cutting out extrenuous columns)

I'm struggling to understand if this is even the best method. Some of the text files have 1.5+ million records, so if I try to drop it in Access I can't pull it back out in order to send the data to the DB host. (Yea, it's incredibly round about, I know, they don't listen to me.......)

My line of thought is,
1) files from different clients should never be mixed with each other
2) It'll be more efficient to get each client on the DBMS (MS SQL) and then simply query the columns and results they want, rather then physically modifying each data file and putting them all together


Obviously I need some guidance, or atleast enough back up to give my client the big slap in the face and do things the correct way.

Help is greatly appreciated and I'll be willing to explain any information further.
Reply With Quote
  #2 (permalink)  
Old 06-04-08, 17:49
cubix cubix is offline
Registered User
 
Join Date: Jun 2008
Posts: 4
Red face What to do with huge CSV & flat files

<-Deleted repeat post because vista sucks->

Last edited by cubix : 06-05-08 at 10:30.
Reply With Quote
  #3 (permalink)  
Old 06-04-08, 21:48
blindman blindman is online now
World Class Flame Warrior
 
Join Date: Jun 2003
Location: Ohio
Posts: 9,270
1) Depends.
2) Yeppers.
__________________
If it's not practically useful, then it's practically useless.

blindman
http://sqlblindman.googlepages.com/main
Reply With Quote
  #4 (permalink)  
Old 06-04-08, 22:09
Pat Phelan Pat Phelan is offline
Resident Curmudgeon
 
Join Date: Feb 2004
Location: In front of the computer
Posts: 9,569
The first thing that I'd do is figure out what I had to start with... How many files, how many columns in each file and more importantly what those columns contained, and how many row/lines/thingies there were in each file. I think that you need to understand what you start with before you can do anything productive (or even predictable) with it.

Once you have the list of columns, lay out a "super table" that contains enough columns to contain all of your data. Note that because some of the columns had better be equivalent, this should be far less than the sum of all of your columns. In other words, if you have thirty data sources with twelve to 30 elements (columns) each, you probably have less than a hundred total elements. Due to my own preferences, I would make certain that I had two columns just for my own use that had nothing to do with any of the data being imported, and I'd make one of them a "valid flag" and the other column my surrogate key (SK) which I'll explain shortly.

Now start to move the data into the super table, one source at a time. Don't worry about duplicate data yet, just get everything from all of the data sources into one place so you can start to grapple with it.

After you migrate all of the data into one place, start to look for patterns (essentially duplicate information). You'll probably need to use more than one way to ferret the duplicates out, since there is probably more than one way for a pair of rows to be considered a duplicate. This is where the SK is a lifesaver, since it had nothing to do with the incomming data, it is purely for your convenience. It allows you to track the rows, find the duplicates, correct the data as you see fit (based on your client's needs) without any need to change any of their data.

You can then start to consolidate the data, logically removing some rows by simply changing the "valid flag" that you added, possibly creating whole new rows from groups of original rows (so if ten different data sources had information about one entity, you could leave the original rows intact with a valid flag that indicated that they'd been merged, then a whole new row that you created from the merged entities.

Once you've "laundered" the data, you can start thinking about how you want to extract the data to send it back to the client. Maybe one table/file, maybe many in different formats, as the need arises. The short answer is that you would then have the ability to manage the data as you see fit instead of staggering under the load of a gazillion files formatted a gazillion different ways!

-PatP
Reply With Quote
  #5 (permalink)  
Old 06-05-08, 10:48
cubix cubix is offline
Registered User
 
Join Date: Jun 2008
Posts: 4
Firstly thanks for your repsonse, a lot of it I've already started on.

This is where I'm at.

I already have a list of all the files that they use from their different clients.
I know how many columns and rows there are, as well as what each column contains.

They have a template for the final resulting layout they want to achieve.
There is no duplicate data across all 200+ million rows, however there is a great deal of room for normalization, but I'm no where near that step yet.

Working with the smaller excel files are easy to manage/modify, but some of the clients has about 100 CSV files with >1.5 million rows each, making it difficult to pull into any type of program that can edit the file, and then re-export it.

Now, why am I trying to edit it and re-export it. Well, they don't want to send the off site database host data that isn't already in their final form, tying my hands with what utilities I can use when changing these files.

The data contains lists of stock purchases/sales, users simply serach for matching stock symbols for different company in order to file claims. So theres nothing big going on here, theres no editing of the tables, it's actually fairly simple, just a massive amount of data.


So what can I even use to start working with such large files, if I try to put them together as one, I'm looking at a single ~14GB file, thats pretty mind boggling to manage with out using something like MySQL or MS Sql.
Reply With Quote
  #6 (permalink)  
Old 06-05-08, 10:58
pootle flump pootle flump is offline
COLOSSAL WIN
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 10,342
Neither quick (as SQL would be) nor cheap but I've played around with files much larger than that in SAS and SSPS.

Could you not get SQL Server Express locally and have several databases (Express does have a DB size limit doesn't it?), the output of which you consolidate once everything is scrubbed.

BTW - I would add one (maybe two) columns to Pat's table - "row source", and perhaps "client".
Reply With Quote
  #7 (permalink)  
Old 06-05-08, 11:03
cubix cubix is offline
Registered User
 
Join Date: Jun 2008
Posts: 4
There is a client name field already accounted for. Though I still feel uneasy that combining multiple clients data together, but I guess it wouldn't much matter.

I haven't thought of SPSS, I wonder where my copy i lying around.
I'll also look into SQL Server express, that might be my best bet, to query in such a way that it'll come up with the resulting table I want and then export that.
Reply With Quote
  #8 (permalink)  
Old 06-10-08, 18:36
loquin loquin is offline
Super Moderator
 
Join Date: Jun 2004
Location: Arizona, USA
Posts: 1,354
Express file size limit is 4 GB per database. Should be workable.
__________________
Lou
使大吃一惊
"Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
"I have my standards. They may be low, but I have them!" - Bette Middler
"It's a book about a Spanish guy named Manual. You should read it." - Dilbert

Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On