Alright, so I'm no DBA, but I've been in the IT field long enough to know whats what.
I'm doing a project for a company, we have several clients data, all in either excel, CSV/TXT, access. These are totally flat files, some of the clients have 38 columns of data, some have 12, theres no conformity.
My client wants these data files to match each other (in # of columns, and information in each row, obviously we're cutting out extrenuous columns)
I'm struggling to understand if this is even the best method. Some of the text files have 1.5+ million records, so if I try to drop it in Access I can't pull it back out in order to send the data to the DB host. (Yea, it's incredibly round about, I know, they don't listen to me.......)
My line of thought is,
1) files from different clients should never be mixed with each other
2) It'll be more efficient to get each client on the DBMS (MS SQL) and then simply query the columns and results they want, rather then physically modifying each data file and putting them all together
Obviously I need some guidance, or atleast enough back up to give my client the big slap in the face and do things the correct way.
Help is greatly appreciated and I'll be willing to explain any information further.
The first thing that I'd do is figure out what I had to start with... How many files, how many columns in each file and more importantly what those columns contained, and how many row/lines/thingies there were in each file. I think that you need to understand what you start with before you can do anything productive (or even predictable) with it.
Once you have the list of columns, lay out a "super table" that contains enough columns to contain all of your data. Note that because some of the columns had better be equivalent, this should be far less than the sum of all of your columns. In other words, if you have thirty data sources with twelve to 30 elements (columns) each, you probably have less than a hundred total elements. Due to my own preferences, I would make certain that I had two columns just for my own use that had nothing to do with any of the data being imported, and I'd make one of them a "valid flag" and the other column my surrogate key (SK) which I'll explain shortly.
Now start to move the data into the super table, one source at a time. Don't worry about duplicate data yet, just get everything from all of the data sources into one place so you can start to grapple with it.
After you migrate all of the data into one place, start to look for patterns (essentially duplicate information). You'll probably need to use more than one way to ferret the duplicates out, since there is probably more than one way for a pair of rows to be considered a duplicate. This is where the SK is a lifesaver, since it had nothing to do with the incomming data, it is purely for your convenience. It allows you to track the rows, find the duplicates, correct the data as you see fit (based on your client's needs) without any need to change any of their data.
You can then start to consolidate the data, logically removing some rows by simply changing the "valid flag" that you added, possibly creating whole new rows from groups of original rows (so if ten different data sources had information about one entity, you could leave the original rows intact with a valid flag that indicated that they'd been merged, then a whole new row that you created from the merged entities.
Once you've "laundered" the data, you can start thinking about how you want to extract the data to send it back to the client. Maybe one table/file, maybe many in different formats, as the need arises. The short answer is that you would then have the ability to manage the data as you see fit instead of staggering under the load of a gazillion files formatted a gazillion different ways!
Firstly thanks for your repsonse, a lot of it I've already started on.
This is where I'm at.
I already have a list of all the files that they use from their different clients.
I know how many columns and rows there are, as well as what each column contains.
They have a template for the final resulting layout they want to achieve.
There is no duplicate data across all 200+ million rows, however there is a great deal of room for normalization, but I'm no where near that step yet.
Working with the smaller excel files are easy to manage/modify, but some of the clients has about 100 CSV files with >1.5 million rows each, making it difficult to pull into any type of program that can edit the file, and then re-export it.
Now, why am I trying to edit it and re-export it. Well, they don't want to send the off site database host data that isn't already in their final form, tying my hands with what utilities I can use when changing these files.
The data contains lists of stock purchases/sales, users simply serach for matching stock symbols for different company in order to file claims. So theres nothing big going on here, theres no editing of the tables, it's actually fairly simple, just a massive amount of data.
So what can I even use to start working with such large files, if I try to put them together as one, I'm looking at a single ~14GB file, thats pretty mind boggling to manage with out using something like MySQL or MS Sql.
There is a client name field already accounted for. Though I still feel uneasy that combining multiple clients data together, but I guess it wouldn't much matter.
I haven't thought of SPSS, I wonder where my copy i lying around.
I'll also look into SQL Server express, that might be my best bet, to query in such a way that it'll come up with the resulting table I want and then export that.
Express file size limit is 4 GB per database. Should be workable.
"Lisa, in this house, we obey the laws of thermodynamics!" - Homer Simpson
"I have my standards. They may be low, but I have them!" - Bette Middler
"It's a book about a Spanish guy named Manual. You should read it." - Dilbert