Results 1 to 8 of 8
  1. #1
    Join Date
    Dec 2009
    Posts
    4

    Unanswered: Need help to design - storing data from a delimeter separated file

    There is input coming from another source in the form of flat files. Each line is a record, separated by a delimiter.

    Example:
    A,3,453,BE,SOME DATA HERE, 0, 3, 9, 8, etc...

    The system I'm building will need to store this data into database table, so the user can view it and query it. There can be millions of records in each file and over time the table will have lots and lots of data that needs to be queried.

    Inserting the data quickly is important (and indexes make inserts slower). Thus, indexes isn't going to help because the inserting will be slow. I heard that indexes aren't necessarily going to help data retrieval in all cases either.

    Queries performed could be they want to see matches for any field number matching any particular value. ( where field#1 is A and field #3 is 453 OR 999 AND field #8 is ABC OR DDD)

    Two approaches:

    Approach #1
    Table with columns named: Field1, Field2, Field3, etc.

    Pros : fast and simple
    Cons: Will end up with 30+ columns; May get different types of files in the near future and require more columns in the table

    Approach #2
    Table with columns: RecordNumber and other information
    Table with columns named: FieldNumber, Field Value, RecordNumber (foreign key)
    So, each row would be:
    1, <row 1 value>
    2, <row 2 value>

    Pros: Normalized data structure, support any input number of fields
    Cons: In an attempt to write a query such as the one above it was very slow (and ugly)


    Currently, Approach #2 is being used, but searching is a nightmare. What normally would be a WHERE field1='somevalue' type of query is more like WHERE (select value from FieldTable WHERE fieldNumber='1' ... ) = 'somevalue'... Unless I am doing the query incorrectly, but it seems that what normally would be the column is now stored as data in the table and I have to pull it out somehow.

    I also ran into problems by doing much simpler queries on this table in general. The tables keep track of what file the data came from so there is ways in the application view list of files and then view the contents. Initially, the contents would be viewed by getting the line and then the contents from the normalized table structures. That was slower than getting the original line from the table and parsing the string out!

    Another possibility?
    Another idea I just thought about is storing the original data line and querying using the wildcards. The application would build the SQL based on what the user wanted to search for and form it like a line record that would appear in the file.
    WHERE lineInfo LIKE 'A,&#37;,345,%,SEARCH%'

    But I'm worried that a query like that would take forever as well. Especially if the user wanted to search for different values matching in a certain field (wants Field 1 to match each A or B )
    WHERE lineInfo LIKE 'A,%,345,%,SEARCH%' OR lineInfo LIKE 'B,%,345,%,SEARCH%'. I'm afraid searching in this manner would be difficult for the programmers and be a nightmare to do.


    Has anyone done similar things before? What is the best approach to something like this that may have to support other input formats (may have more fields or less).
    I'm about to just change the data model to be the Approach #1 and store everything in one table even if it's de-normalized because it is faster. Unless there is another faster and normalized way of doing things.

    I've asked "oracle experts" here and they don't really seem to be giving me good answers.
    Last edited by developer2010; 12-25-09 at 17:16.

  2. #2
    Join Date
    Aug 2003
    Location
    Where the Surf Meets the Turf @Del Mar, CA
    Posts
    7,776
    Provided Answers: 1
    >Approach #1
    >Pros : fast and simple
    >Cons: Will end up with 30+ columns; May get different types of files in the near future and require more columns in the table
    Better than #2 below

    >Approach #2
    >Pros: Normalized data structure, support any input number of fields
    >Cons: In an attempt to write a query such as the one above it was very slow (and ugly)
    "Normalized" only for when it means NOT Third Normal Form
    Another label for such perversion is Name, Value Pair
    You "could" build whole application with a single 3 column table; TABLE_NAME, COLUMN_NAME, COLUMN_VALUE.
    Of course coding it would be a nightmare & you would be hard pressed to design a WORSE performing application.

    Approach #3
    Keep data in multiple flat files that really correspond to Third Normal Form & match real DB tables.
    Define the flat file to be EXTERNAL TABLES so real DB tables can be loaded via
    INSERT INTO REAL_DB_TABLE1 SELECT * FROM EXTERNAL_TABLE1;
    You can lead some folks to knowledge, but you can not make them think.
    The average person thinks he's above average!
    For most folks, they don't know, what they don't know.
    Good judgement comes from experience. Experience comes from bad judgement.

  3. #3
    Join Date
    Dec 2009
    Posts
    4
    Quote Originally Posted by anacedent View Post
    >Approach #1

    "Normalized" only for when it means NOT Third Normal Form
    Another label for such perversion is Name, Value Pair
    You "could" build whole application with a single 3 column table; TABLE_NAME, COLUMN_NAME, COLUMN_VALUE.
    Of course coding it would be a nightmare & you would be hard pressed to design a WORSE performing application.

    Approach #3
    Keep data in multiple flat files that really correspond to Third Normal Form & match real DB tables.
    Define the flat file to be EXTERNAL TABLES so real DB tables can be loaded via
    INSERT INTO REAL_DB_TABLE1 SELECT * FROM EXTERNAL_TABLE1;

    I am more a coder than a database person, so can you please explain more what you mean by Third Normal Form and by the flat file to be EXTERNAL TABLES... ??? How will this help with querying each field?

  4. #4
    Join Date
    Aug 2003
    Location
    Where the Surf Meets the Turf @Del Mar, CA
    Posts
    7,776
    Provided Answers: 1
    >so can you please explain more what you mean by Third Normal Form
    Third Normal Form is a process employed for designing (deciding) which fields belong in individual tables.
    If data organization is suboptimal, SELECT could be noticeably slower or result in inconsistent data reporting.

    >and by the flat file to be EXTERNAL TABLES... ???
    GOOGLE is your friend but only when you actually use it.
    EXTERNAL TABLES is a way to allow SQL to query (SELECT only) external flat file.

    >How will this help with querying each field?
    SELECT does not recognize any difference between EXTERNAL TABLE & regular DB tables.
    You really want to load the external data into regular tables to allow indexes to be built.
    You can lead some folks to knowledge, but you can not make them think.
    The average person thinks he's above average!
    For most folks, they don't know, what they don't know.
    Good judgement comes from experience. Experience comes from bad judgement.

  5. #5
    Join Date
    Dec 2009
    Posts
    4
    I found an example online:
    SQL> create table xtern_empl_rpt
    2 ( empl_id varchar2(3),
    3 last_name varchar2(50),
    4 first_name varchar2(50),
    5 ssn varchar2(9),
    6 email_addr varchar2(100),
    7 years_of_service number(2,0)
    8 )
    9 organization external
    10 ( default directory xtern_data_dir
    11 access parameters
    12 ( records delimited by newline
    13 fields terminated by ','
    14 )
    15 location ('employee_report.csv')
    16 );



    The issues I see with this type of external table is that I would need one for each file type... what I mean is that one type of file may have 20 records separated by a comma, another may have 30 records separated by a | , etc..

    Guess that wouldn't really be too much of an issue.

  6. #6
    Join Date
    Aug 2003
    Location
    Where the Surf Meets the Turf @Del Mar, CA
    Posts
    7,776
    Provided Answers: 1
    >The issues I see with this type of external table is that I would need one for each file type
    Yes
    & if you were to use sqlldr instead you would have one control file for each file type.

    If you have multiple internal table, you have 1 DDL statement for each table.
    If you have multiple flavors of external tables, you have multiple code bases to make each.

    The reality is that both (sqlldr & EXTERNAL TABLE) use the same code base under the hood to "map" external data into the DB.
    Last edited by anacedent; 12-26-09 at 11:59.
    You can lead some folks to knowledge, but you can not make them think.
    The average person thinks he's above average!
    For most folks, they don't know, what they don't know.
    Good judgement comes from experience. Experience comes from bad judgement.

  7. #7
    Join Date
    Dec 2009
    Posts
    4
    I don't want to have a separate table for each file itself though. There are files coming in in the same format, but all the data needs to be accessible via the same table. These create table statements seems to only point to one specific file. I need it to point to say any number of files in a directory (i.e. *.csv) like any file that's there.

    I also forgot that this application is more than just reading data from these records. It has to read the lines of the file and do some analyzation on them and give it a particular status. In that sense, it doesn't really make sense I guess to query only on the data because the "status" has to be queried as well. That status is in a "real table". Maybe the original method with saving each field in it's own columns is better approach.

  8. #8
    Join Date
    May 2009
    Location
    India
    Posts
    66
    THe second approach is certainly more modular.

    Considering that you say
    There can be millions of records in each file and over time the table will have lots and lots of data that needs to be queried.
    I presume that it is in batch mode.

    So maybe a combination of approaches 1 and 2. Ie. Build table of approach2 for searching and modular growth and a dump of table as per approach1 (with record number added) to display the whole record.

    If it is batch mode insert, I would drop the index during insertion and reindex after the insert. This may also flatten out the index.

    I would also add a file name to the "index" table.

    End

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •