Results 1 to 4 of 4
  1. #1
    Join Date
    Nov 2003
    Posts
    2

    Unanswered: db design for many column

    Greetings all. I've just joined this forum, and here I am already asking questions.

    I have a row-column text-file to convert to a web-searchable database. It looks like this:

    user-1 user-record-1 col1 col2 col3... col50
    user-1 user-record-2 col1 col2 col3... col50
    ...
    user-1 user-record-500 col1 col2 col3... col50
    user-2 user-record-1 col1 col2 col3... col50
    user-2 user-record-2 col1 col2 col3... col50
    ...
    user-2 user-record-500 col1 col2 col3... col50

    There are 20,000 users. Each user has 500 user-records of 50-columns. Each column may contain a single digit variable from 1 to 9, or be empty.

    As-is, that makes:

    - 500x50x10=25,000 possible unique user-record/column/variable entries per user. Another table lists a unique 5-digit code for each of these possible combinations, but is not necessarily used. The number of filled-in columns generally ranges from 5,000 to 15,000.

    - 20,000x500=10-million-records x 50-column table of single-digit or blank entries.

    90% of queries will be of the form

    SELECT COUNT(*) WHERE user-record = 'y' AND column-a = 'b'

    where a, x, b are specified by the user through an html form.

    My concern is optimized database architecture. I see four possibilities

    1. An as-is copy.

    2. Putting the 500-user-records x 50-columns per user into one single record per user of 2,500 columns, each of which may contain a 1-9 digit or be blank. Result is a 20,000 x 2,500 table containing 1-9.

    3. Using the 25,000 unique combination identifier codes to create one column per possible combination and just enter "1" in corresponding columns. Result is a 20,000 x 25,000 table containing blank or "1".

    4. A single "column codes" column per record, filled with a list of the 5-digit codes of all columns that were filled-in. Result is a 20,000 x 2 (primary key + code list) table containing lists of up to ~15,000 5-digit codes.

    Not asking people to give me the answer. Would just like a quantitative method to determine the optimal choice, or at least an idea of the critical factors. Thanks, all.

  2. #2
    Join Date
    Sep 2002
    Location
    Montreal, Canada
    Posts
    224
    Provided Answers: 1

    Re: db design for many column

    Originally posted by punchard_don
    My concern is optimized database architecture. I see four possibilities

    1. An as-is copy.

    2. Putting the 500-user-records x 50-columns per user into one single record per user of 2,500 columns, each of which may contain a 1-9 digit or be blank. Result is a 20,000 x 2,500 table containing 1-9.

    3. Using the 25,000 unique combination identifier codes to create one column per possible combination and just enter "1" in corresponding columns. Result is a 20,000 x 25,000 table containing blank or "1".

    4. A single "column codes" column per record, filled with a list of the 5-digit codes of all columns that were filled-in. Result is a 20,000 x 2 (primary key + code list) table containing lists of up to ~15,000 5-digit codes.

    Not asking people to give me the answer. Would just like a quantitative method to determine the optimal choice, or at least an idea of the critical factors. Thanks, all.
    Perhaps if you would explain what this database is trying to represent we could help you more. I really doubt this flat table (with multiple records for the same entity) is optimal at all. Besides, a table with 50 columns is often a flag for poor normalization.

    At first glance, you could have a separate table for those 50 columns with a FK pointing to the userid. But as I said, if you could explain what you're trying to do and send us the DESCRIBE for the tables we'd be able understand your problem better and provide more help

  3. #3
    Join Date
    Nov 2003
    Posts
    2
    I'm afraid there is no DESCRIBE statement yet because no tables exist yet - just the original flat-file.

    How else to describe it.....

    Imagine this: The database holds students' school exam results. By the end of their scholastic career, each person has taken 500 exams. Every exam had 50 multiple-choice questions numbered 1 to 9. We have records for 20,000 students of the form:

    student1, exam1, answer1, answer2,..., answer50
    student1, exam2, answer1, answer2,..., answer50
    ...
    student1, exam500, answer1, answer2,..., answer50

    student2, exam1, answer1, answer2,..., answer50
    student2, exam2, answer1, answer2,..., answer50
    ...
    student2, exam500, answer1, answer2,..., answer50

    Now we want to compile statistics on who answered what. What percentage of total students who got question x in exam y right, also got question b of exam c right? In other words

    SELECT COUNT(*) WHERE response-to-queston-x-on-exam-z=y AND response-to-queston-b-on-exam-d=c

    and we want to do this over the web letting researchers specify the questions and answers they query on.

    We could:

    1. Design the database exactly as the flat-file and take the intersection of

    SELECT COUNT(*) WHERE (exam=z AND answeri=y)
    with
    SELECT COUNT(*) WHERE (exam=d AND answerj=c)

    2. Append all the records for a given student to make one 2,500 column record per student and search

    SELECT COUNT(*) WHERE exami_answerj=y AND examb_answerc=d

    3. Give each of the possible 25,000 answers to each multiple-choice question a unique 5-digit identifier and its own column, then just signify if that answer was given by a "1" in its column, and search

    SELECT COUNT(*) WHERE ansxxxxx AND ansyyyyy

    4. Your suggestion, make a separate table for each possible answer with studentID as the key, then take interstection of searches.

    5. (4) makes me think of doig the same but only with question numbers, leaving the multiple-choice answer in the 2,500 tables and searching for the desired answer

    But as I said, I'm hoping for some method or analysis by which to make/justify the choice. "I think XXX would be best" won't impress my boss much. Is there a way to choose - preferably quantitatively? Or does one just have to build, benchmark, and repeat?

    Thanks again.

  4. #4
    Join Date
    Oct 2003
    Location
    Denver, Colorado
    Posts
    137
    I think that an answer table that looks like this would be good:

    CREATE TABLE answers (
    userid INT NOT NULL,
    examnum INT NOT NULL,
    answernum INT NOT NULL,
    value CHAR(1) NOT NULL
    );
    ALTER TABLE answers ADD INDEX useridx(userid);
    ALTER TABLE answers ADD INDEX examidx(examnum);
    ALTER TABLE answers ADD INDEX ansidx(answernum);
    ALTER TABLE answers ADD INDEX validx(value);

    This way your SQL statemets would be simple and you would avoid the excessive number of columns. This will give you a normalized database and better performance. Your queries will look like this:

    SELECT COUNT(*) FROM answers
    WHERE examnum = 1 AND answernum = 2 AND value = 'b';

    Having an index on each of the columns will allow very fast lookups. It would be unreasonable to create indices on each of the answer columns in the many-column table.

    Getting your data into that structure from the flat-file will take a little bit of work. I would import it into the database into a table that looks exactly like the flat-file, then you can issue a series of fifty queries that look like this:

    INSERT INTO answers
    SELECT userid, examnum, %i, answer%i FROM tmpanswers;

    Originally posted by punchard_don
    I'm afraid there is no DESCRIBE statement yet because no tables exist yet - just the original flat-file.

    How else to describe it.....

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •