Im planning to design a website. I have not decided yet which technologies and data persistence schemas I should choose. So Im looking forward to receiving your suggestion.
The owner of the website requires that
a) it must support a great deal of concurrent database writing operation. The frequency of writing database is about 1,000 times per second. Each writing operation will form a unique record. There could be 10 tables currently in the database to hold all these new records. The average columns of these tables are 50 and every column’s data type is either VARCHAR(less than 15 characters) or INT/DATE.
b) each database table can be created dynamically. Every user can create his own table and designate the number of the table columns in a browser. So each table is owned by every user. For instance, the former 10 tables might be used within one day to collect all the information required by the 10 users, after it has done, every table will be analyzed and generate a well formatted XML file which can be accessed by the user who owns it and these tables can be dropped from the database after the users get the XML files . The next day, maybe another 12 users will create 12 tables to gather their information and the same lifecycle repeats.
So, there are some technologies and data persistence available, such as Java/Ruby on Rails, Hibernate(or other O/R Mapping tools) and db4o etc. Which is the best one to fit for the project?
It is pretty hard to create this kind of design "from whole cloth" where we have very little background to guide us in giving you suggestions. Can you give us some more background on what the "real world" usage of the site will be?
You'll probably get very different answers if you are a primarily Unix shop running military operations than you would for a Linux shop marketing porn or a Microsoft shop tracking sales of spices to convenience stores...
I can't speak for everyone, but I know that I need to have a better handle on what you have now, and where you want to get before I can start to formulate an answer that will help you to get there.
I'm still struggling with an application that would generate nearly 3,600,000 rows of data per hour. That's roughly 10% of the highest transaction volume system I've ever designed, and it is far larger than most banks and approaches the peak hourly transaction rates of the largest web-retailers.
I am sorry for not explaining the background of this project clearly. The primary prototype of this project is based on online normalized examination for all the students in two provinces.(If it runs well, maybe many provinces would participate in.)
The students are divided into 6 grades from junior high school grade one to senior high school grade three. Every grade has 10,000 students or so. So the total amount of students in the area is close to 120,000(6 X 10,000 X 2). Each grade has at least 5 subjects which differ from other grades. In other words, there will be 30 subjects (5 X 6) totally. Each subject needs to be tested every other 7 weeks.
A specific subject, such as mathematics for junior high school grade one, will have two examiners who are stand for the two provinces respectively. So the number of examiners will be 60 (5subjects X 6grades X 2provinces). For example, a examiner, who is responsible for testing the mathematics subject for students in junior high school grade two in the province A, can set the rank of difficulty (easy, medium, hard) and the test time (40,50 or 60minutes) via a page in this website to randomly generate a test paper from a specified question warehouse. This paper will be including 30 to 70 single-choice or multiple-choice questions, which is formatted in HTML. For the sake of security and justice, every examiner will not do this work until a quarter hour before the test starts. So the project needs to create an answer table for each test dynamically to store the answers of the questionnaire of every examined student.
A possible scenario would be
Task: 10 examiners create 10 test papers.
1) Students in junior high school grade two in the province A will take the mathematics test.
The test paper might contain 61 questions and 10,000 students will take it online.
2) Students in senior high school grade three in the province B will take the history test.
This test paper might contain 52 questions and 10,000 students will take it online.
The number of tasks during this time period will be up to 10, which correspond to the 10 examinations and 100,000 students will be online at the same time. The peak hour for database writing operation would occur near the end of 8:40. As a rule of thumb, the max number of students, who press the Submit button simultaneously will be up to 1,000.
The time during 8:45-9:40AM would repeat the same work above to test other 12 subjects.
A student of junior high school grade two might take 3 tests today and the other 2 tomorrow.
I know if I create database tables on the fly, I will go against many basic fundamentals of database design and it will make things more complex and gain lower efficiency. So I am eager to hearing your suggestion.
I don't see why you cannot have existing tables that are populated with the contents of the tests. I think what you are becoming entangled in is the first normal form: "repeating groups". The application is not complicated if you employ proper normalization.
Explain how the "dynamically created tables" will look and we'll go from there.
You are right. I have not figured out how to use a predefined table to store the answers of all the examined students in a specific subject test since the number of questions varies from test to test. I am pondering how I can use existing tables to do the work without violating the 1NF/2NF/3NF normalization, but until now I have no idea.
I decide to generate an answer table on the fly after a examiner has created a specific subject test paper. For example, if the paper contains 45 questions, use DDL to create an answer table that has 46 columns. The first column of the table stands for the student_ID, the others for the answers of the 45 questions. The table is dedicated to gathering all the answers of the test paper. This solution also breaks many principles of reasonable database design. The database design of this project baffles me. Help me out, please.
You're getting far too bogged down with numbers. Dynamic DDL's are not the way to go imo.
You need to take a look back at your ideas and decide on your entities.
Students, Exams, Questions ...
1 student can take many exams
1 exam can have many questions
It looks like you have a "survey" type situation where there are several surveys (exams/tests) and you need to store the results of the survey.
Looks like there are some key entities:
Questions qualified by grade, degree of difficultu, province (?), etc.
Tests which has information on the instructors who are setting it, the degree of difficulty, grade level, province, duration of test, start_time, etc.
Test_Questions which identifies the questions selected for the test
Students with id, name, grade, etc. TestAnswers ( test_id, question_#, correct_answer, etc.) for storing the correct answers for the test. StudentAnswers (student_id, test_id, question_#, selected_answer, ...)
That should be able to handle your requirements as I understand them.
To find a student's performance, simply count the number of answers that the student selected that match the solution stored in test_answers table.
I like Ravi's answer, but I'm a bit lazier... I'd just create a "key" student with the correct answers for every test, and I'd key the grade to the test, not to the student (students move from grade to grade, and more than a few students take classes from different grades in the same term where I live).
Oh yeah, watch out for multiple choice questions... Those can make things complicated.
I was bound down by columns and rows before. You know full well my predicament and your suggestions are helpful.
Ravi, I think your approach can employ existing tables and conform to 1NF, especially the predefined StudentAnswers table which will collect the answers of every student. But I am still unsure of the performance, such as query and concurrency database writing operation.
Every student has 5 subjects and the average question numbers of a test paper (corresponding to a subject) is about 50, thus a student will create 250 (5 X 50) new answer recorders in the StudentAnswers table. When all the tests have done(within 2 days), 15,000,000 (250[recorders per student] X 10,000 [students per grade] X 6 [grades]) answer recorders will be stored in the table. If we want to get the score of a specified subject of a certain student, we first could use test_id and student_id as query conditions to search the target 50 recorders out of 15,000,000 and then compare it with the “key student” (followed by PatP’s design). This query execution efficiency would be a problem. We could predefine 30 SubjectAnswers tables (5[subjects per grade] X 6 [grades]) for all the subjects instead, which basically have the same schemas as the StudentAnswers table but no the test_id columns, to work it out since doing this can separate all the original recorders in the StudentAnswers table into 30 tables to speed up query operation. On the other hand, the drawback is that it requires more design and maintenance work.
If 1,000 students press the Submit button at the same time and each student has done 50 questions, because an answer of every question will be created as a single row in the StudentAnswers table (or 30 SubjectAnswers tables), 50,000 new answer recorders need to be stored. How to dispatch these large database writing operations?
PatP, would you like give me more advice on multiple choice questions? I am all ears.