I am new to this forum, so I hope my question is not inappropriate?
Firstly, just to let you know that my question does not relate to solving a problem for a commercial enterprise, we are an epidemiological research group trying to conduct research in developing countries, and on a shoestring budget. I am a c++ programmer, but have a little knowledge of databases.
We have to develop a database that can store the results / responses to questions in a text format (rtf, word, txt whatever is best). Each document is the questions and responses of the person participating in the research. Each answer may be 1 or two paragraphs long, and one document may be 5 – 10 pages long. There may be as many as 5 – 8,000 documents in the end. Each document represents the responses of one individual to our questionnaire, and each document will be associated with a list of properties, date, time, location, age etc. We need to be able to have our data collectors go through each document and highlight portions of text and associate that text portion with any of a list of keywords. At a later date, for the analysis, we need then to be able to produce reports either something like, how many answers to a question associated with one of the keywords we highlighted did males of age 35 – 40 give, for example. So relating the document properties with our keyword tagged text portions. The second thing we need to do, is to output together all portions of text associated with one keyword into a single separate document/ report.
So, as in my experience to solve such problems, the most critical decisions are made right at the start, my question is what route, in terms of programming languages, database management systems would be best to handle such a task? Something good at handling textual data I guess.
Just to note we have tried out some proprietary software, and although it worked it began to seize up as documents were added.
I hope someone can help, or point us in the right direction for an answer?
Sorry for the long posting!
to answer your question I am looking to write or develop a database that will store the documents in the formats we save them in. Together with the attributes of each document (Time, location, age etc) and to store the keywords for sections of text within the documents. This rather than have a some software read them into the database. Although I hadnt really considered that as a viable option.
Its not the kind of database I have worked on before, as it deals with text rather than quanative data. I could try to make a start on this in C++, but it might not work. I would just like to get advice on the right route to go down before I waste alot of time exploring the wrong option. We are struggling to get answers to this question.
OK, it sounds like you're just after a document storage solution. Most solutions like this allow you to assign custom properties to the document, so you could include all your demographic data in the fields and just link the document to the record.
Did you have a specific database in mind? If you're on a tight budget then something like MySQL would be ideal. I think that you'd probably exceed 4GB of data if you're storing the documents themselves in the database so that rules out Oracle Express.
I had considered MySQL as an option, and yes I am sure there will be more than 4GB of data. I wasn't sure how well MySQL handled this kind of text data. And I wasnt sure how it could provide the function to highlight portions of the text and associate them with a keyword. I also am not sure if my explanation of that part is clear, so just to give an example we might have a part of the questionnaire that asks the respondent a question and record the answer as a transcript of the answer eg.
Q. What are your favorite types of weather?
A. " Well I like it very much in summer if it is sunny,warm and dry, but not too hot. But also in winter I like very cold frosty days when the ground is hard and the sky is blue. If I am inside my home at night I also like the sound of heavy wind and rain."
This is just a simple (made up) example, but our data collectors would then go through all the questionnaires they had collected that day and highlight from within the application, the underlined sentence and tag it with the keyword 'cold'. The application would have to remember that link. We could then at a later date query the database to determine how many men aged 20 -30 like cold weather.
Obviously what we will do is more complex than that, but we are trying not to loose important pieces of data that would not be captured by just having tick boxes in the questionnaire. In this case for example many people might say also that the sky has to be blue for them to like cold weather. We want the respondents to be as free as possible to answer.
We could also then produce a report of all the parts of answers for that question with the keyworde cold, so the application would have to pull all the tagged text into one output, we could look through this and look for important trends, like it also has to be frosty.
I was not sure if MySQL could handle that part of it?
I think we are suffereing from the fact that the proposed design is a bit unusual in terms of databases!
I had considered MySQL as an option, and yes I am sure there will be more than 4GB of data. I wasn't sure how well MySQL handled this kind of text data.
The file would get loaded into a field like a BLOB or whatever the MySQL data type is, longtext or longblob or something like that.
Originally Posted by AndyPAul
And I wasnt sure how it could provide the function to highlight portions of the text and associate them with a keyword.
That's the job of the application, not the database. That bit I'm afraid I can't help with.
It's an interesting project, but as I'm not a programmer I wouldn't have a clue how to create an application that could open all the types of document you want to be able to read actually inside it and then enable you to highlight the text and tag it.
Have you thought about copying the responses from each of the documents into the database. That way in addition to adding keywords or 'tags' to a record, you could use full search capability.
Traditionally, storing the file in the database was avoided in favor of keeping the file in a directory on the operating system and only storing the location of the file in the database.
This allows the Operating System and the Database to concentrate on what they do best, and also allows the data load to be split between servers.
That said, SQL Server 2008 has a new FileStream object designed specifically for storing data files.
If it's not practically useful, then it's practically useless.
Thanks, this is an interesting discussion and I feel like I am making progress.
I had considered perhaps copying and pasting the 'sections of text into a field' but we thought that might be a bit more laborious than just the highlighting and linking we proposed. And the users will have alot of work to do anyway. But its interesting that there might be better search capabilities with that idea.
Perhaps a possible way is to make it some kind of hybrid with MySQL handling the data part and an application written in C++ to handle the other part. Although I am more familair with programming C++ applications for number crunching.
They would have to work together, I guess the application would have to pass the document attributes, and ID together with the location of the highlighted text to the MySQL database together with the keyword to be associated. This way the database is storing a keyword and 'address' of the 'blob' of text. Humm!... maybe its better the application passes the actual blob of text, keyword etc (copy and paste a you suggest), into the database.
Something to think about.