Results 1 to 9 of 9
  1. #1
    Join Date
    Aug 2007
    Posts
    4

    database suggestions

    Hello,
    I joined this forum with the hope that someone can help me out. I run a small research-related business. Over the past 5 years we have collected over 5000 pdf documents of scientific articles. Each of these papers were paid for and have accumulated to become quite the valuable asset. Unfortunately, we do not have any software or database equipment to recognize the content of these pdf's and allow us to search for them for future use. Can someone suggest a program/software that can do this? Thank you very much.

  2. #2
    Join Date
    Jul 2003
    Location
    Michigan
    Posts
    1,941
    If you buy the full version of Adobe Acrobat, you can open those PDF's and save them as RTF's (and other formats).
    Inspiration Through Fermentation

  3. #3
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Coach me a bit here, I don't quite understand what you want to get at the end of this project.

    For the purposes of this discussion, I'm assuming that you are using some flavor of Microsoft Windows. These concepts work just fine in other operating systems, but you'd use different tools and methodologies to get the same results.

    If you copy all of the PDF files into a single common directory, you can search to see what files contain specific words and phrases. This will achieve 99% of what the users I know have needed in the past.

    If that is not sufficient, you can add tools to your server to allow you to do more sophisticated searching. This would include add-on tools like Microsoft's Full Text Search that allow you to to KWIK-like scanning of specified files and directories.

    You can take this further with server based tools that are similar to Googles "Desktop Search" that allow much more sophisticated searching, but those kind of tools need both familiarity with the file content as well as some training in order to use them productively. Like any power tool, they do more (be that good or bad) than the simple tools can do.

    Give us some feedback on just what you've got and what more you want to get out of it, and I'd bet that one or more of us can help!

    -PatP

  4. #4
    Join Date
    Aug 2007
    Posts
    4
    Thanks RNG and PatP.

    I will try to be a little bit more specific this time. The pdfs have random file names. There is no structure to their names. One can be called "14435" and the other "Dr_Venger_2003." The files are spread out over several directories on our sever. They do not have a central location. Some of the pdfs are already searchable. With Windows search or with Google Desltop, we can type in something like "vitamin c prevent cold" and some pdfs will pop up that include those terms in the actual text. The search results, however, usually only show the file name, which forces us to open each result and check the title of the actual article to see if its relevant. Additionally, a lot of our pdfs arent searchable this way. Some originated as scanned documents and as far as I know these simple serach tools don't have the technology to recognize the text within these scanned files. Additionally, if we conduct a search using windows or google, the search tools will search all of our files. All of our work involves science/nutritional-related terms so if you search "vitamin c prevent cold" you not only get the pdfs that are searchable that contain these terms but also all of our word, pdf, excel, ect files that we produced ourselves. Looking through tens to hundreds of these results and opening each one to find the paper youre looking for is just as bad as reordering and repurchasing the paper in terms of time wasted!

    Ideally, we want a system that will allow us to search across all pdfs on our server, scanned or not scanned, and produce results that show the title of the scientific article. As an example, I would search "vitamin c cold" and I quickly look through the titles of the results, see "Double blind, placebo controlled study on the effects of vitamin c on cold prevention," click on it and boom! the pdf opens and I print it off.

    We have full Adobe Acrobat, but going through all of our pdfs, and saving them as RTFs is very time consuming. I hope there is a better solution. Also, as far as i know, Adobe won't convert scanned articles into RTF.

    I hope I'm being clear. Once again, thanks for the quick reply. It's amazing how important an effective database is. Unfortuantely, I didnt realize it until I was swamped with a completely disorganized system.

  5. #5
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Ok, there are a couple of possible solutions here. If you want something cheap, easy, and fast then you'll have to make a few conecessions.

    The first problem that I would address is one of scope. If you want to continue to use simple search tools, then you HAVE to get the files into one folder, although you can add sub-folders to it as you see fit. This will allow you to simply search only the files that you deem to be of interest. There are ways to work around this, but all of those ways involve either significant coding or extensive manual effort on the part of an administrator (either of which is expensive).

    The next problem that I would address is dealing with the documents that have been scanned as images instead of text. The problem is that the documents are effectively stored as pictures rather than as words, so none of the reasonably priced tools can index them. There are very high end tools with the ability to do this, but they are at best about 95% accurate, time consuming, and very frustrating to administer.

    You have two basic approaches that will help you deal with documents scanned as images. The first approach is to use OCR (Optical Character Recognition) software to do the "brute force" conversion of the documents into a searchable form. This is usually very slow and it requires meticulous proof reading by someone that understands the content well enough to be able to do that proof reading. This can also result in "lost" material when the OCR fails to recognize a font, sidebar, etc and fails to report that as an error.

    A slightly more expensive method, but one that is almost 100% reliable is to take two or more of your junior people and have them manually transcribe the contents of each of these documents twice. At first this sounds horribly wasteful, but there are some very important side benefits. This forces those junior folks to be more familiar with the documents, and also gives them a "pride of ownership" in the final result that will lead them to use your archive more often, both of which are huge positive benefits. By comparing the two transcriptions (which can be automated), one of your more senior people can resolve any conflicts/typos/etc and this gives them more insight into the abilities and any possible training needs of the junior people.

    Another solution to your problem (moving completely outside of the discussion so far) would be to either buy or build one of the more complex document management systems that would allow you to handle the documents in their existing form. This would require a fair bit of ongoing time from an administrator for your archive, and it would probably also require a relatively large investment up front to acquire and configure the archive management system.

    Let me know what you think of these options, and if you think that they'll work for you!

    -PatP

  6. #6
    Join Date
    Aug 2007
    Posts
    4
    PatP,

    You're being incredibly helpful!

    I suppose moving all of the files into one directory wouldnt be that dreadful. I understand how that would make the searches simpler, allowing us to specify the folder to be searched.

    How do you prevent a scan from turning into an image? There are often times where we receive an article in hard copy format and need to scan it to store it on our server.

    When you suggest that our junior people "manually transcribe" each document, what exactly do you mean?

    It sounds like we need a consultant to come in to come up with the best solution. Any chance you live in Toronto? For now, I think I will move all of the files into one folder and test out whether this makes a significant difference. Can you recommend any other search software besides Google Desktop or Windows? Both return results with random portions of the text containing the search terms, which makes it difficult to determine whether the result is relevant without opening it. A search tool that would display the title of the article or even the first page of the pdf would be ideal!

  7. #7
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Actually, while I don't live in Toronto, I could probably be persuaded to come and visit...

    As a better option than having me come and visit, one of our moderators Rudy Limeback actually does live in Toronto, and might well be able to offer more creative / less labor intensive solutions that I could. You can contact Rudy via his personal web site.

    -PatP

  8. #8
    Join Date
    Aug 2007
    Posts
    4
    Thanks a lot Pat!

  9. #9
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Quote Originally Posted by pwojewnik
    How do you prevent a scan from turning into an image? There are often times where we receive an article in hard copy format and need to scan it to store it on our server.
    This is usually pretty simple as long as the document is in one or two relatively common fonts, and is visually "clear" to allow OCR scanning to do the bulk of the work for you.

    There are a couple of "red light" kinds of problems that can occur when doing OCR against a published article.

    Most commonly, sidebars (the shaded or boxed discussions that are outside the main paper, but provide important background information or additional technical details) don't always scan correctly because the OCR software doesn't "grok" that the article isn't a continuous stream of text. This can result in the sidebar being lost, rendered as gibberish, or just plain merged into the rest of the text, any one of which makes it practically impossible to do useful searches against the document.

    Tecnical papers often have graphs and diagrams, and OCR usually struggles with these too. The whole graph might be lost, or bits and pieces of it might be recovered in an unusable form.

    Special fonts and unusual (greek, mathematical, etc) characters also pose a real problem for OCR readers. Some handle these things well, some do Ok if you give them the proper settings, but most just fail abysmally and don't even report the failure.

    Combining these scanning problems can compound the problem to the point where the scanned text result is unusable in the worst cases. In any case, a seasoned professional with enough experience to be completely comfortable with the text has to proofread the results of the scan before you include it in your archive, unless you are willing to accept the consquences of GIGO.
    Quote Originally Posted by pwojewnik
    When you suggest that our junior people "manually transcribe" each document, what exactly do you mean?
    What I'm proposing is to have at least two copies of each document to be scanned be physically retyped by a human being. This is often best handled by junior associates for two very important reasons. First, it gives them at least passing familiarity with the documents in the first place, which allows them to construct much better search strings when they need them than if these junior associates had never read the documents in the first place. More importantly, it can install a "pride of ownership" where they see the archive as something that they helped to build, which brings them to use the archive more than they would if it was just another reference source.

    The reason that you want two copies of each document is because junior people will make transcription errors, no matter how careful they might be. They may misunderstand some content, they may not be able to clearly read some formulae, etc. The two documents can be compared (using automated tools) and merged or revised to produce the final archive document by one of your more senior people (who we'll assume is competent to make the necessare judgement calls). This will also point out to them opportunities to mentor or provide additional training for the junior associates, based on the quality of their transcription efforts.

    While this is rather time consuming, it brings several really important benefits to the process. It also makes it possible to retrieve the entire content of the articles (since a human being can visually identify a graph, etc. and can include them in the transcript document).

    Whew, that was rather wordy, but I think it will help clear things up a bit!

    -PatP

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •