    Preliminary Research - Search Engine DB Design


    This is my first post in this forum but I look forward to becoming an active member as I explore the world of databases. I'm a web developer and have experience building databases for simple websites. I've decided I want to work on a personal project to enhance my understanding of databases and database design. I plan to create a website search engine which can search/crawl a website.

    It has become obvious to me that one of the most important factors in creating an efficient search engine is a well-designed database. I'm interested in doing some reading (books/papers/tutorials) on databases specific to search engines. However, searching the web I've not been able to find much research.

    My question to the members of this forum are:
    Do you know of any good research that has been done on the subject of search engine databases? If not, what are some important concepts I should research that are essential to developing a solid search engine database.

    The search engine I plan to develop should scale from a small website to a large website. It is not intended to be a 'google-like' search engine which scours the web. This is intended to be an internal website search engine.

    I appreciate any advise you might be able to offer. Thanks!

    Jun 2003
    It would seem that researching googl's, bing's, and yahoo's search engine database designs would still be a good start, even if you are not planning on replicating that same project scale.
    blindman "sqlblindman"

    blindman "sqlblindman"

    Feb 2004
    In front of the computer
    Searching is one of the "holy grail" topics in computing. There is always a lot of research being done on the subject, and I don't know of anything that approaches a "Unified Field Theory" for computer searching. There are several hotbeds of search research going on right now, each of which has specific features/benefits/restrictions.

    What kind of material do you wish to search? This is probably the key question that you need to answer before you can start any real reading or learning. Searching pictures is very different from searching audio, and they are both very different than searching text. Each medium requires different tactics and radically different processes to search them.

    Once you define your media, you need to decide what is important to be able to find as a result of your search. Macro searches deal with searching huge amounts of information with the goal of getting the searcher into the right vincinity and allowing the searcher to then refine their search. A macro search would be like specifying that you are interested in forrests or beaches on Google Earth, then allowing the search to find where you should look to refine your search criteria.

    Micro searches usually follow the macro search, and they are designed to deal with far less data and to return very specific results. An example of this would be to feed a specific pattern of notes or lyrics to SoundHound and have it find that pattern within a given song or playlist. These searches are not possible on large amounts of data, but they can be just what is needed to get to a final answer in a search process.

    Text seraches can be very different than "context aware" searches because there is only one context. Even though text can appear in multiple languages or dialects, the fundamental nature of text limits the amount of data and the ways that data can be organized. Text searching is very different than context aware searches, and requires a different mindset and different practices to achieve good results.

    If you could offer some information about what kind(s) of data your website(s) will host, what you expect as search criteria, and what you want as a result of the search, then I could offer more specific suggestions on where you should look for more information.

    Pat Phelan

    Jun 2011
    Thanks for the responses.

    I've been looking at documents on google. The problem is it is hard to visualize scaling it down to a single website. I'm sure its possible though, and i'll definitely be spending some studying the way they do it.

    Pat Phelan
    At the moment I am going to concentrate on standard text searching. This will include text webpages and documents (pdf, ppt, doc, etc). I know there are ways to grab text from the documents, i'm more concerned on a flexible database structure that will allow me to scale. Once I get it setup I also plan to play around with tuning to get the most efficiency out of it as possible.

