Results 1 to 6 of 6
  1. #1
    Join Date
    Dec 2006
    Posts
    1

    Unanswered: Web Crawler database design

    Web Crawler database design

    Small question about database design concerning a table that will hold several millions of records
    Containing URL information.

    Let's say that I have a table with about 1000+ root websites
    And the crawler is starting to fetch links from the root website and building huge Url_links table
    And from time to time I have to get top 1000 urls from this table that are UN crawled URL's grouped by website_ID and ordered by insert date.

    When this table is starting to grow (4M records and more) the IO is starting to be very loaded and it is slowing down the process dramatically

    Any tweaks to the design of the table that can improve this process
    We already have indexes on the website ID and date but it is still very slow…
    I was thinking to create a buffer table and separate the UN crawled urls from the crawled ones
    But maybe you have more creative thoughts

  2. #2
    Join Date
    Jul 2009
    Posts
    1
    The database the WebCrawler builds is available through a search page on the Web. ... and validates some of the design choices made in the WebCrawler.

    ________________
    spam link removed
    Last edited by blindman; 07-13-09 at 12:11.

  3. #3
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Quote Originally Posted by examiz
    Web Crawler database design

    Small question about database design concerning a table that will hold several millions of records
    Containing URL information.

    Let's say that I have a table with about 1000+ root websites
    And the crawler is starting to fetch links from the root website and building huge Url_links table
    And from time to time I have to get top 1000 urls from this table that are UN crawled URL's grouped by website_ID and ordered by insert date.

    When this table is starting to grow (4M records and more) the IO is starting to be very loaded and it is slowing down the process dramatically

    Any tweaks to the design of the table that can improve this process
    We already have indexes on the website ID and date but it is still very slow…
    I was thinking to create a buffer table and separate the UN crawled urls from the crawled ones
    But maybe you have more creative thoughts
    Why not show us some slow SQL and then give us the tables involved and their indexes. That way we're much more likely to solve your problem.

  4. #4
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    mike, the post you are replying to is three years old

    the reply made this morning (the one ahead of yours) is the real culprit here -- it is not helpful and has been reported
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  5. #5
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    in fact, the text of that post was lifted from here -- Finding What People Want: Experiences with the WebCrawler
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  6. #6
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Quote Originally Posted by r937
    Finding What People Want:
    I just want less spam - if you're giving your time for free then it's nice to think that time is being used usefully rather than to just promote some spammers we site.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •