If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > MySQL > Web Crawler database design

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 12-26-06, 06:28
examiz examiz is offline
Registered User
 
Join Date: Dec 2006
Posts: 1
Web Crawler database design

Web Crawler database design

Small question about database design concerning a table that will hold several millions of records
Containing URL information.

Let's say that I have a table with about 1000+ root websites
And the crawler is starting to fetch links from the root website and building huge Url_links table
And from time to time I have to get top 1000 urls from this table that are UN crawled URL's grouped by website_ID and ordered by insert date.

When this table is starting to grow (4M records and more) the IO is starting to be very loaded and it is slowing down the process dramatically

Any tweaks to the design of the table that can improve this process
We already have indexes on the website ID and date but it is still very slow…
I was thinking to create a buffer table and separate the UN crawled urls from the crawled ones
But maybe you have more creative thoughts
Reply With Quote
  #2 (permalink)  
Old 07-13-09, 08:58
rs_shadow0000 rs_shadow0000 is offline
Banned
 
Join Date: Jul 2009
Posts: 1
The database the WebCrawler builds is available through a search page on the Web. ... and validates some of the design choices made in the WebCrawler.

________________
spam link removed

Last edited by blindman; 07-13-09 at 11:11.
Reply With Quote
  #3 (permalink)  
Old 07-13-09, 09:11
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Originally Posted by examiz
Web Crawler database design

Small question about database design concerning a table that will hold several millions of records
Containing URL information.

Let's say that I have a table with about 1000+ root websites
And the crawler is starting to fetch links from the root website and building huge Url_links table
And from time to time I have to get top 1000 urls from this table that are UN crawled URL's grouped by website_ID and ordered by insert date.

When this table is starting to grow (4M records and more) the IO is starting to be very loaded and it is slowing down the process dramatically

Any tweaks to the design of the table that can improve this process
We already have indexes on the website ID and date but it is still very slow…
I was thinking to create a buffer table and separate the UN crawled urls from the crawled ones
But maybe you have more creative thoughts
Why not show us some slow SQL and then give us the tables involved and their indexes. That way we're much more likely to solve your problem.
Reply With Quote
  #4 (permalink)  
Old 07-13-09, 09:13
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,534
mike, the post you are replying to is three years old

the reply made this morning (the one ahead of yours) is the real culprit here -- it is not helpful and has been reported
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #5 (permalink)  
Old 07-13-09, 09:29
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,534
in fact, the text of that post was lifted from here -- Finding What People Want: Experiences with the WebCrawler
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #6 (permalink)  
Old 07-13-09, 09:48
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Originally Posted by r937
Finding What People Want:
I just want less spam - if you're giving your time for free then it's nice to think that time is being used usefully rather than to just promote some spammers we site.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On