If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > General > Database Concepts & Design > scalability issue - array

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 08-30-06, 08:28
Zamolxe Zamolxe is offline
Registered User
 
Join Date: Feb 2004
Location: Bucharest
Posts: 37
scalability issue - array

hello guys

i have the following issue:

- 1 computer, lots of resources
- 1 mysql database (tables url, page, and some other tables with relations between this tables)
- 1 PHP script that is acting like a crawler

The crawler grabs a `url` from the table, and scans the site for links - the method is depth search (grabs a link, then it's childs, then the child-childs, etc until it has no more unique links left, then returns).

Now imagine that when the process scans a site, it holds an array in memory with all the unique links that it has found at that specific date,hour,minute,second. If it finds a new link it pushes it into the array. The array is held in memory during the crawling of the entire `url` because it needs to check if the links found are already in the array (prevent loop).

The obvious problem is that if you decide to run the crawler on a large site (100.000+ unique links), the computer will start having problems with memory. So the more links you have in the array the slower the crawler will work.

What are my options here? Are there any papers on this similar kind of problem?
__________________
My Blog
Reply With Quote
  #2 (permalink)  
Old 08-30-06, 08:45
Pat Phelan Pat Phelan is offline
Resident Curmudgeon
 
Join Date: Feb 2004
Location: In front of the computer
Posts: 12,605
I actually wrote a similar paper (the details were different, but the problem and concept was the same) in 1979.

The problem is that RAM is not the best choice for where to store this kind of information. All memory is finite, and RAM tends to be the smallest form of memory on any given machine.

My solution to the problem was to compute a "message digest" for each reference (equivalent to a URL in your example) using some kind of CRC type code. Store the digest, the URL you found, the URL of the referring site (or an FK to another table with site, scan date & time, etc), and possibly a reference count in a table (smetimes it helps to know how many times one site or even many sites refer to a specific URL).

Keeping the digests in an array or hash in RAM will allow you to determine if there's any chance of finding a URL on disk. This can save you lots of time in the crawling process.

-PatP
Reply With Quote
  #3 (permalink)  
Old 08-31-06, 16:23
Zamolxe Zamolxe is offline
Registered User
 
Join Date: Feb 2004
Location: Bucharest
Posts: 37
re

you mean:

- get the first link of the `url`
- grab its links
- put them somewhere on disk with some references (date, nr crawls)

- from the second link repeat the above process
- also from the second link, when i store the new links on the disk i have to check to insert only the new ones (unique) - this is where the "message digest" is comming in

i'm not sure if my crawling technique is the most eficient:

i start from FIRST PAGE (grab all links in here) -> FIST PAGE LINK #1 (grab all links in here) -> FIRST PAGE LINK #1 LINK #1 .... if no more then return to FIRST PAGE LINK #2 ... an so on

now this is more like breadth-first search. i guess a backlink count or link weight algoritm for crawling will save time (imagine if you hit the sitemap from the second link)

i will not store the array with links in memory, i will store an unique array on disk
__________________
My Blog
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On