- 1 computer, lots of resources
- 1 mysql database (tables url, page, and some other tables with relations between this tables)
- 1 PHP script that is acting like a crawler
The crawler grabs a `url` from the table, and scans the site for links - the method is depth search (grabs a link, then it's childs, then the child-childs, etc until it has no more unique links left, then returns).
Now imagine that when the process scans a site, it holds an array in memory with all the unique links that it has found at that specific date,hour,minute,second. If it finds a new link it pushes it into the array. The array is held in memory during the crawling of the entire `url` because it needs to check if the links found are already in the array (prevent loop).
The obvious problem is that if you decide to run the crawler on a large site (100.000+ unique links), the computer will start having problems with memory. So the more links you have in the array the slower the crawler will work.
What are my options here? Are there any papers on this similar kind of problem?
I actually wrote a similar paper (the details were different, but the problem and concept was the same) in 1979.
The problem is that RAM is not the best choice for where to store this kind of information. All memory is finite, and RAM tends to be the smallest form of memory on any given machine.
My solution to the problem was to compute a "message digest" for each reference (equivalent to a URL in your example) using some kind of CRC type code. Store the digest, the URL you found, the URL of the referring site (or an FK to another table with site, scan date & time, etc), and possibly a reference count in a table (smetimes it helps to know how many times one site or even many sites refer to a specific URL).
Keeping the digests in an array or hash in RAM will allow you to determine if there's any chance of finding a URL on disk. This can save you lots of time in the crawling process.
- get the first link of the `url`
- grab its links
- put them somewhere on disk with some references (date, nr crawls)
- from the second link repeat the above process
- also from the second link, when i store the new links on the disk i have to check to insert only the new ones (unique) - this is where the "message digest" is comming in
i'm not sure if my crawling technique is the most eficient:
i start from FIRST PAGE (grab all links in here) -> FIST PAGE LINK #1 (grab all links in here) -> FIRST PAGE LINK #1 LINK #1 .... if no more then return to FIRST PAGE LINK #2 ... an so on
now this is more like breadth-first search. i guess a backlink count or link weight algoritm for crawling will save time (imagine if you hit the sitemap from the second link)
i will not store the array with links in memory, i will store an unique array on disk