hello guys
i have the following issue:
- 1 computer, lots of resources
- 1 mysql database (tables url, page, and some other tables with relations between this tables)
- 1 PHP script that is acting like a crawler
The crawler grabs a `url` from the table, and scans the site for links - the method is depth search (grabs a link, then it's childs, then the child-childs, etc until it has no more unique links left, then returns).
Now imagine that when the process scans a site, it holds an array in memory with all the unique links that it has found at that specific date,hour,minute,second. If it finds a new link it pushes it into the array. The array is held in memory during the crawling of the entire `url` because it needs to check if the links found are already in the array (prevent loop).
The obvious problem is that if you decide to run the crawler on a large site (100.000+ unique links), the computer will start having problems with memory. So the more links you have in the array the slower the crawler will work.
What are my options here? Are there any papers on this similar kind of problem?