Quote:
|
Originally Posted by okayfine
Hello all.
I need to program a web spider / crawler that will log onto various websites (eg. nfl.com) and collect various sports stats automatcially after the game is completed and enter them in my database.
I'm new to programming but I'm a pretty fast learner. Can someone suggest a language / approach that I should be using?
Thanks much in advance.
|
I did a lot of web scraping back in the day. It's *painful* stuff because by the time you get it working the bastards change the layout on you.
There are two approaches:
1. Scripting a browser. I haven't done this since IE 4, so I'm really not sure how IE 7 handles. This involves sending OLE automation requests to the browser to do all the things a web user does.
Pros: you can do anything a person can do. Pages with Javascript "just work." You look just like a person, if you put in the appropriate delays. You don't have to figure out how their cookies and authentication stuff works.
Cons: Harder to debug, may require a dedicated computer, slower. Steep learning curve because a browser is a complicated app with many concurrent tasks and lots of confusing APIs.
2. A web library. Perl's LWP is good, and LWP:: Simple is easy to learn. I'm sure .NET's HttpWebRequest is the same thing.
Pros: Can be dead simple. perl -MLWP:: Simple -e'getstore "http://dbforums.com/", "index.html";' is all it takes to save a known web page to disk. Can run in the background, and is easy to restart a hung process. Easy to debug.
Cons: It can be painful trying to log into any user-areas. Scanning text with regular expressions might work one day, but not the next. Harder to disguise your robot from a saavy webmaster.
Personally, I recommend Perl for the latter approach because the LWP libraries are very well developed, as are the HTML parsing libraries. But sometimes you really can't beat telling IE "click this button, follow this hyperlink" and once you've navigated to a page, it's very easy to use its DOM to pull out the data you want.
Quote:
|
watch out though - you may be violating copyright if you scrape the nfl.com site.
|
IANAL, either, but: The bigger issue is whether or not you're abusing their computers which might violate your ISP's terms of service. If your app hits the site like a normal person would, you can usually fly under their radar. It especially helps to route your requests through open proxies.
