Results 1 to 5 of 5
  1. #1
    Join Date
    Feb 2004
    Posts
    3

    Web Spider / Crawler question - Please help !!

    Hello all.

    I need to program a web spider / crawler that will log onto various websites (eg. nfl.com) and collect various sports stats automatcially after the game is completed and enter them in my database.

    I'm new to programming but I'm a pretty fast learner. Can someone suggest a language / approach that I should be using?

    Thanks much in advance.

  2. #2
    Join Date
    May 2004
    Location
    Seattle
    Posts
    1,313
    if you use .net, you could use the HttpWebRequest class to do this.

    http://msdn2.microsoft.com/en-us/lib...ebrequest.aspx

    watch out though - you may be violating copyright if you scrape the nfl.com site. they would probably claim ownership of the data on their site. I'm not a lawyer though.

  3. #3
    Join Date
    Oct 2002
    Location
    Baghdad, Iraq
    Posts
    697
    Quote Originally Posted by okayfine
    Hello all.

    I need to program a web spider / crawler that will log onto various websites (eg. nfl.com) and collect various sports stats automatcially after the game is completed and enter them in my database.

    I'm new to programming but I'm a pretty fast learner. Can someone suggest a language / approach that I should be using?

    Thanks much in advance.
    I did a lot of web scraping back in the day. It's *painful* stuff because by the time you get it working the bastards change the layout on you.

    There are two approaches:

    1. Scripting a browser. I haven't done this since IE 4, so I'm really not sure how IE 7 handles. This involves sending OLE automation requests to the browser to do all the things a web user does.

    Pros: you can do anything a person can do. Pages with Javascript "just work." You look just like a person, if you put in the appropriate delays. You don't have to figure out how their cookies and authentication stuff works.

    Cons: Harder to debug, may require a dedicated computer, slower. Steep learning curve because a browser is a complicated app with many concurrent tasks and lots of confusing APIs.

    2. A web library. Perl's LWP is good, and LWP:: Simple is easy to learn. I'm sure .NET's HttpWebRequest is the same thing.

    Pros: Can be dead simple. perl -MLWP:: Simple -e'getstore "http://dbforums.com/", "index.html";' is all it takes to save a known web page to disk. Can run in the background, and is easy to restart a hung process. Easy to debug.

    Cons: It can be painful trying to log into any user-areas. Scanning text with regular expressions might work one day, but not the next. Harder to disguise your robot from a saavy webmaster.

    Personally, I recommend Perl for the latter approach because the LWP libraries are very well developed, as are the HTML parsing libraries. But sometimes you really can't beat telling IE "click this button, follow this hyperlink" and once you've navigated to a page, it's very easy to use its DOM to pull out the data you want.

    watch out though - you may be violating copyright if you scrape the nfl.com site.
    IANAL, either, but: The bigger issue is whether or not you're abusing their computers which might violate your ISP's terms of service. If your app hits the site like a normal person would, you can usually fly under their radar. It especially helps to route your requests through open proxies.

  4. #4
    Join Date
    Feb 2004
    Posts
    3
    2. A web library. Perl's LWP is good, and LWP:: Simple is easy to learn. I'm sure .NET's HttpWebRequest is the same thing.

    Pros: Can be dead simple. perl -MLWP:: Simple -e'getstore "http://dbforums.com/", "index.html";' is all it takes to save a known web page to disk. Can run in the background, and is easy to restart a hung process. Easy to debug.

    Cons: It can be painful trying to log into any user-areas. Scanning text with regular expressions might work one day, but not the next. Harder to disguise your robot from a saavy webmaster.

    Personally, I recommend Perl for the latter approach because the LWP libraries are very well developed, as are the HTML parsing libraries. But sometimes you really can't beat telling IE "click this button, follow this hyperlink" and once you've navigated to a page, it's very easy to use its DOM to pull out the data you want.
    How easy it is to do this with Visual Basic? Some of the website does use javascrip authentication.

    IANAL, either, but: The bigger issue is whether or not you're abusing their computers which might violate your ISP's terms of service. If your app hits the site like a normal person would, you can usually fly under their radar. It especially helps to route your requests through open proxies.
    As long as I use the data for private use, it should be okay, no? I want to start the learning process by storing sports data, but eventually, I want to track commodity prices, etc.

    Thanks in advance for your replies.

  5. #5
    Join Date
    Oct 2002
    Location
    Baghdad, Iraq
    Posts
    697
    Quote Originally Posted by okayfine
    How easy it is to do this with Visual Basic? Some of the website does use javascrip authentication.
    Read the javascript and see what's going back and forth between the web browser and server. I use Omniweb for this because it gives me a transcript of the HTTP requests, but that's a Mac based browser. You can use a packet sniffer, too.

    Quote Originally Posted by okayfine
    As long as I use the data for private use, it should be okay, no?
    Like I said, I'm not a lawyer. If you can't do the time, don't do the crime.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •