If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > General > Database Concepts & Design > Web Spider / Crawler question - Please help !!

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 11-23-06, 12:04
okayfine okayfine is offline
Registered User
 
Join Date: Feb 2004
Posts: 3
Web Spider / Crawler question - Please help !!

Hello all.

I need to program a web spider / crawler that will log onto various websites (eg. nfl.com) and collect various sports stats automatcially after the game is completed and enter them in my database.

I'm new to programming but I'm a pretty fast learner. Can someone suggest a language / approach that I should be using?

Thanks much in advance.
Reply With Quote
  #2 (permalink)  
Old 11-27-06, 01:21
jezemine jezemine is offline
another indirection layer
 
Join Date: May 2004
Location: Seattle
Posts: 1,312
if you use .net, you could use the HttpWebRequest class to do this.

http://msdn2.microsoft.com/en-us/lib...ebrequest.aspx

watch out though - you may be violating copyright if you scrape the nfl.com site. they would probably claim ownership of the data on their site. I'm not a lawyer though.
__________________
elsasoft.org
Reply With Quote
  #3 (permalink)  
Old 11-27-06, 02:57
sco08y sco08y is offline
Registered User
 
Join Date: Oct 2002
Location: Baghdad, Iraq
Posts: 697
Quote:
Originally Posted by okayfine
Hello all.

I need to program a web spider / crawler that will log onto various websites (eg. nfl.com) and collect various sports stats automatcially after the game is completed and enter them in my database.

I'm new to programming but I'm a pretty fast learner. Can someone suggest a language / approach that I should be using?

Thanks much in advance.
I did a lot of web scraping back in the day. It's *painful* stuff because by the time you get it working the bastards change the layout on you.

There are two approaches:

1. Scripting a browser. I haven't done this since IE 4, so I'm really not sure how IE 7 handles. This involves sending OLE automation requests to the browser to do all the things a web user does.

Pros: you can do anything a person can do. Pages with Javascript "just work." You look just like a person, if you put in the appropriate delays. You don't have to figure out how their cookies and authentication stuff works.

Cons: Harder to debug, may require a dedicated computer, slower. Steep learning curve because a browser is a complicated app with many concurrent tasks and lots of confusing APIs.

2. A web library. Perl's LWP is good, and LWP:: Simple is easy to learn. I'm sure .NET's HttpWebRequest is the same thing.

Pros: Can be dead simple. perl -MLWP:: Simple -e'getstore "http://dbforums.com/", "index.html";' is all it takes to save a known web page to disk. Can run in the background, and is easy to restart a hung process. Easy to debug.

Cons: It can be painful trying to log into any user-areas. Scanning text with regular expressions might work one day, but not the next. Harder to disguise your robot from a saavy webmaster.

Personally, I recommend Perl for the latter approach because the LWP libraries are very well developed, as are the HTML parsing libraries. But sometimes you really can't beat telling IE "click this button, follow this hyperlink" and once you've navigated to a page, it's very easy to use its DOM to pull out the data you want.

Quote:
watch out though - you may be violating copyright if you scrape the nfl.com site.
IANAL, either, but: The bigger issue is whether or not you're abusing their computers which might violate your ISP's terms of service. If your app hits the site like a normal person would, you can usually fly under their radar. It especially helps to route your requests through open proxies.
Reply With Quote
  #4 (permalink)  
Old 11-27-06, 06:04
okayfine okayfine is offline
Registered User
 
Join Date: Feb 2004
Posts: 3
Quote:
2. A web library. Perl's LWP is good, and LWP:: Simple is easy to learn. I'm sure .NET's HttpWebRequest is the same thing.

Pros: Can be dead simple. perl -MLWP:: Simple -e'getstore "http://dbforums.com/", "index.html";' is all it takes to save a known web page to disk. Can run in the background, and is easy to restart a hung process. Easy to debug.

Cons: It can be painful trying to log into any user-areas. Scanning text with regular expressions might work one day, but not the next. Harder to disguise your robot from a saavy webmaster.

Personally, I recommend Perl for the latter approach because the LWP libraries are very well developed, as are the HTML parsing libraries. But sometimes you really can't beat telling IE "click this button, follow this hyperlink" and once you've navigated to a page, it's very easy to use its DOM to pull out the data you want.
How easy it is to do this with Visual Basic? Some of the website does use javascrip authentication.

Quote:
IANAL, either, but: The bigger issue is whether or not you're abusing their computers which might violate your ISP's terms of service. If your app hits the site like a normal person would, you can usually fly under their radar. It especially helps to route your requests through open proxies.
As long as I use the data for private use, it should be okay, no? I want to start the learning process by storing sports data, but eventually, I want to track commodity prices, etc.

Thanks in advance for your replies.
Reply With Quote
  #5 (permalink)  
Old 12-03-06, 16:47
sco08y sco08y is offline
Registered User
 
Join Date: Oct 2002
Location: Baghdad, Iraq
Posts: 697
Quote:
Originally Posted by okayfine
How easy it is to do this with Visual Basic? Some of the website does use javascrip authentication.
Read the javascript and see what's going back and forth between the web browser and server. I use Omniweb for this because it gives me a transcript of the HTTP requests, but that's a Mac based browser. You can use a packet sniffer, too.

Quote:
Originally Posted by okayfine
As long as I use the data for private use, it should be okay, no?
Like I said, I'm not a lawyer. If you can't do the time, don't do the crime.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On