If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Data Access, Manipulation & Batch Languages > Visual Basic > Get data from HTML pages

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 07-18-09, 13:29
lgarcia3 lgarcia3 is offline
Registered User
 
Join Date: May 2004
Posts: 6
Get data from HTML pages

I have a bunch of HTML pages with static information that I need to open and retrieve the information from them. How can I go about it? HTML pages are just text, do I handle it the same way? Has any one done something similar? Any threads you may know about, code, articles, would be greatly appreciated.
Thanks!
Reply With Quote
  #2 (permalink)  
Old 07-18-09, 14:02
sco08y sco08y is offline
Registered User
 
Join Date: Oct 2002
Location: Baghdad, Iraq
Posts: 697
Scraping info out of files is not for the faint hearted or the shallow pocketed.

Quote:
Originally Posted by lgarcia3
I have a bunch of HTML pages with static information that I need to open and retrieve the information from them.
Is a "bunch" like 100, 10 thousand, 10 million, what?

Are they all very similar? If there's really only a dozen pages, copy and paste is preferable. If you have many and they're all completely different, you're going to need some specialized services.

Quote:
Originally Posted by lgarcia3
HTML pages are just text, do I handle it the same way?
You can, in theory. You probably want an HTML parser. Do a search for MSHTML DOM interface if you're using VB.

If you're liable to do this again, it might be worth it to learn Perl. ActivePerl is a good Windows distribution.
Reply With Quote
  #3 (permalink)  
Old 07-19-09, 08:11
lgarcia3 lgarcia3 is offline
Registered User
 
Join Date: May 2004
Posts: 6
Thanks for your reply

BUNCH = 3000 pages or so, maybe more

A parser would be nice. I'll check MSHTML; but no, I don't have to use VB. Can do others. So, pretty much any DOM that works on Java, C#, VB, or JavaScript can do. If anyone has any suggestion would be appreciated.
Reply With Quote
  #4 (permalink)  
Old 07-19-09, 11:32
sco08y sco08y is offline
Registered User
 
Join Date: Oct 2002
Location: Baghdad, Iraq
Posts: 697
A major criterion ought to be what tool you're most familiar with. After that, I'd avoid languages that involve compiling (C#, VB.net, Java) because you will waste time writing lots of extra code and compilation really slows down development. Something that can do a Read-Evaluate-Prompt Loop, like Perl or Python, is good because you will need to do lots of little tests to get it right. I'd use either of those, or possibly a combination of awk/sed/grep/etc.

As for books, I'd get a decent guide to regular expressions in your language of choice. You'll wind up using them again and again. For articles or websites, what you're doing is essentially part of web scraping, so search for that phrase, again with your language of choice.
Reply With Quote
  #5 (permalink)  
Old 07-19-09, 13:45
Teddy Teddy is offline
Purveyor of Discontent
 
Join Date: Mar 2003
Location: The Bottom of The Barrel
Posts: 6,075
That is a very strange reason to avoid compiled languages.

You realize that a script like this compiles in about 2 seconds on a sub-par machine, right? Pressing "F5" doesn't seem that difficult. As far as writing lots of extra code, that's exactly the opposite of how c# works. Actually, the fact that it abstracts and insulates people from writing core code is a major complaint among "serious" .net detractors. Making the request and receiving the response required in this instance can be done in about six lines of c#.

That said, I'm confused about your criticisms of compiled languages in general.



Also, here's a decent end-to-end solution that covers a very basic example of retrieving and regex'ing your way to sanitized data for storage in a database:

Screen Scraping Tutorial using C# .NET
__________________
oh yeah... documentation... I have heard of that.

*** What Do You Want In The MS Access Forum? ***
Reply With Quote
  #6 (permalink)  
Old 07-26-09, 14:56
sco08y sco08y is offline
Registered User
 
Join Date: Oct 2002
Location: Baghdad, Iraq
Posts: 697
Quote:
Originally Posted by Teddy
You realize that a script like this compiles in about 2 seconds on a sub-par machine, right? Pressing "F5" doesn't seem that difficult.
And I realize that I've done little projects like this hundreds of times and big projects that are suited to compiled languages and my judgement is that I wouldn't generally use a compiled language to do a little job like this.

It's not that you can't, it's just not the right tool for the job. The extra compilation phase is just one reason. Sure, a newbie *might* get lucky and not screw up configuring the compiler, but it's also possible s/he'd spend hours sifting through documentation. Interpreted languages aren't immune to this either, but my experience has been it's a more frequent problem with compiled languages, and the reason is that there's more configuration you have to do with a compiled language that brings you no benefit in a project like this.

Further, few of the benefits of compiled languages apply here. C# is statically bound, and that would give me no benefit in this case. I don't need to be able to generate a library or make a GUI or any of that stuff, so I get no benefit from that. All the complexity that comes with an IDE gives me no benefit.

This guy wants to write a script, see what errors come up, fix his code and try again. That mode of development is the classic model for using an interpreted language. This isn't a criticism of compiled languages any more than saying I'd rather not use a monkey wrench to remove an ingrown hair is a criticism of monkey wrenches.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On