Results 1 to 6 of 6
  1. #1
    Join Date
    May 2004
    Posts
    6

    Unanswered: Get data from HTML pages

    I have a bunch of HTML pages with static information that I need to open and retrieve the information from them. How can I go about it? HTML pages are just text, do I handle it the same way? Has any one done something similar? Any threads you may know about, code, articles, would be greatly appreciated.
    Thanks!

  2. #2
    Join Date
    Oct 2002
    Location
    Baghdad, Iraq
    Posts
    697
    Scraping info out of files is not for the faint hearted or the shallow pocketed.

    Quote Originally Posted by lgarcia3
    I have a bunch of HTML pages with static information that I need to open and retrieve the information from them.
    Is a "bunch" like 100, 10 thousand, 10 million, what?

    Are they all very similar? If there's really only a dozen pages, copy and paste is preferable. If you have many and they're all completely different, you're going to need some specialized services.

    Quote Originally Posted by lgarcia3
    HTML pages are just text, do I handle it the same way?
    You can, in theory. You probably want an HTML parser. Do a search for MSHTML DOM interface if you're using VB.

    If you're liable to do this again, it might be worth it to learn Perl. ActivePerl is a good Windows distribution.

  3. #3
    Join Date
    May 2004
    Posts
    6
    Thanks for your reply

    BUNCH = 3000 pages or so, maybe more

    A parser would be nice. I'll check MSHTML; but no, I don't have to use VB. Can do others. So, pretty much any DOM that works on Java, C#, VB, or JavaScript can do. If anyone has any suggestion would be appreciated.

  4. #4
    Join Date
    Oct 2002
    Location
    Baghdad, Iraq
    Posts
    697
    A major criterion ought to be what tool you're most familiar with. After that, I'd avoid languages that involve compiling (C#, VB.net, Java) because you will waste time writing lots of extra code and compilation really slows down development. Something that can do a Read-Evaluate-Prompt Loop, like Perl or Python, is good because you will need to do lots of little tests to get it right. I'd use either of those, or possibly a combination of awk/sed/grep/etc.

    As for books, I'd get a decent guide to regular expressions in your language of choice. You'll wind up using them again and again. For articles or websites, what you're doing is essentially part of web scraping, so search for that phrase, again with your language of choice.

  5. #5
    Join Date
    Mar 2003
    Location
    The Bottom of The Barrel
    Posts
    6,102
    Provided Answers: 1
    That is a very strange reason to avoid compiled languages.

    You realize that a script like this compiles in about 2 seconds on a sub-par machine, right? Pressing "F5" doesn't seem that difficult. As far as writing lots of extra code, that's exactly the opposite of how c# works. Actually, the fact that it abstracts and insulates people from writing core code is a major complaint among "serious" .net detractors. Making the request and receiving the response required in this instance can be done in about six lines of c#.

    That said, I'm confused about your criticisms of compiled languages in general.



    Also, here's a decent end-to-end solution that covers a very basic example of retrieving and regex'ing your way to sanitized data for storage in a database:

    Screen Scraping Tutorial using C# .NET
    oh yeah... documentation... I have heard of that.

    *** What Do You Want In The MS Access Forum? ***

  6. #6
    Join Date
    Oct 2002
    Location
    Baghdad, Iraq
    Posts
    697
    Quote Originally Posted by Teddy
    You realize that a script like this compiles in about 2 seconds on a sub-par machine, right? Pressing "F5" doesn't seem that difficult.
    And I realize that I've done little projects like this hundreds of times and big projects that are suited to compiled languages and my judgement is that I wouldn't generally use a compiled language to do a little job like this.

    It's not that you can't, it's just not the right tool for the job. The extra compilation phase is just one reason. Sure, a newbie *might* get lucky and not screw up configuring the compiler, but it's also possible s/he'd spend hours sifting through documentation. Interpreted languages aren't immune to this either, but my experience has been it's a more frequent problem with compiled languages, and the reason is that there's more configuration you have to do with a compiled language that brings you no benefit in a project like this.

    Further, few of the benefits of compiled languages apply here. C# is statically bound, and that would give me no benefit in this case. I don't need to be able to generate a library or make a GUI or any of that stuff, so I get no benefit from that. All the complexity that comes with an IDE gives me no benefit.

    This guy wants to write a script, see what errors come up, fix his code and try again. That mode of development is the classic model for using an interpreted language. This isn't a criticism of compiled languages any more than saying I'd rather not use a monkey wrench to remove an ingrown hair is a criticism of monkey wrenches.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •