I have a bunch of HTML pages with static information that I need to open and retrieve the information from them. How can I go about it? HTML pages are just text, do I handle it the same way? Has any one done something similar? Any threads you may know about, code, articles, would be greatly appreciated.
A major criterion ought to be what tool you're most familiar with. After that, I'd avoid languages that involve compiling (C#, VB.net, Java) because you will waste time writing lots of extra code and compilation really slows down development. Something that can do a Read-Evaluate-Prompt Loop, like Perl or Python, is good because you will need to do lots of little tests to get it right. I'd use either of those, or possibly a combination of awk/sed/grep/etc.
As for books, I'd get a decent guide to regular expressions in your language of choice. You'll wind up using them again and again. For articles or websites, what you're doing is essentially part of web scraping, so search for that phrase, again with your language of choice.
That is a very strange reason to avoid compiled languages.
You realize that a script like this compiles in about 2 seconds on a sub-par machine, right? Pressing "F5" doesn't seem that difficult. As far as writing lots of extra code, that's exactly the opposite of how c# works. Actually, the fact that it abstracts and insulates people from writing core code is a major complaint among "serious" .net detractors. Making the request and receiving the response required in this instance can be done in about six lines of c#.
That said, I'm confused about your criticisms of compiled languages in general.
Also, here's a decent end-to-end solution that covers a very basic example of retrieving and regex'ing your way to sanitized data for storage in a database:
You realize that a script like this compiles in about 2 seconds on a sub-par machine, right? Pressing "F5" doesn't seem that difficult.
And I realize that I've done little projects like this hundreds of times and big projects that are suited to compiled languages and my judgement is that I wouldn't generally use a compiled language to do a little job like this.
It's not that you can't, it's just not the right tool for the job. The extra compilation phase is just one reason. Sure, a newbie *might* get lucky and not screw up configuring the compiler, but it's also possible s/he'd spend hours sifting through documentation. Interpreted languages aren't immune to this either, but my experience has been it's a more frequent problem with compiled languages, and the reason is that there's more configuration you have to do with a compiled language that brings you no benefit in a project like this.
Further, few of the benefits of compiled languages apply here. C# is statically bound, and that would give me no benefit in this case. I don't need to be able to generate a library or make a GUI or any of that stuff, so I get no benefit from that. All the complexity that comes with an IDE gives me no benefit.
This guy wants to write a script, see what errors come up, fix his code and try again. That mode of development is the classic model for using an interpreted language. This isn't a criticism of compiled languages any more than saying I'd rather not use a monkey wrench to remove an ingrown hair is a criticism of monkey wrenches.