If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > General > Applications & Tools > Hey everyone. I have a data scraping question

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 03-14-11, 12:42
AstralWurks AstralWurks is offline
Registered User
 
Join Date: Mar 2011
Posts: 3
Hey everyone. I have a data scraping question

I am new to the forums and sort of new to databases. Basically I will be in charge of a project at work where I will need to scrape data from certain sites and more importantly, format it so that it can be imported into a SQL based database.

Here's my question: The data comes to us in completely unformatted ways via e-mail. Each data source sends it in completely different ways. The person developing the database says that a format like pipe-delimited would be best. So if a raw copy and paste sample of my data set looks like this:

"SPECIAL SEWER LATERAL REPLACEMENTS FY11
Project No. X7388146
LOCATION: Annapolis, MD (Anne Arundel Co.)
ESTIMATED AMOUNT: $500,000 to $1,000,000
CONTRACTING METHOD: Competitive Public Bids
UPDATE: Bids Received March 8, 2011 Under Review.
BIDS OPENED:March 8, 2011
OWNER: Anne Arundel County Dept of Public Works
2662 Riva Road, Annapolis, MD 21401
(410)222-7543 FAX# (410)222-7589
Contact: Nancy Whitnall Phone#410)571-0092
OWNER REP: Anne Arundel County Purchasing
2660 Riva Road, 3rd Floor, Annapolis, MD, 21401
(410)222-7620 FAX# (410)222-7624
DIVISION:
Div33 utilities, sanitary sewerage utilities
NOTES: Questions regarding this Project should be directed to the Project Manager, Nancy Whitnall at (410)571-0092
Plans: Owner
PLAN DEP: $20.00 Not Refundable
A Pre-bid Meeting was held on February 24, 2011 at 9:00 AM at Owner
Industry Type: Engineering
Sub Industry Type: Sewers/Underground Waterlines
Apparent Low Bidders:
1. Burgemeister Bell $876,295.00
FAX# (410)363-0883 ,10331 S. Dolfield Rd., Owings Mills, MD 212 08` (410)363-4081
2. Matricciani Company $887,896.00
3. Schummer Inc $1,217,334.00 "

How on earth do I easily have that information formatted in a way which is efficient? If I will have to manually do it, I'll never be able to keep up. I just need to get some tips and insight on that aspect of the database process. Thanks in advance guys!
Reply With Quote
  #2 (permalink)  
Old 03-14-11, 12:52
AstralWurks AstralWurks is offline
Registered User
 
Join Date: Mar 2011
Posts: 3
Need help formatting/scraping data

Hello everyone, I'm new here and sort of new to databases. I've worked mostly with Access but now will be in charge of collecting, formatting and dumping data into a SQL database which developers will create. So I don't need to do a ton on the backend, but I will be the main person actually populating the database.

Here's my question: The data comes to us in completely unformatted ways via e-mail. We will have about 5 data streams and each data source sends it in completely different ways format-wise. The person developing the database says that if I can get the data into a format like pipe-delimited.. it would be best. So here's a raw copy and paste sample of what one of the data sets looks like in the e-mail.

"SPECIAL SEWER LATERAL REPLACEMENTS FY11
Project No. X7388146
LOCATION: Annapolis, MD (Anne Arundel Co.)
ESTIMATED AMOUNT: $500,000 to $1,000,000
CONTRACTING METHOD: Competitive Public Bids
UPDATE: Bids Received March 8, 2011 Under Review.
BIDS OPENED:March 8, 2011
OWNER: Anne Arundel County Dept of Public Works
2662 Riva Road, Annapolis, MD 21401
(410)222-7543 FAX# (410)222-7589
Contact: Nancy Whitnall Phone#410)571-0092
OWNER REP: Anne Arundel County Purchasing
2660 Riva Road, 3rd Floor, Annapolis, MD, 21401
(410)222-7620 FAX# (410)222-7624
DIVISION:
Div33 utilities, sanitary sewerage utilities
NOTES: Questions regarding this Project should be directed to the Project Manager, Nancy Whitnall at (410)571-0092
Plans: Owner
PLAN DEP: $20.00 Not Refundable
A Pre-bid Meeting was held on February 24, 2011 at 9:00 AM at Owner
Industry Type: Engineering
Sub Industry Type: Sewers/Underground Waterlines
Apparent Low Bidders:
1. Burgemeister Bell $876,295.00
FAX# (410)363-0883 ,10331 S. Dolfield Rd., Owings Mills, MD 212 08` (410)363-4081
2. Matricciani Company $887,896.00
3. Schummer Inc $1,217,334.00 "

How do I most efficiently have that information pipe delimited? If I will have to manually do it, I'll never be able to keep up. I just need to get some tips and insight on that aspect of the database process. Thanks in advance guys!
Reply With Quote
  #3 (permalink)  
Old 03-14-11, 13:15
Teddy Teddy is offline
Purveyor of Discontent
 
Join Date: Mar 2003
Location: The Bottom of The Barrel
Posts: 6,075
Do you get the same format for all messages from a given datasource?

I'd hit it with regular expressions in C# or Perl.
__________________
oh yeah... documentation... I have heard of that.

*** What Do You Want In The MS Access Forum? ***
Reply With Quote
  #4 (permalink)  
Old 03-14-11, 13:29
AstralWurks AstralWurks is offline
Registered User
 
Join Date: Mar 2011
Posts: 3
Yes, it comes the same way from each source. I don't really know any programming or anything though. Would something like Google Refine be able to handle that type of thing?
Reply With Quote
  #5 (permalink)  
Old 03-14-11, 14:49
Teddy Teddy is offline
Purveyor of Discontent
 
Join Date: Mar 2003
Location: The Bottom of The Barrel
Posts: 6,075
I haven't used Google Refine. It looks like it allows you to parse data into multiple rows with regular expressions or their home-brew scripting langauge, so it might work.

Note that you'll be doing roughly the same amount of "programming" in Google Refine as you would in C# or Perl. You won't have to do the export stuff I suppose, but other than that...
__________________
oh yeah... documentation... I have heard of that.

*** What Do You Want In The MS Access Forum? ***
Reply With Quote
  #6 (permalink)  
Old 03-15-11, 10:17
blindman blindman is offline
World Class Flame Warrior
 
Join Date: Jun 2003
Location: Ohio
Posts: 11,726
You are going to have to write some custom scripts to handle the five(+) formats in which you will receive the data.
There is no getting around that.
Making it pipe delimited is not the problem. Parsing it is the problem.
__________________
If it's not practically useful, then it's practically useless.

blindman
www.chess.com: "sqlblindman"
Reply With Quote
  #7 (permalink)  
Old 03-15-11, 10:28
Teddy Teddy is offline
Purveyor of Discontent
 
Join Date: Mar 2003
Location: The Bottom of The Barrel
Posts: 6,075
I merged your other thread. Heads up.
__________________
oh yeah... documentation... I have heard of that.

*** What Do You Want In The MS Access Forum? ***
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On