Results 1 to 7 of 7
  1. #1
    Join Date
    Mar 2011
    Posts
    3

    Hey everyone. I have a data scraping question

    I am new to the forums and sort of new to databases. Basically I will be in charge of a project at work where I will need to scrape data from certain sites and more importantly, format it so that it can be imported into a SQL based database.

    Here's my question: The data comes to us in completely unformatted ways via e-mail. Each data source sends it in completely different ways. The person developing the database says that a format like pipe-delimited would be best. So if a raw copy and paste sample of my data set looks like this:

    "SPECIAL SEWER LATERAL REPLACEMENTS FY11
    Project No. X7388146
    LOCATION: Annapolis, MD (Anne Arundel Co.)
    ESTIMATED AMOUNT: $500,000 to $1,000,000
    CONTRACTING METHOD: Competitive Public Bids
    UPDATE: Bids Received March 8, 2011 Under Review.
    BIDS OPENED:March 8, 2011
    OWNER: Anne Arundel County Dept of Public Works
    2662 Riva Road, Annapolis, MD 21401
    (410)222-7543 FAX# (410)222-7589
    Contact: Nancy Whitnall Phone#410)571-0092
    OWNER REP: Anne Arundel County Purchasing
    2660 Riva Road, 3rd Floor, Annapolis, MD, 21401
    (410)222-7620 FAX# (410)222-7624
    DIVISION:
    Div33 utilities, sanitary sewerage utilities
    NOTES: Questions regarding this Project should be directed to the Project Manager, Nancy Whitnall at (410)571-0092
    Plans: Owner
    PLAN DEP: $20.00 Not Refundable
    A Pre-bid Meeting was held on February 24, 2011 at 9:00 AM at Owner
    Industry Type: Engineering
    Sub Industry Type: Sewers/Underground Waterlines
    Apparent Low Bidders:
    1. Burgemeister Bell $876,295.00
    FAX# (410)363-0883 ,10331 S. Dolfield Rd., Owings Mills, MD 212 08` (410)363-4081
    2. Matricciani Company $887,896.00
    3. Schummer Inc $1,217,334.00 "

    How on earth do I easily have that information formatted in a way which is efficient? If I will have to manually do it, I'll never be able to keep up. I just need to get some tips and insight on that aspect of the database process. Thanks in advance guys!

  2. #2
    Join Date
    Mar 2011
    Posts
    3

    Need help formatting/scraping data

    Hello everyone, I'm new here and sort of new to databases. I've worked mostly with Access but now will be in charge of collecting, formatting and dumping data into a SQL database which developers will create. So I don't need to do a ton on the backend, but I will be the main person actually populating the database.

    Here's my question: The data comes to us in completely unformatted ways via e-mail. We will have about 5 data streams and each data source sends it in completely different ways format-wise. The person developing the database says that if I can get the data into a format like pipe-delimited.. it would be best. So here's a raw copy and paste sample of what one of the data sets looks like in the e-mail.

    "SPECIAL SEWER LATERAL REPLACEMENTS FY11
    Project No. X7388146
    LOCATION: Annapolis, MD (Anne Arundel Co.)
    ESTIMATED AMOUNT: $500,000 to $1,000,000
    CONTRACTING METHOD: Competitive Public Bids
    UPDATE: Bids Received March 8, 2011 Under Review.
    BIDS OPENED:March 8, 2011
    OWNER: Anne Arundel County Dept of Public Works
    2662 Riva Road, Annapolis, MD 21401
    (410)222-7543 FAX# (410)222-7589
    Contact: Nancy Whitnall Phone#410)571-0092
    OWNER REP: Anne Arundel County Purchasing
    2660 Riva Road, 3rd Floor, Annapolis, MD, 21401
    (410)222-7620 FAX# (410)222-7624
    DIVISION:
    Div33 utilities, sanitary sewerage utilities
    NOTES: Questions regarding this Project should be directed to the Project Manager, Nancy Whitnall at (410)571-0092
    Plans: Owner
    PLAN DEP: $20.00 Not Refundable
    A Pre-bid Meeting was held on February 24, 2011 at 9:00 AM at Owner
    Industry Type: Engineering
    Sub Industry Type: Sewers/Underground Waterlines
    Apparent Low Bidders:
    1. Burgemeister Bell $876,295.00
    FAX# (410)363-0883 ,10331 S. Dolfield Rd., Owings Mills, MD 212 08` (410)363-4081
    2. Matricciani Company $887,896.00
    3. Schummer Inc $1,217,334.00 "

    How do I most efficiently have that information pipe delimited? If I will have to manually do it, I'll never be able to keep up. I just need to get some tips and insight on that aspect of the database process. Thanks in advance guys!

  3. #3
    Join Date
    Mar 2003
    Location
    The Bottom of The Barrel
    Posts
    6,102
    Do you get the same format for all messages from a given datasource?

    I'd hit it with regular expressions in C# or Perl.
    oh yeah... documentation... I have heard of that.

    *** What Do You Want In The MS Access Forum? ***

  4. #4
    Join Date
    Mar 2011
    Posts
    3
    Yes, it comes the same way from each source. I don't really know any programming or anything though. Would something like Google Refine be able to handle that type of thing?

  5. #5
    Join Date
    Mar 2003
    Location
    The Bottom of The Barrel
    Posts
    6,102
    I haven't used Google Refine. It looks like it allows you to parse data into multiple rows with regular expressions or their home-brew scripting langauge, so it might work.

    Note that you'll be doing roughly the same amount of "programming" in Google Refine as you would in C# or Perl. You won't have to do the export stuff I suppose, but other than that...
    oh yeah... documentation... I have heard of that.

    *** What Do You Want In The MS Access Forum? ***

  6. #6
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    You are going to have to write some custom scripts to handle the five(+) formats in which you will receive the data.
    There is no getting around that.
    Making it pipe delimited is not the problem. Parsing it is the problem.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  7. #7
    Join Date
    Mar 2003
    Location
    The Bottom of The Barrel
    Posts
    6,102
    I merged your other thread. Heads up.
    oh yeah... documentation... I have heard of that.

    *** What Do You Want In The MS Access Forum? ***

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •