Results 1 to 11 of 11
  1. #1
    Join Date
    May 2009
    Posts
    41

    Unanswered: What's the most efficient way of organizing thousands of records?

    I've got a whole column of company names, over 15,000, and after running DISTINCT on it, it narrows down to around 5,000, with lots of duplicates.

    I imagine that there are probably only 500 distinct companies in the list, and each of them has some slight string differences.

    How would you go about sorting the 5,000?

  2. #2
    Join Date
    May 2005
    Posts
    1,191
    Me.Geek = True

  3. #3
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by _ANDROID
    How would you go about sorting the 5,000?
    Code:
    SELECT DISTINCT companyname
      FROM daTable
    ORDER
        BY companyname
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  4. #4
    Join Date
    May 2009
    Posts
    41
    Quote Originally Posted by nckdryr
    Oh man...


    Yeah, that's only part of the problem... there are a few other tables with the same amount of records, and a similar number of duplicates across all the tables!!!

    I'm not sure how it got this way, but I've been assigned the task to fix it.

    I'm hoping someone will pull a rabbit out of a hat and produce some wonderful Access trick to narrow down the duplicates properly.

  5. #5
    Join Date
    May 2009
    Posts
    41
    Quote Originally Posted by r937
    Code:
    SELECT DISTINCT companyname
      FROM daTable
    ORDER
        BY companyname
    You guys are just joking around, right?!

    Ok, here's another question... this table with the thousands of duplicates is really a many-to-many in one table... so for a single company there are around 10 different spelling variations, as well as associated data in another column.

    The goal is to get the fix all the company names so that they are unique... then make the appropriate many-to-many relationship by separating it all.

    I should've mentioned this before, however, it doesn't seem any easier.

  6. #6
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by _ANDROID
    You guys are just joking around, right?!
    not really -- i mean, you ~did~ ask how to sort them, and ORDER BY is just the ticket

    as for resolving the duplicates, you would have to do that by inspecting the names
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  7. #7
    Join Date
    May 2009
    Posts
    41
    This is really like a supplier/product relationship... so I've separated the tables into the appropriate many to many relationship with autonumber as the PK...

    There are bout 5000 companies, and twice as many products, both have duplicates (different strings, same object).

    Now... I'm wondering if I go through the companies and make the duplicates have the same string name, is there some code that can be used to:

    start at the top of the 'company name' column,
    check the next row's company name, and
    if they are equal, copy the subrecords from the second into the first, then
    delete the duplicate company name, then
    repeat

    Obviously, I'm a noob to this. I don't even know if that would work.

  8. #8
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    Provided Answers: 59
    have a list of synonyms
    eg
    Ltd = Ltd. = Limited
    Co = Company = Co.
    make sure your company names are compared in a consistent style (ie make sure the capitalisation is consistent)

    but tighten up your data capture and validation to make sure the data is consistent and coherent
    I'd rather be riding on the Tiger 800 or the Norton

  9. #9
    Join Date
    May 2009
    Posts
    258
    I've dealt with something similar in the past, but with a smaller data set. Instead of just using DISTINCT, you can do a GROUP BY and it will give you better information:
    Code:
    SELECT TRIM(CompanyName), COUNT(*)
      FROM TheTable
    GROUP BY TRIM(CompanyName)
    ORDER BY 1, 2 DESC
    With this, you will have a count of the number of times each name occurs. I added TRIM in there just to make sure there aren't some with just extra spaces. You can then tackle the most commonly occurring entries first (using UPDATE queries) and deal with the others on a case-by-case basis. Creating a synonym table as previously suggested would probably help in this area as well. You should also remove any punctuation from the records when checking.

    This may help out as well:
    Microsoft Access tips: Fuzzy matches - Soundex

    Regards,

    Ax
    Last edited by Ax238; 06-10-09 at 10:46.

  10. #10
    Join Date
    May 2009
    Posts
    41
    Thanks for all the tips guys!

    I'm still working on the company table, and all of your suggestions have helped easily get rid of 4000 duplicates...

    Or I should say that the table still has the duplicates, I've just replaced things like "LTD." and "LIMITED" with "LTD" and so on...

    In order for me to remove the duplicates, I have to move all of their related records to a single company, is there a quick way to do this? The composite table (in the many-to-many relationship) has over 40K records.

  11. #11
    Join Date
    May 2009
    Posts
    258
    You'll want to set a company number for every distinct company record, add it as a new field to the composite table or create a 1-1 table for it. Afterwards, update the company name for each composite table record by joining on a master company table. You can also just remove the company name field from the composite table and just use the new company number to determine the company name for each record.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •