Results 1 to 5 of 5
  1. #1
    Join Date
    Jul 2010
    Posts
    10

    Unanswered: Duplicate record Count

    Hello,

    Any help or advice is really appreciated

    I have the following table

    Table A
    vendorname

    walmart
    walmartinc
    walmartinc
    cubfoods
    cubfoods
    cubfood
    asper
    apser


    I need the view the count of the duplicated vendors. if I perform the following

    select vendorname,count(*) as duplicates from vendors
    group by vendorname having count(*) >1

    the following o/p is displayed

    Vendorname duplicates
    walmartinc 2
    cubfoods 2

    As we can see that (walmart, walmart inc),(asper,apser) are the same vendors; but because of the extra character or number its been considered unique; and because of which I am unable to specify the correct number of unique and duplicate vendors in the vendor table of nearly 15000 records.

    please advice me how can i avoid this problem; is there a way to compare two columns based on the first 7 characters.

    Thank you

  2. #2
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Quote Originally Posted by tammy2512 View Post
    As we can see that (walmart, walmart inc),(asper,apser) are the same vendors; but because of the extra character or number its been considered unique
    Maybe you can see that, but without knowing how you make that determination isn't clear.
    Quote Originally Posted by tammy2512 View Post
    is there a way to compare two columns based on the first 7 characters.
    Definitely.

    If you need to retain the current values for the "vendor" column, create a duplicate column to hold the original values. I'll explain that process in a minute.

    Compute the current counts using:
    Code:
    SELECT Count(*), vendor
       FROM [Table A]
       GROUP BY vendor
       ORDER BY vendor
    Examine this output of this statement to find the "duplicates" using your criteria. Each time you find a pair of values that meet your "duplicate" criteria, replace the values with the smaller count with the values of the larger count. For example
    Code:
    UPDATE [Table A]
       SET vendor = 'walmartinc'
       WHERE  'walmart' = vendor
    Once this process is complete, you'll have a valid count.

    If you need to keep the original values, add a colum to hold those values.
    Code:
    SELECT *, vendor AS [Original vendor]
       INTO [my new table name]
       FROM [Table A]
    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  3. #3
    Join Date
    Jul 2010
    Posts
    10
    Thank you so much for your reply

    Examine this output of this statement to find the "duplicates" using your criteria. Each time you find a pair of values that meet your "duplicate" criteria, replace the values with the smaller count with the values of the larger count.


    Most of the time I will not be aware of the duplicate criteria; for example for vendor walmart, i will not know that there is vendor walmartinc; then how am i supposed to mention the duplicate criteria in the following code; becuase i deal with more then 50k records sometimes

    UPDATE [Table A]
    SET vendor = 'walmartinc'
    WHERE 'walmart' = vendor

    Thank you for your time
    Last edited by tammy2512; 08-20-10 at 22:12.

  4. #4
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    If you don't know how to tell if walmart and walmartinc are duplicates, there is no possibility of writing code that can do it. What you have written tells me that human inspection is required, because code could never reliably tell that asper and apser are the same vendor.

    -PatP
    In theory, theory and practice are identical. In practice, theory and practice are unrelated.

  5. #5
    Join Date
    Nov 2004
    Posts
    1,427
    Provided Answers: 4
    You can also create a translation table:
    Code:
    GivenName	CorrectName
    Walmart		Walmart
    Wallmart	Walmart
    WalmartInc	Walmart
    ...
    asper		asper
    apser		asper
    ...
    Do a join on the GivenName and use the CorrectName in the GROUP BY. This way you don't alter the data people have put in. Some may insist that their "Walmart" or "Walmart inc" is the only right way of writing that name.

    To populate the translation table, check for DISTINCT GivenNames that are not yet present in the translation table. First add all the correct names, by entering it in both GivenName and CorrectName. (*) Then run your script again. Now you will get the alternative names that you will have to connect with the correct name, and put (a number of) them in GivenName and CorrectName. Continue from (*) until all used vendor names can be matched with the correct vendor name.

    Each time, before running the report, you can check your table on the occurrence of new GivenNames that are not yet present in the translation table (use the script you wrote before). The maintenance of the translation table is a manual chore.

    The best way to deal with this is to limit the possible domain of values for the Vendor column by using a selection list from witch users can select the vendor name. But that implies control over the data entry application.
    With kind regards . . . . . SQL Server 2000/2005/2012
    Wim

    Grabel's Law: 2 is not equal to 3 -- not even for very large values of 2.
    Pat Phelan's Law: 2 very definitely CAN equal 3 -- in at least two programming languages

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •