Results 1 to 4 of 4
  1. #1
    Join Date
    Jan 2010
    Posts
    2

    Unanswered: How do I search for duplicates? - Same strings but different arrangement

    Hi All,

    I am new here but I have been going through the forum for some great help before my registration.

    I have this table:

    Code:
    CREATE table Node
    {
       code int unique not null;
       token varchar(max);
    }
    and the data (fraction) below in the table.

    Code:
    code	token
    2	MP1 1'(1,0,3,2)+1'(2,1,8,1)+1'(3,0,5,1); MP2 1'(0)+1'(1)+1'(2); 
    8	MP1 1'(2,1,8,1)+1'(1,0,3,2)+1'(3,0,5,1); MP2 1'(1)+1'(0)+1'(2); 
    5	MP1 1'(3,0,5,1)+1'(1,0,3,2)+1'(2,1,8,1); MP2 1'(2)+1'(0)+1'(1); 
    51	MP1 1'(1,0,3,2)+1'(2,2,5,2)+1'(3,0,5,1); MP2 1'(0)+1'(1)+1'(2); 
    33	MP1 1'(1,0,3,2)+1'(2,2,5,2)+1'(3,0,5,1); MP2 1'(0)+1'(2)+1'(1); 
    18	MP1 1'(1,0,3,2)+1'(2,2,5,2)+1'(3,0,5,1); MP2 1'(2)+1'(0)+1'(1); 
    54	MP1 1'(1,0,3,2)+1'(2,2,5,2)+1'(3,0,5,1); MP2 1'(2)+1'(1)+1'(0); 
    88	MP1 1'(1,0,3,2)+1'(2,2,5,2)+1'(3,2,4,2); MP2 1'(2)+1'(1)+1'(0); 
    81	MP1 1'(1,0,3,2)+1'(3,1,8,3)+1'(2,1,8,1); MP2 1'(1)+1'(2)+1'(0); 
    96	MP1 1'(1,0,3,2)+1'(3,1,8,3)+1'(2,1,8,1); MP2 1'(2)+1'(1)+1'(0);
    The data are entered into the database from a client application, as such the arrangement of the strings cannot be controlled.

    I am trying to find duplicate rows based on the column "token". For example, taking a close look at the data, codes (2, 8 and 5) should be duplicates (by my own theory) but the duplicate SQL statement below does not recognize the strings as the same since the order of "1'(0)+1'(1)+1'(2)" and "1'(1,0,3,2)+1'(2,1,8,1)+1'(3,0,5,1)" are not the same, which is acceptable. Same applies to code lines 51,33,18 and 54

    Code:
    		SELECT token, count(*)
    	  		 FROM Node
    	  		    GROUP BY token
    				HAVING count(*) > 1;
    I have been looking for a way to solve this problem. I'd really appreciate if someone can be of help as I have been on it for a while.

    Thanks.

    Note that the "code" datatype is unique and will not allow duplicate values.

  2. #2
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by rotee View Post
    The data are entered into the database from a client application, as such the arrangement of the strings cannot be controlled.
    the concept of duplicate column values is inflexible -- either there is a duplicate value or there is not

    you already have the SQL which will detect duplicates in the column values

    i think you're looking for an application-based logical duplicate, so the correct place to do this checking is in your application layer

    the alternative is to design the database to model the true entities, instead of your scenario where "the arrangement of the strings cannot be controlled"

    so if you can't change that, then you're going to have to write application logic
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  3. #3
    Join Date
    Jan 2010
    Posts
    2
    Thanks for the reply.

    I thought about checking the duplicate from the application layer but I think it will not work in my case due to limited memory and large data. Trying to pull more than 500,000 records to the application memory and checking for duplicates, will definitely be costly and crash the application (from experience with the application) during runtime - Reason I decided to move over to DBMS.

    Redesigning the database would have been a good alternative but I will not be able to choose the columns where the token values will be stored during insert since it depends on arrival of the data by MP. It also boils down to checking the columns.

    Any ideas to design a database that can solve this issue or other ways will be much appreciated.

  4. #4
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    so you can't redesign the database but you're asking for advice on how to design the database?

    split the token into its constituent parts before storing it

    so 1'(1,0,3,2)+1'(2,1,8,1)+1'(3,0,5,1) would be stored as

    1'(1,0,3,2)
    1'(2,1,8,1)
    1'(3,0,5,1)

    and then duplicate checking is trivial
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •