Results 1 to 9 of 9
  1. #1
    Join Date
    Jun 2012
    Posts
    3

    Unanswered: Identify duplicates and update the last 2 digits to 0 for both the Orig and Dup

    Hi,

    I have a requirement where I have to identify duplicates from a file based on the first 6 chars (It is fixed width file of 12 chars length) and whenever a duplicate row is found, its original and duplicate row's last 2 chars should be updated to all 0's if they are not same. (I mean last 2 digits of original and duplicate row should be same, if not then default to 00 else keep them as is)


    I thought of using uniq command and redirect non dups to one file and dups to another and loop the dups but considering the data volumes, I would want to do it in AWK/SED


    here is the sample input and output


    Code:
    input:
    1251233Y1234
    1221249N8821
    1231116Y9945
    1231113Y2123
    1231109Y3212
    1231123N1214
    1231126N1214
    output should be:
    Code:
    1251233Y1234
    1221249N8821
    1231116Y9900
    1231113Y2100
    1231109N3212
    1231123N1214
    1231126N1214 (Since last 2 digits are same nothing changed)
    Any help in achieving the above result using either awk/sed will be greatly appreciated.

    Thanks,
    Faraway

  2. #2
    Join Date
    Sep 2009
    Location
    Ontario
    Posts
    1,057
    Provided Answers: 1
    Sort the file.
    Then read the sorted file.
    Code:
    read prevline
    while read line
    l=`echo $line|cut -c1-6`
    p=`echo $prevline|cut -c1-6`
    if [ $l  = $p]
    then
     perform change to prevline
     peform change to line
     echo $prevline 
     else
     echo $prevline
     fi
    prevline=$line
    done
    echo $prevline
    Run
    $sort data |myscript >edited.data
    Last edited by kitaman; 06-15-12 at 07:15. Reason: process last record

  3. #3
    Join Date
    Jun 2012
    Posts
    3
    Thanks Kitaman.

    I have used something similar using while loop..but it is taking a really long time to complete, thats I wanted it in AWK/SED.

    here is the sample code I wrote:

    Code:
    prev=""
    while read line
    do
      if [ $prev != "" ]
      then
         if look for dups and process and print prev line
      fi
      prev=$line
    done<input

  4. #4
    Join Date
    Jun 2012
    Posts
    3
    Another problem with this script is it will not process both the original and duplicate rows last 2 digits and also if the duplicates at the last 2 rows are not processed properly.

  5. #5
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool

    Quote Originally Posted by farawaydsky View Post
    Another problem with this script is it will not process both the original and duplicate rows last 2 digits and also if the duplicates at the last 2 rows are not processed properly.
    Don't cry, here is your lolipop:
    Code:
    awk '{k=substr($1,1,6);a[k]=a[k]","$1;}
    END {
      for (k in a){
        n=split(a[k],o,",");
        if(n>2){
          for(i=2;i<=n;i++) {d=substr(o[i],11,2);  m[k,d]+=1; }
          for(i=2;i<=n;i++) {d=substr(o[i],11,2);
            if (m[k,d]>1) print o[i];
            else print substr(o[i],1,10)"00";
          }
        }
        else print o[2];
      }
    }' < input.txt
    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

  6. #6
    Join Date
    Sep 2009
    Location
    Ontario
    Posts
    1,057
    Provided Answers: 1
    Quote Originally Posted by farawaydsky View Post
    Thanks Kitaman.

    I have used something similar using while loop..but it is taking a really long time to complete, thats I wanted it in AWK/SED.

    here is the sample code I wrote:

    Code:
    prev=""
    while read line
    do
      if [ $prev != "" ]
      then
         if look for dups and process and print prev line
      fi
      prev=$line
    done<input
    The input file has to be sorted. How does the "look for dups" routine work.
    The entire process should take as long as it takes to sort the file, and read it once and write it once.

  7. #7
    Join Date
    Sep 2009
    Location
    Ontario
    Posts
    1,057
    Provided Answers: 1
    How many records are in the original file? I am curious about the relative performance of different solutions.

  8. #8
    Join Date
    Sep 2009
    Location
    Ontario
    Posts
    1,057
    Provided Answers: 1
    Code:
    read prevline                              
    while read line                            
    do                                         
    l=`echo $line|cut -c1-6`                   
    p=`echo $prevline|cut -c1-6`               
    if [ "$l"  = "$p" ]                        
    then                                       
     ps=`echo $prevline|cut -c11-12`           
     ls=`echo $line|cut -c11-12`               
     if [ "$ps" != "$ls" ]                     
       then                                    
       line=`echo $line |cut -c1-10`00         
       prevline=`echo $prevline |cut -c1-10`00 
     fi                                        
     echo $prevline                            
     else                                      
     echo $prevline                            
    fi                                         
    prevline=$line                             
    done                                       
    echo $prevline
    This works with the sample data provided. I ran this script and the awk script, but the sample size is so small that I cannot determine which is faster.

  9. #9
    Join Date
    Sep 2009
    Location
    Ontario
    Posts
    1,057
    Provided Answers: 1
    Results:
    Code:
    -bash-3.2# timex ./faraway3 |sort >awk.out       
                                                     
    real        0.05                                 
    user        0.05                                 
    sys         0.00                                 
                                                     
    -bash-3.2# timex ./faraway2 <faraway6 >merge.out 
                                                     
    real       12.54                                 
    user        2.51                                 
    sys         7.68
    I created a test data file of 7999 records using the following script and then using vi appending the first 1000 record to the end and sorting the resultant file. I then manually edited columns 11 and 12 in some records to exercise that portion of the code.
    Code:
                 -bash-3.2# more faraway4 
    p=112001                 
    m="abcd"                 
    s=11                     
    while [ $p -lt 120000 ]  
    do                       
    echo $p$m$s              
    p=`expr $p + 1`          
    s=`expr $s + 3`          
    if [ $s -gt 93 ]         
    then                     
    s=11                     
    fi                       
    done
    The most interesting part of this experiment is the output of the awk routine. It appears that awk processes the records in a similar fashion to a binary sort, (starts in the middle of the file and works out to the ends.)
    Code:
    #first 20 lines of awk output
    119894abcd86 
    119849abcd35 
    119812abcd92 
    119768abcd44 
    119731abcd17 
    119687abcd53 
    119650abcd26 
    119605abcd59 
    119524abcd68 
    119443abcd77 
    119399abcd29 
    119362abcd86 
    119317abcd35 
    119281abcd11 
    119236abcd44 
    119155abcd53 
    119074abcd62 
    119029abcd11 
    118958abcd50 
    118921abcd23 
    118877abcd59 
    118840abcd32 
    118796abcd68

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •