Results 1 to 8 of 8
  1. #1
    Join Date
    Feb 2004
    Location
    UK
    Posts
    43

    Unanswered: Need awk script help urgently : check record from one file to another file

    Hi,

    I have file1 which has 16 millions of records and file2 has 22 millions record.

    Both the files has only one field which has some 10 digit number.

    Now my problem is : I want to check each record from file1 anainst file2 i.e.
    want to find this record into file2. If it is found then print "record found in the file" else "record does not found in the file".

    Please do not give the shell/perl script solutions as it takes to much time to process such heavy files. Although I am already running the script created in the perl and I got "Out of memory!" error also.

    If any one can provide me the awk script that will be a gr8 help.

    I tried to write awk but I am not able to pass the 2 file like this i.e. how would I check the record from first file against another file. Below is script in which I am not able to check the record from file1 to file2

    tail abc.txt | awk '$0 ~/$1/ {print file has Found the number" ; next }; {print " has not found in the file ie" $1 }' pqr.txt

    Can any one correct the above scripts or provide me the solution.

    thanks in advance
    ~Pankaj

  2. #2
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool


    . . . If it is found then print "record found in the file" else "record does not found in the file".
    So you are going to print SIXTEEN MILLION TIMES either/both the above messages?

    PS: What makes you think awk is going to much faster than perl?
    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

  3. #3
    Join Date
    Feb 2004
    Location
    UK
    Posts
    43

    awk issue

    Yeah I have to print that many times regardless of any scripting language. This is already done in Perl and it's taking 12 hr. As in perl we can not store all these in a array. We have to break array in multiple parts.

    I thought awk will take lesser time. If u can provide me the other solution which takes much lesser time then it's ok for me.

    thanks

  4. #4
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool


    If the files are sorted, you could use the diff command.


    PS: Anyway if the files are NOT sorted, sort them and the perl script will execute faster.

    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

  5. #5
    Join Date
    Feb 2004
    Location
    UK
    Posts
    43

    Awk issue

    But if I will use diff command , which is basically diff both way and I do not want both way. My problem is :

    Take one record from FILE1 and check this exist in FILE2 if exist then write record exist in the file else does not exist in the file then take next record from FILE1 and check if this exist in FILE2 if exist then write record exist in the file else does not exist in the file and so on.....

  6. #6
    Join Date
    May 2005
    Location
    South Africa
    Posts
    1,365
    Provided Answers: 1
    Maybe try this with the sorted files
    Code:
    # cat xx1
    001
    003
    004
    005
    007
    # cat xx2
    002
    003
    007
    009
    011
    112
    # ### Note: on the next line the spaces between (  and \ is a tab character
    # comm -2 xx1 xx2 | sed 's/^\([0-9]\)/not found \1/;s/\(        \)/found /'
    not found 001
    found 003
    not found 004
    not found 005
    found 007

  7. #7
    Join Date
    Feb 2004
    Location
    UK
    Posts
    43

    pdreyer : It's not working....

    pdreyer ,

    I tried the solution u hv given but it's not working. I hv these 2 files.
    comm abc.txt1 abc1.txt1 | sed 's/^\([0-9]\)/not found \1/;s/\( \)/found /' > z.out

    I also given the tab after ( as explained.

    PS : there was two sapces before each record in abc.txt file and in the abc1.txt record start with ^700 has 2 spaces and ^10 has one space.

    I ran it before removing space and after removing spaces. Not working....

    abc.txt abc1.txt
    700250154 700250163
    700250155 700250164
    700250156 700250165
    700250157 700250166
    700250158 700250167
    700250159 700250168
    700250160 700250169
    700250161 700250170
    700250162 700250171
    700250163 700250172
    700250164 700352944
    700250165 700352945
    700250166 700352946
    700250167 700352947
    700250168 700352948
    700250169 700352949
    700250170 700352950
    700250171 700352951
    700250172 700352952
    700250173 700656780
    700352925 1006050374
    700352926 1006050378
    700352927 1006050379
    700352928 1006050392
    700352929 1006050508
    700352930 1006050517
    700352931 1006056990
    700352932 1006056991
    700352933 1006056992
    700352934 1006056993
    700352935 1006056998
    700352936 1006056999
    700352937 1006057000
    700352938 1006057001
    700352939 1006057131
    700352940 1006057132
    700352941 1007350605
    700352942 1007350607
    700352943 1007750195
    700352944 1008855857
    700352945 1018950638
    700352946 1019250232
    700352947 1019250233
    700352948 1020350140
    700352949 1020350141
    700352950 1020750300
    700352951 1026350843
    700352952 1026550664
    700656780 1026550666
    700656781 1026551692
    700656782
    700656783
    700656784
    700656785
    700656786
    700656787
    700656788
    700656789
    700656790
    700656791
    700656792

  8. #8
    Join Date
    May 2005
    Location
    South Africa
    Posts
    1,365
    Provided Answers: 1
    I am unable to recreate your problem. Maybe check what character you get from the comm command (is it a tab).

    Running only comm the left column is not found and right is found
    Code:
    # comm -2 abc.txt abc1.txt
    700250154
    700250155
    700250156
    700250157
    700250158
    700250159
    700250160
    700250161
    700250162
            700250163
            700250164
            700250165
            700250166
            700250167
            700250168
            700250169
            700250170
            700250171
            700250172
    700250173
    700352925
    --snip,snip --
    And adding sed:
    Code:
    # comm -2 abc.txt abc1.txt | sed 's/^\([0-9]\)/not found \1/;s/\(       \)/found /'
    not found 700250154
    not found 700250155
    not found 700250156
    not found 700250157
    not found 700250158
    not found 700250159
    not found 700250160
    not found 700250161
    not found 700250162
    found 700250163
    found 700250164
    found 700250165
    found 700250166
    found 700250167
    found 700250168
    found 700250169
    found 700250170
    found 700250171
    found 700250172
    not found 700250173
    not found 700352925
    --snip,snip --

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •