If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Data Access, Manipulation & Batch Languages > Unix Shell Scripts > Need awk script help urgently : check record from one file to another file

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 04-11-06, 06:51
pangup_74 pangup_74 is offline
Registered User
 
Join Date: Feb 2004
Location: UK
Posts: 43
Need awk script help urgently : check record from one file to another file

Hi,

I have file1 which has 16 millions of records and file2 has 22 millions record.

Both the files has only one field which has some 10 digit number.

Now my problem is : I want to check each record from file1 anainst file2 i.e.
want to find this record into file2. If it is found then print "record found in the file" else "record does not found in the file".

Please do not give the shell/perl script solutions as it takes to much time to process such heavy files. Although I am already running the script created in the perl and I got "Out of memory!" error also.

If any one can provide me the awk script that will be a gr8 help.

I tried to write awk but I am not able to pass the 2 file like this i.e. how would I check the record from first file against another file. Below is script in which I am not able to check the record from file1 to file2

tail abc.txt | awk '$0 ~/$1/ {print file has Found the number" ; next }; {print " has not found in the file ie" $1 }' pqr.txt

Can any one correct the above scripts or provide me the solution.

thanks in advance
~Pankaj
Reply With Quote
  #2 (permalink)  
Old 04-11-06, 08:48
LKBrwn_DBA LKBrwn_DBA is offline
Registered User
 
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
Cool


Quote:
. . . If it is found then print "record found in the file" else "record does not found in the file".
So you are going to print SIXTEEN MILLION TIMES either/both the above messages?

PS: What makes you think awk is going to much faster than perl?
__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
Reply With Quote
  #3 (permalink)  
Old 04-11-06, 21:31
pangup_74 pangup_74 is offline
Registered User
 
Join Date: Feb 2004
Location: UK
Posts: 43
awk issue

Yeah I have to print that many times regardless of any scripting language. This is already done in Perl and it's taking 12 hr. As in perl we can not store all these in a array. We have to break array in multiple parts.

I thought awk will take lesser time. If u can provide me the other solution which takes much lesser time then it's ok for me.

thanks
Reply With Quote
  #4 (permalink)  
Old 04-12-06, 14:06
LKBrwn_DBA LKBrwn_DBA is offline
Registered User
 
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
Cool


If the files are sorted, you could use the diff command.


PS: Anyway if the files are NOT sorted, sort them and the perl script will execute faster.

__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
Reply With Quote
  #5 (permalink)  
Old 04-12-06, 23:16
pangup_74 pangup_74 is offline
Registered User
 
Join Date: Feb 2004
Location: UK
Posts: 43
Awk issue

But if I will use diff command , which is basically diff both way and I do not want both way. My problem is :

Take one record from FILE1 and check this exist in FILE2 if exist then write record exist in the file else does not exist in the file then take next record from FILE1 and check if this exist in FILE2 if exist then write record exist in the file else does not exist in the file and so on.....
Reply With Quote
  #6 (permalink)  
Old 04-13-06, 04:34
pdreyer pdreyer is offline
Registered User
 
Join Date: May 2005
Location: South Africa
Posts: 1,268
Maybe try this with the sorted files
Code:
# cat xx1
001
003
004
005
007
# cat xx2
002
003
007
009
011
112
# ### Note: on the next line the spaces between (  and \ is a tab character
# comm -2 xx1 xx2 | sed 's/^\([0-9]\)/not found \1/;s/\(        \)/found /'
not found 001
found 003
not found 004
not found 005
found 007
Reply With Quote
  #7 (permalink)  
Old 04-13-06, 06:18
pangup_74 pangup_74 is offline
Registered User
 
Join Date: Feb 2004
Location: UK
Posts: 43
pdreyer : It's not working....

pdreyer ,

I tried the solution u hv given but it's not working. I hv these 2 files.
comm abc.txt1 abc1.txt1 | sed 's/^\([0-9]\)/not found \1/;s/\( \)/found /' > z.out

I also given the tab after ( as explained.

PS : there was two sapces before each record in abc.txt file and in the abc1.txt record start with ^700 has 2 spaces and ^10 has one space.

I ran it before removing space and after removing spaces. Not working....

abc.txt abc1.txt
700250154 700250163
700250155 700250164
700250156 700250165
700250157 700250166
700250158 700250167
700250159 700250168
700250160 700250169
700250161 700250170
700250162 700250171
700250163 700250172
700250164 700352944
700250165 700352945
700250166 700352946
700250167 700352947
700250168 700352948
700250169 700352949
700250170 700352950
700250171 700352951
700250172 700352952
700250173 700656780
700352925 1006050374
700352926 1006050378
700352927 1006050379
700352928 1006050392
700352929 1006050508
700352930 1006050517
700352931 1006056990
700352932 1006056991
700352933 1006056992
700352934 1006056993
700352935 1006056998
700352936 1006056999
700352937 1006057000
700352938 1006057001
700352939 1006057131
700352940 1006057132
700352941 1007350605
700352942 1007350607
700352943 1007750195
700352944 1008855857
700352945 1018950638
700352946 1019250232
700352947 1019250233
700352948 1020350140
700352949 1020350141
700352950 1020750300
700352951 1026350843
700352952 1026550664
700656780 1026550666
700656781 1026551692
700656782
700656783
700656784
700656785
700656786
700656787
700656788
700656789
700656790
700656791
700656792
Reply With Quote
  #8 (permalink)  
Old 04-13-06, 10:28
pdreyer pdreyer is offline
Registered User
 
Join Date: May 2005
Location: South Africa
Posts: 1,268
I am unable to recreate your problem. Maybe check what character you get from the comm command (is it a tab).

Running only comm the left column is not found and right is found
Code:
# comm -2 abc.txt abc1.txt
700250154
700250155
700250156
700250157
700250158
700250159
700250160
700250161
700250162
        700250163
        700250164
        700250165
        700250166
        700250167
        700250168
        700250169
        700250170
        700250171
        700250172
700250173
700352925
--snip,snip --
And adding sed:
Code:
# comm -2 abc.txt abc1.txt | sed 's/^\([0-9]\)/not found \1/;s/\(       \)/found /'
not found 700250154
not found 700250155
not found 700250156
not found 700250157
not found 700250158
not found 700250159
not found 700250160
not found 700250161
not found 700250162
found 700250163
found 700250164
found 700250165
found 700250166
found 700250167
found 700250168
found 700250169
found 700250170
found 700250171
found 700250172
not found 700250173
not found 700352925
--snip,snip --
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On