If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Data Access, Manipulation & Batch Languages > Unix Shell Scripts > searching a string in gzip files

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 08-02-04, 11:16
deebee deebee is offline
Registered User
 
Join Date: Jul 2004
Posts: 45
searching a string in gzip files

whats the command to search for a string in gzipped file?or how to search for a pattern in *.gz file ?
Reply With Quote
  #2 (permalink)  
Old 08-02-04, 11:28
LKBrwn_DBA LKBrwn_DBA is offline
Registered User
 
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
Cool

Try:
Code:
gunzip -c MyFile.gz|grep 'pattern'

NOTE: gz suffix added for clarity only.
__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
Reply With Quote
  #3 (permalink)  
Old 08-02-04, 12:22
deebee deebee is offline
Registered User
 
Join Date: Jul 2004
Posts: 45
thanks for your reply. Even this works
gzcat <filename> |grep <pattern>
Well, do you have any suggestion how to speed up the search if the file size is huge? For. eg. if you have over 1000 gz files, how can I fastern the search process?
thanks in advance.
Reply With Quote
  #4 (permalink)  
Old 08-02-04, 13:14
LKBrwn_DBA LKBrwn_DBA is offline
Registered User
 
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
Cool

Divide and conquer.

Issue gzcat in background (&) for different file patterns or diff. directories!
__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
Reply With Quote
  #5 (permalink)  
Old 08-02-04, 13:26
LKBrwn_DBA LKBrwn_DBA is offline
Registered User
 
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
Cool

Divide and conquer.

Issue gzcat|gzip in background (&) for different file patterns or diff. directories!
__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
Reply With Quote
  #6 (permalink)  
Old 08-06-04, 11:23
deebee deebee is offline
Registered User
 
Join Date: Jul 2004
Posts: 45
Do you think grep or egrep will work for searching more than 50,000 strings?
For. eg :
Assume you have over 50,000 strings to be searched in the file list. If I do ..
cat <filename> | grep -E <strings> will this work? Well, I never tried for over 50 but I dont think (since there may b some limitation for grep) this will work?
what do u say?
what could be soulution for this?
thanks in advance.
Reply With Quote
  #7 (permalink)  
Old 08-06-04, 11:38
vgersh99 vgersh99 is offline
Registered User
 
Join Date: Apr 2004
Location: Boston, MA
Posts: 325
split (man split) up your file of strings to search for into 'managable' files and and do your grep-ping in the 'divide-and-conquer' manner outlined above - combining the result(s) at the end.

OR

if your searchable strings have some type of commonality between them, try to identify a regex describing the strings and use it with egrep.
__________________
vlad
+-----------------------+
| #include <disclaimer.h> |
+-----------------------+
Reply With Quote
  #8 (permalink)  
Old 08-06-04, 15:04
deebee deebee is offline
Registered User
 
Join Date: Jul 2004
Posts: 45
In your case if I split my string file (or pattern file) of 50,000 strings into 50 files (1000 strings/file) and if I have 20 files to be scanned accross the strings, 50*20 times iteration will occur!
i.e. gzcat <file to be searched.gz> |fgrep -f <string1.file>

I tested for 1000 strings in a file against 672,593 lines of txt file. It took 4.50mins. So it would be 4.5*50*20 mins for the whole!!!! But that could be reduced if I run multiple processes in background.

Is there better way I can do it?
Reply With Quote
  #9 (permalink)  
Old 08-06-04, 16:31
LKBrwn_DBA LKBrwn_DBA is offline
Registered User
 
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
Cool

Quote:
Originally Posted by deebee
In your case if I split my string file (or pattern file) of 50,000 strings into 50 files (1000 strings/file) and if I have 20 files to be scanned accross the strings, 50*20 times iteration will occur!
i.e. gzcat <file to be searched.gz> |fgrep -f <string1.file>

I tested for 1000 strings in a file against 672,593 lines of txt file. It took 4.50mins. So it would be 4.5*50*20 mins for the whole!!!! But that could be reduced if I run multiple processes in background.

Is there better way I can do it?
Maybe you could try something like this:
Code:
#!/bin/ksh
patt_file=50000patterns.txt
split -5000 $patt_file $patt_file.     
dir=.             #<= Directory where the target files are
result_file=/tmp/patternfnd.result
rm -f $result_file.*
n=0
for p in $(ls $patt_file.??)
do
  i=0
  for f in $(find $dir -name '*.gz') 
  do
    gzcat $f|fgrep -nf $p >${result_file}.P${n}F${i} &
    (( i += 1 ))
  done    
  (( n += 1 ))
done


PS: Remember to wait for all proc's to finish!
__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

Last edited by LKBrwn_DBA; 08-06-04 at 16:43.
Reply With Quote
  #10 (permalink)  
Old 08-06-04, 16:49
LKBrwn_DBA LKBrwn_DBA is offline
Registered User
 
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
Cool

DBForums Server seems to be runnning slow today.
__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On