| |
|
If this is your first visit, be sure to check out the FAQ by clicking the link above.
You may have to register before you can post: click the register link above to proceed.
To start viewing messages, select the forum that you want to visit from the selection below.
|
 |

08-02-04, 11:16
|
|
Registered User
|
|
Join Date: Jul 2004
Posts: 45
|
|
|
searching a string in gzip files
|
|
whats the command to search for a string in gzipped file?or how to search for a pattern in *.gz file ?
|
|

08-02-04, 11:28
|
|
Registered User
|
|
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
|
|
Try:
Code:
gunzip -c MyFile.gz|grep 'pattern'

NOTE: gz suffix added for clarity only.
__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
|
|

08-02-04, 12:22
|
|
Registered User
|
|
Join Date: Jul 2004
Posts: 45
|
|
|
|
thanks for your reply. Even this works
gzcat <filename> |grep <pattern>
Well, do you have any suggestion how to speed up the search if the file size is huge? For. eg. if you have over 1000 gz files, how can I fastern the search process?
thanks in advance.
|
|

08-02-04, 13:14
|
|
Registered User
|
|
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
|
|
Divide and conquer.
Issue gzcat in background (&) for different file patterns or diff. directories!

__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
|
|

08-02-04, 13:26
|
|
Registered User
|
|
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
|
|
Divide and conquer.
Issue gzcat|gzip in background (&) for different file patterns or diff. directories!

__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
|
|

08-06-04, 11:23
|
|
Registered User
|
|
Join Date: Jul 2004
Posts: 45
|
|
Do you think grep or egrep will work for searching more than 50,000 strings?
For. eg :
Assume you have over 50,000 strings to be searched in the file list. If I do ..
cat <filename> | grep -E <strings> will this work? Well, I never tried for over 50 but I dont think (since there may b some limitation for grep) this will work?
what do u say?
what could be soulution for this?
thanks in advance.
|
|

08-06-04, 11:38
|
|
Registered User
|
|
Join Date: Apr 2004
Location: Boston, MA
Posts: 325
|
|
split (man split) up your file of strings to search for into 'managable' files and and do your grep-ping in the 'divide-and-conquer' manner outlined above - combining the result(s) at the end.
OR
if your searchable strings have some type of commonality between them, try to identify a regex describing the strings and use it with egrep.
__________________
vlad
+-----------------------+
| #include <disclaimer.h> |
+-----------------------+
|
|

08-06-04, 15:04
|
|
Registered User
|
|
Join Date: Jul 2004
Posts: 45
|
|
In your case if I split my string file (or pattern file) of 50,000 strings into 50 files (1000 strings/file) and if I have 20 files to be scanned accross the strings, 50*20 times iteration will occur!
i.e. gzcat <file to be searched.gz> |fgrep -f <string1.file>
I tested for 1000 strings in a file against 672,593 lines of txt file. It took 4.50mins. So it would be 4.5*50*20 mins for the whole!!!! But that could be reduced if I run multiple processes in background.
Is there better way I can do it?
|
|

08-06-04, 16:31
|
|
Registered User
|
|
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
|
|
Quote:
|
Originally Posted by deebee
In your case if I split my string file (or pattern file) of 50,000 strings into 50 files (1000 strings/file) and if I have 20 files to be scanned accross the strings, 50*20 times iteration will occur!
i.e. gzcat <file to be searched.gz> |fgrep -f <string1.file>
I tested for 1000 strings in a file against 672,593 lines of txt file. It took 4.50mins. So it would be 4.5*50*20 mins for the whole!!!! But that could be reduced if I run multiple processes in background.
Is there better way I can do it?
|
Maybe you could try something like this:
Code:
#!/bin/ksh
patt_file=50000patterns.txt
split -5000 $patt_file $patt_file.
dir=. #<= Directory where the target files are
result_file=/tmp/patternfnd.result
rm -f $result_file.*
n=0
for p in $(ls $patt_file.??)
do
i=0
for f in $(find $dir -name '*.gz')
do
gzcat $f|fgrep -nf $p >${result_file}.P${n}F${i} &
(( i += 1 ))
done
(( n += 1 ))
done
PS: Remember to wait for all proc's to finish!
__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
|
Last edited by LKBrwn_DBA; 08-06-04 at 16:43.
|

08-06-04, 16:49
|
|
Registered User
|
|
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
|
|
 DBForums Server seems to be runnning slow today.
__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
|
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|