Results 1 to 10 of 10
  1. #1
    Join Date
    Jul 2004
    Posts
    45

    Unanswered: searching a string in gzip files

    whats the command to search for a string in gzipped file?or how to search for a pattern in *.gz file ?

  2. #2
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool

    Try:
    Code:
    gunzip -c MyFile.gz|grep 'pattern'

    NOTE: gz suffix added for clarity only.
    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

  3. #3
    Join Date
    Jul 2004
    Posts
    45
    thanks for your reply. Even this works
    gzcat <filename> |grep <pattern>
    Well, do you have any suggestion how to speed up the search if the file size is huge? For. eg. if you have over 1000 gz files, how can I fastern the search process?
    thanks in advance.

  4. #4
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool

    Divide and conquer.

    Issue gzcat in background (&) for different file patterns or diff. directories!
    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

  5. #5
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool

    Divide and conquer.

    Issue gzcat|gzip in background (&) for different file patterns or diff. directories!
    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

  6. #6
    Join Date
    Jul 2004
    Posts
    45
    Do you think grep or egrep will work for searching more than 50,000 strings?
    For. eg :
    Assume you have over 50,000 strings to be searched in the file list. If I do ..
    cat <filename> | grep -E <strings> will this work? Well, I never tried for over 50 but I dont think (since there may b some limitation for grep) this will work?
    what do u say?
    what could be soulution for this?
    thanks in advance.

  7. #7
    Join Date
    Apr 2004
    Location
    Boston, MA
    Posts
    325
    split (man split) up your file of strings to search for into 'managable' files and and do your grep-ping in the 'divide-and-conquer' manner outlined above - combining the result(s) at the end.

    OR

    if your searchable strings have some type of commonality between them, try to identify a regex describing the strings and use it with egrep.
    vlad
    +-----------------------+
    | #include <disclaimer.h> |
    +-----------------------+

  8. #8
    Join Date
    Jul 2004
    Posts
    45
    In your case if I split my string file (or pattern file) of 50,000 strings into 50 files (1000 strings/file) and if I have 20 files to be scanned accross the strings, 50*20 times iteration will occur!
    i.e. gzcat <file to be searched.gz> |fgrep -f <string1.file>

    I tested for 1000 strings in a file against 672,593 lines of txt file. It took 4.50mins. So it would be 4.5*50*20 mins for the whole!!!! But that could be reduced if I run multiple processes in background.

    Is there better way I can do it?

  9. #9
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool

    Quote Originally Posted by deebee
    In your case if I split my string file (or pattern file) of 50,000 strings into 50 files (1000 strings/file) and if I have 20 files to be scanned accross the strings, 50*20 times iteration will occur!
    i.e. gzcat <file to be searched.gz> |fgrep -f <string1.file>

    I tested for 1000 strings in a file against 672,593 lines of txt file. It took 4.50mins. So it would be 4.5*50*20 mins for the whole!!!! But that could be reduced if I run multiple processes in background.

    Is there better way I can do it?
    Maybe you could try something like this:
    Code:
    #!/bin/ksh
    patt_file=50000patterns.txt
    split -5000 $patt_file $patt_file.     
    dir=.             #<= Directory where the target files are
    result_file=/tmp/patternfnd.result
    rm -f $result_file.*
    n=0
    for p in $(ls $patt_file.??)
    do
      i=0
      for f in $(find $dir -name '*.gz') 
      do
        gzcat $f|fgrep -nf $p >${result_file}.P${n}F${i} &
        (( i += 1 ))
      done    
      (( n += 1 ))
    done


    PS: Remember to wait for all proc's to finish!
    Last edited by LKBrwn_DBA; 08-06-04 at 17:43.
    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

  10. #10
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool

    DBForums Server seems to be runnning slow today.
    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •