Results 1 to 5 of 5
  1. #1
    Join Date
    Jan 2005
    Posts
    3

    Unanswered: Per line processing

    I'm trying to write a script which scrubs illegal characters from an XML file. Each line is essentially a db record and I would like to check each line for illegal characters. If the record has illegal characters, I want to remove that line from the file and put it in another error report file.

    I have written the script to do all this but it is quite slow, I'm assuming due to process creation overhead. Here is the working script:

    Code:
    while read DIRTY ; do
            CLEAN=`echo "$DIRTY" | tr -d '\001'-'\010''\013''\014''\016'-'\037''\177'-'\377'`
            if [ "$CLEAN" != "$DIRTY" ]; then
                    echo Found a line of bad XML
                    echo "$DIRTY" >> $1.ERR
            else
                    echo "$DIRTY" >> $1.NEW
            fi
    done < $1
    It currently takes 3 minutes of CPU time to process a 25MB file with 120000 lines. Perhaps two calls to sed (one to search for lines with bad chars, the other for lines without) would work but I don't know the actual sed commands to do that. Any help?

    mike

  2. #2
    Join Date
    Jan 2005
    Posts
    3
    I've been trying to get sed working instead and have a pretty good start:

    # Find bad lines
    sed -n -e 's/(^*[\\001-\\010\\013\\014\\016-\\037]*$)/ \\1/p' $INPUT > $INPUT.ERR
    # Find good lines
    sed -n -e 's/(^*[^\\001-\\010\\013\\014\\016-\\037]*$)/ \\1/p' $INPUT > $INPUT.NEW
    But I get this error:

    sed: -e expression #1, char 48: Invalid range end

    Anyone know what the problem is?

  3. #3
    Join Date
    Apr 2004
    Location
    Boston, MA
    Posts
    325
    how about a single 'awk':

    nawk -f mp.awk input.txt

    here's mp.awk
    Code:
    BEGIN {
      #PAT_bad="([a-c])|([xyz])"
      PAT_bad="([\\001-\\010])|([\\013\\014)|([\\016-\\037])"
    }
    {
      out= FILENAME "." (match($0,PAT_bad) ? "bad" : "good");
      print > out;
    }

  4. #4
    Join Date
    Jan 2005
    Posts
    3
    Nice, vgersh99. I didn't know this but many of the GNU tools that deal with regexp have preset character sets that you can use in your regexps. Here's a solution that I came up with that seems to work also:

    # Strip any DOS linefeeds - they will screw up the search for control characters
    sed 's/^M$//' $1 > $1.CLEAN
    rm $1
    mv $1.CLEAN $1

    grep '[[:cntrl:]]' $1 > $1.ERR
    grep -v '[[:cntrl:]]' $1 > $1.NEW
    That takes 14 seconds instead of 3+ minutes. Thanks for the tip.

  5. #5
    Join Date
    Apr 2004
    Location
    Boston, MA
    Posts
    325
    FYI:

    change this:
    # Strip any DOS linefeeds - they will screw up the search for control characters
    sed 's/^M$//' $1 > $1.CLEAN
    rm $1
    mv $1.CLEAN $1
    to this:
    Code:
    (echo 's/^M$//'; echo 'wq') | ex -s $1
    vlad
    +-----------------------+
    | #include <disclaimer.h> |
    +-----------------------+

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •