If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Data Access, Manipulation & Batch Languages > Unix Shell Scripts > Per line processing

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 01-05-05, 12:32
mperham mperham is offline
Registered User
 
Join Date: Jan 2005
Posts: 3
Per line processing

I'm trying to write a script which scrubs illegal characters from an XML file. Each line is essentially a db record and I would like to check each line for illegal characters. If the record has illegal characters, I want to remove that line from the file and put it in another error report file.

I have written the script to do all this but it is quite slow, I'm assuming due to process creation overhead. Here is the working script:

Code:
while read DIRTY ; do
        CLEAN=`echo "$DIRTY" | tr -d '\001'-'\010''\013''\014''\016'-'\037''\177'-'\377'`
        if [ "$CLEAN" != "$DIRTY" ]; then
                echo Found a line of bad XML
                echo "$DIRTY" >> $1.ERR
        else
                echo "$DIRTY" >> $1.NEW
        fi
done < $1
It currently takes 3 minutes of CPU time to process a 25MB file with 120000 lines. Perhaps two calls to sed (one to search for lines with bad chars, the other for lines without) would work but I don't know the actual sed commands to do that. Any help?

mike
Reply With Quote
  #2 (permalink)  
Old 01-05-05, 13:22
mperham mperham is offline
Registered User
 
Join Date: Jan 2005
Posts: 3
I've been trying to get sed working instead and have a pretty good start:

Quote:
# Find bad lines
sed -n -e 's/(^*[\\001-\\010\\013\\014\\016-\\037]*$)/ \\1/p' $INPUT > $INPUT.ERR
# Find good lines
sed -n -e 's/(^*[^\\001-\\010\\013\\014\\016-\\037]*$)/ \\1/p' $INPUT > $INPUT.NEW
But I get this error:

sed: -e expression #1, char 48: Invalid range end

Anyone know what the problem is?
Reply With Quote
  #3 (permalink)  
Old 01-05-05, 13:55
vgersh99 vgersh99 is offline
Registered User
 
Join Date: Apr 2004
Location: Boston, MA
Posts: 325
how about a single 'awk':

nawk -f mp.awk input.txt

here's mp.awk
Code:
BEGIN {
  #PAT_bad="([a-c])|([xyz])"
  PAT_bad="([\\001-\\010])|([\\013\\014)|([\\016-\\037])"
}
{
  out= FILENAME "." (match($0,PAT_bad) ? "bad" : "good");
  print > out;
}
Reply With Quote
  #4 (permalink)  
Old 01-05-05, 14:34
mperham mperham is offline
Registered User
 
Join Date: Jan 2005
Posts: 3
Nice, vgersh99. I didn't know this but many of the GNU tools that deal with regexp have preset character sets that you can use in your regexps. Here's a solution that I came up with that seems to work also:

Quote:
# Strip any DOS linefeeds - they will screw up the search for control characters
sed 's/^M$//' $1 > $1.CLEAN
rm $1
mv $1.CLEAN $1

grep '[[:cntrl:]]' $1 > $1.ERR
grep -v '[[:cntrl:]]' $1 > $1.NEW
That takes 14 seconds instead of 3+ minutes. Thanks for the tip.
Reply With Quote
  #5 (permalink)  
Old 01-05-05, 14:44
vgersh99 vgersh99 is offline
Registered User
 
Join Date: Apr 2004
Location: Boston, MA
Posts: 325
FYI:

change this:
Quote:
# Strip any DOS linefeeds - they will screw up the search for control characters
sed 's/^M$//' $1 > $1.CLEAN
rm $1
mv $1.CLEAN $1
to this:
Code:
(echo 's/^M$//'; echo 'wq') | ex -s $1
__________________
vlad
+-----------------------+
| #include <disclaimer.h> |
+-----------------------+
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On