I'm trying to write a script which scrubs illegal characters from an XML file. Each line is essentially a db record and I would like to check each line for illegal characters. If the record has illegal characters, I want to remove that line from the file and put it in another error report file.
I have written the script to do all this but it is quite slow, I'm assuming due to process creation overhead. Here is the working script:
Code:
while read DIRTY ; do
CLEAN=`echo "$DIRTY" | tr -d '\001'-'\010''\013''\014''\016'-'\037''\177'-'\377'`
if [ "$CLEAN" != "$DIRTY" ]; then
echo Found a line of bad XML
echo "$DIRTY" >> $1.ERR
else
echo "$DIRTY" >> $1.NEW
fi
done < $1
It currently takes 3 minutes of CPU time to process a 25MB file with 120000 lines. Perhaps two calls to sed (one to search for lines with bad chars, the other for lines without) would work but I don't know the actual sed commands to do that. Any help?
mike