View Single Post
  #6 (permalink)  
Old 02-04-09, 05:28
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
To get the median you need to get the length of each sequence, sort it, then pick the middle value if it's an odd number of rows or average the two middle values if it's even. You're unlikely to be able to do this in a single line of script hence I produced a small program for you. I've tried to improve the median part of the code and came up with this :
Code:
#!/bin/sh

echo "GAAAAGAGGA
ATATTAGGTTTTTAC
TATATTTAACGCGAATGATT" > original_file.dat

# show average
cat original_file.dat | \
        awk '
                BEGIN   { total=0 }
                        { total = total + length($0) }
                END     { print "AVG=" total/NR}'

# get lengths and sort them
cat original_file.dat | \
        awk '{ print length($0) }' | \
        sort -n \
        > tmp.dat

# how many recs in file
RECS=`cat tmp.dat | wc -l`

ODD_NUM=`expr $RECS % 2`
CUTOFF_POINT=`expr $RECS / 2`
CUTOFF_POINT=`expr $CUTOFF_POINT + 1`

if test $ODD_NUM -eq 0
then
        cat tmp.dat | \
                head -$CUTOFF_POINT | \
                tail -2 | \
                awk '
                        BEGIN   { total=0 }
                                { total = total + $0 }
                        END     { print "MEDIAN=" total/NR}'
else
        echo "MEDIAN="`cat tmp.dat | head -$CUTOFF_POINT | tail -1`
fi

exit
You still haven't explained what you're doing with DNA - is it anything interesting?
Reply With Quote