Results 1 to 8 of 8
  1. #1
    Join Date
    Jan 2009
    Posts
    4

    Unanswered: Urgent: Unix Median And Average/mean Problem. Urgent need help from UNIX expert...

    Median Problem:
    GAAAAGAGGA
    ATATTAGGTTTTTAC
    TATATTTAACGCGAATGATT


    At the above three sequence, ATATTAGGTTTTTAC is the median and its length is 15. How can I use the unix script to automatic show that the length of the median is 15?What command line I should type?

    Mean problem:
    GAAAAGAGGA
    ATATTAGGTTTTTAC
    TATATTTAACGCGAATGATT

    The average/mean of the above sequence is 15. How can I use the unix script to automatic calculate the average/mean of the sequence is 15?What command line I should type? My senior advised me to use awk command line, but I don't know how to type it out. No matter what command line used, as long as can solve this problem. Really thanks all of your advise.

  2. #2
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    My senior advised me to use awk command line, but I don't know how to type it out. No matter what command line used, as long as can solve this problem
    Average is quite easy but median is a bit more difficult. I had to remind myself what the median was by looking it up on the web - I found about 5 forums with your question on! I think I'm reasonably close with the following :

    Code:
    #!/bin/sh
    
    echo "GAAAAGAGGA
    ATATTAGGTTTTTAC
    TATATTTAACGCGAATGATT" > original_file.dat
    
    # show average
    cat original_file.dat | \
            awk '
                    BEGIN   { total=0 }
                            { total = total + length($0) }
                    END     { print "AVG=" total/NR}'
    
    # get lengths and sort them
    cat original_file.dat | \
            awk '{ print length($0) }' | \
            sort -n \
            > tmp.dat
    
    # how many recs in file
    RECS=`cat tmp.dat | wc -l`
    
    RECS=`expr $RECS / 2`
    
    if test `expr $RECS % 2` -eq 0
    then
            # if record length is even then average two middle values 
            RECS=`expr $RECS + 1`
            cat tmp.dat | head -$RECS | tail -2 |
                    awk '
            BEGIN   { total=0 }
                    { total = total + $0 }
            END     { print "MEDIAN=" total/NR}'
    
    else
            # else just use middle value
            echo "MEDIAN="`cat tmp.dat | head -$RECS | tail -1`
    fi
    
    exit
    The program creates the original file but I assume you have this so you'll need to comment that bit out. The data appears to be genetic codes - out of interest what is it for?

    Mike

  3. #3
    Join Date
    Jan 2009
    Posts
    4
    Quote Originally Posted by mike_bike_kite
    Average is quite easy but median is a bit more difficult. I had to remind myself what the median was by looking it up on the web - I found about 5 forums with your question on! I think I'm reasonably close with the following :

    Code:
    #!/bin/sh
    
    echo "GAAAAGAGGA
    ATATTAGGTTTTTAC
    TATATTTAACGCGAATGATT" > original_file.dat
    
    # show average
    cat original_file.dat | \
            awk '
                    BEGIN   { total=0 }
                            { total = total + length($0) }
                    END     { print "AVG=" total/NR}'
    
    # get lengths and sort them
    cat original_file.dat | \
            awk '{ print length($0) }' | \
            sort -n \
            > tmp.dat
    
    # how many recs in file
    RECS=`cat tmp.dat | wc -l`
    
    RECS=`expr $RECS / 2`
    
    if test `expr $RECS % 2` -eq 0
    then
            # if record length is even then average two middle values 
            RECS=`expr $RECS + 1`
            cat tmp.dat | head -$RECS | tail -2 |
                    awk '
            BEGIN   { total=0 }
                    { total = total + $0 }
            END     { print "MEDIAN=" total/NR}'
    
    else
            # else just use middle value
            echo "MEDIAN="`cat tmp.dat | head -$RECS | tail -1`
    fi
    
    exit
    The program creates the original file but I assume you have this so you'll need to comment that bit out. The data appears to be genetic codes - out of interest what is it for?

    Mike
    Hi, MIKE . Thanks for your advise. For the average, the command line you suggest is worked. But for median, it seem like can't work. It showed that not "RECS" command, you got any other better solution to solve this problem? Really thanks a lot of your help.

  4. #4
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Quote Originally Posted by patrick chia
    Thanks for your advise
    It's been 5 days since I gave you a complete solution to your problem. Something tells me I wouldn't even of got this note of thanks if it wasn't for the fact you can't get the program to work!

    Quote Originally Posted by patrick chia
    But for median, it seem like can't work.
    Why can't it work? it works perfectly well for me when I tried it with your data and more complex examples. I assume you realise it's a program and not something you type out line by line.

    Quote Originally Posted by patrick chia
    It showed that not "RECS" command
    Unix generally supplies an error message that indicates the problem. I suggest you give me this. Is the tmp.dat file being created and does it contain data? What does the $RECS variable contain after it is set. Are you passing your data through the program?

    Quote Originally Posted by patrick chia
    you got any other better solution to solve this problem?
    No - the program given works perfectly well as it stands.

  5. #5
    Join Date
    Jan 2009
    Posts
    4
    Quote Originally Posted by mike_bike_kite
    It's been 5 days since I gave you a complete solution to your problem. Something tells me I wouldn't even of got this note of thanks if it wasn't for the fact you can't get the program to work!


    Why can't it work? it works perfectly well for me when I tried it with your data and more complex examples. I assume you realise it's a program and not something you type out line by line.


    Unix generally supplies an error message that indicates the problem. I suggest you give me this. Is the tmp.dat file being created and does it contain data? What does the $RECS variable contain after it is set. Are you passing your data through the program?


    No - the program given works perfectly well as it stands.
    Hi, mike_bike_kite
    I try it again the way you teach me d. It's worked now. But the median is 10 instead of 15. You know what problem is going on? The average is calculated correctly which is 15.
    Mike, you got any better suggestion to make the median is 15 instead of 10. Hope you can help me think of one advance command line that I can apply to find out the median and average for another huge file at next time. Really thanks for your help.

  6. #6
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    To get the median you need to get the length of each sequence, sort it, then pick the middle value if it's an odd number of rows or average the two middle values if it's even. You're unlikely to be able to do this in a single line of script hence I produced a small program for you. I've tried to improve the median part of the code and came up with this :
    Code:
    #!/bin/sh
    
    echo "GAAAAGAGGA
    ATATTAGGTTTTTAC
    TATATTTAACGCGAATGATT" > original_file.dat
    
    # show average
    cat original_file.dat | \
            awk '
                    BEGIN   { total=0 }
                            { total = total + length($0) }
                    END     { print "AVG=" total/NR}'
    
    # get lengths and sort them
    cat original_file.dat | \
            awk '{ print length($0) }' | \
            sort -n \
            > tmp.dat
    
    # how many recs in file
    RECS=`cat tmp.dat | wc -l`
    
    ODD_NUM=`expr $RECS % 2`
    CUTOFF_POINT=`expr $RECS / 2`
    CUTOFF_POINT=`expr $CUTOFF_POINT + 1`
    
    if test $ODD_NUM -eq 0
    then
            cat tmp.dat | \
                    head -$CUTOFF_POINT | \
                    tail -2 | \
                    awk '
                            BEGIN   { total=0 }
                                    { total = total + $0 }
                            END     { print "MEDIAN=" total/NR}'
    else
            echo "MEDIAN="`cat tmp.dat | head -$CUTOFF_POINT | tail -1`
    fi
    
    exit
    You still haven't explained what you're doing with DNA - is it anything interesting?

  7. #7
    Join Date
    Jan 2009
    Posts
    4
    Quote Originally Posted by mike_bike_kite
    To get the median you need to get the length of each sequence, sort it, then pick the middle value if it's an odd number of rows or average the two middle values if it's even. You're unlikely to be able to do this in a single line of script hence I produced a small program for you. I've tried to improve the median part of the code and came up with this :
    Code:
    #!/bin/sh
    
    echo "GAAAAGAGGA
    ATATTAGGTTTTTAC
    TATATTTAACGCGAATGATT" > original_file.dat
    
    # show average
    cat original_file.dat | \
            awk '
                    BEGIN   { total=0 }
                            { total = total + length($0) }
                    END     { print "AVG=" total/NR}'
    
    # get lengths and sort them
    cat original_file.dat | \
            awk '{ print length($0) }' | \
            sort -n \
            > tmp.dat
    
    # how many recs in file
    RECS=`cat tmp.dat | wc -l`
    
    ODD_NUM=`expr $RECS % 2`
    CUTOFF_POINT=`expr $RECS / 2`
    CUTOFF_POINT=`expr $CUTOFF_POINT + 1`
    
    if test $ODD_NUM -eq 0
    then
            cat tmp.dat | \
                    head -$CUTOFF_POINT | \
                    tail -2 | \
                    awk '
                            BEGIN   { total=0 }
                                    { total = total + $0 }
                            END     { print "MEDIAN=" total/NR}'
    else
            echo "MEDIAN="`cat tmp.dat | head -$CUTOFF_POINT | tail -1`
    fi
    
    exit
    You still haven't explained what you're doing with DNA - is it anything interesting?
    Yup. I deal with the DNA assignment that use the UNIX command line to find out the length of median.
    Mike, I try the program that you modified already. The result end up just show that "MEDIAN=ATATTAGGTTTTTAC". How come will like this? Actually I planned that the answer is showed "MEDIAN=15". The program for average is worked d. You got any better suggestion to solve this problem? Really thanks a lot for your advise. Have a nice day.

  8. #8
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Quote Originally Posted by patrick chia
    The program for average is worked d.
    I just copied the program above and pasted it into a file called tmp.sh
    I then ran it and it produced the following :
    Code:
    # sh tmp.sh
    AVG=15
    MEDIAN=15
    I altered the sequence by adding a 3 letter code and ran it again :
    Code:
    # sh tmp.sh
    AVG=12
    MEDIAN=12.5
    The average and the median seem correct to me.

    Quote Originally Posted by patrick chia
    You got any better suggestion to solve this problem?
    I suggest you simply paste your codes into Excel and use the functions that excel provides to do what you need. Alternatively you can ask your senior to show you how the above program is run. If you need to know Unix shell scripting for your course then I can recommend Unix Programming Environment.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •