If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Data Access, Manipulation & Batch Languages > Unix Shell Scripts > Urgent: Unix Median And Average/mean Problem. Urgent need help from UNIX expert...

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 01-30-09, 01:44
patrick chia patrick chia is offline
Registered User
 
Join Date: Jan 2009
Posts: 4
Urgent: Unix Median And Average/mean Problem. Urgent need help from UNIX expert...

Median Problem:
GAAAAGAGGA
ATATTAGGTTTTTAC
TATATTTAACGCGAATGATT


At the above three sequence, ATATTAGGTTTTTAC is the median and its length is 15. How can I use the unix script to automatic show that the length of the median is 15?What command line I should type?

Mean problem:
GAAAAGAGGA
ATATTAGGTTTTTAC
TATATTTAACGCGAATGATT

The average/mean of the above sequence is 15. How can I use the unix script to automatic calculate the average/mean of the sequence is 15?What command line I should type? My senior advised me to use awk command line, but I don't know how to type it out. No matter what command line used, as long as can solve this problem. Really thanks all of your advise.
Reply With Quote
  #2 (permalink)  
Old 01-30-09, 05:11
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
My senior advised me to use awk command line, but I don't know how to type it out. No matter what command line used, as long as can solve this problem
Average is quite easy but median is a bit more difficult. I had to remind myself what the median was by looking it up on the web - I found about 5 forums with your question on! I think I'm reasonably close with the following :

Code:
#!/bin/sh

echo "GAAAAGAGGA
ATATTAGGTTTTTAC
TATATTTAACGCGAATGATT" > original_file.dat

# show average
cat original_file.dat | \
        awk '
                BEGIN   { total=0 }
                        { total = total + length($0) }
                END     { print "AVG=" total/NR}'

# get lengths and sort them
cat original_file.dat | \
        awk '{ print length($0) }' | \
        sort -n \
        > tmp.dat

# how many recs in file
RECS=`cat tmp.dat | wc -l`

RECS=`expr $RECS / 2`

if test `expr $RECS % 2` -eq 0
then
        # if record length is even then average two middle values 
        RECS=`expr $RECS + 1`
        cat tmp.dat | head -$RECS | tail -2 |
                awk '
        BEGIN   { total=0 }
                { total = total + $0 }
        END     { print "MEDIAN=" total/NR}'

else
        # else just use middle value
        echo "MEDIAN="`cat tmp.dat | head -$RECS | tail -1`
fi

exit
The program creates the original file but I assume you have this so you'll need to comment that bit out. The data appears to be genetic codes - out of interest what is it for?

Mike
Reply With Quote
  #3 (permalink)  
Old 02-02-09, 20:22
patrick chia patrick chia is offline
Registered User
 
Join Date: Jan 2009
Posts: 4
Quote:
Originally Posted by mike_bike_kite
Average is quite easy but median is a bit more difficult. I had to remind myself what the median was by looking it up on the web - I found about 5 forums with your question on! I think I'm reasonably close with the following :

Code:
#!/bin/sh

echo "GAAAAGAGGA
ATATTAGGTTTTTAC
TATATTTAACGCGAATGATT" > original_file.dat

# show average
cat original_file.dat | \
        awk '
                BEGIN   { total=0 }
                        { total = total + length($0) }
                END     { print "AVG=" total/NR}'

# get lengths and sort them
cat original_file.dat | \
        awk '{ print length($0) }' | \
        sort -n \
        > tmp.dat

# how many recs in file
RECS=`cat tmp.dat | wc -l`

RECS=`expr $RECS / 2`

if test `expr $RECS % 2` -eq 0
then
        # if record length is even then average two middle values 
        RECS=`expr $RECS + 1`
        cat tmp.dat | head -$RECS | tail -2 |
                awk '
        BEGIN   { total=0 }
                { total = total + $0 }
        END     { print "MEDIAN=" total/NR}'

else
        # else just use middle value
        echo "MEDIAN="`cat tmp.dat | head -$RECS | tail -1`
fi

exit
The program creates the original file but I assume you have this so you'll need to comment that bit out. The data appears to be genetic codes - out of interest what is it for?

Mike
Hi, MIKE . Thanks for your advise. For the average, the command line you suggest is worked. But for median, it seem like can't work. It showed that not "RECS" command, you got any other better solution to solve this problem? Really thanks a lot of your help.
Reply With Quote
  #4 (permalink)  
Old 02-03-09, 04:54
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Originally Posted by patrick chia
Thanks for your advise
It's been 5 days since I gave you a complete solution to your problem. Something tells me I wouldn't even of got this note of thanks if it wasn't for the fact you can't get the program to work!
Quote:
Originally Posted by patrick chia
But for median, it seem like can't work.
Why can't it work? it works perfectly well for me when I tried it with your data and more complex examples. I assume you realise it's a program and not something you type out line by line.
Quote:
Originally Posted by patrick chia
It showed that not "RECS" command
Unix generally supplies an error message that indicates the problem. I suggest you give me this. Is the tmp.dat file being created and does it contain data? What does the $RECS variable contain after it is set. Are you passing your data through the program?
Quote:
Originally Posted by patrick chia
you got any other better solution to solve this problem?
No - the program given works perfectly well as it stands.
Reply With Quote
  #5 (permalink)  
Old 02-03-09, 20:03
patrick chia patrick chia is offline
Registered User
 
Join Date: Jan 2009
Posts: 4
Quote:
Originally Posted by mike_bike_kite
It's been 5 days since I gave you a complete solution to your problem. Something tells me I wouldn't even of got this note of thanks if it wasn't for the fact you can't get the program to work!
Why can't it work? it works perfectly well for me when I tried it with your data and more complex examples. I assume you realise it's a program and not something you type out line by line.
Unix generally supplies an error message that indicates the problem. I suggest you give me this. Is the tmp.dat file being created and does it contain data? What does the $RECS variable contain after it is set. Are you passing your data through the program?
No - the program given works perfectly well as it stands.
Hi, mike_bike_kite
I try it again the way you teach me d. It's worked now. But the median is 10 instead of 15. You know what problem is going on? The average is calculated correctly which is 15.
Mike, you got any better suggestion to make the median is 15 instead of 10. Hope you can help me think of one advance command line that I can apply to find out the median and average for another huge file at next time. Really thanks for your help.
Reply With Quote
  #6 (permalink)  
Old 02-04-09, 05:28
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
To get the median you need to get the length of each sequence, sort it, then pick the middle value if it's an odd number of rows or average the two middle values if it's even. You're unlikely to be able to do this in a single line of script hence I produced a small program for you. I've tried to improve the median part of the code and came up with this :
Code:
#!/bin/sh

echo "GAAAAGAGGA
ATATTAGGTTTTTAC
TATATTTAACGCGAATGATT" > original_file.dat

# show average
cat original_file.dat | \
        awk '
                BEGIN   { total=0 }
                        { total = total + length($0) }
                END     { print "AVG=" total/NR}'

# get lengths and sort them
cat original_file.dat | \
        awk '{ print length($0) }' | \
        sort -n \
        > tmp.dat

# how many recs in file
RECS=`cat tmp.dat | wc -l`

ODD_NUM=`expr $RECS % 2`
CUTOFF_POINT=`expr $RECS / 2`
CUTOFF_POINT=`expr $CUTOFF_POINT + 1`

if test $ODD_NUM -eq 0
then
        cat tmp.dat | \
                head -$CUTOFF_POINT | \
                tail -2 | \
                awk '
                        BEGIN   { total=0 }
                                { total = total + $0 }
                        END     { print "MEDIAN=" total/NR}'
else
        echo "MEDIAN="`cat tmp.dat | head -$CUTOFF_POINT | tail -1`
fi

exit
You still haven't explained what you're doing with DNA - is it anything interesting?
Reply With Quote
  #7 (permalink)  
Old 02-04-09, 19:55
patrick chia patrick chia is offline
Registered User
 
Join Date: Jan 2009
Posts: 4
Quote:
Originally Posted by mike_bike_kite
To get the median you need to get the length of each sequence, sort it, then pick the middle value if it's an odd number of rows or average the two middle values if it's even. You're unlikely to be able to do this in a single line of script hence I produced a small program for you. I've tried to improve the median part of the code and came up with this :
Code:
#!/bin/sh

echo "GAAAAGAGGA
ATATTAGGTTTTTAC
TATATTTAACGCGAATGATT" > original_file.dat

# show average
cat original_file.dat | \
        awk '
                BEGIN   { total=0 }
                        { total = total + length($0) }
                END     { print "AVG=" total/NR}'

# get lengths and sort them
cat original_file.dat | \
        awk '{ print length($0) }' | \
        sort -n \
        > tmp.dat

# how many recs in file
RECS=`cat tmp.dat | wc -l`

ODD_NUM=`expr $RECS % 2`
CUTOFF_POINT=`expr $RECS / 2`
CUTOFF_POINT=`expr $CUTOFF_POINT + 1`

if test $ODD_NUM -eq 0
then
        cat tmp.dat | \
                head -$CUTOFF_POINT | \
                tail -2 | \
                awk '
                        BEGIN   { total=0 }
                                { total = total + $0 }
                        END     { print "MEDIAN=" total/NR}'
else
        echo "MEDIAN="`cat tmp.dat | head -$CUTOFF_POINT | tail -1`
fi

exit
You still haven't explained what you're doing with DNA - is it anything interesting?
Yup. I deal with the DNA assignment that use the UNIX command line to find out the length of median.
Mike, I try the program that you modified already. The result end up just show that "MEDIAN=ATATTAGGTTTTTAC". How come will like this? Actually I planned that the answer is showed "MEDIAN=15". The program for average is worked d. You got any better suggestion to solve this problem? Really thanks a lot for your advise. Have a nice day.
Reply With Quote
  #8 (permalink)  
Old 02-05-09, 04:16
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Originally Posted by patrick chia
The program for average is worked d.
I just copied the program above and pasted it into a file called tmp.sh
I then ran it and it produced the following :
Code:
# sh tmp.sh
AVG=15
MEDIAN=15
I altered the sequence by adding a 3 letter code and ran it again :
Code:
# sh tmp.sh
AVG=12
MEDIAN=12.5
The average and the median seem correct to me.
Quote:
Originally Posted by patrick chia
You got any better suggestion to solve this problem?
I suggest you simply paste your codes into Excel and use the functions that excel provides to do what you need. Alternatively you can ask your senior to show you how the above program is run. If you need to know Unix shell scripting for your course then I can recommend Unix Programming Environment.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On