Results 1 to 7 of 7
  1. #1
    Join Date
    Aug 2004
    Location
    Cotswolds, UK
    Posts
    2

    Question Unanswered: Extracting the latest date from a data file

    I have to parse a big (200MB!) data file in order to efficiently determine the latest date in that file.

    All lines in the data file have the following format:

    <record id>,<date>,<value 1>,<value 2>

    where date is in the format "dd/mm/yyyy"

    Every record represented by "record id" has around 20 lines associated with it, with unique dates covering the last 5 years. Most records have the same dates represented, but in some cases records may have more dates than others.

    The file is delivered unsorted.

    Currently I do this in a (bourne) shell using grep & uniq to get a uniq list of dates. I then use cut and sort on this list to pick out the latest year. I then use cut and sort again to pick out the latest month. Finally I use cut and sort to pick out the latest day of the month.

    My script is below. Is there a more efficient way of doing this? As I am new to scripting I may have overlooked a much simpler solution!

    Thanks,
    Andy


    Code:
    #!/bin/sh
    
    # File name is in first param
    
    # Get uniq dates from the file
    cut -f 2 -d ',' -s $1 | uniq > $1.dates
    
    
    # Get latest year from these dates
    for latestyear in `cut -f 3 -d '/' $1.dates | sort -r | uniq`
    do
            break
    done
    
    
    # Get latest month for this year
    for latestmonth in `grep $1.dates -e /$latestyear | cut -f 2 -d '/' | sort -r`
    do
            break
    done
    
    
    ## Get the latest day for this month
    for latestday in `grep $1.dates -e /$latestmonth/$latestyear | cut -f 1 -d '/' | sort -r`
    do
            break
    done
    
    # Delete the dates file
    rm $1.dates
    
    echo $latestday/$latestmonth/$latestyear
    Last edited by andycowling; 08-04-04 at 08:42.

  2. #2
    Join Date
    Apr 2004
    Location
    Boston, MA
    Posts
    325
    Could you pls post a [bold]representative[/bold] sample for debugging.
    You can do it all in awk:
    1. convert date from "dd/mm/yyyy" to "yyyymmdd" format
    2. using awk associate arrays go through the file comparing ans storing the LATEST record based on the NUMERIC value of "date"
    3. output the final associative array.
    vlad
    +-----------------------+
    | #include <disclaimer.h> |
    +-----------------------+

  3. #3
    Join Date
    Jun 2003
    Location
    West Palm Beach, FL
    Posts
    2,713

    Cool

    Try this:
    Code:
    #!/bin/sh
    
    # File name is in first param
    # Get uniq dates from the file
    cut -f 2 -d ',' -s $1 | uniq |\
    awk '{print substr($1,7,4) "/" substr($1,4,2) "/" substr($1,1,2);}' |\
    sort  -r | head -1
    The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

  4. #4
    Join Date
    Jun 2002
    Location
    UK
    Posts
    525
    You could do most of the work using sort...

    sort -rt, -k2.7,2.10 -k2.4,2.5 -k2.1,2.2 yourFile | head -1 | cut -f2 -d,

    or...

    sort -rt, -k2.7,2.10 -k2.4,2.5 -k2.1,2.2 yourFile | awk -F, '{print $2; exit}'

    i.e. 'sort' by the year , then month, then day portions of your date in reverse order and then take the first date
    Last edited by Damian Ibbotson; 08-04-04 at 13:53. Reason: got my days and months mixed up!

  5. #5
    Join Date
    Aug 2004
    Location
    Cotswolds, UK
    Posts
    2
    I timed my solution against the awk solution. The awk solution is close to twice as fast! Thanks for this excellent code sample.

    I want post sample data here as it now seems not necessary - but thanks for the offer!

  6. #6
    Join Date
    May 2006
    Posts
    5
    vlad is right.

  7. #7
    Join Date
    May 2006
    Posts
    5
    sort is a good option.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •