I have to parse a big (200MB!) data file in order to efficiently determine the latest date in that file.
All lines in the data file have the following format:
<record id>,<date>,<value 1>,<value 2>
where date is in the format "dd/mm/yyyy"
Every record represented by "record id" has around 20 lines associated with it, with unique dates covering the last 5 years. Most records have the same dates represented, but in some cases records may have more dates than others.
The file is delivered unsorted.
Currently I do this in a (bourne) shell using grep & uniq to get a uniq list of dates. I then use cut and sort on this list to pick out the latest year. I then use cut and sort again to pick out the latest month. Finally I use cut and sort to pick out the latest day of the month.
My script is below. Is there a more efficient way of doing this? As I am new to scripting I may have overlooked a much simpler solution!
Thanks,
Andy
Code:
#!/bin/sh
# File name is in first param
# Get uniq dates from the file
cut -f 2 -d ',' -s $1 | uniq > $1.dates
# Get latest year from these dates
for latestyear in `cut -f 3 -d '/' $1.dates | sort -r | uniq`
do
break
done
# Get latest month for this year
for latestmonth in `grep $1.dates -e /$latestyear | cut -f 2 -d '/' | sort -r`
do
break
done
## Get the latest day for this month
for latestday in `grep $1.dates -e /$latestmonth/$latestyear | cut -f 1 -d '/' | sort -r`
do
break
done
# Delete the dates file
rm $1.dates
echo $latestday/$latestmonth/$latestyear