If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Data Access, Manipulation & Batch Languages > Unix Shell Scripts > Extracting the latest date from a data file

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 08-04-04, 07:38
andycowling andycowling is offline
Registered User
 
Join Date: Aug 2004
Location: Cotswolds, UK
Posts: 2
Question Extracting the latest date from a data file

I have to parse a big (200MB!) data file in order to efficiently determine the latest date in that file.

All lines in the data file have the following format:
<record id>,<date>,<value 1>,<value 2>
where date is in the format "dd/mm/yyyy"

Every record represented by "record id" has around 20 lines associated with it, with unique dates covering the last 5 years. Most records have the same dates represented, but in some cases records may have more dates than others.

The file is delivered unsorted.

Currently I do this in a (bourne) shell using grep & uniq to get a uniq list of dates. I then use cut and sort on this list to pick out the latest year. I then use cut and sort again to pick out the latest month. Finally I use cut and sort to pick out the latest day of the month.

My script is below. Is there a more efficient way of doing this? As I am new to scripting I may have overlooked a much simpler solution!

Thanks,
Andy


Code:
#!/bin/sh

# File name is in first param

# Get uniq dates from the file
cut -f 2 -d ',' -s $1 | uniq > $1.dates


# Get latest year from these dates
for latestyear in `cut -f 3 -d '/' $1.dates | sort -r | uniq`
do
        break
done


# Get latest month for this year
for latestmonth in `grep $1.dates -e /$latestyear | cut -f 2 -d '/' | sort -r`
do
        break
done


## Get the latest day for this month
for latestday in `grep $1.dates -e /$latestmonth/$latestyear | cut -f 1 -d '/' | sort -r`
do
        break
done

# Delete the dates file
rm $1.dates

echo $latestday/$latestmonth/$latestyear

Last edited by andycowling; 08-04-04 at 07:42.
Reply With Quote
  #2 (permalink)  
Old 08-04-04, 09:05
vgersh99 vgersh99 is offline
Registered User
 
Join Date: Apr 2004
Location: Boston, MA
Posts: 325
Could you pls post a [bold]representative[/bold] sample for debugging.
You can do it all in awk:
1. convert date from "dd/mm/yyyy" to "yyyymmdd" format
2. using awk associate arrays go through the file comparing ans storing the LATEST record based on the NUMERIC value of "date"
3. output the final associative array.
__________________
vlad
+-----------------------+
| #include <disclaimer.h> |
+-----------------------+
Reply With Quote
  #3 (permalink)  
Old 08-04-04, 10:51
LKBrwn_DBA LKBrwn_DBA is offline
Registered User
 
Join Date: Jun 2003
Location: West Palm Beach, FL
Posts: 2,456
Cool

Try this:
Code:
#!/bin/sh

# File name is in first param
# Get uniq dates from the file
cut -f 2 -d ',' -s $1 | uniq |\
awk '{print substr($1,7,4) "/" substr($1,4,2) "/" substr($1,1,2);}' |\
sort  -r | head -1
__________________
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
Reply With Quote
  #4 (permalink)  
Old 08-04-04, 11:10
Damian Ibbotson Damian Ibbotson is offline
Padawan
 
Join Date: Jun 2002
Location: UK
Posts: 525
You could do most of the work using sort...

sort -rt, -k2.7,2.10 -k2.4,2.5 -k2.1,2.2 yourFile | head -1 | cut -f2 -d,

or...

sort -rt, -k2.7,2.10 -k2.4,2.5 -k2.1,2.2 yourFile | awk -F, '{print $2; exit}'

i.e. 'sort' by the year , then month, then day portions of your date in reverse order and then take the first date

Last edited by Damian Ibbotson; 08-04-04 at 12:53. Reason: got my days and months mixed up!
Reply With Quote
  #5 (permalink)  
Old 08-04-04, 11:22
andycowling andycowling is offline
Registered User
 
Join Date: Aug 2004
Location: Cotswolds, UK
Posts: 2
I timed my solution against the awk solution. The awk solution is close to twice as fast! Thanks for this excellent code sample.

I want post sample data here as it now seems not necessary - but thanks for the offer!
Reply With Quote
  #6 (permalink)  
Old 06-02-06, 18:15
Elizabeth X. Elizabeth X. is offline
Registered User
 
Join Date: May 2006
Posts: 5
vlad is right.
Reply With Quote
  #7 (permalink)  
Old 06-02-06, 18:15
Elizabeth X. Elizabeth X. is offline
Registered User
 
Join Date: May 2006
Posts: 5
sort is a good option.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On