| |
|
If this is your first visit, be sure to check out the FAQ by clicking the link above.
You may have to register before you can post: click the register link above to proceed.
To start viewing messages, select the forum that you want to visit from the selection below.
|
 |

07-11-07, 18:33
|
|
Registered User
|
|
Join Date: Jul 2007
Posts: 2
|
|
|
Script for file split is taking long time to process
|
|
Hi,
I'm new to this forum and would like to take help with one of the problem I m facing.
I have a huge .CSV file around 130 million records and need to split that file into seven parts based on the conditions on the one of the column in a file.
Basically, I'm trying to convert the SAS code to Unix shell script.
Data in the input file (daily_forecast.csv):
===============
Sample records in the file
Rest_key, Mia_key, time_key, fore_units,base_units
3 ,100000000 ,17344 , 53.17, 53.00
3 ,100000000 ,17345 , 56.16, 51.00
3 ,100000000 ,17346 , 70.73, 76.00
3 ,100000000 ,17347 , 88.51, 78.00
3 ,100000000 ,17348 , 71.77, 64.00
3 ,100000000 ,17349 , 49.08, 51.00
3 ,100000000 ,17350 , 55.61, 55.00
3 ,100000000 ,17351 , 37.73, 35.00
3 ,100000000 ,17352 , 61.18, 49.00
3 ,100000000 ,17353 , 67.56, 66.12
SAS Code
=======
min_time=min_time_key ( Minimum value in the time key column )
max_time=max_time_key; ( Maximum value in the time key column)
diff=floor((max_time-min_time)/7);
if min_time <= time_key < min_time+1*diff then output df1;
(df1 is the first output file )
if min_time+1*diff <= time_key < min_time+2*diff then output df2;
if min_time+2*diff <= time_key < min_time+3*diff then output df3;
if min_time+3*diff <= time_key < min_time+4*diff then output df4;
if min_time+4*diff <= time_key < min_time+5*diff then output df5;
if min_time+5*diff <= time_key < min_time+6*diff then output df6;
if min_time+6*diff <= time_key <= max_time then output df7;
================================================== ========
I wrote a Unix Shell script as below and it is taking 1 hour 20 min for 130 million records to process. The record layout is same as I mentioned above.
Unix Script
#\bin\sh
infile=daily_forecast.csv ( Input file)
cat /usr/home/mc_data/transforms/$infile | cut -f3 -d "," > time_keys.txt ( extracting the 3rd column from the file)
cat time_keys.txt | awk 'NR==1 {m=$1;p=$1} $1 >=m {m =$1} $1 <=p {p = $1} END {print p,m}' > min_max.txt ( Getting the minimum value and max value in the 3rd column)
sed -e 's/ /,/g' min_max.txt > min_max1.txt
min_time_key=`cat min_max1.txt | cut -c1-5` ( Getting the min value from the 3rd column in the file)
max_time_key=`cat min_max1.txt | cut -c7-12` ( Getting the max value from the 3rd column in the file )
echo "$min_time_key , $max_time_key"
sub=`expr $max_time_key - $min_time_key`;
diff=`expr $sub / 7`;
Temp variables for storing the calculated values
==================================
equ=`expr $min_time_key + 1 \* $diff`;
equ1=`expr $min_time_key + 2 \* $diff`;
equ2=`expr $min_time_key + 3 \* $diff`;
equ3=`expr $min_time_key + 4 \* $diff`;
equ4=`expr $min_time_key + 5 \* $diff`;
equ5=`expr $min_time_key + 6 \* $diff`;
echo "$sub , $equ"
for timekey in `cat time_keys.txt` ( time_keys.txt contains all the values in
timekey column)
do
if [ $timekey -ge $min_time_key ] && [ $timekey -lt $equ ];
then
grep -i "$timekey" $infile >> f_forecast_1_1
echo "1"
elif [ $timekey -ge $equ ] && [ $timekey -lt $equ1 ];
then
grep -i "$timekey" $infile >> f_forecast_1_2
elif [ $timekey -ge $equ1 ] && [ $timekey -lt $equ2 ];
then
grep -i "$timekey" $infile >> f_forecast_1_3
elif [ $timekey -ge $equ2 ] && [ $timekey -lt $equ3 ];
then
grep -i "$timekey" $infile >> f_forecast_1_4
elif [ $timekey -ge $equ3 ] && [ $timekey -lt $equ4 ];
then
grep -i "$timekey" $infile >> f_forecast_1_5
elif [ $timekey -ge $equ4 ] && [ $timekey -lt $equ5 ];
then
grep -i "$timekey" $infile >> f_forecast_1_6
elif [ $timekey -ge $equ5 ] && [ $timekey -le $max_time_key ];
then
grep -i "$timekey" $infile >> f_forecast_1_7
else
echo "time key is not with in the specified range"
fi
done
exit 0;
Please suggest if there is any best way of doing this.
Thanks,
Satish
|
|

07-13-07, 17:53
|
|
Registered User
|
|
Join Date: Jan 2004
Location: Bordeaux, France
Posts: 319
|
|
Try and adapt the following awk program:
Code:
awk -F, '
{ gsub(/ /,"", $3) }
NR==1 { min=max=$3 ; next}
NR==FNR { if ($3>max) max=$3; if($3<min) min=$3; next }
NR!=FNR && FNR==1 { diff=(max-min)/7 ; if (diff==0) diff=1}
{ num=int(($3-min)/diff)+1; out="df" num; print $3>out }
' infile infile
__________________
Jean-Pierre.
|
|

07-13-07, 18:11
|
|
Registered User
|
|
Join Date: Aug 2006
Location: The Netherlands
Posts: 248
|
|
|
|
Hi,
I guess the
Code:
cat time_keys.txt | awk 'NR==1 {m=$1;p=$1} $1 >=m {m =$1} $1 <=p {p = $1} END {print p,m}'
alone takes some time. Aren't the records of the basefile written chronologicaly? If they are you shouldn't make the fuzz of extracting the third column of the basefile and read the resultfile with above statement but use the 'head -1l' command on the basefile for the first line and the 'tail -1l' command for the last to get the results.
With files of these sizes I've no experience but I know awk isn't really fast with big files. Maybe it's worth writing it in C?
Regards
|
|

07-24-07, 15:59
|
|
Registered User
|
|
Join Date: Jul 2007
Posts: 2
|
|
|
Changed the code but still taking too long during file split
I m using head and tail to get the max and min values from the file. But the logic inside for loop is taking lot of time. Please advice.
#/bin/ksh
cat forecasts1.csv | sed '1d' > forecastd.csv
#cut -f2 -d "," forecastd.csv > forecast.csv
min_val=`head -1 forecastd.csv | cut -f2 -d ","`
max_val=`sed '$!d' forecastd.csv | cut -f2 -d ","`
echo $min_val
echo $max_val
infile=forecasts1.csv
sub=`expr $max_val - $min_val`;
diff=`expr $sub / 7`;
equ=`expr $min_val + 1 \* $diff`;
equ1=`expr $min_val + 2 \* $diff`;
equ2=`expr $min_val + 3 \* $diff`;
equ3=`expr $min_val + 4 \* $diff`;
equ4=`expr $min_val + 5 \* $diff`;
equ5=`expr $min_val + 6 \* $diff`;
echo "$sub , $equ"
cut -f2 -d "," forecastd.csv > forecastd1.csv
cat forecastd1.csv | head -10
************************************************** *********THIS IS TAKING LOT OF TIME - AROUND 2 HRS
************************************************** ********
Input file size is 1542995968 bytes
************************************************** ********
for timekey in `cat forecastd1.csv`
do
if [ $timekey -ge $min_val ] && [ $timekey -lt $equ ];
then
grep -i "$timekey" $infile >> f_forecast_1_1
elif [ $timekey -ge $equ ] && [ $timekey -lt $equ1 ];
then
grep -i "$timekey" $infile >> f_forecast_1_2
elif [ $timekey -ge $equ1 ] && [ $timekey -lt $equ2 ];
then
grep -i "$timekey" $infile >> f_forecast_1_3
elif [ $timekey -ge $equ2 ] && [ $timekey -lt $equ3 ];
then
grep -i "$timekey" $infile >> f_forecast_1_4
elif [ $timekey -ge $equ3 ] && [ $timekey -lt $equ4 ];
then
grep -i "$timekey" $infile >> f_forecast_1_5
elif [ $timekey -ge $equ4 ] && [ $timekey -lt $equ5 ];
then
grep -i "$timekey" $infile >> f_forecast_1_6
elif [ $timekey -ge $equ5 ] && [ $timekey -le $max_val ];
then
grep -i "$timekey" $infile >> f_forecast_1_7
else
echo "time key is not with in the specified range"
fi
done
exit 0;
|
|

07-27-07, 06:12
|
|
Registered User
|
|
Join Date: Jan 2004
Location: Bordeaux, France
Posts: 319
|
|
You can do all the work with awk
Create the awk program file split.awk :
Code:
#!/usr/bin/awk -f
#
# --------------------------------------------
# Initialisations
# --------------------------------------------
BEGIN {
FS = ",";
if (ARGC != 2) {
printf("Usage: %s csvfile\n",ARGV[0]);
exit 2
}
ARGV[ARGC++] = ARGV[1]; # Input file is read in two pass
}
{ gsub(/ *, */, ",") } # Removes spaces aroud comma
# --------------------------------------------
# Fist Pass - Get Min/Max values for time key
# --------------------------------------------
NR==1 { next } # Skip Header Line
NR==FNR {
if (min_time_key == "") {
min_time_key = $3;
max_time_key = $3;
} else {
if (min_time_key > $3) min_time_key = $3;
if (max_time_key < $3) max_time_key = $3;
}
next;
}
# --------------------------------------------
# Second Pass - Split file
# --------------------------------------------
FNR==1 {
diff_time_key = int((max_time_key - min_time_key) / 7);
equ0 = min_time_key
equ1 = equ0 + diff_time_key;
equ2 = equ1 + diff_time_key;
equ3 = equ2 + diff_time_key;
equ4 = equ3 + diff_time_key;
equ5 = equ4 + diff_time_key;
equ6 = equ5 + diff_time_key;
next; # Skip Header Line
}
{
if ($3 < equ1) print > "f_forecast_1_1"
else if ($3 < equ2) print > "f_forecast_1_2"
else if ($3 < equ3) print > "f_forecast_1_3"
else if ($3 < equ4) print > "f_forecast_1_4"
else if ($3 < equ5) print > "f_forecast_1_5"
else if ($3 < equ6) print > "f_forecast_1_6"
else print > "f_forecast_1_7"
}
And execute it :
Code:
$ awk -f split.awk daily_forecast.csv
or
$ chmod +rx split.awk
$ split.awk daily_forecast.csv
__________________
Jean-Pierre.
|
|

07-27-07, 12:37
|
|
Registered User
|
|
Join Date: Jan 2004
Location: Bordeaux, France
Posts: 319
|
|
My previous solution assume that the file contains an header line isn't sorted on the time_key field, that seems to be false.
So this new script split.sh will run faster :
Code:
inputfile=$1
min=$(awk -F, 'NR==1 {gsub(/ /, "",$3); print $3; exit}' $inputfile)
max=$(tail -1 $inputfile | awk -F, 'NR==1 {gsub(/ /, "",$3); print $3; exit}')
awk -v min_time_key=$min -v max_time_key=$max '
# --------------------------------------------
# Initialisations
# --------------------------------------------
BEGIN {
FS = ",";
diff_time_key = int((max_time_key - min_time_key) / 7);
equ0 = min_time_key
equ1 = equ0 + diff_time_key;
equ2 = equ1 + diff_time_key;
equ3 = equ2 + diff_time_key;
equ4 = equ3 + diff_time_key;
equ5 = equ4 + diff_time_key;
equ6 = equ5 + diff_time_key;
}
{ gsub(/ *, */, ",") } # Removes spaces around comma
# --------------------------------------------
# Split file
# --------------------------------------------
{
if ($3 < equ1) print > "f_forecast_1_1"
else if ($3 < equ2) print > "f_forecast_1_2"
else if ($3 < equ3) print > "f_forecast_1_3"
else if ($3 < equ4) print > "f_forecast_1_4"
else if ($3 < equ5) print > "f_forecast_1_5"
else if ($3 < equ6) print > "f_forecast_1_6"
else print > "f_forecast_1_7"
}
' $inputfile
Run it with the filename as argument :
Code:
$ split.sh daily_forecast.csv
__________________
Jean-Pierre.
|
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|