If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Data Access, Manipulation & Batch Languages > Unix Shell Scripts > Script for file split is taking long time to process

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 07-11-07, 18:33
Satishbobba Satishbobba is offline
Registered User
 
Join Date: Jul 2007
Posts: 2
Script for file split is taking long time to process

Hi,

I'm new to this forum and would like to take help with one of the problem I m facing.

I have a huge .CSV file around 130 million records and need to split that file into seven parts based on the conditions on the one of the column in a file.

Basically, I'm trying to convert the SAS code to Unix shell script.

Data in the input file (daily_forecast.csv):
===============

Sample records in the file

Rest_key, Mia_key, time_key, fore_units,base_units
3 ,100000000 ,17344 , 53.17, 53.00
3 ,100000000 ,17345 , 56.16, 51.00
3 ,100000000 ,17346 , 70.73, 76.00
3 ,100000000 ,17347 , 88.51, 78.00
3 ,100000000 ,17348 , 71.77, 64.00
3 ,100000000 ,17349 , 49.08, 51.00
3 ,100000000 ,17350 , 55.61, 55.00
3 ,100000000 ,17351 , 37.73, 35.00
3 ,100000000 ,17352 , 61.18, 49.00
3 ,100000000 ,17353 , 67.56, 66.12

SAS Code
=======

min_time=min_time_key ( Minimum value in the time key column )
max_time=max_time_key; ( Maximum value in the time key column)
diff=floor((max_time-min_time)/7);
if min_time <= time_key < min_time+1*diff then output df1;
(df1 is the first output file )
if min_time+1*diff <= time_key < min_time+2*diff then output df2;
if min_time+2*diff <= time_key < min_time+3*diff then output df3;
if min_time+3*diff <= time_key < min_time+4*diff then output df4;
if min_time+4*diff <= time_key < min_time+5*diff then output df5;
if min_time+5*diff <= time_key < min_time+6*diff then output df6;
if min_time+6*diff <= time_key <= max_time then output df7;

================================================== ========

I wrote a Unix Shell script as below and it is taking 1 hour 20 min for 130 million records to process. The record layout is same as I mentioned above.

Unix Script

#\bin\sh
infile=daily_forecast.csv ( Input file)

cat /usr/home/mc_data/transforms/$infile | cut -f3 -d "," > time_keys.txt ( extracting the 3rd column from the file)

cat time_keys.txt | awk 'NR==1 {m=$1;p=$1} $1 >=m {m =$1} $1 <=p {p = $1} END {print p,m}' > min_max.txt ( Getting the minimum value and max value in the 3rd column)

sed -e 's/ /,/g' min_max.txt > min_max1.txt
min_time_key=`cat min_max1.txt | cut -c1-5` ( Getting the min value from the 3rd column in the file)

max_time_key=`cat min_max1.txt | cut -c7-12` ( Getting the max value from the 3rd column in the file )

echo "$min_time_key , $max_time_key"
sub=`expr $max_time_key - $min_time_key`;
diff=`expr $sub / 7`;

Temp variables for storing the calculated values
==================================

equ=`expr $min_time_key + 1 \* $diff`;
equ1=`expr $min_time_key + 2 \* $diff`;
equ2=`expr $min_time_key + 3 \* $diff`;
equ3=`expr $min_time_key + 4 \* $diff`;
equ4=`expr $min_time_key + 5 \* $diff`;
equ5=`expr $min_time_key + 6 \* $diff`;
echo "$sub , $equ"

for timekey in `cat time_keys.txt` ( time_keys.txt contains all the values in
timekey column)
do
if [ $timekey -ge $min_time_key ] && [ $timekey -lt $equ ];
then
grep -i "$timekey" $infile >> f_forecast_1_1
echo "1"
elif [ $timekey -ge $equ ] && [ $timekey -lt $equ1 ];
then
grep -i "$timekey" $infile >> f_forecast_1_2
elif [ $timekey -ge $equ1 ] && [ $timekey -lt $equ2 ];
then
grep -i "$timekey" $infile >> f_forecast_1_3
elif [ $timekey -ge $equ2 ] && [ $timekey -lt $equ3 ];
then
grep -i "$timekey" $infile >> f_forecast_1_4
elif [ $timekey -ge $equ3 ] && [ $timekey -lt $equ4 ];
then
grep -i "$timekey" $infile >> f_forecast_1_5
elif [ $timekey -ge $equ4 ] && [ $timekey -lt $equ5 ];
then
grep -i "$timekey" $infile >> f_forecast_1_6
elif [ $timekey -ge $equ5 ] && [ $timekey -le $max_time_key ];
then
grep -i "$timekey" $infile >> f_forecast_1_7
else
echo "time key is not with in the specified range"
fi
done
exit 0;

Please suggest if there is any best way of doing this.

Thanks,
Satish
Reply With Quote
  #2 (permalink)  
Old 07-13-07, 17:53
aigles aigles is offline
Registered User
 
Join Date: Jan 2004
Location: Bordeaux, France
Posts: 319
Try and adapt the following awk program:
Code:
awk -F, '
   { gsub(/ /,"", $3) }
   NR==1 { min=max=$3 ; next}
   NR==FNR { if ($3>max) max=$3; if($3<min) min=$3; next }
   NR!=FNR && FNR==1 { diff=(max-min)/7 ; if (diff==0) diff=1}
   { num=int(($3-min)/diff)+1; out="df" num; print $3>out }
' infile infile
__________________
Jean-Pierre.
Reply With Quote
  #3 (permalink)  
Old 07-13-07, 18:11
Tyveleyn Tyveleyn is offline
Registered User
 
Join Date: Aug 2006
Location: The Netherlands
Posts: 248
Hi,

I guess the
Code:
cat time_keys.txt | awk 'NR==1 {m=$1;p=$1} $1 >=m {m =$1} $1 <=p {p = $1} END {print p,m}'
alone takes some time. Aren't the records of the basefile written chronologicaly? If they are you shouldn't make the fuzz of extracting the third column of the basefile and read the resultfile with above statement but use the 'head -1l' command on the basefile for the first line and the 'tail -1l' command for the last to get the results.
With files of these sizes I've no experience but I know awk isn't really fast with big files. Maybe it's worth writing it in C?

Regards
Reply With Quote
  #4 (permalink)  
Old 07-24-07, 15:59
Satishbobba Satishbobba is offline
Registered User
 
Join Date: Jul 2007
Posts: 2
Changed the code but still taking too long during file split

I m using head and tail to get the max and min values from the file. But the logic inside for loop is taking lot of time. Please advice.

#/bin/ksh
cat forecasts1.csv | sed '1d' > forecastd.csv
#cut -f2 -d "," forecastd.csv > forecast.csv
min_val=`head -1 forecastd.csv | cut -f2 -d ","`
max_val=`sed '$!d' forecastd.csv | cut -f2 -d ","`
echo $min_val
echo $max_val

infile=forecasts1.csv
sub=`expr $max_val - $min_val`;
diff=`expr $sub / 7`;

equ=`expr $min_val + 1 \* $diff`;
equ1=`expr $min_val + 2 \* $diff`;
equ2=`expr $min_val + 3 \* $diff`;
equ3=`expr $min_val + 4 \* $diff`;
equ4=`expr $min_val + 5 \* $diff`;
equ5=`expr $min_val + 6 \* $diff`;
echo "$sub , $equ"

cut -f2 -d "," forecastd.csv > forecastd1.csv

cat forecastd1.csv | head -10

************************************************** *********THIS IS TAKING LOT OF TIME - AROUND 2 HRS
************************************************** ********
Input file size is 1542995968 bytes
************************************************** ********
for timekey in `cat forecastd1.csv`
do
if [ $timekey -ge $min_val ] && [ $timekey -lt $equ ];
then
grep -i "$timekey" $infile >> f_forecast_1_1
elif [ $timekey -ge $equ ] && [ $timekey -lt $equ1 ];
then
grep -i "$timekey" $infile >> f_forecast_1_2
elif [ $timekey -ge $equ1 ] && [ $timekey -lt $equ2 ];
then
grep -i "$timekey" $infile >> f_forecast_1_3
elif [ $timekey -ge $equ2 ] && [ $timekey -lt $equ3 ];
then
grep -i "$timekey" $infile >> f_forecast_1_4
elif [ $timekey -ge $equ3 ] && [ $timekey -lt $equ4 ];
then
grep -i "$timekey" $infile >> f_forecast_1_5
elif [ $timekey -ge $equ4 ] && [ $timekey -lt $equ5 ];
then
grep -i "$timekey" $infile >> f_forecast_1_6
elif [ $timekey -ge $equ5 ] && [ $timekey -le $max_val ];
then
grep -i "$timekey" $infile >> f_forecast_1_7
else
echo "time key is not with in the specified range"
fi
done
exit 0;
Reply With Quote
  #5 (permalink)  
Old 07-27-07, 06:12
aigles aigles is offline
Registered User
 
Join Date: Jan 2004
Location: Bordeaux, France
Posts: 319
You can do all the work with awk
Create the awk program file split.awk :
Code:
#!/usr/bin/awk -f
#

# --------------------------------------------
# Initialisations
# --------------------------------------------

BEGIN {
   FS = ",";
   if (ARGC != 2) {
      printf("Usage: %s csvfile\n",ARGV[0]);
      exit 2
   }
   ARGV[ARGC++] = ARGV[1];   # Input file is read in two pass
}
   
{ gsub(/ *, */, ",") }  # Removes spaces aroud comma

# --------------------------------------------
# Fist Pass - Get Min/Max values for time key
# --------------------------------------------

NR==1 { next }  # Skip Header Line
NR==FNR  {
   if (min_time_key == "") {
      min_time_key = $3;
      max_time_key = $3;
   } else {
      if (min_time_key > $3) min_time_key = $3;
      if (max_time_key < $3) max_time_key = $3;
   }
   next;
}

# --------------------------------------------
# Second Pass - Split file
# --------------------------------------------

FNR==1 {
   diff_time_key = int((max_time_key - min_time_key) / 7);
   equ0 = min_time_key
   equ1 = equ0 + diff_time_key;
   equ2 = equ1 + diff_time_key;
   equ3 = equ2 + diff_time_key;
   equ4 = equ3 + diff_time_key;
   equ5 = equ4 + diff_time_key;
   equ6 = equ5 + diff_time_key;
   next;        # Skip Header Line
}

{
        if ($3 < equ1) print > "f_forecast_1_1"
   else if ($3 < equ2) print > "f_forecast_1_2"
   else if ($3 < equ3) print > "f_forecast_1_3"
   else if ($3 < equ4) print > "f_forecast_1_4"
   else if ($3 < equ5) print > "f_forecast_1_5"
   else if ($3 < equ6) print > "f_forecast_1_6"
   else                print > "f_forecast_1_7"
}
And execute it :
Code:
$ awk -f split.awk daily_forecast.csv

or

$ chmod +rx split.awk
$ split.awk daily_forecast.csv
__________________
Jean-Pierre.
Reply With Quote
  #6 (permalink)  
Old 07-27-07, 12:37
aigles aigles is offline
Registered User
 
Join Date: Jan 2004
Location: Bordeaux, France
Posts: 319
My previous solution assume that the file contains an header line isn't sorted on the time_key field, that seems to be false.

So this new script split.sh will run faster :

Code:
inputfile=$1

min=$(awk -F, 'NR==1 {gsub(/ /, "",$3); print $3; exit}' $inputfile)
max=$(tail -1 $inputfile | awk -F, 'NR==1 {gsub(/ /, "",$3); print $3; exit}')

awk -v min_time_key=$min -v max_time_key=$max '

# --------------------------------------------
# Initialisations
# --------------------------------------------

BEGIN {
   FS = ",";
   diff_time_key = int((max_time_key - min_time_key) / 7);
   equ0 = min_time_key
   equ1 = equ0 + diff_time_key;
   equ2 = equ1 + diff_time_key;
   equ3 = equ2 + diff_time_key;
   equ4 = equ3 + diff_time_key;
   equ5 = equ4 + diff_time_key;
   equ6 = equ5 + diff_time_key;
}

{ gsub(/ *, */, ",") }  # Removes spaces around comma


# --------------------------------------------
# Split file
# --------------------------------------------
{
        if ($3 < equ1) print > "f_forecast_1_1"
   else if ($3 < equ2) print > "f_forecast_1_2"
   else if ($3 < equ3) print > "f_forecast_1_3"
   else if ($3 < equ4) print > "f_forecast_1_4"
   else if ($3 < equ5) print > "f_forecast_1_5"
   else if ($3 < equ6) print > "f_forecast_1_6"
   else                print > "f_forecast_1_7"
}

' $inputfile
Run it with the filename as argument :
Code:
$ split.sh daily_forecast.csv
__________________
Jean-Pierre.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On