Results 1 to 6 of 6
  1. #1
    Join Date
    Jul 2007
    Posts
    2

    Unanswered: Script for file split is taking long time to process

    Hi,

    I'm new to this forum and would like to take help with one of the problem I m facing.

    I have a huge .CSV file around 130 million records and need to split that file into seven parts based on the conditions on the one of the column in a file.

    Basically, I'm trying to convert the SAS code to Unix shell script.

    Data in the input file (daily_forecast.csv):
    ===============

    Sample records in the file

    Rest_key, Mia_key, time_key, fore_units,base_units
    3 ,100000000 ,17344 , 53.17, 53.00
    3 ,100000000 ,17345 , 56.16, 51.00
    3 ,100000000 ,17346 , 70.73, 76.00
    3 ,100000000 ,17347 , 88.51, 78.00
    3 ,100000000 ,17348 , 71.77, 64.00
    3 ,100000000 ,17349 , 49.08, 51.00
    3 ,100000000 ,17350 , 55.61, 55.00
    3 ,100000000 ,17351 , 37.73, 35.00
    3 ,100000000 ,17352 , 61.18, 49.00
    3 ,100000000 ,17353 , 67.56, 66.12

    SAS Code
    =======

    min_time=min_time_key ( Minimum value in the time key column )
    max_time=max_time_key; ( Maximum value in the time key column)
    diff=floor((max_time-min_time)/7);
    if min_time <= time_key < min_time+1*diff then output df1;
    (df1 is the first output file )
    if min_time+1*diff <= time_key < min_time+2*diff then output df2;
    if min_time+2*diff <= time_key < min_time+3*diff then output df3;
    if min_time+3*diff <= time_key < min_time+4*diff then output df4;
    if min_time+4*diff <= time_key < min_time+5*diff then output df5;
    if min_time+5*diff <= time_key < min_time+6*diff then output df6;
    if min_time+6*diff <= time_key <= max_time then output df7;

    ================================================== ========

    I wrote a Unix Shell script as below and it is taking 1 hour 20 min for 130 million records to process. The record layout is same as I mentioned above.

    Unix Script

    #\bin\sh
    infile=daily_forecast.csv ( Input file)

    cat /usr/home/mc_data/transforms/$infile | cut -f3 -d "," > time_keys.txt ( extracting the 3rd column from the file)

    cat time_keys.txt | awk 'NR==1 {m=$1;p=$1} $1 >=m {m =$1} $1 <=p {p = $1} END {print p,m}' > min_max.txt ( Getting the minimum value and max value in the 3rd column)

    sed -e 's/ /,/g' min_max.txt > min_max1.txt
    min_time_key=`cat min_max1.txt | cut -c1-5` ( Getting the min value from the 3rd column in the file)

    max_time_key=`cat min_max1.txt | cut -c7-12` ( Getting the max value from the 3rd column in the file )

    echo "$min_time_key , $max_time_key"
    sub=`expr $max_time_key - $min_time_key`;
    diff=`expr $sub / 7`;

    Temp variables for storing the calculated values
    ==================================

    equ=`expr $min_time_key + 1 \* $diff`;
    equ1=`expr $min_time_key + 2 \* $diff`;
    equ2=`expr $min_time_key + 3 \* $diff`;
    equ3=`expr $min_time_key + 4 \* $diff`;
    equ4=`expr $min_time_key + 5 \* $diff`;
    equ5=`expr $min_time_key + 6 \* $diff`;
    echo "$sub , $equ"

    for timekey in `cat time_keys.txt` ( time_keys.txt contains all the values in
    timekey column)
    do
    if [ $timekey -ge $min_time_key ] && [ $timekey -lt $equ ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_1
    echo "1"
    elif [ $timekey -ge $equ ] && [ $timekey -lt $equ1 ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_2
    elif [ $timekey -ge $equ1 ] && [ $timekey -lt $equ2 ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_3
    elif [ $timekey -ge $equ2 ] && [ $timekey -lt $equ3 ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_4
    elif [ $timekey -ge $equ3 ] && [ $timekey -lt $equ4 ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_5
    elif [ $timekey -ge $equ4 ] && [ $timekey -lt $equ5 ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_6
    elif [ $timekey -ge $equ5 ] && [ $timekey -le $max_time_key ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_7
    else
    echo "time key is not with in the specified range"
    fi
    done
    exit 0;

    Please suggest if there is any best way of doing this.

    Thanks,
    Satish

  2. #2
    Join Date
    Jan 2004
    Location
    Bordeaux, France
    Posts
    320
    Try and adapt the following awk program:
    Code:
    awk -F, '
       { gsub(/ /,"", $3) }
       NR==1 { min=max=$3 ; next}
       NR==FNR { if ($3>max) max=$3; if($3<min) min=$3; next }
       NR!=FNR && FNR==1 { diff=(max-min)/7 ; if (diff==0) diff=1}
       { num=int(($3-min)/diff)+1; out="df" num; print $3>out }
    ' infile infile
    Jean-Pierre.

  3. #3
    Join Date
    Aug 2006
    Location
    The Netherlands
    Posts
    248
    Hi,

    I guess the
    Code:
    cat time_keys.txt | awk 'NR==1 {m=$1;p=$1} $1 >=m {m =$1} $1 <=p {p = $1} END {print p,m}'
    alone takes some time. Aren't the records of the basefile written chronologicaly? If they are you shouldn't make the fuzz of extracting the third column of the basefile and read the resultfile with above statement but use the 'head -1l' command on the basefile for the first line and the 'tail -1l' command for the last to get the results.
    With files of these sizes I've no experience but I know awk isn't really fast with big files. Maybe it's worth writing it in C?

    Regards

  4. #4
    Join Date
    Jul 2007
    Posts
    2

    Changed the code but still taking too long during file split

    I m using head and tail to get the max and min values from the file. But the logic inside for loop is taking lot of time. Please advice.

    #/bin/ksh
    cat forecasts1.csv | sed '1d' > forecastd.csv
    #cut -f2 -d "," forecastd.csv > forecast.csv
    min_val=`head -1 forecastd.csv | cut -f2 -d ","`
    max_val=`sed '$!d' forecastd.csv | cut -f2 -d ","`
    echo $min_val
    echo $max_val

    infile=forecasts1.csv
    sub=`expr $max_val - $min_val`;
    diff=`expr $sub / 7`;

    equ=`expr $min_val + 1 \* $diff`;
    equ1=`expr $min_val + 2 \* $diff`;
    equ2=`expr $min_val + 3 \* $diff`;
    equ3=`expr $min_val + 4 \* $diff`;
    equ4=`expr $min_val + 5 \* $diff`;
    equ5=`expr $min_val + 6 \* $diff`;
    echo "$sub , $equ"

    cut -f2 -d "," forecastd.csv > forecastd1.csv

    cat forecastd1.csv | head -10

    ************************************************** *********THIS IS TAKING LOT OF TIME - AROUND 2 HRS
    ************************************************** ********
    Input file size is 1542995968 bytes
    ************************************************** ********
    for timekey in `cat forecastd1.csv`
    do
    if [ $timekey -ge $min_val ] && [ $timekey -lt $equ ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_1
    elif [ $timekey -ge $equ ] && [ $timekey -lt $equ1 ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_2
    elif [ $timekey -ge $equ1 ] && [ $timekey -lt $equ2 ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_3
    elif [ $timekey -ge $equ2 ] && [ $timekey -lt $equ3 ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_4
    elif [ $timekey -ge $equ3 ] && [ $timekey -lt $equ4 ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_5
    elif [ $timekey -ge $equ4 ] && [ $timekey -lt $equ5 ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_6
    elif [ $timekey -ge $equ5 ] && [ $timekey -le $max_val ];
    then
    grep -i "$timekey" $infile >> f_forecast_1_7
    else
    echo "time key is not with in the specified range"
    fi
    done
    exit 0;

  5. #5
    Join Date
    Jan 2004
    Location
    Bordeaux, France
    Posts
    320
    You can do all the work with awk
    Create the awk program file split.awk :
    Code:
    #!/usr/bin/awk -f
    #
    
    # --------------------------------------------
    # Initialisations
    # --------------------------------------------
    
    BEGIN {
       FS = ",";
       if (ARGC != 2) {
          printf("Usage: %s csvfile\n",ARGV[0]);
          exit 2
       }
       ARGV[ARGC++] = ARGV[1];   # Input file is read in two pass
    }
       
    { gsub(/ *, */, ",") }  # Removes spaces aroud comma
    
    # --------------------------------------------
    # Fist Pass - Get Min/Max values for time key
    # --------------------------------------------
    
    NR==1 { next }  # Skip Header Line
    NR==FNR  {
       if (min_time_key == "") {
          min_time_key = $3;
          max_time_key = $3;
       } else {
          if (min_time_key > $3) min_time_key = $3;
          if (max_time_key < $3) max_time_key = $3;
       }
       next;
    }
    
    # --------------------------------------------
    # Second Pass - Split file
    # --------------------------------------------
    
    FNR==1 {
       diff_time_key = int((max_time_key - min_time_key) / 7);
       equ0 = min_time_key
       equ1 = equ0 + diff_time_key;
       equ2 = equ1 + diff_time_key;
       equ3 = equ2 + diff_time_key;
       equ4 = equ3 + diff_time_key;
       equ5 = equ4 + diff_time_key;
       equ6 = equ5 + diff_time_key;
       next;        # Skip Header Line
    }
    
    {
            if ($3 < equ1) print > "f_forecast_1_1"
       else if ($3 < equ2) print > "f_forecast_1_2"
       else if ($3 < equ3) print > "f_forecast_1_3"
       else if ($3 < equ4) print > "f_forecast_1_4"
       else if ($3 < equ5) print > "f_forecast_1_5"
       else if ($3 < equ6) print > "f_forecast_1_6"
       else                print > "f_forecast_1_7"
    }
    And execute it :
    Code:
    $ awk -f split.awk daily_forecast.csv
    
    or
    
    $ chmod +rx split.awk
    $ split.awk daily_forecast.csv
    Jean-Pierre.

  6. #6
    Join Date
    Jan 2004
    Location
    Bordeaux, France
    Posts
    320
    My previous solution assume that the file contains an header line isn't sorted on the time_key field, that seems to be false.

    So this new script split.sh will run faster :

    Code:
    inputfile=$1
    
    min=$(awk -F, 'NR==1 {gsub(/ /, "",$3); print $3; exit}' $inputfile)
    max=$(tail -1 $inputfile | awk -F, 'NR==1 {gsub(/ /, "",$3); print $3; exit}')
    
    awk -v min_time_key=$min -v max_time_key=$max '
    
    # --------------------------------------------
    # Initialisations
    # --------------------------------------------
    
    BEGIN {
       FS = ",";
       diff_time_key = int((max_time_key - min_time_key) / 7);
       equ0 = min_time_key
       equ1 = equ0 + diff_time_key;
       equ2 = equ1 + diff_time_key;
       equ3 = equ2 + diff_time_key;
       equ4 = equ3 + diff_time_key;
       equ5 = equ4 + diff_time_key;
       equ6 = equ5 + diff_time_key;
    }
    
    { gsub(/ *, */, ",") }  # Removes spaces around comma
    
    
    # --------------------------------------------
    # Split file
    # --------------------------------------------
    {
            if ($3 < equ1) print > "f_forecast_1_1"
       else if ($3 < equ2) print > "f_forecast_1_2"
       else if ($3 < equ3) print > "f_forecast_1_3"
       else if ($3 < equ4) print > "f_forecast_1_4"
       else if ($3 < equ5) print > "f_forecast_1_5"
       else if ($3 < equ6) print > "f_forecast_1_6"
       else                print > "f_forecast_1_7"
    }
    
    ' $inputfile
    Run it with the filename as argument :
    Code:
    $ split.sh daily_forecast.csv
    Jean-Pierre.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •