Results 1 to 7 of 7

Thread: File decimation

  1. #1
    Join Date
    Feb 2005
    Posts
    3

    Unanswered: File decimation

    Hi,

    I'm trying to come up with a script to remove every nth line from a large file - navigation data.

    I'd like to decimate, for example and leave every 10th, 50th etc line but ensure that the first and last occurance of a line remains.

    e.g.

    line 1 value 1 (keep)
    line 1 value 2
    line 1 value nth ( decimate to evry 10th line)
    line 1 value last ( keep)
    line 2 value 1 etc....

    any help very greatfully received....

    Steve.

  2. #2
    Join Date
    Oct 2003
    Location
    Germany
    Posts
    138
    Hi All0uette,

    line 1, line 2, line nth, is that always the beginning string from each line?

    or better !

    Do you have an original cutout from your file ?
    Greetings from germany
    Peter F.

  3. #3
    Join Date
    Feb 2005
    Posts
    3
    Thanks for getting in touch Peter.
    The following file ( vie in enchaned mode - gets rid of wrap)
    (19 Headers line)
    753182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-268
    753191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-268
    753200.69 1269442.25 3.000 5350.000 11.4742803 -60.6790412 80-268
    753210.00 1269450.37 4.000 5400.000 11.4743531 -60.6789553 80-268
    753219.19 1269458.37 5.000 5450.000 11.4744247 -60.6788705 80-268
    753228.44 1269466.50 6.000 5500.000 11.4744974 -60.6787852 80-268
    753237.69 1269474.50 7.000 5550.000 11.4745691 -60.6786999 80-268
    etc....


    Using


    BEGIN {
    factor = 1000
    counter = factor -1
    }
    {counter = counter +1
    if (counter == factor) {
    print $0
    counter = 0}}


    gives


    762221.00 1277665.00 981.000 54250.000 11.5479097 -60.5957754 80-268
    771489.00 1286063.00 1981.000 104250.000 11.6230698 -60.5101830 80-268
    765400.00 1264953.00 309.000 150350.000 11.4328074 -60.5676320 80-273
    756938.44 1274157.50 1309.000 100350.000 11.5166131 -60.6444451 80-273
    748510.00 1283398.50 2309.000 50350.000 11.6007344 -60.7209966 80-273
    766960.00 1263225.50 97.000 161050.000 11.4170796 -60.5534777 80-273d
    775426.44 1254012.00 1097.000 211050.000 11.3331789 -60.4766731 80-273d
    876296.00 1143907.00 997.000 806550.000 10.3298524 -59.5649188 80-273x
    867833.50 1153130.00 1997.000 756550.000 10.4139376 -59.6411886 80-273x
    859382.44 1162365.50 2997.000 706550.000 10.4981288 -59.7173925 80-273x

    where 'factor' gives the amount of decimation e.g. every 1000 lines in this case.
    As you can the problem is that is does not capture the 1st and last occurance of each line name - in the above example 80-268 - just the 1000'th sequential line in the file.

    What I really need is to capture any header lines in the file ( not shown here ) and then the 1st and last record sorted on the final column - line name - (I could obviosuly awk this to be column one if this is easier) followed by every 10th, 100th or 1000th value of a given line.

    Instead of every single value associated with a line name (80-268) I end up with a subset. This shoudl then repeat each time it comes to a new line name.

    The best way is probably by some artihmetic method. selecting 1st occurance based on line name then applying a multiple fact rather than just sequential numbering within the fiel - but I';m afraid this is beyond my AWK/scrip talents - Over to you - hopefully!

  4. #4
    Join Date
    Oct 2003
    Location
    Germany
    Posts
    138
    Hi All0uette,
    I hope I understand you correctly ! I try to explain it with my words !
    Sorry, my english is not so god

    When the value in the last field changes, the old section ends and a new section begins. At this moment you want to print out the first and the last line from the section. If the sektion is longer nth lines (factor) you want to print out the nth line too ! The header lines are print out too !

    Is that correct ?

    Please show me the 19 Headers lines. I need them for deselektion in Script.
    Greetings from germany
    Peter F.

  5. #5
    Join Date
    Oct 2003
    Location
    Germany
    Posts
    138
    Here is a first solution :

    Try the following DAT-File for first experiments. (only 3 Headerlines )

    INPUTFILE="/usr2/medcom2/bin/dbforum15.dat"
    cat $INPUTFILE | awk ' BEGIN {
    factor=3
    sec=0
    last="BEGINNING"
    }
    ############## Main ################
    {
    # headerfile-line identified
    if ($7 == "" )
    {
    x = x + 1
    header[x] = $0
    }
    else
    {
    if (sec != $7 )
    {
    print last
    print "------ new Section beginning -------"
    print header[1]
    print header[2]
    print header[3]
    print $0
    sec = $7
    i=0
    x=0
    }
    else
    {
    # print all nth line between the sections
    i=i+1
    if (i == factor )
    {
    print $0
    i=0
    }
    last=$0
    }
    }
    }'


    DAT-File:

    Headerline 1 from 80-258
    Headerline 2 from 80-258
    Headerline 3 from 80-258
    153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-268
    253191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-268
    353182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-268
    953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-268
    Headerline 1 from 80-259
    Headerline 2 from 80-259
    Headerline 3 from 80-259
    153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-269
    253191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-269
    353182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-269
    453191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-269
    553182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-269
    953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-269
    Headerline 1 from 80-25x
    Headerline 2 from 80-25x
    Headerline 3 from 80-25x
    153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26x
    253182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26x
    353191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26x
    453182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26x
    953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260
    80-26x
    Headerline 1 from 80-25y
    Headerline 2 from 80-25y
    Headerline 3 from 80-25y
    153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y
    753191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26y
    753182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y
    753191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26y
    753182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y
    753191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26y
    953182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y

    Output is following :

    ------ new Section beginning -------
    Headerline 1 from 80-268
    Headerline 2 from 80-268
    Headerline 3 from 80-268
    153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-268
    953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-268
    953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-268
    ------ new Section beginning -------
    Headerline 1 from 80-269
    Headerline 2 from 80-269
    Headerline 3 from 80-269
    153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-269
    453191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-269
    953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-269
    ------ new Section beginning -------
    Headerline 1 from 80-26x
    Headerline 2 from 80-26x
    Headerline 3 from 80-26x
    153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26x
    453182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26x
    953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26x
    ------ new Section beginning -------
    Headerline 1 from 80-26y
    Headerline 2 from 80-26y
    Headerline 3 from 80-26y
    153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y
    753191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26y
    953182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y
    Greetings from germany
    Peter F.

  6. #6
    Join Date
    Feb 2005
    Posts
    3
    Peter,

    On 1st pass your script looks excellent. I've got to travel to to Netherlnads today but will test fully on Monday and let you know how things go.

    Once again thanks very much for help.

    Talk to you soon

    Steve ( All0uette )

  7. #7
    Join Date
    Oct 2003
    Location
    Germany
    Posts
    138

    Update

    Hi Steve,
    there was a little bug in my last posting.
    The last line of the last section was not printed.

    attend to my words in the part "# headerfile-line identified"

    I have changed the output from sceen to file "OUTPUTFILE"

    Try this update. I hope is works !


    INPUTFILE="/usr2/medcom2/bin/dbforum15.dat"
    OUTPUTFILE="/usr2/medcom2/bin/dbforum15.out"
    cat $INPUTFILE | awk -v OUT=$OUTPUTFILE ' BEGIN {
    factor=3
    sec=0
    last="BEGINNING"
    }
    ############## Main ################
    {
    # headerfile-line identified
    # this query works only, if the headerline is less than 7 words
    # otherwise the query must be change.
    if ($7 == "" )
    {
    x = x + 1
    header[x] = $0
    }
    else
    {
    if (sec != $7 )
    {
    print last > OUT
    print "------ new Section beginning -------" > OUT
    print header[1] > OUT
    print header[2] > OUT
    print header[3] > OUT
    print $0 > OUT
    sec = $7
    i=0
    x=0
    }
    else
    {
    # print all nth line between the sections
    i=i+1
    if (i == factor )
    {
    print $0 > OUT
    i=0
    }
    last=$0
    }
    }
    } END {
    # print last line from file
    print last > OUT
    }'


    exit
    Last edited by fla5do; 02-12-05 at 16:26.
    Greetings from germany
    Peter F.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •