If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Data Access, Manipulation & Batch Languages > Unix Shell Scripts > File decimation

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 02-07-05, 11:26
All0uette All0uette is offline
Registered User
 
Join Date: Feb 2005
Posts: 3
File decimation

Hi,

I'm trying to come up with a script to remove every nth line from a large file - navigation data.

I'd like to decimate, for example and leave every 10th, 50th etc line but ensure that the first and last occurance of a line remains.

e.g.

line 1 value 1 (keep)
line 1 value 2
line 1 value nth ( decimate to evry 10th line)
line 1 value last ( keep)
line 2 value 1 etc....

any help very greatfully received....

Steve.
Reply With Quote
  #2 (permalink)  
Old 02-07-05, 15:18
fla5do fla5do is offline
Registered User
 
Join Date: Oct 2003
Location: Germany
Posts: 138
Hi All0uette,

line 1, line 2, line nth, is that always the beginning string from each line?

or better !

Do you have an original cutout from your file ?
__________________
Greetings from germany
Peter F.
Reply With Quote
  #3 (permalink)  
Old 02-08-05, 05:09
All0uette All0uette is offline
Registered User
 
Join Date: Feb 2005
Posts: 3
Thanks for getting in touch Peter.
The following file ( vie in enchaned mode - gets rid of wrap)
(19 Headers line)
753182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-268
753191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-268
753200.69 1269442.25 3.000 5350.000 11.4742803 -60.6790412 80-268
753210.00 1269450.37 4.000 5400.000 11.4743531 -60.6789553 80-268
753219.19 1269458.37 5.000 5450.000 11.4744247 -60.6788705 80-268
753228.44 1269466.50 6.000 5500.000 11.4744974 -60.6787852 80-268
753237.69 1269474.50 7.000 5550.000 11.4745691 -60.6786999 80-268
etc....


Using


BEGIN {
factor = 1000
counter = factor -1
}
{counter = counter +1
if (counter == factor) {
print $0
counter = 0}}


gives


762221.00 1277665.00 981.000 54250.000 11.5479097 -60.5957754 80-268
771489.00 1286063.00 1981.000 104250.000 11.6230698 -60.5101830 80-268
765400.00 1264953.00 309.000 150350.000 11.4328074 -60.5676320 80-273
756938.44 1274157.50 1309.000 100350.000 11.5166131 -60.6444451 80-273
748510.00 1283398.50 2309.000 50350.000 11.6007344 -60.7209966 80-273
766960.00 1263225.50 97.000 161050.000 11.4170796 -60.5534777 80-273d
775426.44 1254012.00 1097.000 211050.000 11.3331789 -60.4766731 80-273d
876296.00 1143907.00 997.000 806550.000 10.3298524 -59.5649188 80-273x
867833.50 1153130.00 1997.000 756550.000 10.4139376 -59.6411886 80-273x
859382.44 1162365.50 2997.000 706550.000 10.4981288 -59.7173925 80-273x

where 'factor' gives the amount of decimation e.g. every 1000 lines in this case.
As you can the problem is that is does not capture the 1st and last occurance of each line name - in the above example 80-268 - just the 1000'th sequential line in the file.

What I really need is to capture any header lines in the file ( not shown here ) and then the 1st and last record sorted on the final column - line name - (I could obviosuly awk this to be column one if this is easier) followed by every 10th, 100th or 1000th value of a given line.

Instead of every single value associated with a line name (80-268) I end up with a subset. This shoudl then repeat each time it comes to a new line name.

The best way is probably by some artihmetic method. selecting 1st occurance based on line name then applying a multiple fact rather than just sequential numbering within the fiel - but I';m afraid this is beyond my AWK/scrip talents - Over to you - hopefully!
Reply With Quote
  #4 (permalink)  
Old 02-08-05, 13:27
fla5do fla5do is offline
Registered User
 
Join Date: Oct 2003
Location: Germany
Posts: 138
Hi All0uette,
I hope I understand you correctly ! I try to explain it with my words !
Sorry, my english is not so god

When the value in the last field changes, the old section ends and a new section begins. At this moment you want to print out the first and the last line from the section. If the sektion is longer nth lines (factor) you want to print out the nth line too ! The header lines are print out too !

Is that correct ?

Please show me the 19 Headers lines. I need them for deselektion in Script.
__________________
Greetings from germany
Peter F.
Reply With Quote
  #5 (permalink)  
Old 02-08-05, 16:32
fla5do fla5do is offline
Registered User
 
Join Date: Oct 2003
Location: Germany
Posts: 138
Here is a first solution :

Try the following DAT-File for first experiments. (only 3 Headerlines )

INPUTFILE="/usr2/medcom2/bin/dbforum15.dat"
cat $INPUTFILE | awk ' BEGIN {
factor=3
sec=0
last="BEGINNING"
}
############## Main ################
{
# headerfile-line identified
if ($7 == "" )
{
x = x + 1
header[x] = $0
}
else
{
if (sec != $7 )
{
print last
print "------ new Section beginning -------"
print header[1]
print header[2]
print header[3]
print $0
sec = $7
i=0
x=0
}
else
{
# print all nth line between the sections
i=i+1
if (i == factor )
{
print $0
i=0
}
last=$0
}
}
}'


DAT-File:

Headerline 1 from 80-258
Headerline 2 from 80-258
Headerline 3 from 80-258
153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-268
253191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-268
353182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-268
953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-268
Headerline 1 from 80-259
Headerline 2 from 80-259
Headerline 3 from 80-259
153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-269
253191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-269
353182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-269
453191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-269
553182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-269
953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-269
Headerline 1 from 80-25x
Headerline 2 from 80-25x
Headerline 3 from 80-25x
153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26x
253182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26x
353191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26x
453182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26x
953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260
80-26x
Headerline 1 from 80-25y
Headerline 2 from 80-25y
Headerline 3 from 80-25y
153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y
753191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26y
753182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y
753191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26y
753182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y
753191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26y
953182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y

Output is following :

------ new Section beginning -------
Headerline 1 from 80-268
Headerline 2 from 80-268
Headerline 3 from 80-268
153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-268
953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-268
953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-268
------ new Section beginning -------
Headerline 1 from 80-269
Headerline 2 from 80-269
Headerline 3 from 80-269
153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-269
453191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-269
953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-269
------ new Section beginning -------
Headerline 1 from 80-26x
Headerline 2 from 80-26x
Headerline 3 from 80-26x
153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26x
453182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26x
953191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26x
------ new Section beginning -------
Headerline 1 from 80-26y
Headerline 2 from 80-26y
Headerline 3 from 80-26y
153182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y
753191.50 1269434.25 2.000 5300.000 11.4742087 -60.6791260 80-26y
953182.25 1269426.25 1.000 5250.000 11.4741371 -60.6792113 80-26y
__________________
Greetings from germany
Peter F.
Reply With Quote
  #6 (permalink)  
Old 02-09-05, 04:53
All0uette All0uette is offline
Registered User
 
Join Date: Feb 2005
Posts: 3
Peter,

On 1st pass your script looks excellent. I've got to travel to to Netherlnads today but will test fully on Monday and let you know how things go.

Once again thanks very much for help.

Talk to you soon

Steve ( All0uette )
Reply With Quote
  #7 (permalink)  
Old 02-12-05, 15:13
fla5do fla5do is offline
Registered User
 
Join Date: Oct 2003
Location: Germany
Posts: 138
Update

Hi Steve,
there was a little bug in my last posting.
The last line of the last section was not printed.

attend to my words in the part "# headerfile-line identified"

I have changed the output from sceen to file "OUTPUTFILE"

Try this update. I hope is works !


INPUTFILE="/usr2/medcom2/bin/dbforum15.dat"
OUTPUTFILE="/usr2/medcom2/bin/dbforum15.out"
cat $INPUTFILE | awk -v OUT=$OUTPUTFILE ' BEGIN {
factor=3
sec=0
last="BEGINNING"
}
############## Main ################
{
# headerfile-line identified
# this query works only, if the headerline is less than 7 words
# otherwise the query must be change.
if ($7 == "" )
{
x = x + 1
header[x] = $0
}
else
{
if (sec != $7 )
{
print last > OUT
print "------ new Section beginning -------" > OUT
print header[1] > OUT
print header[2] > OUT
print header[3] > OUT
print $0 > OUT
sec = $7
i=0
x=0
}
else
{
# print all nth line between the sections
i=i+1
if (i == factor )
{
print $0 > OUT
i=0
}
last=$0
}
}
} END {
# print last line from file
print last > OUT
}'


exit
__________________
Greetings from germany
Peter F.

Last edited by fla5do; 02-12-05 at 15:26.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On