Page 1 of 2 12 LastLast
Results 1 to 15 of 16
  1. #1
    Join Date
    Feb 2004
    Posts
    3

    Unanswered: Remove text up to first blank line in text file

    I need a command line oneliner that will remove all the text at the begining of a text file up to the first blank line and save the results.

    In fact I need to do this to multiply files in a folder.

    I have tried the following ( and more) none work.

    NOTE I don't care if the result ends in a new suffix ie: $file.out or not

    EXAMPLES OF MY BAD NEWBIE CRAP:

    cd ./txt/ ; for file in *txt ; do sed '1,/^$/d' $file ; done

    cd ./txt/ ; for file in *txt ; do sed '1,/^$/d' 's/w\ \./tmp.tex' ; cp ./tmp.tex ./$file ; done

    cd ./txt/ ; for file in *txt ; do sed '1,/^$/d' ; cp ./$file ./$file.out ; done

    cd ./txt/ ; for file in *txt ; do sed '/^$/,/^$/D' "$file" ; done

  2. #2
    Join Date
    Jun 2002
    Location
    UK
    Posts
    525
    This should do it

    awk '/^$/ && ! textFound {next}{textFound=1; print}' file > newFile

  3. #3
    Join Date
    Jan 2004
    Location
    Bordeaux, France
    Posts
    320
    Try this :

    Code:
    cd ./txt/
    for file in *txt
    do
       sed '/./,$!d' $file > $file.tmp
       mv $file.tmp $file
    done
    Jean-Pierre.

  4. #4
    Join Date
    Jun 2002
    Location
    UK
    Posts
    525
    Originally posted by aigles
    Try this :

    Code:
    cd ./txt/
    for file in *txt
    do
       sed '/./,$!d' $file > $file.tmp
       mv $file.tmp $file
    done
    Hello again Jean Pierre. I know that the command above is correct but I have no idea why!

    If I wanted to retain the lines from the first non-blank, to the last non-blank, I would do something like...

    sed '/./,/./!d'

    However, this would mean that blank lines at the end would be stripped out.

    I have found that if I wanted to match to the end of the file, I would use...

    sed '/./,/\$/!d'

    I can't understand why I've had to escape the $ and now your example has totally confused me. Could you help clear it up for me?

    Thanks, Damian

  5. #5
    Join Date
    Jan 2004
    Location
    Bordeaux, France
    Posts
    320
    • Some explanations

      sed '/./,$!d'

      /./ <= Non empty line, at least contains a char
      $ <= End of file
      /./,$ <= Select lines from first nom empty to end of file
      ! <= Negate selection
      /./,$! <= Select empty lines at top of file
      d <= Delete selected lines
      /./,$!d <= Delete empty lines at top of file


      In a regular expression (/ . . . /)
      $ = end of line
      \$ = character $

      In address range (outside of RE)
      $ = last line


    • To remove empty lines at end of file :
      Code:
      sed -e :a -e '/^\n*$/N;/\n$/ba' $file > $file.tmp
    • To remove empty lines at top and end of file
      Code:
      sed '/./,$!d' $file | sed -e :a -e '/^\n*$/N;/\n$/ba  > $file.tmp
    • Links
      do-it-with-sed
      Sed FAQ
    Jean-Pierre.

  6. #6
    Join Date
    Jun 2002
    Location
    UK
    Posts
    525
    In a regular expression (/ . . . /)
    $ = end of line
    \$ = character $

    In address range (outside of RE)
    $ = last line
    Of course! It's an address, not an RE. I still don't understand why I have to escape the $ in the following...

    sed '/./,/\$/!d'

    As you see, this is a regex and like you, I believe that this ought to represent the character $. It doesn't btw.


  7. #7
    Join Date
    Jan 2004
    Location
    Bordeaux, France
    Posts
    320
    If your file dosen't contains the character '$' the address /\$/ is take as $ (end of file).
    In that case /./,$ and /./,/\$/ are equivalent.
    Jean-Pierre.

  8. #8
    Join Date
    Jun 2002
    Location
    UK
    Posts
    525
    That seems to be correct. Is this behaviour documented anywhere? My man page has the following to say about addresses...

    Certain commands called addressed commands allow you to specify one line or a
    range of lines to which the command should be applied. The following rules apply
    to addressed commands:

    o A command line without an address selects every line.
    o A command line with one address, expressed in context form, selects each
    line that matches the address.
    o A command line with two addresses separated by commas selects the entire
    range from the first line that matches the first address through the next
    line that matches the second. (If the second address is a number less than
    or equal to the line number first selected, only one line is selected.)
    Thereafter, the process is repeated, looking again for the first address.

  9. #9
    Join Date
    Jan 2004
    Location
    Bordeaux, France
    Posts
    320
    My man page says the same thing.
    The document do-it_with-sed specify :
    - commands may take 0, 1 or 2 addresses
    - if no address is given, a command is applied to all pattern spaces
    - if 1 address is given, then it is applied to all pattern spaces
    that match that address
    - if 2 addresses are given, then it is applied to all formed pattern spaces
    between the pattern space that matched the first address, and the next
    pattern space matched by the second address.

    If pattern spaces are all the time single lines, this can be said
    like, if 2 addrs are given, then the command will be executed on
    all lines between first addr and second (inclusive)

    If the second address is an RE, then the search starts only on
    the next line. That's why things like this work:

    /foo/,/foo/<cmd>
    This last point can explain this not documented behavior

    Extract from Sed FAQ
    Address ranges are:

    (1) Inclusive. The range "/From here/,/eternity/" matches all the lines containing "From here" up to and including the line containing "eternity". It will not stop on the line just prior to "eternity". (If you don't like this, see section 4.24.)

    (2) Plenary. They always match full lines, not just parts of lines. In other words, a command to change or delete an address range will change or delete whole lines; it won't stop in the middle of a line.

    (3) Multi-linear. Address ranges normally match 2 lines or more. The second address will never match the same line the first address did; therefore a valid address range always spans at least two lines, with these exceptions which match only one line:

    if the first address matches the last line of the file
    if using the syntax "/RE/,3" and /RE/ occurs only once in the file at line 3 or below
    if using HHsed v1.5. See section 3.4.
    (4) Minimalist. In address ranges with /regex/ as <address2>, the range "/foo/,/bar/" will stop at the first "bar" it finds, provided that "bar" occurs on a line below "foo". If the word "bar" occurs on several lines below the word "foo", the range will match all the lines from the first "foo" up to the first "bar". It will not continue hopping ahead to find more "bar"s. In other words, address ranges are not "greedy," like regular expressions.

    (5) Repeating. An address range will try to match more than one block of lines in a file. However, the blocks cannot nest. In addition, a second match will not "take" the last line of the previous block. For example, given the following text,

    start
    stop start
    stop

    the sed command '/start/,/stop/d' will only delete the first two lines. It will not delete all 3 lines.

    (6) Relentless. If the address range finds a "start" match but doesn't find a "stop", it will match every line from "start" to the end of the file. Thus, beware of the following behaviors:

    /RE1/,/RE2/ # If /RE2/ is not found, matches from /RE1/ to the
    # end-of-file.

    20,/RE/ # If /RE/ is not found, matches from line 20 to the
    # end-of-file.

    /RE/,30 # If /RE/ occurs any time after line 30, each
    # occurrence will be matched in sed15+, sedmod, and
    # GNU sed v3.02+. GNU sed v2.05 and 1.18 will match
    # from the 2nd occurrence of /RE/ to the end-of-file.

    If these behaviors seem strange, remember that they occur because sed does not look "ahead" in
    Jean-Pierre.

  10. #10
    Join Date
    Jun 2002
    Location
    UK
    Posts
    525
    Thankyou.

  11. #11
    Join Date
    Oct 2003
    Posts
    706
    Geek alert! Geek alert!



    And that, gentlebeings, is why I personally don't use sed.

    Oh, it's damm powerful, as you can plainly see, but it's incomprehensible. At least in my experience, when I come back to a sed-line that I myself have written, even one day later, I have forgotten what it means and I spend a long time puzzling it out.

    Witness the fact that one line of 'chicken scratches' was followed by about four explanatory posts describing what it means. To me, that's a maintenance issue.

    I'm not saying that awk is too much better at this than sed, but at least you have the opportunity to put some comments into it. And you can also write more than one rule, you can write procedures and so-forth, which make the whole process much easier to understand when you encounter your own code a second time.

    'Chicken scratching' certainly tends to give Unix shell programming a bad reputation.

    ... P.S.: Nothing personal intended here! Nothing at all. Just another point of view.
    ChimneySweep(R): fast, automatic
    table repair at a click of the
    mouse! http://www.sundialservices.com

  12. #12
    Join Date
    Jan 2004
    Location
    Bordeaux, France
    Posts
    320
    hi Guru

    I agree with you.
    I use sed especially to carry out substitutions, for the rest I use awk.
    Jean-Pierre.

  13. #13
    Join Date
    Feb 2004
    Posts
    3
    Originally posted by aigles
    Try this :

    Code:
    cd ./txt/
    for file in *txt
    do
       sed '/./,$!d' $file > $file.tmp
       mv $file.tmp $file
    done
    Did this.....

    cd ./txt/ ; for file in *txt ; do sed '/./,$!d' $file > $file.tmp ; mv $file.tmp $file ; done

    Got this...

    bash: syntax error near unexpected token `do'


    I don't see anything wrong here... except maybe it needs quotes around the file names etc.. so...

    I did this...


    cd /home/Wolfe/Mail/IRCUNDERGROUND/txt/ ; for file in *txt ; do sed '/./,$!d' "$file" > "$file.tmp" ; mv $file.tmp $file ; done


    No errors, but... the headers remain.

  14. #14
    Join Date
    Jan 2004
    Location
    Bordeaux, France
    Posts
    320
    • Put quotes around all file names
      Code:
      cd /home/Wolfe/Mail/IRCUNDERGROUND/txt/
      for file in *txt
      do
         sed '/./,$!d' "$file" > "$file.tmp"
         mv "$file.tmp" "$file"
      done
    • Execute 'set -x' before the for loop to verify commands
    • If your directory doesn't contains file '*txt' the variable 'file' will get the value '*txt'.
      To avoid that, you can use 'find' command :
      Code:
      cd /home/Wolfe/Mail/IRCUNDERGROUND/txt/
      find . -name '*txt' | \
      while read file
      do
         sed '/./,$!d' "$file" > "$file.tmp"
         mv "$file.tmp" "$file"
      done
    Jean-Pierre.

  15. #15
    Join Date
    Feb 2004
    Posts
    3
    [SIZE=1]Originally posted by aigles
    [list][*] Put quotes around all file names
    []
    Well.. I did that and the verdict is...

    No errors... and no results.
    the files remain as they were as if I had done nothing...

    Some other things that I have tried:

    cd /home/Wolfe/Mail/IRCUNDERGROUND/txt/ ; for file in *txt ; do perl -e '$i=0; while(<>){if($i|^[A-Za-z]:|/^\b*$/){print $_};$i++}' < $file > $file.tmp ; done

    All this does is crash.. symbols within the the text file tend to screw things up "<" for example will cause an error.. so will pipes "|" ++> etc...

    My thinking was that any line in the text that begins with "<text>:" is most likely to be a header, and therefore the line could be deleted. This would be better then just deleting to the first blank line, because if someone forwards an email as text to the system, the headers from the first would show up in the post. Problem is however, not all the lines in the header start with a string and a ":", some start with a "<" etc.. and therefore screw things up and don't get deleted. I pre-deleted most of the symbols causing this, but it is a pointless task.

    The things that is really ticking me off. is the I did have a working sed script oneliner that did the job. It deleted everything to the first blank line and echo'd the contents to a new file ($file.out) but somehow it got deleted even in my backups. Oh well back to the drawing board as they say.

    Sorry for the long rant.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •