Results 1 to 8 of 8
  1. #1
    Join Date
    Mar 2004
    Posts
    3

    Question Unanswered: Regexp: Substitute EOL (\n) characters inside the quotations

    Hello!

    I have a problem: I have CSV-file that have to be parsed into array of hashes. There are serveral different ways to parse CSV, but I have one trouble that doesn't allow me to use them.

    My file can contain string parameters, they encloses by quotation marks, e.g. string can be like this:

    1234,"abcd def",567,"xxxxx",999

    But there can exists EOL (\n) characters inside text field, for example:

    1234,"abcd
    def
    ghi",567,"xxxxx",999

    So, I need to substitute EOL characters inside the text fields by some other character ('|' for example) and then I can parse my file as ususal CSV.

    How can i do it?

    I tried this regular expression:
    $str =~ s/\"(.*?)\n(.*?)\"/"$1*$2"/sg;

    but it substitute only one occurence of \n. How can I change it to parse all the EOLs inside the strings but not at the end of the CSV-fields (outside the quotations)?

    Thanks for answers!

  2. #2
    Join Date
    Nov 2002
    Posts
    207
    One option I can think of is, use of chop. But this will only work if the total length of the record is constant. If it is, then you can keep on chop(ping) until you get the required length and then move to the next line.

    HTH.

  3. #3
    Join Date
    Mar 2004
    Posts
    3
    No, all string sizes are different. chomp is not suitable here.

    Any more ideas? I still didn't implement this by regexp, now i'm using small cycle with "for" It is not good

  4. #4
    Join Date
    Nov 2002
    Posts
    207
    How about number of fields in a record with comma separator???

  5. #5
    Join Date
    Mar 2004
    Posts
    3
    It is constant.

    But the problem here is the possible existance of commas inside the quotes -- so, we need to ignore all commas between 1th and 2nd, 3rd and 4th, 5th and 6th and so on (odd, then this odd +1) quotes -- these commas are not separators, they're only part of the text fields.

  6. #6
    Join Date
    Jun 2004
    Location
    Nowhere Near You
    Posts
    89
    Any reason for not using the module Tie::CSV_File?

  7. #7
    Join Date
    Jun 2004
    Location
    Nowhere Near You
    Posts
    89
    Quote Originally Posted by mkalsi
    How about number of fields in a record with comma separator???
    Quote Originally Posted by davemay
    It is constant.

    But the problem here is the possible existance of commas inside the quotes -- so, we need to ignore all commas between 1th and 2nd, 3rd and 4th, 5th and 6th and so on (odd, then this odd +1) quotes -- these commas are not separators, they're only part of the text fields.
    But you know the number of fields, so

    Code:
    #!\user\bin\perl -w
    my($s_NumOfFields)= 3 # Set this to the correct number of fields
    
    open(CSV,'<CSV_TEST.TXT') || die "$!";
    my(@fields);
    my($text);
    while ($text=<CSV>) {
      while (scalar(@fields=ParseCSV($text)) < $s_NumOfFields) {
        $text.=<CSV>;
         };
      # do your thing to @fields here - for example
      print join(' | ',@fields)."\n\n";
       };
    close(CSV);
    
      sub ParseCSV {
        my($text)=shift;
        my(@fields)=();
    
        while ($text =~ m{([^",]*)[,\n]|"((?:[^"]|"")*)"[,\n]}gs) {
          unless (substr($`,-1,1) eq '"') {
            if (defined $1) {
              $field=$1;
               } else {
              ($field=$2)=~s/""/"/g;
               };
            push(@fields,$field);
             };
           };
        return @fields;
         };
    I ran this for $s_NumOfFields=3 on
    Code:
    1,2,3
    "this
     is
     a
     test","so is this, ","along
     with
     this"
    "this
     is
     a
     test","so is this, ","along
     with
     this"
    to get
    Code:
    1 | 2 | 3
    
    this
     is
     a
     test | so is this,  | along
     with
     this
    
    this
     is
     a
     test | so is this,  | along
     with
     this
    Last edited by senza_nome; 07-15-04 at 22:51.

  8. #8
    Join Date
    Jun 2004
    Location
    Nowhere Near You
    Posts
    89
    And a version when the number of fields is not known:
    NB --- here the question of how do you interpret ",\n" comes into play. Perhaps it indicates that more is to come? Unfortunately, the empty field is legitimate and outside of a quoted string the "\n" is a record terminator so we interpret the ",\n" as a empty field terminating the record.
    Code:
    #!\user\bin\perl -w
    
    use Inline::Files;
    
    # Version 2 --- number of fields is not known
    while (@a_Field=GetCSV(@a_Field)) {
      print join(' | ',@a_Field)."\n\n";
       };
    
      sub GetCSV{
        my($s_Line,$s_Text);
        while ($s_Line=<CSV2>) {
          #print "'$s_Line'\n";
          my(@a_Field);
          $s_Text.=$s_Line;
          while ($s_Text =~ m{([^",]*)([,\n])|"((?:[^"]|"")*)"([,\n])}gs) {
            unless (substr($`,-1,1) eq '"') {
              if (defined $1) {
                push(@a_Field,$1);
                if ($2 eq "\n") {
                  return @a_Field;
                   };
                 } else {
                my($s_Field);
                ($s_Field=$3)=~s/""/"/g;
                push(@a_Field,$s_Field);
                if ($4 eq "\n") {
                  return @a_Field;
                   };
                 };
               };
             };
           };
        return ();
         };
    
    __CSV2__
    1,2,3,
    4
    "this
     is
     a
     test","so is this, ","along
     with
     this"
    "this
     is
     a
     test","so is this, ","along
     with
     this",
    "too"
    and got

    Code:
    1 | 2 | 3 |
    
    4
    
    this
     is
     a
     test | so is this,  | along
     with
     this
    
    this
     is
     a
     test | so is this,  | along
     with
     this |
    
    too
    Last edited by senza_nome; 07-16-04 at 11:13.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •