Results 1 to 5 of 5
  1. #1
    Join Date
    Apr 2012
    Posts
    2

    Unanswered: SED With Regex to extract Email Address

    Hi Folks,
    In my program, I have a variable which consists of multiple lines. i need to use each line as an input. My intention is to extract the email address of the user in each line and use it to process further.

    The email address could be anywhere in the whole line. But there will be only one and I need to extract it. The most complex possible format of the email ID is:
    first-name.last-name@xyz.com

    In other words, first and last names are separated by a period (.) and the first and last names may have a hyphen (-). The domain name (@xyz.com) is fixed and only has letters, no numbers.

    We have a Solaris OS and I am using Korn shell. I read some examples on SED command and have made a few attempts to use it. But every time, no matter what regex I use, I get the entire input as the output. I must be doing something very basic thing wrong. Could you please suggest?

    $ echo '92' | sed '/[0-9]+/p'
    Output: 92

    $ echo 'email92' | sed '/[0-9]+/p'
    Output: email92

    $ echo "abc.xyz@comp.com" | sed '/\([A-Za-z0-9]+\)\(-*\)\([A-Za-z0-9]*\)\(\.\)\([A-Za-z0-9]+\)\(-*\)\([A-Za-z0-9]*\)@comp.com/p'
    Output: abc.xyz@comp.com

    $ echo "Email address is abc.xyz@comp.com" | sed '/\([A-Za-z0-9]+\)\(-*\)\([A-Za-z0-9]*\)\(\.\)\([A-Za-z0-9]+\)\(-*\)\([A-Za-z0-9]*\)@comp.com/p'
    Output: Email address is abc.xyz@comp.com

    Note that below is the regex that I arrived at to extract the email address.
    /\([A-Za-z0-9]+\)\(-*\)\([A-Za-z0-9]*\)\(\.\)\([A-Za-z0-9]+\)\(-*\)\([A-Za-z0-9]*\)@comp.com

    Any help is greatly appreciated.

  2. #2
    Join Date
    Sep 2009
    Location
    Ontario
    Posts
    1,057
    Provided Answers: 1
    Code:
    while read line 
    do
    ./findemail $line
    done
    Code:
    #findemail 
    i=1
    n=$*
    IFS="@"
    while [ $i -le $n ]
    do 
       echo $1 |read email domain
       if [ "a$domain" != "a" ]
       then
        echo email address is $email@$domain
        exit
      else
        shift 
      fi
    done
    Untested, bu I think it should work. It will fail if there is more than one @ character on a single line.

  3. #3
    Join Date
    Apr 2012
    Posts
    2
    @kitaman,
    Thank you for your suggestion!

    But I am looking to implement it using Regex so that it is simple and a one-liner as my existing code is already long. Plus I also want to learn SED and Regex as they are quite powerful and interesting.

    First of all, I want to know what I could be doing wrong such that my input gets printed directly without any alteration. I wonder if the Regex is getting applied or not.

    For eg. In the below sample, I am expecting 92 to be printed as I am looking for numeric values only.

    $ echo 'email92' | sed '/[0-9]+/p'
    Output: email92

  4. #4
    Join Date
    Sep 2009
    Location
    Ontario
    Posts
    1,057
    Provided Answers: 1
    Well the first version doesn't work.
    Code:
    while read line  
    do               
    ./findemail $line
    done
    Code:
    #findemail
    i=1                                      
    n=$#                                     
    while [ $i -le $n ]                      
    do                                       
            echo "$1" |grep "@" >/dev/null   
            if [ $? -eq 0 ]                  
            then                             
                    echo email address is $1 
                    exit                     
            else                             
            if [ $i -ne $n ]                 
            then                             
                    shift                    
            fi                               
      fi                                     
      i=`expr $i + 1`                        
    done                                     
    #
    This does, provided the email address is space delimited from the other text on the line.
    The biggest issue is the fact that the @ character is interpreted by the shell, and either has to be escaped with a \ or enclosed in quotes, or be data in a read statement.

  5. #5
    Join Date
    Mar 2012
    Posts
    12
    Quote Originally Posted by ragz_82 View Post
    I want to know what I could be doing wrong such that my input gets printed directly without any alteration. I wonder if the Regex is getting applied or not.

    For eg. In the below sample, I am expecting 92 to be printed as I am looking for numeric values only.

    $ echo 'email92' | sed '/[0-9]+/p'
    Output: email92
    The regex is not being applied at all and sed is not matching the line. If it were, you would be seeing 2 lines of output, not one. Try this:
    Code:
     : now you should see 2 lines
    echo email92 | sed '/9/p'
    
    : ditto
    echo email92 | sed '/[0-9]/p; '
    echo email92 | sed '/[0-9]\+/p; '
    
    : if you have GNU sed, you can do this:
    echo email92 | sed -r '/[0-9]+/p; '
    This does not do what you want (yet), but it lets you know when a line is matched. Sed is a line-oriented tool. It prints each line of input by default, unless the -n switch is used.

    The command "/some pattern/p" will print the lines that match some pattern. But since lines are printed anyway, this command will print each line once (including blank lines), while those lines that match "some pattern" will be printed twice. Once by default, and once by the 'p' command.

    Maybe you were expecting the line to print only if the pattern matched. In that case you should add the -n switch to sed, and then the 'p' command will print only lines that match the pattern.

    What you are probably most interested in is the 's' or s/ubsti/tute/ command, which begins with "s" followed by /old_pattern/new_string/, followed by optional flags (single letters or numbers that alter the pattern matching behavior). After you get going with sed, the idea for your situation is to define a pattern that looks like an email address (not hard to do), indicate it by using parentheses, and keep the parenthesized expression while tossing out the rest. It can be done as a one-liner.

    Basic question here: do you expect only one email address per line, or more than one email address per line?

    If you have GNU grep available, you may be able to get what you want without sed. If you have GNU grep, look at the -o ("only") switch. I'll be glad to help you with the regexes.

    Eric Pement

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •