Click to See Complete Forum and Search --> : [RESOLVED] Expressing nesting with regular expressions and matching e-mail addresses
Stephen Philbin
10-25-2008, 09:31 AM
I'm having trouble with the last bit of a regular expression I'm trying to write. The regex is supposed to be an interpretation of the grammar for an E-mail address as specified in RFC 2822 (http://www.ietf.org/rfc/rfc2822.txt).
The part I'm having a problem with at the moment is the comment specification. Here's the (ABNF) specification of comments from section 3.2.3 of RFC 2822.
ccontent = ctext / quoted-pair / comment
comment = "(" *([FWS] ccontent) [FWS] ")"
As you can see, the references are circular to allow for nesting of comments, but I'm having trouble putting this into a regular expression. Can anyone offer help or advice?
If anyone wants to see it, the code I have for the expression so far is $wsp = '[\\x09\\x20]';
$crlf = '\\x0d\\x0a';
$fws = '(' . $wsp . '*' . $crlf . ')?' . $wsp . '+';
$text = '[\\x01-\\x09\\x0b\\x0c\\0e-\\x7f]';
$atext = '[\\x21\\x23-\\x27\\x2a\\x2b\\x2d\\x2f-\\x39\\x3d\\x3f\\x41-\\x5a\\x5e-\\x7e]';
$no_ws_ctl = '[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x7f]';
$ctext = $no_ws_ctl . '|' . '[\\x21-\\x27\\x2a-\\x5b\\x5d-\\x7e]';
##########################################################################
$comment = '\\x28((' . $fws . ')?' . $ccontent . ')*(' . $fws . ')?\\x29';
$ccontent = $ctext . '|' . $quoted_pair . '|' . $comment;
##########################################################################
$cfws = '((' . $fws . ')?' . $comment . ')*(((' . $fws . ')?' . $comment . ')|' . $fws . ')';
$qtext = $no_ws_ctl . '|[\\x21\\x23-\\x5b\\x5d-\\x7e]';
$quoted_pair = '\\x5c' . $text;
$qcontent = $qtext . '|' . $quoted_pair;
$quoted_string = '(' . $cfws . ')?\\x22((' . $fws . ')?' . $qcontent . ')*(' . $fws . ')?\\x22(' . $cfws . ')?';
$atom = '(' . $cfws . ')?' . $atext . '+(' . $cfws . ')?';
$word = $atom . '|' . $quoted_string;
$dtext = $no_ws_ctl . '|' . '[\\x21-\\x5a\\x5e-\\x7e]';
$dcontent = $dtext . '|' . $quoted_pair;
$domain_literal = '(' . $cfws . ')?\\x5b((' . $fws . ')?' . $dcontent . ')*(' . $fws . ')?\\x5d(' . $cfws . ')?';
$local_part = $dot_atom . '|' . $quoted_string;
$domain = $dot_atom . '|' . $domain_literal;
$addr_spec = $local_part . '\\x40' . $domain;
$angle_addr = '(' . $cfws . ')?' . '\\x3c' . $addr_spec . '\\x3e' . '(' . $cfws . ')?';
$display_name = $word. '+';
$name_addr = '(' . $display_name . ')?' . $angle_addr;
$mailbox = $name_addr . '|' . $addr_spec;
$group = $display_name . '\\x3a(' . $mailbox_list . '|' . $cfws . ')?\\x3b(' . $cfws . ')?';
$this->spec_exp = $mailbox . '|' . $group;
jim_keller
10-27-2008, 11:27 AM
it may be easiest to use a recursive function. For instance, the first call to the function checks to see if there are any nested comments. If there are, it first calls itself on the data inside the outermost comment. This continues until the check for nested comments fails, at which point the function does the actual parsing, then returns the value so that the previous call in the stack can operate on the returned value. This will happen all the way on up until you have all of your nested comments parsed.
Stephen Philbin
10-27-2008, 03:15 PM
Thanks for taking the time to read about my problem and offer helpful suggestion, Jim. I had considered the use of a recursive function and, if I had failed to find a way to express this recursion in regex, would have no doubt gone that route.
However, I did a little digging around and found that regular expressions now actually support recursion. It's rather less than straight-forward and I haven't studied the text enough to gain a complete understanding of it. I've just learned what I needed to know to get this particulair job done.
As well as posting the regex that solves my particulair problem, I'll also give some simplified sample code to illustrate what I have learned. That way, anyone with some similair recursion issue can just look at and adapt the simplified sample without having to wade through all the other pattern parts specific to my needs.
Most of the information I used to solve my particulair problem was found at http://www.pcre.org/pcre.txt (from the "BACK REFERENCES" section onwards.)
Here's some sample code, then.
<?php
$string = 'Some outer text (the first comment (the first (second) nested comment)) surrounding nested comments';
$text = '[a-z A-Z]';
$comment = '(?<comment>\(((?&comment)|' . $text . ')*\))';
$final_pattern = '/^' . $text . '*' . $comment . $text . '*' . '$/';
/*
Resulting pattern text:
/^[a-z A-Z]*(?<comment>\(((?&comment)|[a-z A-Z])*\))[a-z A-Z]*$/
*/
echo "<textarea cols=\"130\" rows=\"2\">Match Pattern: $final_pattern\r\nMatch String: $string</textarea><br>\r\n";
var_dump(preg_match($final_pattern, $string));
?>
Most of the above code is just standard stuff for PHP. There's only the actual pattern that might seem a bit unusual. I'll explain it by rewriting it with indented line breaks for clarity.
/^ "Standard start of subject string anchor"
[a-z A-Z]* "ordinary zero-or-more letters and spaces pattern"
(?<comment> "Opening of a `labeled subpattern`. It makes the contents of the brackets known by the text between the `<` and `>`"
\( "Literal opening bracket"
((?&comment)|[a-z A-Z])* "Zero-or-more ocurrences of either text or the expression labeled as `comment`"
\) "Literal closing bracket"
) "Closing bracket of the labeled subpattern"
[a-z A-Z]*
$/ "Standard end of subject string anchor"
As you can see, the part of the regex that references the labeled subexpression "comment" is inside the part of the pattern which defines the labeled subexpression "comment" which allows for an expression of recursion.
There are a few syntaxes for labeling and referencing sections of a pattern, but here I use (?<label_name_here>pattern_section_you_wish_to_label_here) to label a section of a pattern, and (?&label_name_here) to reference a previously labeled section of an expression.
This form of expressing recursion works by its self in the simple example, but it is still not quite right for my original problem. In my pattern that I was trying to build the comments could be recursively nested, but comments formed parts of several sections of the pattern. Using the form of labeling for recursion shown in the simple example obove was no good because it caused the same labeling to recur many times in the overall pattern.
To solve this problem, I used a subpattern definition according to the instructions in the pcre.txt file linked to at the start of this post. The general syntax is (?(DEFINE)(?<label_name_here>pattern_section_you_wish_to_label_here))
Using this define mechanism allows you to define the desired pattern section so that you can reference it by its label later on in the pattern, but without actually having a direct effect on the overall pattern used for matching. You basically define the desired pattern section at the start of your expression and then write the rest of the pattern as normal, but using the reference to the defined pattern wherever you need it.
If the simple example were to use the DEFINE method it would look like this <?php
$string = 'Some outer text (the first comment (the first (second) nested comment)) surrounding nested comments';
$text = '[a-z A-Z]';
$comment_definition = '(?(DEFINE) (?<comment>\(((?&comment)|' . $text . ')*\)))';
$comment = '(?&comment)';
$final_pattern = '/^' . $comment_definition . $text . '*' . $comment . $text . '*' . '$/';
/*
Resulting pattern text:
/^(?(DEFINE) (?<comment>\(((?&comment)|[a-z A-Z])*\)))[a-z A-Z]*(?&comment)[a-z A-Z]*$/
*/
echo "<textarea cols=\"130\" rows=\"2\">Match Pattern: $final_pattern\r\nMatch String: $string</textarea><br>\r\n";
var_dump(preg_match($final_pattern, $string));
?>
Doing things this way allows me to use the PHP variable, $comment, several times in more complex patterns (such as my e-mail matching one) without causing the same labeling section of the regex to appear over and over in the resulting overall pattern.
Sorry if I haven't explained it very well. I tried my best. People are welcome to add clarifications to anything I may have failed to properly explain. I'll post the php code that makes up the regex that solves my problem pretty soon. I've still a little work to do on it yet before it's right.
Thanks again to Jim and anyone else that read my original post and spent some time on trying to come up with a solution.
Stephen Philbin
10-28-2008, 05:15 AM
Ok. Here's my finished the regular expression for matching an e-mail address. If anyone else is planning on using it there's a few things they should know first. The first thing is that it comes from a PHP class that is designed primarily for generating outbound mail. Because of this, it takes on the guidance of RFC 2822 which states that mail software should be ready to accept obsolete syntax, but should never generate it. As a result, this expression is ignorant of obsolete syntax.
The second main thing to know is that the regular main expression string created my the concatenation of the PHP variables if fairly huge. I haven't done any benchmarking, but I'd imagine it would not be particulairly speedy as it stands. I would recommend building smaller, purpose-specific expressions from the building blocks provided (remembering to use the $comment_definition variable at the beginning of any patterns you build that has the $comment PHP variable as one of its building blocks).
UPDATE:
I've just added two more complete expressions in addition to the original pattern: $full_address_spec_pattern. One larger expression and one smaller expression. The smaller one is $single_address_pattern and is intended as a convenience to serve what I would imagine are the needs of the majority of people. If you are only interested in checking the conformance of a single e-mail address (rather than a list of addresses) I would very strongly recommend that you use this one because it is much shorter than the others and will still meet your needs.
The second (and larger) variant I added is the $full_address_spec_list_pattern variable. This is for matching a list of addresses that conform to the full address specification of RFC 2822 and has been added mainly just for completeness. It's definied in RFC 2822 so I also defined it here. If you are going to use the $full_address_spec_list_pattern variable then I strongly suggest that you make absolutely sure that one of the other patterns wouldn't also fit your needs because the literal string that $full_address_spec_list_pattern produces is absolutely huge.
Another thing that's probably worth mentioning is the use of once-only (aka atomic) subpatterns. I don't really know enough about them to try using them in this code without running the risk of causing errors that would be difficult to detect, but from what I've read, there seems to be a chance that they could be used to significantly boost the performance of this pattern. Any help from anyone that has familiarity with atomic matching and RFC 2822 would be greatly appreciated.
Anyway. Enough waffle. Here's the code. $wsp = '[\\x09\\x20]';
$crlf = '\\x0d\\x0a';
$fws = '(' . $wsp . '*' . $crlf . ')?' . $wsp . '+';
$text = '[\\x01-\\x09\\x0b\\x0c\\0e-\\x7f]';
$atext = '[\\x21\\x23-\\x27\\x2a\\x2b\\x2d\\x2f-\\x39\\x3d\\x3f\\x41-\\x5a\\x5e-\\x7e]';
$no_ws_ctl = '[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x7f]';
$ctext = $no_ws_ctl . '|' . '[\\x21-\\x27\\x2a-\\x5b\\x5d-\\x7e]';
$quoted_pair = '\\x5c' . $text;
$comment_definition = '(?(DEFINE) (?<comment>\\x28((' . $fws . ')?(' . $ctext . ')|' . $quoted_pair . '|' . $comment . ')*(' . $fws . ')?\\x29))';
$comment = '(?&comment)';
$cfws = '((' . $fws . ')?' . $comment . ')*(((' . $fws . ')?' . $comment . ')|' . $fws . ')';
$qtext = $no_ws_ctl . '|[\\x21\\x23-\\x5b\\x5d-\\x7e]';
$qcontent = '(' . $qtext . ')|' . $quoted_pair;
$quoted_string = '(' . $cfws . ')?\\x22((' . $fws . ')?' . $qcontent . ')*(' . $fws . ')?\\x22(' . $cfws . ')?';
$atom = '(' . $cfws . ')?' . $atext . '+(' . $cfws . ')?';
$word = $atom . '|' . $quoted_string;
$dtext = $no_ws_ctl . '|' . '[\\x21-\\x5a\\x5e-\\x7e]';
$dcontent = '(' . $dtext . ')|' . $quoted_pair;
$domain_literal = '(' . $cfws . ')?\\x5b((' . $fws . ')?' . $dcontent . ')*(' . $fws . ')?\\x5d(' . $cfws . ')?';
$dot_atom = '(' . $cfws . ')?' . $atext . '+(\\x2e' . $atext . '+)*(' . $cfws . ')?';
$local_part = $dot_atom . '|' . $quoted_string;
$domain = $dot_atom . '|' . $domain_literal;
$addr_spec = '(' . $local_part . ')\\x40(' . $domain . ')';
$angle_addr = '(' . $cfws . ')?' . '\\x3c' . $addr_spec . '\\x3e' . '(' . $cfws . ')?';
$display_name = '(' . $word . ')+';
$name_addr = '(' . $display_name . ')?' . $angle_addr;
$mailbox = $name_addr . '|' . $addr_spec;
$mailbox_list = '(' . $mailbox . ')(\\x2c' . $mailbox . ')*';
$group = $display_name . '\\x3a(' . $mailbox_list . '|' . $cfws . ')?\\x3b(' . $cfws . ')?';
$address = '(' . $mailbox . ')|' . $group;
$single_address_pattern = '/^' . $comment_definition . $name_addr . '$/';
$full_address_spec_pattern = '/^' . $comment_definition . $address . '$/';
$full_address_spec_list_pattern = '/^' . $comment_definition . '(' . $address . ')(\\x2c' . $address . ')*$/';
Thanks for reading!
MrCoder
10-28-2008, 06:54 AM
This should be sticky-ied!
Great posts, great information, good job Stephen (Bookmarked)
Stephen Philbin
10-28-2008, 07:39 AM
Heh. Thanks. I dunno about making it sticky, though. I'll see what other people think first. Not keen on the idea of making my own threads sticky. Feels a bit self-congratulatory.