El módulo Regexp::Grammars escrito por Damian Conway extiende las expresiones regulares Perl con la capacidad de generar representaciones del árbol de análisis sintáctico abstracto y obviando la necesidad de explicitar los blancos. El módulo necesita para funcionar una versión de Perl superior o igual a la 5.10.
La documentación de Regexp::Grammars establece cual es el problema que aborda el módulo:
...Perl5.10 makes possible to use regexes to recognize complex, hierarchical-and even recursive-textual structures. The problem is that Perl 5.10 doesn’t provide any support for extracting that hierarchical data into nested data structures. In other words, using Perl 5.10 you can match complex data, but not parse it into an internally useful form.
An additional problem when using Perl 5.10 regexes to match
complex data formats is that you have to make sure you remember
to insert whitespace- matching constructs (such as \s*
) at every
possible position where the data might contain ignorable whitespace. This
reduces the readability of such patterns, and increases the chance of
errors (typically caused by overlooking a location where whitespace
might appear).
The Regexp::Grammars module solves both those problems.
If you import the module into a particular lexical scope, it preprocesses any regex in that scope, so as to implement a number of extensions to the standard Perl 5.10 regex syntax. These extensions simplify the task of defining and calling subrules within a grammar, and allow those subrule calls to capture and retain the components of they match in a proper hierarchical manner.
Las expresiones regulares Regexp::Grammars aumentan las regexp Perl 5.10. La sintáxis se expande y se modifica:
A Regexp::Grammars specification consists of a pattern (which may include both standard Perl 5.10 regex syntax, as well as special Regexp::Grammars directives), followed by one or more rule or token definitions.
Sigue un ejemplo:
pl@nereida:~/Lregexpgrammars/demo$ cat -n balanced_brackets.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 qr{ 9 (<pp>) 10 11 <rule: pp> \( (?: [^()]*+ | <escape> | <pp> )* \) 12 13 <token: escape> \\. 14 15 }xs; 16 }; 17 18 while (my $input = <>) { 19 while ($input =~ m{$rbb}g) { 20 say("matches: <$&>"); 21 say Dumper \%/; 22 } 23 }
Note that there is no need to explicitly place \s*
subpatterns throughout the rules; that is taken care of automatically.
...
The initial pattern ((<pp>)
) acts like the top rule of the grammar, and must
be matched completely for the grammar to match.
The rules and tokens are declarations only and they are not directly matched. Instead, they act like subroutines, and are invoked by name from the initial pattern (or from within a rule or token).
Each rule or token extends from the directive that introduces it up to either the next rule or token directive, or (in the case of the final rule or token) to the end of the grammar.
(2*(3+5))*4+(2-3)
produce:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 balanced_brackets.pl (2*(3+5))*4+(2-3) matches: <(2*(3+5))> $VAR1 = { '' => '(2*(3+5))', 'pp' => { '' => '(2*(3+5))', 'pp' => '(3+5)' } }; matches: <(2-3)> $VAR1 = { '' => '(2-3)', 'pp' => '(2-3)' };
Each rule calls the subrules specified within it, and then return a hash containing whatever result each of those subrules returned, with each result indexed by the subrule’s name.
In this way, each level of the hierarchical regex can generate hashes recording everything its own subrules matched, so when the entire pattern matches, it produces a tree of nested hashes that represent the structured data the pattern matched.
...
In addition each result-hash has one extra key: the empty string. The value for this key is whatever string the entire subrule call matched.
The difference between a token and a rule is that a token treats any whitespace within it exactly as a normal Perl regular expression would. That is, a sequence of whitespace in a token is ignored if the/x
modifier is in effect, or else matches the same literal sequence of whitespace characters (if/x
is not in effect).
En el ejemplo anterior el comportamiento es el mismo si se reescribe la regla
para el token escape
como:
13 <rule: escape> \\.En este otro ejemplo mostramos que la diferencia entre token y rule es significativa:
pl@nereida:~/Lregexpgrammars/demo$ cat -n tokenvsrule.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 qr{ 9 <s> 10 11 <rule: s> <a> <c> 12 13 <rule: c> c d 14 15 <token: a> a b 16 17 }xs; 18 }; 19 20 while (my $input = <>) { 21 if ($input =~ m{$rbb}) { 22 say("matches: <$&>"); 23 say Dumper \%/; 24 } 25 else { 26 say "Does not match"; 27 } 28 }
Al ejecutar este programa vemos la diferencia en la interpretación de los blancos:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 tokenvsrule.pl ab c d matches: <ab c d> $VAR1 = { '' => 'ab c d', 's' => { '' => 'ab c d', 'c' => 'c d', 'a' => 'ab' } }; a b c d Does not match ab cd matches: <ab cd> $VAR1 = { '' => 'ab cd', 's' => { '' => 'ab cd', 'c' => 'cd', 'a' => 'ab' } };Obsérvese como la entrada
a b c d
es rechazada mientras
que la entrada ab c d
es aceptada.
In a rule, any sequence of whitespace (except those at the very start and the very end of the rule) is treated as matching the implicit subrule<.ws>
, which is automatically predefined to match optional whitespace (i.e.\s*
).
You can explicitly define a<ws>
token to change that default behaviour. For example, you could alter the definition of whitespace to include Perlish comments, by adding an explicit<token: ws>
:
<token: ws> (?: \s+ | #[^\n]* )*
But be careful not to define <ws>
as a rule, as this will lead
to all kinds of infinitely recursive unpleasantness.
El siguiente ejemplo ilustra como redefinir <ws>
:
pl@nereida:~/Lregexpgrammars/demo$ cat -n tokenvsruleandws.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 no warnings 'uninitialized'; 9 qr{ 10 <s> 11 12 <token: ws> (?: \s+ | /\* .*? \*/)*+ 13 14 <rule: s> <a> <c> 15 16 <rule: c> c d 17 18 <token: a> a b 19 20 }xs; 21 }; 22 23 while (my $input = <>) { 24 if ($input =~ m{$rbb}) { 25 say Dumper \%/; 26 } 27 else { 28 say "Does not match"; 29 } 30 }Ahora podemos introducir comentarios en la entrada:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 -w tokenvsruleandws.pl ab /* 1 */ c d $VAR1 = { '' => 'ab /* 1 */ c d', 's' => { '' => 'ab /* 1 */ c d', 'c' => 'c d', 'a' => 'ab' } };
To invoke a rule to match at any point, just enclose the rule’s name in angle brackets (like in Perl 6). There must be no space between the opening bracket and the rulename. For example:
qr{ file: # Match literal sequence 'f' 'i' 'l' 'e' ':' <name> # Call <rule: name> <options>? # Call <rule: options> (it's okay if it fails) <rule: name> # etc. }x;
If you need to match a literal pattern that would otherwise look like a subrule call, just backslash-escape the leading angle:
qr{ file: # Match literal sequence 'f' 'i' 'l' 'e' ':' \<name> # Match literal sequence '<' 'n' 'a' 'm' 'e' '>' <options>? # Call <rule: options> (it's okay if it fails) <rule: name> # etc. }x;
El siguiente programa ilustra algunos puntos discutidos en la cita anterior:
casiano@millo:~/src/perl/regexp-grammar-examples$ cat -n badbracket.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 qr{ 9 (<pp>) 10 11 <rule: pp> \( (?: <b > | \< | < escape> | <pp> )* \) 12 13 <token: b > b 14 15 <token: escape> \\. 16 17 }xs; 18 }; 19 20 while (my $input = <>) { 21 while ($input =~ m{$rbb}g) { 22 say("matches: <$&>"); 23 say Dumper \%/; 24 } 25 }
Obsérvense los blancos en < escape>
y en <token: b > b
.
Pese a ello el programa funciona:
casiano@millo:~/src/perl/regexp-grammar-examples$ perl5.10.1 badbracket.pl (\(\)) matches: <(\(\))> $VAR1 = { '' => '(\\(\\))', 'pp' => { '' => '(\\(\\))', 'escape' => '\\)' } }; (b) matches: <(b)> $VAR1 = { '' => '(b)', 'pp' => { '' => '(b)', 'b' => 'b' } }; (<) matches: <(<)> $VAR1 = { '' => '(<)', 'pp' => '(<)' }; (c) casiano@millo:
...Note, however, that if the result-hash at any level contains only the empty-string key (i.e. the subrule did not call any sub-subrules or save any of their nested result-hashes), then the hash is unpacked and just the matched substring itself if returned.
For example, if <rule: sentence>
had been defined:
<rule: sentence> I see dead people
then a successful call to the rule would only add:
sentence => 'I see dead people'
to the current result-hash.
This is a useful feature because it prevents a series of nested subrule calls from producing very unwieldy data structures. For example, without this automatic unpacking, even the simple earlier example:
<rule: sentence> <noun> <verb> <object>
would produce something needlessly complex, such as:
sentence => { "" => 'I saw a dog', noun => { "" => 'I', }, verb => { "" => 'saw', }, object => { "" => 'a dog', article => { "" => 'a', }, noun => { "" => 'dog', }, }, }
El siguiente ejemplo ilustra este punto:
pl@nereida:~/Lregexpgrammars/demo$ cat -n unaryproductions.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 qr{ 9 <s> 10 11 <rule: s> <noun> <verb> <object> 12 13 <token: noun> he | she | Peter | Jane 14 15 <token: verb> saw | sees 16 17 <token: object> a\s+dog | a\s+cat 18 19 }x; 20 }; 21 22 while (my $input = <>) { 23 while ($input =~ m{$rbb}g) { 24 say("matches: <$&>"); 25 say Dumper \%/; 26 } 27 }
Sigue una ejecución del programa anterior:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 unaryproductions.pl he saw a dog matches: <he saw a dog> $VAR1 = { '' => 'he saw a dog', 's' => { '' => 'he saw a dog', 'object' => 'a dog', 'verb' => 'saw', 'noun' => 'he' } }; Jane sees a cat matches: <Jane sees a cat> $VAR1 = { '' => 'Jane sees a cat', 's' => { '' => 'Jane sees a cat', 'object' => 'a cat', 'verb' => 'sees', 'noun' => 'Jane' } };
Cuando se usa Regexp::Grammars
como parte de
un programa que utiliza otras regexes hay que evitar
que Regexp::Grammars
procese las mismas. Regexp::Grammars
reescribe las expresiones regulares durante la fase de preproceso. Esta por ello
presenta las mismas limitaciones que cualquier otra forma de
'source filtering' (véase perlfilter). Por ello es una buena idea declarar
la gramática en un bloque do
restringiendo de esta forma el ámbito de
acción del módulo.
5 my $calculator = do{ 6 use Regexp::Grammars; 7 qr{ . ........ 28 }xms 29 };
When a grammar has parsed successfully, the %/
variable will contain
a series of nested hashes (and possibly arrays) representing the
hierarchical structure of the parsed data.
Typically, the next step is to walk that tree, extracting or converting or otherwise processing that information. If the tree has nodes of many different types, it can be difficult to build a recursive subroutine that can navigate it easily.
A much cleaner solution is possible if the nodes of the tree are proper objects. In that case, you just define atrasnlate()
method for each of the classes, and have every node call that method on each of its children. The chain oftranslate()
calls would cascade down the nodes of the tree, each one invoking the appropriatetranslate()
method according to the type of node encountered.
The only problem is that, by default, Regexp::Grammars returns a tree of plain-old hashes, notClass::Whatever
objects. Fortunately, it's easy to request that the result hashes be automatically blessed into the appropriate classes, using the<objrule:...>
and<objtoken:...>
directives.
These directives are identical to the<rule:...>
and<token:...>
directives (respectively), except that the rule or token they create will also bless the hash it normally returns, converting it to an object of a class whose name is the same as the rule or token itself.
For example:
<objrule: Element> # ...Defines a rule that can be called as <Element> # ...and which returns a hash-based Element object
The IDENTIFIER
of the rule or token may also be fully qualified. In
such cases, the rule or token is defined using only the final short
name, but the result object is blessed using the fully qualified long
name. For example:
<objrule: LaTeX::Element> # ...Defines a rule that can be called as <Element> # ...and which returns a hash-based LaTeX::Element object
This can be useful to ensure that returned objects don't collide with other namespaces in your program.
Note that you can freely mix object-returning and plain-old-hash-returning rules and tokens within a single grammar, though you have to be careful not to subsequently try to call a method on any of the unblessed nodes.
No siempre el nombre de la regla es el mas apropiado para ser el nombre del resultado:
It is not always convenient to have subrule results stored under the same name as the rule itself. Rule names should be optimized for understanding the behaviour of the parser, whereas result names should be optimized for understanding the structure of the data. Often those two goals are identical, but not always; sometimes rule names need to describe what the data looks like, while result names need to describe what the data means.
For example, sometimes you need to call the same rule twice, to match two syntactically identical components whose positions give then semantically distinct meanings:
<rule: copy_cmd> copy <file> <file>
The problem here is that, if the second call to<file>
succeeds, its result-hash will be stored under the keyfile
, clobbering the data that was returned from the first call to<file>
.
To avoid such problems, Regexp::Grammars
allows you to alias any
subrule call, so that it is still invoked by the original name, but its
result-hash is stored under a different key. The syntax for that is:
<alias=rulename>
. For example:
<rule: copy_cmd> copy <from=file> <to=file>
Here,<rule: file>
is called twice, with the first result-hash being stored under the keyfrom
, and the second result-hash being stored under the keyto
.
Note, however, that the alias before the=
must be a proper identifier (i.e. a letter or underscore, followed by letters, digits, and/or underscores). Aliases that start with an underscore and aliases namedMATCH
have special meaning.
Aliases can also be useful for normalizing data that may appear in different formats and sequences. For example:
<rule: copy_cmd> copy <from=file> <to=file> | dup <to=file> as <from=file> | <from=file> -> <to=file> | <to=file> <- <from=file>
Here, regardless of which order the old and new files are specified, the result-hash always gets:
copy_cmd => { from => 'oldfile', to => 'newfile', }
El siguiente programa ilustra los comentarios de la documentación:
pl@nereida:~/Lregexpgrammars/demo$ cat -n copygrammar.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 qr{ 9 <copy_cmd> 10 11 <rule: copy_cmd> 12 copy <from=file> <to=file> 13 | <from=file> -> <to=file> 14 | <to=file> <- <from=file> 15 16 <token: file> [\w./\\]+ 17 }x; 18 }; 19 20 while (my $input = <>) { 21 while ($input =~ m{$rbb}g) { 22 say("matches: <$&>"); 23 say Dumper \%/; 24 } 25 }Cuando lo ejecutamos obtenemos:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 copygrammar.pl copy a b matches: <copy a b> $VAR1 = { '' => 'copy a b', 'copy_cmd' => { '' => 'copy a b', 'to' => 'b', 'from' => 'a' } }; b <- a matches: <b <- a> $VAR1 = { '' => 'b <- a', 'copy_cmd' => { '' => 'b <- a', 'to' => 'b', 'from' => 'a' } }; a -> b matches: <a -> b> $VAR1 = { '' => 'a -> b', 'copy_cmd' => { '' => 'a -> b', 'to' => 'b', 'from' => 'a' } };
If a subrule call is quantified with a repetition specifier:
<rule: file_sequence> <file>+
then each repeated match overwrites the corresponding entry in the
surrounding rule’s result-hash, so only the result of the final
repetition will be
retained. That is, if the above example matched the string foo.pl bar.py baz.php
,
then the result-hash would contain:
file_sequence { "" => 'foo.pl bar.py baz.php', file => 'baz.php', }
Existe un caveat con el uso de los operadores de repetición y el manejo de los blancos. Véase el siguiente programa:
pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers3.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 9 qr{ 10 <numbers> 11 12 <rule: numbers> 13 (<number>)+ 14 15 <token: number> \s*\d+ 16 }xms; 17 }; 18 19 while (my $input = <>) { 20 if ($input =~ m{$rbb}) { 21 say("matches: <$&>"); 22 say Dumper \%/; 23 } 24 }Obsérvese el uso explícito de espacios
\s*\d+
en la definición de number
.
Sigue un ejemplo de ejecución:
pl@nereida:~/Lregexpgrammars/demo$ perl5_10_1 numbers3.pl 1 2 3 4 matches: <1 2 3 4> $VAR1 = { '' => '1 2 3 4', 'numbers' => { '' => '1 2 3 4', 'number' => ' 4' } };
Si se eliminan los blancos de la definición de
number
:
pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 9 qr{ 10 <numbers> 11 12 <rule: numbers> 13 (<number>)+ 14 15 <token: number> \d+ 16 }xms; 17 }; 18 19 while (my $input = <>) { 20 if ($input =~ m{$rbb}) { 21 say("matches: <$&>"); 22 say Dumper \%/; 23 } 24 }se obtiene una conducta que puede sorprender:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 numbers.pl 12 34 56 matches: <12> $VAR1 = { '' => '12', 'numbers' => { '' => '12', 'number' => '12' } };
La explicación está en la documentación: véase la sección Grammar Syntax:
<rule: IDENTIFIER>
Define a rule whose name is specified by the supplied identifier.
Everything following the<rule:...>
directive (up to the next<rule:...>
or<token:...>
directive) is treated as part of the rule being defined.
Any whitespace in the rule is replaced by a call to the<.ws>
subrule (which defaults to matching\s*
, but may be explicitly redefined).
También podríamos haber resuelto el problema introduciendo un blanco explícito dentro del cierre positivo:
<rule: numbers> (<number> )+ <token: number> \d+
Usually, that’s not the desired outcome, so Regexp::Grammars provides another mechanism by which to call a subrule; one that saves all repetitions of its results.
A regular subrule call consists of the rule’s name surrounded by angle
brackets. If, instead, you surround the rule’s name with <[...]>
(angle and square brackets) like so:
<rule: file_sequence> <[file]>+
then the rule is invoked in exactly the same way, but the result of that
submatch is pushed onto an array nested inside the appropriate result-hash
entry. In other words, if the above example matched the same
foo.pl bar.py baz.php
string, the result-hash would contain:
file_sequence { "" => 'foo.pl bar.py baz.php', file => [ 'foo.pl', 'bar.py', 'baz.php' ], }
Teniendo en cuenta lo dicho anteriormente sobre los blancos dentro de los cuantificadores, es necesario introducir blancos dentro del operador de repetición:
pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers4.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 9 qr{ 10 <numbers> 11 12 <rule: numbers> 13 (?: <[number]> )+ 14 15 <token: number> \d+ 16 }xms; 17 }; 18 19 while (my $input = <>) { 20 if ($input =~ m{$rbb}) { 21 say("matches: <$&>"); 22 say Dumper \%/; 23 } 24 }Al ejecutar este programa obtenemos:
pl@nereida:~/Lregexpgrammars/demo$ perl5_10_1 numbers4.pl 1 2 3 4 matches: <1 2 3 4 > $VAR1 = { '' => '1 2 3 4 ', 'numbers' => { '' => '1 2 3 4 ', 'number' => [ '1', '2', '3', '4' ] } };
This listifying subrule call can also be useful for non-repeated subrule calls, if the same subrule is invoked in several places in a grammar. For example if a cmdline option could be given either one or two values, you might parse it:
<rule: size_option> -size <[size]> (?: x <[size]> )?
The result-hash entry for size
would then always contain an array,
with either one or two elements, depending on the input being parsed.
Sigue un ejemplo:
pl@nereida:~/Lregexpgrammars/demo$ cat -n sizes.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 9 qr{ 10 <command> 11 12 <rule: command> ls <size_option> 13 14 <rule: size_option> 15 -size <[size]> (?: x <[size]> )? 16 17 <token: size> \d+ 18 }x; 19 }; 20 21 while (my $input = <>) { 22 while ($input =~ m{$rbb}g) { 23 say("matches: <$&>"); 24 say Dumper \%/; 25 } 26 }Veamos su comportamiento con diferentes entradas:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 sizes.pl ls -size 4 matches: <ls -size 4 > $VAR1 = { '' => 'ls -size 4 ', 'command' => { 'size_option' => { '' => '-size 4 ', 'size' => [ '4' ] }, '' => 'ls -size 4 ' } }; ls -size 2x8 matches: <ls -size 2x8 > $VAR1 = { '' => 'ls -size 2x8 ', 'command' => { 'size_option' => { '' => '-size 2x8 ', 'size' => [ '2', '8' ] }, '' => 'ls -size 2x8 ' } };
Listifying subrules can also be given aliases, just like ordinary subrules. The alias is always specified inside the square brackets:
<rule: size_option> -size <[size=pos_integer]> (?: x <[size=pos_integer]> )?
Here, the sizes are parsed using thepos_integer
rule, but saved in the result-hash in an array under the keysize
.
Sigue un ejemplo:
pl@nereida:~/Lregexpgrammars/demo$ cat -n aliasedsizes.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 9 qr{ 10 <command> 11 12 <rule: command> ls <size_option> 13 14 <rule: size_option> 15 -size <[size=int]> (?: x <[size=int]> )? 16 17 <token: int> \d+ 18 }x; 19 }; 20 21 while (my $input = <>) { 22 while ($input =~ m{$rbb}g) { 23 say("matches: <$&>"); 24 say Dumper \%/; 25 } 26 }Veamos el resultado de una ejecución:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 aliasedsizes.pl ls -size 2x4 matches: <ls -size 2x4 > $VAR1 = { '' => 'ls -size 2x4 ', 'command' => { 'size_option' => { '' => '-size 2x4 ', 'size' => [ '2', '4' ] }, '' => 'ls -size 2x4 ' } };
En este ejemplo aparece <number>+
sin
corchetes ni paréntesis:
pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers5.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 9 qr{ 10 <numbers> 11 12 <rule: numbers> 13 <number>+ 14 15 <token: number> \d+ 16 }xms; 17 }; 18 19 while (my $input = <>) { 20 if ($input =~ m{$rbb}) { 21 say("matches: <$&>"); 22 say Dumper \%/; 23 } 24 }Este programa produce un mensaje de advertencia:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 numbers5.pl warn | Repeated subrule <number>+ will only capture its final match | (Did you mean <[number]>+ instead?) |
Si se quiere evitar el mensaje y se está dispuesto a asumir la pérdida de los valores asociados con los elementos de la lista se deberán poner el operando entre paréntesis (con o sin memoria).
Esto es lo que dice la documentación sobre este warning:
Repeated subrule <rule> will only capture its final match
You specified a subrule call with a repetition qualifier, such as:
<ListElem>*
or:
<ListElem>+
Because each subrule call saves its result in a hash entry of the same name, each repeated match will overwrite the previous ones, so only the last match will ultimately be saved. If you want to save all the matches, you need to tell Regexp::Grammars to save the sequence of results as a nested array within the hash entry, like so:
<[ListElem]>*
or:
<[ListElem]>+
If you really did intend to throw away every result but the final one, you can silence the warning by placing the subrule call inside any kind of parentheses. For example:
(<ListElem>)*
or:
(?: <ListElem> )+
Aliases can also be given to standard Perl subpatterns, as well as to code blocks within a regex. The syntax for subpatterns is:
<ALIAS= (SUBPATTERN) >
In other words, the syntax is exactly like an aliased subrule call, except that the rule name is replaced with a set of parentheses containing the subpattern. Any parentheses-capturing or non-capturing-will do.
The effect of aliasing a standard subpattern is to cause whatever that subpattern matches to be saved in the result-hash, using the alias as its key. For example:
<rule: file_command> <cmd=(mv|cp|ln)> <from=file> <to=file>
Here, theSigue un ejemplo:<cmd=(mv|cp|ln)>
is treated exactly like a regular(mv|cp|ln)
, but whatever substring it matches is saved in the result-hash under the key'cmd'
.
pl@nereida:~/Lregexpgrammars/demo$ cat -n subpattern.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 9 qr{ 10 <file_command> 11 12 <rule: file_command> 13 14 <cmd=(mv|cp|ln)> <from=([\w./]+)> <to=([\w./]+)> 15 16 }x; 17 }; 18 19 while (my $input = <>) { 20 while ($input =~ m{$rbb}g) { 21 say("matches: <$&>"); 22 say Dumper \%/; 23 } 24 }y una ejecución:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 subpattern.pl mv a b matches: <mv a b> $VAR1 = { '' => 'mv a b', 'file_command' => { '' => 'mv a b', 'to' => 'b', 'cmd' => 'mv', 'from' => 'a' } }; cp c d matches: <cp c d> $VAR1 = { '' => 'cp c d', 'file_command' => { '' => 'cp c d', 'to' => 'd', 'cmd' => 'cp', 'from' => 'c' } }
The syntax for aliasing code blocks is:
<ALIAS= (?{ your($code->here) }) >
Note, however, that the code block must be specified in the standard
Perl 5.10 regex notation: (?{...})
. A common mistake is to write:
<ALIAS= { your($code->here } >
instead, which will attempt to interpolate$code
before the regex is even compiled, as such variables are only protected from interpolation inside a(?{...})
.
When correctly specified, this construct executes the code in the block
and saves the result of that execution in the result-hash, using the
alias as its key. Aliased code blocks are useful for adding semantic
information based on which branch of a rule is executed. For example,
consider the copy_cmd
alternatives shown earlier:
<rule: copy_cmd> copy <from=file> <to=file> | dup <to=file> as <from=file> | <from=file> -> <to=file> | <to=file> <- <from=file>
Using aliased code blocks, you could add an extra field to the result- hash to describe which form of the command was detected, like so:
<rule: copy_cmd> copy <from=file> <to=file> <type=(?{ 'std' })> | dup <to=file> as <from=file> <type=(?{ 'rev' })> | <from=file> -> <to=file> <type=(?{ 'fwd' })> | <to=file> <- <from=file> <type=(?{ 'bwd' })>
Now, if the rule matched, the result-hash would contain something like:
copy_cmd => { from => 'oldfile', to => 'newfile', type => 'fwd', }
El siguiente ejemplo ilustra lo dicho en la documentación. En la línea 15 hemos introducido una regla para el control de errores31.8:
pl@nereida:~/Lregexpgrammars/demo$ cat -n aliasedcodeblock2.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 6 my $rbb = do { 7 use Regexp::Grammars; 8 qr{ 9 <copy_cmd> 10 11 <rule: copy_cmd> 12 copy (<from=file>) (<to=file>) <type=(?{ 'std' })> 13 | <from=file> -> <to=file> <type=(?{ 'fwd' })> 14 | <to=file> <- <from=file> <type=(?{ 'bwd' })> 15 | .+ (?{ die "Syntax error!\n" }) 16 17 <token: file> [\w./\\]+ 18 }x; 19 }; 20 21 while (my $input = <>) { 22 while ($input =~ m{$rbb}g) { 23 say("matches: <$&>"); 24 say Dumper \%/; 25 } 26 }
La ejecución muestra el comportamiento del programa con tres entradas válidas y una errónea:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 aliasedcodeblock2.pl copy a b matches: <copy a b > $VAR1 = { '' => 'copy a b ', 'copy_cmd' => { '' => 'copy a b ', 'to' => 'b', 'from' => 'a', 'type' => 'std' } }; b <- a matches: <b <- a > $VAR1 = { '' => 'b <- a ', 'copy_cmd' => { '' => 'b <- a ', 'to' => 'b', 'from' => 'a', 'type' => 'bwd' } }; a -> b matches: <a -> b > $VAR1 = { '' => 'a -> b ', 'copy_cmd' => { '' => 'a -> b ', 'to' => 'b', 'from' => 'a', 'type' => 'fwd' } }; cp a b Syntax error!
Note that, in addition to the semantics described above, aliased subpatterns and code blocks also become visible to Regexp::Grammars integrated debugger (see Debugging).
By default, every subrule call saves its result into the result-hash, either under its own name, or under an alias.
However, sometimes you may want to refactor some literal part of a rule into one or more subrules, without having those submatches added to the result-hash. The syntax for calling a subrule, but ignoring its return value is:
<.SUBRULE>
(which is stolen directly from Perl 6).
For example, you may prefer to rewrite a rule such as:
<rule: paren_pair> \( (?: <escape> | <paren_pair> | <brace_pair> | [^()] )* \)
without any literal matching, like so:
<rule: paren_pair> <.left_paren> (?: <escape> | <paren_pair> | <brace_pair> | <.non_paren> )* <.right_paren> <token: left_paren> \( <token: right_paren> \) <token: non_paren> [^()]
Moreover, as the individual components inside the parentheses probably aren't being captured for any useful purpose either, you could further optimize that to:
<rule: paren_pair> <.left_paren> (?: <.escape> | <.paren_pair> | <.brace_pair> | <.non_paren> )* <.right_paren>
Note that you can also use the dot modifier on an aliased subpattern:
<.Alias= (SUBPATTERN) >
This seemingly contradictory behaviour (of giving a subpattern a name, then deliberately ignoring that name) actually does make sense in one situation. Providing the alias makes the subpattern visible to the debugger, while using the dot stops it from affecting the result-hash. See Debugging non-grammars for an example of this usage.
Por ejemplo, queremos reconocer listas de números separados por comas. Supongamos también que queremos darle un nombre a la expresión regular de separación. Quizá, aunque no es el caso, porque la expresión regular de separación sea suficientemente compleja. Si no usamos la notación punto la coma aparecerá en la estructura:
pl@nereida:~/Lregexpgrammars/demo$ cat -n numberscomma.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 $Data::Dumper::Indent = 1; 6 7 my $rbb = do { 8 use Regexp::Grammars; 9 10 qr{ 11 <numbers> 12 13 <objrule: numbers> 14 <[number]> (<comma> <[number]>)* 15 16 <objtoken: number> \s*\d+ 17 <token: comma> \s*, 18 }xms; 19 }; 20 21 while (my $input = <>) { 22 if ($input =~ m{$rbb}) { 23 say("matches: <$&>"); 24 say Dumper \%/; 25 } 26 }En efecto, aparece la clave
comma
:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 numberscomma.pl 2, 3, 4 matches: <2, 3, 4> $VAR1 = { '' => '2, 3, 4', 'numbers' => bless( { '' => '2, 3, 4', 'number' => [ bless( { '' => '2' }, 'number' ), bless( { '' => '3' }, 'number' ), bless( { '' => '4' }, 'number' ) ], 'comma' => ',' }, 'numbers' ) };Si cambiamos la llamada a la regla
<comma>
por
<.comma>
pl@nereida:~/Lregexpgrammars/demo$ diff numberscomma.pl numberscomma2.pl 14c14 < <[number]> (<comma> <[number]>)* --- > <[number]> (<.comma> <[number]>)*eliminamos la aparición de la innecesaria clave:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 numberscomma2.pl 2, 3, 4 matches: <2, 3, 4> $VAR1 = { '' => '2, 3, 4', 'numbers' => bless( { '' => '2, 3, 4', 'number' => [ bless( { '' => '2' }, 'number' ), bless( { '' => '3' }, 'number' ), bless( { '' => '4' }, 'number' ) ] }, 'numbers' ) };
Regexp::Grammars
also offers full manual control over the distillation
process. If you use the reserved word MATCH
as the alias for a subrule
call:
<MATCH=filename>
or a subpattern match:
<MATCH=( \w+ )>
or a code block:
<MATCH=(?{ 42 })>
then the current rule will treat the return value of that subrule, pattern, or code block as its complete result, and return that value instead of the usual result-hash it constructs. This is the case even if the result has other entries that would normally also be returned.
For example, in a rule like:
<rule: term> <MATCH=literal> | <left_paren> <MATCH=expr> <right_paren>
The use ofMATCH
aliases causes the rule to return either whatever<literal>
returns, or whatever<expr>
returns (provided it's between left and right parentheses).
Note that, in this second case, even though<left_paren>
and<right_paren>
are captured to the result-hash, they are not returned, because theMATCH
alias overrides the normal return the result-hash semantics and returns only what its associated subrule (i.e.<expr>
) produces.
El siguiente ejemplo ilustra el uso del alias MATCH
:
$ cat -n demo_calc.pl 1 #!/usr/local/lib/perl/5.10.1/bin/perl5.10.1 2 use v5.10; 3 use warnings; 4 5 my $calculator = do{ 6 use Regexp::Grammars; 7 qr{ 8 <Answer> 9 10 <rule: Answer> 11 <X=Mult> <Op=([+-])> <Y=Answer> 12 | <MATCH=Mult> 13 14 <rule: Mult> 15 <X=Pow> <Op=([*/%])> <Y=Mult> 16 | <MATCH=Pow> 17 18 <rule: Pow> 19 <X=Term> <Op=(\^)> <Y=Pow> 20 | <MATCH=Term> 21 22 <rule: Term> 23 <MATCH=Literal> 24 | \( <MATCH=Answer> \) 25 26 <token: Literal> 27 <MATCH=( [+-]? \d++ (?: \. \d++ )?+ )> 28 }xms 29 }; 30 31 while (my $input = <>) { 32 if ($input =~ $calculator) { 33 use Data::Dumper 'Dumper'; 34 warn Dumper \%/; 35 } 36 }
Veamos una ejecución:
$ ./demo_calc.pl 2+3*5 $VAR1 = { '' => '2+3*5', 'Answer' => { '' => '2+3*5', 'Op' => '+', 'X' => '2', 'Y' => { '' => '3*5', 'Op' => '*', 'X' => '3', 'Y' => '5' } } }; 4-5-2 $VAR1 = { '' => '4-5-2', 'Answer' => { '' => '4-5-2', 'Op' => '-', 'X' => '4', 'Y' => { '' => '5-2', 'Op' => '-', 'X' => '5', 'Y' => '2' } } };Obsérvese como el árbol construido para la expresión
4-5-2
se hunde a derechas dando lugar a una jerarquía errónea.
Para arreglar el problema sería necesario eliminar la
recursividad por la izquierda en las reglas correspondientes.
It's also possible to control what a rule returns from within a code block. Regexp::Grammars provides a set of reserved variables that give direct access to the result-hash.
The result-hash itself can be accessed as %MATCH
within any code block
inside a rule. For example:
<rule: sum> <X=product> \+ <Y=product> <MATCH=(?{ $MATCH{X} + $MATCH{Y} })>
Here, the rule matches a product (aliased'X'
in the result-hash), then a literal'+'
, then another product (aliased to'Y'
in the result-hash). The rule then executes the code block, which accesses the two saved values (as$MATCH{X}
and$MATCH{Y}
), adding them together. Because the block is itself aliased toMATCH
, the sum produced by the block becomes the (only) result of the rule.
It is also possible to set the rule result from within a code block
(instead of aliasing it). The special override return value is
represented by the special variable $MATCH
. So the previous example
could be rewritten:
<rule: sum> <X=product> \+ <Y=product> (?{ $MATCH = $MATCH{X} + $MATCH{Y} })
Both forms are identical in effect. Any assignment to $MATCH
overrides
the normal return all subrule results behaviour.
Assigning to $MATCH
directly is particularly handy if the result may
not always be distillable, for example:
<rule: sum> <X=product> \+ <Y=product> (?{ if (!ref $MATCH{X} && !ref $MATCH{Y}) { # Reduce to sum, if both terms are simple scalars... $MATCH = $MATCH{X} + $MATCH{Y}; } else { # Return full syntax tree for non-simple case... $MATCH{op} = '+'; } })
Note that you can also partially override the subrule return
behaviour. Normally, the subrule returns the complete text it matched
under the empty key of its result-hash. That is, of course,
$MATCH{""}
,
so you can override just that behaviour by directly assigning to that
entry.
For example, if you have a rule that matches key/value pairs from a configuration file, you might prefer that any trailing comments not be included in the matched text entry of the rule's result-hash. You could hide such comments like so:
<rule: config_line> <key> : <value> <comment>? (?{ # Edit trailing comments out of "matched text" entry... $MATCH = "$MATCH{key} : $MATCH{value}"; })
Some more examples of the uses of $MATCH
:
<rule: FuncDecl> # Keyword Name Keep return the name (as a string)... func <Identifier> ; (?{ $MATCH = $MATCH{'Identifier'} }) <rule: NumList> # Numbers in square brackets... \[ ( \d+ (?: , \d+)* ) \] # Return only the numbers... (?{ $MATCH = $CAPTURE }) <token: Cmd> # Match standard variants then standardize the keyword... (?: mv | move | rename ) (?{ $MATCH = 'mv'; })
$CAPTURE
and$CONTEXT
are both aliases for the built-in read-only$^N
variable, which always contains the substring matched by the nearest preceding(...)
capture.$^N
still works perfectly well, but these are provided to improve the readability of code blocks and error messages respectively.
El siguiente código implementa una calculadora usando destilación en el código:
pl@nereida:~/Lregexpgrammars/demo$ cat -n demo_calc_inline.pl 1 use v5.10; 2 use warnings; 3 4 my $calculator = do{ 5 use Regexp::Grammars; 6 qr{ 7 <Answer> 8 9 <rule: Answer> 10 <X=Mult> \+ <Y=Answer> 11 (?{ $MATCH = $MATCH{X} + $MATCH{Y}; }) 12 | <X=Mult> - <Y=Answer> 13 (?{ $MATCH = $MATCH{X} - $MATCH{Y}; }) 14 | <MATCH=Mult> 15 16 <rule: Mult> 17 <X=Pow> \* <Y=Mult> 18 (?{ $MATCH = $MATCH{X} * $MATCH{Y}; }) 19 | <X=Pow> / <Y=Mult> 20 (?{ $MATCH = $MATCH{X} / $MATCH{Y}; }) 21 | <X=Pow> % <Y=Mult> 22 (?{ $MATCH = $MATCH{X} % $MATCH{Y}; }) 23 | <MATCH=Pow> 24 25 <rule: Pow> 26 <X=Term> \^ <Y=Pow> 27 (?{ $MATCH = $MATCH{X} ** $MATCH{Y}; }) 28 | <MATCH=Term> 29 30 <rule: Term> 31 <MATCH=Literal> 32 | \( <MATCH=Answer> \) 33 34 <token: Literal> 35 <MATCH=( [+-]? \d++ (?: \. \d++ )?+ )> 36 }xms 37 }; 38 39 while (my $input = <>) { 40 if ($input =~ $calculator) { 41 say '--> ', $/{Answer}; 42 } 43 }
4-2-2
8/4/2
2^2^3
If a rule name (or an alias) begins with an underscore:
<_RULENAME> <_ALIAS=RULENAME> <[_RULENAME]> <[_ALIAS=RULENAME]>
then matching proceeds as normal, and any result that is returned is stored in the current result-hash in the usual way.
However, when any rule finishes (and just before it returns) it first filters its result-hash, removing any entries whose keys begin with an underscore. This means that any subrule with an underscored name (or with an underscored alias) remembers its result, but only until the end of the current rule. Its results are effectively private to the current rule.
This is especially useful in conjunction with result distillation.
El siguiente ejemplo muestra como construir un reconocedor de listas (posiblemente vacías) de números:
casiano@millo:~/Lregexp-grammar-examples$ cat -n simple_list.pl 1 #!/soft/perl5lib/bin/perl5.10.1 2 use v5.10; 3 4 use Regexp::Grammars; 5 6 my $list = qr{ 7 <List> 8 9 <rule: List> 10 <digit> <List> 11 | # empty 12 13 <rule: digit> 14 <MATCH=(\d+)> 15 16 }xms; 17 18 while (my $input = <>) { 19 chomp $input; 20 if ($input =~ $list) { 21 use Data::Dumper 'Dumper'; 22 warn Dumper \%/; 23 } 24 else { 25 warn "Does not match\n" 26 } 27 }Sigue una ejecución:
casiano@millo:~/Lregexp-grammar-examples$ ./simple_list.pl 2 3 4 $VAR1 = { '' => '2 3 4', 'List' => { '' => '2 3 4', 'digit' => '2' 'List' => { '' => '3 4', 'digit' => '3' 'List' => { '' => '4', 'digit' => '4' 'List' => '', }, }, } };
Tenga en cuenta que el orden de las reglas influye en el lenguaje reconocido. Véase lo que ocurre si cambiamos en el ejemplo anterior el orden de las reglas:
casiano@millo:~/Lregexp-grammar-examples$ cat -n simple_list_empty_first.pl 1 #!/soft/perl5lib/bin/perl5.10.1 2 use v5.10; 3 4 use Regexp::Grammars; 5 6 my $list = qr{ 7 <List> 8 9 <rule: List> 10 # empty 11 | <digit> <List> 12 13 <rule: digit> 14 <MATCH=(\d+)> 15 16 }xms; 17 18 while (my $input = <>) { 19 chomp $input; 20 if ($input =~ $list) { 21 use Data::Dumper 'Dumper'; 22 warn Dumper \%/; 23 } 24 else { 25 warn "Does not match\n" 26 } 27 }Al ejecutar se obtiene:
casiano@millo:~/Lregexp-grammar-examples$ ./simple_list_empty_first.pl 2 3 4 $VAR1 = { '' => '', 'List' => '' };
Por supuesto basta poner anclas en el patrón a buscar para forzar a que se reconozca la lista completa:
pl@nereida:~/Lregexpgrammars/demo$ diff simple_list_empty_first.pl simple_list_empty_first_with_anchors.pl 7c7 < <List> --- > ^<List>$En efecto, la nueva versión reconoce la lista:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 simple_list_empty_first_with_anchors.pl 2 3 4 $VAR1 = { '' => '2 3 4', 'List' => { 'List' => { 'List' => { 'List' => '', '' => '4', 'digit' => '4' }, '' => '3 4', 'digit' => '3' }, '' => '2 3 4', 'digit' => '2' } };
Si se quiere mantener la producción vacía en primer lugar pero forzar el reconocimiento de la lista completa, se puede hacer uso de un lookahead negativo:
pl@nereida:~/Lregexpgrammars/demo$ cat -n simple_list_empty_first_with_lookahead.pl 1 #!/soft/perl5lib/bin/perl5.10.1 2 use v5.10; 3 4 use strict; 5 use Regexp::Grammars; 6 7 my $list = qr{ 8 <List> 9 10 <rule: List> 11 (?! <digit> ) # still empty production 12 | <digit> <List> 13 14 <rule: digit> 15 <MATCH=(\d+)> 16 17 }xms; 18 19 while (my $input = <>) { 20 chomp $input; 21 if ($input =~ $list) { 22 use Data::Dumper 'Dumper'; 23 warn Dumper \%/; 24 } 25 else { 26 warn "Does not match\n" 27 } 28 }Así, sólo se reducirá por la regla vacía si el siguiente token no es un número. Sigue un ejemplo de ejecución:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 simple_list_empty_first_with_lookahead.pl 2 3 4 $VAR1 = { '' => '2 3 4', 'List' => { 'List' => { 'List' => { 'List' => '', '' => '4', 'digit' => '4' }, '' => '3 4', 'digit' => '3' }, '' => '2 3 4', 'digit' => '2' } };
¿Cómo podemos hacer que la estructura retornada por el reconocedor sea una lista?. Podemos añadir acciones como sigue:
casiano@millo:~/Lregexp-grammar-examples$ cat -n simple_list_action.pl 1 #!/soft/perl5lib/bin/perl5.10.1 2 use v5.10; 3 4 use Regexp::Grammars; 5 6 my $list = qr{ 7 <List> 8 9 <rule: List> 10 <digit> <X=List> <MATCH= (?{ unshift @{$MATCH{X}}, $MATCH{digit}; $MATCH{X} })> 11 | # empty 12 <MATCH= (?{ [] })> 13 14 <rule: digit> 15 <MATCH=(\d+)> 16 17 }xms; 18 19 while (my $input = <>) { 20 chomp $input; 21 if ($input =~ $list) { 22 use Data::Dumper 'Dumper'; 23 warn Dumper \%/; 24 } 25 else { 26 warn "Does not match\n" 27 } 28 }
Al ejecutarse este programa produce una salida como:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 simple_list_action.pl 2 3 4 $VAR1 = { '' => '2 3 4', 'List' => [ '2', '3', '4' ] };
Los operadores de repetición como *
, +
, etc.
permiten simplificar el análisis de lenguajes de listas:
pl@nereida:~/Lregexpgrammars/demo$ cat -n simple_list_star.pl 1 #!/soft/perl5lib/bin/perl5.10.1 2 use v5.10; 3 4 use Regexp::Grammars; 5 6 my $list = qr{ 7 <List> 8 9 <rule: List> 10 (?: <[digit]>)* 11 12 <rule: digit> 13 <MATCH=(\d+)> 14 15 }xms; 16 17 while (my $input = <>) { 18 chomp $input; 19 if ($input =~ $list) { 20 use Data::Dumper 'Dumper'; 21 warn Dumper \%/; 22 } 23 else { 24 warn "Does not match\n" 25 } 26 }Los corchetes alrededor de
digit
hacen
que el valor asociado con el patrón sea la lista de números.
Si no los ponemos el valor asociado sería el último valor de la lista.
One of the commonest tasks in text parsing is to match a list of unspecified length, in which items are separated by a fixed token. Things like:
1, 2, 3 , 4 ,13, 91 # Numbers separated by commas and spaces g-c-a-g-t-t-a-c-a # Bases separated by dashes /usr/local/bin # Names separated by directory markers /usr:/usr/local:bin # Directories separated by colons
The usual construct required to parse these kinds of structures is either:
<rule: list> <item> <separator> <list # recursive definition | <item> # base case
Or, more efficiently, but less prettily:
<rule: list> <[item]> (?: <separator> <[item]> )* # iterative definition
Because this is such a common requirement, Regexp::Grammars provides a cleaner way to specify the iterative version. The syntax is taken from Perl 6:
<rule: list> <[item]> ** <separator> # iterative definition
This is a repetition specifier on the first subrule (hence the use of**
as the marker, to reflect the repetitive behaviour of*
). However, the number of repetitions is controlled by the second subrule: the first subrule will be repeatedly matched for as long as the second subrule matches immediately after it.
So, for example, you can match a sequence of numbers separated by commas with:
<[number]> ** <comma> <token: number> \d+ <token: comma> \s* , \s*
Note that it's important to use the <[...]>
form for the items being
matched, so that all of them are saved in the result hash. You can also
save all the separators (if that's important):
<[number]> ** <[comma]>
The repeated item must be specified as a subrule call fo some kind, but the separators may be specified either as a subrule or a bracketed pattern. For example:
<[number]> ** ( , )
The separator must always be specified in matched delimiters of some kind: either matching<...>
or matching(...)
. A common error is to write:
<[number]> ** ,
You can also use a pattern as the item matcher, but it must be aliased into a subrule:
<[item=(\d+)]> ** ( , )
Veamos un ejemplo sencillo:
casiano@millo:~/src/perl/regexp-grammar-examples$ cat -n demo_list.pl 1 #!/soft/perl5lib/bin/perl5.10.1 2 use v5.10; 3 4 use Regexp::Grammars; 5 6 my $list_nonempty = qr{ 7 <List> 8 9 <rule: List> 10 \( <[Value]> ** (,) \) 11 12 <token: Value> 13 \d+ 14 }xms; 15 16 my $list_empty = qr{ 17 <List> 18 19 <rule: List> 20 \( (?: <[Value]> ** <_Sep=(,)> )? \) 21 22 <token: Value> 23 \d+ 24 }xms; 25 26 use Smart::Comments; 27 28 29 while (my $input = <>) { 30 my $input2 = $input; 31 if ($input =~ $list_nonempty) { 32 ### nonempty: $/{List} 33 } 34 if ($input2 =~ $list_empty) { 35 ### empty: $/{List} 36 } 37 }Sigue un ejemplo de ejecución:
casiano@millo:~/src/perl/regexp-grammar-examples$ ./demo_list.pl (3,4,5) ### nonempty: { ### '' => '(3,4,5)', ### Value => [ ### '3', ### '4', ### '5' ### ] ### } ### empty: { ### '' => '(3,4,5)', ### Value => [ ### '3', ### '4', ### '5' ### ] ### } () ### empty: '()'
Las expresiones aritméticas puede definirse como una jerarquía de listas como sigue:
pl@nereida:~/Lregexpgrammars/demo$ cat -n calcaslist.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 $Data::Dumper::Indent = 1; 6 7 my $rbb = do { 8 use Regexp::Grammars; 9 10 qr{ 11 \A<expr>\z 12 13 <objrule: expr> <[operands=term]> ** <[operators=addop]> 14 15 <objrule: term> <[operands=uneg]> ** <[operators=mulop]> 16 17 <objrule: uneg> <[operators=minus]>* <[operands=power]> 18 19 <objrule: power> <[operands=factorial]> ** <[operators=powerop]> 20 21 <objrule: factorial> <[operands=factor]> <[operators=(!)]>* 22 23 <objrule: factor> <val=([+-]?\d+(?:\.\d*)?)> 24 | \( <MATCH=expr> \) 25 26 <token: addop> [+-] 27 28 <token: mulop> [*/] 29 30 <token: powerop> \*\*|\^ 31 32 <token: minus> - <MATCH=(?{ 'NEG' })> 33 34 }x; 35 }; 36 37 while (my $input = <>) { 38 chomp($input); 39 if ($input =~ m{$rbb}) { 40 my $tree = $/{expr}; 41 say Dumper $tree; 42 43 } 44 else { 45 say("does not match"); 46 } 47 }
Obsérvese el árbol generado para la expresión 4-2-2
:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 calcaslist.pl 4-2-2 $VAR1 = bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { '' => '4', 'val' => '4' }, 'factor' ) ], '' => '4' }, 'factorial' ) ], '' => '4' }, 'power' ) ], '' => '4' }, 'uneg' ) ], '' => '4' }, 'term' ), bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { '' => '2', 'val' => '2' }, 'factor' ) ], '' => '2' }, 'factorial' ) ], '' => '2' }, 'power' ) ], '' => '2' }, 'uneg' ) ], '' => '2' }, 'term' ), bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { '' => '2', 'val' => '2' }, 'factor' ) ], '' => '2' }, 'factorial' ) ], '' => '2' }, 'power' ) ], '' => '2' }, 'uneg' ) ], '' => '2' }, 'term' ) ], '' => '4-2-2', 'operators' => [ '-', '-' ] }, 'expr' );
La directiva require
es similar
en su funcionamiento al paréntesis 5.10 (??{ Código Perl })
el cuál hace que el Código Perl
sea evaluado durante el tiempo de matching.
El resultado de la evaluación se trata como una expresión regular con la que deberá
casarse.
(véase la sección 31.2.9
para mas detalles).
La sintáxis de la directiva <require:>
es
<require: (?{ CODE }) >
The code block is executed and if its final value is true, matching continues from the same position. If the block's final value is false, the match fails at that point and starts backtracking.
The <require:...>
directive is useful for testing conditions that it's
not easy (or even possible) to check within the syntax of the the regex
itself. For example:
<rule: IPV4_Octet_Decimal> # Up three digits... <MATCH= ( \d{1,3}+ )> # ...but less that 256... <require: (?{ $MATCH <= 255 })>
A require expects a regex codeblock as its argument and succeeds if the final value of that codeblock is true. If the final value is false, the directive fails and the rule starts backtracking.
Note, in this example that the digits are matched with\d{1,3}+
. The trailing+
prevents the{1,3}
repetition from backtracking to a smaller number of digits if the<require:...>
fails.
El programa demo_IP4.pl
ilustra el uso de la directiva:
pl@nereida:~/Lregexpgrammars/demo$ cat -n ./demo_IP4.pl 1 #!/usr//bin/env perl5.10.1 2 use v5.10; 3 use warnings; 4 5 use Regexp::Grammars; 6 7 my $grammar = qr{ 8 \A <IP4_addr> \Z 9 10 <token: quad> 11 <MATCH=(\d{1,3})> 12 <require: (?{ $MATCH < 256 })> 13 14 <token: IP4_addr> 15 <[MATCH=quad]>**(\.) 16 <require: (?{ @$MATCH == 4 })> 17 }xms; 18 19 while (my $line = <>) { 20 if ($line =~ $grammar) { 21 use Data::Dumper 'Dumper'; 22 say Dumper \%/; 23 } 24 else { 25 say 'Does not match' 26 } 27 }Las condiciones usadas en el
require
obligan a que cada
quad31.9 sea menor que 256 y a que existan sólo cuatro quads.
Sigue un ejemplo de ejecución:
pl@nereida:~/Lregexpgrammars/demo$ ./demo_IP4.pl 123 . 145 . 105 . 252 Does not match pl@nereida:~/Lregexpgrammars/demo$ ./demo_IP4.pl 123.145.105.252 $VAR1 = { '' => '123.145.105.252', 'IP4_addr' => [ 123, 145, 105, 252 ] }; pl@nereida:~/Lregexpgrammars/demo$ ./demo_IP4.pl 148.257.128.128 Does not match 0.0.0.299 Does not match pl@nereida:~/Lregexpgrammars/demo$ ./demo_IP4.pl 123.145.105.242.193 Does not matchObsérvese como no se aceptan blancos entre los puntos en esta versión. ¿Sabría explicar la causa?
In some situations a grammar may need a rule that matches dozens,
hundreds, or even thousands of one-word alternatives. For example, when
matching command names, or valid userids, or English words. In such
cases it is often impractical (and always inefficient) to list all the
alternatives between |
alterators:
<rule: shell_cmd> a2p | ac | apply | ar | automake | awk | ... # ...and 400 lines later ... | zdiff | zgrep | zip | zmore | zsh <rule: valid_word> a | aa | aal | aalii | aam | aardvark | aardwolf | aba | ... # ...and 40,000 lines later... ... | zymotize | zymotoxic | zymurgy | zythem | zythum
To simplify such cases, Regexp::Grammars provides a special construct that allows you to specify all the alternatives as the keys of a normal hash. The syntax for that construct is simply to put the hash name inside angle brackets (with no space between the angles and the hash name).
Which means that the rules in the previous example could also be written:
<rule: shell_cmd> <%cmds> <rule: valid_word> <%dict>
provided that the two hashes (%cmds
and%dict
) are visible in the scope where the grammar is created.
Internally, the construct is converted to something equivalent to:
<rule: shell_cmd> (<.hk>) <require: exists $cmds{$CAPTURE}> <rule: valid_word> (<.hk>) <require: exists $dict{$CAPTURE}>
The special<hk>
rule is created automatically, and defaults to\S+
, but you can also define it explicitly to handle other kinds of keys. For example:
<rule: hk> .+ # Key may be any number of chars on a single line <rule: hk> [ACGT]{10,} # Key is a base sequence of at least 10 pairs
Matching a hash key in this way is typically significantly faster than matching a full set of alternations. Specifically, it is O(length of longest potential key), instead of O(number of keys).
Sigue un ejemplo:
pl@nereida:~/Lregexpgrammars/demo$ cat -n hash.pl 1 #!/usr/bin/env perl5.10.1 2 use strict; 3 use warnings; 4 use 5.010; 5 use Data::Dumper; 6 $Data::Dumper::Deparse = 1; 7 8 my %cmd = map { ($_ => undef ) } qw( uname pwd date ); 9 10 my $rbb = do { 11 use Regexp::Grammars; 12 13 qr{ 14 ^<command>$ 15 16 <rule: command> 17 <cmd=%cmd> (?: <[arg]> )* 18 19 <token: arg> [^\s<>`&]+ 20 }xms; 21 }; 22 23 while (my $input = <>) { 24 chomp($input); 25 if ($input =~ m{$rbb}) { 26 say("matches: <$&>"); 27 say Dumper \%/; 28 system $/{''} 29 } 30 else { 31 say("does not match"); 32 } 33 }
Sigue un ejemplo de ejecución:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 hash.pl a2p f1 f2 matches: <a2p f1 f2> $VAR1 = { '' => 'a2p f1 f2', 'command' => { '' => 'a2p f1 f2', 'cmd' => 'a2p', 'arg' => [ 'f1', 'f2' ] } }; pocho 2 5 does not match
Regexp::Grammars provides a number of features specifically designed to help debug both grammars and the data they parse.
All debugging messages are written to a log file (which, by default, is justSTDERR
). However, you can specify a disk file explicitly by placing a"<logfile:...>"
directive at the start of your grammar31.10:
$grammar = qr{ <logfile: LaTeX_parser_log > \A <LaTeX_file> \Z # Pattern to match <rule: LaTeX_file> # etc. }x;
You can also explicitly specify that messages go to the terminal:
<logfile: - >
Whenever a log file has been directly specified, Regexp::Grammars
automatically does verbose static analysis of your grammar. That is,
whenever it compiles a grammar containing an explicit "<logfile:...>"
directive it logs a series of messages explaining how it has interpreted
the various components of that grammar. For example, the following
grammar:
pl@nereida:~/Lregexpgrammars/demo$ cat -n log.pl 1 #!/usr/bin/env perl5.10.1 2 use strict; 3 use warnings; 4 use 5.010; 5 use Data::Dumper; 6 7 my $rbb = do { 8 use Regexp::Grammars; 9 10 qr{ 11 <logfile: -> 12 13 <numbers> 14 15 <rule: numbers> 16 <number> ** <.comma> 17 18 <token: number> \d+ 19 20 <token: comma> , 21 }xms; 22 }; 23 24 while (my $input = <>) { 25 if ($input =~ m{$rbb}) { 26 say("matches: <$&>"); 27 say Dumper \%/; 28 } 29 }
would produce the following analysis in the terminal:
pl@nereida:~/Lregexpgrammars/demo$ ./log.pl warn | Repeated subrule <number>* will only capture its final match | (Did you mean <[number]>* instead?) | info | Processing the main regex before any rule definitions | | | |...Treating <numbers> as: | | | match the subrule <numbers> | | \ saving the match in $MATCH{'numbers'} | | | \___End of main regex | | Defining a rule: <numbers> | |...Returns: a hash | | | |...Treating <number> as: | | | match the subrule <number> | | \ saving the match in $MATCH{'number'} | | | |...Treating <.comma> as: | | | match the subrule <comma> | | \ but don't save anything | | | |...Treating <number> ** <.comma> as: | | | repeatedly match the subrule <number> | | \ as long as the matches are separated by matches of <.comma> | | | \___End of rule definition | | Defining a rule: <number> | |...Returns: a hash | | | |...Treating '\d' as: | | \ normal Perl regex syntax | | | |...Treating '+ ' as: | | \ normal Perl regex syntax | | | \___End of rule definition | | Defining a rule: <comma> | |...Returns: a hash | | | |...Treating ', ' as: | | \ normal Perl regex syntax | | | \___End of rule definition | 2, 3, 4 matches: <2, 3, 4> $VAR1 = { '' => '2, 3, 4', 'numbers' => { '' => '2, 3, 4', 'number' => '4' } };
This kind of static analysis is a useful starting point in debugging a miscreant grammar31.11, because it enables you to see what you actually specified (as opposed to what you thought you'd specified).
Regexp::Grammars also provides a simple interactive debugger, with which you can observe the process of parsing and the data being collected in any result-hash.
To initiate debugging, place a <debug:...>
directive anywhere in your
grammar. When parsing reaches that directive the debugger will be
activated, and the command specified in the directive immediately
executed. The available commands are:
<debug: on> - Enable debugging, stop when entire grammar matches <debug: match> - Enable debugging, stope when a rule matches <debug: try> - Enable debugging, stope when a rule is tried <debug: off> - Disable debugging and continue parsing silently <debug: continue> - Synonym for <debug: on> <debug: run> - Synonym for <debug: on> <debug: step> - Synonym for <debug: try>
These directives can be placed anywhere within a grammar and take effect
when that point is reached in the parsing. Hence, adding a
<debug:step>
directive is very much like setting a breakpoint at that
point in the grammar. Indeed, a common debugging strategy is to turn
debugging on and off only around a suspect part of the grammar:
<rule: tricky> # This is where we think the problem is... <debug:step> <preamble> <text> <postscript> <debug:off>
Once the debugger is active, it steps through the parse, reporting rules that are tried, matches and failures, backtracking and restarts, and the parser's location within both the grammar and the text being matched. That report looks like this:
===============> Trying <grammar> from position 0 > cp file1 file2 |...Trying <cmd> | |...Trying <cmd=(cp)> | | \FAIL <cmd=(cp)> | \FAIL <cmd> \FAIL <grammar> ===============> Trying <grammar> from position 1 cp file1 file2 |...Trying <cmd> | |...Trying <cmd=(cp)> file1 file2 | | \_____<cmd=(cp)> matched 'cp' file1 file2 | |...Trying <[file]>+ file2 | | \_____<[file]>+ matched 'file1' | |...Trying <[file]>+ [eos] | | \_____<[file]>+ matched ' file2' | |...Trying <[file]>+ | | \FAIL <[file]>+ | |...Trying <target> | | |...Trying <file> | | | \FAIL <file> | | \FAIL <target> <~~~~~~~~~~~~~~ | |...Backtracking 5 chars and trying new match file2 | |...Trying <target> | | |...Trying <file> | | | \____ <file> matched 'file2' [eos] | | \_____<target> matched 'file2' | \_____<cmd> matched ' cp file1 file2' \_____<grammar> matched ' cp file1 file2'
The first column indicates the point in the input at which the parser is trying to match, as well as any backtracking or forward searching it may need to do. The remainder of the columns track the parser's hierarchical traversal of the grammar, indicating which rules are tried, which succeed, and what they match.
Provided the logfile is a terminal (as it is by default), the debugger also pauses at various points in the parsing process-before trying a rule, after a rule succeeds, or at the end of the parse-according to the most recent command issued. When it pauses, you can issue a new command by entering a single letter:
m - to continue until the next subrule matches t or s - to continue until the next subrule is tried r or c - to continue to the end of the grammar o - to switch off debugging
Note that these are the first letters of the corresponding <debug:...>
commands, listed earlier. Just hitting ENTER while the debugger is
paused repeats the previous command.
While the debugger is paused you can also type a d
, which will display
the result-hash for the current rule. This can be useful for detecting
which rule isn't returning the data you expected.
Veamos un ejemplo. El siguiente programa activa el depurador:
pl@nereida:~/Lregexpgrammars/demo$ cat -n demo_debug.pl 1 #!/usr/bin/env perl5.10.1 2 use 5.010; 3 use warnings; 4 5 use Regexp::Grammars; 6 7 my $balanced_brackets = qr{ 8 <debug:on> 9 10 <left_delim=( \( )> 11 (?: 12 <[escape=( \\ )]> 13 | <recurse=( (?R) )> 14 | <[simple=( . )]> 15 )* 16 <right_delim=( \) )> 17 }xms; 18 19 while (<>) { 20 if (/$balanced_brackets/) { 21 say 'matched:'; 22 use Data::Dumper 'Dumper'; 23 warn Dumper \%/; 24 } 25 }Al ejecutar obtenemos
pl@nereida:~/Lregexpgrammars/demo$ ./demo_debug.pl (a) =====> Trying <grammar> from position 0 (a)\n |...Trying <left_delim=( \( )> a)\n | _____<left_delim=( \( )> matched '(' c |...Trying <[escape=( \ )]> | \FAIL <[escape=( \ )]> |...Trying <recurse=( (?R) )> =====> Trying <grammar> from position 1 a)\n | |...Trying <left_delim=( \( )> | | \FAIL <left_delim=( \( )> \FAIL <grammar> |...Trying <[simple=( . )]> )\n | _____<[simple=( . )]> matched 'a' |...Trying <[escape=( \ )]> | \FAIL <[escape=( \ )]> |...Trying <recurse=( (?R) )> =====> Trying <grammar> from position 2 )\n | |...Trying <left_delim=( \( )> | | \FAIL <left_delim=( \( )> \FAIL <grammar> |...Trying <[simple=( . )]> \n | _____<[simple=( . )]> matched ')' |...Trying <[escape=( \ )]> | \FAIL <[escape=( \ )]> |...Trying <recurse=( (?R) )> =====> Trying <grammar> from position 3 \n | |...Trying <left_delim=( \( )> | | \FAIL <left_delim=( \( )> \FAIL <grammar> |...Trying <[simple=( . )]> [eos] | _____<[simple=( . )]> matched '' |...Trying <[escape=( \ )]> | \FAIL <[escape=( \ )]> |...Trying <recurse=( (?R) )> =====> Trying <grammar> from position 4 [eos] | |...Trying <left_delim=( \( )> | | \FAIL <left_delim=( \( )> \FAIL <grammar> |...Trying <[simple=( . )]> | \FAIL <[simple=( . )]> |...Trying <right_delim=( \) )> | \FAIL <right_delim=( \) )> <~~~~ |...Backtracking 1 char and trying new match \n |...Trying <right_delim=( \) )> | \FAIL <right_delim=( \) )> <~~~~ |...Backtracking 1 char and trying new match )\n |...Trying <right_delim=( \) )> \n | _____<right_delim=( \) )> matched ')' _____<grammar> matched '(a)' d : { : '' => '(a)', : 'left_delim' => '(', : 'simple' => [ : 'a' : ], : 'right_delim' => ')' : }; o matched: $VAR1 = { '' => '(a)', 'left_delim' => '(', 'simple' => [ 'a' ], 'right_delim' => ')' };
Both static and interactive debugging send a series of predefined log
messages to whatever log file you have specified. It is also possible to
send additional, user-defined messages to the log, using the "<log:...>"
directive.
This directive expects either a simple text or a codeblock as its single argument. If the argument is a code block, that code is expected to return the text of the message; if the argument is anything else, that something else is the literal message. For example:
<rule: ListElem> <Elem= ( [a-z]\d+) > <log: Checking for a suffix, too...> <Suffix= ( : \d+ ) >? <log: (?{ "ListElem: $MATCH{Elem} and $MATCH{Suffix}" })>
User-defined log messages implemented using a codeblock can also specify
a severity level. If the codeblock of a <log:...>
directive returns
two or more values, the first is treated as a log message severity
indicator, and the remaining values as separate lines of text to be
logged. For example:
<rule: ListElem> <Elem= ( [a-z]\d+) > <Suffix= ( : \d+ ) >? <log: (?{ warn => "Elem was: $MATCH{Elem}", "Suffix was $MATCH{Suffix}", })>
When they are encountered, user-defined log messages are interspersed between any automatic log messages (i.e. from the debugger), at the correct level of nesting for the current rule.
It is possible to use Regexp::Grammars without creating any subrule definitions, simply to debug a recalcitrant regex. For example, if the following regex wasn't working as expected:
my $balanced_brackets = qr{ \( # left delim (?: \\ # escape or | (?R) # recurse or | . # whatever )* \) # right delim }xms;
you could instrument it with aliased subpatterns and then debug it step-by-step, using Regexp::Grammars:
use Regexp::Grammars; my $balanced_brackets = qr{ <debug:step> <.left_delim= ( \( )> (?: <.escape= ( \\ )> | <.recurse= ( (?R) )> | <.whatever=( . )> )* <.right_delim= ( \) )> }xms; while (<>) { say 'matched' if /$balanced_brackets/; }
Note the use of amnesiac aliased subpatterns to avoid needlessly building a result-hash. Alternatively, you could use listifying aliases to preserve the matching structure as an additional debugging aid:
use Regexp::Grammars; my $balanced_brackets = qr{ <debug:step> <[left_delim= ( \( )]> (?: <[escape= ( \\ )]> | <[recurse= ( (?R) )]> | <[whatever=( . )]> )* <[right_delim= ( \) )]> }xms; if ( '(a(bc)d)' =~ /$balanced_brackets/) { use Data::Dumper 'Dumper'; warn Dumper \%/; }
En este punto debo decir que no he podido reproducir el comportamiento
de las directivas <error:>
y <warning:>
tal y como las describe
Conway en el manual de Regexp::Grammars.
El siguiente ejemplo ilustra un conjunto de técnicas de gestión de errores que son independientes del soprote dado por Regexp::Grammars.
Se trata de la misma calculadora explicada en la sección 31.10.18.
pl@nereida:~/Lregexpgrammars/demo/calculator$ cat -n calculatorwitherrmanagement.pl 1 #!/usr/bin/env perl5.10.1 2 use strict; 3 use warnings; 4 use 5.010; 5 use Lingua::EN::Inflect qw(PL); 6 use Scalar::Util qw{blessed}; 7 8 my $rbb = do { 9 my ($warnings, $errors); # closure 10 sub warnings { $warnings } # accessor 11 sub errors { $errors } # accessor 12 13 use Regexp::Grammars; 14 qr{ 15 (?{ 16 $warnings = 0; 17 $errors = 0; 18 }) 19 \A<expr> 20 (?: \z 21 | 22 (.*) (?{ 23 # Accept the string but emit a warning 24 $warnings++; 25 local our $expr = \$MATCH{expr}{''}; 26 local our $endlegal = length($$expr) > 4? "... ".substr($$expr, -4) : $$expr; 27 warn "Warning: Unexpected '". substr($^N, 0, 10)."' after '$endlegal'\n"; 28 }) 29 ) 30 31 <objrule: expr> <[operands=term]> ** <[operators=addop]> 32 33 <objrule: term> <[operands=uneg]> ** <[operators=mulop]> 34 35 <objrule: uneg> <[operators=minus]>* <[operands=power]> 36 37 <objrule: power> <[operands=factorial]> ** <[operators=powerop]> 38 39 <objrule: factorial> <[operands=factor]> <[operators=(!)]>* 40 41 <objrule: factor> (<val=([+-]?\d+(?:\.\d*)?)>) 42 | \( <MATCH=expr> \) 43 | ([^-+(0-9]+) (?{ 44 # is + and not * to avoid infinite recursion 45 warn "Error: expecting a number or a open parenthesis, found: '". substr($^N, 0, 10)."'\n"; 46 $warnings++; 47 $errors++; 48 }) <MATCH=factor> 49 50 <token: addop> [+-] 51 52 <token: mulop> [*/] 53 54 <token: powerop> \*\*|\^ 55 56 <token: minus> - <MATCH=(?{ 'NEG' })> 57 58 }x; 59 }; 60 61 sub test_calc { 62 my $prompt = shift; 63 64 print $prompt; 65 while (my $input = <>) { 66 chomp($input); 67 68 local %/; 69 $input =~ m{$rbb}; 70 71 say warnings." ".PL('warning',warnings) if warnings; 72 say errors." ".PL('error',errors) if errors; 73 74 my $tree = $/{expr}; 75 if (blessed($tree)) { 76 do "PostfixCalc.pm"; 77 say "postfix: ".$tree->ceval; 78 79 do "EvalCalc.pm"; 80 say "result: ".$tree->ceval; 81 } 82 print $prompt; 83 } 84 say "Bye!" 85 } 86 87 ########## main 88 test_calc( 89 'Parsing infix arithmetic expressions (CTRL-D to end in unix) ', 90 );
Veamos algunas ejecuciones que incluyen entradas erróneas:
pl@nereida:~/Lregexpgrammars/demo/calculator$ ./calculatorwitherrmanagement.pl Parsing infix arithmetic expressions (CTRL-D to end in unix) 2+3 postfix: 2 3 + result: 5 Parsing infix arithmetic expressions (CTRL-D to end in unix) 2*(3+#) Error: expecting a number or a open parenthesis, found: '#)' Error: expecting a number or a open parenthesis, found: '#' Error: expecting a number or a open parenthesis, found: ')' Warning: Unexpected '*(3+#)' after '2' 4 warnings 3 errors postfix: 2 result: 2 Parsing infix arithmetic expressions (CTRL-D to end in unix) 2+#*4 Error: expecting a number or a open parenthesis, found: '#*' 1 warning 1 error postfix: 2 4 + result: 6 Parsing infix arithmetic expressions (CTRL-D to end in unix) Bye!Obsérvese los mensajes de error repetidos para la entrada
2*(3+#)
. Ellos son debidos a los reiterados intentos de
casar <factor>
en la regla de recuperación de errores:
41 <objrule: factor> (<val=([+-]?\d+(?:\.\d*)?)>) 42 | \( <MATCH=expr> \) 43 | ([^-+(0-9]+) (?{ 44 # is + and not * to avoid infinite recursion 45 warn "Error: expecting a number or a open parenthesis, found: '". substr($^N, 0, 10)."'\n"; 46 $warnings++; 47 $errors++; 48 }) <MATCH=factor>en este caso resulta imposible encontrar un factor. Se puede cambiar la conducta indicando un
(* COMMIT)
antes de la
llamada a <MATCH=factor>
:
41 <objrule: factor> (<val=([+-]?\d+(?:\.\d*)?)>) 42 | \( <MATCH=expr> \) 43 | ([^-+(0-9]+) (?{ 44 # is + and not * to avoid infinite recursion 45 warn "Error: expecting a number or a open parenthesis, found: '". substr($^N, 0, 10)."'\n"; 46 $warnings++; 47 $errors++; 48 }) (*COMMIT) <MATCH=factor>
en este caso la conducta es abandonar en el caso de que no se pueda encontrar un <factor>
:
pl@nereida:~/Lregexpgrammars/demo/calculator$ ./calculatorwitherrmanagement.pl Parsing infix arithmetic expressions (CTRL-D to end in unix) 2*(3+#) Error: expecting a number or a open parenthesis, found: '#)' 1 warning 1 error Parsing infix arithmetic expressions (CTRL-D to end in unix) 2*3 postfix: 2 3 * result: 6 Parsing infix arithmetic expressions (CTRL-D to end in unix) @ Error: expecting a number or a open parenthesis, found: '@' 1 warning 1 error Parsing infix arithmetic expressions (CTRL-D to end in unix) Bye!
Sometimes, you want to detect problems, but not invalidate the entire
parse as a result. For those occasions, the module provides a less
stringent form of error reporting: the <warning:...>
directive.
This directive is exactly the same as an <error:...>
in every respect
except that it does not induce a failure to match at the point it
appears.
The directive is, therefore, useful for reporting non-fatal problems in a parse. For example:
qr{ \A # ...Match only at start of input <ArithExpr> # ...Match a valid arithmetic expression (?: # Should be at end of input... \s* \Z | # If not, report the fact but don't fail... <warning: Expected end-of-input> <warning: (?{ "Extra junk at index $INDEX: $CONTEXT" })> ) # Rule definitions here... }xms;
Note that, because they do not induce failure, two or more
<warning:...>
directives can be "stacked" in sequence, as in the
previous example.
pl@nereida:~/Lregexpgrammars/demo$ cat -n exprdamian.pl 1 use strict; 2 use warnings; 3 use 5.010; 4 use Data::Dumper; 5 $Data::Dumper::Indent = 1; 6 7 my $rbb = do { 8 use Regexp::Grammars; 9 10 qr{ 11 \A<expr>\z 12 13 <objrule: expr> <MATCH=term> (?! <addop> ) # bypass 14 | <[operands=term]> ** <[operators=addop]> 15 16 <objrule: term> <MATCH=factor> (?! <mulop> ) # bypass 17 | <[operands=factor]> ** <[operators=mulop]> 18 19 <objrule: factor> <val=([+-]?\d+(?:\.\d*)?)> 20 | \( <MATCH=expr> \) 21 22 <token: addop> [+-] 23 24 <token: mulop> [*/] 25 26 }x; 27 }; 28 29 while (my $input = <>) { 30 chomp($input); 31 if ($input =~ m{$rbb}) { 32 my $tree = $/{expr}; 33 say Dumper $tree; 34 say $tree->ceval; 35 36 } 37 else { 38 say("does not match"); 39 } 40 } 41 42 BEGIN { 43 44 package LeftBinaryOp; 45 use strict; 46 use base qw(Class::Accessor); 47 48 LeftBinaryOp->mk_accessors(qw{operators operands}); 49 50 my %f = ( 51 '+' => sub { shift() + shift() }, 52 '-' => sub { shift() - shift() }, 53 '*' => sub { shift() * shift() }, 54 '/' => sub { shift() / shift() }, 55 ); 56 57 sub ceval { 58 my $self = shift; 59 60 # recursively evaluate the children first 61 my @operands = map { $_->ceval } @{$self->operands}; 62 63 # then combine them 64 my $s = shift @operands; 65 for (@{$self->operators}) { 66 $s = $f{$_}->($s, shift @operands); 67 } 68 return $s; 69 } 70 71 package term; 72 use base qw{LeftBinaryOp}; 73 74 package expr; 75 use base qw{LeftBinaryOp}; 76 77 package factor; 78 79 sub ceval { 80 my $self = shift; 81 82 return $self->{val}; 83 } 84 85 1; 86 }
Ejecuciones:
pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 exprdamian.pl 4-2-2 $VAR1 = bless( { 'operands' => [ bless( { '' => '4', 'val' => '4' }, 'factor' ), bless( { '' => '2', 'val' => '2' }, 'factor' ), bless( { '' => '2', 'val' => '2' }, 'factor' ) ], '' => '4-2-2', 'operators' => [ '-', '-' ] }, 'expr' ); 0 8/4/2 $VAR1 = bless( { 'operands' => [ bless( { '' => '8', 'val' => '8' }, 'factor' ), bless( { '' => '4', 'val' => '4' }, 'factor' ), bless( { '' => '2', 'val' => '2' }, 'factor' ) ], '' => '8/4/2', 'operators' => [ '/', '/' ] }, 'term' ); 1 3 $VAR1 = bless( { '' => '3', 'val' => '3' }, 'factor' ); 3 2*(3+4) $VAR1 = bless( { 'operands' => [ bless( { '' => '2', 'val' => '2' }, 'factor' ), bless( { 'operands' => [ bless( { '' => '3', 'val' => '3' }, 'factor' ), bless( { '' => '4', 'val' => '4' }, 'factor' ) ], '' => '3+4', 'operators' => [ '+' ] }, 'expr' ) ], '' => '2*(3+4)', 'operators' => [ '*' ] }, 'term' ); 14
El siguiente programa calculator.pl
recibe como entrada una expresión en infijo.
La ejecución consta de dos bucles. En la primera parte se inyecta a la jerarquía de clases de los AST generados para las expresiones en infijo una semántica que permite evaluar la expresión:
58 require EvalCalc; 59 60 test_calc( 61 'Evaluating infix arithmetic expressions (CTRL-D to end in unix) ', 62 sub { print &Data::Dumper::Dumper(shift()) }, 63 );En esta primera parte mostraremos además el AST construido para la expresión infija de entrada.
pl@nereida:~/Lregexpgrammars/demo$ ./calculator.pl Evaluating infix arithmetic expressions (CTRL-D to end in unix) 8-4-2 $VAR1 = bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { '' => '8', 'val' => '8' }, 'factor' ) ], '' => '8' }, 'factorial' ) ], '' => '8' }, 'power' ) ], '' => '8' }, 'uneg' ) ], '' => '8' }, 'term' ), bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { '' => '4', 'val' => '4' }, 'factor' ) ], '' => '4' }, 'factorial' ) ], '' => '4' }, 'power' ) ], '' => '4' }, 'uneg' ) ], '' => '4' }, 'term' ), bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { 'operands' => [ bless( { '' => '2', 'val' => '2' }, 'factor' ) ], '' => '2' }, 'factorial' ) ], '' => '2' }, 'power' ) ], '' => '2' }, 'uneg' ) ], '' => '2' }, 'term' ) ], '' => '8-4-2', 'operators' => [ '-', '-' ] }, 'expr' ); 2Observamos que la asociatividad es la correcta. El
2
final es el resultado de la evaluación de
8-4-2
.
La estructura del árbol se corresponde con la de la gramática:
8 my $rbb = do { 9 use Regexp::Grammars; 10 11 qr{ 12 \A<expr>\z 13 14 <objrule: expr> <[operands=term]> ** <[operators=addop]> 15 16 <objrule: term> <[operands=uneg]> ** <[operators=mulop]> 17 18 <objrule: uneg> <[operators=minus]>* <[operands=power]> 19 20 <objrule: power> <[operands=factorial]> ** <[operators=powerop]> 21 22 <objrule: factorial> <[operands=factor]> <[operators=(!)]>* 23 24 <objrule: factor> <val=([+-]?\d+(?:\.\d*)?)> 25 | \( <MATCH=expr> \) 26 27 <token: addop> [+-] 28 29 <token: mulop> [*/] 30 31 <token: powerop> \*\*|\^ 32 33 <token: minus> - <MATCH=(?{ 'NEG' })> 34 35 }x; 36 };
Ahora, en una segunda parte sobreescribimos los métodos
sem
que describen la semántica para producir una traducción
de infijo a postfijo:
66 require PostfixCalc; 67 test_calc('Translating expressions to postfix (CTRL-D to end in unix) ');Ahora al proporcionar la entrada
6--3!
obtenemos:
Translating expressions to postfix (CTRL-D to end in unix) 6--3! 6 3 ! ~ -Aquí
~
es el operador de negación unaria y !
es el operador
factorial.
Estos son los ficheros que integran la aplicación:
pl@nereida:~/Lregexpgrammars/demo/calculator$ tree . |-- EvalCalc.pm # Soporte para la evaluación de la expresión: sem |-- Operator.pm # Soporte a las clases nodo: recorridos |-- PostfixCalc.pm # Soporte para la traducción a postfijo: sem `-- calculator.pl # programa principal
En el programa principal definimos la gramática
y escribimos una subrutina test_calc
que realiza el parsing.
pl@nereida:~/Lregexpgrammars/demo/calculator$ cat -n calculator.pl 1 #!/usr/bin/env perl5.10.1 2 use strict; 3 use warnings; 4 use 5.010; 5 use Data::Dumper; 6 $Data::Dumper::Indent = 1; 7 8 my $rbb = do { 9 use Regexp::Grammars; 10 11 qr{ 12 \A<expr>\z 13 14 <objrule: expr> <[operands=term]> ** <[operators=addop]> 15 16 <objrule: term> <[operands=uneg]> ** <[operators=mulop]> 17 18 <objrule: uneg> <[operators=minus]>* <[operands=power]> 19 20 <objrule: power> <[operands=factorial]> ** <[operators=powerop]> 21 22 <objrule: factorial> <[operands=factor]> <[operators=(!)]>* 23 24 <objrule: factor> <val=([+-]?\d+(?:\.\d*)?)> 25 | \( <MATCH=expr> \) 26 27 <token: addop> [+-] 28 29 <token: mulop> [*/] 30 31 <token: powerop> \*\*|^ 32 33 <token: minus> - <MATCH=(?{ 'NEG' })> 34 35 }x; 36 }; 37 38 sub test_calc { 39 my $prompt = shift; 40 my $handler = shift; 41 42 say $prompt; 43 while (my $input = <>) { 44 chomp($input); 45 if ($input =~ m{$rbb}) { 46 my $tree = $/{expr}; 47 $handler->($tree) if $handler; 48 49 say $tree->ceval; 50 51 } 52 else { 53 say("does not match"); 54 } 55 } 56 } 57 58 require EvalCalc; 59 60 test_calc( 61 'Evaluating infix arithmetic expressions (CTRL-D to end in unix) ', 62 sub { print &Data::Dumper::Dumper(shift()) }, 63 ); 64 65 66 require PostfixCalc; 67 test_calc('Translating expressions to postfix (CTRL-D to end in unix) ');
Los nodos del AST poseen un método ceval
que se encarga de
realizar la traducción del nodo.
pl@nereida:~/Lregexpgrammars/demo/calculator$ cat -n Operator.pm 1 # Class hierarchy diagram: 2 # $ vgg -t 'Operator(LeftBinaryOp(expr,term),RightBinaryOp(power),PreUnaryOp(uneg),PostUnaryOp(factorial))' 3 # +--------+ 4 # |Operator| 5 # +--------+ 6 # .---------------.----^--------.-------------. 7 # +------------+ +-------------+ +----------+ +-----------+ 8 # |LeftBinaryOp| |RightBinaryOp| |PreUnaryOp| |PostUnaryOp| 9 # +------------+ +-------------+ +----------+ +-----------+ 10 # .---^--. | | | 11 # +----+ +----+ +-----+ +----+ +---------+ 12 # |expr| |term| |power| |uneg| |factorial| 13 # +----+ +----+ +-----+ +----+ +---------+ 14 # 15 # 16 # NOTE: package "factor" actually implements numbers and is 17 # outside this hierarchy 18 # 19 package Operator; 20 use strict; 21 use Carp; 22 23 sub Operands { 24 my $self = shift; 25 26 return () unless exists $self->{operands}; 27 return @{$self->{operands}}; 28 } 29 30 sub Operators { 31 my $self = shift; 32 33 return () unless exists $self->{operators}; 34 return @{$self->{operators}}; 35 } 36 37 sub sem { 38 confess "not defined sem"; 39 } 40 41 sub make_sem { 42 my $class = shift; 43 my %semdesc = @_; 44 45 for my $class (keys %semdesc) { 46 my %sem = %{$semdesc{$class}}; 47 48 # Install 'sem' method in $class 49 no strict 'refs'; 50 no warnings 'redefine'; 51 *{$class."::sem"} = sub { 52 my ($self, $op) = @_; 53 $sem{$op} 54 }; 55 } 56 } 57 58 package LeftBinaryOp; 59 use base qw{Operator}; 60 61 sub ceval { 62 my $self = shift; 63 64 # recursively evaluate the children first 65 my @operands = map { $_->ceval } $self->Operands; 66 67 # then combine them 68 my $s = shift @operands; 69 for ($self->Operators) { 70 $s = $self->sem($_)->($s, shift @operands); 71 } 72 return $s; 73 } 74 75 package RightBinaryOp; 76 use base qw{Operator}; 77 78 sub ceval { 79 my $self = shift; 80 81 # recursively evaluate the children first 82 my @operands = map { $_->ceval } $self->Operands; 83 84 # then combine them 85 my $s = pop @operands; 86 for (reverse $self->Operators) { 87 $s = $self->sem($_)->(pop @operands, $s); 88 } 89 return $s; 90 } 91 92 package PreUnaryOp; 93 use base qw{Operator}; 94 95 sub ceval { 96 my $self = shift; 97 98 # recursively evaluate the children first 99 my @operands = map { $_->ceval } $self->Operands; 100 101 # then combine them 102 my $s = shift @operands; 103 for (reverse $self->Operators) { 104 $s = $self->sem($_)->($s); 105 } 106 return $s; 107 } 108 109 package PostUnaryOp; 110 use base qw{Operator}; 111 112 sub ceval { 113 my $self = shift; 114 115 # recursively evaluate the children first 116 my @operands = map { $_->ceval } $self->Operands; 117 118 # then combine them 119 my $s = shift @operands; 120 for ($self->Operators) { 121 $s = $self->sem($_)->($s); 122 } 123 return $s; 124 } 125 126 package term; 127 use base qw{LeftBinaryOp}; 128 129 package expr; 130 use base qw{LeftBinaryOp}; 131 132 package power; 133 use base qw{RightBinaryOp}; 134 135 package uneg; 136 use base qw{PreUnaryOp}; 137 138 package factorial; 139 use base qw{PostUnaryOp}; 140 141 package factor; 142 143 sub ceval { 144 my $self = shift; 145 146 return $self->{val}; 147 } 148 149 1;
pl@nereida:~/Lregexpgrammars/demo/calculator$ cat -n EvalCalc.pm 1 package EvalCalc; 2 use strict; 3 use Carp; 4 5 use Operator; 6 7 #### 8 sub f { 9 $_[0]>1?$_[0]*f($_[0]-1):1; 10 } 11 12 sub fac { 13 my $n = shift; 14 15 confess "Not valid number" unless $n =~ /^\d+$/; 16 f($n); 17 }; 18 19 my $s = sub { shift() ** shift() }; 20 21 Operator->make_sem( 22 expr => { 23 '+' => sub { shift() + shift() }, 24 '-' => sub { shift() - shift() }, 25 }, 26 term => { 27 '*' => sub { shift() * shift() }, 28 '/' => sub { shift() / shift() }, 29 }, 30 power => { 31 '^' => $s, 32 '**' => $s, 33 }, 34 uneg => { 35 'NEG' => sub { -shift() }, 36 }, 37 factorial => { 38 '!' => \&fac, 39 }, 40 ); 41 42 1;
pl@nereida:~/Lregexpgrammars/demo/calculator$ cat -n PostfixCalc.pm 1 package PostfixCalc; 2 use strict; 3 4 use Operator; 5 6 # Modify semantics: now translate to postfix 7 my $powers = sub { shift().' '.shift().' **' }; 8 9 Operator->make_sem( 10 expr => { 11 '+' => sub { shift().' '.shift().' +' }, 12 '-' => sub { shift().' '.shift().' -' }, 13 }, 14 term => { 15 '*' => sub { shift().' '.shift().' *' }, 16 '/' => sub { shift().' '.shift().' /' }, 17 }, 18 power => { 19 '^' => $powers, 20 '**' => $powers, 21 }, 22 uneg => { 23 # use ~ for unary minus 24 'NEG' => sub { shift().' ~' }, 25 }, 26 factorial => { 27 '!' => sub { shift().' !'}, 28 }, 29 ); 30 31 1;
pl@nereida:~/Lregexpgrammars/demo$ cat -n calculator.pl 1 #!/usr/bin/env perl5.10.1
$handler
en test_calc
:
42 sub test_calc { 43 my $prompt = shift; 44 my $handler = shift; 45 46 say $prompt; 47 while (my $input = <>) { 48 chomp($input); 49 if ($input =~ m{$rbb}) { 50 my $tree = $/{expr}; 51 $handler->($tree) if $handler; 52 53 say $tree->ceval; 54 55 } 56 else { 57 say("does not match"); 58 } 59 } 60 }
make_sem
, fac
y las llamadas a make_sem
en un módulo Calculator::Semantics
aparte.
2*3+4
se traducirá como + * 2 3 4
Calculator
.
Mueva el módulo Operator
a Calculator::Operator
.
Lea el capítulo Modulos
de los apuntes de LHP.
TODO
. Cuando la funcionalidad
a comprobar esté operativa cambie su estatus.
b = a = 4*2que será traducida a postfijo como:
4 2 * a = b =El operador de asignación es asociativo a derechas. El valor devuelto por una expresión de asignación es el valor asignado.
Use un hash para implantar la relación nombre-valor en el caso de la evaluación
c = { a = 4; b = 2*a }
Los bloques son listas entre llaves de expresiones separadas por punto y coma. El valor retornado por una expresión bloque es el último evaluado en el bloque.
El símbolo de arranque de la gramática (esto es, el patrón regular contra el que hay que casar) será la expresión bloque.
<
, >
, <=
, >=
, ==
y !=
con la prioridad adecuada. Tenga en cuenta que una expresión como:
a = b+2 > c*4deberá entenderse como
a = ((b+2) > (c*4))Esto es, se traducirá como:
b 2 + c 4 * > a =
if ... then ... else
. La parte del else
será opcional:
c = if a > 0 then { a = a -1; 2*a } else { b + 2 }; d = if a > 0 then { a = b -1; 2*b };un
else
casa con el if
mas cercano.
La sentencia:
if (a > 0) then if (b > 0) then {5} else {6}se interpreta como:
if (a > 0) then (if (b > 0) then {5} else {6})y no como:
if (a > 0) then (if (b > 0) then {5}) else {6}Se traducirá como:
a 0 > jz endif124 b 0 > jz else125 5 j endif126 :else125 6 :endif124 :endif125 ...
Casiano Rodríguez León