Subsecciones

Análisis Sintáctico con `Regexp::Grammars`

El módulo Regexp::Grammars escrito por Damian Conway extiende las expresiones regulares Perl con la capacidad de generar representaciones del árbol de análisis sintáctico abstracto y obviando la necesidad de explicitar los blancos. El módulo necesita para funcionar una versión de Perl superior o igual a la 5.10.

Introducción

El Problema

La documentación de Regexp::Grammars establece cual es el problema que aborda el módulo:

...Perl5.10 makes possible to use regexes to recognize complex, hierarchical-and even recursive-textual structures. The problem is that Perl 5.10 doesn’t provide any support for extracting that hierarchical data into nested data structures. In other words, using Perl 5.10 you can match complex data, but not parse it into an internally useful form.

An additional problem when using Perl 5.10 regexes to match complex data formats is that you have to make sure you remember to insert whitespace- matching constructs (such as \s*) at every possible position where the data might contain ignorable whitespace. This reduces the readability of such patterns, and increases the chance of errors (typically caused by overlooking a location where whitespace might appear).

Una solución: `Regexp::Grammars`

The Regexp::Grammars module solves both those problems.

If you import the module into a particular lexical scope, it preprocesses any regex in that scope, so as to implement a number of extensions to the standard Perl 5.10 regex syntax. These extensions simplify the task of defining and calling subrules within a grammar, and allow those subrule calls to capture and retain the components of they match in a proper hierarchical manner.

La sintaxis de una expresión regular `Regexp::Grammars`

Las expresiones regulares Regexp::Grammars aumentan las regexp Perl 5.10. La sintáxis se expande y se modifica:

A Regexp::Grammars specification consists of a pattern (which may include both standard Perl 5.10 regex syntax, as well as special Regexp::Grammars directives), followed by one or more rule or token definitions.

Sigue un ejemplo:

pl@nereida:~/Lregexpgrammars/demo$ cat -n balanced_brackets.pl
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5
     6  my $rbb = do {
     7      use Regexp::Grammars;
     8      qr{
     9        (<pp>)
    10
    11        <rule: pp>   \( (?: [^()]*+ | <escape> | <pp> )* \)
    12
    13        <token: escape> \\.
    14
    15      }xs;
    16  };
    17
    18  while (my $input = <>) {
    19      while ($input =~ m{$rbb}g) {
    20          say("matches: <$&>");
    21          say Dumper \%/;
    22      }
    23  }

Note that there is no need to explicitly place \s* subpatterns throughout the rules; that is taken care of automatically.

...

The initial pattern ((<pp>)) acts like the top rule of the grammar, and must be matched completely for the grammar to match.

The rules and tokens are declarations only and they are not directly matched. Instead, they act like subroutines, and are invoked by name from the initial pattern (or from within a rule or token).

Each rule or token extends from the directive that introduces it up to either the next rule or token directive, or (in the case of the final rule or token) to the end of the grammar.

El hash `%/`: Una representación del AST

Al ejecutar el programa anterior con entrada (2*(3+5))*4+(2-3) produce:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 balanced_brackets.pl
(2*(3+5))*4+(2-3)
matches: <(2*(3+5))>
$VAR1 = {
          '' => '(2*(3+5))',
          'pp' => {
                    '' => '(2*(3+5))',
                    'pp' => '(3+5)'
                  }
        };

matches: <(2-3)>
$VAR1 = {
          '' => '(2-3)',
          'pp' => '(2-3)'
        };

Each rule calls the subrules specified within it, and then return a hash containing whatever result each of those subrules returned, with each result indexed by the subrule’s name.

In this way, each level of the hierarchical regex can generate hashes recording everything its own subrules matched, so when the entire pattern matches, it produces a tree of nested hashes that represent the structured data the pattern matched.

...

In addition each result-hash has one extra key: the empty string. The value for this key is whatever string the entire subrule call matched.

Diferencias entre `token` y `rule`

The difference between a token and a rule is that a token treats any whitespace within it exactly as a normal Perl regular expression would. That is, a sequence of whitespace in a token is ignored if the /x modifier is in effect, or else matches the same literal sequence of whitespace characters (if /x is not in effect).

En el ejemplo anterior el comportamiento es el mismo si se reescribe la regla para el token escape como:

    13        <rule: escape> \\.

En este otro ejemplo mostramos que la diferencia entre token y rule es significativa:

pl@nereida:~/Lregexpgrammars/demo$ cat -n tokenvsrule.pl
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5
     6  my $rbb = do {
     7      use Regexp::Grammars;
     8      qr{
     9        <s>
    10
    11        <rule: s> <a> <c>
    12
    13        <rule: c>  c d
    14
    15        <token: a>  a b
    16
    17      }xs;
    18  };
    19
    20  while (my $input = <>) {
    21      if ($input =~ m{$rbb}) {
    22          say("matches: <$&>");
    23          say Dumper \%/;
    24      }
    25      else {
    26          say "Does not match";
    27      }
    28  }

Al ejecutar este programa vemos la diferencia en la interpretación de los blancos:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 tokenvsrule.pl
ab c d
matches: <ab c d>
$VAR1 = {
          '' => 'ab c d',
          's' => {
                   '' => 'ab c d',
                   'c' => 'c d',
                   'a' => 'ab'
                 }
        };

a b c d
Does not match
ab cd
matches: <ab cd>
$VAR1 = {
          '' => 'ab cd',
          's' => {
                   '' => 'ab cd',
                   'c' => 'cd',
                   'a' => 'ab'
                 }
        };

Obsérvese como la entrada a b c d es rechazada mientras que la entrada ab c d es aceptada.

Redefinición de los espacios en blanco

In a rule, any sequence of whitespace (except those at the very start and the very end of the rule) is treated as matching the implicit subrule <.ws>, which is automatically predefined to match optional whitespace (i.e. \s*).

You can explicitly define a <ws> token to change that default behaviour. For example, you could alter the definition of whitespace to include Perlish comments, by adding an explicit <token: ws>:

                      <token: ws>
                         (?: \s+ | #[^\n]* )*

But be careful not to define <ws> as a rule, as this will lead to all kinds of infinitely recursive unpleasantness.

El siguiente ejemplo ilustra como redefinir <ws>:

pl@nereida:~/Lregexpgrammars/demo$ cat -n tokenvsruleandws.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8      no warnings 'uninitialized';
 9      qr{
10        <s>
11
12        <token: ws> (?: \s+ | /\* .*? \*/)*+
13
14        <rule: s> <a> <c>
15
16        <rule: c>  c d
17
18        <token: a>  a b
19
20      }xs;
21  };
22
23  while (my $input = <>) {
24      if ($input =~ m{$rbb}) {
25          say Dumper \%/;
26      }
27      else {
28          say "Does not match";
29      }
30  }

Ahora podemos introducir comentarios en la entrada:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 -w tokenvsruleandws.pl
ab /* 1 */ c d
$VAR1 = {
          '' => 'ab /* 1 */ c d',
          's' => {
                   '' => 'ab /* 1 */ c d',
                   'c' => 'c d',
                   'a' => 'ab'
                 }
        };

Llamando a las subreglas

To invoke a rule to match at any point, just enclose the rule’s name in angle brackets (like in Perl 6). There must be no space between the opening bracket and the rulename. For example:

           qr{
               file:             # Match literal sequence 'f' 'i' 'l' 'e' ':'
               <name>            # Call <rule: name>
               <options>?        # Call <rule: options> (it's okay if it fails)

               <rule: name>
                   # etc.
           }x;

If you need to match a literal pattern that would otherwise look like a subrule call, just backslash-escape the leading angle:

           qr{
               file:             # Match literal sequence 'f' 'i' 'l' 'e' ':'
               \<name>           # Match literal sequence '<' 'n' 'a' 'm' 'e' '>'
               <options>?        # Call <rule: options> (it's okay if it fails)

               <rule: name>
                   # etc.
           }x;

El siguiente programa ilustra algunos puntos discutidos en la cita anterior:

casiano@millo:~/src/perl/regexp-grammar-examples$ cat -n badbracket.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8      qr{
 9        (<pp>)
10
11        <rule: pp>   \( (?: <b  > | \< | < escape> | <pp> )* \)
12
13        <token: b  > b
14
15        <token: escape> \\.
16
17      }xs;
18  };
19
20  while (my $input = <>) {
21      while ($input =~ m{$rbb}g) {
22          say("matches: <$&>");
23          say Dumper \%/;
24      }
25  }

Obsérvense los blancos en < escape> y en <token: b > b. Pese a ello el programa funciona:

casiano@millo:~/src/perl/regexp-grammar-examples$ perl5.10.1 badbracket.pl
(\(\))
matches: <(\(\))>
$VAR1 = {
          '' => '(\\(\\))',
          'pp' => {
                    '' => '(\\(\\))',
                    'escape' => '\\)'
                  }
        };

(b)
matches: <(b)>
$VAR1 = {
          '' => '(b)',
          'pp' => {
                    '' => '(b)',
                    'b' => 'b'
                  }
        };

(<)
matches: <(<)>
$VAR1 = {
          '' => '(<)',
          'pp' => '(<)'
        };

(c)

casiano@millo:

Eliminación del anidamiento de ramas unarias en `%/`

...Note, however, that if the result-hash at any level contains only the empty-string key (i.e. the subrule did not call any sub-subrules or save any of their nested result-hashes), then the hash is unpacked and just the matched substring itself if returned.

For example, if <rule: sentence> had been defined:

    <rule: sentence>
        I see dead people

then a successful call to the rule would only add:

    sentence => 'I see dead people'

to the current result-hash.

This is a useful feature because it prevents a series of nested subrule calls from producing very unwieldy data structures. For example, without this automatic unpacking, even the simple earlier example:

    <rule: sentence>
        <noun> <verb> <object>

would produce something needlessly complex, such as:

    sentence => {
        ""     => 'I saw a dog',
        noun   => {
            "" => 'I',
        },
        verb   => {
            "" => 'saw',
        },
        object => {
            ""      => 'a dog',
            article => {
                "" => 'a',
            },
            noun    => {
                "" => 'dog',
            },
        },
    }

El siguiente ejemplo ilustra este punto:

pl@nereida:~/Lregexpgrammars/demo$ cat -n unaryproductions.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8      qr{
 9        <s>
10
11        <rule: s> <noun> <verb> <object>
12
13        <token: noun> he | she | Peter | Jane
14
15        <token: verb> saw | sees
16
17        <token: object> a\s+dog | a\s+cat
18
19      }x;
20  };
21
22  while (my $input = <>) {
23      while ($input =~ m{$rbb}g) {
24          say("matches: <$&>");
25          say Dumper \%/;
26      }
27  }

Sigue una ejecución del programa anterior:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 unaryproductions.pl
he saw a dog
matches: <he saw a dog>
$VAR1 = {
          '' => 'he saw a dog',
          's' => {
                   '' => 'he saw a dog',
                   'object' => 'a dog',
                   'verb' => 'saw',
                   'noun' => 'he'
                 }
        };

Jane sees a cat
matches: <Jane sees a cat>
$VAR1 = {
          '' => 'Jane sees a cat',
          's' => {
                   '' => 'Jane sees a cat',
                   'object' => 'a cat',
                   'verb' => 'sees',
                   'noun' => 'Jane'
                 }
        };

Ámbito de uso de `Regexp::Grammars`

Cuando se usa Regexp::Grammars como parte de un programa que utiliza otras regexes hay que evitar que Regexp::Grammars procese las mismas. Regexp::Grammars reescribe las expresiones regulares durante la fase de preproceso. Esta por ello presenta las mismas limitaciones que cualquier otra forma de 'source filtering' (véase perlfilter). Por ello es una buena idea declarar la gramática en un bloque do restringiendo de esta forma el ámbito de acción del módulo.

 5  my $calculator = do{
 6      use Regexp::Grammars;
 7      qr{
 .          ........
28      }xms
29  };

Objetos

When a grammar has parsed successfully, the %/ variable will contain a series of nested hashes (and possibly arrays) representing the hierarchical structure of the parsed data.

Typically, the next step is to walk that tree, extracting or converting or otherwise processing that information. If the tree has nodes of many different types, it can be difficult to build a recursive subroutine that can navigate it easily.

A much cleaner solution is possible if the nodes of the tree are proper objects. In that case, you just define a trasnlate() method for each of the classes, and have every node call that method on each of its children. The chain of translate() calls would cascade down the nodes of the tree, each one invoking the appropriate translate() method according to the type of node encountered.

The only problem is that, by default, Regexp::Grammars returns a tree of plain-old hashes, not Class::Whatever objects. Fortunately, it's easy to request that the result hashes be automatically blessed into the appropriate classes, using the <objrule:...> and <objtoken:...> directives.

These directives are identical to the <rule:...> and <token:...> directives (respectively), except that the rule or token they create will also bless the hash it normally returns, converting it to an object of a class whose name is the same as the rule or token itself.

For example:

    <objrule: Element>
        # ...Defines a rule that can be called as <Element>
        # ...and which returns a hash-based Element object

The IDENTIFIER of the rule or token may also be fully qualified. In such cases, the rule or token is defined using only the final short name, but the result object is blessed using the fully qualified long name. For example:

    <objrule: LaTeX::Element> 
        # ...Defines a rule that can be called as <Element>
        # ...and which returns a hash-based LaTeX::Element object

This can be useful to ensure that returned objects don't collide with other namespaces in your program.

Note that you can freely mix object-returning and plain-old-hash-returning rules and tokens within a single grammar, though you have to be careful not to subsequently try to call a method on any of the unblessed nodes.

Renombrando los resultados de una subregla

Nombre de la regla versus Nombre del Resultado

No siempre el nombre de la regla es el mas apropiado para ser el nombre del resultado:

It is not always convenient to have subrule results stored under the same name as the rule itself. Rule names should be optimized for understanding the behaviour of the parser, whereas result names should be optimized for understanding the structure of the data. Often those two goals are identical, but not always; sometimes rule names need to describe what the data looks like, while result names need to describe what the data means.

Colisión de nombres de reglas

For example, sometimes you need to call the same rule twice, to match two syntactically identical components whose positions give then semantically distinct meanings:

    <rule: copy_cmd>
        copy <file> <file>

The problem here is that, if the second call to <file> succeeds, its result-hash will be stored under the key file, clobbering the data that was returned from the first call to <file>.

Aliasing

To avoid such problems, Regexp::Grammars allows you to alias any subrule call, so that it is still invoked by the original name, but its result-hash is stored under a different key. The syntax for that is: <alias=rulename>. For example:

    <rule: copy_cmd>
        copy <from=file> <to=file>

Here, <rule: file> is called twice, with the first result-hash being stored under the key from, and the second result-hash being stored under the key to.

Note, however, that the alias before the = must be a proper identifier (i.e. a letter or underscore, followed by letters, digits, and/or underscores). Aliases that start with an underscore and aliases named MATCH have special meaning.

Normalización de los resultados mediante aliasing

Aliases can also be useful for normalizing data that may appear in different formats and sequences. For example:

    <rule: copy_cmd>
        copy <from=file>        <to=file>
      | dup    <to=file>  as  <from=file>
      |      <from=file>  ->    <to=file>
      |        <to=file>  <-  <from=file>

Here, regardless of which order the old and new files are specified, the result-hash always gets:

    copy_cmd => {
        from => 'oldfile',
          to => 'newfile',
    }

Ejemplo

El siguiente programa ilustra los comentarios de la documentación:

pl@nereida:~/Lregexpgrammars/demo$ cat -n copygrammar.pl
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5
     6  my $rbb = do {
     7      use Regexp::Grammars;
     8      qr{
     9        <copy_cmd>
    10
    11        <rule: copy_cmd>
    12              copy <from=file> <to=file>
    13          |   <from=file> ->   <to=file>
    14          |   <to=file>   <- <from=file>
    15
    16        <token: file> [\w./\\]+
    17      }x;
    18  };
    19
    20  while (my $input = <>) {
    21      while ($input =~ m{$rbb}g) {
    22          say("matches: <$&>");
    23          say Dumper \%/;
    24      }
    25  }

Cuando lo ejecutamos obtenemos:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 copygrammar.pl
copy a b
matches: <copy a b>
$VAR1 = {
          '' => 'copy a b',
          'copy_cmd' => {
                          '' => 'copy a b',
                          'to' => 'b',
                          'from' => 'a'
                        }
        };

b <- a
matches: <b <- a>
$VAR1 = {
          '' => 'b <- a',
          'copy_cmd' => {
                          '' => 'b <- a',
                          'to' => 'b',
                          'from' => 'a'
                        }
        };

a -> b
matches: <a -> b>
$VAR1 = {
          '' => 'a -> b',
          'copy_cmd' => {
                          '' => 'a -> b',
                          'to' => 'b',
                          'from' => 'a'
                        }
        };

Listas

El operador de cierre positivo

If a subrule call is quantified with a repetition specifier:

           <rule: file_sequence>
               <file>+

then each repeated match overwrites the corresponding entry in the surrounding rule’s result-hash, so only the result of the final repetition will be retained. That is, if the above example matched the string foo.pl bar.py baz.php, then the result-hash would contain:

           file_sequence {
               ""   => 'foo.pl bar.py baz.php',
               file => 'baz.php',
           }

Operadores de listas y espacios en blanco

Existe un caveat con el uso de los operadores de repetición y el manejo de los blancos. Véase el siguiente programa:

pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers3.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8
 9      qr{
10        <numbers>
11
12        <rule: numbers>
13          (<number>)+
14
15        <token: number> \s*\d+
16      }xms;
17  };
18
19  while (my $input = <>) {
20      if ($input =~ m{$rbb}) {
21          say("matches: <$&>");
22          say Dumper \%/;
23      }
24  }

Obsérvese el uso explícito de espacios \s*\d+ en la definición de number.

Sigue un ejemplo de ejecución:

pl@nereida:~/Lregexpgrammars/demo$ perl5_10_1 numbers3.pl
1 2 3 4
matches: <1 2 3 4>
$VAR1 = {
          '' => '1 2 3 4',
          'numbers' => {
                         '' => '1 2 3 4',
                         'number' => ' 4'
                       }
        };

Si se eliminan los blancos de la definición de number:

pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers.pl
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5  
     6  my $rbb = do {
     7      use Regexp::Grammars;
     8  
     9      qr{
    10        <numbers>
    11  
    12        <rule: numbers> 
    13          (<number>)+
    14  
    15        <token: number> \d+
    16      }xms;
    17  };
    18  
    19  while (my $input = <>) {
    20      if ($input =~ m{$rbb}) {
    21          say("matches: <$&>");
    22          say Dumper \%/;
    23      }
    24  }

se obtiene una conducta que puede sorprender:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 numbers.pl 
12 34 56
matches: <12>
$VAR1 = {
          '' => '12',
          'numbers' => {
                         '' => '12',
                         'number' => '12'
                       }
        };

La explicación está en la documentación: véase la sección Grammar Syntax:

<rule: IDENTIFIER>

Define a rule whose name is specified by the supplied identifier.

Everything following the <rule:...> directive (up to the next <rule:...> or <token:...> directive) is treated as part of the rule being defined.

Any whitespace in the rule is replaced by a call to the <.ws> subrule (which defaults to matching \s*, but may be explicitly redefined).

También podríamos haber resuelto el problema introduciendo un blanco explícito dentro del cierre positivo:

      <rule: numbers>
        (<number> )+

      <token: number> \d+

Una Solución al problema de recordar los resultados de una lista: El uso de brackets

Usually, that’s not the desired outcome, so Regexp::Grammars provides another mechanism by which to call a subrule; one that saves all repetitions of its results.

A regular subrule call consists of the rule’s name surrounded by angle brackets. If, instead, you surround the rule’s name with <[...]> (angle and square brackets) like so:

           <rule: file_sequence>
               <[file]>+

then the rule is invoked in exactly the same way, but the result of that submatch is pushed onto an array nested inside the appropriate result-hash entry. In other words, if the above example matched the same foo.pl bar.py baz.php string, the result-hash would contain:

           file_sequence {
               ""   => 'foo.pl bar.py baz.php',
               file => [ 'foo.pl', 'bar.py', 'baz.php' ],
           }

Teniendo en cuenta lo dicho anteriormente sobre los blancos dentro de los cuantificadores, es necesario introducir blancos dentro del operador de repetición:

pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers4.pl
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5
     6  my $rbb = do {
     7      use Regexp::Grammars;
     8
     9      qr{
    10        <numbers>
    11
    12        <rule: numbers>
    13          (?:  <[number]> )+
    14
    15        <token: number> \d+
    16      }xms;
    17  };
    18
    19  while (my $input = <>) {
    20      if ($input =~ m{$rbb}) {
    21          say("matches: <$&>");
    22          say Dumper \%/;
    23      }
    24  }

Al ejecutar este programa obtenemos:

pl@nereida:~/Lregexpgrammars/demo$ perl5_10_1 numbers4.pl
1 2 3 4
matches: <1 2 3 4
>
$VAR1 = {
          '' => '1 2 3 4
',
          'numbers' => {
                         '' => '1 2 3 4
',
                         'number' => [ '1', '2', '3', '4' ]
                       }
        };

Otra forma de resolver las colisiones de nombres: salvarlos en una lista

This listifying subrule call can also be useful for non-repeated subrule calls, if the same subrule is invoked in several places in a grammar. For example if a cmdline option could be given either one or two values, you might parse it:

    <rule: size_option>   
        -size <[size]> (?: x <[size]> )?

The result-hash entry for size would then always contain an array, with either one or two elements, depending on the input being parsed.

Sigue un ejemplo:

pl@nereida:~/Lregexpgrammars/demo$ cat -n sizes.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8
 9      qr{
10        <command>
11
12        <rule: command> ls <size_option>
13
14        <rule: size_option>
15            -size <[size]> (?: x <[size]> )?
16
17        <token: size> \d+
18      }x;
19  };
20
21  while (my $input = <>) {
22      while ($input =~ m{$rbb}g) {
23          say("matches: <$&>");
24          say Dumper \%/;
25      }
26  }

Veamos su comportamiento con diferentes entradas:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 sizes.pl
ls -size 4
matches: <ls -size 4
>
$VAR1 = {
          '' => 'ls -size 4
',
          'command' => {
                         'size_option' => {
                                            '' => '-size 4
',
                                            'size' => [ '4' ]
                                          },
                         '' => 'ls -size 4
'
                       }
        };

ls -size 2x8
matches: <ls -size 2x8
>
$VAR1 = {
          '' => 'ls -size 2x8
',
          'command' => {
                         'size_option' => {
                                            '' => '-size 2x8
',
                                            'size' => [ '2', '8' ]
                                          },
                         '' => 'ls -size 2x8
'
                       }
        };

Aliasing de listas

Listifying subrules can also be given aliases, just like ordinary subrules. The alias is always specified inside the square brackets:

    <rule: size_option>   
        -size <[size=pos_integer]> (?: x <[size=pos_integer]> )?

Here, the sizes are parsed using the pos_integer rule, but saved in the result-hash in an array under the key size.

Sigue un ejemplo:

pl@nereida:~/Lregexpgrammars/demo$ cat -n aliasedsizes.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8
 9      qr{
10        <command>
11
12        <rule: command> ls <size_option>
13
14        <rule: size_option>
15            -size <[size=int]> (?: x <[size=int]> )?
16
17        <token: int> \d+
18      }x;
19  };
20
21  while (my $input = <>) {
22      while ($input =~ m{$rbb}g) {
23          say("matches: <$&>");
24          say Dumper \%/;
25      }
26  }

Veamos el resultado de una ejecución:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 aliasedsizes.pl
ls -size 2x4
matches: <ls -size 2x4
>
$VAR1 = {
          '' => 'ls -size 2x4
',
          'command' => {
                         'size_option' => {
                                            '' => '-size 2x4
',
                                            'size' => [
                                                        '2',
                                                        '4'
                                                      ]
                                          },
                         '' => 'ls -size 2x4
'
                       }
        };

Caveat: Cierres y Warnings

En este ejemplo aparece <number>+ sin corchetes ni paréntesis:

pl@nereida:~/Lregexpgrammars/demo$ cat -n numbers5.pl 
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5  
     6  my $rbb = do {
     7      use Regexp::Grammars;
     8  
     9      qr{
    10        <numbers>
    11  
    12        <rule: numbers> 
    13          <number>+
    14  
    15        <token: number> \d+
    16      }xms;
    17  };
    18  
    19  while (my $input = <>) {
    20      if ($input =~ m{$rbb}) {
    21          say("matches: <$&>");
    22          say Dumper \%/;
    23      }
    24  }

Este programa produce un mensaje de advertencia:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 numbers5.pl 
  warn | Repeated subrule <number>+ will only capture its final match
       | (Did you mean <[number]>+ instead?)
       |

Si se quiere evitar el mensaje y se está dispuesto a asumir la pérdida de los valores asociados con los elementos de la lista se deberán poner el operando entre paréntesis (con o sin memoria).

Esto es lo que dice la documentación sobre este warning:

Repeated subrule <rule> will only capture its final match

You specified a subrule call with a repetition qualifier, such as:

        <ListElem>*

or:

        <ListElem>+

Because each subrule call saves its result in a hash entry of the same name, each repeated match will overwrite the previous ones, so only the last match will ultimately be saved. If you want to save all the matches, you need to tell Regexp::Grammars to save the sequence of results as a nested array within the hash entry, like so:

        <[ListElem]>*

or:

        <[ListElem]>+

If you really did intend to throw away every result but the final one, you can silence the warning by placing the subrule call inside any kind of parentheses. For example:

        (<ListElem>)*

or:

        (?: <ListElem> )+

Pseudo sub-reglas

Subpatrones

Aliases can also be given to standard Perl subpatterns, as well as to code blocks within a regex. The syntax for subpatterns is:

    <ALIAS= (SUBPATTERN) >

In other words, the syntax is exactly like an aliased subrule call, except that the rule name is replaced with a set of parentheses containing the subpattern. Any parentheses-capturing or non-capturing-will do.

The effect of aliasing a standard subpattern is to cause whatever that subpattern matches to be saved in the result-hash, using the alias as its key. For example:

    <rule: file_command>

        <cmd=(mv|cp|ln)>  <from=file>  <to=file>

Here, the <cmd=(mv|cp|ln)> is treated exactly like a regular (mv|cp|ln), but whatever substring it matches is saved in the result-hash under the key 'cmd'.

Sigue un ejemplo:

pl@nereida:~/Lregexpgrammars/demo$ cat -n subpattern.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8
 9      qr{
10          <file_command>
11
12          <rule: file_command>
13
14          <cmd=(mv|cp|ln)>  <from=([\w./]+)>  <to=([\w./]+)>
15
16      }x;
17  };
18
19  while (my $input = <>) {
20      while ($input =~ m{$rbb}g) {
21          say("matches: <$&>");
22          say Dumper \%/;
23      }
24  }

y una ejecución:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 subpattern.pl
mv a b
matches: <mv a b>
$VAR1 = {
          '' => 'mv a b',
          'file_command' => {
                              '' => 'mv a b',
                              'to' => 'b',
                              'cmd' => 'mv',
                              'from' => 'a'
                            }
        };

cp c d
matches: <cp c d>
$VAR1 = {
          '' => 'cp c d',
          'file_command' => {
                              '' => 'cp c d',
                              'to' => 'd',
                              'cmd' => 'cp',
                              'from' => 'c'
                            }
        }

Bloques de código

The syntax for aliasing code blocks is:

    <ALIAS= (?{ your($code->here) }) >

Note, however, that the code block must be specified in the standard Perl 5.10 regex notation: (?{...}). A common mistake is to write:

    <ALIAS= { your($code->here } >

instead, which will attempt to interpolate $code before the regex is even compiled, as such variables are only protected from interpolation inside a (?{...}).

When correctly specified, this construct executes the code in the block and saves the result of that execution in the result-hash, using the alias as its key. Aliased code blocks are useful for adding semantic information based on which branch of a rule is executed. For example, consider the copy_cmd alternatives shown earlier:

    <rule: copy_cmd>
        copy <from=file>        <to=file>
      | dup    <to=file>  as  <from=file>
      |      <from=file>  ->    <to=file>
      |        <to=file>  <-  <from=file>

Using aliased code blocks, you could add an extra field to the result- hash to describe which form of the command was detected, like so:

    <rule: copy_cmd>
        copy <from=file>        <to=file>  <type=(?{ 'std' })> 
      | dup    <to=file>  as  <from=file>  <type=(?{ 'rev' })> 
      |      <from=file>  ->    <to=file>  <type=(?{ 'fwd' })> 
      |        <to=file>  <-  <from=file>  <type=(?{ 'bwd' })>

Now, if the rule matched, the result-hash would contain something like:

    copy_cmd => {
        from => 'oldfile',
          to => 'newfile',
        type => 'fwd',
    }

El siguiente ejemplo ilustra lo dicho en la documentación. En la línea 15 hemos introducido una regla para el control de errores^31.8:

pl@nereida:~/Lregexpgrammars/demo$ cat -n aliasedcodeblock2.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5
 6  my $rbb = do {
 7      use Regexp::Grammars;
 8      qr{
 9        <copy_cmd>
10
11        <rule: copy_cmd>
12              copy (<from=file>) (<to=file>) <type=(?{ 'std' })>
13          |   <from=file> ->   <to=file> <type=(?{ 'fwd' })>
14          |   <to=file>   <- <from=file> <type=(?{ 'bwd' })>
15          |   .+ (?{ die "Syntax error!\n" })
16
17        <token: file> [\w./\\]+
18      }x;
19  };
20
21  while (my $input = <>) {
22      while ($input =~ m{$rbb}g) {
23          say("matches: <$&>");
24          say Dumper \%/;
25      }
26  }

La ejecución muestra el comportamiento del programa con tres entradas válidas y una errónea:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 aliasedcodeblock2.pl
copy a b
matches: <copy a b
>
$VAR1 = {
          '' => 'copy a b
',
          'copy_cmd' => {
                          '' => 'copy a b
',
                          'to' => 'b',
                          'from' => 'a',
                          'type' => 'std'
                        }
        };

b <- a
matches: <b <- a
>
$VAR1 = {
          '' => 'b <- a
',
          'copy_cmd' => {
                          '' => 'b <- a
',
                          'to' => 'b',
                          'from' => 'a',
                          'type' => 'bwd'
                        }
        };

a -> b
matches: <a -> b
>
$VAR1 = {
          '' => 'a -> b
',
          'copy_cmd' => {
                          '' => 'a -> b
',
                          'to' => 'b',
                          'from' => 'a',
                          'type' => 'fwd'
                        }
        };

cp a b
Syntax error!

Pseudo subreglas y depuración

Note that, in addition to the semantics described above, aliased subpatterns and code blocks also become visible to Regexp::Grammars integrated debugger (see Debugging).

Llamadas a subreglas desmemoriadas

By default, every subrule call saves its result into the result-hash, either under its own name, or under an alias.

However, sometimes you may want to refactor some literal part of a rule into one or more subrules, without having those submatches added to the result-hash. The syntax for calling a subrule, but ignoring its return value is:

    <.SUBRULE>

(which is stolen directly from Perl 6).

For example, you may prefer to rewrite a rule such as:

    <rule: paren_pair> 

        \( 
            (?: <escape> | <paren_pair> | <brace_pair> | [^()] )*
        \)

without any literal matching, like so:

    <rule: paren_pair> 

        <.left_paren>
            (?: <escape> | <paren_pair> | <brace_pair> | <.non_paren> )*
        <.right_paren>
    
    <token: left_paren>   \(
    <token: right_paren>  \)
    <token: non_paren>    [^()]

Moreover, as the individual components inside the parentheses probably aren't being captured for any useful purpose either, you could further optimize that to:

    <rule: paren_pair> 

        <.left_paren>
            (?: <.escape> | <.paren_pair> | <.brace_pair> | <.non_paren> )*
        <.right_paren>

Note that you can also use the dot modifier on an aliased subpattern:

    <.Alias= (SUBPATTERN) >

This seemingly contradictory behaviour (of giving a subpattern a name, then deliberately ignoring that name) actually does make sense in one situation. Providing the alias makes the subpattern visible to the debugger, while using the dot stops it from affecting the result-hash. See Debugging non-grammars for an example of this usage.

Ejemplo: Números entre comas

Por ejemplo, queremos reconocer listas de números separados por comas. Supongamos también que queremos darle un nombre a la expresión regular de separación. Quizá, aunque no es el caso, porque la expresión regular de separación sea suficientemente compleja. Si no usamos la notación punto la coma aparecerá en la estructura:

pl@nereida:~/Lregexpgrammars/demo$ cat -n numberscomma.pl
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5  $Data::Dumper::Indent = 1;
     6
     7  my $rbb = do {
     8      use Regexp::Grammars;
     9
    10      qr{
    11        <numbers>
    12
    13        <objrule: numbers>
    14          <[number]> (<comma> <[number]>)*
    15
    16        <objtoken: number> \s*\d+
    17        <token: comma>  \s*,
    18      }xms;
    19  };
    20
    21  while (my $input = <>) {
    22      if ($input =~ m{$rbb}) {
    23          say("matches: <$&>");
    24          say Dumper \%/;
    25      }
    26  }

En efecto, aparece la clave comma:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 numberscomma.pl
2, 3, 4
matches: <2, 3, 4>
$VAR1 = {
  '' => '2, 3, 4',
  'numbers' => bless( {
    '' => '2, 3, 4',
    'number' => [
      bless( { '' => '2' }, 'number' ),
      bless( { '' => '3' }, 'number' ),
      bless( { '' => '4' }, 'number' )
    ],
    'comma' => ','
  }, 'numbers' )
};

Si cambiamos la llamada a la regla <comma> por <.comma>

pl@nereida:~/Lregexpgrammars/demo$ diff numberscomma.pl numberscomma2.pl
14c14
<         <[number]> (<comma> <[number]>)*
---
>         <[number]> (<.comma> <[number]>)*

eliminamos la aparición de la innecesaria clave:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 numberscomma2.pl
2, 3, 4
matches: <2, 3, 4>
$VAR1 = {
  '' => '2, 3, 4',
  'numbers' => bless( {
    '' => '2, 3, 4',
    'number' => [
      bless( { '' => '2' }, 'number' ),
      bless( { '' => '3' }, 'number' ),
      bless( { '' => '4' }, 'number' )
    ]
  }, 'numbers' )
};

Destilación del resultado

Destilación manual

Regexp::Grammars also offers full manual control over the distillation process. If you use the reserved word MATCH as the alias for a subrule call:

    <MATCH=filename>

or a subpattern match:

    <MATCH=( \w+ )>

or a code block:

    <MATCH=(?{ 42 })>

then the current rule will treat the return value of that subrule, pattern, or code block as its complete result, and return that value instead of the usual result-hash it constructs. This is the case even if the result has other entries that would normally also be returned.

For example, in a rule like:

    <rule: term>
          <MATCH=literal>
        | <left_paren> <MATCH=expr> <right_paren>

The use of MATCH aliases causes the rule to return either whatever <literal> returns, or whatever <expr> returns (provided it's between left and right parentheses).

Note that, in this second case, even though <left_paren> and <right_paren> are captured to the result-hash, they are not returned, because the MATCH alias overrides the normal return the result-hash semantics and returns only what its associated subrule (i.e. <expr>) produces.

El siguiente ejemplo ilustra el uso del alias MATCH:

$ cat -n demo_calc.pl
 1  #!/usr/local/lib/perl/5.10.1/bin/perl5.10.1
 2  use v5.10;
 3  use warnings;
 4
 5  my $calculator = do{
 6      use Regexp::Grammars;
 7      qr{
 8          <Answer>
 9
10          <rule: Answer>
11              <X=Mult> <Op=([+-])> <Y=Answer>
12            | <MATCH=Mult>
13
14          <rule: Mult>
15              <X=Pow> <Op=([*/%])> <Y=Mult>
16            | <MATCH=Pow>
17
18          <rule: Pow>
19              <X=Term> <Op=(\^)> <Y=Pow>
20            | <MATCH=Term>
21
22          <rule: Term>
23                 <MATCH=Literal>
24            | \( <MATCH=Answer> \)
25
26          <token: Literal>
27              <MATCH=( [+-]? \d++ (?: \. \d++ )?+ )>
28      }xms
29  };
30
31  while (my $input = <>) {
32      if ($input =~ $calculator) {
33          use Data::Dumper 'Dumper';
34          warn Dumper \%/;
35      }
36  }

Veamos una ejecución:

$ ./demo_calc.pl
2+3*5
$VAR1 = {
          '' => '2+3*5',
          'Answer' => {
                        '' => '2+3*5',
                        'Op' => '+',
                        'X' => '2',
                        'Y' => {
                                 '' => '3*5',
                                 'Op' => '*',
                                 'X' => '3',
                                 'Y' => '5'
                               }
                      }
        };
4-5-2
$VAR1 = {
          '' => '4-5-2',
          'Answer' => {
                        '' => '4-5-2',
                        'Op' => '-',
                        'X' => '4',
                        'Y' => {
                                 '' => '5-2',
                                 'Op' => '-',
                                 'X' => '5',
                                 'Y' => '2'
                               }
                      }
        };

Obsérvese como el árbol construido para la expresión 4-5-2 se hunde a derechas dando lugar a una jerarquía errónea. Para arreglar el problema sería necesario eliminar la recursividad por la izquierda en las reglas correspondientes.

Destilación en el programa

It's also possible to control what a rule returns from within a code block. Regexp::Grammars provides a set of reserved variables that give direct access to the result-hash.

The result-hash itself can be accessed as %MATCH within any code block inside a rule. For example:

    <rule: sum> 
        <X=product> \+ <Y=product>
            <MATCH=(?{ $MATCH{X} + $MATCH{Y} })>

Here, the rule matches a product (aliased 'X' in the result-hash), then a literal '+', then another product (aliased to 'Y' in the result-hash). The rule then executes the code block, which accesses the two saved values (as $MATCH{X} and $MATCH{Y}), adding them together. Because the block is itself aliased to MATCH, the sum produced by the block becomes the (only) result of the rule.

It is also possible to set the rule result from within a code block (instead of aliasing it). The special override return value is represented by the special variable $MATCH. So the previous example could be rewritten:

    <rule: sum> 
        <X=product> \+ <Y=product>
            (?{ $MATCH = $MATCH{X} + $MATCH{Y} })

Both forms are identical in effect. Any assignment to $MATCH overrides the normal return all subrule results behaviour.

Assigning to $MATCH directly is particularly handy if the result may not always be distillable, for example:

    <rule: sum> 
        <X=product> \+ <Y=product>
            (?{ if (!ref $MATCH{X} && !ref $MATCH{Y}) {
                    # Reduce to sum, if both terms are simple scalars...
                    $MATCH = $MATCH{X} + $MATCH{Y};
                }
                else {
                    # Return full syntax tree for non-simple case...
                    $MATCH{op} = '+';
                }
            })

Note that you can also partially override the subrule return behaviour. Normally, the subrule returns the complete text it matched under the empty key of its result-hash. That is, of course, $MATCH{""}, so you can override just that behaviour by directly assigning to that entry.

For example, if you have a rule that matches key/value pairs from a configuration file, you might prefer that any trailing comments not be included in the matched text entry of the rule's result-hash. You could hide such comments like so:

    <rule: config_line>
        <key> : <value>  <comment>?
            (?{
                # Edit trailing comments out of "matched text" entry...
                $MATCH = "$MATCH{key} : $MATCH{value}";
            })

Some more examples of the uses of $MATCH:

    <rule: FuncDecl>
      # Keyword  Name               Keep return the name (as a string)...
        func     <Identifier> ;     (?{ $MATCH = $MATCH{'Identifier'} })


    <rule: NumList>
      # Numbers in square brackets...
        \[ 
            ( \d+ (?: , \d+)* )
        \]

      # Return only the numbers...
        (?{ $MATCH = $CAPTURE })


    <token: Cmd>
      # Match standard variants then standardize the keyword...
        (?: mv | move | rename )      (?{ $MATCH = 'mv'; })

$CAPTURE and $CONTEXT are both aliases for the built-in read-only $^N variable, which always contains the substring matched by the nearest preceding (...) capture. $^N still works perfectly well, but these are provided to improve the readability of code blocks and error messages respectively.

El siguiente código implementa una calculadora usando destilación en el código:

pl@nereida:~/Lregexpgrammars/demo$ cat -n demo_calc_inline.pl
 1  use v5.10;
 2  use warnings;
 3
 4  my $calculator = do{
 5      use Regexp::Grammars;
 6      qr{
 7          <Answer>
 8
 9          <rule: Answer>
10              <X=Mult> \+ <Y=Answer>
11                  (?{ $MATCH = $MATCH{X} + $MATCH{Y}; })
12            | <X=Mult> - <Y=Answer>
13                  (?{ $MATCH = $MATCH{X} - $MATCH{Y}; })
14            | <MATCH=Mult>
15
16          <rule: Mult>
17              <X=Pow> \* <Y=Mult>
18                  (?{ $MATCH = $MATCH{X} * $MATCH{Y}; })
19            | <X=Pow>  / <Y=Mult>
20                  (?{ $MATCH = $MATCH{X} / $MATCH{Y}; })
21            | <X=Pow>  % <Y=Mult>
22                  (?{ $MATCH = $MATCH{X} % $MATCH{Y}; })
23            | <MATCH=Pow>
24
25          <rule: Pow>
26              <X=Term> \^ <Y=Pow>
27                  (?{ $MATCH = $MATCH{X} ** $MATCH{Y}; })
28            | <MATCH=Term>
29
30          <rule: Term>
31                 <MATCH=Literal>
32            | \( <MATCH=Answer> \)
33
34          <token: Literal>
35              <MATCH=( [+-]? \d++ (?: \. \d++ )?+ )>
36      }xms
37  };
38
39  while (my $input = <>) {
40      if ($input =~ $calculator) {
41          say '--> ', $/{Answer};
42      }
43  }

Ejercicio 31.11.1 Cual es la salida del programa anterior para las entradas:

4-2-2
8/4/2
2^2^3

Llamadas privadas a subreglas y subreglas privadas

If a rule name (or an alias) begins with an underscore:

     <_RULENAME>       <_ALIAS=RULENAME>  
    <[_RULENAME]>     <[_ALIAS=RULENAME]>

then matching proceeds as normal, and any result that is returned is stored in the current result-hash in the usual way.

However, when any rule finishes (and just before it returns) it first filters its result-hash, removing any entries whose keys begin with an underscore. This means that any subrule with an underscored name (or with an underscored alias) remembers its result, but only until the end of the current rule. Its results are effectively private to the current rule.

This is especially useful in conjunction with result distillation.

Mas sobre listas

Reconocimiento manual de listas

Analizando listas manualmente

El siguiente ejemplo muestra como construir un reconocedor de listas (posiblemente vacías) de números:

casiano@millo:~/Lregexp-grammar-examples$ cat -n simple_list.pl
     1  #!/soft/perl5lib/bin/perl5.10.1
     2  use v5.10;
     3
     4  use Regexp::Grammars;
     5
     6  my $list = qr{
     7      <List>
     8
     9      <rule: List>
    10           <digit> <List>
    11         | # empty
    12
    13      <rule: digit>
    14          <MATCH=(\d+)>
    15
    16  }xms;
    17
    18  while (my $input = <>) {
    19      chomp $input;
    20      if ($input =~ $list) {
    21          use Data::Dumper 'Dumper';
    22          warn Dumper \%/;
    23      }
    24      else {
    25        warn "Does not match\n"
    26      }
    27  }

Sigue una ejecución:

casiano@millo:~/Lregexp-grammar-examples$ ./simple_list.pl
2 3 4
$VAR1 = {
          '' => '2 3 4',
          'List' => {
                      '' => '2 3 4',
                      'digit' => '2'
                      'List' => {
                                  '' => '3 4',
                                  'digit' => '3'
                                  'List' => {
                                              '' => '4',
                                              'digit' => '4'
                                              'List' => '',
                                            },
                                },
                    }
        };

Influencia del orden en el lenguaje reconocido

Tenga en cuenta que el orden de las reglas influye en el lenguaje reconocido. Véase lo que ocurre si cambiamos en el ejemplo anterior el orden de las reglas:

casiano@millo:~/Lregexp-grammar-examples$ cat -n simple_list_empty_first.pl
     1  #!/soft/perl5lib/bin/perl5.10.1
     2  use v5.10;
     3
     4  use Regexp::Grammars;
     5
     6  my $list = qr{
     7      <List>
     8
     9      <rule: List>
    10           # empty
    11         | <digit> <List>
    12
    13      <rule: digit>
    14          <MATCH=(\d+)>
    15
    16  }xms;
    17
    18  while (my $input = <>) {
    19      chomp $input;
    20      if ($input =~ $list) {
    21          use Data::Dumper 'Dumper';
    22          warn Dumper \%/;
    23      }
    24      else {
    25        warn "Does not match\n"
    26      }
    27  }

Al ejecutar se obtiene:

casiano@millo:~/Lregexp-grammar-examples$ ./simple_list_empty_first.pl
2 3 4
$VAR1 = {
          '' => '',
          'List' => ''
        };

Por supuesto basta poner anclas en el patrón a buscar para forzar a que se reconozca la lista completa:

pl@nereida:~/Lregexpgrammars/demo$ diff simple_list_empty_first.pl simple_list_empty_first_with_anchors.pl
7c7
<     <List>
---
>     ^<List>$

En efecto, la nueva versión reconoce la lista:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 simple_list_empty_first_with_anchors.pl
2 3 4
$VAR1 = {
          '' => '2 3 4',
          'List' => {
                      'List' => {
                                  'List' => {
                                              'List' => '',
                                              '' => '4',
                                              'digit' => '4'
                                            },
                                  '' => '3 4',
                                  'digit' => '3'
                                },
                      '' => '2 3 4',
                      'digit' => '2'
                    }
        };

Si se quiere mantener la producción vacía en primer lugar pero forzar el reconocimiento de la lista completa, se puede hacer uso de un lookahead negativo:

pl@nereida:~/Lregexpgrammars/demo$ cat -n simple_list_empty_first_with_lookahead.pl
     1  #!/soft/perl5lib/bin/perl5.10.1
     2  use v5.10;
     3
     4  use strict;
     5  use Regexp::Grammars;
     6
     7  my $list = qr{
     8      <List>
     9
    10      <rule: List>
    11           (?! <digit> ) # still empty production
    12         | <digit> <List>
    13
    14      <rule: digit>
    15          <MATCH=(\d+)>
    16
    17  }xms;
    18
    19  while (my $input = <>) {
    20      chomp $input;
    21      if ($input =~ $list) {
    22          use Data::Dumper 'Dumper';
    23          warn Dumper \%/;
    24      }
    25      else {
    26        warn "Does not match\n"
    27      }
    28  }

Así, sólo se reducirá por la regla vacía si el siguiente token no es un número. Sigue un ejemplo de ejecución:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 simple_list_empty_first_with_lookahead.pl
2 3 4
$VAR1 = {
          '' => '2 3 4',
          'List' => {
                      'List' => {
                                  'List' => {
                                              'List' => '',
                                              '' => '4',
                                              'digit' => '4'
                                            },
                                  '' => '3 4',
                                  'digit' => '3'
                                },
                      '' => '2 3 4',
                      'digit' => '2'
                    }
        };

Aplanamiento manual de listas

¿Cómo podemos hacer que la estructura retornada por el reconocedor sea una lista?. Podemos añadir acciones como sigue:

casiano@millo:~/Lregexp-grammar-examples$ cat -n simple_list_action.pl
     1  #!/soft/perl5lib/bin/perl5.10.1
     2  use v5.10;
     3
     4  use Regexp::Grammars;
     5
     6  my $list = qr{
     7      <List>
     8
     9      <rule: List>
    10           <digit> <X=List> <MATCH= (?{ unshift @{$MATCH{X}}, $MATCH{digit}; $MATCH{X} })>
    11         | # empty
    12           <MATCH= (?{ [] })>
    13
    14      <rule: digit>
    15          <MATCH=(\d+)>
    16
    17  }xms;
    18
    19  while (my $input = <>) {
    20      chomp $input;
    21      if ($input =~ $list) {
    22          use Data::Dumper 'Dumper';
    23          warn Dumper \%/;
    24      }
    25      else {
    26        warn "Does not match\n"
    27      }
    28  }

Al ejecutarse este programa produce una salida como:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 simple_list_action.pl
2 3 4
$VAR1 = {
          '' => '2 3 4',
          'List' => [ '2', '3', '4' ]
        };

Los operadores de repetición

Los operadores de repetición como *, +, etc. permiten simplificar el análisis de lenguajes de listas:

pl@nereida:~/Lregexpgrammars/demo$ cat -n simple_list_star.pl
 1  #!/soft/perl5lib/bin/perl5.10.1
 2  use v5.10;
 3
 4  use Regexp::Grammars;
 5
 6  my $list = qr{
 7      <List>
 8
 9      <rule: List>
10          (?: <[digit]>)*
11
12      <rule: digit>
13          <MATCH=(\d+)>
14
15  }xms;
16
17  while (my $input = <>) {
18      chomp $input;
19      if ($input =~ $list) {
20          use Data::Dumper 'Dumper';
21          warn Dumper \%/;
22      }
23      else {
24        warn "Does not match\n"
25      }
26  }

Los corchetes alrededor de digit hacen que el valor asociado con el patrón sea la lista de números. Si no los ponemos el valor asociado sería el último valor de la lista.

Listas separadas por Algo

One of the commonest tasks in text parsing is to match a list of unspecified length, in which items are separated by a fixed token. Things like:

    1, 2, 3 , 4 ,13, 91        # Numbers separated by commas and spaces

    g-c-a-g-t-t-a-c-a          # Bases separated by dashes

    /usr/local/bin             # Names separated by directory markers

    /usr:/usr/local:bin        # Directories separated by colons

The usual construct required to parse these kinds of structures is either:

    <rule: list>

        <item> <separator> <list               # recursive definition
      | <item>                                 # base case

Or, more efficiently, but less prettily:

    <rule: list>

        <[item]> (?: <separator> <[item]> )*   # iterative definition

Because this is such a common requirement, Regexp::Grammars provides a cleaner way to specify the iterative version. The syntax is taken from Perl 6:

    <rule: list>

        <[item]> ** <separator>                # iterative definition

This is a repetition specifier on the first subrule (hence the use of ** as the marker, to reflect the repetitive behaviour of *). However, the number of repetitions is controlled by the second subrule: the first subrule will be repeatedly matched for as long as the second subrule matches immediately after it.

So, for example, you can match a sequence of numbers separated by commas with:

    <[number]> ** <comma>

    <token: number>  \d+
    <token: comma>   \s* , \s*

Note that it's important to use the <[...]> form for the items being matched, so that all of them are saved in the result hash. You can also save all the separators (if that's important):

    <[number]> ** <[comma]>

The repeated item must be specified as a subrule call fo some kind, but the separators may be specified either as a subrule or a bracketed pattern. For example:

    <[number]> ** ( , )

The separator must always be specified in matched delimiters of some kind: either matching <...> or matching (...). A common error is to write:

    <[number]> ** ,

You can also use a pattern as the item matcher, but it must be aliased into a subrule:

    <[item=(\d+)]> ** ( , )

Ejemplo: Listas de números separados por comas

Veamos un ejemplo sencillo:

casiano@millo:~/src/perl/regexp-grammar-examples$ cat -n demo_list.pl
 1  #!/soft/perl5lib/bin/perl5.10.1
 2  use v5.10;
 3
 4  use Regexp::Grammars;
 5
 6  my $list_nonempty = qr{
 7      <List>
 8
 9      <rule: List>
10          \(  <[Value]> ** (,)  \)
11
12      <token: Value>
13          \d+
14  }xms;
15
16  my $list_empty = qr{
17      <List>
18
19      <rule: List>
20          \(  (?: <[Value]> ** <_Sep=(,)> )?  \)
21
22      <token: Value>
23          \d+
24  }xms;
25
26  use Smart::Comments;
27
28
29  while (my $input = <>) {
30      my $input2 = $input;
31      if ($input =~ $list_nonempty) {
32          ### nonempty: $/{List}
33      }
34      if ($input2 =~ $list_empty) {
35          ### empty: $/{List}
36      }
37  }

Sigue un ejemplo de ejecución:

casiano@millo:~/src/perl/regexp-grammar-examples$ ./demo_list.pl
(3,4,5)

### nonempty: {
###             '' => '(3,4,5)',
###             Value => [
###                        '3',
###                        '4',
###                        '5'
###                      ]
###           }

### empty: {
###          '' => '(3,4,5)',
###          Value => [
###                     '3',
###                     '4',
###                     '5'
###                   ]
###        }
()

### empty: '()'

Ejemplo: AST para las expresiones aritméticas

Las expresiones aritméticas puede definirse como una jerarquía de listas como sigue:

pl@nereida:~/Lregexpgrammars/demo$ cat -n calcaslist.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5  $Data::Dumper::Indent = 1;
 6
 7  my $rbb = do {
 8      use Regexp::Grammars;
 9
10      qr{
11        \A<expr>\z
12
13        <objrule: expr>      <[operands=term]> ** <[operators=addop]>
14
15        <objrule: term>      <[operands=uneg]> ** <[operators=mulop]>
16
17        <objrule: uneg>      <[operators=minus]>* <[operands=power]>
18
19        <objrule: power>     <[operands=factorial]> ** <[operators=powerop]>
20
21        <objrule: factorial> <[operands=factor]>  <[operators=(!)]>*
22
23        <objrule: factor>    <val=([+-]?\d+(?:\.\d*)?)>
24                           | \( <MATCH=expr> \)
25
26        <token: addop>        [+-]
27
28        <token: mulop>        [*/]
29
30        <token: powerop>      \*\*|\^
31
32        <token: minus>        - <MATCH=(?{ 'NEG' })>
33
34      }x;
35  };
36
37  while (my $input = <>) {
38      chomp($input);
39      if ($input =~ m{$rbb}) {
40          my $tree = $/{expr};
41          say Dumper $tree;
42
43      }
44      else {
45          say("does not match");
46      }
47  }

Obsérvese el árbol generado para la expresión 4-2-2:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 calcaslist.pl
4-2-2
$VAR1 = bless( {
  'operands' => [
    bless( {
      'operands' => [
        bless( {
          'operands' => [
            bless( {
              'operands' => [
                bless( {
                  'operands' => [
                    bless( { '' => '4', 'val' => '4' }, 'factor' )
                  ],
                  '' => '4'
                }, 'factorial' )
              ],
              '' => '4'
            }, 'power' )
          ],
          '' => '4'
        }, 'uneg' )
      ],
      '' => '4'
    }, 'term' ),
    bless( {
      'operands' => [
        bless( {
          'operands' => [
            bless( {
              'operands' => [
                bless( {
                  'operands' => [
                    bless( { '' => '2', 'val' => '2' }, 'factor' )
                  ],
                  '' => '2'
                }, 'factorial' )
              ],
              '' => '2'
            }, 'power' )
          ],
          '' => '2'
        }, 'uneg' )
      ],
      '' => '2'
    }, 'term' ),
    bless( {
      'operands' => [
        bless( {
          'operands' => [
            bless( {
              'operands' => [
                bless( {
                  'operands' => [
                    bless( { '' => '2', 'val' => '2' }, 'factor' )
                  ],
                  '' => '2'
                }, 'factorial' )
              ],
              '' => '2'
            }, 'power' )
          ],
          '' => '2'
        }, 'uneg' )
      ],
      '' => '2'
    }, 'term' )
  ],
  '' => '4-2-2',
  'operators' => [
    '-',
    '-'
  ]
}, 'expr' );

La directiva `require`

La directiva require es similar en su funcionamiento al paréntesis 5.10 (??{ Código Perl }) el cuál hace que el Código Perl sea evaluado durante el tiempo de matching. El resultado de la evaluación se trata como una expresión regular con la que deberá casarse. (véase la sección 31.2.9 para mas detalles).

La sintáxis de la directiva <require:> es

                        <require: (?{ CODE }) >

The code block is executed and if its final value is true, matching continues from the same position. If the block's final value is false, the match fails at that point and starts backtracking.

The <require:...> directive is useful for testing conditions that it's not easy (or even possible) to check within the syntax of the the regex itself. For example:

    <rule: IPV4_Octet_Decimal>
        # Up three digits...
        <MATCH= ( \d{1,3}+ )>
        
        # ...but less that 256...
        <require: (?{ $MATCH <= 255 })>

A require expects a regex codeblock as its argument and succeeds if the final value of that codeblock is true. If the final value is false, the directive fails and the rule starts backtracking.

Note, in this example that the digits are matched with \d{1,3}+ . The trailing + prevents the {1,3} repetition from backtracking to a smaller number of digits if the <require:...> fails.

El programa demo_IP4.pl ilustra el uso de la directiva:

pl@nereida:~/Lregexpgrammars/demo$ cat -n ./demo_IP4.pl
 1  #!/usr//bin/env perl5.10.1
 2  use v5.10;
 3  use warnings;
 4
 5  use Regexp::Grammars;
 6
 7  my $grammar = qr{
 8      \A <IP4_addr> \Z
 9
10      <token: quad>
11          <MATCH=(\d{1,3})>
12          <require: (?{ $MATCH < 256 })>
13
14      <token: IP4_addr>
15          <[MATCH=quad]>**(\.)
16          <require: (?{ @$MATCH == 4 })>
17  }xms;
18
19  while (my $line = <>) {
20      if ($line =~ $grammar) {
21          use Data::Dumper 'Dumper';
22          say Dumper \%/;
23      }
24      else {
25          say 'Does not match'
26      }
27  }

Las condiciones usadas en el require obligan a que cada quad^31.9 sea menor que 256 y a que existan sólo cuatro quads.

Sigue un ejemplo de ejecución:

pl@nereida:~/Lregexpgrammars/demo$ ./demo_IP4.pl
123 . 145 . 105 . 252
Does not match
pl@nereida:~/Lregexpgrammars/demo$ ./demo_IP4.pl
123.145.105.252
$VAR1 = {
          '' => '123.145.105.252',
          'IP4_addr' => [
                          123,
                          145,
                          105,
                          252
                        ]
        };
pl@nereida:~/Lregexpgrammars/demo$ ./demo_IP4.pl
148.257.128.128
Does not match
0.0.0.299
Does not match
pl@nereida:~/Lregexpgrammars/demo$  ./demo_IP4.pl
123.145.105.242.193
Does not match

Obsérvese como no se aceptan blancos entre los puntos en esta versión. ¿Sabría explicar la causa?

Casando con las claves de un hash

In some situations a grammar may need a rule that matches dozens, hundreds, or even thousands of one-word alternatives. For example, when matching command names, or valid userids, or English words. In such cases it is often impractical (and always inefficient) to list all the alternatives between | alterators:

    <rule: shell_cmd>
        a2p | ac | apply | ar | automake | awk | ...
        # ...and 400 lines later
        ... | zdiff | zgrep | zip | zmore | zsh

    <rule: valid_word>
        a | aa | aal | aalii | aam | aardvark | aardwolf | aba | ...
        # ...and 40,000 lines later... 
        ... | zymotize | zymotoxic | zymurgy | zythem | zythum

To simplify such cases, Regexp::Grammars provides a special construct that allows you to specify all the alternatives as the keys of a normal hash. The syntax for that construct is simply to put the hash name inside angle brackets (with no space between the angles and the hash name).

Which means that the rules in the previous example could also be written:

    <rule: shell_cmd>
        <%cmds>

    <rule: valid_word>
        <%dict>

provided that the two hashes (%cmds and %dict) are visible in the scope where the grammar is created.

Internally, the construct is converted to something equivalent to:

    <rule: shell_cmd>
        (<.hk>)  <require: exists $cmds{$CAPTURE}>

    <rule: valid_word>
        (<.hk>)  <require: exists $dict{$CAPTURE}>

The special <hk> rule is created automatically, and defaults to \S+, but you can also define it explicitly to handle other kinds of keys. For example:

    <rule: hk>
        .+            # Key may be any number of chars on a single line

    <rule: hk>
        [ACGT]{10,}   # Key is a base sequence of at least 10 pairs

Matching a hash key in this way is typically significantly faster than matching a full set of alternations. Specifically, it is O(length of longest potential key), instead of O(number of keys).

Ejemplo de uso de la directiva hash

Sigue un ejemplo:

pl@nereida:~/Lregexpgrammars/demo$ cat -n hash.pl
 1  #!/usr/bin/env perl5.10.1
 2  use strict;
 3  use warnings;
 4  use 5.010;
 5  use Data::Dumper;
 6  $Data::Dumper::Deparse = 1;
 7
 8  my %cmd = map { ($_ => undef ) } qw( uname pwd date );
 9
10  my $rbb = do {
11      use Regexp::Grammars;
12
13      qr{
14        ^<command>$
15
16        <rule: command>
17          <cmd=%cmd> (?: <[arg]> )*
18
19        <token: arg> [^\s<>`&]+
20      }xms;
21  };
22
23  while (my $input = <>) {
24      chomp($input);
25      if ($input =~ m{$rbb}) {
26          say("matches: <$&>");
27          say Dumper \%/;
28          system $/{''}
29      }
30      else {
31          say("does not match");
32      }
33  }

Sigue un ejemplo de ejecución:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 hash.pl
a2p f1 f2
matches: <a2p f1 f2>
$VAR1 = {
          '' => 'a2p f1 f2',
          'command' => {
                         '' => 'a2p f1 f2',
                         'cmd' => 'a2p',
                         'arg' => [
                                    'f1',
                                    'f2'
                                  ]
                       }
        };

pocho 2 5
does not match

Depuración

Regexp::Grammars provides a number of features specifically designed to help debug both grammars and the data they parse.

All debugging messages are written to a log file (which, by default, is just STDERR). However, you can specify a disk file explicitly by placing a "<logfile:...>" directive at the start of your grammar^31.10:

        $grammar = qr{

            <logfile: LaTeX_parser_log >

            \A <LaTeX_file> \Z    # Pattern to match

            <rule: LaTeX_file>
                # etc.
        }x;

You can also explicitly specify that messages go to the terminal:

            <logfile: - >

Debugging grammar creation

Whenever a log file has been directly specified, Regexp::Grammars automatically does verbose static analysis of your grammar. That is, whenever it compiles a grammar containing an explicit "<logfile:...>" directive it logs a series of messages explaining how it has interpreted the various components of that grammar. For example, the following grammar:

pl@nereida:~/Lregexpgrammars/demo$ cat -n log.pl
     1  #!/usr/bin/env perl5.10.1
     2  use strict;
     3  use warnings;
     4  use 5.010;
     5  use Data::Dumper;
     6
     7  my $rbb = do {
     8      use Regexp::Grammars;
     9
    10      qr{
    11        <logfile: ->
    12
    13        <numbers>
    14
    15        <rule: numbers>
    16          <number> ** <.comma>
    17
    18        <token: number> \d+
    19
    20        <token: comma>   ,
    21      }xms;
    22  };
    23
    24  while (my $input = <>) {
    25      if ($input =~ m{$rbb}) {
    26          say("matches: <$&>");
    27          say Dumper \%/;
    28      }
    29  }

would produce the following analysis in the terminal:

pl@nereida:~/Lregexpgrammars/demo$ ./log.pl
  warn | Repeated subrule <number>* will only capture its final match
       | (Did you mean <[number]>* instead?)
       |
  info | Processing the main regex before any rule definitions
       |    |
       |    |...Treating <numbers> as:
       |    |      |  match the subrule <numbers>
       |    |       \ saving the match in $MATCH{'numbers'}
       |    |
       |     \___End of main regex
       |
       | Defining a rule: <numbers>
       |    |...Returns: a hash
       |    |
       |    |...Treating <number> as:
       |    |      |  match the subrule <number>
       |    |       \ saving the match in $MATCH{'number'}
       |    |
       |    |...Treating <.comma> as:
       |    |      |  match the subrule <comma>
       |    |       \ but don't save anything
       |    |
       |    |...Treating <number> ** <.comma> as:
       |    |      |  repeatedly match the subrule <number>
       |    |       \ as long as the matches are separated by matches of <.comma>
       |    |
       |     \___End of rule definition
       |
       | Defining a rule: <number>
       |    |...Returns: a hash
       |    |
       |    |...Treating '\d' as:
       |    |       \ normal Perl regex syntax
       |    |
       |    |...Treating '+ ' as:
       |    |       \ normal Perl regex syntax
       |    |
       |     \___End of rule definition
       |
       | Defining a rule: <comma>
       |    |...Returns: a hash
       |    |
       |    |...Treating ', ' as:
       |    |       \ normal Perl regex syntax
       |    |
       |     \___End of rule definition
       |
2, 3, 4
matches: <2, 3, 4>
$VAR1 = {
          '' => '2, 3, 4',
          'numbers' => {
                         '' => '2, 3, 4',
                         'number' => '4'
                       }
        };

This kind of static analysis is a useful starting point in debugging a miscreant grammar^31.11, because it enables you to see what you actually specified (as opposed to what you thought you'd specified).

Debugging grammar execution

Regexp::Grammars also provides a simple interactive debugger, with which you can observe the process of parsing and the data being collected in any result-hash.

To initiate debugging, place a <debug:...> directive anywhere in your grammar. When parsing reaches that directive the debugger will be activated, and the command specified in the directive immediately executed. The available commands are:

        <debug: on>    - Enable debugging, stop when entire grammar matches
        <debug: match> - Enable debugging, stope when a rule matches
        <debug: try>   - Enable debugging, stope when a rule is tried
        <debug: off>   - Disable debugging and continue parsing silently

        <debug: continue> - Synonym for <debug: on>
        <debug: run>      - Synonym for <debug: on>
        <debug: step>     - Synonym for <debug: try>

These directives can be placed anywhere within a grammar and take effect when that point is reached in the parsing. Hence, adding a <debug:step> directive is very much like setting a breakpoint at that point in the grammar. Indeed, a common debugging strategy is to turn debugging on and off only around a suspect part of the grammar:

        <rule: tricky>   # This is where we think the problem is...
            <debug:step>
            <preamble> <text> <postscript>
            <debug:off>

Once the debugger is active, it steps through the parse, reporting rules that are tried, matches and failures, backtracking and restarts, and the parser's location within both the grammar and the text being matched. That report looks like this:

        ===============> Trying <grammar> from position 0
        > cp file1 file2 |...Trying <cmd>
                         |   |...Trying <cmd=(cp)>
                         |   |    \FAIL <cmd=(cp)>
                         |    \FAIL <cmd>
                          \FAIL <grammar>
        ===============> Trying <grammar> from position 1
         cp file1 file2  |...Trying <cmd>
                         |   |...Trying <cmd=(cp)>
         file1 file2     |   |    \_____<cmd=(cp)> matched 'cp'
        file1 file2      |   |...Trying <[file]>+
         file2           |   |    \_____<[file]>+ matched 'file1'
                         |   |...Trying <[file]>+
        [eos]            |   |    \_____<[file]>+ matched ' file2'
                         |   |...Trying <[file]>+
                         |   |    \FAIL <[file]>+
                         |   |...Trying <target>
                         |   |   |...Trying <file>
                         |   |   |    \FAIL <file>
                         |   |    \FAIL <target>
         <~~~~~~~~~~~~~~ |   |...Backtracking 5 chars and trying new match
        file2            |   |...Trying <target>
                         |   |   |...Trying <file>
                         |   |   |    \____ <file> matched 'file2'
        [eos]            |   |    \_____<target> matched 'file2'
                         |    \_____<cmd> matched ' cp file1 file2'
                          \_____<grammar> matched ' cp file1 file2'

The first column indicates the point in the input at which the parser is trying to match, as well as any backtracking or forward searching it may need to do. The remainder of the columns track the parser's hierarchical traversal of the grammar, indicating which rules are tried, which succeed, and what they match.

Provided the logfile is a terminal (as it is by default), the debugger also pauses at various points in the parsing process-before trying a rule, after a rule succeeds, or at the end of the parse-according to the most recent command issued. When it pauses, you can issue a new command by entering a single letter:

        m       - to continue until the next subrule matches
        t or s  - to continue until the next subrule is tried
        r or c  - to continue to the end of the grammar
        o       - to switch off debugging

Note that these are the first letters of the corresponding <debug:...> commands, listed earlier. Just hitting ENTER while the debugger is paused repeats the previous command.

While the debugger is paused you can also type a d, which will display the result-hash for the current rule. This can be useful for detecting which rule isn't returning the data you expected.

Veamos un ejemplo. El siguiente programa activa el depurador:

pl@nereida:~/Lregexpgrammars/demo$ cat -n demo_debug.pl
     1  #!/usr/bin/env perl5.10.1
     2  use 5.010;
     3  use warnings;
     4
     5      use Regexp::Grammars;
     6
     7      my $balanced_brackets = qr{
     8          <debug:on>
     9
    10          <left_delim=(  \( )>
    11          (?:
    12              <[escape=(  \\ )]>
    13          |   <recurse=( (?R) )>
    14          |   <[simple=(  .  )]>
    15          )*
    16          <right_delim=( \) )>
    17      }xms;
    18
    19      while (<>) {
    20          if (/$balanced_brackets/) {
    21              say 'matched:';
    22              use Data::Dumper 'Dumper';
    23              warn Dumper \%/;
    24          }
    25      }

Al ejecutar obtenemos

pl@nereida:~/Lregexpgrammars/demo$ ./demo_debug.pl
(a)
=====> Trying <grammar> from position 0
(a)\n  |...Trying <left_delim=(  \( )>

a)\n   |    _____<left_delim=(  \( )> matched '('      c
       |...Trying <[escape=(  \ )]>
       |    \FAIL <[escape=(  \ )]>
       |...Trying <recurse=( (?R) )>
=====> Trying <grammar> from position 1
a)\n   |   |...Trying <left_delim=(  \( )>

       |   |    \FAIL <left_delim=(  \( )>
        \FAIL <grammar>
       |...Trying <[simple=(  .  )]>
)\n    |    _____<[simple=(  .  )]> matched 'a'
       |...Trying <[escape=(  \ )]>

       |    \FAIL <[escape=(  \ )]>
       |...Trying <recurse=( (?R) )>
=====> Trying <grammar> from position 2
)\n    |   |...Trying <left_delim=(  \( )>
       |   |    \FAIL <left_delim=(  \( )>

        \FAIL <grammar>
       |...Trying <[simple=(  .  )]>
\n     |    _____<[simple=(  .  )]> matched ')'
       |...Trying <[escape=(  \ )]>
       |    \FAIL <[escape=(  \ )]>

       |...Trying <recurse=( (?R) )>
=====> Trying <grammar> from position 3
\n     |   |...Trying <left_delim=(  \( )>
       |   |    \FAIL <left_delim=(  \( )>
        \FAIL <grammar>

       |...Trying <[simple=(  .  )]>
[eos]  |    _____<[simple=(  .  )]> matched ''
       |...Trying <[escape=(  \ )]>
       |    \FAIL <[escape=(  \ )]>
       |...Trying <recurse=( (?R) )>

=====> Trying <grammar> from position 4
[eos]  |   |...Trying <left_delim=(  \( )>
       |   |    \FAIL <left_delim=(  \( )>
        \FAIL <grammar>
       |...Trying <[simple=(  .  )]>

       |    \FAIL <[simple=(  .  )]>
       |...Trying <right_delim=( \) )>
       |    \FAIL <right_delim=( \) )>
 <~~~~ |...Backtracking 1 char and trying new match
\n     |...Trying <right_delim=( \) )>
       |    \FAIL <right_delim=( \) )>

 <~~~~ |...Backtracking 1 char and trying new match
)\n    |...Trying <right_delim=( \) )>
\n     |    _____<right_delim=( \) )> matched ')'
        _____<grammar> matched '(a)'   d
              :         {
              :           '' => '(a)',
              :           'left_delim' => '(',
              :           'simple' => [
              :                         'a'
              :                       ],
              :           'right_delim' => ')'
              :         };      o
matched:
$VAR1 = {
          '' => '(a)',
          'left_delim' => '(',
          'simple' => [
                        'a'
                      ],
          'right_delim' => ')'
        };

Mensajes de `log` del usuario

Both static and interactive debugging send a series of predefined log messages to whatever log file you have specified. It is also possible to send additional, user-defined messages to the log, using the "<log:...>" directive.

This directive expects either a simple text or a codeblock as its single argument. If the argument is a code block, that code is expected to return the text of the message; if the argument is anything else, that something else is the literal message. For example:

        <rule: ListElem>

            <Elem=   ( [a-z]\d+) >
                <log: Checking for a suffix, too...>

            <Suffix= ( : \d+   ) >?
                <log: (?{ "ListElem: $MATCH{Elem} and $MATCH{Suffix}" })>

User-defined log messages implemented using a codeblock can also specify a severity level. If the codeblock of a <log:...> directive returns two or more values, the first is treated as a log message severity indicator, and the remaining values as separate lines of text to be logged. For example:

        <rule: ListElem>
            <Elem=   ( [a-z]\d+) >
            <Suffix= ( : \d+   ) >?

                <log: (?{
                    warn => "Elem was: $MATCH{Elem}",
                            "Suffix was $MATCH{Suffix}",
                })>

When they are encountered, user-defined log messages are interspersed between any automatic log messages (i.e. from the debugger), at the correct level of nesting for the current rule.

Depuración de Regexps

It is possible to use Regexp::Grammars without creating any subrule definitions, simply to debug a recalcitrant regex. For example, if the following regex wasn't working as expected:

        my $balanced_brackets = qr{
            \(             # left delim
            (?:
                \\         # escape or
            |   (?R)       # recurse or
            |   .          # whatever
            )*
            \)             # right delim
        }xms;

you could instrument it with aliased subpatterns and then debug it step-by-step, using Regexp::Grammars:

        use Regexp::Grammars;

        my $balanced_brackets = qr{
            <debug:step>

            <.left_delim=  (  \(  )>
            (?:
                <.escape=  (  \\  )>
            |   <.recurse= ( (?R) )>
            |   <.whatever=(  .   )>
            )*
            <.right_delim= (  \)  )>
        }xms;

        while (<>) {
            say 'matched' if /$balanced_brackets/;
        }

Note the use of amnesiac aliased subpatterns to avoid needlessly building a result-hash. Alternatively, you could use listifying aliases to preserve the matching structure as an additional debugging aid:

        use Regexp::Grammars;

        my $balanced_brackets = qr{
            <debug:step>

            <[left_delim=  (  \(  )]>
            (?:
                <[escape=  (  \\  )]>
            |   <[recurse= ( (?R) )]>
            |   <[whatever=(  .   )]>
            )*
            <[right_delim= (  \)  )]>
        }xms;

        if ( '(a(bc)d)' =~ /$balanced_brackets/) {
            use Data::Dumper 'Dumper';
            warn Dumper \%/;
        }

Manejo y recuperación de errores

En este punto debo decir que no he podido reproducir el comportamiento de las directivas <error:> y <warning:> tal y como las describe Conway en el manual de Regexp::Grammars.

El siguiente ejemplo ilustra un conjunto de técnicas de gestión de errores que son independientes del soprote dado por Regexp::Grammars.

Se trata de la misma calculadora explicada en la sección 31.10.18.

pl@nereida:~/Lregexpgrammars/demo/calculator$ cat -n calculatorwitherrmanagement.pl
 1  #!/usr/bin/env perl5.10.1
 2  use strict;
 3  use warnings;
 4  use 5.010;
 5  use Lingua::EN::Inflect qw(PL);
 6  use Scalar::Util qw{blessed};
 7
 8  my $rbb = do {
 9      my ($warnings, $errors);    # closure
10      sub warnings { $warnings }  # accessor
11      sub errors { $errors }      # accessor
12
13      use Regexp::Grammars;
14      qr{
15        (?{
16            $warnings = 0;
17            $errors = 0;
18        })
19        \A<expr>
20        (?:   \z
21             |
22               (.*) (?{
23                        # Accept the string but emit a warning
24                        $warnings++;
25                        local our $expr = \$MATCH{expr}{''};
26                        local our $endlegal = length($$expr) > 4? "... ".substr($$expr, -4) : $$expr;
27                        warn "Warning: Unexpected '". substr($^N, 0, 10)."' after '$endlegal'\n";
28                     })
29        )
30
31        <objrule: expr>      <[operands=term]> ** <[operators=addop]>
32
33        <objrule: term>      <[operands=uneg]> ** <[operators=mulop]>
34
35        <objrule: uneg>      <[operators=minus]>* <[operands=power]>
36
37        <objrule: power>     <[operands=factorial]> ** <[operators=powerop]>
38
39        <objrule: factorial> <[operands=factor]>  <[operators=(!)]>*
40
41        <objrule: factor>    (<val=([+-]?\d+(?:\.\d*)?)>)
42                           | \( <MATCH=expr> \)
43                           | ([^-+(0-9]+) (?{
44                                            # is + and not * to avoid infinite recursion
45                                            warn "Error: expecting a number or a open parenthesis, found: '". substr($^N, 0, 10)."'\n";
46                                            $warnings++;
47                                            $errors++;
48                                        }) <MATCH=factor>
49
50        <token: addop>        [+-]
51
52        <token: mulop>        [*/]
53
54        <token: powerop>      \*\*|\^
55
56        <token: minus>        - <MATCH=(?{ 'NEG' })>
57
58      }x;
59  };
60
61  sub test_calc {
62    my $prompt = shift;
63
64    print $prompt;
65    while (my $input = <>) {
66        chomp($input);
67
68        local %/;
69        $input =~ m{$rbb};
70
71        say warnings." ".PL('warning',warnings) if warnings;
72        say errors." ".PL('error',errors)       if errors;
73
74        my $tree = $/{expr};
75        if (blessed($tree)) {
76            do "PostfixCalc.pm";
77            say "postfix: ".$tree->ceval;
78
79            do "EvalCalc.pm";
80            say "result: ".$tree->ceval;
81        }
82        print $prompt;
83    }
84    say "Bye!"
85  }
86
87  ########## main
88  test_calc(
89    'Parsing infix arithmetic expressions (CTRL-D to end in unix) ',
90  );

Veamos algunas ejecuciones que incluyen entradas erróneas:

pl@nereida:~/Lregexpgrammars/demo/calculator$ ./calculatorwitherrmanagement.pl
Parsing infix arithmetic expressions (CTRL-D to end in unix) 2+3
postfix: 2 3 +
result: 5
Parsing infix arithmetic expressions (CTRL-D to end in unix) 2*(3+#)
Error: expecting a number or a open parenthesis, found: '#)'
Error: expecting a number or a open parenthesis, found: '#'
Error: expecting a number or a open parenthesis, found: ')'
Warning: Unexpected '*(3+#)' after '2'
4 warnings
3 errors
postfix: 2
result: 2
Parsing infix arithmetic expressions (CTRL-D to end in unix) 2+#*4
Error: expecting a number or a open parenthesis, found: '#*'
1 warning
1 error
postfix: 2 4 +
result: 6
Parsing infix arithmetic expressions (CTRL-D to end in unix) Bye!

Obsérvese los mensajes de error repetidos para la entrada 2*(3+#). Ellos son debidos a los reiterados intentos de casar <factor> en la regla de recuperación de errores:

41        <objrule: factor>    (<val=([+-]?\d+(?:\.\d*)?)>)
42                           | \( <MATCH=expr> \)
43                           | ([^-+(0-9]+) (?{
44                                            # is + and not * to avoid infinite recursion
45                                            warn "Error: expecting a number or a open parenthesis, found: '". substr($^N, 0, 10)."'\n";
46                                            $warnings++;
47                                            $errors++;
48                                        }) <MATCH=factor>

en este caso resulta imposible encontrar un factor. Se puede cambiar la conducta indicando un (* COMMIT) antes de la llamada a <MATCH=factor>:

 41       <objrule: factor>    (<val=([+-]?\d+(?:\.\d*)?)>)
 42                          | \( <MATCH=expr> \)
 43                          | ([^-+(0-9]+) (?{
 44                                           # is + and not * to avoid infinite recursion
 45                                           warn "Error: expecting a number or a open parenthesis, found: '". substr($^N, 0, 10)."'\n";
 46                                           $warnings++;
 47                                           $errors++;
 48                                       }) (*COMMIT) <MATCH=factor>

en este caso la conducta es abandonar en el caso de que no se pueda encontrar un <factor>:

pl@nereida:~/Lregexpgrammars/demo/calculator$ ./calculatorwitherrmanagement.pl
Parsing infix arithmetic expressions (CTRL-D to end in unix) 2*(3+#)
Error: expecting a number or a open parenthesis, found: '#)'
1 warning
1 error
Parsing infix arithmetic expressions (CTRL-D to end in unix) 2*3
postfix: 2 3 *
result: 6
Parsing infix arithmetic expressions (CTRL-D to end in unix) @
Error: expecting a number or a open parenthesis, found: '@'
1 warning
1 error
Parsing infix arithmetic expressions (CTRL-D to end in unix) Bye!

Mensajes de Warning

Sometimes, you want to detect problems, but not invalidate the entire parse as a result. For those occasions, the module provides a less stringent form of error reporting: the <warning:...> directive.

This directive is exactly the same as an <error:...> in every respect except that it does not induce a failure to match at the point it appears.

The directive is, therefore, useful for reporting non-fatal problems in a parse. For example:

       qr{ \A            # ...Match only at start of input
           <ArithExpr>   # ...Match a valid arithmetic expression

           (?:
               # Should be at end of input...
               \s* \Z
             |
               # If not, report the fact but don't fail...
               <warning: Expected end-of-input>
               <warning: (?{ "Extra junk at index $INDEX: $CONTEXT" })>
           )

           # Rule definitions here...
       }xms;

Note that, because they do not induce failure, two or more <warning:...> directives can be "stacked" in sequence, as in the previous example.

Simplificando el AST

pl@nereida:~/Lregexpgrammars/demo$ cat -n exprdamian.pl
     1  use strict;
     2  use warnings;
     3  use 5.010;
     4  use Data::Dumper;
     5  $Data::Dumper::Indent = 1;
     6
     7  my $rbb = do {
     8      use Regexp::Grammars;
     9
    10      qr{
    11        \A<expr>\z
    12
    13        <objrule: expr>    <MATCH=term> (?! <addop> )                  # bypass
    14                         | <[operands=term]> ** <[operators=addop]>
    15
    16        <objrule: term>    <MATCH=factor> (?! <mulop> )                # bypass
    17                         | <[operands=factor]> ** <[operators=mulop]>
    18
    19        <objrule: factor>    <val=([+-]?\d+(?:\.\d*)?)>
    20                         | \( <MATCH=expr> \)
    21
    22        <token: addop> [+-]
    23
    24        <token: mulop> [*/]
    25
    26      }x;
    27  };
    28
    29  while (my $input = <>) {
    30      chomp($input);
    31      if ($input =~ m{$rbb}) {
    32          my $tree = $/{expr};
    33          say Dumper $tree;
    34          say $tree->ceval;
    35
    36      }
    37      else {
    38          say("does not match");
    39      }
    40  }
    41
    42  BEGIN {
    43
    44    package LeftBinaryOp;
    45    use strict;
    46    use base qw(Class::Accessor);
    47
    48    LeftBinaryOp->mk_accessors(qw{operators operands});
    49
    50    my %f = (
    51      '+' => sub { shift() + shift() },
    52      '-' => sub { shift() - shift() },
    53      '*' => sub { shift() * shift() },
    54      '/' => sub { shift() / shift() },
    55    );
    56
    57    sub ceval {
    58      my $self = shift;
    59
    60      # recursively evaluate the children first
    61      my @operands = map { $_->ceval } @{$self->operands};
    62
    63      # then combine them
    64      my $s = shift @operands;
    65      for (@{$self->operators}) {
    66        $s = $f{$_}->($s, shift @operands);
    67      }
    68      return $s;
    69    }
    70
    71    package term;
    72    use base qw{LeftBinaryOp};
    73
    74    package expr;
    75    use base qw{LeftBinaryOp};
    76
    77    package factor;
    78
    79    sub ceval {
    80      my $self = shift;
    81
    82      return $self->{val};
    83    }
    84
    85    1;
    86  }

Ejecuciones:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 exprdamian.pl
4-2-2
$VAR1 = bless( {
  'operands' => [
    bless( {
      '' => '4',
      'val' => '4'
    }, 'factor' ),
    bless( {
      '' => '2',
      'val' => '2'
    }, 'factor' ),
    bless( {
      '' => '2',
      'val' => '2'
    }, 'factor' )
  ],
  '' => '4-2-2',
  'operators' => [
    '-',
    '-'
  ]
}, 'expr' );

0
8/4/2
$VAR1 = bless( {
  'operands' => [
    bless( {
      '' => '8',
      'val' => '8'
    }, 'factor' ),
    bless( {
      '' => '4',
      'val' => '4'
    }, 'factor' ),
    bless( {
      '' => '2',
      'val' => '2'
    }, 'factor' )
  ],
  '' => '8/4/2',
  'operators' => [
    '/',
    '/'
  ]
}, 'term' );

1
3
$VAR1 = bless( {
  '' => '3',
  'val' => '3'
}, 'factor' );

3
2*(3+4)
$VAR1 = bless( {
  'operands' => [
    bless( {
      '' => '2',
      'val' => '2'
    }, 'factor' ),
    bless( {
      'operands' => [
        bless( {
          '' => '3',
          'val' => '3'
        }, 'factor' ),
        bless( {
          '' => '4',
          'val' => '4'
        }, 'factor' )
      ],
      '' => '3+4',
      'operators' => [
        '+'
      ]
    }, 'expr' )
  ],
  '' => '2*(3+4)',
  'operators' => [
    '*'
  ]
}, 'term' );

14

Reciclando una `Regexp::Grammar`

Ejecución

El siguiente programa calculator.pl recibe como entrada una expresión en infijo.

La ejecución consta de dos bucles. En la primera parte se inyecta a la jerarquía de clases de los AST generados para las expresiones en infijo una semántica que permite evaluar la expresión:

    58  require EvalCalc;
    59
    60  test_calc(
    61    'Evaluating infix arithmetic expressions (CTRL-D to end in unix) ',
    62    sub { print &Data::Dumper::Dumper(shift()) },
    63  );

En esta primera parte mostraremos además el AST construido para la expresión infija de entrada.

pl@nereida:~/Lregexpgrammars/demo$ ./calculator.pl
Evaluating infix arithmetic expressions (CTRL-D to end in unix)
8-4-2
$VAR1 = bless( {
  'operands' => [
    bless( {
      'operands' => [
        bless( {
          'operands' => [
            bless( {
              'operands' => [
                bless( {
                  'operands' => [
                    bless( { '' => '8', 'val' => '8' }, 'factor' )
                  ],
                  '' => '8'
                }, 'factorial' )
              ],
              '' => '8'
            }, 'power' )
          ],
          '' => '8'
        }, 'uneg' )
      ],
      '' => '8'
    }, 'term' ),
    bless( {
      'operands' => [
        bless( {
          'operands' => [
            bless( {
              'operands' => [
                bless( {
                  'operands' => [
                    bless( { '' => '4', 'val' => '4' }, 'factor' )
                  ],
                  '' => '4'
                }, 'factorial' )
              ],
              '' => '4'
            }, 'power' )
          ],
          '' => '4'
        }, 'uneg' )
      ],
      '' => '4'
    }, 'term' ),
    bless( {
      'operands' => [
        bless( {
          'operands' => [
            bless( {
              'operands' => [
                bless( {
                  'operands' => [
                    bless( { '' => '2', 'val' => '2' }, 'factor' )
                  ],
                  '' => '2'
                }, 'factorial' )
              ],
              '' => '2'
            }, 'power' )
          ],
          '' => '2'
        }, 'uneg' )
      ],
      '' => '2'
    }, 'term' )
  ],
  '' => '8-4-2',
  'operators' => [
    '-',
    '-'
  ]
}, 'expr' );
2

Observamos que la asociatividad es la correcta. El 2 final es el resultado de la evaluación de 8-4-2.

La estructura del árbol se corresponde con la de la gramática:

 8  my $rbb = do {
 9      use Regexp::Grammars;
10
11      qr{
12        \A<expr>\z
13
14        <objrule: expr>      <[operands=term]> ** <[operators=addop]>
15
16        <objrule: term>      <[operands=uneg]> ** <[operators=mulop]>
17
18        <objrule: uneg>      <[operators=minus]>* <[operands=power]>
19
20        <objrule: power>     <[operands=factorial]> ** <[operators=powerop]>
21
22        <objrule: factorial> <[operands=factor]>  <[operators=(!)]>*
23
24        <objrule: factor>    <val=([+-]?\d+(?:\.\d*)?)>
25                           | \( <MATCH=expr> \)
26
27        <token: addop>        [+-]
28
29        <token: mulop>        [*/]
30
31        <token: powerop>      \*\*|\^
32
33        <token: minus>        - <MATCH=(?{ 'NEG' })>
34
35      }x;
36  };

Ahora, en una segunda parte sobreescribimos los métodos sem que describen la semántica para producir una traducción de infijo a postfijo:

 66  require PostfixCalc;
 67  test_calc('Translating expressions to postfix (CTRL-D to end in unix) ');

Ahora al proporcionar la entrada 6--3! obtenemos:

Translating expressions to postfix (CTRL-D to end in unix)
6--3!
6 3 ! ~ -

Aquí ~ es el operador de negación unaria y ! es el operador factorial.

Estructura de la aplicación

Estos son los ficheros que integran la aplicación:

pl@nereida:~/Lregexpgrammars/demo/calculator$ tree
.
|-- EvalCalc.pm            # Soporte para la evaluación de la expresión: sem
|-- Operator.pm            # Soporte a las clases nodo: recorridos
|-- PostfixCalc.pm         # Soporte para la traducción a postfijo: sem
`-- calculator.pl          # programa principal

Programa principal

En el programa principal definimos la gramática y escribimos una subrutina test_calc que realiza el parsing.

pl@nereida:~/Lregexpgrammars/demo/calculator$ cat -n calculator.pl
    1   #!/usr/bin/env perl5.10.1
    2   use strict;
    3   use warnings;
    4   use 5.010;
    5   use Data::Dumper;
    6   $Data::Dumper::Indent = 1;
    7 
    8   my $rbb = do {
    9       use Regexp::Grammars;
   10 
   11       qr{
   12         \A<expr>\z
   13 
   14         <objrule: expr>      <[operands=term]> ** <[operators=addop]>
   15 
   16         <objrule: term>      <[operands=uneg]> ** <[operators=mulop]>
   17 
   18         <objrule: uneg>      <[operators=minus]>* <[operands=power]>
   19 
   20         <objrule: power>     <[operands=factorial]> ** <[operators=powerop]>
   21 
   22         <objrule: factorial> <[operands=factor]>  <[operators=(!)]>*
   23 
   24         <objrule: factor>    <val=([+-]?\d+(?:\.\d*)?)>
   25                            | \( <MATCH=expr> \)
   26 
   27         <token: addop>        [+-]
   28 
   29         <token: mulop>        [*/]
   30 
   31         <token: powerop>      \*\*|^
   32 
   33         <token: minus>        - <MATCH=(?{ 'NEG' })>
   34 
   35       }x;
   36   };
   37 
   38   sub test_calc {
   39     my $prompt = shift;
   40     my $handler = shift;
   41 
   42     say $prompt;
   43     while (my $input = <>) {
   44         chomp($input);
   45         if ($input =~ m{$rbb}) {
   46             my $tree = $/{expr};
   47             $handler->($tree) if $handler;
   48 
   49             say $tree->ceval;
   50 
   51         }
   52         else {
   53             say("does not match");
   54         }
   55     }
   56   }
   57 
   58   require EvalCalc;
   59 
   60   test_calc(
   61     'Evaluating infix arithmetic expressions (CTRL-D to end in unix) ',
   62     sub { print &Data::Dumper::Dumper(shift()) },
   63   );
   64 
   65 
   66   require PostfixCalc;
   67   test_calc('Translating expressions to postfix (CTRL-D to end in unix) ');

Los nodos del AST poseen un método ceval que se encarga de realizar la traducción del nodo.

Las Clases de nodos del AST

pl@nereida:~/Lregexpgrammars/demo/calculator$ cat -n Operator.pm
   1  #   Class hierarchy diagram:
   2  #  $ vgg -t 'Operator(LeftBinaryOp(expr,term),RightBinaryOp(power),PreUnaryOp(uneg),PostUnaryOp(factorial))'
   3  #                           +--------+
   4  #                           |Operator|
   5  #                           +--------+
   6  #          .---------------.----^--------.-------------.
   7  #    +------------+ +-------------+ +----------+ +-----------+
   8  #    |LeftBinaryOp| |RightBinaryOp| |PreUnaryOp| |PostUnaryOp|
   9  #    +------------+ +-------------+ +----------+ +-----------+
  10  #      .---^--.        |              |               |
  11  #    +----+ +----+  +-----+         +----+       +---------+
  12  #    |expr| |term|  |power|         |uneg|       |factorial|
  13  #    +----+ +----+  +-----+         +----+       +---------+
  14  #
  15  #
  16  # NOTE: package "factor" actually implements numbers and is
  17  #       outside this hierarchy
  18  #
  19  package Operator;
  20  use strict;
  21  use Carp;
  22
  23  sub Operands {
  24    my $self = shift;
  25
  26    return () unless exists $self->{operands};
  27    return @{$self->{operands}};
  28  }
  29
  30  sub Operators {
  31    my $self = shift;
  32
  33    return () unless exists $self->{operators};
  34    return @{$self->{operators}};
  35  }
  36
  37  sub sem {
  38    confess "not defined sem";
  39  }
  40
  41  sub make_sem {
  42    my $class = shift;
  43    my %semdesc = @_;
  44
  45    for my $class (keys %semdesc) {
  46      my %sem = %{$semdesc{$class}};
  47
  48      # Install 'sem' method in $class
  49      no strict 'refs';
  50      no warnings 'redefine';
  51      *{$class."::sem"} = sub {
  52        my ($self, $op) = @_;
  53        $sem{$op}
  54      };
  55    }
  56  }
  57
  58  package LeftBinaryOp;
  59  use base qw{Operator};
  60
  61  sub ceval {
  62    my $self = shift;
  63
  64    # recursively evaluate the children first
  65    my @operands = map { $_->ceval } $self->Operands;
  66
  67    # then combine them
  68    my $s = shift @operands;
  69    for ($self->Operators) {
  70      $s = $self->sem($_)->($s, shift @operands);
  71    }
  72    return $s;
  73  }
  74
  75  package RightBinaryOp;
  76  use base qw{Operator};
  77
  78  sub ceval {
  79    my $self = shift;
  80
  81    # recursively evaluate the children first
  82    my @operands = map { $_->ceval } $self->Operands;
  83
  84    # then combine them
  85    my $s = pop @operands;
  86    for (reverse $self->Operators) {
  87      $s = $self->sem($_)->(pop @operands, $s);
  88    }
  89    return $s;
  90  }
  91
  92  package PreUnaryOp;
  93  use base qw{Operator};
  94
  95  sub ceval {
  96    my $self = shift;
  97
  98    # recursively evaluate the children first
  99    my @operands = map { $_->ceval } $self->Operands;
 100
 101    # then combine them
 102    my $s = shift @operands;
 103    for (reverse $self->Operators) {
 104      $s = $self->sem($_)->($s);
 105    }
 106    return $s;
 107  }
 108
 109  package PostUnaryOp;
 110  use base qw{Operator};
 111
 112  sub ceval {
 113    my $self = shift;
 114
 115    # recursively evaluate the children first
 116    my @operands = map { $_->ceval } $self->Operands;
 117
 118    # then combine them
 119    my $s = shift @operands;
 120    for ($self->Operators) {
 121      $s = $self->sem($_)->($s);
 122    }
 123    return $s;
 124  }
 125
 126  package term;
 127  use base qw{LeftBinaryOp};
 128
 129  package expr;
 130  use base qw{LeftBinaryOp};
 131
 132  package power;
 133  use base qw{RightBinaryOp};
 134
 135  package uneg;
 136  use base qw{PreUnaryOp};
 137
 138  package factorial;
 139  use base qw{PostUnaryOp};
 140
 141  package factor;
 142
 143  sub ceval {
 144    my $self = shift;
 145
 146    return $self->{val};
 147  }
 148
 149  1;

Definiendo `sem` para la evaluación de la expresión

pl@nereida:~/Lregexpgrammars/demo/calculator$ cat -n EvalCalc.pm
 1  package EvalCalc;
 2  use strict;
 3  use Carp;
 4
 5  use Operator;
 6
 7  ####
 8  sub f {
 9    $_[0]>1?$_[0]*f($_[0]-1):1;
10  }
11
12  sub fac {
13    my $n = shift;
14
15    confess "Not valid number" unless $n =~ /^\d+$/;
16    f($n);
17  };
18
19  my $s = sub { shift() **  shift() };
20
21  Operator->make_sem(
22     expr => {
23        '+' => sub { shift()  +  shift() },
24        '-' => sub { shift()  -  shift() },
25     },
26     term => {
27       '*' => sub { shift()  *  shift() },
28       '/' => sub { shift()  /  shift() },
29     },
30     power => {
31        '^'  => $s,
32        '**' => $s,
33     },
34     uneg => {
35        'NEG' => sub { -shift() },
36     },
37     factorial => {
38        '!' => \&fac,
39     },
40  );
41
42  1;

Definiendo `sem` para la traducción a postfijo

pl@nereida:~/Lregexpgrammars/demo/calculator$ cat -n PostfixCalc.pm
 1  package PostfixCalc;
 2  use strict;
 3
 4  use Operator;
 5
 6  # Modify semantics: now translate to postfix
 7  my $powers = sub { shift().' '.shift().' **' };
 8
 9  Operator->make_sem(
10     expr => {
11        '+' => sub { shift().' '.shift().' +'  },
12        '-' => sub { shift().' '.shift().' -' },
13      },
14      term => {
15        '*' => sub { shift().' '.shift().' *'  },
16        '/' => sub { shift().' '.shift().' /' },
17      },
18      power => {
19        '^'  => $powers,
20        '**' => $powers,
21      },
22      uneg => {
23         # use ~ for unary minus
24        'NEG' => sub { shift().' ~' },
25      },
26      factorial => {
27        '!' => sub { shift().' !'},
28      },
29  );
30
31  1;

Ejercicio 31.11.2

Explique el significado de la primera línea del programa principal

pl@nereida:~/Lregexpgrammars/demo$ cat -n calculator.pl
   1  #!/usr/bin/env perl5.10.1

Explique el significado de $handler en test_calc:

  42  sub test_calc {
  43    my $prompt = shift;
  44    my $handler = shift;
  45
  46    say $prompt;
  47    while (my $input = <>) {
  48        chomp($input);
  49        if ($input =~ m{$rbb}) {
  50            my $tree = $/{expr};
  51            $handler->($tree) if $handler;
  52
  53            say $tree->ceval;
  54
  55        }
  56        else {
  57            say("does not match");
  58        }
  59    }
  60  }

Aisle las funciones relacionadas con la creación de semántica como make_sem, fac y las llamadas a make_sem en un módulo Calculator::Semantics aparte.
Añada un traductor de infijo a prefijo al código presentado en esta sección. Una expresión como 2*3+4 se traducirá como + * 2 3 4

Práctica: Calculadora con `Regexp::Grammars`

Reforme la estructura del ejemplo para que tenga una jerarquía de desarrollo de acuerdo a los estándares de Perl. Use h2xs o bien Module::Starter. Use el espacio de nombres Calculator. Mueva el módulo Operator a Calculator::Operator. Lea el capítulo Modulos de los apuntes de LHP.
Defina el conjunto de pruebas que deberá pasar su traductor. Añádalas como pruebas TODO. Cuando la funcionalidad a comprobar esté operativa cambie su estatus.
Añada variables y la expresión de asignación:
```
b = a = 4*2
```
que será traducida a postfijo como:
```
4 2 * a = b =
```
El operador de asignación es asociativo a derechas. El valor devuelto por una expresión de asignación es el valor asignado.
Use un hash para implantar la relación nombre-valor en el caso de la evaluación
Introduzca la expresión bloque:
```
c = { a = 4; b = 2*a }
```
Los bloques son listas entre llaves de expresiones separadas por punto y coma. El valor retornado por una expresión bloque es el último evaluado en el bloque.
El símbolo de arranque de la gramática (esto es, el patrón regular contra el que hay que casar) será la expresión bloque.
Introduzca las expresiones de comparación <, >, <=, >=, == y != con la prioridad adecuada. Tenga en cuenta que una expresión como:
```
a = b+2 > c*4
```
deberá entenderse como
```
a = ((b+2) > (c*4))
```
Esto es, se traducirá como:
```
b 2 + c 4 * > a =
```

Introduzca la expresión if ... then ... else. La parte del else será opcional:

c = if a > 0 then { a = a -1; 2*a } else { b + 2 };
d = if a > 0 then { a = b -1; 2*b };

un else casa con el if mas cercano. La sentencia:

if (a > 0) then if (b > 0) then {5} else {6}

se interpreta como:

if (a > 0) then (if (b > 0) then {5} else {6})

y no como:

if (a > 0) then (if (b > 0) then {5}) else {6}

Se traducirá como:

        a
        0
        >
        jz endif124
        b
        0
        >
        jz else125
        5
        j endif126
:else125
        6
:endif124
:endif125
        ...

Escriba un intérprete de la máquina orientada a pila definida en los apartados anteriores. El código generado debería poder ejecutarse correctamente en el intérprete.

Casiano Rodríguez León
2013-04-23

Reciclando una Regexp::Grammar

Reciclando una `Regexp::Grammar`