Subsecciones

Introducción

Los rudimentos de las expresiones regulares pueden encontrarse en los trabajos pioneros de McCullogh y Pitts (1940) sobre redes neuronales. El lógico Stephen Kleene definió formalmente el algebra que denominó conjuntos regulares y desarrollo una notación para la descripción de dichos conjuntos, las expresiones regulares.

Durante las décadas de 1960 y 1970 hubo un desarrollo formal de las expresiones regulares. Una de las priemras publicaciones que utilizan las expresiones regulares en un marco informático es el artículo de 1968 de Ken Thompson Regular Expression Search Algorithm en el que describe un compilador de expresiones regulares que produce código objeto para un IBM 7094. Este compilador dió lugar al editor qed, en el cual se basó el editor de Unix ed. Aunque las expresiones regulares de este último no eran tan sofisticadas como las de qed, fueron las primeras en ser utilizadas en un contexto no académico. Se dice que el comando global g en su formato g/re/p que utilizaba para imprimir (opción p) las líneas que casan con la expresión regular re dió lugar a un programa separado al que se denomino grep.

Las expresiones regulares facilitadas por las primeras versiones de estas herramientas eran limitadas. Por ejemplo, se disponía del cierre de Kleene * pero no del cierre positivo + o del operador opcional ?. Por eso, posteriormente, se han introducido los metacaracteres \+ y \?. Existían numerosas limitaciones en dichas versiones, por ej. $ sólo significa ``final de línea'' al final de la expresión regular. Eso dificulta expresiones como

grep 'cierre$\|^Las' viq.tex
Sin embargo, la mayor parte de las versiones actuales resuelven correctamente estos problemas:
nereida:~/viq> grep 'cierre$\|^Las' viq.tex
Las expresiones regulares facilitadas por las primeras versiones de estas herramientas
eran limitadas. Por ejemplo, se disponía del cierre de Kleene \verb|*| pero no del cierre
nereida:~/viq>
De hecho AT&T Bell labs añadió numerosas funcionalidades, como por ejemplo, el uso de \{min, max\}, tomada de lex. Por esa época, Alfred Aho escribió egrep que, no sólo proporciona un conjunto mas rico de operadores sino que mejoró la implementación. Mientras que el grep de Ken Thompson usaba un autómata finito no determinista (NFA), la versión de egrep de Aho usa un autómata finito determinista (DFA).

En 1986 Henry Spencer desarrolló la librería regex para el lenguaje C, que proporciona un conjunto consistente de funciones que permiten el manejo de expresiones regulares. Esta librería ha contribuido a ``homogeneizar'' la sintáxis y semántica de las diferentes herramientas que utilizan expresiones regulares (como awk, lex, sed, ...).

Véase También


Un ejemplo sencillo

Matching en Contexto Escalar

pl@nereida:~/Lperltesting$ cat -n c2f.pl
    1   #!/usr/bin/perl -w
    2   use strict;
    3 
    4   print "Enter a temperature (i.e. 32F, 100C):\n";
    5   my $input = <STDIN>;
    6   chomp($input);
    7 
    8   if ($input !~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i) {
    9     warn "Expecting a temperature, so don't understand \"$input\".\n";
   10   }
   11   else {
   12     my $InputNum = $1;
   13     my $type = $3;
   14     my ($celsius, $farenheit);
   15     if ($type eq "C" or $type eq "c") {
   16       $celsius = $InputNum;
   17       $farenheit = ($celsius * 9/5)+32;
   18     }
   19     else {
   20       $farenheit = $InputNum;
   21       $celsius = ($farenheit -32)*5/9;
   22     }
   23     printf "%.2f C = %.2f F\n", $celsius, $farenheit;
   24   }

Véase también:

Ejecución con el depurador:

pl@nereida:~/Lperltesting$  perl -wd c2f.pl
Loading DB routines from perl5db.pl version 1.28
Editor support available.
Enter h or `h h' for help, or `man perldebug' for more help.
main::(c2f.pl:4):       print "Enter a temperature (i.e. 32F, 100C):\n";
DB<1>  c 8
Enter a temperature (i.e. 32F, 100C):
32F
main::(c2f.pl:8):       if ($input !~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i) {
DB<2>  n
main::(c2f.pl:12):        my $InputNum = $1;
DB<2>  x ($1, $2, $3)
0  32
1  undef
2  'F'
DB<3>  use YAPE::Regex::Explain
DB<4>  p YAPE::Regex::Explain->new('([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$')->explain
The regular expression:
(?-imsx:([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$)
matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [-+]?                    any character of: '-', '+' (optional
                             (matching the most amount possible))
----------------------------------------------------------------------
    [0-9]+                   any character of: '0' to '9' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    (                        group and capture to \2 (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      \.                       '.'
----------------------------------------------------------------------
      [0-9]*                   any character of: '0' to '9' (0 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )?                       end of \2 (NOTE: because you're using a
                             quantifier on this capture, only the
                             LAST repetition of the captured pattern
                             will be stored in \2)
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    [CF]                     any character of: 'C', 'F'
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

Dentro de una expresión regular es necesario referirse a los textos que casan con el primer, paréntesis, segundo, etc. como \1, \2, etc. La notación $1 se refieré a lo que casó con el primer paréntesis en el último matching, no en el actual. Veamos un ejemplo:

pl@nereida:~/Lperltesting$ cat -n dollar1slash1.pl
    1   #!/usr/bin/perl -w
    2   use strict;
    3 
    4   my $a = "hola juanito";
    5   my $b = "adios anita";
    6 
    7   $a =~ /(ani)/;
    8   $b =~ s/(adios) *($1)/\U$1 $2/;
    9   print "$b\n";
Observe como el $1 que aparece en la cadena de reemplazo (línea 8) se refiere a la cadena adios mientras que el $1 en la primera parte contiene ani:
pl@nereida:~/Lperltesting$ ./dollar1slash1.pl
ADIOS ANIta

Ejercicio 31.1.1   Indique cuál es la salida del programa anterior si se sustituye la línea 8 por
$b =~ s/(adios) *(\1)/\U$1 $2/;

Número de Paréntesis

El número de paréntesis con memoria no está limitado:

pl@nereida:~/Lperltesting$ perl -wde 0
main::(-e:1):   0
            123456789ABCDEF
DB<1> $x = "123456789AAAAAA"
                   1  2  3  4  5  6  7  8  9 10 11  12
DB<2> $r = $x =~ /(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)\11/; print "$r\n$10\n$11\n"
1
A
A

Véase el siguiente párrafo de perlre (sección Capture buffers):

There is no limit to the number of captured substrings that you may use. However Perl also uses \10, \11, etc. as aliases for \010, \011, etc. (Recall that 0 means octal, so \011 is the character at number 9 in your coded character set; which would be the 10th character, a horizontal tab under ASCII.) Perl resolves this ambiguity by interpreting \10 as a backreference only if at least 10 left parentheses have opened before it. Likewise \11 is a backreference only if at least 11 left parentheses have opened before it. And so on. \1 through \9 are always interpreted as backreferences.

Contexto de Lista

Si se utiliza en un contexto que requiere una lista, el ``pattern match'' retorna una lista consistente en las subexpresiones casadas mediante los paréntesis, esto es $1, $2, $3, .... Si no hubiera emparejamiento se retorna la lista vacía. Si lo hubiera pero no hubieran paréntesis se retorna la lista ($&).

pl@nereida:~/src/perl/perltesting$ cat -n escapes.pl
     1  #!/usr/bin/perl -w
     2  use strict;
     3
     4  my $foo = "one two three four five\nsix seven";
     5  my ($F1, $F2, $Etc) = ($foo =~ /^\s*(\S+)\s+(\S+)\s*(.*)/);
     6  print "List Context: F1 = $F1, F2 = $F2, Etc = $Etc\n";
     7
     8  # This is 'almost' the same than:
     9  ($F1, $F2, $Etc) = split(/\s+/, $foo, 3);
    10  print "Split: F1 = $F1, F2 = $F2, Etc = $Etc\n";
Observa el resultado de la ejecución:
pl@nereida:~/src/perl/perltesting$ ./escapes.pl
List Context: F1 = one, F2 = two, Etc = three four five
Split: F1 = one, F2 = two, Etc = three four five
six seven

El modificador s

La opción s usada en una regexp hace que el punto '.' case con el retorno de carro:

pl@nereida:~/src/perl/perltesting$  perl -wd ./escapes.pl
main::(./escapes.pl:4): my $foo = "one two three four five\nsix seven";
DB<1>  c 9
List Context: F1 = one, F2 = two, Etc = three four five
main::(./escapes.pl:9): ($F1, $F2, $Etc) = split(' ',$foo, 3);
DB<2>  ($F1, $F2, $Etc) = ($foo =~ /^\s*(\S+)\s+(\S+)\s*(.*)/s)
DB<3>  p "List Context: F1 = $F1, F2 = $F2, Etc = $Etc\n"
List Context: F1 = one, F2 = two, Etc = three four five
six seven

La opción /s hace que . se empareje con un \n. Esto es, casa con cualquier carácter.

Veamos otro ejemplo, que imprime los nombres de los ficheros que contienen cadenas que casan con un patrón dado, incluso si este aparece disperso en varias líneas:

    1  #!/usr/bin/perl -w
    2  #use: 
    3  #smodifier.pl 'expr' files
    4  #prints the names of the files that match with the give expr
    5  undef $/; # input record separator
    6  my $what = shift @ARGV;
    7  while(my $file = shift @ARGV) {
    8    open(FILE, "<$file");
    9    $line =  <FILE>;
   10    if ($line =~ /$what/s) {
   11      print "$file\n";
   12    }
   13  }

Ejemplo de uso:

> smodifier.pl 'three.*three' double.in split.pl doublee.pl
double.in
doublee.pl

Vea la sección 31.4.2 para ver los contenidos del fichero double.in. En dicho fichero, el patrón three.*three aparece repartido entre varias líneas.

El modificador m

El modificador s se suele usar conjuntamente con el modificador m. He aquí lo que dice la seccion Using character classes de la sección 'Using-character-classes' en perlretut al respecto:

Here are examples of //s and //m in action:

   1. $x = "There once was a girl\nWho programmed in Perl\n";
   2.
   3. $x =~ /^Who/; # doesn't match, "Who" not at start of string
   4. $x =~ /^Who/s; # doesn't match, "Who" not at start of string
   5. $x =~ /^Who/m; # matches, "Who" at start of second line
   6. $x =~ /^Who/sm; # matches, "Who" at start of second line
   7.
   8. $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n"
   9. $x =~ /girl.Who/s; # matches, "." matches "\n"
  10. $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n"
  11. $x =~ /girl.Who/sm; # matches, "." matches "\n"

Most of the time, the default behavior is what is wanted, but //s and //m are occasionally very useful. If //m is being used, the start of the string can still be matched with \A and the end of the string can still be matched with the anchors \Z (matches both the end and the newline before, like $), and \z (matches only the end):

   1. $x =~ /^Who/m; # matches, "Who" at start of second line
   2. $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string
   3.
   4. $x =~ /girl$/m; # matches, "girl" at end of first line
   5. $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
   6.
   7. $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
   8. $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string
Normalmente el carácter ^ casa solamente con el comienzo de la cadena y el carácter $ con el final. Los \n empotrados no casan con ^ ni $. El modificador /m modifica esta conducta. De este modo ^ y $ casan con cualquier frontera de línea interna. Las anclas \A y \Z se utilizan entonces para casar con el comienzo y final de la cadena. Véase un ejemplo:
nereida:~/perl/src> perl -de 0
  DB<1> $a = "hola\npedro"
  DB<2> p "$a"
hola
pedro
  DB<3> $a =~ s/.*/x/m
  DB<4> p $a
x
pedro
  DB<5> $a =~ s/^pedro$/juan/
  DB<6> p "$a"
x
pedro
  DB<7> $a =~ s/^pedro$/juan/m
  DB<8>  p "$a"
x
juan

El conversor de temperaturas reescrito usando contexto de lista

Reescribamos el ejemplo anterior usando un contexto de lista:

casiano@millo:~/Lperltesting$ cat -n c2f_list.pl
    1   #!/usr/bin/perl -w
    2   use strict;
    3 
    4   print "Enter a temperature (i.e. 32F, 100C):\n";
    5   my $input = <STDIN>;
    6   chomp($input);
    7 
    8   my ($InputNum, $type);
    9 
   10   ($InputNum, $type) = $input =~ m/^
   11                                       ([-+]?[0-9]+(?:\.[0-9]*)?) # Temperature
   12                                       \s*
   13                                       ([cCfF]) # Celsius or Farenheit
   14                                    $/x;
   15 
   16   die "Expecting a temperature, so don't understand \"$input\".\n" unless defined($InputNum);
   17 
   18   my ($celsius, $fahrenheit);
   19   if ($type eq "C" or $type eq "c") {
   20     $celsius = $InputNum;
   21     $fahrenheit = ($celsius * 9/5)+32;
   22   }
   23   else {
   24     $fahrenheit = $InputNum;
   25     $celsius = ($fahrenheit -32)*5/9;
   26   }
   27   printf "%.2f C = %.2f F\n", $celsius, $fahrenheit;

La opción x

La opción /x en una regexp permite utilizar comentarios y espacios dentro de la expresión regular. Los espacios dentro de la expresión regular dejan de ser significativos. Si quieres conseguir un espacio que sea significativo, usa \s o bien escápalo. Véase la sección 'Modifiers' en perlre y la sección 'Building-a-regexp' en perlretut.

Paréntesis sin memoria

La notación (?: ... ) se usa para introducir paréntesis de agrupamiento sin memoria. (?: ...) Permite agrupar las expresiones tal y como lo hacen los paréntesis ordinarios. La diferencia es que no ``memorizan'' esto es no guardan nada en $1, $2, etc. Se logra así una compilación mas eficiente. Veamos un ejemplo:

> cat groupingpar.pl
#!/usr/bin/perl

  my $a = shift;

  $a =~ m/(?:hola )*(juan)/;
  print "$1\n";
nereida:~/perl/src> groupingpar.pl 'hola juan'
juan

Interpolación en los patrones: La opción o

El patrón regular puede contener variables, que serán interpoladas (en tal caso, el patrón será recompilado). Si quieres que dicho patrón se compile una sóla vez, usa la opción /o.

pl@nereida:~/Lperltesting$ cat -n mygrep.pl
     1  #!/usr/bin/perl -w
     2  my $what = shift @ARGV || die "Usage $0 regexp files ...\n";
     3  while (<>) {
     4    print "File $ARGV, rel. line $.: $_" if (/$what/o); # compile only once
     5  }
     6
Sigue un ejemplo de ejecución:
pl@nereida:~/Lperltesting$ ./mygrep.pl
Usage ./mygrep.pl regexp files ...
pl@nereida:~/Lperltesting$ ./mygrep.pl if labels.c
File labels.c, rel. line 7:        if (a < 10) goto LABEL;

El siguiente texto es de la sección 'Using-regular-expressions-in-Perl' en perlretut:

If $pattern won't be changing over the lifetime of the script, we can add the //o modifier, which directs Perl to only perform variable substitutions once

Otra posibilidad es hacer una compilación previa usando el operador qr (véase la sección 'Regexp-Quote-Like-Operators' en perlop). La siguiente variante del programa anterior también compila el patrón una sóla vez:

pl@nereida:~/Lperltesting$ cat -n mygrep2.pl
     1  #!/usr/bin/perl -w
     2  my $what = shift @ARGV || die "Usage $0 regexp files ...\n";
     3  $what = qr{$what};
     4  while (<>) {
     5    print "File $ARGV, rel. line $.: $_" if (/$what/);
     6  }

Véase

Cuantificadores greedy

El siguiente extracto de la sección Matching Repetitions en la sección 'Matching-repetitions' en perlretut ilustra la semántica greedy de los operadores de repetición *+{}? etc.

For all of these quantifiers, Perl will try to match as much of the string as possible, while still allowing the regexp to succeed. Thus with /a?.../, Perl will first try to match the regexp with the a present; if that fails, Perl will try to match the regexp without the a present. For the quantifier * , we get the following:

   1. $x = "the cat in the hat";
   2. $x =~ /^(.*)(cat)(.*)$/; # matches,
   3. # $1 = 'the '
   4. # $2 = 'cat'
   5. # $3 = ' in the hat'

Which is what we might expect, the match finds the only cat in the string and locks onto it. Consider, however, this regexp:

   1. $x =~ /^(.*)(at)(.*)$/; # matches,
   2. # $1 = 'the cat in the h'
   3. # $2 = 'at'
   4. # $3 = '' (0 characters match)

One might initially guess that Perl would find the at in cat and stop there, but that wouldn't give the longest possible string to the first quantifier .*. Instead, the first quantifier .* grabs as much of the string as possible while still having the regexp match. In this example, that means having the at sequence with the final at in the string.

The other important principle illustrated here is that when there are two or more elements in a regexp, the leftmost quantifier, if there is one, gets to grab as much the string as possible, leaving the rest of the regexp to fight over scraps. Thus in our example, the first quantifier .* grabs most of the string, while the second quantifier .* gets the empty string. Quantifiers that grab as much of the string as possible are called maximal match or greedy quantifiers.

When a regexp can match a string in several different ways, we can use the principles above to predict which way the regexp will match:

Regexp y Bucles Infinitos

El siguiente párrafo está tomado de la sección 'Repeated-Patterns-Matching-a-Zero-length-Substring' en perlre:

Regular expressions provide a terse and powerful programming language. As with most other power tools, power comes together with the ability to wreak havoc.

A common abuse of this power stems from the ability to make infinite loops using regular expressions, with something as innocuous as:

   1. 'foo' =~ m{ ( o? )* }x;

The o? matches at the beginning of 'foo' , and since the position in the string is not moved by the match, o? would match again and again because of the * quantifier.

Another common way to create a similar cycle is with the looping modifier //g :

   1. @matches = ( 'foo' =~ m{ o? }xg );

or

   1. print "match: <$&>\n" while 'foo' =~ m{ o? }xg;

or the loop implied by split().

... Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower-level loops given by the greedy quantifiers *+{} , and for higher-level ones like the /g modifier or split() operator.

The lower-level loops are interrupted (that is, the loop is broken) when Perl detects that a repeated expression matched a zero-length substring. Thus

   1.  m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;

is made equivalent to

   1.  m{ (?: NON_ZERO_LENGTH )*
   2.  |
   3.  (?: ZERO_LENGTH )?
   4.  }x;

The higher level-loops preserve an additional state between iterations: whether the last match was zero-length. To break the loop, the following match after a zero-length match is prohibited to have a length of zero. This prohibition interacts with backtracking (see Backtracking), and so the second best match is chosen if the best match is of zero length.

For example:

   1. $_ = 'bar';
   2. s/\w??/<$&>/g;

results in <><b><><a><><r><> . At each position of the string the best match given by non-greedy ?? is the zero-length match, and the second best match is what is matched by \w . Thus zero-length matches alternate with one-character-long matches.

Similarly, for repeated m/()/g the second-best match is the match at the position one notch further in the string.

The additional state of being matched with zero-length is associated with the matched string, and is reset by each assignment to pos(). Zero-length matches at the end of the previous match are ignored during split.

Ejercicio 31.1.2  

Cuantificadores lazy

Las expresiones lazy o no greedy hacen que el NFA se detenga en la cadena mas corta que casa con la expresión. Se denotan como sus análogas greedy añadiéndole el postfijo ?:

Repasemos lo que dice la sección Matching Repetitions en la sección 'Matching-repetitions' en perlretut:

Sometimes greed is not good. At times, we would like quantifiers to match a minimal piece of string, rather than a maximal piece. For this purpose, Larry Wall created the minimal match or non-greedy quantifiers ?? ,*?, +?, and {}?. These are the usual quantifiers with a ? appended to them. They have the following meanings:

Let's look at the example above, but with minimal quantifiers:

   1. $x = "The programming republic of Perl";
   2. $x =~ /^(.+?)(e|r)(.*)$/; # matches,
   3. # $1 = 'Th'
   4. # $2 = 'e'
   5. # $3 = ' programming republic of Perl'

The minimal string that will allow both the start of the string ^ and the alternation to match is Th , with the alternation e|r matching e. The second quantifier .* is free to gobble up the rest of the string.

   1. $x =~ /(m{1,2}?)(.*?)$/; # matches,
   2. # $1 = 'm'
   3. # $2 = 'ming republic of Perl'

The first string position that this regexp can match is at the first m in programming . At this position, the minimal m{1,2}? matches just one m . Although the second quantifier .*? would prefer to match no characters, it is constrained by the end-of-string anchor $ to match the rest of the string.

   1. $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches,
   2. # $1 = 'The progra'
   3. # $2 = 'm'
   4. # $3 = 'ming republic of Perl'

In this regexp, you might expect the first minimal quantifier .*? to match the empty string, because it is not constrained by a ^ anchor to match the beginning of the word. Principle 0 applies here, however. Because it is possible for the whole regexp to match at the start of the string, it will match at the start of the string. Thus the first quantifier has to match everything up to the first m. The second minimal quantifier matches just one m and the third quantifier matches the rest of the string.

   1. $x =~ /(.??)(m{1,2})(.*)$/; # matches,
   2. # $1 = 'a'
   3. # $2 = 'mm'
   4. # $3 = 'ing republic of Perl'

Just as in the previous regexp, the first quantifier .?? can match earliest at position a , so it does. The second quantifier is greedy, so it matches mm , and the third matches the rest of the string.

We can modify principle 3 above to take into account non-greedy quantifiers:

Ejercicio 31.1.3   Explique cuál será el resultado de el segundo comando de matching introducido en el depurador:
casiano@millo:~/Lperltesting$ perl -wde 0
main::(-e:1):   0
  DB<1> x ('1'x34) =~ m{^(11+)\1+$}
0  11111111111111111
  DB<2> x ('1'x34) =~ m{^(11+?)\1+$}
????????????????????????????????????

Descripción detallada del proceso de matching

Veamos en detalle lo que ocurre durante un matching. Repasemos lo que dice la sección Matching Repetitions en la sección 'Matching-repetitions' en perlretut:

Just like alternation, quantifiers are also susceptible to backtracking. Here is a step-by-step analysis of the example

   1. $x = "the cat in the hat";
   2. $x =~ /^(.*)(at)(.*)$/; # matches,
   3. # $1 = 'the cat in the h'
   4. # $2 = 'at'
   5. # $3 = '' (0 matches)

  1. Start with the first letter in the string 't'.

  2. The first quantifier '.*' starts out by matching the whole string 'the cat in the hat'.

  3. 'a' in the regexp element 'at' doesn't match the end of the string. Backtrack one character.

  4. 'a' in the regexp element 'at' still doesn't match the last letter of the string 't', so backtrack one more character.

  5. Now we can match the 'a' and the 't'.

  6. Move on to the third element '.*'. Since we are at the end of the string and '.*' can match 0 times, assign it the empty string.

  7. We are done!

Rendimiento

La forma en la que se escribe una regexp puede dar lugar agrandes variaciones en el rendimiento. Repasemos lo que dice la sección Matching Repetitions en la sección 'Matching-repetitions' en perlretut:

Most of the time, all this moving forward and backtracking happens quickly and searching is fast. There are some pathological regexps, however, whose execution time exponentially grows with the size of the string. A typical structure that blows up in your face is of the form

            /(a|b+)*/;

The problem is the nested indeterminate quantifiers. There are many different ways of partitioning a string of length n between the + and *: one repetition with b+ of length $ n$, two repetitions with the first b+ length $ k$ and the second with length $ n-k$, $ m$ repetitions whose bits add up to length $ n$, etc.

In fact there are an exponential number of ways to partition a string as a function of its length. A regexp may get lucky and match early in the process, but if there is no match, Perl will try every possibility before giving up. So be careful with nested *'s, {n,m}'s, and + 's.

The book Mastering Regular Expressions by Jeffrey Friedl [8] gives a wonderful discussion of this and other efficiency issues.

Eliminación de Comentarios de un Programa C

El siguiente ejemplo elimina los comentarios de un programa C.

casiano@millo:~/Lperltesting$ cat -n comments.pl
    1   #!/usr/bin/perl -w
    2   use strict;
    3 
    4   my $progname = shift @ARGV or die "Usage:\n$0 prog.c\n";
    5   open(my $PROGRAM,"<$progname") || die "can't find $progname\n";
    6   my $program = '';
    7   {
    8     local $/ = undef;
    9     $program = <$PROGRAM>;
   10   }
   11   $program =~ s{
   12     /\*  # Match the opening delimiter
   13     .*?  # Match a minimal number of characters
   14     \*/  # Match the closing delimiter
   15   }[]gsx;
   16 
   17   print $program;
Veamos un ejemplo de ejecución. Supongamos el fichero de entrada:
> cat hello.c
#include <stdio.h>
/* first
comment
*/
main() {
  printf("hello world!\n"); /* second comment */
}

Entonces la ejecución con ese fichero de entrada produce la salida:

> comments.pl hello.c
#include <stdio.h>
 
main() {
  printf("hello world!\n");
}
Veamos la diferencia de comportamiento entre * y *? en el ejemplo anterior:

pl@nereida:~/src/perl/perltesting$  perl5_10_1 -wde 0
main::(-e:1):   0
  DB<1>   use re 'debug'; 'main() /* 1c */ { /* 2c */ return; /* 3c */ }' =~ qr{(/\*.*\*/)}; print "\n$1\n"
Compiling REx "(/\*.*\*/)"
Final program:
   1: OPEN1 (3)
   3:   EXACT  (5)
   5:   STAR (7)
   6:     REG_ANY (0)
   7:   EXACT <*/> (9)
   9: CLOSE1 (11)
  11: END (0)
anchored "/*" at 0 floating "*/" at 2..2147483647 (checking floating) minlen 4
Guessing start of match in sv for REx "(/\*.*\*/)" against "main() /* 1c */ { /* 2c */ return; /* 3c */ }"
Found floating substr "*/" at offset 13...
Found anchored substr "/*" at offset 7...
Starting position does not contradict /^/m...
Guessed: match at offset 7
Matching REx "(/\*.*\*/)" against "/* 1c */ { /* 2c */ return; /* 3c */ }"
   7      |  1:OPEN1(3)
   7      |  3:EXACT (5)
   9 <() /*> < 1c */ { />    |  5:STAR(7)
                                  REG_ANY can match 36 times out of 2147483647...
  41 <; /* 3c > <*/ }>       |  7:  EXACT <*/>(9)
  43 <; /* 3c */> < }>       |  9:  CLOSE1(11)
  43 <; /* 3c */> < }>       | 11:  END(0)
Match successful!

/* 1c */ { /* 2c */ return; /* 3c */
Freeing REx: "(/\*.*\*/)"

  DB<2>   use re 'debug'; 'main() /* 1c */ { /* 2c */ return; /* 3c */ }' =~ qr{(/\*.*?\*/)}; print "\n$1\n"
Compiling REx "(/\*.*?\*/)"
Final program:
   1: OPEN1 (3)
   3:   EXACT  (5)
   5:   MINMOD (6)
   6:   STAR (8)
   7:     REG_ANY (0)
   8:   EXACT <*/> (10)
  10: CLOSE1 (12)
  12: END (0)
anchored "/*" at 0 floating "*/" at 2..2147483647 (checking floating) minlen 4
Guessing start of match in sv for REx "(/\*.*?\*/)" against "main() /* 1c */ { /* 2c */ return; /* 3c */ }"
Found floating substr "*/" at offset 13...
Found anchored substr "/*" at offset 7...
Starting position does not contradict /^/m...
Guessed: match at offset 7
Matching REx "(/\*.*?\*/)" against "/* 1c */ { /* 2c */ return; /* 3c */ }"
   7      |  1:OPEN1(3)
   7      |  3:EXACT (5)
   9 <() /*> < 1c */ { />    |  5:MINMOD(6)
   9 <() /*> < 1c */ { />    |  6:STAR(8)
                                  REG_ANY can match 4 times out of 4...
  13 <* 1c > <*/ { /* 2c>    |  8:  EXACT <*/>(10)
  15 <1c */> < { /* 2c *>    | 10:  CLOSE1(12)
  15 <1c */> < { /* 2c *>    | 12:  END(0)
Match successful!

/* 1c */
Freeing REx: "(/\*.*?\*/)"

  DB<3>

Véase también la documentación en la sección 'Matching-repetitions' en perlretut y la sección 'Quantifiers' en perlre.

Negaciones y operadores lazy

A menudo las expresiones X[^X]*X y X.*?X, donde X es un carácter arbitrario se usan de forma casi equivalente.

Esta equivalencia se rompe si no se cumplen las hipótesis establecidas.

En el siguiente ejemplo se intentan detectar las cadenas entre comillas dobles que terminan en el signo de exclamación:

pl@nereida:~/Lperltesting$ cat -n negynogreedy.pl
     1  #!/usr/bin/perl -w
     2  use strict;
     3
     4  my $b = 'Ella dijo "Ana" y yo contesté: "Jamás!". Eso fué todo.';
     5  my $a;
     6  ($a = $b) =~ s/".*?!"/-$&-/;
     7  print "$a\n";
     8
     9  $b =~ s/"[^"]*!"/-$&-/;
    10  print "$b\n";

Al ejecutar el programa obtenemos:

> negynogreedy.pl
Ella dijo -"Ana" y yo contesté: "Jamás!"-. Eso fué todo.
Ella dijo "Ana" y yo contesté: -"Jamás!"-. Eso fué todo.

Copia y sustitución simultáneas

El operador de binding =~ nos permite ``asociar'' la variable con la operación de casamiento o sustitución. Si se trata de una sustitución y se quiere conservar la cadena, es necesario hacer una copia:
$d = $s;
$d =~ s/esto/por lo otro/;
en vez de eso, puedes abreviar un poco usando la siguiente ``perla'':
($d = $s) =~ s/esto/por lo otro/;
Obsérvese la asociación por la izquierda del operador de asignación.

Referencias a Paréntesis Previos

Las referencias relativas permiten escribir expresiones regulares mas reciclables. Véase la documentación en la sección 'Relative-backreferences' en perlretut:

Counting the opening parentheses to get the correct number for a backreference is errorprone as soon as there is more than one capturing group. A more convenient technique became available with Perl 5.10: relative backreferences. To refer to the immediately preceding capture group one now may write \g{-1} , the next but last is available via \g{-2}, and so on.

Another good reason in addition to readability and maintainability for using relative backreferences is illustrated by the following example, where a simple pattern for matching peculiar strings is used:

   1. $a99a = '([a-z])(\d)\2\1'; # matches a11a, g22g, x33x, etc.

Now that we have this pattern stored as a handy string, we might feel tempted to use it as a part of some other pattern:

   1. $line = "code=e99e";
   2. if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior!
   3.   print "$1 is valid\n";
   4. } else {
   5.   print "bad line: '$line'\n";
   6. }

But this doesn't match - at least not the way one might expect. Only after inserting the interpolated $a99a and looking at the resulting full text of the regexp is it obvious that the backreferences have backfired - the subexpression (\w+) has snatched number 1 and demoted the groups in $a99a by one rank. This can be avoided by using relative backreferences:

   1. $a99a = '([a-z])(\d)\g{-1}\g{-2}'; # safe for being interpolated

El siguiente programa ilustra lo dicho:

casiano@millo:~/Lperltesting$ cat -n backreference.pl
    1   use strict;
    2   use re 'debug';
    3 
    4   my $a99a = '([a-z])(\d)\2\1';
    5   my $line = "code=e99e";
    6   if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior!
    7     print "$1 is valid\n";
    8   } else {
    9     print "bad line: '$line'\n";
   10   }
Sigue la ejecución:
casiano@millo:~/Lperltesting$  perl5.10.1 -wd backreference.pl
main::(backreference.pl:4):     my $a99a = '([a-z])(\d)\2\1';
  DB<1>  c 6
main::(backreference.pl:6):     if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior!
  DB<2>  x ($line =~ /^(\w+)=$a99a$/)
  empty array
  DB<4>  $a99a = '([a-z])(\d)\g{-1}\g{-2}'
  DB<5>  x ($line =~ /^(\w+)=$a99a$/)
0  'code'
1  'e'
2  9

Usando Referencias con Nombre (Perl 5.10)

El siguiente texto esta tomado de la sección 'Named-backreferences' en perlretut:

Perl 5.10 also introduced named capture buffers and named backreferences. To attach a name to a capturing group, you write either (?<name>...) or (?'name'...). The backreference may then be written as \g{name} .

It is permissible to attach the same name to more than one group, but then only the leftmost one of the eponymous set can be referenced. Outside of the pattern a named capture buffer is accessible through the %+ hash.

Assuming that we have to match calendar dates which may be given in one of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write three suitable patterns where we use 'd', 'm' and 'y' respectively as the names of the buffers capturing the pertaining components of a date. The matching operation combines the three patterns as alternatives:

   1.  $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)';
   2.  $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)';
   3.  $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)';
   4.  for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){
   5.    if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){
   6.      print "day=$+{d} month=$+{m} year=$+{y}\n";
   7.    }
   8.  }

If any of the alternatives matches, the hash %+ is bound to contain the three key-value pairs.
En efecto, al ejecutar el programa:
casiano@millo:~/Lperltesting$ cat -n namedbackreferences.pl
     1  use v5.10;
     2  use strict;
     3
     4  my $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)';
     5  my $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)';
     6  my $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)';
     7
     8  for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){
     9    if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){
    10      print "day=$+{d} month=$+{m} year=$+{y}\n";
    11    }
    12  }
Obtenemos la salida:
casiano@millo:~/Lperltesting$ perl5.10.1 -w namedbackreferences.pl
day=21 month=10 year=2006
day=15 month=01 year=2007
day=31 month=10 year=2005

Como se comentó:

... It is permissible to attach the same name to more than one group, but then only the leftmost one of the eponymous set can be referenced.

Veamos un ejemplo:

pl@nereida:~/Lperltesting$ perl5.10.1 -wdE 0
main::(-e:1):   0
DB<1>  # ... only the leftmost one of the eponymous set can be referenced
DB<2> $r = qr{(?<a>[a-c])(?<a>[a-f])}
DB<3> print $+{a} if 'ad' =~ $r
a
DB<4> print $+{a} if 'cf' =~ $r
c
DB<5> print $+{a} if 'ak' =~ $r

Reescribamos el ejemplo de conversión de temperaturas usando paréntesis con nombre:

pl@nereida:~/Lperltesting$ cat -n c2f_5_10v2.pl
 1  #!/usr/local/bin/perl5_10_1 -w
 2  use strict;
 3
 4  print "Enter a temperature (i.e. 32F, 100C):\n";
 5  my $input = <STDIN>;
 6  chomp($input);
 7
 8  $input =~ m/^
 9              (?<farenheit>[-+]?[0-9]+(?:\.[0-9]*)?)\s*[fF]
10              |
11              (?<celsius>[-+]?[0-9]+(?:\.[0-9]*)?)\s*[cC]
12           $/x;
13
14  my ($celsius, $farenheit);
15  if (exists $+{celsius}) {
16    $celsius = $+{celsius};
17    $farenheit = ($celsius * 9/5)+32;
18  }
19  elsif (exists $+{farenheit}) {
20    $farenheit = $+{farenheit};
21    $celsius = ($farenheit -32)*5/9;
22  }
23  else {
24    die "Expecting a temperature, so don't understand \"$input\".\n";
25  }
26
27  printf "%.2f C = %.2f F\n", $celsius, $farenheit;

La función exists retorna verdadero si existe la clave en el hash y falso en otro caso.

Grupos con Nombre y Factorización

El uso de nombres hace mas robustas y mas factorizables las expresiones regulares. Consideremos la siguiente regexp que usa notación posicional:

pl@nereida:~/Lperltesting$ perl5.10.1 -wde 0
main::(-e:1):   0
  DB<1> x "abbacddc" =~ /(.)(.)\2\1/
0  'a'
1  'b'
Supongamos que queremos reutilizar la regexp con repetición
  DB<2> x "abbacddc" =~ /((.)(.)\2\1){2}/
  empty array
¿Que ha ocurrido? La introducción del nuevo paréntesis nos obliga a renombrar las referencias a las posiciones:
  DB<3> x "abbacddc" =~ /((.)(.)\3\2){2}/
0  'cddc'
1  'c'
2  'd'
  DB<4> "abbacddc" =~ /((.)(.)\3\2){2}/; print "$&\n"
abbacddc
Esto no ocurre si utilizamos nombres. El operador \k<a> sirve para hacer referencia al valor que ha casado con el paréntesis con nombre a:
  DB<5> x "abbacddc" =~ /((?<a>.)(?<b>.)\k<b>\k<a>){2}/
0  'cddc'
1  'c'
2  'd'
El uso de grupos con nombre y \k31.1en lugar de referencias numéricas absolutas hace que la regexp sea mas reutilizable.

LLamadas a expresiones regulares via paréntesis con memoria

Es posible también llamar a la expresión regular asociada con un paréntesis.

Este parrafo tomado de la sección 'Extended-Patterns' en perlre explica el modo de uso:

(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)

PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture buffer to recurse to.

....

Capture buffers contained by the pattern will have the value as determined by the outermost recursion. ....

If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture buffers and positive ones following. Thus (?-1) refers to the most recently declared buffer, and (?+1) indicates the next buffer to be declared.

Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed buffers are included.

Veamos un ejemplo:

casiano@millo:~/Lperltesting$ perl5.10.1 -wdE 0
main::(-e:1):   0
  DB<1> x "AABB" =~ /(A)(?-1)(?+1)(B)/
0  'A'
1  'B'
  # Parenthesis:       1   2   2                  1
  DB<2> x 'ababa' =~ /^((?:([ab])(?1)\g{-1}|[ab]?))$/
0  'ababa'
1  'a'
  DB<3> x 'bbabababb' =~ /^((?:([ab])(?1)\g{-1}|[ab]?))$/
0  'bbabababb'
1  'b'

Véase también:

Reutilizando Expresiones Regulares

La siguiente reescritura de nuestro ejemplo básico utiliza el módulo Regexp::Common para factorizar la expresión regular:

casiano@millo:~/src/perl/perltesting$ cat -n c2f_5_10v3.pl
 1  #!/soft/perl5lib/bin/perl5.10.1 -w
 2  use strict;
 3  use Regexp::Common;
 4
 5  print "Enter a temperature (i.e. 32F, 100C):\n";
 6  my $input = <STDIN>;
 7  chomp($input);
 8
 9  $input =~ m/^
10              (?<farenheit>$RE{num}{real})\s*[fF]
11              |
12              (?<celsius>$RE{num}{real})\s*[cC]
13           $/x;
14
15  my ($celsius, $farenheit);
16  if ('celsius' ~~ %+) {
17    $celsius = $+{celsius};
18    $farenheit = ($celsius * 9/5)+32;
19  }
20  elsif ('farenheit' ~~ %+) {
21    $farenheit = $+{farenheit};
22    $celsius = ($farenheit -32)*5/9;
23  }
24  else {
25    die "Expecting a temperature, so don't understand \"$input\".\n";
26  }
27
28  printf "%.2f C = %.2f F\n", $celsius, $farenheit;

Véase:

El Módulo Regexp::Common

El módulo Regexp::Common provee un extenso número de expresiones regulares que son accesibles vía el hash %RE. sigue un ejemplo de uso:

casiano@millo:~/Lperltesting$ cat -n regexpcommonsynopsis.pl
     1  use strict;
     2  use Perl6::Say;
     3  use Regexp::Common;
     4
     5  while (<>) {
     6      say q{a number}              if /$RE{num}{real}/;
     7
     8      say q{a ['"`] quoted string} if /$RE{quoted}/;
     9
    10      say q{a /.../ sequence}      if m{$RE{delimited}{'-delim'=>'/'}};
    11
    12      say q{balanced parentheses}  if /$RE{balanced}{'-parens'=>'()'}/;
    13
    14      die q{a #*@%-ing word}."\n"  if /$RE{profanity}/;
    15
    16  }
    17
Sigue un ejemplo de ejecución:
casiano@millo:~/Lperltesting$ perl regexpcommonsynopsis.pl
43
a number
"2+2 es" 4
a number
a ['"`] quoted string
x/y/z
a /.../ sequence
(2*(4+5/(3-2)))
a number
balanced parentheses
fuck you!
a #*@%-ing word

El siguiente fragmento de la documentación de Regexp::Common explica el modo simplificado de uso:

To access a particular pattern, %RE is treated as a hierarchical hash of hashes (of hashes...), with each successive key being an identifier. For example, to access the pattern that matches real numbers, you specify:

        $RE{num}{real}

and to access the pattern that matches integers:

        $RE{num}{int}

Deeper layers of the hash are used to specify flags: arguments that modify the resulting pattern in some way.

For example, to access the pattern that matches base-2 real numbers with embedded commas separating groups of three digits (e.g. 10,101,110.110101101):

        $RE{num}{real}{-base => 2}{-sep => ','}{-group => 3}

Through the magic of Perl, these flag layers may be specified in any order (and even interspersed through the identifier keys!) so you could get the same pattern with:

        $RE{num}{real}{-sep => ','}{-group => 3}{-base => 2}

or:

        $RE{num}{-base => 2}{real}{-group => 3}{-sep => ','}

or even:

        $RE{-base => 2}{-group => 3}{-sep => ','}{num}{real}

etc.

Note, however, that the relative order of amongst the identifier keys is significant. That is:

        $RE{list}{set}

would not be the same as:

        $RE{set}{list}

Veamos un ejemplo con el depurador:

casiano@millo:~/Lperltesting$ perl -MRegexp::Common -wde 0
main::(-e:1):   0
  DB<1> x 'numero: 10,101,110.110101101 101.1e-1 234' =~ m{($RE{num}{real}{-base => 2}{-sep => ','}{-group => 3})}g
0  '10,101,110.110101101'
1  '101.1e-1'

La expresión regular para un número real es relativamente compleja:

casiano@millo:~/src/perl/perltesting$ perl5.10.1 -wd c2f_5_10v3.pl
main::(c2f_5_10v3.pl:5):     print "Enter a temperature (i.e. 32F, 100C):\n";
  DB<1> p $RE{num}{real}
(?:(?i)(?:[+-]?)(?:(?=[0123456789]|[.])(?:[0123456789]*)(?:(?:[.])(?:[0123456789]{0,}))?)(?:(?:[E])(?:(?:[+-]?)(?:[0123456789]+))|))

Si se usa la opción -keep el patrón proveído usa paréntesis con memoria:

casiano@millo:~/Lperltesting$ perl -MRegexp::Common -wde 0
main::(-e:1):   0
DB<2> x 'one, two, three, four, five' =~ /$RE{list}{-pat => '\w+'}/
0  1
DB<3> x 'one, two, three, four, five' =~ /$RE{list}{-pat => '\w+'}{-keep}/
0  'one, two, three, four, five'
1  ', '

Smart Matching

Perl 5.10 introduce el operador de smart matching. El siguiente texto es tomado casi verbatim del site de la compañía Perl Training Australia31.2:

Perl 5.10 introduces a new-operator, called smart-match, written ~~. As the name suggests, smart-match tries to compare its arguments in an intelligent fashion. Using smart-match effectively allows many complex operations to be reduces to very simple statements.

Unlike many of the other features introduced in Perl 5.10, there's no need to use the feature pragma to enable smart-match, as long as you're using 5.10 it's available.

The smart-match operator is always commutative. That means that $x ~~ $y works the same way as $y ~~ $x. You'll never have to remember which order to place to your operands with smart-match. Smart-match in action.

As a simple introduction, we can use smart-match to do a simple string comparison between simple scalars. For example:

    use feature qw(say);

    my $x = "foo";
    my $y = "bar";
    my $z = "foo";

    say '$x and $y are identical strings' if $x ~~ $y;
    say '$x and $z are identical strings' if $x ~~ $z;    # Printed

If one of our arguments is a number, then a numeric comparison is performed:

    my $num   = 100;
    my $input = <STDIN>;

    say 'You entered 100' if $num ~~ $input;

This will print our message if our user enters 100, 100.00, +100, 1e2, or any other string that looks like the number 100.

We can also smart-match against a regexp:

    my $input  = <STDIN>;

    say 'You said the secret word!' if $input ~~ /xyzzy/;

Smart-matching with a regexp also works with saved regexps created with qr.

So we can use smart-match to act like eq, == and =~, so what? Well, it does much more than that.

We can use smart-match to search a list:

casiano@millo:~/Lperltesting$ perl5.10.1 -wdE 0
main::(-e:1):   0
  DB<1> @friends = qw(Frodo Meriadoc Pippin Samwise Gandalf)
  DB<2> print "You're a friend" if 'Pippin' ~~ @friends
You're a friend
  DB<3> print "You're a friend" if 'Mordok' ~~ @friends

It's important to note that searching an array with smart-match is extremely fast. It's faster than using grep, it's faster than using first from Scalar::Util, and it's faster than walking through the loop with foreach, even if you do know all the clever optimisations.

Esta es la forma típica de buscar un elemento en un array en versiones anteriores a la 5.10:
casiano@millo:~$ perl -wde 0
main::(-e:1):   0
  DB<1> use List::Util qw{first}
  DB<2> @friends = qw(Frodo Meriadoc Pippin Samwise Gandalf)
  DB<3> x first { $_ eq 'Pippin'} @friends
0  'Pippin'
  DB<4> x first { $_ eq 'Mordok'} @friends
0  undef

We can also use smart-match to compare arrays:

  DB<4> @foo = qw(x y z xyzzy ninja)
  DB<5> @bar = qw(x y z xyzzy ninja)
  DB<7> print "Identical arrays" if @foo ~~ @bar
Identical arrays
  DB<8> @bar = qw(x y z xyzzy nOnjA)
  DB<9> print "Identical arrays" if @foo ~~ @bar
  DB<10>

And even search inside an array using a string:

 DB<11> x @foo = qw(x y z xyzzy ninja)
0  'x'
1  'y'
2  'z'
3  'xyzzy'
4  'ninja'
  DB<12> print "Array contains a ninja " if @foo ~~ 'ninja'

or using a regexp:

  DB<13> print "Array contains magic pattern" if @foo ~~ /xyz/
Array contains magic pattern
  DB<14> print "Array contains magic pattern" if @foo ~~ /\d+/

Smart-match works with array references, too31.3:

  DB<16> $array_ref = [ 1..10 ]
  DB<17> print "Array contains 10" if 10 ~~ $array_ref
Array contains 10
  DB<18> print "Array contains 10" if $array_ref ~~ 10
  DB<19>

En el caso de un número y un array devuelve cierto si el escalar aparece en un array anidado:

casiano@millo:~/Lperltesting$ perl5.10.1 -E 'say "ok" if 42 ~~  [23, 17, [40..50], 70];'
ok
casiano@millo:~/Lperltesting$ perl5.10.1 -E 'say "ok" if 42 ~~  [23, 17, [50..60], 70];'
casiano@millo:~/Lperltesting$

Of course, we can use smart-match with more than just arrays and scalars, it works with searching for the key in a hash, too!

  DB<19> %colour = ( sky   => 'blue', grass => 'green', apple => 'red',)
  DB<20> print "I know the colour" if 'grass' ~~ %colour
I know the colour
  DB<21> print "I know the colour" if 'cloud' ~~ %colour
  DB<22>
  DB<23> print "A key starts with 'gr'" if %colour ~~ /^gr/
A key starts with 'gr'
  DB<24> print "A key starts with 'clou'" if %colour ~~ /^clou/
  DB<25>

You can even use it to see if the two hashes have identical keys:

  DB<26> print 'Hashes have identical keys' if %taste ~~ %colour;
Hashes have identical keys

La conducta del operador de smart matching viene dada por la siguiente tabla tomada de la sección 'Smart-matching-in-detail' en perlsyn:

The behaviour of a smart match depends on what type of thing its arguments are. The behaviour is determined by the following table: the first row that applies determines the match behaviour (which is thus mostly determined by the type of the right operand). Note that the smart match implicitly dereferences any non-blessed hash or array ref, so the "Hash" and "Array" entries apply in those cases. (For blessed references, the "Object" entries apply.)

Note that the "Matching Code" column is not always an exact rendition. For example, the smart match operator short-circuits whenever possible, but grep does not.

 $a      $b        Type of Match Implied    Matching Code
 ======  =====     =====================    =============
 Any     undef     undefined                !defined $a

 Any     Object    invokes ~~ overloading on $object, or dies

 Hash    CodeRef   sub truth for each key[1] !grep { !$b->($_) } keys %$a
 Array   CodeRef   sub truth for each elt[1] !grep { !$b->($_) } @$a
 Any     CodeRef   scalar sub truth          $b->($a)

 Hash    Hash      hash keys identical (every key is found in both hashes)
 Array   Hash      hash slice existence     grep { exists $b->{$_} } @$a
 Regex   Hash      hash key grep            grep /$a/, keys %$b
 undef   Hash      always false (undef can't be a key)
 Any     Hash      hash entry existence     exists $b->{$a}

 Hash    Array     hash slice existence     grep { exists $a->{$_} } @$b
 Array   Array     arrays are comparable[2]
 Regex   Array     array grep               grep /$a/, @$b
 undef   Array     array contains undef     grep !defined, @$b
 Any     Array     match against an array element[3]
                                            grep $a ~~ $_, @$b

 Hash    Regex     hash key grep            grep /$b/, keys %$a
 Array   Regex     array grep               grep /$b/, @$a
 Any     Regex     pattern match            $a =~ /$b/

 Object  Any       invokes ~~ overloading on $object, or falls back:
 Any     Num       numeric equality         $a == $b
 Num     numish[4] numeric equality         $a == $b
 undef   Any       undefined                !defined($b)
 Any     Any       string equality          $a eq $b

Ejercicios

Ejercicio 31.1.4  

Depuración de Expresiones Regulares

Para obtener información sobre la forma en que es compilada una expresión regular y como se produce el proceso de matching podemos usar la opción 'debug' del módulo re. La versión de Perl 5.10 da una información algo mas legible que la de las versiones anteriores:

pl@nereida:~/Lperltesting$ perl5_10_1 -wde 0

Loading DB routines from perl5db.pl version 1.32
Editor support available.

Enter h or `h h' for help, or `man perldebug' for more help.

main::(-e:1):   0
  DB<1> use re 'debug'; 'astr' =~ m{[sf].r}
Compiling REx "[sf].r"
Final program:
   1: ANYOF[fs][] (12)
  12: REG_ANY (13)
  13: EXACT <r> (15)
  15: END (0)
anchored "r" at 2 (checking anchored) stclass ANYOF[fs][] minlen 3
Guessing start of match in sv for REx "[sf].r" against "astr"
Found anchored substr "r" at offset 3...
Starting position does not contradict /^/m...
start_shift: 2 check_at: 3 s: 1 endpos: 2
Does not contradict STCLASS...
Guessed: match at offset 1
Matching REx "[sf].r" against "str"
   1 <a> <str>               |  1:ANYOF[fs][](12)
   2 <as> <tr>               | 12:REG_ANY(13)
   3 <ast> <r>               | 13:EXACT <r>(15)
   4 <astr> <>               | 15:END(0)
Match successful!
Freeing REx: "[sf].r"

Si se usa la opción debug de re con objetos expresión regular, se obtendrá información durante el proceso de matching:

 DB<3> use re 'debug'; $re = qr{[sf].r}
Compiling REx "[sf].r"
Final program:
   1: ANYOF[fs][] (12)
  12: REG_ANY (13)
  13: EXACT <r> (15)
  15: END (0)
anchored "r" at 2 (checking anchored) stclass ANYOF[fs][] minlen 3

  DB<4> 'astr' =~ $re
Guessing start of match in sv for REx "[sf].r" against "astr"
Found anchored substr "r" at offset 3...
Starting position does not contradict /^/m...
start_shift: 2 check_at: 3 s: 1 endpos: 2
Does not contradict STCLASS...
Guessed: match at offset 1
Matching REx "[sf].r" against "str"
   1 <a> <str>               |  1:ANYOF[fs][](12)
   2 <as> <tr>               | 12:REG_ANY(13)
   3 <ast> <r>               | 13:EXACT <r>(15)
   4 <astr> <>               | 15:END(0)
Match successful!


Tablas de Escapes, Metacarácteres, Cuantificadores, Clases

Sigue una sección de tablas con notaciones tomada de perlre:

Metacharacters

The following metacharacters have their standard egrep-ish meanings:

   1. \ Quote the next metacharacter
   2. ^ Match the beginning of the line
   3. . Match any character (except newline)
   4. $ Match the end of the line (or before newline at the end)
   5. | Alternation
   6. () Grouping
   7. [] Character class

Standard greedy quantifiers

The following standard greedy quantifiers are recognized:

   1. * Match 0 or more times
   2. + Match 1 or more times
   3. ? Match 1 or 0 times
   4. {n} Match exactly n times
   5. {n,} Match at least n times
   6. {n,m} Match at least n but not more than m times

Non greedy quantifiers

The following non greedy quantifiers are recognized:

   1. *? Match 0 or more times, not greedily
   2. +? Match 1 or more times, not greedily
   3. ?? Match 0 or 1 time, not greedily
   4. {n}? Match exactly n times, not greedily
   5. {n,}? Match at least n times, not greedily
   6. {n,m}? Match at least n but not more than m times, not greedily

Possesive quantifiers

The following possesive quantifiers are recognized:

   1. *+ Match 0 or more times and give nothing back
   2. ++ Match 1 or more times and give nothing back
   3. ?+ Match 0 or 1 time and give nothing back
   4. {n}+ Match exactly n times and give nothing back (redundant)
   5. {n,}+ Match at least n times and give nothing back
   6. {n,m}+ Match at least n but not more than m times and give nothing back

Escape sequences

   1. \t tab (HT, TAB)
   2. \n newline (LF, NL)
   3. \r return (CR)
   4. \f form feed (FF)
   5. \a alarm (bell) (BEL)
   6. \e escape (think troff) (ESC)
   7. \033 octal char (example: ESC)
   8. \x1B hex char (example: ESC)
   9. \x{263a} long hex char (example: Unicode SMILEY)
  10. \cK control char (example: VT)
  11. \N{name} named Unicode character
  12. \l lowercase next char (think vi)
  13. \u uppercase next char (think vi)
  14. \L lowercase till \E (think vi)
  15. \U uppercase till \E (think vi)
  16. \E end case modification (think vi)
  17. \Q quote (disable) pattern metacharacters till \E

Ejercicio 31.1.5   Explique la salida:
casiano@tonga:~$ perl -wde 0
main::(-e:1):   0
  DB<1> $x = '([a-z]+)'
  DB<2> x 'hola' =~ /$x/
0  'hola'
  DB<3> x 'hola' =~ /\Q$x/
  empty array
  DB<4> x '([a-z]+)' =~ /\Q$x/
0  1

Character Classes and other Special Escapes

   1. \w Match a "word" character (alphanumeric plus "_")
   2. \W Match a non-"word" character
   3. \s Match a whitespace character
   4. \S Match a non-whitespace character
   5. \d Match a digit character
   6. \D Match a non-digit character
   7. \pP Match P, named property. Use \p{Prop} for longer names.
   8. \PP Match non-P
   9. \X Match eXtended Unicode "combining character sequence",
  10.    equivalent to (?>\PM\pM*)
  11. \C Match a single C char (octet) even under Unicode.
  12.    NOTE: breaks up characters into their UTF-8 bytes,
  13.    so you may end up with malformed pieces of UTF-8.
  14.    Unsupported in lookbehind.
  15. \1 Backreference to a specific group.
  16. '1' may actually be any positive integer.
  17. \g1 Backreference to a specific or previous group,
  18. \g{-1} number may be negative indicating a previous buffer and may
  19.        optionally be wrapped in curly brackets for safer parsing.
  20. \g{name} Named backreference
  21. \k<name> Named backreference
  22. \K Keep the stuff left of the \K, don't include it in $&
  23. \v Vertical whitespace
  24. \V Not vertical whitespace
  25. \h Horizontal whitespace
  26. \H Not horizontal whitespace
  27. \R Linebreak

Zero width assertions

Perl defines the following zero-width assertions:

   1. \b Match a word boundary
   2. \B Match except at a word boundary
   3. \A Match only at beginning of string
   4. \Z Match only at end of string, or before newline at the end
   5. \z Match only at end of string
   6. \G Match only at pos() (e.g. at the end-of-match position
   7. of prior m//g)

The POSIX character class syntax

The POSIX character class syntax:

   1. [:class:]

is also available. Note that the [ and ] brackets are literal; they must always be used within a character class expression.

   1. # this is correct:
   2. $string =~ /[[:alpha:]]/;
   3.
   4. # this is not, and will generate a warning:
   5. $string =~ /[:alpha:]/;

Available classes

The available classes and their backslash equivalents (if available) are as follows:

   1. alpha
   2. alnum
   3. ascii
   4. blank
   5. cntrl
   6. digit \d
   7. graph
   8. lower
   9. print
  10. punct
  11. space \s 
  12. upper
  13. word \w 
  14. xdigit

For example use [:upper:] to match all the uppercase characters. Note that the [] are part of the [::] construct, not part of the whole character class. For example:

   1. [01[:alpha:]%]

matches zero, one, any alphabetic character, and the percent sign.

Equivalences to Unicode

The following equivalences to Unicode \p{} constructs and equivalent backslash character classes (if available), will hold:

   1. [[:...:]] \p{...} backslash
   2.
   3. alpha IsAlpha
   4. alnum IsAlnum
   5. ascii IsASCII
   6. blank
   7. cntrl IsCntrl
   8. digit IsDigit \d
   9. graph IsGraph
  10. lower IsLower
  11. print IsPrint 
  12. punct IsPunct 
  13. space IsSpace
  14. IsSpacePerl \s
  15. upper IsUpper
  16. word IsWord \w
  17. xdigit IsXDigit

Negated character classes

You can negate the [::] character classes by prefixing the class name with a '^'. This is a Perl extension. For example:

   1. POSIX traditional Unicode
   2.
   3. [[:^digit:]] \D \P{IsDigit}
   4. [[:^space:]] \S \P{IsSpace}
   5. [[:^word:]] \W \P{IsWord}


Variables especiales después de un emparejamiento

Despues de un emparejamiento con éxito, las siguientes variables especiales quedan definidas:



$& El texto que casó
$` El texto que está a la izquierda de lo que casó
$' El texto que está a la derecha de lo que casó
$1, $2, $3, etc. Los textos capturados por los paréntesis
$+ Una copia del $1, $2, ...con número mas alto
@- Desplazamientos de las subcadenas que casan en $1 ...
@+ Desplazamientos de los finales de las subcadenas en $1 ...
$#- El índice del último paréntesis que casó
$#+ El índice del último paréntesis en la última expresión regular

Las Variables de match, pre-match y post-mach



Ejemplo:

   1 #!/usr/bin/perl -w
   2 if ("Hello there, neighbor" =~ /\s(\w+),/) {
   3   print "That was: ($`)($&)($').\n",
   4 }

> matchvariables.pl
That was: (Hello)( there,)( neighbor).

El uso de estas variables tenía un efecto negativo en el rendimiento de la regexp. Véase perlfaq6 la sección Why does using $&, $`, or $' slow my program down?.

Once Perl sees that you need one of these variables anywhere in the program, it provides them on each and every pattern match. That means that on every pattern match the entire string will be copied, part of it to $`, part to $&, and part to $'. Thus the penalty is most severe with long strings and patterns that match often. Avoid $&, $', and $` if you can, but if you can't, once you've used them at all, use them at will because you've already paid the price. Remember that some algorithms really appreciate them. As of the 5.005 release, the $& variable is no longer "expensive" the way the other two are.

Since Perl 5.6.1 the special variables @- and @+ can functionally replace $`, $& and $'. These arrays contain pointers to the beginning and end of each match (see perlvar for the full story), so they give you essentially the same information, but without the risk of excessive string copying.

Perl 5.10 added three specials, ${^MATCH}, ${^PREMATCH}, and ${^POSTMATCH} to do the same job but without the global performance penalty. Perl 5.10 only sets these variables if you compile or execute the regular expression with the /p modifier.

pl@nereida:~/Lperltesting$ cat ampersandoldway.pl
#!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w
use strict;
use Benchmark qw(cmpthese timethese);

'hola juan' =~ /ju/;
my ($a, $b, $c) = ($`, $&, $');


cmpthese( -1, {
    oldway => sub { 'hola juan' =~ /ju/  },
});
pl@nereida:~/Lperltesting$ cat ampersandnewway.pl
#!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w
use strict;
use Benchmark qw(cmpthese timethese);

'hola juan' =~ /ju/p;
my ($a, $b, $c) = (${^PREMATCH}, ${^MATCH}, ${^POSTMATCH});


cmpthese( -1, {
    newway => sub { 'hola juan' =~ /ju/  },
});

pl@nereida:~/Lperltesting$ time ./ampersandoldway.pl
            Rate oldway
oldway 2991861/s     --

real    0m3.761s
user    0m3.740s
sys     0m0.020s
pl@nereida:~/Lperltesting$ time ./ampersandnewway.pl
            Rate newway
newway 8191999/s     --

real    0m6.721s
user    0m6.704s
sys     0m0.016s

Véase

Texto Asociado con el Último Paréntesis

La variable $+ contiene el texto que casó con el último paréntesis en el patrón. Esto es útil en situaciones en las cuáles una de un conjunto de alternativas casa, pero no sabemos cuál:

  DB<9> "Revision: 4.5" =~ /Version: (.*)|Revision: (.*)/ && ($rev = $+);
  DB<10> x $rev
0  4.5
  DB<11> "Version: 4.5" =~ /Version: (.*)|Revision: (.*)/ && ($rev = $+);
  DB<12> x $rev
0  4.5

Los Offsets de los Inicios de los Casamientos: @-

El vector @- contiene los offsets o desplazamientos de los casamientos en la última expresión regular. La entrada $-[0] es el desplazamiento del último casamiento con éxito y $-[n] es el desplazamiento de la subcadena que casa con el n-ésimo paréntesis (o undef si el párentesis no casó). Por ejemplo:

#           012345678
DB<1> $z = "hola13.47"
DB<2> if ($z =~ m{a(\d+)(\.(\d+))?}) { print "@-\n"; }
3 4 6 7
El resultado se interpreta como sigue:

Esto es lo que dice perlvar sobre @-:

This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The nth element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.

After a match against some variable $var:

    $` is the same as substr($var, 0, $-[0])
    $& is the same as substr($var, $-[0], $+[0] - $-[0])
    $' is the same as substr($var, $+[0])
    $1 is the same as substr($var, $-[1], $+[1] - $-[1])
    $2 is the same as substr($var, $-[2], $+[2] - $-[2])
    $3 is the same as substr($var, $-[3], $+[3] - $-[3])

Desplazamientos de los Finales de los Emparejamientos: @+

El array @+ contiene los desplazamientos de los finales de los emparejamientos. La entrada $+[0] contiene el desplazamiento del final de la cadena del emparejamiento completo. Siguiendo con el ejemplo anterior:

#            0123456789
DB<17> $z = "hola13.47x"
DB<18> if ($z =~ m{a(\d+)(\.)(\d+)?}) { print "@+\n"; }
9 6 7 9
El resultado se interpreta como sigue:

Número de paréntesis en la última regexp con éxito

Se puede usar $#+ para determinar cuantos parentesis había en el último emparejamiento que tuvo éxito.

  DB<29> $z = "h"
  DB<30> print "$#+\n" if ($z =~ m{(a)(b)}) || ($z =~ m{(h)(.)?(.)?})
3
  DB<31> $z = "ab"
  DB<32> print "$#+\n" if ($z =~ m{(a)(b)}) || ($z =~ m{(h)(.)?(.)?})
2

Indice del Ultimo Paréntesis

La variable $#- contiene el índice del último paréntesis que casó. Observe la siguiente ejecución con el depurador:

  DB<1> $x = '13.47'; $y = '125'
  DB<2> if ($y =~ m{(\d+)(\.(\d+))?}) { print "last par = $#-, content = $+\n"; }
last par = 1, content = 125
  DB<3> if ($x =~ m{(\d+)(\.(\d+))?}) { print "last par = $#-, content = $+\n"; }
last par = 3, content = 47

@- y @+ no tienen que tener el mismo tamaño

En general no puede asumirse que @- y @+ sean del mismo tamaño.

  DB<1> "a" =~ /(a)|(b)/; @a = @-; @b = @+
  DB<2> x @a
0  0
1  0
  DB<3> x @b
0  1
1  1
2  undef

Véase También

Para saber más sobre las variables especiales disponibles consulte

Ambito Automático

Como sabemos, ciertas variables (como $1, $& ...) reciben automáticamente un valor con cada operación de ``matching''.

Considere el siguiente código:

if (m/(...)/) {
  &do_something();
  print "the matched variable was $1.\n";
}
Puesto que $1 es automáticamente declarada local a la entrada de cada bloque, no importa lo que se haya hecho en la función &do_something(), el valor de $1 en la sentencia print es el correspondiente al ``matching'' realizado en el if.


Opciones

Modificador Significado
e evaluar: evaluar el lado derecho de una sustitución como una expresión
g global: Encontrar todas las ocurrencias
i ignorar: no distinguir entre mayúsculas y minúsculas
m multilínea (^ y $ casan con \n internos)
o optimizar: compilar una sola vez
s ^ y $ ignoran \n pero el punto . ``casa'' con \n
x extendida: permitir comentarios

El Modificador /g

La conducta de este modificador depende del contexto. En un contexto de listas devuelve una lista con todas las subcadenas casadas por todos los paréntesis en la expresión regular. Si no hubieran paréntesis devuelve una lista con todas las cadenas casadas (como si hubiera paréntesis alrededor del patrón global).

   1 #!/usr/bin/perl -w
   2 ($one, $five, $fifteen) = (`uptime` =~ /(\d+\.\d+)/g);
   3 print "$one, $five, $fifteen\n";

Observe la salida:

> uptime
  1:35pm  up 19:22,  0 users,  load average: 0.01, 0.03, 0.00
> glist.pl
0.01, 0.03, 0.00

En un contexto escalar m//g itera sobre la cadena, devolviendo cierto cada vez que casa, y falso cuando deja de casar. En otras palabras, recuerda donde se quedo la última vez y se recomienza la búsqueda desde ese punto. Se puede averiguar la posicion del emparejamiento utilizando la función pos. Si por alguna razón modificas la cadena en cuestión, la posición de emparejamiento se reestablece al comienzo de la cadena.

   1 #!/usr/bin/perl -w
   2 # count sentences in a document
   3 #defined as ending in [.!?] perhaps with
   4 # quotes or parens on either side.
   5 $/ = ""; # paragraph mode
   6 while ($paragraph = <>) {
   7   print $paragraph;
   8   while ($paragraph =~ /[a-z]['")]*[.!?]+['")]*\s/g) {
   9     $sentences++;
  10   }
  11 }
  12 print "$sentences\n";

Observe el uso de la variable especial $/. Esta variable contiene el separador de registros en el fichero de entrada. Si se iguala a la cadena vacía usará las líneas en blanco como separadores. Se le puede dar el valor de una cadena multicarácter para usarla como delimitador. Nótese que establecerla a \n\n es diferente de asignarla a "". Si se deja undef, la siguiente lectura leerá todo el fichero.

Sigue un ejemplo de ejecución. El programa se llama gscalar.pl. Introducimos el texto desde STDIN. El programa escribe el número de párrafos:

> gscalar.pl
este primer parrafo. Sera seguido de un
segundo parrafo.
 
"Cita de Seneca".
 
3

La opción e: Evaluación del remplazo

La opción /e permite la evaluación como expresión perl de la cadena de reemplazo (En vez de considerarla como una cadena delimitada por doble comilla).

   1 #!/usr/bin/perl -w
   2 $_ = "abc123xyz\n";
   3 s/\d+/$&*2/e;
   4 print;
   5 s/\d+/sprintf("%5d",$&)/e;
   6 print;
   7 s/\w/$& x 2/eg;
   8 print;
El resultado de la ejecución es:
> replacement.pl
abc246xyz
abc  246xyz
aabbcc  224466xxyyzz

Véase un ejemplo con anidamiento de /e:

   1 #!/usr/bin/perl 
   2 $a ="one";
   3 $b = "two";
   4 $_ = '$a $b';
   5 print "_ = $_\n\n";
   6 s/(\$\w+)/$1/ge;
   7 print "After 's/(\$\w+)/$1/ge' _ = $_\n\n";
   8 s/(\$\w+)/$1/gee;
   9 print "After 's/(\$\w+)/$1/gee' _ = $_\n\n";
El resultado de la ejecución es:
 
> enested.pl
_ = $a $b
 
After 's/($w+)/$b/ge' _ = $a $b
 
After 's/($w+)/$b/gee' _ = one two

He aqui una solución que hace uso de e al siguiente ejercicio (véase 'Regex to add space after punctuation sign' en PerlMonks) Se quiere poner un espacio en blanco después de la aparición de cada coma:

s/,/, /g;
pero se quiere que la sustitución no tenga lugar si la coma esta incrustada entre dos dígitos. Además se pide que si hay ya un espacio después de la coma, no se duplique

s/(\d[,.]\d)|(,(?!\s))/$1 || "$2 "/ge;

Se hace uso de un lookahead negativo (?!\s). Véase la sección 31.2.3 para entender como funciona un lookahead negativo.

Casiano Rodríguez León
2013-04-23