Durante las décadas de 1960 y 1970 hubo un desarrollo formal de las expresiones regulares.
Una de las priemras publicaciones que utilizan las expresiones
regulares en un marco informático es el artículo de 1968 de Ken
Thompson Regular Expression Search Algorithm en el que describe
un compilador de expresiones regulares que produce código objeto para
un IBM 7094.
Este compilador dió lugar al editor qed, en el cual se basó el
editor de Unix
ed. Aunque las expresiones regulares de este último no eran
tan sofisticadas como las de qed, fueron las primeras en ser
utilizadas en un contexto no académico.
Se dice que el comando global g
en su formato g/re/p
que utilizaba
para imprimir (opción p
) las líneas que casan con la expresión regular re
dió lugar a un programa separado al que se denomino grep.
Las expresiones regulares facilitadas por las primeras versiones de estas herramientas
eran limitadas. Por ejemplo, se disponía del cierre de Kleene *
pero no del cierre
positivo +
o del operador opcional ?
.
Por eso, posteriormente, se han introducido los metacaracteres \+
y \?
.
Existían numerosas limitaciones en dichas versiones, por ej. $
sólo significa ``final
de línea'' al final de la expresión regular. Eso dificulta expresiones como
grep 'cierre$\|^Las' viq.texSin embargo, la mayor parte de las versiones actuales resuelven correctamente estos problemas:
nereida:~/viq> grep 'cierre$\|^Las' viq.tex Las expresiones regulares facilitadas por las primeras versiones de estas herramientas eran limitadas. Por ejemplo, se disponía del cierre de Kleene \verb|*| pero no del cierre nereida:~/viq>De hecho AT&T Bell labs añadió numerosas funcionalidades, como por ejemplo, el uso de
\{min, max\}
, tomada de lex.
Por esa época, Alfred Aho escribió egrep que, no sólo proporciona un conjunto
mas rico de operadores sino que mejoró la implementación.
Mientras que el grep de Ken Thompson usaba un autómata finito no determinista
(NFA), la versión de egrep
de Aho usa un autómata finito determinista (DFA).
En 1986 Henry Spencer desarrolló la librería regex para el lenguaje C
, que
proporciona un conjunto consistente de funciones que permiten el manejo de expresiones
regulares. Esta librería ha contribuido a ``homogeneizar'' la sintáxis y semántica
de las diferentes herramientas que utilizan expresiones regulares (como awk,
lex, sed, ...).
pl@nereida:~/Lperltesting$ cat -n c2f.pl 1 #!/usr/bin/perl -w 2 use strict; 3 4 print "Enter a temperature (i.e. 32F, 100C):\n"; 5 my $input = <STDIN>; 6 chomp($input); 7 8 if ($input !~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i) { 9 warn "Expecting a temperature, so don't understand \"$input\".\n"; 10 } 11 else { 12 my $InputNum = $1; 13 my $type = $3; 14 my ($celsius, $farenheit); 15 if ($type eq "C" or $type eq "c") { 16 $celsius = $InputNum; 17 $farenheit = ($celsius * 9/5)+32; 18 } 19 else { 20 $farenheit = $InputNum; 21 $celsius = ($farenheit -32)*5/9; 22 } 23 printf "%.2f C = %.2f F\n", $celsius, $farenheit; 24 }
Véase también:
perldoc
perlrequick
perldoc
perlretut
perldoc
perlre
perldoc
perlreref
Ejecución con el depurador:
pl@nereida:~/Lperltesting$ perl -wd c2f.pl Loading DB routines from perl5db.pl version 1.28 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(c2f.pl:4): print "Enter a temperature (i.e. 32F, 100C):\n"; DB<1> c 8 Enter a temperature (i.e. 32F, 100C): 32F main::(c2f.pl:8): if ($input !~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i) { DB<2> n main::(c2f.pl:12): my $InputNum = $1; DB<2> x ($1, $2, $3) 0 32 1 undef 2 'F' DB<3> use YAPE::Regex::Explain DB<4> p YAPE::Regex::Explain->new('([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$')->explain The regular expression: (?-imsx:([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [-+]? any character of: '-', '+' (optional (matching the most amount possible)) ---------------------------------------------------------------------- [0-9]+ any character of: '0' to '9' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ( group and capture to \2 (optional (matching the most amount possible)): ---------------------------------------------------------------------- \. '.' ---------------------------------------------------------------------- [0-9]* any character of: '0' to '9' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- )? end of \2 (NOTE: because you're using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \2) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ( group and capture to \3: ---------------------------------------------------------------------- [CF] any character of: 'C', 'F' ---------------------------------------------------------------------- ) end of \3 ---------------------------------------------------------------------- $ before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
Dentro de una expresión regular es necesario
referirse a los textos que casan con el primer, paréntesis,
segundo, etc. como \1
, \2,
etc. La notación
$1
se refieré a lo que casó con el primer paréntesis
en el último matching, no en el actual. Veamos un ejemplo:
pl@nereida:~/Lperltesting$ cat -n dollar1slash1.pl 1 #!/usr/bin/perl -w 2 use strict; 3 4 my $a = "hola juanito"; 5 my $b = "adios anita"; 6 7 $a =~ /(ani)/; 8 $b =~ s/(adios) *($1)/\U$1 $2/; 9 print "$b\n";Observe como el
$1
que aparece en la cadena de reemplazo (línea 8)
se refiere a la cadena adios
mientras que el $1
en la primera parte contiene ani
:
pl@nereida:~/Lperltesting$ ./dollar1slash1.pl ADIOS ANIta
$b =~ s/(adios) *(\1)/\U$1 $2/;
El número de paréntesis con memoria no está limitado:
pl@nereida:~/Lperltesting$ perl -wde 0 main::(-e:1): 0 123456789ABCDEF DB<1> $x = "123456789AAAAAA" 1 2 3 4 5 6 7 8 9 10 11 12 DB<2> $r = $x =~ /(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)\11/; print "$r\n$10\n$11\n" 1 A A
Véase el siguiente párrafo de perlre (sección Capture buffers):
There is no limit to the number of captured substrings that you may use. However Perl also uses\10
,\11
, etc. as aliases for\010
,\011
, etc. (Recall that0
means octal, so\011
is the character at number9
in your coded character set; which would be the 10th character, a horizontal tab under ASCII.) Perl resolves this ambiguity by interpreting\10
as a backreference only if at least10
left parentheses have opened before it. Likewise\11
is a backreference only if at least11
left parentheses have opened before it. And so on.\1
through\9
are always interpreted as backreferences.
Si se utiliza en un contexto que requiere una lista,
el ``pattern match'' retorna una lista consistente en
las subexpresiones casadas mediante los paréntesis,
esto es $1
, $2
, $3
, ....
Si no hubiera emparejamiento se retorna la lista vacía.
Si lo hubiera pero no hubieran paréntesis se retorna la lista
($&)
.
pl@nereida:~/src/perl/perltesting$ cat -n escapes.pl 1 #!/usr/bin/perl -w 2 use strict; 3 4 my $foo = "one two three four five\nsix seven"; 5 my ($F1, $F2, $Etc) = ($foo =~ /^\s*(\S+)\s+(\S+)\s*(.*)/); 6 print "List Context: F1 = $F1, F2 = $F2, Etc = $Etc\n"; 7 8 # This is 'almost' the same than: 9 ($F1, $F2, $Etc) = split(/\s+/, $foo, 3); 10 print "Split: F1 = $F1, F2 = $F2, Etc = $Etc\n";Observa el resultado de la ejecución:
pl@nereida:~/src/perl/perltesting$ ./escapes.pl List Context: F1 = one, F2 = two, Etc = three four five Split: F1 = one, F2 = two, Etc = three four five six seven
La opción s
usada en una regexp
hace que el punto '.'
case con el retorno
de carro:
pl@nereida:~/src/perl/perltesting$ perl -wd ./escapes.pl main::(./escapes.pl:4): my $foo = "one two three four five\nsix seven"; DB<1> c 9 List Context: F1 = one, F2 = two, Etc = three four five main::(./escapes.pl:9): ($F1, $F2, $Etc) = split(' ',$foo, 3); DB<2> ($F1, $F2, $Etc) = ($foo =~ /^\s*(\S+)\s+(\S+)\s*(.*)/s) DB<3> p "List Context: F1 = $F1, F2 = $F2, Etc = $Etc\n" List Context: F1 = one, F2 = two, Etc = three four five six seven
La opción /s
hace que .
se empareje con
un \n
.
Esto es, casa con cualquier carácter.
Veamos otro ejemplo, que imprime los nombres de los ficheros que contienen cadenas que casan con un patrón dado, incluso si este aparece disperso en varias líneas:
1 #!/usr/bin/perl -w 2 #use: 3 #smodifier.pl 'expr' files 4 #prints the names of the files that match with the give expr 5 undef $/; # input record separator 6 my $what = shift @ARGV; 7 while(my $file = shift @ARGV) { 8 open(FILE, "<$file"); 9 $line = <FILE>; 10 if ($line =~ /$what/s) { 11 print "$file\n"; 12 } 13 }
Ejemplo de uso:
> smodifier.pl 'three.*three' double.in split.pl doublee.pl double.in doublee.pl
Vea la sección 31.4.2 para ver los contenidos
del fichero double.in
. En dicho fichero,
el patrón three.*three
aparece repartido entre
varias líneas.
El modificador s
se suele usar conjuntamente con el modificador
m
. He aquí lo que dice
la seccion Using character classes de la sección 'Using-character-classes' en perlretut
al respecto:
m
modifier (//m
): Treat string as a set of multiple lines.
'.'
matches any character except \n
.
^
and $
are able to match at the start or end of any line within the string.
s
and m
modifiers (//sm
): Treat string as a single long line, but detect multiple lines.
'.'
matches any character, even \n
.
^
and $
, however, are able to match at the start or end of any line within the string.
Here are examples of //s and //m in action:
1. $x = "There once was a girl\nWho programmed in Perl\n"; 2. 3. $x =~ /^Who/; # doesn't match, "Who" not at start of string 4. $x =~ /^Who/s; # doesn't match, "Who" not at start of string 5. $x =~ /^Who/m; # matches, "Who" at start of second line 6. $x =~ /^Who/sm; # matches, "Who" at start of second line 7. 8. $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n" 9. $x =~ /girl.Who/s; # matches, "." matches "\n" 10. $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n" 11. $x =~ /girl.Who/sm; # matches, "." matches "\n"
Most of the time, the default behavior is what is wanted, but//s
and//m
are occasionally very useful. If//m
is being used, the start of the string can still be matched with\A
and the end of the string can still be matched with the anchors\Z
(matches both the end and the newline before, like$
), and\z
(matches only the end):
1. $x =~ /^Who/m; # matches, "Who" at start of second line 2. $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string 3. 4. $x =~ /girl$/m; # matches, "girl" at end of first line 5. $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string 6. 7. $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end 8. $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of stringNormalmente el carácter
^
casa solamente con el comienzo de la
cadena y el carácter $
con el final. Los \n
empotrados
no casan
con ^
ni $
. El modificador /m
modifica esta
conducta. De este modo ^
y $
casan con cualquier frontera
de línea interna. Las anclas \A
y \Z
se utilizan entonces
para casar con
el comienzo y final de la cadena.
Véase un ejemplo:
nereida:~/perl/src> perl -de 0 DB<1> $a = "hola\npedro" DB<2> p "$a" hola pedro DB<3> $a =~ s/.*/x/m DB<4> p $a x pedro DB<5> $a =~ s/^pedro$/juan/ DB<6> p "$a" x pedro DB<7> $a =~ s/^pedro$/juan/m DB<8> p "$a" x juan
Reescribamos el ejemplo anterior usando un contexto de lista:
casiano@millo:~/Lperltesting$ cat -n c2f_list.pl 1 #!/usr/bin/perl -w 2 use strict; 3 4 print "Enter a temperature (i.e. 32F, 100C):\n"; 5 my $input = <STDIN>; 6 chomp($input); 7 8 my ($InputNum, $type); 9 10 ($InputNum, $type) = $input =~ m/^ 11 ([-+]?[0-9]+(?:\.[0-9]*)?) # Temperature 12 \s* 13 ([cCfF]) # Celsius or Farenheit 14 $/x; 15 16 die "Expecting a temperature, so don't understand \"$input\".\n" unless defined($InputNum); 17 18 my ($celsius, $fahrenheit); 19 if ($type eq "C" or $type eq "c") { 20 $celsius = $InputNum; 21 $fahrenheit = ($celsius * 9/5)+32; 22 } 23 else { 24 $fahrenheit = $InputNum; 25 $celsius = ($fahrenheit -32)*5/9; 26 } 27 printf "%.2f C = %.2f F\n", $celsius, $fahrenheit;
La opción /x
en una regexp permite utilizar comentarios y
espacios dentro de la expresión
regular. Los espacios dentro de la expresión regular dejan de ser significativos.
Si quieres conseguir un espacio que sea significativo, usa \s
o
bien escápalo. Véase la sección 'Modifiers' en perlre
y
la sección 'Building-a-regexp' en perlretut.
La notación (?: ... )
se usa para introducir paréntesis de agrupamiento sin memoria.
(?: ...)
Permite agrupar las expresiones tal y como lo hacen los
paréntesis ordinarios. La diferencia es que no ``memorizan''
esto es no guardan nada en $1
, $2
, etc.
Se logra así una compilación mas eficiente. Veamos un ejemplo:
> cat groupingpar.pl #!/usr/bin/perl my $a = shift; $a =~ m/(?:hola )*(juan)/; print "$1\n"; nereida:~/perl/src> groupingpar.pl 'hola juan' juan
El patrón regular puede contener variables, que serán interpoladas
(en tal caso, el patrón será recompilado).
Si quieres que dicho patrón se compile una sóla vez, usa la opción
/o
.
pl@nereida:~/Lperltesting$ cat -n mygrep.pl 1 #!/usr/bin/perl -w 2 my $what = shift @ARGV || die "Usage $0 regexp files ...\n"; 3 while (<>) { 4 print "File $ARGV, rel. line $.: $_" if (/$what/o); # compile only once 5 } 6Sigue un ejemplo de ejecución:
pl@nereida:~/Lperltesting$ ./mygrep.pl Usage ./mygrep.pl regexp files ... pl@nereida:~/Lperltesting$ ./mygrep.pl if labels.c File labels.c, rel. line 7: if (a < 10) goto LABEL;
El siguiente texto es de la sección 'Using-regular-expressions-in-Perl' en perlretut:
If $pattern
won't be changing over the lifetime of the script,
we can add the //o modifier, which directs Perl to only perform variable
substitutions once
Otra posibilidad es hacer una compilación previa usando el operador
qr
(véase la sección 'Regexp-Quote-Like-Operators' en perlop).
La siguiente variante del programa anterior también compila el patrón
una sóla vez:
pl@nereida:~/Lperltesting$ cat -n mygrep2.pl 1 #!/usr/bin/perl -w 2 my $what = shift @ARGV || die "Usage $0 regexp files ...\n"; 3 $what = qr{$what}; 4 while (<>) { 5 print "File $ARGV, rel. line $.: $_" if (/$what/); 6 }
Véase
El siguiente extracto de la sección Matching Repetitions en la sección 'Matching-repetitions' en perlretut
ilustra la semántica greedy de los operadores de repetición *+{}?
etc.
For all of these quantifiers, Perl will try to match as much of the string as possible, while still allowing the regexp to succeed. Thus with/a?.../
, Perl will first try to match the regexp with the a present; if that fails, Perl will try to match the regexp without the a present. For the quantifier*
, we get the following:
1. $x = "the cat in the hat"; 2. $x =~ /^(.*)(cat)(.*)$/; # matches, 3. # $1 = 'the ' 4. # $2 = 'cat' 5. # $3 = ' in the hat'
Which is what we might expect, the match finds the only cat in the string and locks onto it. Consider, however, this regexp:
1. $x =~ /^(.*)(at)(.*)$/; # matches, 2. # $1 = 'the cat in the h' 3. # $2 = 'at' 4. # $3 = '' (0 characters match)
One might initially guess that Perl would find theat
incat
and stop there, but that wouldn't give the longest possible string to the first quantifier.*
. Instead, the first quantifier.*
grabs as much of the string as possible while still having the regexp match. In this example, that means having theat
sequence with the finalat
in the string.
The other important principle illustrated here is that when there are two or more elements in a regexp, the leftmost quantifier, if there is one, gets to grab as much the string as possible, leaving the rest of the regexp to fight over scraps. Thus in our example, the first quantifier.*
grabs most of the string, while the second quantifier.*
gets the empty string. Quantifiers that grab as much of the string as possible are called maximal match or greedy quantifiers.
When a regexp can match a string in several different ways, we can use the principles above to predict which way the regexp will match:
Principle 0: Taken as a whole, any regexp will be matched at the earliest possible position in the string.
Principle 1: In an alternation a|b|c...
, the leftmost alternative that allows a match for the whole regexp will be the one used.
Principle 2: The maximal matching quantifiers ?
, *
, +
and {n,m}
will in general match as much of the string as possible while still allowing the whole regexp to match.
El siguiente párrafo está tomado de la sección 'Repeated-Patterns-Matching-a-Zero-length-Substring' en perlre:
Regular expressions provide a terse and powerful programming language. As with most other power tools, power comes together with the ability to wreak havoc.
A common abuse of this power stems from the ability to make infinite loops using regular expressions, with something as innocuous as:
1. 'foo' =~ m{ ( o? )* }x;
Theo?
matches at the beginning of'foo'
, and since the position in the string is not moved by the match,o?
would match again and again because of the*
quantifier.
Another common way to create a similar cycle is with the looping modifier //g
:
1. @matches = ( 'foo' =~ m{ o? }xg );
or
1. print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
or the loop implied by split()
.
... Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower-level loops given by the greedy quantifiers*+{}
, and for higher-level ones like the/g
modifier orsplit()
operator.
The lower-level loops are interrupted (that is, the loop is broken) when Perl detects that a repeated expression matched a zero-length substring. Thus
1. m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
is made equivalent to
1. m{ (?: NON_ZERO_LENGTH )* 2. | 3. (?: ZERO_LENGTH )? 4. }x;
The higher level-loops preserve an additional state between iterations: whether the last match was zero-length. To break the loop, the following match after a zero-length match is prohibited to have a length of zero. This prohibition interacts with backtracking (see Backtracking), and so the second best match is chosen if the best match is of zero length.
For example:
1. $_ = 'bar'; 2. s/\w??/<$&>/g;
results in<><b><><a><><r><>
. At each position of the string the best match given by non-greedy??
is the zero-length match, and the second best match is what is matched by\w
. Thus zero-length matches alternate with one-character-long matches.
Similarly, for repeated m/()/g
the second-best match is the match at
the position one notch further in the string.
The additional state of being matched with zero-length is associated with the matched string, and is reset by each assignment topos()
. Zero-length matches at the end of the previous match are ignored duringsplit
.
DB<25> $c = 0 DB<26> print(($c++).": <$&>\n") while 'aaaabababab' =~ /a*(ab)*/g; 0: <aaaa> 1: <> 2: <a> 3: <> 4: <a> 5: <> 6: <a> 7: <> 8: <>
Las expresiones lazy o no greedy hacen que el NFA se detenga en la cadena mas corta que
casa con la expresión. Se denotan como sus análogas greedy añadiéndole el
postfijo ?
:
{n,m}?
{n,}?
{n}?
*?
+?
??
Repasemos lo que dice la sección Matching Repetitions en la sección 'Matching-repetitions' en perlretut:
Sometimes greed is not good. At times, we would like quantifiers to match a minimal piece of string, rather than a maximal piece. For this purpose, Larry Wall created the minimal match or non-greedy quantifiers??
,*?
,+?
, and{}?
. These are the usual quantifiers with a ? appended to them. They have the following meanings:
a??
means: match 'a' 0 or 1 times. Try 0 first, then 1.
a*?
means: match 'a' 0 or more times, i.e., any number of times, but as few times as possible
a+?
means: match 'a' 1 or more times, i.e., at least once, but as few times as possible
a{n,m}?
means: match at least n times, not more than m times, as few times as possible
a{n,}?
means: match at least n times, but as few times as possible
a{n}?
means: match exactly n times. Because we match exactly n times, an? is equivalent to an and is just there for notational consistency.
Let's look at the example above, but with minimal quantifiers:
1. $x = "The programming republic of Perl"; 2. $x =~ /^(.+?)(e|r)(.*)$/; # matches, 3. # $1 = 'Th' 4. # $2 = 'e' 5. # $3 = ' programming republic of Perl'
The minimal string that will allow both the start of the string^
and the alternation to match isTh
, with the alternatione|r
matchinge
. The second quantifier.*
is free to gobble up the rest of the string.
1. $x =~ /(m{1,2}?)(.*?)$/; # matches, 2. # $1 = 'm' 3. # $2 = 'ming republic of Perl'
The first string position that this regexp can match is at the firstm
in programming . At this position, the minimalm{1,2}?
matches just onem
. Although the second quantifier.*?
would prefer to match no characters, it is constrained by the end-of-string anchor$
to match the rest of the string.
1. $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, 2. # $1 = 'The progra' 3. # $2 = 'm' 4. # $3 = 'ming republic of Perl'
In this regexp, you might expect the first minimal quantifier.*?
to match the empty string, because it is not constrained by a^
anchor to match the beginning of the word. Principle 0 applies here, however. Because it is possible for the whole regexp to match at the start of the string, it will match at the start of the string. Thus the first quantifier has to match everything up to the first m. The second minimal quantifier matches just onem
and the third quantifier matches the rest of the string.
1. $x =~ /(.??)(m{1,2})(.*)$/; # matches, 2. # $1 = 'a' 3. # $2 = 'mm' 4. # $3 = 'ing republic of Perl'
Just as in the previous regexp, the first quantifier.??
can match earliest at positiona
, so it does. The second quantifier is greedy, so it matches mm , and the third matches the rest of the string.
We can modify principle 3 above to take into account non-greedy quantifiers:
Principle 3: If there are two or more elements in a regexp, the leftmost greedy (non-greedy) quantifier, if any, will match as much (little) of the string as possible while still allowing the whole regexp to match. The next leftmost greedy (non-greedy) quantifier, if any, will try to match as much (little) of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.
casiano@millo:~/Lperltesting$ perl -wde 0 main::(-e:1): 0 DB<1> x ('1'x34) =~ m{^(11+)\1+$} 0 11111111111111111 DB<2> x ('1'x34) =~ m{^(11+?)\1+$} ????????????????????????????????????
Just like alternation, quantifiers are also susceptible to backtracking. Here is a step-by-step analysis of the example
1. $x = "the cat in the hat"; 2. $x =~ /^(.*)(at)(.*)$/; # matches, 3. # $1 = 'the cat in the h' 4. # $2 = 'at' 5. # $3 = '' (0 matches)
Start with the first letter in the string 't'.
The first quantifier '.*' starts out by matching the whole string 'the cat in the hat'.
'a' in the regexp element 'at' doesn't match the end of the string. Backtrack one character.
'a' in the regexp element 'at' still doesn't match the last letter of the string 't', so backtrack one more character.
Now we can match the 'a' and the 't'.
Move on to the third element '.*'. Since we are at the end of the string and '.*' can match 0 times, assign it the empty string.
We are done!
La forma en la que se escribe una regexp puede dar lugar agrandes variaciones en el rendimiento. Repasemos lo que dice la sección Matching Repetitions en la sección 'Matching-repetitions' en perlretut:
Most of the time, all this moving forward and backtracking happens quickly and searching is fast. There are some pathological regexps, however, whose execution time exponentially grows with the size of the string. A typical structure that blows up in your face is of the form
/(a|b+)*/;
The problem is the nested indeterminate quantifiers. There are many different ways of partitioning a string of length n between the+
and*
: one repetition withb+
of length , two repetitions with the firstb+
length and the second with length , repetitions whose bits add up to length , etc.
In fact there are an exponential number of ways to partition a string as a function of its length. A regexp may get lucky and match early in the process, but if there is no match, Perl will try every possibility before giving up. So be careful with nested*
's,{n,m}
's, and+
's.
The book Mastering Regular Expressions by Jeffrey Friedl [8] gives a wonderful discussion of this and other efficiency issues.
El siguiente ejemplo elimina los comentarios de un programa C
.
casiano@millo:~/Lperltesting$ cat -n comments.pl 1 #!/usr/bin/perl -w 2 use strict; 3 4 my $progname = shift @ARGV or die "Usage:\n$0 prog.c\n"; 5 open(my $PROGRAM,"<$progname") || die "can't find $progname\n"; 6 my $program = ''; 7 { 8 local $/ = undef; 9 $program = <$PROGRAM>; 10 } 11 $program =~ s{ 12 /\* # Match the opening delimiter 13 .*? # Match a minimal number of characters 14 \*/ # Match the closing delimiter 15 }[]gsx; 16 17 print $program;Veamos un ejemplo de ejecución. Supongamos el fichero de entrada:
> cat hello.c #include <stdio.h> /* first comment */ main() { printf("hello world!\n"); /* second comment */ }
Entonces la ejecución con ese fichero de entrada produce la salida:
> comments.pl hello.c #include <stdio.h> main() { printf("hello world!\n"); }Veamos la diferencia de comportamiento entre
*
y *?
en el ejemplo anterior:
pl@nereida:~/src/perl/perltesting$ perl5_10_1 -wde 0 main::(-e:1): 0 DB<1> use re 'debug'; 'main() /* 1c */ { /* 2c */ return; /* 3c */ }' =~ qr{(/\*.*\*/)}; print "\n$1\n" Compiling REx "(/\*.*\*/)" Final program: 1: OPEN1 (3) 3: EXACT *> (5) 5: STAR (7) 6: REG_ANY (0) 7: EXACT <*/> (9) 9: CLOSE1 (11) 11: END (0) anchored "/*" at 0 floating "*/" at 2..2147483647 (checking floating) minlen 4 Guessing start of match in sv for REx "(/\*.*\*/)" against "main() /* 1c */ { /* 2c */ return; /* 3c */ }" Found floating substr "*/" at offset 13... Found anchored substr "/*" at offset 7... Starting position does not contradict /^/m... Guessed: match at offset 7 Matching REx "(/\*.*\*/)" against "/* 1c */ { /* 2c */ return; /* 3c */ }" 7* 1c */ {> | 1:OPEN1(3) 7 * 1c */ {> | 3:EXACT *>(5) 9 <() /*> < 1c */ { /> | 5:STAR(7) REG_ANY can match 36 times out of 2147483647... 41 <; /* 3c > <*/ }> | 7: EXACT <*/>(9) 43 <; /* 3c */> < }> | 9: CLOSE1(11) 43 <; /* 3c */> < }> | 11: END(0) Match successful! /* 1c */ { /* 2c */ return; /* 3c */ Freeing REx: "(/\*.*\*/)" DB<2> use re 'debug'; 'main() /* 1c */ { /* 2c */ return; /* 3c */ }' =~ qr{(/\*.*?\*/)}; print "\n$1\n" Compiling REx "(/\*.*?\*/)" Final program: 1: OPEN1 (3) 3: EXACT *> (5) 5: MINMOD (6) 6: STAR (8) 7: REG_ANY (0) 8: EXACT <*/> (10) 10: CLOSE1 (12) 12: END (0) anchored "/*" at 0 floating "*/" at 2..2147483647 (checking floating) minlen 4 Guessing start of match in sv for REx "(/\*.*?\*/)" against "main() /* 1c */ { /* 2c */ return; /* 3c */ }" Found floating substr "*/" at offset 13... Found anchored substr "/*" at offset 7... Starting position does not contradict /^/m... Guessed: match at offset 7 Matching REx "(/\*.*?\*/)" against "/* 1c */ { /* 2c */ return; /* 3c */ }" 7 * 1c */ {> | 1:OPEN1(3) 7 * 1c */ {> | 3:EXACT *>(5) 9 <() /*> < 1c */ { /> | 5:MINMOD(6) 9 <() /*> < 1c */ { /> | 6:STAR(8) REG_ANY can match 4 times out of 4... 13 <* 1c > <*/ { /* 2c> | 8: EXACT <*/>(10) 15 <1c */> < { /* 2c *> | 10: CLOSE1(12) 15 <1c */> < { /* 2c *> | 12: END(0) Match successful! /* 1c */ Freeing REx: "(/\*.*?\*/)" DB<3>
Véase también la documentación en la sección 'Matching-repetitions' en perlretut y la sección 'Quantifiers' en perlre.
X[^X]*X
y X.*?X
, donde X
es un carácter arbitrario se usan de forma casi equivalente.
Una cadena que no contiene X
en su interior y que está delimitada por X
s
Una cadena que comienza en X
y termina en la X
mas próxima a la X
de comienzo
Esta equivalencia se rompe si no se cumplen las hipótesis establecidas.
En el siguiente ejemplo se intentan detectar las cadenas entre comillas dobles que terminan en el signo de exclamación:
pl@nereida:~/Lperltesting$ cat -n negynogreedy.pl 1 #!/usr/bin/perl -w 2 use strict; 3 4 my $b = 'Ella dijo "Ana" y yo contesté: "Jamás!". Eso fué todo.'; 5 my $a; 6 ($a = $b) =~ s/".*?!"/-$&-/; 7 print "$a\n"; 8 9 $b =~ s/"[^"]*!"/-$&-/; 10 print "$b\n";
Al ejecutar el programa obtenemos:
> negynogreedy.pl Ella dijo -"Ana" y yo contesté: "Jamás!"-. Eso fué todo. Ella dijo "Ana" y yo contesté: -"Jamás!"-. Eso fué todo.
=~
nos permite ``asociar'' la variable
con la operación de casamiento o sustitución. Si se trata de una sustitución
y se quiere conservar la cadena, es necesario hacer una copia:
$d = $s; $d =~ s/esto/por lo otro/;en vez de eso, puedes abreviar un poco usando la siguiente ``perla'':
($d = $s) =~ s/esto/por lo otro/;Obsérvese la asociación por la izquierda del operador de asignación.
Las referencias relativas permiten escribir expresiones regulares mas reciclables. Véase la documentación en la sección 'Relative-backreferences' en perlretut:
Counting the opening parentheses to get the correct number for a backreference is errorprone as soon as there is more than one capturing group. A more convenient technique became available with Perl 5.10: relative backreferences. To refer to the immediately preceding capture group one now may write\g{-1}
, the next but last is available via\g{-2}
, and so on.
Another good reason in addition to readability and maintainability for using relative backreferences is illustrated by the following example, where a simple pattern for matching peculiar strings is used:
1. $a99a = '([a-z])(\d)\2\1'; # matches a11a, g22g, x33x, etc.
Now that we have this pattern stored as a handy string, we might feel tempted to use it as a part of some other pattern:
1. $line = "code=e99e"; 2. if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior! 3. print "$1 is valid\n"; 4. } else { 5. print "bad line: '$line'\n"; 6. }
But this doesn't match - at least not the way one might expect. Only after inserting the interpolated$a99a
and looking at the resulting full text of the regexp is it obvious that the backreferences have backfired - the subexpression(\w+)
has snatched number 1 and demoted the groups in$a99a
by one rank. This can be avoided by using relative backreferences:
1. $a99a = '([a-z])(\d)\g{-1}\g{-2}'; # safe for being interpolated
El siguiente programa ilustra lo dicho:
casiano@millo:~/Lperltesting$ cat -n backreference.pl 1 use strict; 2 use re 'debug'; 3 4 my $a99a = '([a-z])(\d)\2\1'; 5 my $line = "code=e99e"; 6 if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior! 7 print "$1 is valid\n"; 8 } else { 9 print "bad line: '$line'\n"; 10 }Sigue la ejecución:
casiano@millo:~/Lperltesting$ perl5.10.1 -wd backreference.pl main::(backreference.pl:4): my $a99a = '([a-z])(\d)\2\1'; DB<1> c 6 main::(backreference.pl:6): if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior! DB<2> x ($line =~ /^(\w+)=$a99a$/) empty array DB<4> $a99a = '([a-z])(\d)\g{-1}\g{-2}' DB<5> x ($line =~ /^(\w+)=$a99a$/) 0 'code' 1 'e' 2 9
El siguiente texto esta tomado de la sección 'Named-backreferences' en perlretut:
Perl 5.10 also introduced named capture buffers and named backreferences. To attach a name to a capturing group, you write either(?<name>...)
or(?'name'...)
. The backreference may then be written as\g{name}
.
It is permissible to attach the same name to more than
one group, but then only the leftmost one of the eponymous set can be
referenced. Outside of the pattern a named capture buffer is accessible
through the %+
hash.
Assuming that we have to match calendar dates which may be given in one of the three formatsyyyy-mm-dd
,mm/dd/yyyy
ordd.mm.yyyy
, we can write three suitable patterns where we use'd'
,'m'
and'y'
respectively as the names of the buffers capturing the pertaining components of a date. The matching operation combines the three patterns as alternatives:
1. $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)'; 2. $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)'; 3. $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)'; 4. for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){ 5. if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){ 6. print "day=$+{d} month=$+{m} year=$+{y}\n"; 7. } 8. }
If any of the alternatives matches, the hash %+
is bound to contain the three key-value pairs.
En efecto, al ejecutar el programa:
casiano@millo:~/Lperltesting$ cat -n namedbackreferences.pl 1 use v5.10; 2 use strict; 3 4 my $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)'; 5 my $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)'; 6 my $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)'; 7 8 for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){ 9 if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){ 10 print "day=$+{d} month=$+{m} year=$+{y}\n"; 11 } 12 }Obtenemos la salida:
casiano@millo:~/Lperltesting$ perl5.10.1 -w namedbackreferences.pl day=21 month=10 year=2006 day=15 month=01 year=2007 day=31 month=10 year=2005
Como se comentó:
... It is permissible to attach the same name to more than one group, but then only the leftmost one of the eponymous set can be referenced.
Veamos un ejemplo:
pl@nereida:~/Lperltesting$ perl5.10.1 -wdE 0 main::(-e:1): 0 DB<1> # ... only the leftmost one of the eponymous set can be referenced DB<2> $r = qr{(?<a>[a-c])(?<a>[a-f])} DB<3> print $+{a} if 'ad' =~ $r a DB<4> print $+{a} if 'cf' =~ $r c DB<5> print $+{a} if 'ak' =~ $r
Reescribamos el ejemplo de conversión de temperaturas usando paréntesis con nombre:
pl@nereida:~/Lperltesting$ cat -n c2f_5_10v2.pl 1 #!/usr/local/bin/perl5_10_1 -w 2 use strict; 3 4 print "Enter a temperature (i.e. 32F, 100C):\n"; 5 my $input = <STDIN>; 6 chomp($input); 7 8 $input =~ m/^ 9 (?<farenheit>[-+]?[0-9]+(?:\.[0-9]*)?)\s*[fF] 10 | 11 (?<celsius>[-+]?[0-9]+(?:\.[0-9]*)?)\s*[cC] 12 $/x; 13 14 my ($celsius, $farenheit); 15 if (exists $+{celsius}) { 16 $celsius = $+{celsius}; 17 $farenheit = ($celsius * 9/5)+32; 18 } 19 elsif (exists $+{farenheit}) { 20 $farenheit = $+{farenheit}; 21 $celsius = ($farenheit -32)*5/9; 22 } 23 else { 24 die "Expecting a temperature, so don't understand \"$input\".\n"; 25 } 26 27 printf "%.2f C = %.2f F\n", $celsius, $farenheit;
La función exists retorna verdadero si existe la clave en el hash y falso en otro caso.
El uso de nombres hace mas robustas y mas factorizables las expresiones regulares. Consideremos la siguiente regexp que usa notación posicional:
pl@nereida:~/Lperltesting$ perl5.10.1 -wde 0 main::(-e:1): 0 DB<1> x "abbacddc" =~ /(.)(.)\2\1/ 0 'a' 1 'b'Supongamos que queremos reutilizar la regexp con repetición
DB<2> x "abbacddc" =~ /((.)(.)\2\1){2}/ empty array¿Que ha ocurrido? La introducción del nuevo paréntesis nos obliga a renombrar las referencias a las posiciones:
DB<3> x "abbacddc" =~ /((.)(.)\3\2){2}/ 0 'cddc' 1 'c' 2 'd' DB<4> "abbacddc" =~ /((.)(.)\3\2){2}/; print "$&\n" abbacddcEsto no ocurre si utilizamos nombres. El operador
\k<a>
sirve para hacer referencia
al valor que ha casado con el paréntesis con nombre a
:
DB<5> x "abbacddc" =~ /((?<a>.)(?<b>.)\k<b>\k<a>){2}/ 0 'cddc' 1 'c' 2 'd'El uso de grupos con nombre y
\k
31.1en lugar de referencias numéricas absolutas
hace que la regexp sea mas reutilizable.
Es posible también llamar a la expresión regular asociada con un paréntesis.
Este parrafo tomado de la sección 'Extended-Patterns' en perlre explica el modo de uso:
(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)
PARNO
is a sequence of digits (not starting with 0) whose value reflects
the paren-number of the capture buffer to recurse to.
....
Capture buffers contained by the pattern will have the value as determined by the outermost recursion. ....
IfPARNO
is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture buffers and positive ones following. Thus(?-1)
refers to the most recently declared buffer, and(?+1)
indicates the next buffer to be declared.
Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed buffers are included.
Veamos un ejemplo:
casiano@millo:~/Lperltesting$ perl5.10.1 -wdE 0 main::(-e:1): 0 DB<1> x "AABB" =~ /(A)(?-1)(?+1)(B)/ 0 'A' 1 'B' # Parenthesis: 1 2 2 1 DB<2> x 'ababa' =~ /^((?:([ab])(?1)\g{-1}|[ab]?))$/ 0 'ababa' 1 'a' DB<3> x 'bbabababb' =~ /^((?:([ab])(?1)\g{-1}|[ab]?))$/ 0 'bbabababb' 1 'b'
Véase también:
La siguiente reescritura de nuestro ejemplo básico utiliza el módulo Regexp::Common para factorizar la expresión regular:
casiano@millo:~/src/perl/perltesting$ cat -n c2f_5_10v3.pl 1 #!/soft/perl5lib/bin/perl5.10.1 -w 2 use strict; 3 use Regexp::Common; 4 5 print "Enter a temperature (i.e. 32F, 100C):\n"; 6 my $input = <STDIN>; 7 chomp($input); 8 9 $input =~ m/^ 10 (?<farenheit>$RE{num}{real})\s*[fF] 11 | 12 (?<celsius>$RE{num}{real})\s*[cC] 13 $/x; 14 15 my ($celsius, $farenheit); 16 if ('celsius' ~~ %+) { 17 $celsius = $+{celsius}; 18 $farenheit = ($celsius * 9/5)+32; 19 } 20 elsif ('farenheit' ~~ %+) { 21 $farenheit = $+{farenheit}; 22 $celsius = ($farenheit -32)*5/9; 23 } 24 else { 25 die "Expecting a temperature, so don't understand \"$input\".\n"; 26 } 27 28 printf "%.2f C = %.2f F\n", $celsius, $farenheit;
Véase:
El módulo Regexp::Common
provee un extenso número
de expresiones regulares que son accesibles vía el hash %RE
.
sigue un ejemplo de uso:
casiano@millo:~/Lperltesting$ cat -n regexpcommonsynopsis.pl 1 use strict; 2 use Perl6::Say; 3 use Regexp::Common; 4 5 while (<>) { 6 say q{a number} if /$RE{num}{real}/; 7 8 say q{a ['"`] quoted string} if /$RE{quoted}/; 9 10 say q{a /.../ sequence} if m{$RE{delimited}{'-delim'=>'/'}}; 11 12 say q{balanced parentheses} if /$RE{balanced}{'-parens'=>'()'}/; 13 14 die q{a #*@%-ing word}."\n" if /$RE{profanity}/; 15 16 } 17Sigue un ejemplo de ejecución:
casiano@millo:~/Lperltesting$ perl regexpcommonsynopsis.pl 43 a number "2+2 es" 4 a number a ['"`] quoted string x/y/z a /.../ sequence (2*(4+5/(3-2))) a number balanced parentheses fuck you! a #*@%-ing word
El siguiente fragmento de la documentación de Regexp::Common explica el modo simplificado de uso:
To access a particular pattern, %RE
is treated as a hierarchical hash of
hashes (of hashes...), with each successive key being an identifier. For
example, to access the pattern that matches real numbers, you specify:
$RE{num}{real}
and to access the pattern that matches integers:
$RE{num}{int}
Deeper layers of the hash are used to specify flags: arguments that modify the resulting pattern in some way.
For example, to access the
pattern that matches base-2 real numbers with embedded commas separating
groups of three digits (e.g. 10,101,110.110101101
):
$RE{num}{real}{-base => 2}{-sep => ','}{-group => 3}
Through the magic of Perl, these flag layers may be specified in any order (and even interspersed through the identifier keys!) so you could get the same pattern with:
$RE{num}{real}{-sep => ','}{-group => 3}{-base => 2}
or:
$RE{num}{-base => 2}{real}{-group => 3}{-sep => ','}
or even:
$RE{-base => 2}{-group => 3}{-sep => ','}{num}{real}
etc.
Note, however, that the relative order of amongst the identifier keys is significant. That is:
$RE{list}{set}
would not be the same as:
$RE{set}{list}
Veamos un ejemplo con el depurador:
casiano@millo:~/Lperltesting$ perl -MRegexp::Common -wde 0 main::(-e:1): 0 DB<1> x 'numero: 10,101,110.110101101 101.1e-1 234' =~ m{($RE{num}{real}{-base => 2}{-sep => ','}{-group => 3})}g 0 '10,101,110.110101101' 1 '101.1e-1'
La expresión regular para un número real es relativamente compleja:
casiano@millo:~/src/perl/perltesting$ perl5.10.1 -wd c2f_5_10v3.pl main::(c2f_5_10v3.pl:5): print "Enter a temperature (i.e. 32F, 100C):\n"; DB<1> p $RE{num}{real} (?:(?i)(?:[+-]?)(?:(?=[0123456789]|[.])(?:[0123456789]*)(?:(?:[.])(?:[0123456789]{0,}))?)(?:(?:[E])(?:(?:[+-]?)(?:[0123456789]+))|))
Si se usa la opción -keep
el patrón proveído usa paréntesis con memoria:
casiano@millo:~/Lperltesting$ perl -MRegexp::Common -wde 0 main::(-e:1): 0 DB<2> x 'one, two, three, four, five' =~ /$RE{list}{-pat => '\w+'}/ 0 1 DB<3> x 'one, two, three, four, five' =~ /$RE{list}{-pat => '\w+'}{-keep}/ 0 'one, two, three, four, five' 1 ', '
Perl 5.10 introduce el operador de smart matching. El siguiente texto es tomado casi verbatim del site de la compañía Perl Training Australia31.2:
Perl 5.10 introduces a new-operator, called smart-match, written ~~
. As
the name suggests, smart-match tries to compare its arguments in an
intelligent fashion. Using smart-match effectively allows many complex
operations to be reduces to very simple statements.
Unlike many of the other features introduced in Perl 5.10, there's no need to use the feature pragma to enable smart-match, as long as you're using 5.10 it's available.
The smart-match operator is always commutative. That means that$x ~~ $y
works the same way as$y ~~ $x
. You'll never have to remember which order to place to your operands with smart-match. Smart-match in action.
As a simple introduction, we can use smart-match to do a simple string comparison between simple scalars. For example:
use feature qw(say); my $x = "foo"; my $y = "bar"; my $z = "foo"; say '$x and $y are identical strings' if $x ~~ $y; say '$x and $z are identical strings' if $x ~~ $z; # Printed
If one of our arguments is a number, then a numeric comparison is performed:
my $num = 100; my $input = <STDIN>; say 'You entered 100' if $num ~~ $input;
This will print our message if our user enters 100, 100.00, +100, 1e2, or any other string that looks like the number 100.
We can also smart-match against a regexp:
my $input = <STDIN>; say 'You said the secret word!' if $input ~~ /xyzzy/;
Smart-matching with a regexp also works with saved regexps created with qr.
So we can use smart-match to act like eq,==
and=~
, so what? Well, it does much more than that.
We can use smart-match to search a list:
casiano@millo:~/Lperltesting$ perl5.10.1 -wdE 0 main::(-e:1): 0 DB<1> @friends = qw(Frodo Meriadoc Pippin Samwise Gandalf) DB<2> print "You're a friend" if 'Pippin' ~~ @friends You're a friend DB<3> print "You're a friend" if 'Mordok' ~~ @friends
It's important to note that searching an array with smart-match is extremely fast. It's faster than using grep, it's faster than usingfirst
from Scalar::Util, and it's faster than walking through the loop withforeach
, even if you do know all the clever optimisations.
Esta es la forma típica de buscar un elemento en un array en versiones anteriores a la 5.10:
casiano@millo:~$ perl -wde 0 main::(-e:1): 0 DB<1> use List::Util qw{first} DB<2> @friends = qw(Frodo Meriadoc Pippin Samwise Gandalf) DB<3> x first { $_ eq 'Pippin'} @friends 0 'Pippin' DB<4> x first { $_ eq 'Mordok'} @friends 0 undef
We can also use smart-match to compare arrays:
DB<4> @foo = qw(x y z xyzzy ninja) DB<5> @bar = qw(x y z xyzzy ninja) DB<7> print "Identical arrays" if @foo ~~ @bar Identical arrays DB<8> @bar = qw(x y z xyzzy nOnjA) DB<9> print "Identical arrays" if @foo ~~ @bar DB<10>
And even search inside an array using a string:
DB<11> x @foo = qw(x y z xyzzy ninja) 0 'x' 1 'y' 2 'z' 3 'xyzzy' 4 'ninja' DB<12> print "Array contains a ninja " if @foo ~~ 'ninja'
or using a regexp:
DB<13> print "Array contains magic pattern" if @foo ~~ /xyz/ Array contains magic pattern DB<14> print "Array contains magic pattern" if @foo ~~ /\d+/
Smart-match works with array references, too31.3:
DB<16> $array_ref = [ 1..10 ] DB<17> print "Array contains 10" if 10 ~~ $array_ref Array contains 10 DB<18> print "Array contains 10" if $array_ref ~~ 10 DB<19>
En el caso de un número y un array devuelve cierto si el escalar aparece en un array anidado:
casiano@millo:~/Lperltesting$ perl5.10.1 -E 'say "ok" if 42 ~~ [23, 17, [40..50], 70];' ok casiano@millo:~/Lperltesting$ perl5.10.1 -E 'say "ok" if 42 ~~ [23, 17, [50..60], 70];' casiano@millo:~/Lperltesting$
Of course, we can use smart-match with more than just arrays and scalars, it works with searching for the key in a hash, too!
DB<19> %colour = ( sky => 'blue', grass => 'green', apple => 'red',) DB<20> print "I know the colour" if 'grass' ~~ %colour I know the colour DB<21> print "I know the colour" if 'cloud' ~~ %colour DB<22> DB<23> print "A key starts with 'gr'" if %colour ~~ /^gr/ A key starts with 'gr' DB<24> print "A key starts with 'clou'" if %colour ~~ /^clou/ DB<25>
You can even use it to see if the two hashes have identical keys:
DB<26> print 'Hashes have identical keys' if %taste ~~ %colour; Hashes have identical keys
La conducta del operador de smart matching viene dada por la siguiente tabla tomada de la sección 'Smart-matching-in-detail' en perlsyn:
The behaviour of a smart match depends on what type of thing its arguments are. The behaviour is determined by the following table: the first row that applies determines the match behaviour (which is thus mostly determined by the type of the right operand). Note that the smart match implicitly dereferences any non-blessed hash or array ref, so the "Hash" and "Array" entries apply in those cases. (For blessed references, the "Object" entries apply.)
Note that the "Matching Code" column is not always an exact rendition. For example, the smart match operator short-circuits whenever possible, but grep does not.
$a $b Type of Match Implied Matching Code ====== ===== ===================== ============= Any undef undefined !defined $a Any Object invokes ~~ overloading on $object, or dies Hash CodeRef sub truth for each key[1] !grep { !$b->($_) } keys %$a Array CodeRef sub truth for each elt[1] !grep { !$b->($_) } @$a Any CodeRef scalar sub truth $b->($a) Hash Hash hash keys identical (every key is found in both hashes) Array Hash hash slice existence grep { exists $b->{$_} } @$a Regex Hash hash key grep grep /$a/, keys %$b undef Hash always false (undef can't be a key) Any Hash hash entry existence exists $b->{$a} Hash Array hash slice existence grep { exists $a->{$_} } @$b Array Array arrays are comparable[2] Regex Array array grep grep /$a/, @$b undef Array array contains undef grep !defined, @$b Any Array match against an array element[3] grep $a ~~ $_, @$b Hash Regex hash key grep grep /$b/, keys %$a Array Regex array grep grep /$b/, @$a Any Regex pattern match $a =~ /$b/ Object Any invokes ~~ overloading on $object, or falls back: Any Num numeric equality $a == $b Num numish[4] numeric equality $a == $b undef Any undefined !defined($b) Any Any string equality $a eq $b
1 pl@nereida:~/Lperltesting$ cat twonumbers.pl 2 $_ = "I have 2 numbers: 53147"; 3 @pats = qw{ 4 (.*)(\d*) 5 (.*)(\d+) 6 (.*?)(\d*) 7 (.*?)(\d+) 8 (.*)(\d+)$ 9 (.*?)(\d+)$ 10 (.*)\b(\d+)$ 11 (.*\D)(\d+)$ 12 }; 13 14 print "$_\n"; 15 for $pat (@pats) { 16 printf "%-12s ", $pat; 17 <>; 18 if ( /$pat/ ) { 19 print "<$1> <$2>\n"; 20 } else { 21 print "FAIL\n"; 22 } 23 }
Para obtener información sobre la forma en que es compilada una expresión regular
y como se produce el proceso de matching podemos usar la opción
'debug'
del módulo re
. La versión de Perl 5.10 da una información
algo mas legible que la de las versiones anteriores:
pl@nereida:~/Lperltesting$ perl5_10_1 -wde 0 Loading DB routines from perl5db.pl version 1.32 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(-e:1): 0 DB<1> use re 'debug'; 'astr' =~ m{[sf].r} Compiling REx "[sf].r" Final program: 1: ANYOF[fs][] (12) 12: REG_ANY (13) 13: EXACT <r> (15) 15: END (0) anchored "r" at 2 (checking anchored) stclass ANYOF[fs][] minlen 3 Guessing start of match in sv for REx "[sf].r" against "astr" Found anchored substr "r" at offset 3... Starting position does not contradict /^/m... start_shift: 2 check_at: 3 s: 1 endpos: 2 Does not contradict STCLASS... Guessed: match at offset 1 Matching REx "[sf].r" against "str" 1 <a> <str> | 1:ANYOF[fs][](12) 2 <as> <tr> | 12:REG_ANY(13) 3 <ast> <r> | 13:EXACT <r>(15) 4 <astr> <> | 15:END(0) Match successful! Freeing REx: "[sf].r"
Si se usa la opción debug
de re
con
objetos expresión regular, se obtendrá información durante el proceso de
matching:
DB<3> use re 'debug'; $re = qr{[sf].r} Compiling REx "[sf].r" Final program: 1: ANYOF[fs][] (12) 12: REG_ANY (13) 13: EXACT <r> (15) 15: END (0) anchored "r" at 2 (checking anchored) stclass ANYOF[fs][] minlen 3 DB<4> 'astr' =~ $re Guessing start of match in sv for REx "[sf].r" against "astr" Found anchored substr "r" at offset 3... Starting position does not contradict /^/m... start_shift: 2 check_at: 3 s: 1 endpos: 2 Does not contradict STCLASS... Guessed: match at offset 1 Matching REx "[sf].r" against "str" 1 <a> <str> | 1:ANYOF[fs][](12) 2 <as> <tr> | 12:REG_ANY(13) 3 <ast> <r> | 13:EXACT <r>(15) 4 <astr> <> | 15:END(0) Match successful!
The following metacharacters have their standard egrep-ish meanings:
1. \ Quote the next metacharacter 2. ^ Match the beginning of the line 3. . Match any character (except newline) 4. $ Match the end of the line (or before newline at the end) 5. | Alternation 6. () Grouping 7. [] Character class
The following standard greedy quantifiers are recognized:
1. * Match 0 or more times 2. + Match 1 or more times 3. ? Match 1 or 0 times 4. {n} Match exactly n times 5. {n,} Match at least n times 6. {n,m} Match at least n but not more than m times
The following non greedy quantifiers are recognized:
1. *? Match 0 or more times, not greedily 2. +? Match 1 or more times, not greedily 3. ?? Match 0 or 1 time, not greedily 4. {n}? Match exactly n times, not greedily 5. {n,}? Match at least n times, not greedily 6. {n,m}? Match at least n but not more than m times, not greedily
The following possesive quantifiers are recognized:
1. *+ Match 0 or more times and give nothing back 2. ++ Match 1 or more times and give nothing back 3. ?+ Match 0 or 1 time and give nothing back 4. {n}+ Match exactly n times and give nothing back (redundant) 5. {n,}+ Match at least n times and give nothing back 6. {n,m}+ Match at least n but not more than m times and give nothing back
1. \t tab (HT, TAB) 2. \n newline (LF, NL) 3. \r return (CR) 4. \f form feed (FF) 5. \a alarm (bell) (BEL) 6. \e escape (think troff) (ESC) 7. \033 octal char (example: ESC) 8. \x1B hex char (example: ESC) 9. \x{263a} long hex char (example: Unicode SMILEY) 10. \cK control char (example: VT) 11. \N{name} named Unicode character 12. \l lowercase next char (think vi) 13. \u uppercase next char (think vi) 14. \L lowercase till \E (think vi) 15. \U uppercase till \E (think vi) 16. \E end case modification (think vi) 17. \Q quote (disable) pattern metacharacters till \E
casiano@tonga:~$ perl -wde 0 main::(-e:1): 0 DB<1> $x = '([a-z]+)' DB<2> x 'hola' =~ /$x/ 0 'hola' DB<3> x 'hola' =~ /\Q$x/ empty array DB<4> x '([a-z]+)' =~ /\Q$x/ 0 1
1. \w Match a "word" character (alphanumeric plus "_") 2. \W Match a non-"word" character 3. \s Match a whitespace character 4. \S Match a non-whitespace character 5. \d Match a digit character 6. \D Match a non-digit character 7. \pP Match P, named property. Use \p{Prop} for longer names. 8. \PP Match non-P 9. \X Match eXtended Unicode "combining character sequence", 10. equivalent to (?>\PM\pM*) 11. \C Match a single C char (octet) even under Unicode. 12. NOTE: breaks up characters into their UTF-8 bytes, 13. so you may end up with malformed pieces of UTF-8. 14. Unsupported in lookbehind. 15. \1 Backreference to a specific group. 16. '1' may actually be any positive integer. 17. \g1 Backreference to a specific or previous group, 18. \g{-1} number may be negative indicating a previous buffer and may 19. optionally be wrapped in curly brackets for safer parsing. 20. \g{name} Named backreference 21. \k<name> Named backreference 22. \K Keep the stuff left of the \K, don't include it in $& 23. \v Vertical whitespace 24. \V Not vertical whitespace 25. \h Horizontal whitespace 26. \H Not horizontal whitespace 27. \R Linebreak
Perl defines the following zero-width assertions:
1. \b Match a word boundary 2. \B Match except at a word boundary 3. \A Match only at beginning of string 4. \Z Match only at end of string, or before newline at the end 5. \z Match only at end of string 6. \G Match only at pos() (e.g. at the end-of-match position 7. of prior m//g)
The POSIX character class syntax:
1. [:class:]
is also available. Note that the [
and ]
brackets are literal;
they must always be used within a character class expression.
1. # this is correct: 2. $string =~ /[[:alpha:]]/; 3. 4. # this is not, and will generate a warning: 5. $string =~ /[:alpha:]/;
The available classes and their backslash equivalents (if available) are as follows:
1. alpha 2. alnum 3. ascii 4. blank 5. cntrl 6. digit \d 7. graph 8. lower 9. print 10. punct 11. space \s 12. upper 13. word \w 14. xdigit
For example use [:upper:]
to match all the uppercase characters.
Note that the []
are part of the [::]
construct, not part of the whole character class. For example:
1. [01[:alpha:]%]
matches zero, one, any alphabetic character, and the percent sign.
The following equivalences to Unicode
\p{}
constructs and equivalent backslash
character classes (if available), will hold:
1. [[:...:]] \p{...} backslash 2. 3. alpha IsAlpha 4. alnum IsAlnum 5. ascii IsASCII 6. blank 7. cntrl IsCntrl 8. digit IsDigit \d 9. graph IsGraph 10. lower IsLower 11. print IsPrint 12. punct IsPunct 13. space IsSpace 14. IsSpacePerl \s 15. upper IsUpper 16. word IsWord \w 17. xdigit IsXDigit
You can negate the [::]
character classes by prefixing
the class name with a '^'
. This is a Perl extension. For example:
1. POSIX traditional Unicode 2. 3. [[:^digit:]] \D \P{IsDigit} 4. [[:^space:]] \S \P{IsSpace} 5. [[:^word:]] \W \P{IsWord}
$& |
El texto que casó |
$` |
El texto que está a la izquierda de lo que casó |
$' |
El texto que está a la derecha de lo que casó |
$1, $2, $3 , etc. |
Los textos capturados por los paréntesis |
$+ |
Una copia del $1, $2, ...con número mas alto |
@- |
Desplazamientos de las subcadenas que casan en $1 ... |
@+ |
Desplazamientos de los finales de las subcadenas en $1 ... |
$#- |
El índice del último paréntesis que casó |
$#+ |
El índice del último paréntesis en la última expresión regular |
Ejemplo:
1 #!/usr/bin/perl -w 2 if ("Hello there, neighbor" =~ /\s(\w+),/) { 3 print "That was: ($`)($&)($').\n", 4 }
> matchvariables.pl That was: (Hello)( there,)( neighbor).
El uso de estas variables tenía un efecto negativo en el rendimiento de la
regexp. Véase
perlfaq6
la sección
Why does using $&, $`, or $' slow my program down?
.
Once Perl sees that you need one of these variables anywhere in the program, it provides them on each and every pattern match. That means that on every pattern match the entire string will be copied, part of it to$`
, part to$&
, and part to$'
. Thus the penalty is most severe with long strings and patterns that match often. Avoid$&
,$'
, and$`
if you can, but if you can't, once you've used them at all, use them at will because you've already paid the price. Remember that some algorithms really appreciate them. As of the 5.005 release, the$&
variable is no longer "expensive" the way the other two are.
Since Perl 5.6.1 the special variables@-
and@+
can functionally replace$`
,$&
and$'
. These arrays contain pointers to the beginning and end of each match (see perlvar for the full story), so they give you essentially the same information, but without the risk of excessive string copying.
Perl 5.10 added three specials,${^MATCH}
,${^PREMATCH}
, and${^POSTMATCH}
to do the same job but without the global performance penalty. Perl 5.10 only sets these variables if you compile or execute the regular expression with the/p
modifier.
pl@nereida:~/Lperltesting$ cat ampersandoldway.pl #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w use strict; use Benchmark qw(cmpthese timethese); 'hola juan' =~ /ju/; my ($a, $b, $c) = ($`, $&, $'); cmpthese( -1, { oldway => sub { 'hola juan' =~ /ju/ }, }); pl@nereida:~/Lperltesting$ cat ampersandnewway.pl #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w use strict; use Benchmark qw(cmpthese timethese); 'hola juan' =~ /ju/p; my ($a, $b, $c) = (${^PREMATCH}, ${^MATCH}, ${^POSTMATCH}); cmpthese( -1, { newway => sub { 'hola juan' =~ /ju/ }, }); pl@nereida:~/Lperltesting$ time ./ampersandoldway.pl Rate oldway oldway 2991861/s -- real 0m3.761s user 0m3.740s sys 0m0.020s pl@nereida:~/Lperltesting$ time ./ampersandnewway.pl Rate newway newway 8191999/s -- real 0m6.721s user 0m6.704s sys 0m0.016s
Véase
$MATCH
)
La variable $+
contiene el texto que casó
con el último paréntesis en el patrón. Esto es útil
en situaciones en las cuáles una de un conjunto de alternativas
casa, pero no sabemos cuál:
DB<9> "Revision: 4.5" =~ /Version: (.*)|Revision: (.*)/ && ($rev = $+); DB<10> x $rev 0 4.5 DB<11> "Version: 4.5" =~ /Version: (.*)|Revision: (.*)/ && ($rev = $+); DB<12> x $rev 0 4.5
El vector @-
contiene los offsets o desplazamientos
de los casamientos en la última expresión regular.
La entrada $-[0]
es el desplazamiento del último casamiento con éxito
y $-[n]
es el desplazamiento de la subcadena que casa
con el n
-ésimo paréntesis (o undef
si el párentesis
no casó). Por ejemplo:
# 012345678 DB<1> $z = "hola13.47" DB<2> if ($z =~ m{a(\d+)(\.(\d+))?}) { print "@-\n"; } 3 4 6 7El resultado se interpreta como sigue:
$& = a13.47
$1 = 13
$2 = .
$3 = 47
Esto es lo que dice perlvar
sobre @-
:
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope.$-[0]
is the offset into the string of the beginning of the entire match. The nth element of this array holds the offset of the nth submatch, so$-[1]
is the offset where$1
begins,$-[2]
the offset where$2
begins, and so on.
After a match against some variable $var
:
$` is the same as substr($var, 0, $-[0]) $& is the same as substr($var, $-[0], $+[0] - $-[0]) $' is the same as substr($var, $+[0]) $1 is the same as substr($var, $-[1], $+[1] - $-[1]) $2 is the same as substr($var, $-[2], $+[2] - $-[2]) $3 is the same as substr($var, $-[3], $+[3] - $-[3])
El array @+
contiene los desplazamientos de
los finales de los emparejamientos.
La entrada $+[0]
contiene el desplazamiento del final de la cadena
del emparejamiento completo.
Siguiendo con el ejemplo anterior:
# 0123456789 DB<17> $z = "hola13.47x" DB<18> if ($z =~ m{a(\d+)(\.)(\d+)?}) { print "@+\n"; } 9 6 7 9El resultado se interpreta como sigue:
$& = a13.47x
$1 = 13
$2 = .
$3 = 47
Se puede usar $#+
para determinar cuantos parentesis
había en el último emparejamiento que tuvo éxito.
DB<29> $z = "h" DB<30> print "$#+\n" if ($z =~ m{(a)(b)}) || ($z =~ m{(h)(.)?(.)?}) 3 DB<31> $z = "ab" DB<32> print "$#+\n" if ($z =~ m{(a)(b)}) || ($z =~ m{(h)(.)?(.)?}) 2
La variable $#-
contiene el índice del último paréntesis
que casó. Observe la siguiente ejecución con el depurador:
DB<1> $x = '13.47'; $y = '125' DB<2> if ($y =~ m{(\d+)(\.(\d+))?}) { print "last par = $#-, content = $+\n"; } last par = 1, content = 125 DB<3> if ($x =~ m{(\d+)(\.(\d+))?}) { print "last par = $#-, content = $+\n"; } last par = 3, content = 47
En general no puede asumirse que @-
y @+
sean
del mismo tamaño.
DB<1> "a" =~ /(a)|(b)/; @a = @-; @b = @+ DB<2> x @a 0 0 1 0 DB<3> x @b 0 1 1 1 2 undef
Para saber más sobre las variables especiales disponibles consulte
Como sabemos, ciertas variables (como $1
, $&
...)
reciben automáticamente un valor con cada operación
de ``matching''.
Considere el siguiente código:
if (m/(...)/) { &do_something(); print "the matched variable was $1.\n"; }Puesto que
$1
es automáticamente declarada local
a la entrada de cada bloque, no importa lo que se haya
hecho en la función &do_something()
, el valor de
$1
en la sentencia print
es el correspondiente
al ``matching'' realizado en el if
.
Modificador | Significado |
e | evaluar: evaluar el lado derecho de una sustitución como una expresión |
g | global: Encontrar todas las ocurrencias |
i | ignorar: no distinguir entre mayúsculas y minúsculas |
m | multilínea (^ y $ casan con \n internos) |
o | optimizar: compilar una sola vez |
s | ^ y $ ignoran \n pero el punto . ``casa'' con \n |
x | extendida: permitir comentarios |
1 #!/usr/bin/perl -w 2 ($one, $five, $fifteen) = (`uptime` =~ /(\d+\.\d+)/g); 3 print "$one, $five, $fifteen\n";
Observe la salida:
> uptime 1:35pm up 19:22, 0 users, load average: 0.01, 0.03, 0.00 > glist.pl 0.01, 0.03, 0.00
En un contexto escalar m//g
itera sobre la cadena, devolviendo
cierto cada vez que casa, y falso cuando deja de casar. En otras
palabras, recuerda donde se quedo la última vez y se recomienza la búsqueda
desde ese punto. Se puede averiguar la posicion del emparejamiento
utilizando la función pos.
Si por alguna razón modificas la cadena en cuestión,
la posición de emparejamiento se reestablece al comienzo de la cadena.
1 #!/usr/bin/perl -w 2 # count sentences in a document 3 #defined as ending in [.!?] perhaps with 4 # quotes or parens on either side. 5 $/ = ""; # paragraph mode 6 while ($paragraph = <>) { 7 print $paragraph; 8 while ($paragraph =~ /[a-z]['")]*[.!?]+['")]*\s/g) { 9 $sentences++; 10 } 11 } 12 print "$sentences\n";
Observe el uso de la variable especial $/
.
Esta variable contiene el separador de registros en el fichero de
entrada. Si se iguala a la cadena vacía usará las líneas
en blanco como separadores. Se le puede dar el valor de una cadena multicarácter
para usarla como delimitador. Nótese que establecerla a \n\n
es diferente de asignarla a ""
. Si se deja undef
,
la siguiente lectura leerá todo el fichero.
Sigue un ejemplo de ejecución. El programa se llama gscalar.pl
.
Introducimos el texto desde STDIN
. El programa escribe el
número de párrafos:
> gscalar.pl este primer parrafo. Sera seguido de un segundo parrafo. "Cita de Seneca". 3
/e
permite la evaluación como expresión perl de la
cadena de reemplazo (En vez de considerarla como una cadena delimitada
por doble comilla).
1 #!/usr/bin/perl -w 2 $_ = "abc123xyz\n"; 3 s/\d+/$&*2/e; 4 print; 5 s/\d+/sprintf("%5d",$&)/e; 6 print; 7 s/\w/$& x 2/eg; 8 print;El resultado de la ejecución es:
> replacement.pl abc246xyz abc 246xyz aabbcc 224466xxyyzz
Véase un ejemplo con anidamiento de /e:
1 #!/usr/bin/perl 2 $a ="one"; 3 $b = "two"; 4 $_ = '$a $b'; 5 print "_ = $_\n\n"; 6 s/(\$\w+)/$1/ge; 7 print "After 's/(\$\w+)/$1/ge' _ = $_\n\n"; 8 s/(\$\w+)/$1/gee; 9 print "After 's/(\$\w+)/$1/gee' _ = $_\n\n";El resultado de la ejecución es:
> enested.pl _ = $a $b After 's/($w+)/$b/ge' _ = $a $b After 's/($w+)/$b/gee' _ = one two
He aqui una solución que hace uso de e
al siguiente ejercicio
(véase 'Regex to add space after punctuation sign' en PerlMonks)
Se quiere poner un espacio en blanco después de la aparición de cada coma:
s/,/, /g;pero se quiere que la sustitución no tenga lugar si la coma esta incrustada entre dos dígitos. Además se pide que si hay ya un espacio después de la coma, no se duplique
s/(\d[,.]\d)|(,(?!\s))/$1 || "$2 "/ge;
Se hace uso de un lookahead negativo (?!\s)
.
Véase la sección 31.2.3 para entender como funciona
un lookahead negativo.
Casiano Rodríguez León