(?#text)
Un comentario. Se ignora text
.
Si se usa la opción x
basta con poner #
.
Los modificadores de la conducta de una expresión regular pueden ser
empotrados en una subexpresión usando el formato (?pimsx-imsx)
.
Véase el correspondiente texto Extended Patterns de la sección 'Extended-Patterns' en perlre:
One or more embedded pattern-match modifiers, to be turned on (or turned
off, if preceded by '-
' ) for the remainder of the pattern or the remainder
of the enclosing pattern group (if any). This is particularly useful
for dynamic patterns, such as those read in from a configuration file,
taken from an argument, or specified in a table somewhere. Consider
the case where some patterns want to be case sensitive and some do not:
The case insensitive ones merely need to include (?i) at the front of
the pattern. For example:
1. $pattern = "foobar"; 2. if ( /$pattern/i ) { } 3. 4. # more flexible: 5. 6. $pattern = "(?i)foobar"; 7. if ( /$pattern/ ) { }
These modifiers are restored at the end of the enclosing group. For example,
1. ( (?i) blah ) \s+ \1
will matchblah
in any case, some spaces, and an exact (including the case!) repetition of the previous word, assuming the/x
modifier, and no/i
modifier outside this group.
El siguiente ejemplo extiende el ejemplo visto en
la sección
31.1.1
eliminando los comentarios /* ... */
y // ...
de un programa C. En dicho ejemplo se usaba el modificador s
para hacer que el punto casara con cualquier carácter:
casiano@tonga:~/Lperltesting$ cat -n extendedcomments.pl 1 #!/usr/bin/perl -w 2 use strict; 3 4 my $progname = shift @ARGV or die "Usage:\n$0 prog.c\n"; 5 open(my $PROGRAM,"<$progname") || die "can't find $progname\n"; 6 my $program = ''; 7 { 8 local $/ = undef; 9 $program = <$PROGRAM>; 10 } 11 $program =~ s{(?xs) 12 /\* # Match the opening delimiter 13 .*? # Match a minimal number of characters 14 \*/ # Match the closing delimiter 15 | 16 (?-s)//.* # C++ // comments. No s modifier 17 }[]g; 18 19 print $program;Sigue un ejemplo de ejecución. Usaremos como entrada el programa C:
casiano@tonga:~/Lperltesting$ cat -n ehello.c 1 #include <stdio.h> 2 /* first 3 comment 4 */ 5 main() { // A C++ comment 6 printf("hello world!\n"); /* second comment */ 7 } // final commentAl ejecutar el programa eliminamos los comentarios:
casiano@tonga:~/Lperltesting$ extendedcomments.pl ehello.c | cat -n 1 #include <stdio.h> 2 3 main() { 4 printf("hello world!\n"); 5 }
El siguiente fragmento esta 'casi' literalmente tomado de la sección 'Looking-ahead-and-looking-behind' en perlretut:
In Perl regular expressions, most regexp elements 'eat up' a certain
amount of string when they match. For instance, the regexp element
[abc}]
eats up one character of the string when it matches, in the sense
that Perl moves to the next character position in the string after the
match. There are some elements, however, that don't eat up characters
(advance the character position) if they match.
The examples we have seen
so far are the anchors. The anchor ^
matches the beginning of the line,
but doesn't eat any characters.
Similarly, the word boundary anchor
\b
matches wherever a character matching \w
is next to a character that
doesn't, but it doesn't eat up any characters itself.
Anchors are examples of zero-width assertions. Zero-width, because they consume no characters, and assertions, because they test some property of the string.
In the context of our walk in the woods analogy to regexp matching, most regexp elements move us along a trail, but anchors have us stop a moment and check our surroundings. If the local environment checks out, we can proceed forward. But if the local environment doesn't satisfy us, we must backtrack.
Checking the environment entails either looking ahead on the trail, looking behind, or both.
^
looks behind, to see that there are no
characters before.
$
looks ahead, to see that there are no characters
after.
\b
looks both ahead and behind, to see if the characters on either
side differ in their "word-ness".
The lookahead and lookbehind assertions are generalizations of the anchor concept. Lookahead and lookbehind are zero-width assertions that let us specify which characters we want to test for.
The lookahead assertion
is denoted by (?=regexp)
and the lookbehind assertion is denoted by
(?<=fixed-regexp)
.
En español, operador de ``trailing'' o ``mirar-adelante'' positivo.
Por ejemplo, /\w+(?=\t)/
solo casa una palabra si va seguida de un tabulador, pero el tabulador no formará parte de $&
.
Ejemplo:
> cat -n lookahead.pl 1 #!/usr/bin/perl 2 3 $a = "bugs the rabbit"; 4 $b = "bugs the frog"; 5 if ($a =~ m{bugs(?= the cat| the rabbit)}i) { print "$a matches. $& = $&\n"; } 6 else { print "$a does not match\n"; } 7 if ($b =~ m{bugs(?= the cat| the rabbit)}i) { print "$b matches. $& = $&\n"; } 8 else { print "$b does not match\n"; }Al ejecutar el programa obtenemos:
> lookahead.pl bugs the rabbit matches. $& = bugs bugs the frog does not match >
Some examples using the debugger31.4:
DB<1> #012345678901234567890 DB<2> $x = "I catch the housecat 'Tom-cat' with catnip" DB<3> print "($&) (".pos($x).")\n" if $x =~ /cat(?=\s)/g (cat) (20) # matches 'cat' in 'housecat' DB<5> $x = "I catch the housecat 'Tom-cat' with catnip" # To reset pos DB<6> x @catwords = ($x =~ /(?<=\s)cat\w+/g) 0 'catch' 1 'catnip' DB<7> #012345678901234567890123456789 DB<8> $x = "I catch the housecat 'Tom-cat' with catnip" DB<9> print "($&) (".pos($x).")\n" if $x =~ /\bcat\b/g (cat) (29) # matches 'cat' in 'Tom-cat' DB<10> $x = "I catch the housecat 'Tom-cat' with catnip" DB<11> x $x =~ /(?<=\s)cat(?=\s)/ empty array DB<12> # doesn't match; no isolated 'cat' in middle of $x
Véase el nodo A hard RegEx problem en PerlMonks. Un monje solicita:
Hi Monks,
I wanna to match this issues:
Pls help me check. Thanks
Solución:
casiano@millo:~$ perl -wde 0 main::(-e:1): 0 DB<1> x 'aaa2a1' =~ /\A(?=.*[a-z])(?=.*\d)\w{3,10}\z/i 0 1 DB<2> x 'aaaaaa' =~ /\A(?=.*[a-z])(?=.*\d)\w{3,10}\z/i empty array DB<3> x '1111111' =~ /\A(?=.*[a-z])(?=.*\d)\w{3,10}\z/i empty array DB<4> x '1111111bbbbb' =~ /\A(?=.*[a-z])(?=.*\d)\w{3,10}\z/i empty array DB<5> x '111bbbbb' =~ /\A(?=.*[a-z])(?=.*\d)\w{3,10}\z/i 0 1
Note that the parentheses in (?=regexp)
and (?<=regexp)
are non-capturing, since these are zero-width assertions.
Lookahead
(?=regexp)
can match arbitrary regexps, but lookbehind
(?<=fixed-regexp)
only works for regexps of fixed width, i.e., a fixed number of characters
long.
Thus (?<=(ab|bc))
is fine, but (?<=(ab)*)
is not.
The negated
versions of the lookahead and lookbehind assertions are denoted by
(?!regexp)
and (?<!fixed-regexp)
respectively.
They evaluate true if
the regexps do not match:
$x = "foobar"; $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo' $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo'
Here is an example where a string containing blank-separated words, numbers and single dashes is to be split into its components.
Using /\s+/
alone won't work, because spaces are not required between dashes, or a
word or a dash. Additional places for a split are established by looking
ahead and behind:
casiano@tonga:~$ perl5.10.1 -wdE 0 main::(-e:1): 0 DB<1> $str = "one two - --6-8" DB<2> x @toks = split / \s+ | (?<=\S) (?=-) | (?<=-) (?=\S)/x, $str 0 'one' 1 'two' 2 '-' 3 '-' 4 '-' 5 6 6 '-' 7 8
El siguiente párrafo ha sido extraído la sección 'Look-Around-Assertions' en pelre. Usémoslo como texto de repaso:
Look-around assertions are zero width patterns which match a specific
pattern without including it in $&
. Positive assertions match when their
subpattern matches, negative assertions match when their subpattern
fails. Look-behind matches text up to the current match position,
look-ahead matches text following the current match position.
(?=pattern)
A zero-width positive look-ahead assertion. For example, /\w+(?=\t)/
matches a word followed by a tab, without including the tab in $&
.
(?!pattern)
A zero-width negative look-ahead assertion. For example /foo(?!bar)/
matches any occurrence of foo
that isn't followed by bar
.
Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind.
If you are looking for a bar
that isn't preceded by a foo
,
/(?!foo)bar/
will not do what you want.
That's because the
(?!foo)
is just saying that the next thing cannot be foo
-and it's not, it's a bar
, so foobar
will match.
You would have to do something like /(?!foo)...bar/
for that.
We say "like" because there's the case of your bar
not having
three characters before it.
You could cover that this way:
/(?:(?!foo)...|^.{0,2})bar/
. Sometimes it's still easier just to say:
if (/bar/ && $` !~ /foo$/)
For look-behind see below.
(?<=pattern)
A zero-width positive look-behind assertion.
For example, /(?<=\t)\w+/
matches a word that follows a tab, without including the tab in $&
.
Works only for fixed-width look-behind.
\K
There is a special form of this construct, called \K
, which causes
the regex engine to 'keep' everything it had matched prior to the \K
and not include it in $&
. This effectively provides variable length
look-behind. The use of \K
inside of another look-around assertion is
allowed, but the behaviour is currently not well defined.
For various reasons \K
may be significantly more efficient than the
equivalent (?<=...)
construct, and it is especially useful in situations
where you want to efficiently remove something following something else
in a string. For instance
s/(foo)bar/$1/g;
can be rewritten as the much more efficient
s/foo\Kbar//g;
Sigue una sesión con el depurador que ilustra la semántica del operador:
casiano@millo:~$ perl5.10.1 -wdE 0 main::(-e:1): 0 DB<1> print "& = <$&> 1 = <$1>\n" if "alphabet" =~ /([^aeiou][a-z][aeiou])[a-z]/ & = <phab> 1 = <pha> DB<2> print "& = <$&> 1 = <$1>\n" if "alphabet" =~ /\K([^aeiou][a-z][aeiou])[a-z]/ & = <phab> 1 = <pha> DB<3> print "& = <$&> 1 = <$1>\n" if "alphabet" =~ /([^aeiou]\K[a-z][aeiou])[a-z]/ & = <hab> 1 = <pha> DB<4> print "& = <$&> 1 = <$1>\n" if "alphabet" =~ /([^aeiou][a-z]\K[aeiou])[a-z]/ & = <ab> 1 = <pha> DB<5> print "& = <$&> 1 = <$1>\n" if "alphabet" =~ /([^aeiou][a-z][aeiou])\K[a-z]/ & = <b> 1 = <pha> DB<6> print "& = <$&> 1 = <$1>\n" if "alphabet" =~ /([^aeiou][a-z][aeiou])[a-z]\K/ & = <> 1 = <pha> DB<7> @a = "alphabet" =~ /([aeiou]\K[^aeiou])/g; print "$&\n" t DB<8> x @a 0 'al' 1 'ab' 2 'et'
Otro ejemplo: eliminamos los blancos del final en una cadena:
DB<23> $x = ' cadena entre blancos ' DB<24> ($y = $x) =~ s/.*\b\K.*//g DB<25> p "<$y>" < cadena entre blancos>
(?<!pattern)
A zero-width negative look-behind assertion.
For example /(?<!bar)foo/
matches any occurrence of
foo
that does not follow bar
. Works only for fixed-width look-behind.
Veamos un ejemplo de uso. Se quiere
sustituir las extensiones .something
por .txt
en cadenas que contienen una ruta a un fichero:
casiano@millo:~$ perl5.10.1 -wdE 0 main::(-e:1): 0 DB<1> ($b = $a = 'abc/xyz.something') =~ s{\.[^.]*$}{.txt} DB<2> p $b abc/xyz.txt DB<3> ($b = $a = 'abc/xyz.something') =~ s/\.\K[^.]*$/txt/; DB<4> p $b abc/xyz.txt DB<5> p $a abc/xyz.something
Véase también:
Escriba una expresión regular que encuentre la última aparición de la cadena foo
en una cadena dada.
DB<6> x ($a = 'foo foo bar bar foo bar bar') =~ /foo(?!.*foo)/g; print pos($a)."\n" 19 DB<7> x ($a = 'foo foo bar bar foo bar bar') =~ s/foo(?!.*foo)/\U$&/ 0 1 DB<8> x $a 0 'foo foo bar bar FOO bar bar'
Aparentemente el operador ``mirar-adelante'' negativo es parecido a usar el operador ``mirar-adelante'' positivo con la negación de una clase.
/regexp(?![abc])/ |
/regexp(?=[^abc])/ |
Sin embargo existen al menos dos diferencias:
\d+(?!\.)
casa con $a = '452'
, mientras que \d+(?=[^.])
lo hace, pero porque
452
es 45
seguido de un carácter que no es el punto:
> cat lookaheadneg.pl #!/usr/bin/perl $a = "452"; if ($a =~ m{\d+(?=[^.])}i) { print "$a casa clase negada. \$& = $&\n"; } else { print "$a no casa\n"; } if ($a =~ m{\d+(?!\.)}i) { print "$a casa predicción negativa. \$& = $&\n"; } else { print "$b no casa\n"; } nereida:~/perl/src> lookaheadneg.pl 452 casa clase negada. $& = 45 452 casa predicción negativa. $& = 452
Otros dos ejemplos:
^(?![A-Z]*$)[a-zA-Z]*$
casa con líneas formadas por secuencias de letras tales que no todas son mayúsculas. (Obsérvese el uso de las anclas).
^(?=.*?esto)(?=.*?eso)
casan con cualquier línea en la que aparezcan
esto
y eso
. Ejemplo:
> cat estoyeso.pl #!/usr/bin/perl my $a = shift; if ($a =~ m{^(?=.*?esto)(?=.*?eso)}i) { print "$a matches.\n"; } else { print "$a does not match\n"; } >estoyeso.pl 'hola eso y esto' hola eso y esto matches. > estoyeso.pl 'hola esto y eso' hola esto y eso matches. > estoyeso.pl 'hola aquello y eso' hola aquello y eso does not match > estoyeso.pl 'hola esto y aquello' hola esto y aquello does not matchEl ejemplo muestra que la interpretación es que cada operador mirar-adelante se interpreta siempre a partir de la posición actual de búsqueda. La expresión regular anterior es básicamente equivalente a
(/esto/ && /eso/)
.
(?!000)(\d\d\d)
casa con cualquier cadena de tres dígitos que no
sea la cadena 000
.
Nótese que el ``mirar-adelante'' negativo
no puede usarse fácilmente para imitar un ``mirar-atrás'',
esto es, que no se puede imitar la conducta de
(?<!foo)bar
mediante
algo como (/?!foo)bar
. Tenga en cuenta que:
(?!foo)
es que los tres caracteres que siguen no puede ser foo
.
foo
no pertenece a /(?!foo)bar/
, pero
foobar
pertenece a (?!foo)bar/
porque bar
es una cadena
cuyos tres siguientes caracteres son bar
y no son foo
.
(?<!foo)bar
usando un lookahead negativo
tendríamos que escribir algo asi como
/(?!foo)...bar/
que casa con una cadena de tres caracteres que no sea foo
seguida de
bar
(pero que tampoco es exactamente equivalente):
pl@nereida:~/Lperltesting$ cat -n foobar.pl 1 use v5.10; 2 use strict; 3 4 my $a = shift; 5 6 for my $r (q{(?<!foo)bar}, q{(?!foo)bar}, q{(?!foo)...bar}) { 7 if ($a =~ /$r/) { 8 say "$a casa con $r" 9 } 10 else { 11 say "$a no casa con $r" 12 } 13 }
q{(?!foo)...bar}
se apróxima mas a (q{(?<!foo)bar}
:
pl@nereida:~/Lperltesting$ perl5.10.1 foobar.pl foobar foobar no casa con (?<!foo)bar foobar casa con (?!foo)bar foobar no casa con (?!foo)...bar pl@nereida:~/Lperltesting$ perl5.10.1 foobar.pl bar bar casa con (?<!foo)bar bar casa con (?!foo)bar bar no casa con (?!foo)...bar
bar
casa con (?<!foo)bar
pero no con (?!foo)...bar
.
¿Sabría encontrar una expresión regular mas apropiada usando lookahead negativo?
if (/bar/ and $` !~ /foo$/)o aún mejor (véase 31.1.4):
if (/bar/p && ${^PREMATCH} =~ /foo$/)El siguiente programa puede ser utilizado para ilustrar la equivalencia:
pl@nereida:~/Lperltesting$ cat -n foobarprematch.pl 1 use v5.10; 2 use strict; 3 4 $_ = shift; 5 6 if (/bar/p && ${^PREMATCH} =~ /foo$/) { 7 say "$_ no cumple ".q{/bar/p && ${^PREMATCH} =~ /foo$/}; 8 } 9 else { 10 say "$_ cumple ".q{/bar/p && ${^PREMATCH} =~ /foo$/}; 11 } 12 if (/(?<!foo)bar/) { 13 say "$_ casa con (?<!foo)bar" 14 } 15 else { 16 say "$_ no casa con (?<!foo)bar" 17 }Siguen dos ejecuciones:
pl@nereida:~/Lperltesting$ perl5.10.1 foobarprematch.pl bar bar cumple /bar/p && ${^PREMATCH} =~ /foo$/ bar casa con (?<!foo)bar pl@nereida:~/Lperltesting$ perl5.10.1 foobarprematch.pl foobar foobar no cumple /bar/p && ${^PREMATCH} =~ /foo$/ foobar no casa con (?<!foo)bar
foo
por foo,
usando \K
o lookbehind
lookahead
por look-ahead
usando lookaheads y lookbehinds
foo
y bar
siempre que no se incluya la
palabra baz
DB<1> x 'abc' =~ /(?=(.)(.)(.))a(b)/
s/,/, /g;pero se quiere que la sustitución no tenga lugar si la coma esta incrustada entre dos dígitos.
s/,/, /g;pero se quiere que la sustitución no tenga lugar si la coma esta incrustada entre dos dígitos. Además se pide que si hay ya un espacio después de la coma, no se duplique
pl@nereida:~/Lperltesting$ cat -n ABC123.pl 1 use warnings; 2 use strict; 3 4 my $c = 0; 5 my @p = ('^(ABC)(?!123)', '^(\D*)(?!123)',); 6 7 for my $r (@p) { 8 for my $s (qw{ABC123 ABC445}) { 9 $c++; 10 print "$c: '$s' =~ /$r/ : "; 11 <>; 12 if ($s =~ /$r/) { 13 print " YES ($1)\n"; 14 } 15 else { 16 print " NO\n"; 17 } 18 } 19 }
Citando la sección Defining named patterns en el documento la sección 'Defining-named-patterns' en perlretut para perl5.10:
Some regular expressions use identical subpatterns in several places. Starting with Perl 5.10, it is possible to define named subpatterns in a section of the pattern so that they can be called up by name anywhere in the pattern. This syntactic pattern for this definition group is"(?(DEFINE)(?<name>pattern)...)"
An insertion of a named pattern is written as(?&name)
.
Veamos un ejemplo que define el lenguaje de los números en punto flotante:
pl@nereida:~/Lperltesting$ cat -n definingnamedpatterns.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w 2 use v5.10; 3 4 my $regexp = qr{ 5 ^ (?<num> 6 (?&osg)[\t\ ]* (?: (?&int)(?&dec)? | (?&dec) ) 7 ) 8 (?: [eE] 9 (?<exp> (?&osg)(?&int)) )? 10 $ 11 (?(DEFINE) 12 (?<osg>[-+]?) # optional sign 13 (?<int>\d++) # integer 14 (?<dec>\.(?&int)) # decimal fraction 15 ) 16 }x; 17 18 my $input = <>; 19 chomp($input); 20 my @r; 21 if (@r = $input =~ $regexp) { 22 my $exp = $+{exp} || ''; 23 say "$input matches: (num => '$+{num}', exp => '$exp')"; 24 } 25 else { 26 say "does not match"; 27 }perlretut comenta sobre este ejemplo:
The example above illustrates this feature.
The three subpatterns that
are used more than once are the optional sign,
the digit sequence for
an integer
and the decimal fraction. The DEFINE
group at the end of
the pattern contains their definition. Notice that the decimal fraction
pattern is the first place where we can reuse the integer pattern.
Curiosamente, (DEFINE)
se considera un caso particular de
las expresiones regulares condicionales de la forma (?(condition)yes-pattern)
(véase la sección 31.2.10).
Esto es lo que dice la sección 'Extended-Patterns' en perlre
al respecto:
A special form is the (DEFINE)
predicate, which never executes
directly its yes-pattern, and does not allow a no-pattern. This allows
to define subpatterns which will be executed only by using the recursion
mechanism. This way, you can define a set of regular expression rules
that can be bundled into any pattern you choose.
It is recommended that for this usage you put the DEFINE
block at the
end of the pattern, and that you name any subpatterns defined within it.
Also, it's worth noting that patterns defined this way probably will not be as efficient, as the optimiser is not very clever about handling them.
An example of how this might be used is as follows:
1. /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT)) 2. (?(DEFINE) 3. (?<NAME_PAT>....) 4. (?<ADRESS_PAT>....) 5. )/x
Note that capture buffers matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing buffers is necessary. Thus$+{NAME_PAT}
would not be defined even though$+{NAME}
would be.
%+
y %-
. Con respecto a el hash %+
:
%LAST_PAREN_MATCH
, %+
Similar to @+
, the %+
hash allows access to the named capture buffers,
should they exist, in the last successful match in the currently active
dynamic scope.
For example, $+{foo}
is equivalent to $1
after the following match:
1. 'foo' =~ /(?<foo>foo)/;
The keys of the %+
hash list only the names of buffers that have
captured (and that are thus associated to defined values).
The underlying behaviour of %+
is provided by the Tie::Hash::NamedCapture
module.
Note: %-
and %+
are tied views into a common internal
hash associated with the last successful regular expression. Therefore
mixing iterative access to them via each
may have unpredictable
results. Likewise, if the last successful match changes, then the results
may be surprising.
%-
Similar to %+
, this variable allows access to the named capture
buffers in the last successful match in the currently active dynamic
scope. To each capture buffer name found in the regular expression,
it associates a reference to an array containing the list of values
captured by all buffers with that name (should there be several of them),
in the order where they appear.
Here's an example:
1. if ('1234' =~ /(?<A>1)(?<B>2)(?<A>3)(?<B>4)/) { 2. foreach my $bufname (sort keys %-) { 3. my $ary = $-{$bufname}; 4. foreach my $idx (0..$#$ary) { 5. print "\$-{$bufname}[$idx] : ", 6. (defined($ary->[$idx]) ? "'$ary->[$idx]'" : "undef"), 7. "\n"; 8. } 9. } 10. }
would print out:
1. $-{A}[0] : '1' 2. $-{A}[1] : '3' 3. $-{B}[0] : '2' 4. $-{B}[1] : '4'
The keys of the %-
hash correspond to all buffer names found in
the regular expression.
Perl 5.10 introduce la posibilidad de definir subpatrones en una sección del patrón. Citando la versión del documento perlretut para perl5.10:
This feature (introduced in Perl 5.10) significantly extends the power
of Perl’s pattern matching. By referring to some other capture group
anywhere in the
pattern with the construct (?group-ref)
, the pattern within the
referenced group is used as an independent subpattern in place of the
group reference itself.
Because the group reference may be contained within the group it refers
to, it is now possible to apply pattern matching to tasks that hitherto
required a
recursive parser.
...
In(?...)
both absolute and relative backreferences may be used. The entire pattern can be reinserted with(?R)
or(?0)
. If you prefer to name your buffers, you can use(?&name)
to recurse into that buffer.
Véase un ejemplo que reconoce los palabra-palíndromos (esto es, la lectura directa y la inversa de la cadena pueden diferir en los signos de puntuación):
casiano@millo:~/Lperltesting$ cat -n palindromos.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w 2 use v5.10; 3 4 my $regexp = qr/^(\W* 5 (?: 6 (\w) (?1) \g{-1} # palindromo estricto 7 | 8 \w? # no recursiva 9 ) 10 \W*)$/ix; 11 12 my $input = <>; 13 chomp($input); 14 if ($input =~ $regexp) { 15 say "$input is a palindrome"; 16 } 17 else { 18 say "does not match"; 19 }
i
en la regexp qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix
?Siguen algunos ejemplos de ejecución31.5
pl@nereida:~/Lperltesting$ ./palindromos.pl A man, a plan, a canal: Panama! A man, a plan, a canal: Panama! is a palindrome pl@nereida:~/Lperltesting$ ./palindromos.pl A man, a plan, a cam, a yak, a yam, a canal – Panama! A man, a plan, a cam, a yak, a yam, a canal – Panama! is a palindrome pl@nereida:~/Lperltesting$ ./palindromos.pl A man, a plan, a cat, a ham, a yak, a yam, a hat, a canal – Panama! A man, a plan, a cat, a ham, a yak, a yam, a hat, a canal – Panama! is a palindrome pl@nereida:~/Lperltesting$ ./palindromos.pl saippuakauppias saippuakauppias is a palindrome pl@nereida:~/Lperltesting$ ./palindromos.pl dfghjgfd does not match pl@nereida:~/Lperltesting$ ./palindromos.pl ...,;;;; ...,;;;; is a palindrome
(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)
Similar to (??{ code })
(véase la sección
31.2.9)
except it does not involve compiling any code,
instead it treats the contents of a capture buffer as an independent
pattern that must match at the current position. Capture buffers
contained by the pattern will have the value as determined by the
outermost recursion.
PARNO
is a sequence of digits (not starting with 0) whose value
reflects the paren-number of the capture buffer to recurse to.
(?R)
recurses to the beginning of the whole pattern.(?0)
is an alternate syntax for(?R)
.
If PARNO
is preceded by a plus or minus sign then it
is assumed to be relative, with negative numbers indicating preceding
capture buffers and positive ones following. Thus (?-1) refers to the
most recently declared buffer, and (?+1) indicates the next buffer to
be declared.
Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed buffers are included.Hay una diferencia fundamental entre
\g{-1}
y (?-1)
.
El primero significa lo que casó con el último paréntesis.
El segundo significa que se debe llamar a la expresión regular
que define el último paréntesis. Véase un ejemplo:
pl@nereida:~/Lperltesting$ perl5.10.1 -wdE 0 main::(-e:1): 0 DB<1> x ($a = "12 aAbB 34") =~ s/([aA])(?-1)(?+1)([bB])/-\1\2-/g 0 1 DB<2> p $a 12 -aB- 34
En perlre también se comenta sobre este punto:
If there is no corresponding capture buffer defined, then it is a fatal error. Recursing deeper than 50 times without consuming any input string will also result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build.
El siguiente programa (inspirado en uno
que aparece en perlre)
reconoce una llamada a una función foo()
que puede contener una secuencia de expresiones con paréntesis equilibrados
como argumento:
1 pl@nereida:~/Lperltesting$ cat perlrebalancedpar.pl 2 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w 3 use v5.10; 4 use strict; 5 6 my $regexp = qr{ ( # paren group 1 (full function) 7 foo 8 ( # paren group 2 (parens) 9 \( 10 ( # paren group 3 (contents of parens) 11 (?: 12 [^()]+ # Non-parens 13 | 14 (?2) # Recurse to start of paren group 2 15 )* 16 ) # 3 17 \) 18 ) # 2 19 ) # 1 20 }x; 21 22 my $input = <>; 23 chomp($input); 24 my @res = ($input =~ /$regexp/); 25 if (@res) { 26 say "<$&> is balanced\nParen: (@res)"; 27 } 28 else { 29 say "does not match"; 30 }Al ejecutar obtenemos:
pl@nereida:~/Lperltesting$ ./perlrebalancedpar.pl foo(bar(baz)+baz(bop)) <foo(bar(baz)+baz(bop))> is balanced Paren: (foo(bar(baz)+baz(bop)) (bar(baz)+baz(bop)) bar(baz)+baz(bop))
Como se comenta en perlre es conveniente usar índices relativos si se quiere tener una expresión regular reciclable:
The following shows how using negative indexing can make it easier to
embed recursive patterns inside of a qr//
construct for later use:
1. my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; 2. if (/foo $parens \s+ + \s+ bar $parens/x) { 3. # do something here... 4. }Véase la sección 31.2.6 para comprender el uso de los operadores posesivos como
++
.
El siguiente programa presenta una heurística para determinar los bloques de un programa:
1 pl@nereida:~/Lperltesting$ cat blocks.pl 2 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w 3 use v5.10; 4 use strict; 5 #use re 'debug'; 6 7 my $rb = qr{(?x) 8 ( 9 \{ # llave abrir 10 (?: 11 [^{}]++ # no llaves 12 | 13 [^{}]*+ # no llaves 14 (?1) # recursivo 15 [^{}]*+ # no llaves 16 )*+ 17 \} # llave cerrar 18 ) 19 }; 20 21 local $/ = undef; 22 my $input = <>; 23 my@blocks = $input =~ m{$rb}g; 24 my $i = 0; 25 say($i++.":\n$_\n===") for @blocks;
Veamos una ejecución. Le daremos como entrada el siguiente programa: Al ejecutar el programa con esta entrada obtenemos:
pl@nereida:~/Lperltesting$ cat -n blocks.c 1 main() { /* 1 */ 2 { /* 2 */ } 3 { /* 3 */ } 4 } 5 6 f(){ /* 4 */ 7 { /* 5 */ 8 { /* 6 */ } 9 } 10 { /* 7 */ 11 { /* 8 */ } 12 } 13 } 14 15 g(){ /* 9 */ 16 } 17 18 h() { 19 {{{}}} 20 } 21 /* end h */ |
pl@nereida:~/Lperltesting$ perl5.10.1 blocks.pl blocks.c 0: { /* 1 */ { /* 2 */ } { /* 3 */ } } === 1: { /* 4 */ { /* 5 */ { /* 6 */ } } { /* 7 */ { /* 8 */ } } } === 2: { /* 9 */ } === 3: { {{{}}} } === |
La posibilidad de combinar en las expresiones regulares Perl 5.10
la recursividad con los constructos (?<name>...)
y ?&name)
así como las secciones (?(DEFINE) ...)
permiten la escritura de expresiones regulares que reconocen lenguajes recursivos.
El siguiente
ejemplo muestra un reconocedor de un subconjunto del lenguaje
LATEX (véase la entrada LaTeX en la wikipedia):
1 pl@nereida:~/Lperltesting$ cat latex5_10.pl 2 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w 3 use strict; 4 use v5.10; 5 6 my $regexp = qr{ 7 \A(?&File)\z 8 9 (?(DEFINE) 10 (?<File> (?&Element)*+\s* 11 ) 12 13 (?<Element> \s* (?&Command) 14 | \s* (?&Literal) 15 ) 16 17 (?<Command> \\ \s* (?<L>(?&Literal)) \s* (?<Op>(?&Options)?) \s* (?<A>(?&Args)) 18 (?{ 19 say "command: <$+{L}> options: <$+{Op}> args: <$+{A}>" 20 }) 21 ) 22 23 (?<Options> \[ \s* (?:(?&Option) (?:\s*,\s* (?&Option) )*)? \s* \] 24 ) 25 26 (?<Args> (?: \{ \s* (?&Element)* \s* \} )* 27 ) 28 29 (?<Option> \s* [^][$&%#_{}~^\s,]+ 30 ) 31 32 (?<Literal> \s* ([^][$&%#_{}~^\s]+) 33 ) 34 ) 35 }xms; 36 37 my $input = do{ local $/; <>}; 38 if ($input =~ $regexp) { 39 say "$@: matches:\n$&"; 40 } 41 else { 42 say "does not match"; 43 }
Añadimos una acción semántica al final de la aceptación de un <Command>
.
(?<Command> \\ \s* (?<L>(?&Literal)) \s* (?<Op>(?&Options)?) \s* (?<A>(?&Args)?) (?{ say "command: <$+{L}> options: <$+{Op}> args: <$+{A}>" }) )Esta acción es ejecutada pero no afecta al proceso de análisis. (véase la sección 31.2.8 para mas información sobre las acciones semánticas en medio de una regexp). La acción se limita a mostrar que ha casado con cada una de las tres componentes: el comando, las opciones y los argumentos.
Los paréntesis adicionales, como en (?<L>(?&Literal))
son necesarios
para guardar lo que casó.
Cuando se ejecuta produce la siguiente salida31.6:
pl@nereida:~/Lperltesting$ cat prueba.tex \documentclass[a4paper,11pt]{article} \usepackage{latexsym} \author{D. Conway} \title{Parsing \LaTeX{}} \begin{document} \maketitle \tableofcontents \section{Description} ...is easy \footnote{But not\\ \emph{necessarily} simple}. In fact it's easy peasy to do. \end{document} pl@nereida:~/Lperltesting$ ./latex5_10.pl prueba.tex command: <documentclass> options: <[a4paper,11pt]> args: <{article}> command: <usepackage> options: <> args: <{latexsym}> command: <author> options: <> args: <{D. Conway}> command: <LaTeX> options: <> args: <{}> command: <title> options: <> args: <{Parsing \LaTeX{}}> command: <begin> options: <> args: <{document}> command: <maketitle> options: <> args: <> command: <tableofcontents> options: <> args: <> command: <section> options: <> args: <{Description}> command: <emph> options: <> args: <{necessarily}> command: <footnote> options: <> args: <{But not\\ \emph{necessarily} simple}> command: <end> options: <> args: <{document}> : matches: \documentclass[a4paper,11pt]{article} \usepackage{latexsym} \author{D. Conway} \title{Parsing \LaTeX{}} \begin{document} \maketitle \tableofcontents \section{Description} ...is easy \footnote{But not\\ \emph{necessarily} simple}. In fact it's easy peasy to do. \end{document}La siguiente entrada
prueba3.tex
no pertenece
al lenguaje definido por el patrón regular,
debido a la presencia de la cadena $In$
en la última línea:
pl@nereida:~/Lperltesting$ cat prueba3.tex \documentclass[a4paper,11pt]{article} \usepackage{latexsym} \author{D. Conway} \title{Parsing \LaTeX{}} \begin{document} \maketitle \tableofcontents \section{Description} \comm{a}{b} ...is easy \footnote{But not\\ \emph{necessarily} simple}. $In$ fact it's easy peasy to do. \end{document} pl@nereida:~/Lperltesting$ ./latex5_10.pl prueba3.tex command: <documentclass> options: <[a4paper,11pt]> args: <{article}> command: <usepackage> options: <> args: <{latexsym}> command: <author> options: <> args: <{D. Conway}> command: <LaTeX> options: <> args: <{}> command: <title> options: <> args: <{Parsing \LaTeX{}}> command: <begin> options: <> args: <{document}> command: <maketitle> options: <> args: <> command: <tableofcontents> options: <> args: <> command: <section> options: <> args: <{Description}> command: <comm> options: <> args: <{a}{b}> command: <emph> options: <> args: <{necessarily}> command: <footnote> options: <> args: <{But not\\ \emph{necessarily} simple}> does not match
10 (?<File> (?&Element)*+\s* 11 )¿Que ocurrre si se quita el posesivo y se vuelve a ejecutar
$ ./latex5_10.pl prueba3.tex
?
Véase el nodo Complex regex for maths formulas en perlmonks para la formulación del problema. Un monje pregunta:
Hiya monks,
Im having trouble getting my head around a regular expression to match sequences. I need to catch all exceptions where a mathematical expression is illegal...
There must be either a letter or a digit either side of an operator
parenthesis must open and close next to letters or digits, not next to operators, and do not have to exist
variables must not be more than one letter
Nothing other than a-z,A-Z,0-9,+,-,*,/,(,)
can be used
Can anyone offer a hand on how best to tackle this problem?
many thanks
La solución parte de que una expresión es o bien un término o bien un término seguido de una operador y un término, esto es:
termino
termino op termino op termino
...
termino (op termino)*
.
Un término es un número o un identificador o una expresión entre paréntesis, esto es:
numero
identificador
( expresión )
La siguiente expresión regular recursiva sigue esta idea:
pl@nereida:~/Lperltesting$ cat -n simpleexpressionsna.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 2 use v5.10; 3 use strict; 4 use warnings; 5 6 local our ($skip, $term, $expr); 7 $skip = qr/\s*/; 8 $expr = qr{ (?<EXPR> 9 (?<TERM> # An expression is a TERM ... 10 $skip (?<ID>[a-zA-Z]+) 11 | $skip (?<INT>[1-9]\d*) 12 | $skip \( 13 $skip (?&EXPR) 14 $skip \) 15 ) (?: $skip # possibly followed by a sequence of ... 16 (?<OP>[-+*/]) 17 (?&TERM) # ... operand TERM pairs 18 )* 19 ) 20 }x; 21 my $re = qr/^ $expr $skip \z/x; 22 sub is_valid { shift =~ /$re/o } 23 24 my @test = ( '(a + 3)', '(3 * 4)+(b + x)', '(5 - a)*z', 25 '((5 - a))*((((z)))+2)', '3 + 2', '!3 + 2', '3 + 2!', 26 '3 a', '3 3', '3 * * 3', 27 '2 - 3 * 4', '2 - 3 + 4', 28 ); 29 foreach (@test) { 30 say("$_:"); 31 say(is_valid($_) ? "\n<$_> is valid" : "\n<$_> is not valid") 32 }Podemos usar acciones semánticas empotradas para ver la forma en la que trabaja la expresión regular (véase la sección 31.2.8):
pl@nereida:~/Lperltesting$ cat -n simpleexpressions.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 2 use v5.10; 3 use strict; 4 use warnings; 5 6 use re 'eval'; # to allow Eval-group at runtime 7 8 local our ($skip, $term, $expr); 9 $skip = qr/\s*/; 10 $expr = qr{ (?<EXPR> 11 (?<TERM> # An expression is a TERM ... 12 $skip (?<ID>[a-zA-Z]+) (?{ print "[ID $+{ID}] " }) 13 | $skip (?<INT>[1-9]\d*) (?{ print "[INT $+{INT}] " }) 14 | $skip \( (?{ print "[(] " }) 15 $skip (?&EXPR) 16 $skip \) (?{ print "[)] " }) 17 ) (?: $skip # possibly followed by a sequence of ... 18 (?<OP>[-+*/]) (?{ print "[OP $+{OP}] " }) 19 (?&TERM) # ... operand TERM pairs 20 )* 21 ) 22 }x; 23 my $re = qr/^ $expr $skip \z/x; 24 sub is_valid { shift =~ /$re/o } 25 26 my @test = ( '(a + 3)', '(3 * 4)+(b + x)', '(5 - a)*z', 27 '((5 - a))*((((z)))+2)', '3 + 2', '!3 + 2', '3 + 2!', 28 '3 a', '3 3', '3 * * 3', 29 '2 - 3 * 4', '2 - 3 + 4', 30 ); 31 foreach (@test) { 32 say("$_:"); 33 say(is_valid($_) ? "\n<$_> is valid" : "\n<$_> is not valid") 34 }
Ejecución:
pl@nereida:~/Lperltesting$ ./simpleexpressions.pl (a + 3): [(] [ID a] [OP +] [INT 3] [)] <(a + 3)> is valid (3 * 4)+(b + x): [(] [INT 3] [OP *] [INT 4] [)] [OP +] [(] [ID b] [OP +] [ID x] [)] <(3 * 4)+(b + x)> is valid (5 - a)*z: [(] [INT 5] [OP -] [ID a] [)] [OP *] [ID z] <(5 - a)*z> is valid ((5 - a))*((((z)))+2): [(] [(] [INT 5] [OP -] [ID a] [)] [)] [OP *] [(] [(] [(] [(] [ID z] [)] [)] [)] [OP +] [INT 2] [)] <((5 - a))*((((z)))+2)> is valid 3 + 2: [INT 3] [OP +] [INT 2] <3 + 2> is valid !3 + 2: <!3 + 2> is not valid 3 + 2!: [INT 3] [OP +] [INT 2] <3 + 2!> is not valid 3 a: [INT 3] <3 a> is not valid 3 3: [INT 3] <3 3> is not valid 3 * * 3: [INT 3] [OP *] <3 * * 3> is not valid 2 - 3 * 4: [INT 2] [OP -] [INT 3] [OP *] [INT 4] <2 - 3 * 4> is valid 2 - 3 + 4: [INT 2] [OP -] [INT 3] [OP +] [INT 4] <2 - 3 + 4> is valid
Por defecto, cuando un subpatrón con un cuantificador impide que el patrón global tenga éxito, se produce un backtrack. Hay ocasiones en las que esta conducta da lugar a ineficiencia.
Perl 5.10 provee los cuantificadores posesivos: Un cuantificador posesivo actúa como un cuantificador greedy pero no se produce backtracking.
*+ |
Casar 0 o mas veces y no retroceder |
++ |
Casar 1 o mas veces y no retroceder |
?+ |
Casar 0 o 1 veces y no retroceder |
{n}+ |
Casar exactamente n veces y no retroceder (redundante) |
{n,}+ |
Casar al menos n veces y no retroceder |
{n,m}+ |
Casar al menos n veces y no mas de m veces y no retroceder |
'aaaa'
no casa con /(a++a)/
porque no
hay retroceso después de leer las 4 aes:
pl@nereida:~/Lperltesting$ perl5.10.1 -wde 0 main::(-e:1): 0 DB<1> x 'aaaa' =~ /(a+a)/ 0 'aaaa' DB<2> x 'aaaa' =~ /(a++a)/ empty array
Los operadores posesivos sirven para poder escribir expresiones regulares mas eficientes en aquellos casos en los que sabemos que el retroceso no conducirá a nuevas soluciones, como es el caso del reconocimiento de las cadenas delimitadas por comillas dobles:
pl@nereida:~/Lperltesting$ cat -n ./quotedstrings.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 2 use v5.10; 3 4 my $regexp = qr/ 5 " # double quote 6 (?: # no memory 7 [^"\\]++ # no " or escape: Don't backtrack 8 | \\. # escaped character 9 )*+ 10 " # end double quote 11 /x; 12 13 my $input = <>; 14 chomp($input); 15 if ($input =~ $regexp) { 16 say "$& is a string"; 17 } 18 else { 19 say "does not match"; 20 }
Los paréntesis posesivos (?> ...)
dan lugar a un reconocedor
que rechaza las demandas de retroceso.
De hecho, los operadores posesivos pueden ser reescritos
en términos de los paréntesis posesivos:
La notación X++
es equivalente a (?>X+)
.
El siguiente ejemplo reconoce el lenguaje de los paréntesis balanceados:
pl@nereida:~/Lperltesting$ cat -n ./balancedparenthesis.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 2 use v5.10; 3 4 my $regexp = 5 qr/^( 6 [^()]*+ # no hay paréntesis, no backtrack 7 \( 8 (?> # subgrupo posesivo 9 [^()]++ # no hay paréntesis, + posesivo, no backtrack 10 |(?1) # o es un paréntesis equilibrado 11 )* 12 \) 13 [^()]*+ # no hay paréntesis 14 )$/x; 15 16 my $input = <>; 17 chomp($input); 18 if ($input =~ $regexp) { 19 say "$& is a balanced parenthesis"; 20 } 21 else { 22 say "does not match"; 23 }Cuando se ejecuta produce una salida como:
pl@nereida:~/Lperltesting$ ./balancedparenthesis.pl (2*(3+4)-5)*2 (2*(3+4)-5)*2 is a balanced parenthesis pl@nereida:~/Lperltesting$ ./balancedparenthesis.pl (2*(3+4)-5))*2 does not match pl@nereida:~/Lperltesting$ ./balancedparenthesis.pl 2*(3+4 does not match pl@nereida:~/Lperltesting$ ./balancedparenthesis.pl 4*(2*(3+4)-5)*2 4*(2*(3+4)-5)*2 is a balanced parenthesis
El uso de los operadores posesivos nos permite reescribir la solución al problema de encontrar los bloques maximales de un código dada en la sección 31.2.5 de la siguiente manera:
1 pl@nereida:~/Lperltesting$ cat blocksopti.pl 2 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w 3 use v5.10; 4 use strict; 5 #use re 'debug'; 6 7 my $rb = qr{(?x) 8 ( 9 \{ # llave abrir 10 (?: 11 [^{}]++ # no llaves 12 | 13 (?1) # recursivo 14 [^{}]*+ # no llaves 15 )*+ 16 \} # llave cerrar 17 ) 18 }; 19 20 local $/ = undef; 21 my $input = <>; 22 my@blocks = $input =~ m{$rb}g; 23 my $i = 0; 24 say($i++.":\n$_\n===") for @blocks;
A veces conviene tener una forma de acceso uniforme a la lista proporcionada por los paréntesis con memoria. Por ejemplo, la siguiente expresión regular reconoce el lenguaje de las horas en notaciones civil y militar:
pl@nereida:~/Lperltesting$ perl5.10.1 -wde 0 main::(-e:1): 0 DB<1> '23:12' =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/; print "1->$1 2->$2\n" 1->23 2->12 DB<2> '2312' =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/; print "3->$3 4->$4\n" 3->23 4->12Parece inconveniente tener los resultados en variables distintas. El constructo
(?| ...)
hace que los paréntesis se enumeren
relativos a las alternativas:
DB<3> '2312' =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))/; print "1->$1 2->$2\n" 1->23 2->12 DB<4> '23:12' =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))/; print "1->$1 2->$2\n" 1->23 2->12Ahora en ambos casos
$1
y $2
contienen las horas y minutos.
Es posible introducir código Perl dentro de una expresión regular.
Para ello se usa la notación (?{code})
.
El siguiente texto esta tomado de la sección 'A-bit-of-magic:-executing-Perl-code-in-a-regular-expression' en perlretut:
Normally, regexps are a part of Perl expressions. Code evaluation expressions turn that around by allowing arbitrary Perl code to be a part of a regexp. A code evaluation expression is denoted (?code), with code a string of Perl statements.
Be warned that this feature is considered experimental, and may be changed without notice.
Code expressions are zero-width assertions, and the value they return depends on their environment.
There are two possibilities: either the
code expression is used as a conditional in a conditional expression
(?(condition)...)
, or it is not.
If the code expression is a conditional, the code is evaluated and the result (i.e., the result of the last statement) is used to determine truth or falsehood.
If the code expression is not used as a conditional, the assertion always evaluates true and the result is put into the special variable$^R
. The variable$^R
can then be used in code expressions later in the regexp
Las expresiones de código son zero-width assertions: no consumen entrada.
El resultado de la ejecución se salva en la variable especial $^R
.
Veamos un ejemplo:
pl@nereida:~/Lperltesting$ perl5.10.1 -wde 0 main::(-e:1): 0 DB<1> $x = "abcdef" DB<2> $x =~ /abc(?{ "Hi mom\n" })def(?{ print $^R })$/ Hi mom DB<3> $x =~ /abc(?{ print "Hi mom\n"; 4 })def(?{ print "$^R\n" })/ Hi mom 4 DB<4> $x =~ /abc(?{ print "Hi mom\n"; 4 })ddd(?{ print "$^R\n" })/ # does not match DB<5>En el último ejemplo (línea
DB<4>
) ninguno de los print
se ejecuta dado que no hay matching.
Tomado de la sección 'Extended-Patterns' en perlre:
This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its code is not interpolated. Currently, the rules to determine where the code ends are somewhat convoluted.
Tomado de la sección 'Extended-Patterns' en perlre:
... can be used with the special variable $^N
to
capture the results of submatches in variables without having to keep
track of the number of nested parentheses. For example:
pl@nereida:~/Lperltesting$ perl5.10.1 -wdE 0 main::(-e:1): 0 DB<1> $x = "The brown fox jumps over the lazy dog" DB<2> x $x =~ /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i 0 'brown' 1 'fox' DB<3> p "color=$color animal=$animal\n" color=brown animal=fox DB<4> $x =~ /the (\S+)(?{ print (substr($_,0,pos($_)))."\n" }) (\S+)/i The brown
Inside the(?{...})
block,$_
refers to the string the regular expression is matching against. You can also usepos()
to know what is the current position of matching within this string.
Si se usa un cuantificador sobre un código empotrado, actúa como un bucle:
pl@nereida:~/Lperltesting$ perl5.10.1 -wde 0 main::(-e:1): 0 DB<1> $x = "aaaa" DB<2> $x =~ /(a(?{ $c++ }))*/ DB<3> p $c 4 DB<4> $y = "abcd" DB<5> $y =~ /(?:(.)(?{ print "-$1-\n" }))*/ -a- -b- -c- -d-
Tomado (y modificado el ejemplo) de la sección 'Extended-Patterns' en perlre:
...The code is properly scoped in the following sense: If the assertion is backtracked (compare la sección 'Backtracking' en perlre), all changes introduced after localization are undone, so that
pl@nereida:~/Lperltesting$ cat embededcodescope.pl use strict; our ($cnt, $res); sub echo { local our $pre = substr($_,0,pos($_)); local our $post = (pos($_) < length)? (substr($_,1+pos($_))) : ''; print("$pre(count = $cnt)$post\n"); } $_ = 'a' x 8; m< (?{ $cnt = 0 }) # Initialize $cnt. ( a (?{ local $cnt = $cnt + 1; # Update $cnt, backtracking-safe. echo(); }) )* aaaa (?{ $res = $cnt }) # On success copy to non-localized # location. >x; print "FINAL RESULT: cnt = $cnt res =$res\n";
will set$res = 4
. Note that after the match,$cnt
returns to the globally introduced value, because the scopes that restrict local operators are unwound.
pl@nereida:~/Lperltesting$ perl5.8.8 -w embededcodescope.pl a(count = 1)aaaaaa aa(count = 2)aaaaa aaa(count = 3)aaaa aaaa(count = 4)aaa aaaaa(count = 5)aa aaaaaa(count = 6)a aaaaaaa(count = 7) aaaaaaaa(count = 8) FINAL RESULT: cnt = 0 res =4
Due to an unfortunate implementation issue, the Perl code contained in these blocks is treated as a compile time closure that can have seemingly bizarre consequences when used with lexically scoped variables inside of subroutines or loops. There are various workarounds for this, including simply using global variables instead. If you are using this construct and strange results occur then check for the use of lexically scoped variables.
For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the periloususe re 'eval'
pragma has been used (see re), or the variables contain results ofqr//
operator (see"qr/STRING/imosx"
in perlop).
This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example:
1. $re = <>; 2. chomp $re; 3. $string =~ /$re/;
Before Perl knew how to execute interpolated code within a pattern, this
operation was completely safe from a security point of view, although
it could raise an exception from an illegal pattern. If you turn on the
use re 'eval'
, though, it is no longer secure, so you should only do
so if you are also using taint
checking. Better yet, use the carefully
constrained evaluation within a Safe
compartment. See perlsec for details
about both these mechanisms. (Véase la sección 'Taint-mode' en perlsec)
Because Perl's regex engine is currently not re-entrant, interpolated code may not invoke the regex engine either directly withm//
ors///
, or indirectly with functions such as split.
Las acciones empotradas pueden utilizarse como mecanismo de depuración y de descubrimiento del comportamiento de nuestras expresiones regulares.
En el siguiente programa se produce una colisión
entre los nombres <i>
y <j>
de los patrones
que ocurren en el patrón <expr>
y en el patrón principal:
pl@nereida:~/Lperltesting$ cat -n clashofnamedofssets.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 2 use v5.10; 3 4 my $input; 5 6 local $" = ", "; 7 8 my $parser = qr{ 9 ^ (?<i> (?&expr)) (?<j> (?&expr)) \z 10 (?{ 11 say "main $+ hash:"; 12 say " ($_ => $+{$_}) " for sort keys %+; 13 }) 14 15 (?(DEFINE) 16 (?<expr> 17 (?<i> . ) 18 (?<j> . ) 19 (?{ 20 say "expr \$+ hash:"; 21 say " ($_ => $+{$_}) " for sort keys %+; 22 }) 23 ) 24 ) 25 }x; 26 27 $input = <>; 28 chomp($input); 29 if ($input =~ $parser) { 30 say "matches: ($&)"; 31 }La colisión hace que la salida sea esta:
pl@nereida:~/Lperltesting$ ./clashofnamedofssets.pl abab expr $+ hash: (i => a) (j => b) expr $+ hash: (i => ab) (j => b) main $+ hash: (i => ab) (j => ab) matches: (abab)Si se evitan las colisiones, se evita la pérdida de información:
pl@nereida:~/Lperltesting$ cat -n namedoffsets.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 2 use v5.10; 3 4 my $input; 5 6 local $" = ", "; 7 8 my $parser = qr{ 9 ^ (?<i> (?&expr)) (?<j> (?&expr)) \z 10 (?{ 11 say "main $+ hash:"; 12 say " ($_ => $+{$_}) " for sort keys %+; 13 }) 14 15 (?(DEFINE) 16 (?<expr> 17 (?<i_e> . ) 18 (?<j_e> . ) 19 (?{ 20 say "expr \$+ hash:"; 21 say " ($_ => $+{$_}) " for sort keys %+; 22 }) 23 ) 24 ) 25 }x; 26 27 $input = <>; 28 chomp($input); 29 if ($input =~ $parser) { 30 say "matches: ($&)"; 31 }
que al ejecutarse produce:
pl@nereida:~/Lperltesting$ ./namedoffsets.pl abab expr $+ hash: (i_e => a) (j_e => b) expr $+ hash: (i => ab) (i_e => a) (j_e => b) main $+ hash: (i => ab) (j => ab) matches: (abab)
Los paréntesis especiales:
(??{ Código Perl })hacen que el
Código Perl
sea evaluado durante el tiempo de matching.
El resultado de la evaluación se trata como una expresión regular. El match continuará
intentando casar con la expresión regular retornada.
Los paréntesis en la expresión regular retornada no cuentan en el patrón exterior. Véase el siguiente ejemplo:
pl@nereida:~/Lperltesting$ cat -n postponedregexp.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w 2 use v5.10; 3 use strict; 4 5 my $r = qr{(?x) # ignore spaces 6 ([ab]) # save 'a' or 'b' in $1 7 (??{ "($^N)"x3 }) # 3 more of the same as in $1 8 }; 9 say "<$&> lastpar = $#-" if 'bbbb' =~ $r; 10 say "<$&> lastpar = $#-" if 'aaaa' =~ $r; 11 say "<abab> didn't match" unless 'abab' =~ $r; 12 say "<aaab> didn't match" unless 'aaab' =~ $r;
Como se ve, hemos accedido desde el código interior
al último paréntesis usando $^N
.
Sigue una ejecución:
pl@nereida:~/Lperltesting$ ./postponedregexp.pl <bbbb> lastpar = 1 <aaaa> lastpar = 1 <abab> didn't match <aaab> didn't match
Consideremos el problema de escribir una expresión regular que reconoce secuencias no vacías de dígitos tales que la longitud de la secuencia restante viene determinada por el primer dígito. Esta es una solución:
pl@nereida:~/Lperltesting$ cat -n intints.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w 2 use v5.10; 3 use strict; 4 5 my $r = qr{(?x) # ignore spaces 6 (\d) # a digit 7 ( (??{ 8 "\\d{$^N}" # as many as the former 9 }) # digit says 10 ) 11 }; 12 say "<$&> <$1> <$2>" if '3428' =~ $r; 13 say "<$&> <$1> <$2>" if '228' =~ $r; 14 say "<$&> <$1> <$2>" if '14' =~ $r; 15 say "24 does not match" unless '24' =~ $r; 16 say "4324 does not match" unless '4324' =~ $r;
Cuando se ejecuta se obtiene:
pl@nereida:~/Lperltesting$ ./intints.pl <3428> <3> <428> <228> <2> <28> <14> <1> <4> 24 does not match 4324 does not match
Otro ejemplo: queremos escribir una expresión regular que reconozca secuencias
de $n
dígitos en las que no todos los dígitos se repiten. Donde quizá $n
es capturado
de un paréntesis anterior en la expresión regular. Para simplificar la ilustración
de la técnica supongamos que $n = 7
:
pl@nereida:~$ perl5.10.1 -wdE 0 main::(-e:1): 0 DB<1> x join '', map { "(?!".$_."{7})" } 0..9 0 '(?!0{7})(?!1{7})(?!2{7})(?!3{7})(?!4{7})(?!5{7})(?!6{7})(?!7{7})(?!8{7})(?!9{7})' DB<2> x '7777777' =~ /(??{join '', map { "(?!".$_."{7})" } 0..9})(\d{7})/ empty array DB<3> x '7777778' =~ /(??{join '', map { "(?!".$_."{7})" } 0..9})(\d{7})/ 0 7777778 DB<4> x '4444444' =~ /(??{join '', map { "(?!".$_."{7})" } 0..9})(\d{7})/ empty array DB<5> x '4422444' =~ /(??{join '', map { "(?!".$_."{7})" } 0..9})(\d{7})/ 0 4422444
Se trata en este ejercicio de generalizar la expresión regular introducida en la sección 31.2.5 para reconocer los palabra-palíndromos.
Se trata de encontrar una regexp que acepte que la lectura derecha e inversa de una frase en Español pueda diferir en la acentuación (como es el caso del clásico palíndromo dábale arroz a la zorra el abad). Una solución trivial es preprocesar la cadena eliminando los acentos. Supondremos sin embargo que se quiere trabajar sobre la cadena original. He aquí una solucion:
1 pl@nereida:~/Lperltesting$ cat actionspanishpalin.pl 2 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w -CIOEioA 3 use v5.10; 4 use strict; 5 use utf8; 6 use re 'eval'; 7 use Switch; 8 9 sub f { 10 my $char = shift; 11 12 switch($char) { 13 case [ qw{a á} ] { return '[aá]' } 14 case [ qw{e é} ] { return '[eé]' } 15 case [ qw{i í} ] { return '[ií]' } 16 case [ qw{o ó} ] { return '[oó]' } 17 case [ qw{u ú} ] { return '[uú]' } 18 else { return $char }; 19 } 20 } 21 22 my $regexp = qr/^(\W* (?: 23 (\w) (?-2)(??{ f($^N) }) 24 | \w? 25 ) \W* 26 ) 27 $ 28 /ix; 29 30 my $input = <>; # Try: 'dábale arroz a la zorra el abad'; 31 chomp($input); 32 if ($input =~ $regexp) { 33 say "$input is a palindrome"; 34 } 35 else { 36 say "$input does not match"; 37 }
Sigue un ejemplo de ejecución:
pl@nereida:~/Lperltesting$ ./actionspanishpalin.pl dábale arroz a la zorra el abad dábale arroz a la zorra el abad is a palindrome pl@nereida:~/Lperltesting$ ./actionspanishpalin.pl éoíúaáuioé éoíúaáuioé is a palindrome pl@nereida:~/Lperltesting$ ./actionspanishpalin.pl dáed dáed does not match
Véase el nodo Complex regex for maths formulas para la formulación del problema:
Hiya monks,
Im having trouble getting my head around a regular expression to match sequences. I need to catch all exceptions where a mathematical expression is illegal...
There must be either a letter or a digit either side of an operator
parenthesis must open and close next to letters or digits, not next to operators, and do not have to exist
variables must not be more than one letter
Nothing other than a-z,A-Z,0-9,+,-,*,/,(,)
can be used
Can anyone offer a hand on how best to tackle this problem?
many thanks
La respuesta dada por ikegami
usa (?{{ ... })
para conseguir una conducta recursiva en versiones de perl anteriores a la 5.10:
pl@nereida:~/Lperltesting$ cat -n complexformula.pl 1 #!/usr/bin/perl 2 use strict; 3 use warnings; 4 5 sub is_valid_expr { 6 use re 'eval'; # to allow Eval-group at runtime 7 8 local our ($skip, $term, $expr); 9 $skip = qr! \s* !x; 10 $term = qr! $skip [a-zA-Z]+ # A term is an identifier 11 | $skip [1-9][0-9]* # or a number 12 | $skip \( (??{ $expr }) $skip # or an expression 13 \) 14 !x; 15 $expr = qr! $term # A expr is a term 16 (?: $skip [-+*/] $term )* # or a term + a term ... 17 !x; 18 19 return $_[0] =~ / ^ $expr $skip \z /x; 20 } 21 22 print(is_valid_expr($_) ? "$_ is valid\n" : "$_ is not valid\n") foreach ( 23 '(a + 3)', 24 '(3 * 4)+(b + x)', 25 '(5 - a)*z', 26 '3 + 2', 27 28 '!3 + 2', 29 '3 + 2!', 30 31 '3 a', 32 '3 3', 33 '3 * * 3', 34 35 '2 - 3 * 4', 36 '2 - 3 + 4', 37 );
Sigue el resultado de la ejecución:
pl@nereida:~/Lperltesting$ perl complexformula.pl (a + 3) is valid (3 * 4)+(b + x) is valid (5 - a)*z is valid 3 + 2 is valid !3 + 2 is not valid 3 + 2! is not valid 3 a is not valid 3 3 is not valid 3 * * 3 is not valid 2 - 3 * 4 is valid 2 - 3 + 4 is valid
Estos son algunos puntos a tener en cuenta cuando se usan patrones postpuestos.
Véase la entrada (??{ code })
en la sección 'Extended-Patterns' en perlre:
WARNING: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine.
This is a postponed regular subexpression. The code is evaluated at run time, at the moment this subexpression may match. The result of evaluation is considered as a regular expression and matched as if it were inserted instead of this construct.
The code is not interpolated.
As before, the rules to determine where the code ends are currently somewhat convoluted.
Because perl's regex engine is not currently re-entrant, delayed code may not invoke the regex engine either directly withm//
ors///
), or indirectly with functions such as split.
Recursing deeper than 50 times without consuming any input string will result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build.
Citando a perlre:
A conditional expression is a form of if-then-else statement that allows one to choose which patterns are to be matched, based on some condition.
There are two types of conditional expression:(?(condition)yes-regexp)
and(?(condition)yes-regexp|no-regexp)
.
(?(condition)yes-regexp)
is like anif () {}
statement in Perl. If the condition is true, the yes-regexp will be matched. If the condition is false, the yes-regexp will be skipped and Perl will move onto the next regexp element.
The second form
is like an if () {} else {}
statement in Perl. If the condition is true,
the yes-regexp will be matched, otherwise the no-regexp will be matched.
The condition can have several forms.
\integer
matched earlier in the regexp. The same thing
can be done with a name associated with a capture buffer, written as
(<name>)
or ('name')
.
(?...)
, either a lookahead, a lookbehind, or a code assertion.
(R)
or is being
called from some capturing group, referenced either by number (R1
,
or by name (R&name)
.
Una expresión condicional puede adoptar diversas formas.
La mas simple es un entero en paréntesis.
Es cierta si la correspondiente referencia
\integer
casó (también se puede usar un nombre
si se trata de un paréntesis con nombre).
En la expresión regular /^(.)(..)?(?(2)a|b)/
si el segundo paréntesis
casa, la cadena debe ir seguida de una a
, si no casa deberá ir seguida de
una b
:
DB<1> x 'hola' =~ /^(.)(..)?(?(2)a|b)/ 0 'h' 1 'ol' DB<2> x 'ha' =~ /^(.)(..)?(?(2)a|b)/ empty array DB<3> x 'hb' =~ /^(.)(..)?(?(2)a|b)/ 0 'h' 1 undef
La siguiente búsqueda casa con patrones de la forma
$x$x
o $x$y$y$x
:
pl@nereida:~/Lperltesting$ perl5.10.1 -wde 0 main::(-e:1): 0 DB<1> x 'aa' =~ m{^(\w+)(\w+)?(?(2)\2\1|\1)$} 0 'a' 1 undef DB<2> x 'abba' =~ m{^(\w+)(\w+)?(?(2)\2\1|\1)$} 0 'a' 1 'b' DB<3> x 'abbc' =~ m{^(\w+)(\w+)?(?(2)\2\1|\1)$} empty array DB<4> x 'juanpedropedrojuan' =~ m{^(\w+)(\w+)?(?(2)\2\1|\1)$} 0 'juan' 1 'pedro'
Una expresión condicional también puede ser un código:
DB<1> $a = 0; print "$&" if 'hola' =~ m{(?(?{$a})hola|adios)} # No hay matching DB<2> $a = 1; print "$&" if 'hola' =~ m{(?(?{$a})hola|adios)} hola
La siguiente expresión regular utiliza un condicional para forzar a que si una cadena comienza por un paréntesis abrir termina con un paréntesis cerrar. Si la cadena no comienza por paréntesis abrir no debe existir un paréntesis final de cierre:
pl@nereida:~/Lperltesting$ cat -n conditionalregexp.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w 2 use v5.10; 3 use strict; 4 5 my $r = qr{(?x) # ignore spaces 6 ^ 7 ( \( )? # may be it comes an open par 8 [^()]+ # no parenthesis 9 (?(1) # did we sart with par? 10 \) # if yes then close par 11 ) 12 $ 13 }; 14 say "<$&>" if '(abcd)' =~ $r; 15 say "<$&>" if 'abc' =~ $r; 16 say "<(abc> does not match" unless '(abc' =~ $r; 17 say "<abc)> does not match" unless 'abc)' =~ $r;
Al ejecutar este programa se obtiene:
pl@nereida:~/Lperltesting$ ./conditionalregexp.pl <(abcd)> <abc> <(abc> does not match <abc)> does not match
El siguiente ejemplo muestra el uso de la condición (R)
, la cual comprueba
si la expresión ha sido evaluada dentro de una recursión:
pl@nereida:~/Lperltesting$ perl5.10.1 -wdE 0 main::(-e:1): 0 DB<1> x 'bbaaaabb' =~ /(b(?(R)a+|(?0))b)/ 0 'bbaaaabb' DB<2> x 'bb' =~ /(b(?(R)a+|(?0))b)/ empty array DB<3> x 'bab' =~ /(b(?(R)a+|(?0))b)/ empty array DB<4> x 'bbabb' =~ /(b(?(R)a+|(?0))b)/ 0 'bbabb'La sub-expresión regular
(?(R)a+|(?0))
dice:
si esta siendo evaluada recursivamente admite a+
si no, evalúa la regexp completa recursivamente.
Se trata en este ejercicio de generalizar la expresión
regular introducida en la sección
31.2.5
para reconocer los palabra-palíndromos31.7. Se trata de encontrar una regexp que acepte
que la lectura derecha e inversa de una frase en Español
pueda diferir en la acentuación (como es el caso
del clásico palíndromo dábale arroz a la zorra
el abad). Una solución trivial es preprocesar
la cadena eliminando los acentos. Supondremos sin embargo
que se quiere trabajar sobre la cadena original.
He aquí una solucion parcial (por
consideraciones de legibilidad sólo se consideran las vocales a
y o
:
1 pl@nereida:~/Lperltesting$ cat spanishpalin.pl 2 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w -CIOEioA 3 use v5.10; 4 use strict; 5 use utf8; 6 7 my $regexp = qr/^(?<pal>\W* (?: 8 (?<L>(?<a>[áa])|(?<e>[ée])|\w) # letter 9 (?&pal) # nested palindrome 10 (?(<a>)[áa] # if is an "a" group 11 |(?:((?<e>)[ée] # if is an "e" group 12 |\g{L} # exact match 13 ) # end if [ée] 14 ) # end group 15 ) # end if [áa] 16 | \w? # non rec. case 17 ) \W* # punctuation symbols 18 ) 19 $ 20 /ix; 21 22 my $input = <>; # Try: 'dábale arroz a la zorra el abad'; 23 chomp($input); 24 if ($input =~ $regexp) { 25 say "$input is a palindrome"; 26 } 27 else { 28 say "$input does not match"; 29 }
Ejecución:
pl@nereida:~/Lperltesting$ ./spanishpalin.pl dábale arroz a la zorra el abad dábale arroz a la zorra el abad is a palindrome pl@nereida:~/Lperltesting$ ./spanishpalin.pl óuuo óuuo does not match pl@nereida:~/Lperltesting$ ./spanishpalin.pl éaáe éaáe is a palindrome
Hemos usado la opción -CIOEioA
para asegurarnos
que los ficheros de entrada/saldia y error y la línea de
comandos estan en modo UTF-8.
(Véase la sección )
Esto es lo que dice la documentación de perlrun al respecto:
The -C
flag controls some of the Perl Unicode features.
As of 5.8.1, the -C
can be followed either by a number or a list of option
letters. The letters, their numeric values, and effects are as follows;
listing the letters is equal to summing the numbers.
1 I 1 STDIN is assumed to be in UTF-8 2 O 2 STDOUT will be in UTF-8 3 E 4 STDERR will be in UTF-8 4 S 7 I + O + E 5 i 8 UTF-8 is the default PerlIO layer for input streams 6 o 16 UTF-8 is the default PerlIO layer for output streams 7 D 24 i + o 8 A 32 the @ARGV elements are expected to be strings encoded 9 in UTF-8 10 L 64 normally the "IOEioA" are unconditional, 11 the L makes them conditional on the locale environment 12 variables (the LC_ALL, LC_TYPE, and LANG, in the order 13 of decreasing precedence) -- if the variables indicate 14 UTF-8, then the selected "IOEioA" are in effect 15 a 256 Set ${^UTF8CACHE} to -1, to run the UTF-8 caching code in 16 debugging mode.
For example,-COE
and-C6
will both turn on UTF-8-ness on bothSTDOUT
andSTDERR
. Repeating letters is just redundant, not cumulative nor toggling.
The io options mean that any subsequentopen()
(or similar I/O operations) will have the:utf8
PerlIO layer implicitly applied to them, in other words,UTF-8
is expected from any input stream, andUTF-8
is produced to any output stream. This is just the default, with explicit layers inopen()
and withbinmode()
one can manipulate streams as usual.
-C
on its own (not followed by any number or option list), or the empty string""
for thePERL_UNICODE
environment variable, has the same effect as-CSDL
. In other words, the standard I/O handles and the defaultopen()
layer are UTF-8-fied but only if the locale environment variables indicate a UTF-8 locale. This behaviour follows the implicit (and problematic) UTF-8 behaviour of Perl 5.8.0.
You can use-C0
(or0
forPERL_UNICODE
) to explicitly disable all the above Unicode features.
El pragma use utf8
hace que se utilice una semántica de
carácteres (por ejemplo, la regexp /./
casará con un carácter unicode), el
pragma use bytes
cambia de semántica de caracteres
a semántica de bytes (la regexp .
casará con un byte).
lhp@nereida:~/Lperl/src/testing$ cat -n dot_utf8_2.pl 1 #!/usr/local/bin/perl -w 2 use strict; 3 use utf8; 4 use charnames qw{greek}; 5 6 binmode(STDOUT, ':utf8'); 7 8 my $x = 'αβγδεφ'; 9 10 my @w = $x =~ /(.)/g; 11 print "@w\n"; 12 13 { 14 use bytes; 15 my @v = map { ord } $x =~ /(.)/g; 16 print "@v\n"; 17 }Al ejcutar el programa obtenemos la salida:
pl@nereida:~/Lperltesting$ perl dot_utf8_2.pl
α β γ δ ε φ
206 177 206 178 206 179 206 180 206 181 207 134
Tomado de la sección 'Backtracking-control-verbs' en perlretut:
The control verb(*FAIL)
may be abbreviated as(*F)
. If this is inserted in a regexp it will cause to fail, just like at some mismatch between the pattern and the string. Processing of the regexp continues like after any "normal" failure, so that the next position in the string or another alternative will be tried. As failing to match doesn't preserve capture buffers or produce results, it may be necessary to use this in combination with embedded code.
pl@nereida:~/Lperltesting$ cat -n vowelcount.pl 1 #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 -w 2 use strict; 3 4 my $input = shift() || <STDIN>; 5 my %count = (); 6 $input =~ /([aeiou])(?{ $count{$1}++; })(*FAIL)/i; 7 printf("'%s' => %3d\n", $_, $count{$_}) for (sort keys %count);Al ejecutarse con entrada
supercalifragilistico
produce la salida:
pl@nereida:~/Lperltesting$ ./vowelcount.pl supercalifragilistico 'a' => 2 'e' => 1 'i' => 4 'o' => 1 'u' => 1
$1
depués de ejecutado el matching
$input =~ /([aeiou])(?{ $count{$1}++; })(*FAIL)/i;
?Véase también:
Tomado de perlretut:
This pattern matches nothing and causes the end of successful matching at the point at which the(*ACCEPT)
pattern was encountered, regardless of whether there is actually more to match in the string. When inside of a nested pattern, such as recursion, or in a subpattern dynamically generated via(??{})
, only the innermost pattern is ended immediately.
If the(*ACCEPT)
is inside of capturing buffers then the buffers are marked as ended at the point at which the(*ACCEPT)
was encountered. For instance:
DB<1> x 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x 0 'AB' 1 'B' 2 undef DB<2> x 'ACDE' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x 0 'ACD' 1 'C' 2 'E'
This zero-width pattern prunes the backtracking tree at the current point when backtracked into on failure. Consider the patternA (*SKIP) B
, whereA
andB
are complex patterns. Until the(*SKIP)
verb is reached,A
may backtrack as necessary to match. Once it is reached, matching continues inB
, which may also backtrack as necessary; however, shouldB
not match, then no further backtracking will take place, and the pattern will fail outright at the current starting position.
It also signifies that whatever text that was matched leading up to the(*SKIP)
pattern being executed cannot be part of any match of this pattern. This effectively means that the regex engineskips
forward to this position on failure and tries to match again, (assuming that there is sufficient room to match).
The name of the(*SKIP:NAME)
pattern has special significance. If a(*MARK:NAME)
was encountered while matching, then it is that position which is used as the "skip point". If no(*MARK)
of that name was encountered, then the(*SKIP)
operator has no effect. When used without a name the "skip point" is where the match point was when executing the(*SKIP)
pattern.
Ejemplo:
pl@nereida:~/Lperltesting$ cat -n SKIP.pl 1 #!/soft/perl5lib/bin/perl5.10.1 -w 2 use strict; 3 use v5.10; 4 5 say "NO SKIP: /a+b?(*FAIL)/"; 6 our $count = 0; 7 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/; 8 say "Count=$count\n"; 9 10 say "WITH SKIP: a+b?(*SKIP)(*FAIL)/"; 11 $count = 0; 12 'aaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/; 13 say "WITH SKIP: Count=$count\n"; 14 15 say "WITH SKIP /a+(*SKIP)b?(*FAIL)/:"; 16 $count = 0; 17 'aaab' =~ /a+(*SKIP)b?(?{print "$&\n"; $count++})(*FAIL)/; 18 say "Count=$count\n"; 19 20 say "WITH SKIP /(*SKIP)a+b?(*FAIL): "; 21 $count = 0; 22 'aaab' =~ /(*SKIP)a+b?(?{print "$&\n"; $count++})(*FAIL)/; 23 say "Count=$count\n";
Ejecución:
pl@nereida:~/Lperltesting$ perl5.10.1 SKIP.pl NO SKIP: /a+b?(*FAIL)/ aaab aaa aa a aab aa a ab a Count=9 WITH SKIP: a+b?(*SKIP)(*FAIL)/ aaab WITH SKIP: Count=1 WITH SKIP /a+(*SKIP)b?(*FAIL)/: aaab aaa Count=2 WITH SKIP /(*SKIP)a+b?(*FAIL): aaab aaa aa a aab aa a ab a Count=9
Tomado de la sección 'Backtracking-control-verbs' en perlretut:
(*MARK:NAME) (*:NAME)
This zero-width pattern can be used to mark the point reached in a string when a certain part of the pattern has been successfully matched. This mark may be given a name. A later(*SKIP)
pattern will then skip forward to that point if backtracked into on failure. Any number of(*MARK)
patterns are allowed, and theNAME
portion is optional and may be duplicated.
In addition to interacting with the(*SKIP)
pattern,(*MARK:NAME)
can be used tolabel
a pattern branch, so that after matching, the program can determine which branches of the pattern were involved in the match.
When a match is successful, the$REGMARK
variable will be set to the name of the most recently executed(*MARK:NAME)
that was involved in the match.
This can be used to determine which branch of a pattern was matched without using a separate capture buffer for each branch, which in turn can result in a performance improvement.
When a match has failed, and unless another verb has been involved in failing the match and has provided its own name to use, the$REGERROR
variable will be set to the name of the most recently executed(*MARK:NAME)
.
pl@nereida:~/Lperltesting$ cat -n mark.pl 1 use v5.10; 2 use strict; 3 4 our $REGMARK; 5 6 $_ = shift; 7 say $REGMARK if /(?:x(*MARK:mx)|y(*MARK:my)|z(*MARK:mz))/; 8 say $REGMARK if /(?:x(*:xx)|y(*:yy)|z(*:zz))/;Cuando se ejecuta produce:
pl@nereida:~/Lperltesting$ perl5.10.1 mark.pl y my yy pl@nereida:~/Lperltesting$ perl5.10.1 mark.pl z mz zz
Se quiere poner un espacio en blanco después de la aparición de cada coma:
s/,/, /g;
pero se quiere que la sustitución no tenga lugar si la coma esta incrustada entre dos dígitos. Además se pide que si hay ya un espacio después de la coma, no se duplique. Sigue una solución que usa marcas:
pl@nereida:~/Lperltesting$ perl5.10.1 -wdE 0 main::(-e:1): 0 DB<1> $a = 'ab,cd, ef,12,34,efg,56,78,df, ef,' DB<2> x ($b = $a) =~ s/\d,\d(*:d)|,(?!\s)/($REGMARK eq 'd')? $& : ', '/ge 0 8 DB<3> p "<$b>" <ab, cd, ef, 12,34, efg, 56,78, df, ef, >
Casiano Rodríguez León