PCRE Regular expressions
You can use PCRE @re_str
in combination with CombinedParser
's constructors.
Constructing Regular expressions
CombinedParsers.Regexp.@re_str
โ Macro@re_str(x,flags)
Construct a ParserWithCaptures
from PCRE regex syntax, such as re"^[a-z]*$"
, without interpolation and unescaping (except for quotation mark "
which still has to be escaped). Plug-in replacement for PCRE string macro @r_str.
The regex also accepts one or more flags, listed after the ending quote, to change its behaviour:
i
enables case-insensitive matchingm
treats the^
and$
tokens as matching the start and end of individual lines, as opposed to the whole string.s
allows the.
modifier to match newlines.x
enables "comment mode": whitespace is ignored except when escaped with\
, and#
is treated as starting a comment.a
disablesUCP
mode (enables ASCII mode). By default\B
,\b
,\D
,\d
,\S
,\s
,\W
,\w
, etc. match based on Unicode character properties. With this option, these sequences only match ASCII characters.xx
enables "extended comment mode": whitespace in bracket character matchers are ignored.
julia> re"a|c"i
|๐ Either
โโ [aA] ValueIn
โโ [cC] ValueIn
::Char
julia> re"a+c"
๐ Sequence
โโ a+ |> Repeat
โโ c
::Tuple{Vector{Char}, Char}
See also Regcomb
, parse_options
.
CombinedParsers.Regexp
โ ModuleA regular expression parser transforming a PCRE string to a CombinedParser equivalent to the regular expression.
CombinedParsers.Regexp.Regcomb
โ FunctionRegcomb(x::AbstractString[, flags=""])
Syntax for flags
in @re_str
.
Base.getindex
โ MethodBase.getindex(x::ParseMatch{<:Any,<:SequenceWithCaptures,<:Any},i::Union{Integer,Symbol})
Gets capture i
as SubString.
See API of RegexMatch
.
Base.getproperty
โ MethodBase.getproperty(m::ParseMatch{<:Any,<:SequenceWithCaptures,<:Any},key::Symbol)
enable m.captures
and m.match
.
See API of RegexMatch
.
CombinedParsers.regex_escape
โ Function regex_escape(s::AbstractString)
regular expression metacharacters are escaped along with whitespace.
Compatibility & Unit Tests
CombinedParsers.Regexp.character_class
โ Constantjulia> CombinedParsers.Regexp.character_class
๐ Sequence |> map(#57)
โโ \[\:
โโ |๐ Either
โ โโ alpha => [\p{L}] ValueIn
โ โโ lower => [\p{Ll}] ValueIn
โ โโ upper => [\p{Lu}] ValueIn
โ โโ word => [\p{L}\p{Nl}\p{Nd}\p{Pc}] ValueIn
โ โโ digit => [\p{Nd}] ValueIn
โ โโ xdigit => [[:xdigit:]] ValueIn
โ โโ alnum => [\p{L}\p{N}] ValueIn
โ โโ blank => [\t\p{Zs}] ValueIn
โ โโ cntrl => [\p{Cc}] ValueIn
โ โโ graph => [^\p{Z}\p{C}] ValueNotIn
โ โโ print => [\p{C}] ValueIn
โ โโ punct => [\p{P}] ValueIn
โ โโ space => [\r\v\n\f\t\p{Z}] ValueIn
โโ \:\]
::CombinedParsers.ValueMatcher
TODO:
By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCREUCP option is passed to pcrecompile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:
- [:alnum:] becomes \p{Xan}
- [:alpha:] becomes \p{L}
- [:blank:] becomes \h
- [:digit:] becomes \p{Nd}
- [:lower:] becomes \p{Ll}
- [:space:] becomes \p{Xps}
- [:upper:] becomes \p{Lu}
- [:word:] becomes \p{Xwd}
Base.:==
โ Method==(pcre_m::RegexMatch,pc_m::ParseMatch)
equal iif values of .match
, .offset
, .ncodeunits
and .captures
are equal.
CombinedParsers.Regexp.@pcre_tests
โ Macro@pcre_testset
Define @syntax pcre_test
and @syntax pcre_tests
for parsing unit test output of the PCRE library. The parser is used for testing CombinedParser
and benchmarking against Regex
.
CombinedParsers.Regexp
CombinedParsers._iterate
โ Method_iterate(p::ParserWithCaptures, sequence::SequenceWithCaptures,a...)
Base.empty!(sequence)
before iteration. (Why?)
Parsing Options
PCRE options are supported
CombinedParsers.Regexp.with_options
โ Functionwith_options(flags::UInt32,x::AbstractString)
Return 'xif
iszero(0), otherwise
StringWithOptionswith
flags`.
with_options(flags::UInt32,x::Char)
Return 'xif
iszero(0), otherwise
CharWithOptionswith
flags`.
with_options(flags::AbstractString,x)
Return with_options(parse_options(options),x)
, see parse_options
.
with_options(set_flags::UInt32, unset_flags::UInt32,x)
Set options set_flags | ( x.flags & ~unset_flags )
if x isa WithOptions
, set options set_flags
otherwise.
CombinedParsers.Regexp.parse_options
โ Functionparse_options(options::AbstractString)
Return PCRE option mask parsed from options
.
Parser for flags
in @re_str
.
julia> CombinedParsers.Regexp.pcre_options_parser
๐ Sequence[2]
โโ ^ AtStart
โโ ๐* Sequence[1] |> Repeat |> map(splat_or)
โ โโ |๐ Either
โ โ โโ dupnames => 0x00000040 |> with_name(:DUPNAMES)
โ โ โโ xx => 0x01000000 |> with_name(:EXTENDED_MORE)
โ โ โโ i => 0x00000008 |> with_name(:CASELESS)
โ โ โโ m => 0x00000400 |> with_name(:MULTILINE)
โ โ โโ n => 0x00002000 |> with_name(:NO_AUTO_CAPTURE)
โ โ โโ U => 0x00040000 |> with_name(:UNGREEDY)
โ โ โโ J => 0x00000040 |> with_name(:DUPNAMES)
โ โ โโ s => 0x00000020 |> with_name(:DOTALL)
โ โ โโ x => 0x00000080 |> with_name(:EXTENDED)
โ โ โโ B => 0x00000000 |> with_name(:BINCODE)
โ โ โโ I => 0x00000000 |> with_name(:INFO)
โ โโ ,? |missing
โโ $ AtEnd
::UInt32
CombinedParsers.Regexp.StringWithOptions
โ TypeA lazy element transformation type (e.g. AbstractString), getindex
wraps elements in with_options(flags,...)
.
With parsing options
TODO: make flags a transformation function?
CombinedParsers.Regexp.CharWithOptions
โ TypeA lazy element transformation type (e.g. AbstractString), getindex
wraps elements in with_options(flags,...)
.
With parsing options
TODO: make flags a transformation function?
CombinedParsers.Regexp.OnOptionsParser
โ TypeParser wrapper sequence with if_options
.
CombinedParsers.Regexp.on_options
โ Functionon_options(flags::Integer,parser)
create parser that matches if flags
are set in sequence, and parser
matches.
Used for PCRE parsing, e.g.
Either(
on_options(Base.PCRE.MULTILINE,
'^' => at_linestart),
parser('^' => AtStart())
)
CombinedParsers.Regexp.ParserOptions
โ TypeA wrapper matching the inner parser on with_options(set_flags, unset_flags, sequence)
.
CombinedParsers.Regexp.FilterOptions
โ TypeLazy wrapper for a sequence, masking elements in getindex
with MatchingNever if any of flags
are not set.
TODO: make flags a filter function? resolve confound of sequence and value, like StringWithOptions, CharWithOptions
CombinedParsers.Regexp.MatchingNever
โ TypeHelper struct to mask sequence elements from matchers.
Regular Expression Types
CombinedParsers.Regexp.ParserWithCaptures
โ TypeTop level parser supporting regular expression features captures, backreferences and subroutines. Collects subroutines in field subroutines::Vector
and indices of named capture groups in field names::Dict
.
implicitly called in match
See also Backreference
, Capture
, Subroutine
CombinedParsers.Regexp.SequenceWithCaptures
โ TypeSequenceWithCaptures ensapsulates a sequence to be parsed, and parsed captures.
This struct will allow for captures a sequence-level state. For next version, a match-level state passed as _iterate argument is considered.
See also ParserWithCaptures
CombinedParsers.Regexp.Capture
โ TypeCapture a parser result, optionally with a name. index
field is initialized when calling ParserWithCaptures
on the parser.
CombinedParsers.Regexp.Backreference
โ TypeBackreference(f::Function,index::Integer)
Backreference(f::Function,name::Union{Nothing,Symbol},index::Integer)
Backreference(f::Function,name::AbstractString)
Parser matching previously captured sequence, optionally with a name. index
field is recursively set when calling 'ParserWithCaptures` on the parser.
CombinedParsers.Regexp.Subroutine
โ TypeParser matching preceding capture, optionally with a name. index
field is recursively set when calling ParserWithCaptures
on the parser.
CombinedParsers.Regexp.subroutine_index_reset
โ Methodhttps://www.pcre.org/original/doc/html/pcrepattern.html#SEC16
CombinedParsers.Regexp.index
โ Methodindex(parser::Subroutine,sequence)
Index of a subroutine. "If you make a subroutine call to a non-unique named subpattern, the one that corresponds to the first occurrence of the name is used." (what about "In the absence of duplicate numbers (see the previous section) this is the one with the lowest number."?)
CombinedParsers.Regexp.Conditional
โ TypeConditional parser, _iterate
cycles conditionally on _iterate_condition
through matches in field yes
and no
respectively.
CombinedParsers.Regexp.DupSubpatternNumbers
โ TypeParser wrapper for ParserWithCaptures
, setting resetindex=true in `deepmapparser(::typeof(indexedcaptures),...)`.
julia> p = re"(?|(a)|(b))\1"
๐ Sequence |> regular expression combinator with 1 capturing groups
โโ |๐ Either |> DupSubpatternNumbers
โ โโ (a) |> Capture 1
โ โโ (b) |> Capture 1
โโ \g{1} Backreference
::Tuple{Char, AbstractString}
julia> match(p, "aa")
ParseMatch("aa", 1="a")
julia> match(p, "bb")
ParseMatch("bb", 1="b")
See also pcre doc