PCRE Regular expressions

You can use PCRE @re_str in combination with CombinedParser's constructors.

Constructing Regular expressions

CombinedParsers.Regexp.@re_str โ€” Macro
@re_str(x,flags)

Construct a ParserWithCaptures from PCRE regex syntax, such as re"^[a-z]*$", without interpolation and unescaping (except for quotation mark " which still has to be escaped). Plug-in replacement for PCRE string macro @r_str.

The regex also accepts one or more flags, listed after the ending quote, to change its behaviour:

  • i enables case-insensitive matching
  • m treats the ^ and $ tokens as matching the start and end of individual lines, as opposed to the whole string.
  • s allows the . modifier to match newlines.
  • x enables "comment mode": whitespace is ignored except when escaped with \, and # is treated as starting a comment.
  • a disables UCP mode (enables ASCII mode). By default \B, \b, \D, \d, \S, \s, \W, \w, etc. match based on Unicode character properties. With this option, these sequences only match ASCII characters.
  • xx enables "extended comment mode": whitespace in bracket character matchers are ignored.
julia> re"a|c"i
|๐Ÿ—„ Either
โ”œโ”€ [aA] ValueIn
โ””โ”€ [cC] ValueIn
::Char

julia> re"a+c"
๐Ÿ—„ Sequence
โ”œโ”€ a+  |> Repeat
โ””โ”€ c
::Tuple{Vector{Char}, Char}

See also Regcomb, parse_options.

CombinedParsers.Regexp โ€” Module

A regular expression parser transforming a PCRE string to a CombinedParser equivalent to the regular expression.

Base.getindex โ€” Method
Base.getindex(x::ParseMatch{<:Any,<:SequenceWithCaptures,<:Any},i::Union{Integer,Symbol})

Gets capture i as SubString.

See API of RegexMatch.

Base.getproperty โ€” Method
Base.getproperty(m::ParseMatch{<:Any,<:SequenceWithCaptures,<:Any},key::Symbol)

enable m.captures and m.match.

See API of RegexMatch.

CombinedParsers.regex_escape โ€” Function
    regex_escape(s::AbstractString)

regular expression metacharacters are escaped along with whitespace.

Compatibility & Unit Tests

CombinedParsers.Regexp.character_class โ€” Constant
julia> CombinedParsers.Regexp.character_class
๐Ÿ—„ Sequence |> map(#57)
โ”œโ”€ \[\: 
โ”œโ”€ |๐Ÿ—„ Either
โ”‚  โ”œโ”€ alpha  => [\p{L}] ValueIn
โ”‚  โ”œโ”€ lower  => [\p{Ll}] ValueIn
โ”‚  โ”œโ”€ upper  => [\p{Lu}] ValueIn
โ”‚  โ”œโ”€ word  => [\p{L}\p{Nl}\p{Nd}\p{Pc}] ValueIn
โ”‚  โ”œโ”€ digit  => [\p{Nd}] ValueIn
โ”‚  โ”œโ”€ xdigit  => [[:xdigit:]] ValueIn
โ”‚  โ”œโ”€ alnum  => [\p{L}\p{N}] ValueIn
โ”‚  โ”œโ”€ blank  => [\t\p{Zs}] ValueIn
โ”‚  โ”œโ”€ cntrl  => [\p{Cc}] ValueIn
โ”‚  โ”œโ”€ graph  => [^\p{Z}\p{C}] ValueNotIn
โ”‚  โ”œโ”€ print  => [\p{C}] ValueIn
โ”‚  โ”œโ”€ punct  => [\p{P}] ValueIn
โ”‚  โ””โ”€ space  => [\r\v\n\f\t\p{Z}] ValueIn
โ””โ”€ \:\] 
::CombinedParsers.ValueMatcher

TODO:

By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCREUCP option is passed to pcrecompile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:

  • [:alnum:] becomes \p{Xan}
  • [:alpha:] becomes \p{L}
  • [:blank:] becomes \h
  • [:digit:] becomes \p{Nd}
  • [:lower:] becomes \p{Ll}
  • [:space:] becomes \p{Xps}
  • [:upper:] becomes \p{Lu}
  • [:word:] becomes \p{Xwd}
Base.:== โ€” Method
==(pcre_m::RegexMatch,pc_m::ParseMatch)

equal iif values of .match, .offset, .ncodeunits and .captures are equal.

CombinedParsers.Regexp.@pcre_tests โ€” Macro
@pcre_testset

Define @syntax pcre_test and @syntax pcre_tests for parsing unit test output of the PCRE library. The parser is used for testing CombinedParser and benchmarking against Regex.

CombinedParsers.Regexp

CombinedParsers._iterate โ€” Method
_iterate(p::ParserWithCaptures, sequence::SequenceWithCaptures,a...)

Base.empty!(sequence) before iteration. (Why?)

Parsing Options

PCRE options are supported

CombinedParsers.Regexp.with_options โ€” Function
with_options(flags::UInt32,x::AbstractString)

Return 'xifiszero(0), otherwiseStringWithOptionswithflags`.

with_options(flags::UInt32,x::Char)

Return 'xifiszero(0), otherwiseCharWithOptionswithflags`.

with_options(flags::AbstractString,x)

Return with_options(parse_options(options),x), see parse_options.

with_options(set_flags::UInt32, unset_flags::UInt32,x)

Set options set_flags | ( x.flags & ~unset_flags ) if x isa WithOptions, set options set_flags otherwise.

CombinedParsers.Regexp.parse_options โ€” Function
parse_options(options::AbstractString)

Return PCRE option mask parsed from options.

Parser for flags in @re_str.

julia> CombinedParsers.Regexp.pcre_options_parser
๐Ÿ—„ Sequence[2]
โ”œโ”€ ^ AtStart
โ”œโ”€ ๐Ÿ—„* Sequence[1] |> Repeat |> map(splat_or)
โ”‚  โ”œโ”€ |๐Ÿ—„ Either
โ”‚  โ”‚  โ”œโ”€ dupnames  => 0x00000040 |> with_name(:DUPNAMES)
โ”‚  โ”‚  โ”œโ”€ xx  => 0x01000000 |> with_name(:EXTENDED_MORE)
โ”‚  โ”‚  โ”œโ”€ i  => 0x00000008 |> with_name(:CASELESS)
โ”‚  โ”‚  โ”œโ”€ m  => 0x00000400 |> with_name(:MULTILINE)
โ”‚  โ”‚  โ”œโ”€ n  => 0x00002000 |> with_name(:NO_AUTO_CAPTURE)
โ”‚  โ”‚  โ”œโ”€ U  => 0x00040000 |> with_name(:UNGREEDY)
โ”‚  โ”‚  โ”œโ”€ J  => 0x00000040 |> with_name(:DUPNAMES)
โ”‚  โ”‚  โ”œโ”€ s  => 0x00000020 |> with_name(:DOTALL)
โ”‚  โ”‚  โ”œโ”€ x  => 0x00000080 |> with_name(:EXTENDED)
โ”‚  โ”‚  โ”œโ”€ B  => 0x00000000 |> with_name(:BINCODE)
โ”‚  โ”‚  โ””โ”€ I  => 0x00000000 |> with_name(:INFO)
โ”‚  โ””โ”€ ,? |missing
โ””โ”€ $ AtEnd
::UInt32
CombinedParsers.Regexp.StringWithOptions โ€” Type

A lazy element transformation type (e.g. AbstractString), getindex wraps elements in with_options(flags,...).

With parsing options

TODO: make flags a transformation function?

CombinedParsers.Regexp.CharWithOptions โ€” Type

A lazy element transformation type (e.g. AbstractString), getindex wraps elements in with_options(flags,...).

With parsing options

TODO: make flags a transformation function?

CombinedParsers.Regexp.on_options โ€” Function
on_options(flags::Integer,parser)

create parser that matches if flags are set in sequence, and parser matches.

Used for PCRE parsing, e.g.

Either(
    on_options(Base.PCRE.MULTILINE, 
           '^' => at_linestart),
    parser('^' => AtStart())
)
CombinedParsers.Regexp.FilterOptions โ€” Type

Lazy wrapper for a sequence, masking elements in getindex with MatchingNever if any of flags are not set.

TODO: make flags a filter function? resolve confound of sequence and value, like StringWithOptions, CharWithOptions

Regular Expression Types

CombinedParsers.Regexp.Backreference โ€” Type
Backreference(f::Function,index::Integer)

Backreference(f::Function,name::Union{Nothing,Symbol},index::Integer)

Backreference(f::Function,name::AbstractString)

Parser matching previously captured sequence, optionally with a name. index field is recursively set when calling 'ParserWithCaptures` on the parser.

CombinedParsers.Regexp.Subroutine โ€” Type

Parser matching preceding capture, optionally with a name. index field is recursively set when calling ParserWithCaptures on the parser.

CombinedParsers.Regexp.DupSubpatternNumbers โ€” Type

Parser wrapper for ParserWithCaptures, setting resetindex=true in `deepmapparser(::typeof(indexedcaptures),...)`.

julia> p = re"(?|(a)|(b))\1"
๐Ÿ—„ Sequence |> regular expression combinator with 1 capturing groups
โ”œโ”€ |๐Ÿ—„ Either |> DupSubpatternNumbers
โ”‚  โ”œโ”€ (a)  |> Capture 1
โ”‚  โ””โ”€ (b)  |> Capture 1
โ””โ”€ \g{1} Backreference
::Tuple{Char, AbstractString}

julia> match(p, "aa")
ParseMatch("aa", 1="a")

julia> match(p, "bb")
ParseMatch("bb", 1="b")

See also pcre doc