PCRE Regular expressions

You can use PCRE @re_str in combination with CombinedParser's constructors.

Constructing Regular expressions

CombinedParsers.Regexp.@re_str — Macro

@re_str(x,flags)

Construct a ParserWithCaptures from PCRE regex syntax, such as re"^[a-z]*$", without interpolation and unescaping (except for quotation mark " which still has to be escaped). Plug-in replacement for PCRE string macro @r_str.

The regex also accepts one or more flags, listed after the ending quote, to change its behaviour:

i enables case-insensitive matching
m treats the ^ and $ tokens as matching the start and end of individual lines, as opposed to the whole string.
s allows the . modifier to match newlines.
x enables "comment mode": whitespace is ignored except when escaped with \, and # is treated as starting a comment.
a disables UCP mode (enables ASCII mode). By default \B, \b, \D, \d, \S, \s, \W, \w, etc. match based on Unicode character properties. With this option, these sequences only match ASCII characters.
xx enables "extended comment mode": whitespace in bracket character matchers are ignored.

julia> re"a|c"i
|🗄 Either
├─ [aA] ValueIn
└─ [cC] ValueIn
::Char

julia> re"a+c"
🗄 Sequence
├─ a+  |> Repeat
└─ c
::Tuple{Vector{Char}, Char}

Compatibility & Unit Tests

CombinedParsers.Regexp.character_class — Constant

julia> CombinedParsers.Regexp.character_class
🗄 Sequence |> map(#57)
├─ \[\: 
├─ |🗄 Either
│  ├─ alpha  => [\p{L}] ValueIn
│  ├─ lower  => [\p{Ll}] ValueIn
│  ├─ upper  => [\p{Lu}] ValueIn
│  ├─ word  => [\p{L}\p{Nl}\p{Nd}\p{Pc}] ValueIn
│  ├─ digit  => [\p{Nd}] ValueIn
│  ├─ xdigit  => [[:xdigit:]] ValueIn
│  ├─ alnum  => [\p{L}\p{N}] ValueIn
│  ├─ blank  => [\t\p{Zs}] ValueIn
│  ├─ cntrl  => [\p{Cc}] ValueIn
│  ├─ graph  => [^\p{Z}\p{C}] ValueNotIn
│  ├─ print  => [\p{C}] ValueIn
│  ├─ punct  => [\p{P}] ValueIn
│  └─ space  => [\r\v\n\f\t\p{Z}] ValueIn
└─ \:\] 
::CombinedParsers.ValueMatcher

TODO:

By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCREUCP option is passed to pcrecompile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:

[:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L}
[:blank:] becomes \h
[:digit:] becomes \p{Nd}
[:lower:] becomes \p{Ll}
[:space:] becomes \p{Xps}
[:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd}

Base.:== — Method

==(pcre_m::RegexMatch,pc_m::ParseMatch)

equal iif values of .match, .offset, .ncodeunits and .captures are equal.

CombinedParsers.Regexp.@pcre_tests — Macro

@pcre_testset

Define @syntax pcre_test and @syntax pcre_tests for parsing unit test output of the PCRE library. The parser is used for testing CombinedParser and benchmarking against Regex.

CombinedParsers.Regexp

CombinedParsers._iterate — Method

_iterate(p::ParserWithCaptures, sequence::SequenceWithCaptures,a...)

Base.empty!(sequence) before iteration. (Why?)

Parsing Options

PCRE options are supported

CombinedParsers.Regexp.with_options — Function

with_options(flags::UInt32,x::AbstractString)

Return 'xifiszero(0), otherwiseStringWithOptionswithflags`.

with_options(flags::UInt32,x::Char)

Return 'xifiszero(0), otherwiseCharWithOptionswithflags`.

with_options(flags::AbstractString,x)

Return with_options(parse_options(options),x), see parse_options.

with_options(set_flags::UInt32, unset_flags::UInt32,x)

Set options set_flags | ( x.flags & ~unset_flags ) if x isa WithOptions, set options set_flags otherwise.

CombinedParsers.Regexp.parse_options — Function

parse_options(options::AbstractString)

Return PCRE option mask parsed from options.

Parser for flags in @re_str.

julia> CombinedParsers.Regexp.pcre_options_parser
🗄 Sequence[2]
├─ ^ AtStart
├─ 🗄* Sequence[1] |> Repeat |> map(splat_or)
│  ├─ |🗄 Either
│  │  ├─ dupnames  => 0x00000040 |> with_name(:DUPNAMES)
│  │  ├─ xx  => 0x01000000 |> with_name(:EXTENDED_MORE)
│  │  ├─ i  => 0x00000008 |> with_name(:CASELESS)
│  │  ├─ m  => 0x00000400 |> with_name(:MULTILINE)
│  │  ├─ n  => 0x00002000 |> with_name(:NO_AUTO_CAPTURE)
│  │  ├─ U  => 0x00040000 |> with_name(:UNGREEDY)
│  │  ├─ J  => 0x00000040 |> with_name(:DUPNAMES)
│  │  ├─ s  => 0x00000020 |> with_name(:DOTALL)
│  │  ├─ x  => 0x00000080 |> with_name(:EXTENDED)
│  │  ├─ B  => 0x00000000 |> with_name(:BINCODE)
│  │  └─ I  => 0x00000000 |> with_name(:INFO)
│  └─ ,? |missing
└─ $ AtEnd
::UInt32

CombinedParsers.Regexp.StringWithOptions — Type

A lazy element transformation type (e.g. AbstractString), getindex wraps elements in with_options(flags,...).

With parsing options

TODO: make flags a transformation function?

CombinedParsers.Regexp.CharWithOptions — Type

A lazy element transformation type (e.g. AbstractString), getindex wraps elements in with_options(flags,...).

With parsing options

TODO: make flags a transformation function?

CombinedParsers.Regexp.OnOptionsParser — Type

Parser wrapper sequence with if_options.

CombinedParsers.Regexp.on_options — Function

on_options(flags::Integer,parser)

create parser that matches if flags are set in sequence, and parser matches.

Used for PCRE parsing, e.g.

Either(
    on_options(Base.PCRE.MULTILINE, 
           '^' => at_linestart),
    parser('^' => AtStart())
)

CombinedParsers.Regexp.ParserOptions — Type

A wrapper matching the inner parser on with_options(set_flags, unset_flags, sequence).

CombinedParsers.Regexp.FilterOptions — Type

Lazy wrapper for a sequence, masking elements in getindex with MatchingNever if any of flags are not set.

TODO: make flags a filter function? resolve confound of sequence and value, like StringWithOptions, CharWithOptions

CombinedParsers.Regexp.MatchingNever — Type

Helper struct to mask sequence elements from matchers.

Regular Expression Types

CombinedParsers.Regexp.ParserWithCaptures — Type

Top level parser supporting regular expression features captures, backreferences and subroutines. Collects subroutines in field subroutines::Vector and indices of named capture groups in field names::Dict.

Note

implicitly called in match

CombinedParsers.Regexp.SequenceWithCaptures — Type

SequenceWithCaptures ensapsulates a sequence to be parsed, and parsed captures.

This struct will allow for captures a sequence-level state. For next version, a match-level state passed as _iterate argument is considered.

See also ParserWithCaptures

CombinedParsers.Regexp.Capture — Type

Capture a parser result, optionally with a name. index field is initialized when calling ParserWithCaptures on the parser.

ParserWithCaptures

CombinedParsers.Regexp.Backreference — Type

Backreference(f::Function,index::Integer)

Backreference(f::Function,name::Union{Nothing,Symbol},index::Integer)

Backreference(f::Function,name::AbstractString)

Parser matching previously captured sequence, optionally with a name. index field is recursively set when calling 'ParserWithCaptures` on the parser.

CombinedParsers.Regexp.Subroutine — Type

Parser matching preceding capture, optionally with a name. index field is recursively set when calling ParserWithCaptures on the parser.

CombinedParsers.Regexp.subroutine_index_reset — Method

https://www.pcre.org/original/doc/html/pcrepattern.html#SEC16

CombinedParsers.Regexp.index — Method

index(parser::Subroutine,sequence)

Index of a subroutine. "If you make a subroutine call to a non-unique named subpattern, the one that corresponds to the first occurrence of the name is used." (what about "In the absence of duplicate numbers (see the previous section) this is the one with the lowest number."?)

CombinedParsers.Regexp.Conditional — Type

Conditional parser, _iterate cycles conditionally on _iterate_condition through matches in field yes and no respectively.

CombinedParsers.Regexp.DupSubpatternNumbers — Type

Parser wrapper for ParserWithCaptures, setting resetindex=true in `deepmapparser(::typeof(indexedcaptures),...)`.

julia> p = re"(?|(a)|(b))\1"
🗄 Sequence |> regular expression combinator with 1 capturing groups
├─ |🗄 Either |> DupSubpatternNumbers
│  ├─ (a)  |> Capture 1
│  └─ (b)  |> Capture 1
└─ \g{1} Backreference
::Tuple{Char, AbstractString}

julia> match(p, "aa")
ParseMatch("aa", 1="a")

julia> match(p, "bb")
ParseMatch("bb", 1="b")