Regular Expressions CombinedParser
Parsing is reading by a computer program, formally transforming character sequences into defined and structured representations.
โ name parser slide
Representation and Transformation
โ number list slide
So parsing is a fundamental obstacle before you can begin working with data. Writing parsers is ubiquitous.
What is a regular expression?
Regular expressions and state machines are the most common tools of the parsing trade. Most programmers and data scientists are familiar with regular expressions.
A regular expression is a textual representation of a parser program. The syntax of regular expressions is condensed and arcane.
Parsing has been a painpoint in all of my work as a developer and a researcher with a focus on text while I accumulated plenty of regular expressions. I still use these regexps at many places and do not want to rewrite all regular expressions I still need. So I wrote a CombinedParser that parses regex, and transforms it into an equivalent julia parser program,
julia> using CombinedParsers
This guide demonstrates constructing a recursive CombinedParser
with the @syntax
or push!
to Either
technique. This page outlines essentials of the @re_str
macro
julia> using CombinedParsers.Regexp
Characters and Escaped Characters
The simplest regular expression pattern is a character matcher.
julia> import CombinedParsers.Regexp: meta_chars
julia> char_matcher = Either( CharNotIn(meta_chars), # character Sequence( # escaped_meta 2, # transform: emit unescaped at index '\\', CharIn(meta_chars) ) )
|๐ Either โโ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn โโ ๐ Sequence |> map(#54) โโ \\ โโ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn ::Char
The char_matcher
unescapes regex meta_chars
:
julia> parse(Repeat(char_matcher),raw"a\[b"; log=true)
3-element Vector{Char}: 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase) '[': ASCII/Unicode U+005B (category Ps: Punctuation, open) 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
Repeated patterns
Number of Repetitions: Transforming a Parsing
A basic feature of regular expression is whether a repeatable pattern (e.g. a char) is optional ?
(repeated 0 or 1 times), or should be repeated at least once +
, any times *
, or {min,max}
times (pcre spec). The syntax of regex repetitions require mapping such SubString
s to representions as UnitRange{Int}
.
Above, the :escaped_meta
mapped the parsing to emit the unescaped character at the second position in its sequence. Any Julia function can be used for transformations with the do
syntax after a CombinedParser
constructor to implicitly call map
creating a Transformation
parser.
julia> import CombinedParsers.Regexp: integer, Repeat_max
julia> @with_names repetitions = Either( '?' => 0:1, '+' => 1:Repeat_max, '*' => 0:Repeat_max, Sequence( '{', :min => integer, :max => Optional( Sequence(2, ',', integer | Repeat_max)), '}' ) do v if v.max isa Missing v.min:v.min else v.min:v.max end::UnitRange{Int} end )
|๐ Either |> with_name(:repetitions) โโ \? => 0:1 โโ \+ => 1:1000000 โโ \* => 0:1000000 โโ ๐ Sequence |> map(ntuple) |> map(#1) โโ \{ โโ ๐ Sequence |> map(#31) |> with_name(:min) โ โโ \-? |missing โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โโ ๐? Sequence |> map(#54)|missing |> with_name(:max) โ โโ , โ โโ ๐? Sequence |> map(#31)|1000000 โ โโ \-? |missing โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โโ \} ::UnitRange{Int64}
repetitions
parses PCRE repetitions syntax into a (min, max)
:
julia> parse(repetitions, "?")
0:1
julia> parse(repetitions, "*")
0:1000000
julia> parse(repetitions, "+")
1:1000000
julia> parse(repetitions, "{2}")
2:2
julia> parse(repetitions, "{5,}")
5:1000000
Either
Regular expressions can repeat character matchers and other sub-patterns when appending the repetitions
suffix.
julia> repeatable = Either{CombinedParser}( map(CharIn,char_matcher), '.' => AnyChar() )
|๐ Either โโ |๐ Either |> map(CharIn) โ โโ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn โ โโ ๐ Sequence |> map(#54) โ โโ \\ โ โโ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn โโ \. => . AnyValue ::CombinedParser
We will add capture groups and other sub-patterns to repeatable
later.
Repeatable patterns, Optional
, Lazy
and Atomic
The CombinedParser to parse a repeated pattern from PCRE syntax
julia> @with_names quantified=Sequence( repeatable, Optional(repetitions, default=(1,1)), Optional(map(v->convert(Char,v),with_name("possessive_quantifier",CharIn('+','?')))) ) do v pat = v[1] result = if v[2]==(1,1) pat elseif v[2]==(0,1) Optional(pat) else Repeat(v[2],pat) end if v[3] === missing result elseif v[3]=='+' Atomic(result) elseif v[3]=='?' Lazy(result) else result end end
๐ Sequence |> map(#3) |> with_name(:quantified) โโ |๐ Either โ โโ |๐ Either |> map(CharIn) โ โ โโ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn โ โ โโ ๐ Sequence |> map(#54) โ โ โโ \\ โ โ โโ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn โ โโ \. => . AnyValue โ โโ |๐? Either |> with_name(:repetitions)|(1, 1) โ โโ \? => 0:1 โ โโ \+ => 1:1000000 โ โโ \* => 0:1000000 โ โโ ๐ Sequence |> map(ntuple) |> map(#1) โ โโ \{ โ โโ ๐ Sequence |> map(#31) |> with_name(:min) โ โ โโ \-? |missing โ โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โ โโ ๐? Sequence |> map(#54)|missing |> with_name(:max) โ โ โโ , โ โ โโ ๐? Sequence |> map(#31)|1000000 โ โ โโ \-? |missing โ โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โ โโ \} โโ [\+\?]? ValueIn |> map(#4) |> with_name(:possessive_quantifier)|missing ::Any
PCRE repetitions syntax can be quantified
:
julia> p = quantified(raw"\++?", log=true)
match repetitions@3-4: \\++? ^ match possessive_quantifier@4-5: \\++? ^ match quantified@1-5: \\++? ^___^ [\+]+? ValueIn |> Repeat |> Lazy ::Vector{Char}
julia> parse(p, "+++")
1-element Vector{Char}: '+': ASCII/Unicode U+002B (category Sm: Symbol, math)
Sequence
s and Alternations
Regular expression patterns can be written in sequence, delimited by |
for matching either one of several alternatives.
julia> subpattern = Either{CombinedParser}( '^' => AtStart(), '$' => AtEnd(), quantified);
ERROR: transforming results with convert(CombinedParser,::Any) ๐ Sequence |> map(#3) |> with_name(:quantified) โโ |๐ Either โ โโ |๐ Either |> map(CharIn) โ โ โโ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn โ โ โโ ๐ Sequence |> map(#54) โ โ โโ \\ โ โ โโ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn โ โโ \. => . AnyValue โ โโ |๐? Either |> with_name(:repetitions)|(1, 1) โ โโ \? => 0:1 โ โโ \+ => 1:1000000 โ โโ \* => 0:1000000 โ โโ ๐ Sequence |> map(ntuple) |> map(#1) โ โโ \{ โ โโ ๐ Sequence |> map(#31) |> with_name(:min) โ โ โโ \-? |missing โ โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โ โโ ๐? Sequence |> map(#54)|missing |> with_name(:max) โ โ โโ , โ โ โโ ๐? Sequence |> map(#31)|1000000 โ โ โโ \-? |missing โ โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โ โโ \} โโ [\+\?]? ValueIn |> map(#4) |> with_name(:possessive_quantifier)|missing ::Any
julia> @with_names sequence = map( v->sSequence(v...), Repeat( subpattern ));
ERROR: UndefVarError: subpattern not defined
julia> p = sequence(raw"\++?$")
ERROR: UndefVarError: sequence not defined
julia> parse(p, "+++")
1-element Vector{Char}: '+': ASCII/Unicode U+002B (category Sm: Symbol, math)
The alternation parser transforms a regex string into a parser
julia> @with_names pattern = join(sequence, '|') do v sEither(v...)::CombinedParser end;
ERROR: UndefVarError: sequence not defined
julia> result_type(pattern)
ERROR: UndefVarError: pattern not defined
julia> p = parse(pattern, "ab|c", log=(:sequence, :pattern))
ERROR: UndefVarError: pattern not defined
julia> parse(p, "c")
ERROR: ArgumentError: no successfull parsing.
Capture
s
Parentheses allow groupings that are repeatable.
julia> import CombinedParsers.Regexp: name
The push!
to Either
technique allows for the construction of recursive parsers.
julia> @with_names capture_group = map(Capture, Sequence(2, '(', pattern, ')'));
ERROR: UndefVarError: pattern not defined
julia> push!(repeatable,capture_group);
ERROR: UndefVarError: capture_group not defined
Conveniently, the @syntax
with a for <name> in <either>
calls push!
.
julia> @syntax for noncapture_group in repeatable Sequence(2, "(?:", pattern, ')') end;
ERROR: UndefVarError: pattern not defined
julia> import CombinedParsers.Regexp: ParserWithCaptures
julia> regcomb = map(ParserWithCaptures,pattern);
ERROR: UndefVarError: pattern not defined
julia> match(regcomb("(1a(2b?)*)*0"),"1a1a21a2b22b0")
ERROR: UndefVarError: regcomb not defined
julia> match(CombinedParsers.Regexp.Regcomb("(1a(2b?)*)*0"),"1a1a21a2b22b0")
ParseMatch("1a1a21a2b22b0", 1="1a2b22b", 2="2b")
julia> match(Regex("(1a(2b?)*)*0"),"1a1a21a2b22b0")
RegexMatch("1a1a21a2b22b0", 1="1a2b22b", 2="2b")
julia> pattern("(1a(2b?)*)*0")("1a1a21a2b22b0")
ERROR: UndefVarError: pattern not defined
Lookahead
and Lookbehind
julia> @syntax for lookaround in repeatable Sequence( "(?", :direction => Either( '<' => Lookbehind, Always() => Lookahead), :doesmatch => Either( '=' => true, '!' => false), :pattern => pattern, ')') do v v.direction(v.doesmatch,v.pattern)::CombinedParser # annotation needed for type inference # obfuscated by :direction end end;
ERROR: UndefVarError: pattern not defined
julia> match(pattern("a(?=c|d)."),"abad")
ERROR: UndefVarError: pattern not defined
julia> match(Regex("a(?=c|d)."),"abad")
RegexMatch("ad")
Regular Expression Brackets
The regular expression parser packaged with CombinedParser is widely compliant with the PCRE C library used in Julia Base.Regex
โ and in more than half the tests actually faster than C see testing and benchmarking against PCRE tests.
julia> import CombinedParsers.Regexp: bracket, ParserWithCaptures
julia> push!(repeatable,bracket);
@re_str
is tested and benchmarked against the PCRE C library testset (see compliance report).
Syntax is essential, semantics is high-brow. What is the semantics of a regular expression? The semantics of regular expression is a parser program that reads a context free regular language.
This page was generated using Literate.jl.