Regular Expressions CombinedParser
Parsing is reading by a computer program, formally transforming character sequences into defined and structured representations.
โ name parser slide
Representation and Transformation
โ number list slide
So parsing is a fundamental obstacle before you can begin working with data. Writing parsers is ubiquitous.
What is a regular expression?
Regular expressions and state machines are the most common tools of the parsing trade. Most programmers and data scientists are familiar with regular expressions.
A regular expression is a textual representation of a parser program. The syntax of regular expressions is condensed and arcane.
Parsing has been a painpoint in all of my work as a developer and a researcher with a focus on text while I accumulated plenty of regular expressions. I still use these regexps at many places and do not want to rewrite all regular expressions I still need. So I wrote a CombinedParser that parses regex, and transforms it into an equivalent julia parser program,
julia> using CombinedParsers
This guide demonstrates constructing a recursive CombinedParser with the @syntax or push! to Either technique. This page outlines essentials of the @re_str macro
julia> using CombinedParsers.Regexp
Characters and Escaped Characters
The simplest regular expression pattern is a character matcher.
julia> import CombinedParsers.Regexp: meta_charsjulia> char_matcher = Either( CharNotIn(meta_chars), # character Sequence( # escaped_meta 2, # transform: emit unescaped at index '\\', CharIn(meta_chars) ) )|๐ Either โโ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn โโ ๐ Sequence |> map(#54) โโ \\ โโ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn ::Char
The char_matcher unescapes regex meta_chars:
julia> parse(Repeat(char_matcher),raw"a\[b"; log=true)3-element Vector{Char}: 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase) '[': ASCII/Unicode U+005B (category Ps: Punctuation, open) 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
Repeated patterns
Number of Repetitions: Transforming a Parsing
A basic feature of regular expression is whether a repeatable pattern (e.g. a char) is optional ? (repeated 0 or 1 times), or should be repeated at least once +, any times *, or {min,max} times (pcre spec). The syntax of regex repetitions require mapping such SubStrings to representions as UnitRange{Int}.
Above, the :escaped_meta mapped the parsing to emit the unescaped character at the second position in its sequence. Any Julia function can be used for transformations with the do syntax after a CombinedParser constructor to implicitly call map creating a Transformation parser.
julia> import CombinedParsers.Regexp: integer, Repeat_maxjulia> @with_names repetitions = Either( '?' => 0:1, '+' => 1:Repeat_max, '*' => 0:Repeat_max, Sequence( '{', :min => integer, :max => Optional( Sequence(2, ',', integer | Repeat_max)), '}' ) do v if v.max isa Missing v.min:v.min else v.min:v.max end::UnitRange{Int} end )|๐ Either |> with_name(:repetitions) โโ \? => 0:1 โโ \+ => 1:1000000 โโ \* => 0:1000000 โโ ๐ Sequence |> map(ntuple) |> map(#1) โโ \{ โโ ๐ Sequence |> map(#31) |> with_name(:min) โ โโ \-? |missing โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โโ ๐? Sequence |> map(#54)|missing |> with_name(:max) โ โโ , โ โโ ๐? Sequence |> map(#31)|1000000 โ โโ \-? |missing โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โโ \} ::UnitRange{Int64}
repetitions parses PCRE repetitions syntax into a (min, max):
julia> parse(repetitions, "?")0:1julia> parse(repetitions, "*")0:1000000julia> parse(repetitions, "+")1:1000000julia> parse(repetitions, "{2}")2:2julia> parse(repetitions, "{5,}")5:1000000
Either
Regular expressions can repeat character matchers and other sub-patterns when appending the repetitions suffix.
julia> repeatable = Either{CombinedParser}( map(CharIn,char_matcher), '.' => AnyChar() )|๐ Either โโ |๐ Either |> map(CharIn) โ โโ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn โ โโ ๐ Sequence |> map(#54) โ โโ \\ โ โโ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn โโ \. => . AnyValue ::CombinedParser
We will add capture groups and other sub-patterns to repeatable later.
Repeatable patterns, Optional, Lazy and Atomic
The CombinedParser to parse a repeated pattern from PCRE syntax
julia> @with_names quantified=Sequence( repeatable, Optional(repetitions, default=(1,1)), Optional(map(v->convert(Char,v),with_name("possessive_quantifier",CharIn('+','?')))) ) do v pat = v[1] result = if v[2]==(1,1) pat elseif v[2]==(0,1) Optional(pat) else Repeat(v[2],pat) end if v[3] === missing result elseif v[3]=='+' Atomic(result) elseif v[3]=='?' Lazy(result) else result end end๐ Sequence |> map(#3) |> with_name(:quantified) โโ |๐ Either โ โโ |๐ Either |> map(CharIn) โ โ โโ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn โ โ โโ ๐ Sequence |> map(#54) โ โ โโ \\ โ โ โโ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn โ โโ \. => . AnyValue โ โโ |๐? Either |> with_name(:repetitions)|(1, 1) โ โโ \? => 0:1 โ โโ \+ => 1:1000000 โ โโ \* => 0:1000000 โ โโ ๐ Sequence |> map(ntuple) |> map(#1) โ โโ \{ โ โโ ๐ Sequence |> map(#31) |> with_name(:min) โ โ โโ \-? |missing โ โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โ โโ ๐? Sequence |> map(#54)|missing |> with_name(:max) โ โ โโ , โ โ โโ ๐? Sequence |> map(#31)|1000000 โ โ โโ \-? |missing โ โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โ โโ \} โโ [\+\?]? ValueIn |> map(#4) |> with_name(:possessive_quantifier)|missing ::Any
PCRE repetitions syntax can be quantified:
julia> p = quantified(raw"\++?", log=true)match repetitions@3-4: \\++? ^ match possessive_quantifier@4-5: \\++? ^ match quantified@1-5: \\++? ^___^ [\+]+? ValueIn |> Repeat |> Lazy ::Vector{Char}julia> parse(p, "+++")1-element Vector{Char}: '+': ASCII/Unicode U+002B (category Sm: Symbol, math)
Sequences and Alternations
Regular expression patterns can be written in sequence, delimited by | for matching either one of several alternatives.
julia> subpattern = Either{CombinedParser}( '^' => AtStart(), '$' => AtEnd(), quantified);ERROR: transforming results with convert(CombinedParser,::Any) ๐ Sequence |> map(#3) |> with_name(:quantified) โโ |๐ Either โ โโ |๐ Either |> map(CharIn) โ โ โโ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn โ โ โโ ๐ Sequence |> map(#54) โ โ โโ \\ โ โ โโ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn โ โโ \. => . AnyValue โ โโ |๐? Either |> with_name(:repetitions)|(1, 1) โ โโ \? => 0:1 โ โโ \+ => 1:1000000 โ โโ \* => 0:1000000 โ โโ ๐ Sequence |> map(ntuple) |> map(#1) โ โโ \{ โ โโ ๐ Sequence |> map(#31) |> with_name(:min) โ โ โโ \-? |missing โ โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โ โโ ๐? Sequence |> map(#54)|missing |> with_name(:max) โ โ โโ , โ โ โโ ๐? Sequence |> map(#31)|1000000 โ โ โโ \-? |missing โ โ โโ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โ โโ \} โโ [\+\?]? ValueIn |> map(#4) |> with_name(:possessive_quantifier)|missing ::Anyjulia> @with_names sequence = map( v->sSequence(v...), Repeat( subpattern ));ERROR: UndefVarError: subpattern not definedjulia> p = sequence(raw"\++?$")ERROR: UndefVarError: sequence not definedjulia> parse(p, "+++")1-element Vector{Char}: '+': ASCII/Unicode U+002B (category Sm: Symbol, math)
The alternation parser transforms a regex string into a parser
julia> @with_names pattern = join(sequence, '|') do v sEither(v...)::CombinedParser end;ERROR: UndefVarError: sequence not definedjulia> result_type(pattern)ERROR: UndefVarError: pattern not definedjulia> p = parse(pattern, "ab|c", log=(:sequence, :pattern))ERROR: UndefVarError: pattern not definedjulia> parse(p, "c")ERROR: ArgumentError: no successfull parsing.
Captures
Parentheses allow groupings that are repeatable.
julia> import CombinedParsers.Regexp: name
The push! to Either technique allows for the construction of recursive parsers.
julia> @with_names capture_group = map(Capture, Sequence(2, '(', pattern, ')'));ERROR: UndefVarError: pattern not definedjulia> push!(repeatable,capture_group);ERROR: UndefVarError: capture_group not defined
Conveniently, the @syntax with a for <name> in <either> calls push!.
julia> @syntax for noncapture_group in repeatable Sequence(2, "(?:", pattern, ')') end;ERROR: UndefVarError: pattern not definedjulia> import CombinedParsers.Regexp: ParserWithCapturesjulia> regcomb = map(ParserWithCaptures,pattern);ERROR: UndefVarError: pattern not definedjulia> match(regcomb("(1a(2b?)*)*0"),"1a1a21a2b22b0")ERROR: UndefVarError: regcomb not definedjulia> match(CombinedParsers.Regexp.Regcomb("(1a(2b?)*)*0"),"1a1a21a2b22b0")ParseMatch("1a1a21a2b22b0", 1="1a2b22b", 2="2b")julia> match(Regex("(1a(2b?)*)*0"),"1a1a21a2b22b0")RegexMatch("1a1a21a2b22b0", 1="1a2b22b", 2="2b")julia> pattern("(1a(2b?)*)*0")("1a1a21a2b22b0")ERROR: UndefVarError: pattern not defined
Lookahead and Lookbehind
julia> @syntax for lookaround in repeatable Sequence( "(?", :direction => Either( '<' => Lookbehind, Always() => Lookahead), :doesmatch => Either( '=' => true, '!' => false), :pattern => pattern, ')') do v v.direction(v.doesmatch,v.pattern)::CombinedParser # annotation needed for type inference # obfuscated by :direction end end;ERROR: UndefVarError: pattern not definedjulia> match(pattern("a(?=c|d)."),"abad")ERROR: UndefVarError: pattern not definedjulia> match(Regex("a(?=c|d)."),"abad")RegexMatch("ad")
Regular Expression Brackets
The regular expression parser packaged with CombinedParser is widely compliant with the PCRE C library used in Julia Base.Regex โ and in more than half the tests actually faster than C see testing and benchmarking against PCRE tests.
julia> import CombinedParsers.Regexp: bracket, ParserWithCapturesjulia> push!(repeatable,bracket);
@re_str is tested and benchmarked against the PCRE C library testset (see compliance report).
Syntax is essential, semantics is high-brow. What is the semantics of a regular expression? The semantics of regular expression is a parser program that reads a context free regular language.
This page was generated using Literate.jl.