Regular Expressions CombinedParser

Parsing is reading by a computer program, formally transforming character sequences into defined and structured representations.

โ€“ name parser slide

Representation and Transformation

โ€“ number list slide

So parsing is a fundamental obstacle before you can begin working with data. Writing parsers is ubiquitous.

What is a regular expression?

Regular expressions and state machines are the most common tools of the parsing trade. Most programmers and data scientists are familiar with regular expressions.

A regular expression is a textual representation of a parser program. The syntax of regular expressions is condensed and arcane.

Parsing has been a painpoint in all of my work as a developer and a researcher with a focus on text while I accumulated plenty of regular expressions. I still use these regexps at many places and do not want to rewrite all regular expressions I still need. So I wrote a CombinedParser that parses regex, and transforms it into an equivalent julia parser program,

julia> using CombinedParsers

This guide demonstrates constructing a recursive CombinedParser with the @syntax or push! to Either technique. This page outlines essentials of the @re_str macro

julia> using CombinedParsers.Regexp

Characters and Escaped Characters

The simplest regular expression pattern is a character matcher.

julia> import CombinedParsers.Regexp: meta_chars
julia> char_matcher = Either( CharNotIn(meta_chars), # character Sequence( # escaped_meta 2, # transform: emit unescaped at index '\\', CharIn(meta_chars) ) )|๐Ÿ—„ Either โ”œโ”€ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn โ””โ”€ ๐Ÿ—„ Sequence |> map(#54) โ”œโ”€ \\ โ””โ”€ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn ::Char

The char_matcher unescapes regex meta_chars:

julia> parse(Repeat(char_matcher),raw"a\[b"; log=true)3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 '[': ASCII/Unicode U+005B (category Ps: Punctuation, open)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

Repeated patterns

Number of Repetitions: Transforming a Parsing

A basic feature of regular expression is whether a repeatable pattern (e.g. a char) is optional ? (repeated 0 or 1 times), or should be repeated at least once +, any times *, or {min,max} times (pcre spec). The syntax of regex repetitions require mapping such SubStrings to representions as UnitRange{Int}.

Above, the :escaped_meta mapped the parsing to emit the unescaped character at the second position in its sequence. Any Julia function can be used for transformations with the do syntax after a CombinedParser constructor to implicitly call map creating a Transformation parser.

julia> import CombinedParsers.Regexp: integer, Repeat_max
julia> @with_names repetitions = Either( '?' => 0:1, '+' => 1:Repeat_max, '*' => 0:Repeat_max, Sequence( '{', :min => integer, :max => Optional( Sequence(2, ',', integer | Repeat_max)), '}' ) do v if v.max isa Missing v.min:v.min else v.min:v.max end::UnitRange{Int} end )|๐Ÿ—„ Either |> with_name(:repetitions) โ”œโ”€ \? => 0:1 โ”œโ”€ \+ => 1:1000000 โ”œโ”€ \* => 0:1000000 โ””โ”€ ๐Ÿ—„ Sequence |> map(ntuple) |> map(#1) โ”œโ”€ \{ โ”œโ”€ ๐Ÿ—„ Sequence |> map(#31) |> with_name(:min) โ”‚ โ”œโ”€ \-? |missing โ”‚ โ””โ”€ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โ”œโ”€ ๐Ÿ—„? Sequence |> map(#54)|missing |> with_name(:max) โ”‚ โ”œโ”€ , โ”‚ โ””โ”€ ๐Ÿ—„? Sequence |> map(#31)|1000000 โ”‚ โ”œโ”€ \-? |missing โ”‚ โ””โ”€ [0-9]+ ValueIn |> Repeat |> ! |> map(#187) โ””โ”€ \} ::UnitRange{Int64}

repetitions parses PCRE repetitions syntax into a (min, max):

julia> parse(repetitions, "?")0:1
julia> parse(repetitions, "*")0:1000000
julia> parse(repetitions, "+")1:1000000
julia> parse(repetitions, "{2}")2:2
julia> parse(repetitions, "{5,}")5:1000000

Either

Regular expressions can repeat character matchers and other sub-patterns when appending the repetitions suffix.

julia> repeatable = Either{CombinedParser}(
           map(CharIn,char_matcher),
           '.' => AnyChar()
       )|๐Ÿ—„ Either
โ”œโ”€ |๐Ÿ—„ Either |> map(CharIn)
โ”‚  โ”œโ”€ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn
โ”‚  โ””โ”€ ๐Ÿ—„ Sequence |> map(#54)
โ”‚     โ”œโ”€ \\
โ”‚     โ””โ”€ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn
โ””โ”€ \.  => . AnyValue

::CombinedParser

We will add capture groups and other sub-patterns to repeatable later.

Repeatable patterns, Optional, Lazy and Atomic

The CombinedParser to parse a repeated pattern from PCRE syntax

julia> @with_names quantified=Sequence(
               repeatable,
               Optional(repetitions, default=(1,1)),
               Optional(map(v->convert(Char,v),with_name("possessive_quantifier",CharIn('+','?'))))
           ) do v
               pat = v[1]
               result = if v[2]==(1,1)
                   pat
               elseif v[2]==(0,1)
                   Optional(pat)
               else
                   Repeat(v[2],pat)
               end
               if v[3] === missing
                   result
               elseif v[3]=='+'
                   Atomic(result)
               elseif v[3]=='?'
                   Lazy(result)
               else
                   result
               end
           end๐Ÿ—„ Sequence |> map(#3) |> with_name(:quantified)
โ”œโ”€ |๐Ÿ—„ Either
โ”‚  โ”œโ”€ |๐Ÿ—„ Either |> map(CharIn)
โ”‚  โ”‚  โ”œโ”€ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn
โ”‚  โ”‚  โ””โ”€ ๐Ÿ—„ Sequence |> map(#54)
โ”‚  โ”‚     โ”œโ”€ \\
โ”‚  โ”‚     โ””โ”€ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn
โ”‚  โ””โ”€ \.  => . AnyValue
โ”‚
โ”œโ”€ |๐Ÿ—„? Either |> with_name(:repetitions)|(1, 1)
โ”‚  โ”œโ”€ \?  => 0:1
โ”‚  โ”œโ”€ \+  => 1:1000000
โ”‚  โ”œโ”€ \*  => 0:1000000
โ”‚  โ””โ”€ ๐Ÿ—„ Sequence |> map(ntuple) |> map(#1)
โ”‚     โ”œโ”€ \{
โ”‚     โ”œโ”€ ๐Ÿ—„ Sequence |> map(#31) |> with_name(:min)
โ”‚     โ”‚  โ”œโ”€ \-? |missing
โ”‚     โ”‚  โ””โ”€ [0-9]+ ValueIn |> Repeat |> ! |> map(#187)
โ”‚     โ”œโ”€ ๐Ÿ—„? Sequence |> map(#54)|missing |> with_name(:max)
โ”‚     โ”‚  โ”œโ”€ ,
โ”‚     โ”‚  โ””โ”€ ๐Ÿ—„? Sequence |> map(#31)|1000000
โ”‚     โ”‚     โ”œโ”€ \-? |missing
โ”‚     โ”‚     โ””โ”€ [0-9]+ ValueIn |> Repeat |> ! |> map(#187)
โ”‚     โ””โ”€ \}
โ””โ”€ [\+\?]? ValueIn |> map(#4) |> with_name(:possessive_quantifier)|missing
::Any

PCRE repetitions syntax can be quantified:

julia> p = quantified(raw"\++?", log=true)   match repetitions@3-4: \\++?
                             ^
   match possessive_quantifier@4-5: \\++?
                                        ^
   match quantified@1-5: \\++?
                         ^___^
[\+]+? ValueIn |> Repeat |> Lazy
::Vector{Char}
julia> parse(p, "+++")1-element Vector{Char}: '+': ASCII/Unicode U+002B (category Sm: Symbol, math)

Sequences and Alternations

Regular expression patterns can be written in sequence, delimited by | for matching either one of several alternatives.

julia> subpattern = Either{CombinedParser}(
           '^' => AtStart(),
           '$' => AtEnd(),
           quantified);ERROR: transforming results with convert(CombinedParser,::Any)
๐Ÿ—„ Sequence |> map(#3) |> with_name(:quantified)
โ”œโ”€ |๐Ÿ—„ Either
โ”‚  โ”œโ”€ |๐Ÿ—„ Either |> map(CharIn)
โ”‚  โ”‚  โ”œโ”€ [^\\\^\$\.\[\|\(\)\?\*\+\{] ValueNotIn
โ”‚  โ”‚  โ””โ”€ ๐Ÿ—„ Sequence |> map(#54)
โ”‚  โ”‚     โ”œโ”€ \\
โ”‚  โ”‚     โ””โ”€ [\\\^\$\.\[\|\(\)\?\*\+\{] ValueIn
โ”‚  โ””โ”€ \.  => . AnyValue
โ”‚
โ”œโ”€ |๐Ÿ—„? Either |> with_name(:repetitions)|(1, 1)
โ”‚  โ”œโ”€ \?  => 0:1
โ”‚  โ”œโ”€ \+  => 1:1000000
โ”‚  โ”œโ”€ \*  => 0:1000000
โ”‚  โ””โ”€ ๐Ÿ—„ Sequence |> map(ntuple) |> map(#1)
โ”‚     โ”œโ”€ \{
โ”‚     โ”œโ”€ ๐Ÿ—„ Sequence |> map(#31) |> with_name(:min)
โ”‚     โ”‚  โ”œโ”€ \-? |missing
โ”‚     โ”‚  โ””โ”€ [0-9]+ ValueIn |> Repeat |> ! |> map(#187)
โ”‚     โ”œโ”€ ๐Ÿ—„? Sequence |> map(#54)|missing |> with_name(:max)
โ”‚     โ”‚  โ”œโ”€ ,
โ”‚     โ”‚  โ””โ”€ ๐Ÿ—„? Sequence |> map(#31)|1000000
โ”‚     โ”‚     โ”œโ”€ \-? |missing
โ”‚     โ”‚     โ””โ”€ [0-9]+ ValueIn |> Repeat |> ! |> map(#187)
โ”‚     โ””โ”€ \}
โ””โ”€ [\+\?]? ValueIn |> map(#4) |> with_name(:possessive_quantifier)|missing
::Any
julia> @with_names sequence = map( v->sSequence(v...), Repeat( subpattern ));ERROR: UndefVarError: subpattern not defined
julia> p = sequence(raw"\++?$")ERROR: UndefVarError: sequence not defined
julia> parse(p, "+++")1-element Vector{Char}: '+': ASCII/Unicode U+002B (category Sm: Symbol, math)

The alternation parser transforms a regex string into a parser

julia> @with_names pattern = join(sequence, '|') do v
           sEither(v...)::CombinedParser
       end;ERROR: UndefVarError: sequence not defined
julia> result_type(pattern)ERROR: UndefVarError: pattern not defined
julia> p = parse(pattern, "ab|c", log=(:sequence, :pattern))ERROR: UndefVarError: pattern not defined
julia> parse(p, "c")ERROR: ArgumentError: no successfull parsing.

Captures

Parentheses allow groupings that are repeatable.

julia> import CombinedParsers.Regexp: name

The push! to Either technique allows for the construction of recursive parsers.

julia> @with_names capture_group =
           map(Capture, Sequence(2, '(', pattern, ')'));ERROR: UndefVarError: pattern not defined
julia> push!(repeatable,capture_group);ERROR: UndefVarError: capture_group not defined

Conveniently, the @syntax with a for <name> in <either> calls push!.

julia> @syntax for noncapture_group in repeatable
           Sequence(2, "(?:", pattern, ')')
       end;ERROR: UndefVarError: pattern not defined
julia> import CombinedParsers.Regexp: ParserWithCaptures
julia> regcomb = map(ParserWithCaptures,pattern);ERROR: UndefVarError: pattern not defined
julia> match(regcomb("(1a(2b?)*)*0"),"1a1a21a2b22b0")ERROR: UndefVarError: regcomb not defined
julia> match(CombinedParsers.Regexp.Regcomb("(1a(2b?)*)*0"),"1a1a21a2b22b0")ParseMatch("1a1a21a2b22b0", 1="1a2b22b", 2="2b")
julia> match(Regex("(1a(2b?)*)*0"),"1a1a21a2b22b0")RegexMatch("1a1a21a2b22b0", 1="1a2b22b", 2="2b")
julia> pattern("(1a(2b?)*)*0")("1a1a21a2b22b0")ERROR: UndefVarError: pattern not defined

Lookahead and Lookbehind

julia> @syntax for lookaround in repeatable
           Sequence(
               "(?",
               :direction => Either(
                   '<' => Lookbehind,
                   Always() => Lookahead),
               :doesmatch => Either(
                   '=' => true,
                   '!' => false),
               :pattern => pattern,
               ')') do v
                   v.direction(v.doesmatch,v.pattern)::CombinedParser
                   # annotation needed for type inference
                   # obfuscated by :direction
               end
       end;ERROR: UndefVarError: pattern not defined
julia> match(pattern("a(?=c|d)."),"abad")ERROR: UndefVarError: pattern not defined
julia> match(Regex("a(?=c|d)."),"abad")RegexMatch("ad")

Regular Expression Brackets

The regular expression parser packaged with CombinedParser is widely compliant with the PCRE C library used in Julia Base.Regex โ€“ and in more than half the tests actually faster than C see testing and benchmarking against PCRE tests.

julia> import CombinedParsers.Regexp: bracket, ParserWithCaptures
julia> push!(repeatable,bracket);

@re_str is tested and benchmarked against the PCRE C library testset (see compliance report).

Syntax is essential, semantics is high-brow. What is the semantics of a regular expression? The semantics of regular expression is a parser program that reads a context free regular language.


This page was generated using Literate.jl.