Overview
ParseMatch
CombinedParsers.jl provides the @re_str
macro as a plug-in replacement for the base Julia @r_str
macro. Base Julia PCRE regular expressions:
julia> pattern = r"(?<a>a|B)+c"
r"(?<a>a|B)+c"
julia> mr = match(pattern,"aBc")
RegexMatch("aBc", a="B")
CombinedParsers.Regexp regular expression:
julia> pattern = re"(?<a>a|B)+c"
๐ Sequence |> regular expression combinator with 1 capturing groups โโ (?<a>|๐)+ Either |> Capture 1 |> with_name(:a) |> Repeat โ โโ a โ โโ B โโ c ::Tuple{Vector{Char}, Char}
julia> mre = match(pattern,"aBc")
ParseMatch("aBc", a="B")
The ParseMatch type has getproperty
and getindex
methods for handling like RegexMatch
.
julia> mre.match
"aBc"
julia> mre.captures
1-element Vector{SubString{String}}: "B"
julia> mre[1]
"B"
julia> mre[:a]
"B"
CombinedParsers.jl is tested and benchmarked against the PCRE C library testset, see compliance report.
Parsing
match
searches for the first match of the Regex in the String and return a RegexMatch
/Parsematch
object containing the match and captures, or nothing if the match failed. If a capture matches repeatedly only the last match is captured.
julia> match(pattern,"aBBac")
ParseMatch("aBBac", a="a")
Base.parse
methods parse a String into a Julia type. A CombinedParser p
will parse into an instance of result_type(p)
. For parsers defined with the @re_str
the result_type
s are nested Tuples and Vectors of SubString, Chars and Missing.
julia> parse(pattern,"aBBac")
(['a', 'B', 'B', 'a'], 'c')
Iterating
If a parsing is not uniquely defined different parsings can be lazily iterated, conforming to Julia's iterate
interface.
for p in parse_all(re"^(a|ab|b)+$","abab")
println(p)
end
(re"^", Union{Char, Tuple{Char, Char}}['a', 'b', 'a', 'b'], re"$")
(re"^", Union{Char, Tuple{Char, Char}}['a', 'b', ('a', 'b')], re"$")
(re"^", Union{Char, Tuple{Char, Char}}[('a', 'b'), 'a', 'b'], re"$")
(re"^", Union{Char, Tuple{Char, Char}}[('a', 'b'), ('a', 'b')], re"$")
Performance
CombinedParsers
are fast, utilizing parametric types and generated functions in the Julia compiler.
Compared with the Base.Regex (PCRE C implementation)
using BenchmarkTools
pattern = r"[aB]+c";
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 690 evaluations.
Range (min โฆ max): 184.778 ns โฆ 23.575 ฮผs โ GC (min โฆ max): 0.00% โฆ 98.58%
Time (median): 196.782 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 245.407 ns ยฑ 932.857 ns โ GC (mean ยฑ ฯ): 15.63% ยฑ 4.07%
โโ
โโ
โโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
185 ns Histogram: frequency by time 291 ns <
Memory estimate: 224 bytes, allocs estimate: 3.
CombinedParsers
are slightly faster in this case, and for many other tested parsers.
pattern = re"[aB]+c";
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 200 evaluations.
Range (min โฆ max): 410.505 ns โฆ 82.686 ฮผs โ GC (min โฆ max): 0.00% โฆ 99.19%
Time (median): 428.952 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 556.538 ns ยฑ 2.745 ฮผs โ GC (mean ยฑ ฯ): 17.04% ยฑ 3.43%
โโโโ
โ
โโโโโโโโโโโโโ โโโโโ โ โโโโ โ โโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ
โ
411 ns Histogram: log(frequency) by time 692 ns <
Memory estimate: 544 bytes, allocs estimate: 5.
Matching Regex captures are supported for compatibility
pattern = r"([aB])+c"
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 506 evaluations.
Range (min โฆ max): 222.158 ns โฆ 38.584 ฮผs โ GC (min โฆ max): 0.00% โฆ 99.19%
Time (median): 232.386 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 313.363 ns ยฑ 1.480 ฮผs โ GC (mean ยฑ ฯ): 18.85% ยฑ 3.96%
โโโโ
โ
โโโโโโโโโโโโ โโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโ
โ
โ
โ
222 ns Histogram: log(frequency) by time 441 ns <
Memory estimate: 288 bytes, allocs estimate: 4.
CombinedParsers.Regexp.Capture
s are slow compared with PCRE,
pattern = re"([aB])+c";
@benchmark match(pattern,"aBaBc")
BenchmarkTools.Trial: 10000 samples with 148 evaluations.
Range (min โฆ max): 696.054 ns โฆ 96.962 ฮผs โ GC (min โฆ max): 0.00% โฆ 98.97%
Time (median): 731.385 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 970.875 ns ยฑ 4.370 ฮผs โ GC (mean ยฑ ฯ): 21.06% ยฑ 4.63%
โโโโโโโ
โโโโโโโโโโโโโโ โโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโ โ
696 ns Histogram: log(frequency) by time 1.15 ฮผs <
Memory estimate: 1.28 KiB, allocs estimate: 10.
But with CombinedParsers
you capture more flexibly with transformations anyway.
julia> pattern = re"[aB]+c";
julia> @btime (mre = match(pattern,"aBaBc"))
415.387 ns (5 allocations: 544 bytes) ParseMatch("aBaBc")
julia> @btime get(mre)
66.217 ns (2 allocations: 128 bytes) (['a', 'B'], 'c')
Transformations
Transform the result of a parsing with map
. The result_type
is inferred automatically using julia type inference.
julia> p = map(length,re"(ab)*")
(ab)* Sequence |> Capture 1 |> Repeat |> map(length) |> regular expression combinator with 1 capturing groups ::Int64
julia> parse(p,"abababab")
4
Conveniently, calling getindex(::CombinedParser,::Integer)
and map(::Integer,::CombinedParser)
create a transforming parser selecting from the result of the parsing.
julia> parse(map(IndexAt(2),re"abc"),"abc")
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
julia> parse(re"abc"[2],"abc")
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
Next: The User guide provides a summary of CombinedParsers types.