A fast Trie-based parser for a collection of literal Strings.

Matching any one of many Strings (e.g. names) is slow in PCRE with | alternatives.

julia> using Random
julia> s = [ randstring(10) for _ in 1:1000 ];
julia> re = Regex(join(s,"|"));
julia> using BenchmarkTools
julia> @benchmark match(re,s[end])BenchmarkTools.Trial: 10000 samples with 9 evaluations. Range (min … max): 2.022 μs … 5.634 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.053 μs ┊ GC (median): 0.00% Time (mean ± σ): 2.087 μs ± 150.254 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▄▁ ███▇▆▆▅▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃ 2.02 μs Histogram: frequency by time 2.68 μs < Memory estimate: 240 bytes, allocs estimate: 4.

A optimized state-machine parser will perform much better.

TODO: Benchmark Automa.jl here

CombinedParsers does not use state machines, but provides fast _iterate implementation for Either{Trie{Char}}, that is constructed with Either(::Vector{<:AbstractString})

julia> using CombinedParsers
julia> pc = Either(s);
julia> @benchmark match(pc,s[end])BenchmarkTools.Trial: 10000 samples with 194 evaluations. Range (min … max): 498.912 ns … 70.833 μs ┊ GC (min … max): 0.00% … 98.69% Time (median): 515.570 ns ┊ GC (median): 0.00% Time (mean ± σ): 634.489 ns ± 2.608 μs ┊ GC (mean ± σ): 15.88% ± 3.83% ▁▅██▇▆▅▄▄▄▃▃▃▃▂▂▃▃▃▂▃▂▂▂▂▂▂▂▁▁▁▁▁▁ ▁ ▂ ██████████████████████████████████████▇███▇▇▇▆▇▅▅▅▅▅▅▅▅▄▆▃▁▅ █ 499 ns Histogram: log(frequency) by time 721 ns < Memory estimate: 416 bytes, allocs estimate: 7.

Note that pc does not capture the result by default with parse.

julia> parse(!pc,s[end])ERROR: objects of type Nothing cannot be finalized

For large word collections, PCRE fails

julia> s = [ randstring(10) for _ in 1:10000 ];
julia> re = Regex(join(s,"|"));ERROR: PCRE compilation error: regular expression is too large at offset 109999

CombinedParsers displays good timings also .

julia> pc = Either(s);
julia> @benchmark Either(s);
julia> @benchmark match(pc,s[end])BenchmarkTools.Trial: 10000 samples with 193 evaluations. Range (min … max): 505.933 ns … 81.966 μs ┊ GC (min … max): 0.00% … 99.05% Time (median): 528.373 ns ┊ GC (median): 0.00% Time (mean ± σ): 628.487 ns ± 2.550 μs ┊ GC (mean ± σ): 12.81% ± 3.13% ▅█▅▁ ▁▂▇████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂ 506 ns Histogram: frequency by time 715 ns < Memory estimate: 416 bytes, allocs estimate: 7.

Either{Trie{Char}} can be used to perform fast text search in Julia.

Aho-Corasick algorithm?

The implementation could be expanded to implement Aho-Corasick algorithm.


This page was generated using Literate.jl.