A fast Trie-based parser for a collection of literal Strings.
Matching any one of many String
s (e.g. names) is slow in PCRE with |
alternatives.
julia> using Random
julia> s = [ randstring(10) for _ in 1:1000 ];
julia> re = Regex(join(s,"|"));
julia> using BenchmarkTools
julia> @benchmark match(re,s[end])
BenchmarkTools.Trial: 10000 samples with 9 evaluations. Range (min … max): 2.022 μs … 5.634 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.053 μs ┊ GC (median): 0.00% Time (mean ± σ): 2.087 μs ± 150.254 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▄▁ ███▇▆▆▅▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃ 2.02 μs Histogram: frequency by time 2.68 μs < Memory estimate: 240 bytes, allocs estimate: 4.
A optimized state-machine parser will perform much better.
TODO: Benchmark Automa.jl here
CombinedParsers
does not use state machines, but provides fast _iterate
implementation for Either{Trie{Char}}
, that is constructed with Either(::Vector{<:AbstractString})
julia> using CombinedParsers
julia> pc = Either(s);
julia> @benchmark match(pc,s[end])
BenchmarkTools.Trial: 10000 samples with 194 evaluations. Range (min … max): 498.912 ns … 70.833 μs ┊ GC (min … max): 0.00% … 98.69% Time (median): 515.570 ns ┊ GC (median): 0.00% Time (mean ± σ): 634.489 ns ± 2.608 μs ┊ GC (mean ± σ): 15.88% ± 3.83% ▁▅██▇▆▅▄▄▄▃▃▃▃▂▂▃▃▃▂▃▂▂▂▂▂▂▂▁▁▁▁▁▁ ▁ ▂ ██████████████████████████████████████▇███▇▇▇▆▇▅▅▅▅▅▅▅▅▄▆▃▁▅ █ 499 ns Histogram: log(frequency) by time 721 ns < Memory estimate: 416 bytes, allocs estimate: 7.
Note that pc
does not capture the result by default with parse
.
julia> parse(!pc,s[end])
ERROR: objects of type Nothing cannot be finalized
For large word collections, PCRE fails
julia> s = [ randstring(10) for _ in 1:10000 ];
julia> re = Regex(join(s,"|"));
ERROR: PCRE compilation error: regular expression is too large at offset 109999
CombinedParsers
displays good timings also .
julia> pc = Either(s);
julia> @benchmark Either(s);
julia> @benchmark match(pc,s[end])
BenchmarkTools.Trial: 10000 samples with 193 evaluations. Range (min … max): 505.933 ns … 81.966 μs ┊ GC (min … max): 0.00% … 99.05% Time (median): 528.373 ns ┊ GC (median): 0.00% Time (mean ± σ): 628.487 ns ± 2.550 μs ┊ GC (mean ± σ): 12.81% ± 3.13% ▅█▅▁ ▁▂▇████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂ 506 ns Histogram: frequency by time 715 ns < Memory estimate: 416 bytes, allocs estimate: 7.
Either{Trie{Char}}
can be used to perform fast text search in Julia.
Aho-Corasick algorithm?
The implementation could be expanded to implement Aho-Corasick algorithm.
This page was generated using Literate.jl.