CombinedParsers.jl Documentation
CombinedParsers
— ModuleA package for combining parsers and transforming strings into julia types.
Compose parsers with the functional parser combinator paradigm, utilize Julia's type inferrence for transformations, log conveniently for debugging, and let Julia compile your parser for good performance.
Compose parsers parsimoneously within a functional parser combinator paradigm, utilize Julia's type inference for transformations, log conveniently for debugging, and let Julia compile your parser for performance.
CombinedParsers.jl
is currently a release candidate presented at JuliaCon2020. See the next steps section, if interested.
Package Features
Speed
- optimized julia
@generated function
s for parametric parser and state types (benchmarks in compliance tests) - fast trie-based scanning (example), compile with your custom parsing algorithm (example)
- often matches faster than C library
PCRE
regular expressions - memoization with
WithMemory
- lazy transformations of match states (for
Sequence
andRepeat
)
Simplicity
@syntax
defines parser and result construction without redundancy: Julia infersresult_type
(parser) inmap
.- AbstractTrees.jl printing in the REPL.
with_log
provides colored logging of the parsingwith_name
s.
Interoperability
- TextParse.jl can be composed with CombinedParsers and vice versa.
@re_str
provides pure Julia regular expressions as plug-in replacement forBase.@r_str
(PCRE pattern unit test set).
Generality
- Lazily
iterate
all valid parsings. - Higher-order parsers:
after
matching a left-hand parser a right-hand parser is constructed in arbitrary function of the left-hand match. - Parse binary data
Vector{UInt8}
and any sequence type supportinggetindex
,nextind
,prevind
methods (cf. bson example).
Getting started
CombinedParsers.jl
is a registered package. Install with
] add CombinedParsers
- The Overview provides a tutorial explaining how to get started using CombinedParsers.
- The User guide provides a summary of CombinedParsers types and constructors.
- Some examples of packages using CombinedParsers can be found on the Examples page.
- Matching and parsing
- Parser Templates
- Parser Construction
- composing with regular expressions
- Transformations
See the Index for the complete list of documented functions and types.
If you prefer a video introduction:
8-min JuliaCon2020 talk | 3h JuliaCon2021 workshop |
---|---|
Example: rational numbers arithmetics
Parsing is reading and transforming a sequence of characters. CombinedParsers
provides constructors to combine parsers and transform (sub-)parsings arbitrarily with julia syntax.
using CombinedParsers
using TextParse
This example reads and evaluates arithmetical terms for rational numbers. The following defines an evaluating parser for rational number terms as sequences of subterms interleaved with operators.
Subterms are Either
integer numbers, TextParse.Numeric(Int)
converted to Rational{Int}
, or subterms are written as parentheses around a nested term:
@syntax subterm = Either{Rational{Int}}(Any[TextParse.Numeric(Int)]; convert=true);
@syntax for parenthesis in subterm
mult = evaluate |> join(subterm, CharIn("*/"), infix=:prefix )
@syntax term = evaluate |> join(mult, CharIn("+-"), infix=:prefix )
Sequence(2,'(',term,')')
end;
The @syntax
definition in 5,5 lines is sufficient for parsing and evaluating arithmetics: Base.join
(x, delimiter; infix=:prefix)
is shorthand for Sequence
(x ,
Repeat
( delimiter * x ))
, and f |> parser
is shorthand for map
(f,parser)
. That's all! @syntax
registers a @term_string
macro for parsing and transforming:
julia> result_type(term)
Rational{Int64}
julia> term"(1+2)/5"
3//5
julia> # The defined `CombinedParser` `term` function # provides optional logging of the parsing process. term("1/((1+2)*4+3*(5*2))",log = [:parenthesis])
match parenthesis@4-9: 1/((1+2)*4+3*( ^___^ match parenthesis@14-19: *4+3*(5*2)) ^___^ match parenthesis@3-20: 1/((1+2)*4+3*(5*2)) ^_______________^ 1//42
Is every rational answer ultimately the inverse of a universal question in life?
Note: The evaluate
function definition is detailed in the full example.
julia> evaluate( (0, [ ('+',1), ('-',2) ]) )
-1//1
julia> evaluate( (1, [ ('*',4), ('/',3) ]) )
4//3
Useful Design
- WikitextParser.jl is a
CombinedParser
for parsing wikitext syntax, quite comprehensibly and representing Wikipedia articles within Julia. - OrgmodeParser.jl is a
CombinedParser
for parsing main org mode syntax, representing org files within Julia. - CombinedParserTools.jl is currently more or less my own workspace to provide a set of re-useable parsers, used in
WikitextParser
. - Tries.jl is the abstract implementation of the fast prefix-tree matching in
CombinedParsers
(see docs) - ReversedStrings.jl implements lazy lazy
String
transformations andreverse
(similar to LazyArrays.jl)
If you want to work with any of these open source packages, I will gladly provide professional support. If you are writing your own recursive CombinedParser
and seek inspiration, you might find these comprehensive examples interesting. (currently α release, so beware, dragons!)
The CombinedParsers
design
- is fast due to Julia parametric types, and compiler optimizations with generated functions,
- its strictly typed parsing defines the domain data types,
- is composable and optimizable with Julia method dispatch,
- provides flexible public API for parsing, matching, iteration
Making Julia parametric types central for the parser design equally allows automation of the data pipeline after parsing.
FilingForest demonstrates indexing the German Wiktionary into a columnar database, with fast selecting and measuring. I am finishing the write-up of Wiktionary data parsing into a language graph database including:
- fast db-indexing of text streams (e.g. logging): If you need support indexing logging streams into a (SQL-)Database, the (currently) proprietary TypeGraphs.jl provides
CombinedParsers
plug and play: Table schemas are infered from your parser. - fast out-of core data science/AI on your parsed data: If you need support with storing parsed data in optimized memory-mapped JuliaDB, TypeDB.jl provides
CombinedParsers
plug and play. - fast scientific measurements in a data graph: FilingForest IA.jl provides
CombinedParsers
plug and play: even for recursively nested data.
All (currently) proprietary packages are default-over-configuration for fast integration, and are in active development.
- fast HTTP-serving of parsed data: If you need support with a parsing server-client infrastructure, the (currently) proprietary GraphQLAlchemy.jl provides
CombinedParsers
plug and play: GraphQL schemas and resolver are infered from your parser.
Optimization Strategy
With the C PCRE2 library testset, and for 58% of patterns, CombinedParsers
match faster than Regex
(first 100 pattern). C PCRE2 optimized is among the fastest regex libraries (second behind Rust, running mariomka's benchmark will position CombinedParser among its competition). Explorations for optimization are in git branches.
All benchmarks are wrong, but some are useful - Szilard, benchm-ml
The package is maturing, and optimization is ongoing. If you are interested in and able to dive deeper into the Julia memory layout and compiler, I would gladly collaborate on further optimizations:
- String layout: Parsing requires repeated Char comparisons. In UTF8, frequent characters are encoded shorter (8 bit), rare have longer codes. For this reason, in Julia
String
indices are not consecutive and transversal requires using infamousnextind
andprevind
. Profiling:leftof
andrightof
comsume considerable time.CombinedParsers.NCodeunitsState
can speed up that traversalCombinedParsers
currently operates on the result ofgetindex(::String,index)::Char
(technically oniterate(::String,index)::Tuple{Char,Int}
). Could matching use the raw byte representation directly?
- Macros: make all iteration
@generated
functions using expressions generated by a dispatchediterate_expression
that can be used in a macro@iterate
to generate an unrolled/unnested iteration code. (Profiling hints that function calls do hardly contribute to runtime.)
Acknowledgements
This package is enabled only due to the Julia's compiler and superior type system. Thankfully: a really concise language for powerful computing!
I am thankful for contributions and inspiration from many great packages:
TextParse.jl
A bunch of fast text parsing tools, used in CSV.jl
CombinedParsers
composes with TextParse.jl both ways (CombinedParser <: TextParse.AbstractToken
and provides a method for CombinedParsers.tryparsenext
)
Inspirations
- The work was strongly inspired by the great Scala fastparse package, and also the elm parser.
- Parsers.jl, a collection of parsers for date and primitive types, inspired the
parse
methods. - Automa.jl, a Julia package for text validation, parsing, and tokenizing based on state machine compiler. The package compiles deterministic finite automata. (Currently there is no inter-operation possible, because in
Automa
processing of parsed tokens is done with actions). - ParserCombinator.jl was a great inspiration. Yet I decided for a new design with a focus on transformations and type inference with parametric types, instead of basing this work off
ParserCombinator
, written before 2016 (and fixed for Julia 1.0 in 2018).CombinedParsers
integrates into the Julia 1.0 Iteration API, smallUnion{Nothing,T} where T
types instead of using Nullables, compiler optimizations and generated functions. I want to provide benchmarks comparisons withParserCombinator.jl
.
Next Steps
- [ ] Syntax freeze – your comments are appreciated!
- [ ] decide for a error tracing strategy, backtracking. If you want to collaborate on stepping & debugging, please reach out to me.
- [ ] Performance optimizations
- [ ] streaming
- [ ] test coverage underestimated (PCRE tests are not included in travis)
- [ ]
Contributing and Questions
Contributions and feedback are very welcome, especially regarding brief syntax and constructor dispatch. Please open an issue if you encounter any problems or would just like to ask a question, or contact me at mail@g-kappler.de.
Outline
Manual
- Overview
- User Guide
- Basics
- Character Sets
- Sequence
- Either
- Repeat
- Optional
- Lazy repetitions and optional parsers
- Assertions
- Atomic groups
- A fast Trie-based parser for a collection of literal Strings.
Examples
- Names and Adresses
- Number lists (wikitext references)
- Number ranges
- Numbers
- Joining numbers and ranges
- Inclusion in a wikitext parser
- PCRE papercuts when parsing number sequences
- Regular Expressions
CombinedParser
- What is a regular expression?
- Characters and Escaped Characters
- Repeated patterns
Either
- Repeatable patterns,
Optional
,Lazy
andAtomic
Sequence
s and AlternationsCapture
sLookahead
andLookbehind
- Regular Expression Brackets
Palindromes<:CombinedParser
: a Tutorial for writing your combinable Parser- 1. Regular Expression
- 2. A non-word skipping
Palindrome<:CombinedParser
- Parsing strategy
- Subtyping
<: CombinedParser{STATE,RESULT}
. - Matching:
CombinedParsers._iterate
Base.prevind
andBase.nextind
match
andget
- Iterating through matches
- Performance Optimization
- Padding and combining
- Next...
- JSON
Library API
- Using
CombinedParsers
- Parser Templates
- Composing with
TextParse
- Constants and Conversion
- Parser Building Blocks
- Predefined Parsers
- Predefined Assertions
- Constructing Parsers
- Character Matchers
- Repeating
- Atomic
- Sequences
- Recursive Parsers with
Either
- Parser generating parsers
- Assertions
- Logging and Side-Effects
- other
Transformation
s to anyresult_type
- PCRE Regular expressions
- Constructing Regular expressions
- Compatibility & Unit Tests
- CombinedParsers.Regexp
- Parsing Options
- Regular Expression Types
- Internal API
- Internal Types
Index
CombinedParsers.match_all
CombinedParsers.parse_all
CombinedParsers.tryparse_pos
CombinedParsers.DateParser
CombinedParsers.DateTimeParser
CombinedParsers.caseless
CombinedParsers.integer_base
CombinedParsers.parser
CombinedParsers.trim
CombinedParsers.wrap
CombinedParsers.NumericParser
CombinedParsers.AnyChar
CombinedParsers.CharIn
CombinedParsers.CharNotIn
CombinedParsers.Delayed
CombinedParsers.Lookahead
CombinedParsers.Lookbehind
CombinedParsers.Repeat1
CombinedParsers.Repeat_stop
CombinedParsers.Repeat_until
CombinedParsers._copy
CombinedParsers._ismatch
CombinedParsers.after
CombinedParsers.defaultvalue
CombinedParsers.either_result_type
CombinedParsers.flatten_valuepatterns
CombinedParsers.ismatch
CombinedParsers.log_names
CombinedParsers.log_parser
CombinedParsers.sSequence
CombinedParsers.sequence_result_type
CombinedParsers.sequence_state_type
CombinedParsers.substitute
CombinedParsers.with_effect
CombinedParsers.with_log
CombinedParsers.with_name
CombinedParsers.Always
CombinedParsers.AnyValue
CombinedParsers.AtEnd
CombinedParsers.AtStart
CombinedParsers.Atomic
CombinedParsers.Bytes
CombinedParsers.Either
CombinedParsers.FlatMap
CombinedParsers.Lazy
CombinedParsers.MappedSequenceParser
CombinedParsers.MemoizingParser
CombinedParsers.NamedParser
CombinedParsers.NegativeLookahead
CombinedParsers.NegativeLookbehind
CombinedParsers.Never
CombinedParsers.Optional
CombinedParsers.PositiveLookahead
CombinedParsers.PositiveLookbehind
CombinedParsers.Repeat
CombinedParsers.Sequence
CombinedParsers.UnicodeClass
CombinedParsers.ValueIn
CombinedParsers.ValueMatcher
CombinedParsers.ValueNotIn
CombinedParsers.WithMemory
CombinedParsers._deepmap
CombinedParsers.deepmap
CombinedParsers.dodeepmap
CombinedParsers.infer_result_type
CombinedParsers.result_type
CombinedParsers.Constant
CombinedParsers.IndexAt
CombinedParsers.MatchRange
CombinedParsers.MatchedSubSequence
CombinedParsers.Transformation
CombinedParsers._iterate
CombinedParsers.regex_escape
CombinedParsers.Regexp.Regcomb
CombinedParsers.Regexp.index
CombinedParsers.Regexp.on_options
CombinedParsers.Regexp.parse_options
CombinedParsers.Regexp.subroutine_index_reset
CombinedParsers.Regexp.with_options
CombinedParsers.Regexp.Backreference
CombinedParsers.Regexp.Capture
CombinedParsers.Regexp.CharWithOptions
CombinedParsers.Regexp.Conditional
CombinedParsers.Regexp.DupSubpatternNumbers
CombinedParsers.Regexp.FilterOptions
CombinedParsers.Regexp.MatchingNever
CombinedParsers.Regexp.OnOptionsParser
CombinedParsers.Regexp.ParserOptions
CombinedParsers.Regexp.ParserWithCaptures
CombinedParsers.Regexp.SequenceWithCaptures
CombinedParsers.Regexp.StringWithOptions
CombinedParsers.Regexp.Subroutine
CombinedParsers._deepmap_parser
CombinedParsers._iterate
CombinedParsers._leftof
CombinedParsers._rightof
CombinedParsers.deepmap_parser
CombinedParsers.leftof
CombinedParsers.parsematch_tuple
CombinedParsers.print_constructor
CombinedParsers.regex_inner
CombinedParsers.regex_prefix
CombinedParsers.regex_string
CombinedParsers.regex_suffix
CombinedParsers.reinfer
CombinedParsers.rightof
CombinedParsers.state_type
CombinedParsers.strip_either1
CombinedParsers.tuple_pos
CombinedParsers.tuple_state
CombinedParsers.Assertion
CombinedParsers.CombinedParser
CombinedParsers.ConstantParser
CombinedParsers.FilterParser
CombinedParsers.LeafParser
CombinedParsers.MatchState
CombinedParsers.MatchesIterator
CombinedParsers.MemoTreeChildren
CombinedParsers.NCodeunitsState
CombinedParsers.NIndexParser
CombinedParsers.NoMatch
CombinedParsers.ParseMatch
CombinedParsers.WrappedAssertion
CombinedParsers.WrappedParser
CombinedParsers.Regexp.NoDict