Constructing Parsers

Character Matchers


Parser matching exactly one x::T, returning the value.

julia> AnyChar()
. AnyValue
Bytes{N,T} <: NIndexParser{N,T}

Fast parsing of a fixed number N of indices, reinterpret(T,match)[1] the parsed vector as T, if isbitstype, or T(match) constructor otherwise.

Provide Base.get(parser::Bytes{N,T}, sequence, till, after, i, state) where {N,T} for custom conversion.


Endianness can be achieved by just mapping bswap

julia> map(bswap, Bytes(2,UInt16))([0x16,0x11])

julia> Bytes(2,UInt16)([0x16,0x11])

used in ValueIn, ValueNotIn and succeeds if char at cursor is in one of the unicode classes.

julia> match(ValueIn(:L), "aB")

julia> match(ValueIn(:Lu), "aB")

julia> match(ValueIn(:N), "aA1")

Supported Unicode classes

julia> for (k,v) in CombinedParsers.unicode_class
         println(":",k, " is a ",v[1],", ", v[2],".")
:L is a Letter, any kind of letter from any language.
:Ll is a Lowercase Letter, a lowercase letter that has an uppercase variant.
:Lu is a Uppercase Letter, an uppercase letter that has a lowercase variant.
:Lt is a Titlecase Letter, a letter that appears at the start of a word when only the first letter of the word is capitalized.
:L& is a Cased Letter, a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
:Lm is a Modifier Letter, a special character that is used like a letter.
:Lo is a Other Letter, a letter or ideograph that does not have lowercase and uppercase variants.
:M is a Mark, a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
:Mn is a Non Spacing Mark, a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
:Mc is a Spacing Combining Mark, a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
:Me is a Enclosing Mark, a character that encloses the character it is combined with (circle, square, keycap, etc.).
:Z is a Separator, any kind of whitespace or invisible separator.
:Zs is a Space Separator, a whitespace character that is invisible, but does take up space.
:Zl is a Line Separator, line separator character U+2028.
:Zp is a Paragraph Separator, paragraph separator character U+2029.
:S is a Symbol, math symbols, currency signs, dingbats, box-drawing characters, etc..
:Sm is a Math Symbol, any mathematical symbol.
:Sc is a Currency Symbol, any currency sign.
:Sk is a Modifier Symbol, a combining character (mark) as a full character on its own.
:So is a Other Symbol, various symbols that are not math symbols, currency signs, or combining characters.
:N is a Number, any kind of numeric character in any script.
:Nd is a Decimal Digit Number, a digit zero through nine in any script except ideographic scripts.
:Nl is a Letter Number, a number that looks like a letter, such as a Roman numeral.
:No is a Other Number, a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
:P is a Punctuation, any kind of punctuation character.
:Pc is a Connector Punctuation, a punctuation character such as an underscore that connects words.
:Pd is a Dash Punctuation, any kind of hyphen or dash.
:Ps is a Open Punctuation, any kind of opening bracket.
:Pe is a Close Punctuation, any kind of closing bracket.
:Pi is a Initial Punctuation, any kind of opening quote.
:Pf is a Final Punctuation, any kind of closing quote.
:Po is a Other Punctuation, any kind of punctuation character that is not a dash, bracket, quote or connector.
:C is a Other, invisible control characters and unused code points.
:Cc is a Control, an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
:Cf is a Format, invisible formatting indicator.
:Cs is a Surrogate, one half of a surrogate pair in UTF-16 encoding.
:Co is a Private Use, any code point reserved for private use.
:Cn is a Unassigned, any code point to which no character has been assigned.

Parser matching exactly one element c (character) in a sequence, iif _ismatch(c,x).

julia> a_z = ValueIn('a':'z')
[a-z] ValueIn

julia> parse(a_z, "a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> ac = CharIn("ac")
[ac] ValueIn

julia> parse(ac, "c")
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

julia> l = CharIn(islowercase)
[islowercase(...)] ValueIn

julia> parse(l, "c")
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
ValueNotIn{T}(label::AbstractString, x)

Parser matching exactly one element (character) in a sequence, iif not in x.

ValueNotIn([label::AbstractString="", ]x...)
ValueNotIn{T}([label::AbstractString="", ]x...)

Flattens x with CombinedParsers.flatten_valuepatterns, and tries to infer T if not provided.

julia> a_z = CharNotIn('a':'z')
[^a-z] ValueNotIn

julia> ac = CharNotIn("ca")
[^ca] ValueNotIn

Respects boolean logic:

julia> CharNotIn(CharNotIn("ab"))("a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

Respects boolean logic:

julia> CharIn(CharIn("ab"))("a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> CharIn(CharNotIn("bc"))("a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> parse(CharNotIn(CharIn("bc")), "a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
_ismatch(x::Char, set::Union{Tuple,Vector})::Bool

Return _ismatch(x,set...).

_ismatch(x, f, r1, r...)

Check if x matches any of the options f, r1,r...: If ismatch(x,f) return true, otherwise return _ismatch(x, r1, r...).


returns false (out of options)

_ismatch(x, p)

returns x==p


returns p(c)




returns c in p

Base.broadcasted(::typeof((&)), x::ValueNotIn, y::ValueNotIn)

Character matchers m like Union{ValueIn,ValueNotIn,T}, or any type T providing a ismatch(m::T,c::Char)::Bool method represent a "sparse" bitarray for all characters.


Please consider the broadcast API a draft you are invited to comment to.

julia> CharNotIn("abc") .& CharNotIn("z")
[^abcz] ValueNotIn

julia> CharIn("abc") .& CharNotIn("c")
[ab] ValueIn

Used in ValueMatcher constructors.

Heuristic is roughly:

  • collect ElementIterators in a Set
  • collect everything else in a Tuple (Functions etc.)
  • in the process the label is concatenated
  • return all that was collected as Tuple{String, <:Set, <:Tuple} or Tuple{String, <:Set} or Tuple{String, <:Tuple}.


Repeat(minmax::UnitRange, x...)
Repeat(x...; min=0,max=Repeat_max)
Repeat(min::Integer, x...)
Repeat(min::Integer,max::Integer, x...)

Parser repeating pattern x min:max times.

julia> Repeat(2,2,'a')
a{2}  |> Repeat

julia> Repeat(3,'a')
a{3,}  |> Repeat
(|)(x::AbstractToken{T}, default::Union{T,Missing})

Operator syntax for Optional(x, default=default).

julia> parser("abc") | "nothing"
|🗄 Either
├─ abc
└─ nothing

julia> parser("abc") | missing
abc? |missing
::Union{Missing, SubString{String}}

Parser repeating pattern x one time or more.


Abbreviation for,Repeat1(a...)).


Parser that always succeeds. If parser succeeds, return result of parser with curser behind match. If parser does not succeed, return default with curser unchanged.

julia> match(r"a?","b")

julia> parse(Optional("a", default=42),"b")

Lazy x repetition matching (instead of default greedy).

julia> german_street_address = !Lazy(Repeat(AnyChar())) * Repeat1(' ') * TextParse.Numeric(Int)
🗄 Sequence
├─ .*? AnyValue |> Repeat |> Lazy |> !
├─ \ +  |> Repeat
└─ <Int64>
::Tuple{SubString{String}, Vector{Char}, Int64}

julia> german_street_address("Konrad Adenauer Allee    42")
("Konrad Adenauer Allee", [' ', ' ', ' ', ' '], 42)

PCRE @re_str

julia> re"a+?"
a+?  |> Repeat |> Lazy

julia> re"a??"
a?? |missing |> Lazy
::Union{Missing, Char}
Repeat_stop(p,stop; min=0, max=Repeat_max)

Repeat p until stop (NegativeLookahead), not matching stop. Sets cursor before stop. Tries min:max times Returns results of p.

julia> p = Repeat_stop(AnyChar(),'b') * AnyChar()
🗄 Sequence
├─ 🗄* Sequence[2] |> Repeat
│  ├─ (?!b) NegativeLookahead
│  └─ . AnyValue
└─ . AnyValue
::Tuple{Vector{Char}, Char}

julia> parse(p,"acbX")
(['a', 'c'], 'b')

See also NegativeLookahead

Repeat_until(p,until, with_until=false; wrap=identity, min=0, max=Repeat_max)

Repeat p until stop (with Repeat_stop). and set point after stop.

Return a Vector{result_type(p)} if wrap_until==false, otherwise a Tuple{Vector{result_type(p)},result_type(until)}.

To transform the Repeat_stop(p) parser head, provide a function(::Vector{result_type(p)}) in wrap keyword argument, e.g.

julia> p = Repeat_until(AnyChar(),'b') * AnyChar()
🗄 Sequence
├─ 🗄 Sequence[1]
│  ├─ (?>🗄*) Sequence[2] |> Repeat |> Atomic
│  │  ├─ (?!b) NegativeLookahead
│  │  └─ . AnyValue
│  └─ b
└─ . AnyValue
::Tuple{Vector{Char}, Char}

julia> parse(p,"acbX")
(['a', 'c'], 'X')

julia> parse(Repeat_until(AnyChar(),'b';wrap=MatchedSubSequence),"acbX")

See also NegativeLookahead

Base.join(x::Repeat,delim, infix=:skip)

Parser matching repeated x.parser separated by delim.

julia> parse(join(Repeat(AnyChar()),','),"a,b,c")
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
julia> parse(join(Repeat(AnyChar()),',';infix=:prefix),"a,b,c")
('a', [(',', 'b'), (',', 'c')])

julia> parse(join(Repeat(AnyChar()),',';infix=:suffix),"a,b,c")
([('a', ','), ('b', ',')], 'c')
Base.join(x::CombinedParser,delim; kw...)

Shorthand for join(Repeat(x),delim; kw...).

Base.join(f::Function, x::CombinedParser, delim; kw...)

Shorthand for,join(x,delim; kw...)).



A parser matching p, and failing when required to backtrack (behaving like an atomic group in regular expressions).



of parts::P, sequence_state_type==S with sequence_result_type==T.

Sequence(parts::CombinedParser...; tuplestate=true)

of parts, sequence_state_type(p; tuplestate=tuplestate) with sequence_result_type.

Sequences can alternatively created with *

julia> german_street_address = !Repeat(AnyChar()) * ' ' * TextParse.Numeric(Int)
🗄 Sequence
├─ .* AnyValue |> Repeat |> !
├─ \
└─ <Int64>
::Tuple{SubString{String}, Char, Int64}

julia> german_street_address("Some Avenue 42")
("Some Avenue", ' ', 42)

Indexing (transformation) can be defined with

julia> e1 = Sequence(!Repeat(AnyChar()), ' ',TextParse.Numeric(Int))[1]
🗄 Sequence[1]
├─ .* AnyValue |> Repeat |> !
├─ \
└─ <Int64>

julia> e1("Some Avenue 42")
"Some Avenue"

State is managed as sequence_state_type(parts; tuplestate). Overwrite to optimize state types special cases.

(*)(x::Any, y::AbstractToken)
(*)(x::AbstractToken, y::Any)
(*)(x::AbstractToken, y::AbstractToken)

Chain parsers in sSequence. See also @seq.


Simplifying Sequence, flatten Sequences, remove Always assertions.

julia> Sequence('a',CharIn("AB")*'b')
🗄 Sequence
├─ a
└─ 🗄 Sequence
   ├─ [AB] ValueIn
   └─ b
::Tuple{Char, Tuple{Char, Char}}

julia> sSequence('a',CharIn("AB")*'b')
🗄 Sequence
├─ a
├─ [AB] ValueIn
└─ b
::Tuple{Char, Char, Char}

See also Sequence


This function will be removed and replaced with a keyword argument


Create a sequence interleaved with whitespace (horizontal or vertical). The result_type is omitting whitespace.

sequence_state_type(pts::Type; tuplestate=true)
  • MatchState if all fieldtypes are MatchState,
  • otherwise if tuplestate, a tuple type with the state_type of parts,
  • or Vector{Any} if !tuplestate.

Todo: NCodeunitsState instead of MatchState might increase performance.

Recursive Parsers with Either

Either{S,T}(p) where {S,T} = new{typeof(p),S,T}(p)

Parser that tries matching the provided parsers in order, accepting the first match, and fails if all parsers fail.

This parser has no == and hash methods because it can recurse.

julia> match(r"a|bc","bc")

julia> parse(Either("a","bc"),"bc")

julia> parse("a" | "bc","bc")
(|)(x::AbstractToken, y)
(|)(x, y::AbstractToken)
(|)(x::AbstractToken, y::AbstractToken)

Operator syntax for Either(x, y; simplify=true).

julia> 'a' | CharIn("AB") | "bc"
|🗄 Either
├─ a
├─ [AB] ValueIn
└─ bc
::Union{Char, SubString{String}}
@syntax name = expr

Convenience macro defining a CombinedParser name=expr and custom parsing macro @name_str.

DocTestFilters = r"map\(.+\)"
julia> @syntax a = AnyChar();

julia> a"char"
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
@syntax for name in either; expr; end

Parser expr is pushfirst! to either. If either is undefined, it will be created. If either == :text || either == Symbol(:) the parser will be added to CombinedParser_globals variable in your module.

julia> @syntax street_address = Either(Any[]);

julia> @syntax for german_street_address in street_address
                     " ",
                     TextParse.Numeric(Int)) do v
                (street = v[1], no=v[3])
🗄 Sequence |> map(#50) |> with_name(:german_street_address)
├─ .* AnyValue |> Repeat |> ! |> map(intern)
├─ \
└─ <Int64>
::NamedTuple{(:street, :no), Tuple{String, Int64}}

julia> german_street_address"Some Avenue 42"
(street = "Some Avenue", no = 42)

julia> @syntax for us_street_address in street_address
                     " ",
                     !!Repeat(AnyChar())) do v
                (street = v[3], no=v[1])
🗄 Sequence |> map(#52) |> with_name(:us_street_address)
├─ <Int64>
├─ \  
└─ .* AnyValue |> Repeat |> ! |> map(intern)
::NamedTuple{(:street, :no), Tuple{String, Int64}}

julia> street_address"50 Oakland Ave"
(street = "Oakland Ave", no = 50)

julia> street_address"Oakland Ave 50"
(street = "Oakland Ave", no = 50)

Define a parser substitution.


Apply parser substitution, respecting scope in the defined tree:

  • Parser variables are defined within scope of Eithers, for all its NamedParser options.
  • Substitution parsers are replaced with parser variables.
  • strip_either1 is used to simplify in a second phase.

Substitution implementation is experimental pending feedback.

todo: scope NamedParser objects in WrappedParser, Sequence, etc.?

julia> Either(:a => !Either(
                 :b => "X", 
                 :d => substitute(:b),
              :b => "b",
              :c => substitute(:b)
              ) |> substitute
|🗄 Either
├─ |🗄 Either |> ! |> with_name(:a)
│  ├─ X  |> with_name(:b)
│  ├─ X  |> with_name(:b) |> with_name(:d)
│  └─ b  |> with_name(:b) |> with_name(:c)
├─ b  |> with_name(:b)
└─ b  |> with_name(:b) |> with_name(:c)


With substitute you can write recursive parsers in a style inspired by (E)BNF. CombinedParsers.BNF.ebnf uses substitute.

julia> def = Either(:integer => !Either("0", Sequence(Optional("-"), substitute(:natural_number))),
                    :natural_number => !Sequence(substitute(:nonzero_digit), Repeat(substitute(:digit))),
                    :nonzero_digit => re"[1-9]",
                    :digit => Either("0", substitute(:nonzero_digit)))
|🗄 Either
├─ |🗄 Either |> ! |> with_name(:integer)
│  ├─ 0 
│  └─ 🗄 Sequence
│     ├─ \-? |
│     └─  natural_number call substitute!
├─ 🗄 Sequence |> ! |> with_name(:natural_number)
│  ├─  nonzero_digit call substitute!
│  └─ * digit call substitute! |> Repeat
├─ [1-9] ValueIn |> with_name(:nonzero_digit)
└─ |🗄 Either |> with_name(:digit)
   ├─ 0 
   └─  nonzero_digit call substitute!
::Union{Nothing, Char, SubString{String}}

julia> substitute(def)
|🗄 Either
├─ |🗄 Either |> ! |> with_name(:integer)
│  ├─ 0 
│  └─ 🗄 Sequence
│     ├─ \-? |
│     └─ 🗄 Sequence |> ! |> with_name(:natural_number) # branches hidden
├─ 🗄 Sequence |> ! |> with_name(:natural_number)
│  ├─ [1-9] ValueIn |> with_name(:nonzero_digit)
│  └─ |🗄* Either |> with_name(:digit) |> Repeat
│     ├─ 0 
│     └─ [1-9] ValueIn |> with_name(:nonzero_digit)
├─ [1-9] ValueIn |> with_name(:nonzero_digit)
└─ |🗄 Either |> with_name(:digit)
   ├─ 0 
   └─ [1-9] ValueIn |> with_name(:nonzero_digit)
::Union{Char, SubString{String}}
Base.push!(x::Either, option)

Push option to x.options as parser tried next if x fails.

Recursive parsers can be built with push! to Either.

See also pushfirst! and @syntax.

Base.push!(x::WrappedParser{<:Either}, option)

Push option to x.options of repeated inner parser.

Base.pushfirst!(x::Either, option)

Push option to x.options as parser tried first, and trying x if option fails.

Recursive parsers can be built with pushfirst! to Either.

See also push! and @syntax.

Base.pushfirst!(x::WrappedParser{<:Either}, option)

Push option as first x.options of repeated inner parser.

Parser generating parsers


Like Scala's fastparse FlatMap

julia> saying(v) = v == "same" ? v : "different";

julia> p = after(saying, String, "same"|"but")
🗄 FlatMap
├─ |🗄 Either
│  ├─ same 
│  └─ but 
└─ saying

julia> p("samesame")

julia> p("butdifferent")



Parser succeding if and only if at index 1 with result_type AtStart.

julia> AtStart()

Parser succeding if and only if at last index with result_type AtEnd.

julia> AtEnd()

Assertion parser matching always and not consuming any input. Returns Always().

julia> Always()

Look behind


Parser that succeeds if and only if parser succeeds before cursor. Consumes no input. The match is returned. Useful for checks like "must be preceded by parser, don't consume its match".


Parser that succeeds if and only if parser does not succeed before cursor. Consumes no input. nothing is returned as match. Useful for checks like "must not be preceded by parser, don't consume its match".

julia> la=NegativeLookbehind("keep")

julia> parse("peek"*la,"peek")
("peek", re"(?<!keep)")

Look ahead


Parser that succeeds if and only if parser succeeds, but consumes no input. The match is returned. Useful for checks like "must be followed by parser, but don't consume its match".

julia> la=PositiveLookahead("peek")

julia> parse(la*AnyChar(),"peek")
("peek", 'p')

Parser that succeeds if and only if parser does not succeed, but consumes no input. parser is returned as match. Useful for checks like "must not be followed by parser, don't consume its match".

julia> la = NegativeLookahead("peek")

julia> parse(la*AnyChar(),"seek")
(re"(?!peek)", 's')

Logging and Side-Effects


Sets names of parsers within begin/end block to match the variables they are asigned to.

so, for example

julia> @with_names foo = AnyChar()
. AnyValue |> with_name(:foo)

julia> parse(log_names(foo),"ab")
   match foo@1-2: ab
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

See also log_names and @syntax.

log_parser(message::Type, x::CombinedParser, a...; kw...)
log_parser(message::Function, x::CombinedParser, a...; kw...)

Transform parser including logging statements for sub-parsers of type message or for which calling message does not return nothing.

with_log(s::AbstractString,p, delta=5;nomatch=false)

Log matching process of parser p, displaying delta characters left of and right of match.

If nomatch==true, also log when parser does not match.

See also: log_names, with_effect


Call f(sequence,before_i,after_i,state,a...) if p matches, f(sequence,before_i,before_i,nothing,a...) otherwise.



WrappedParser memoizing all match states. For slow parsers with a lot of backtracking this parser can help improve speed.

(Sharing a good example where memoization makes a difference is appreciated.)

WithMemory(x) <: AbstractString

String wrapper with memoization of next match states for parsers at indices. Memoization is sometimes recommended as a way of improving the performance of parser combinators (like state machine optimization and compilation for regular languages).


A snappy performance gain could not be demonstrated so far, probably because the costs of state memory allocation for caching are often greater than recomputing a match. If you have a case where your performance benefits with this, let me know!