Constructing Parsers

Character Matchers

CombinedParsers.AnyChar — Function

AnyChar() = AnyValue(Char)

CombinedParsers.AnyValue — Type

AnyValue(T=Char)

Parser matching exactly one x::T, returning the value.

julia> AnyChar()
. AnyValue
::Char

CombinedParsers.Bytes — Type

Bytes{N,T} <: NIndexParser{N,T}

Fast parsing of a fixed number N of indices, reinterpret(T,match)[1] the parsed vector as T, if isbitstype, or T(match) constructor otherwise.

Provide Base.get(parser::Bytes{N,T}, sequence, till, after, i, state) where {N,T} for custom conversion.

Note

Endianness can be achieved by just mapping bswap

julia> map(bswap, Bytes(2,UInt16))([0x16,0x11])
0x1611

julia> Bytes(2,UInt16)([0x16,0x11])
0x1116

CombinedParsers.ValueMatcher — Type

ValueMatcher match value at point c iif ismatch(c, parser). A ValueMatcher{T}=NIndexParser{1,T} and has state_type MatchState.

See AnyValue, ValueIn, and ValueNotIn.

CombinedParsers.CharIn — Function

CharIn(a...; kw...) = ValueIn{Char}(a...; kw...)

CombinedParsers.UnicodeClass — Type

UnicodeClass(unicode_category::Symbol...)

used in ValueIn, ValueNotIn and succeeds if char at cursor is in one of the unicode classes.

julia> match(ValueIn(:L), "aB")
ParseMatch("a")

julia> match(ValueIn(:Lu), "aB")
ParseMatch("B")

julia> match(ValueIn(:N), "aA1")
ParseMatch("1")

Supported Unicode classes

julia> for (k,v) in CombinedParsers.unicode_class
         println(":",k, " is a ",v[1],", ", v[2],".")
       end
:L is a Letter, any kind of letter from any language.
:Ll is a Lowercase Letter, a lowercase letter that has an uppercase variant.
:Lu is a Uppercase Letter, an uppercase letter that has a lowercase variant.
:Lt is a Titlecase Letter, a letter that appears at the start of a word when only the first letter of the word is capitalized.
:L& is a Cased Letter, a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
:Lm is a Modifier Letter, a special character that is used like a letter.
:Lo is a Other Letter, a letter or ideograph that does not have lowercase and uppercase variants.
:M is a Mark, a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
:Mn is a Non Spacing Mark, a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
:Mc is a Spacing Combining Mark, a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
:Me is a Enclosing Mark, a character that encloses the character it is combined with (circle, square, keycap, etc.).
:Z is a Separator, any kind of whitespace or invisible separator.
:Zs is a Space Separator, a whitespace character that is invisible, but does take up space.
:Zl is a Line Separator, line separator character U+2028.
:Zp is a Paragraph Separator, paragraph separator character U+2029.
:S is a Symbol, math symbols, currency signs, dingbats, box-drawing characters, etc..
:Sm is a Math Symbol, any mathematical symbol.
:Sc is a Currency Symbol, any currency sign.
:Sk is a Modifier Symbol, a combining character (mark) as a full character on its own.
:So is a Other Symbol, various symbols that are not math symbols, currency signs, or combining characters.
:N is a Number, any kind of numeric character in any script.
:Nd is a Decimal Digit Number, a digit zero through nine in any script except ideographic scripts.
:Nl is a Letter Number, a number that looks like a letter, such as a Roman numeral.
:No is a Other Number, a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
:P is a Punctuation, any kind of punctuation character.
:Pc is a Connector Punctuation, a punctuation character such as an underscore that connects words.
:Pd is a Dash Punctuation, any kind of hyphen or dash.
:Ps is a Open Punctuation, any kind of opening bracket.
:Pe is a Close Punctuation, any kind of closing bracket.
:Pi is a Initial Punctuation, any kind of opening quote.
:Pf is a Final Punctuation, any kind of closing quote.
:Po is a Other Punctuation, any kind of punctuation character that is not a dash, bracket, quote or connector.
:C is a Other, invisible control characters and unused code points.
:Cc is a Control, an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
:Cf is a Format, invisible formatting indicator.
:Cs is a Surrogate, one half of a surrogate pair in UTF-16 encoding.
:Co is a Private Use, any code point reserved for private use.
:Cn is a Unassigned, any code point to which no character has been assigned.

CombinedParsers.ValueIn — Type

ValueIn(x)

Parser matching exactly one element c (character) in a sequence, iif _ismatch(c,x).

julia> a_z = ValueIn('a':'z')
[a-z] ValueIn
::Char

julia> parse(a_z, "a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> ac = CharIn("ac")
[ac] ValueIn
::Char

julia> parse(ac, "c")
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

julia> l = CharIn(islowercase)
[islowercase(...)] ValueIn
::Char

julia> parse(l, "c")
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

CombinedParsers.CharNotIn — Function

CharNotIn(a...; kw...) = ValueNotIn{Char}(a...; kw...)

CombinedParsers.ValueNotIn — Type

ValueNotIn{T}(label::AbstractString, x)

Parser matching exactly one element (character) in a sequence, iif not in x.

ValueNotIn([label::AbstractString="", ]x...)
ValueNotIn{T}([label::AbstractString="", ]x...)

Flattens x with CombinedParsers.flatten_valuepatterns, and tries to infer T if not provided.

julia> a_z = CharNotIn('a':'z')
[^a-z] ValueNotIn
::Char

julia> ac = CharNotIn("ca")
[^ca] ValueNotIn
::Char

Respects boolean logic:

julia> CharNotIn(CharNotIn("ab"))("a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

Respects boolean logic:

julia> CharIn(CharIn("ab"))("a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> CharIn(CharNotIn("bc"))("a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> parse(CharNotIn(CharIn("bc")), "a")
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

CombinedParsers.ismatch — Function

ismatch(c,p)

returns _ismatch(c, p)

ismatch(c::MatchingNever,p)

returns false.

CombinedParsers._ismatch — Function

_ismatch(x::Char, set::Union{Tuple,Vector})::Bool

Return _ismatch(x,set...).

_ismatch(x, f, r1, r...)

Check if x matches any of the options f, r1,r...: If ismatch(x,f) return true, otherwise return _ismatch(x, r1, r...).

_ismatch(x)

returns false (out of options)

_ismatch(x, p)

returns x==p

_ismatch(c,p::Function)

returns p(c)

_ismatch(c,p::AnyValue)

true

_ismatch(c,p::Union{StepRange,Set})

returns c in p

Base.Broadcast.broadcasted — Function

Base.broadcasted(::typeof((&)), x::ValueNotIn, y::ValueNotIn)

Character matchers m like Union{ValueIn,ValueNotIn,T}, or any type T providing a ismatch(m::T,c::Char)::Bool method represent a "sparse" bitarray for all characters.

Note

Please consider the broadcast API a draft you are invited to comment to.

julia> CharNotIn("abc") .& CharNotIn("z")
[^abcz] ValueNotIn
::Char

julia> CharIn("abc") .& CharNotIn("c")
[ab] ValueIn
::Char

CombinedParsers.flatten_valuepatterns — Function

flatten_valuepatterns(x...)

Used in ValueMatcher constructors.

Heuristic is roughly:

collect ElementIterators in a Set
collect everything else in a Tuple (Functions etc.)
in the process the label is concatenated
return all that was collected as Tuple{String, <:Set, <:Tuple} or Tuple{String, <:Set} or Tuple{String, <:Tuple}.

Repeating

CombinedParsers.Repeat — Type

Repeat(minmax::UnitRange, x...)
Repeat(x...; min=0,max=Repeat_max)
Repeat(min::Integer, x...)
Repeat(min::Integer,max::Integer, x...)

Parser repeating pattern x min:max times.

julia> Repeat(2,2,'a')
a{2}  |> Repeat
::Vector{Char}


julia> Repeat(3,'a')
a{3,}  |> Repeat
::Vector{Char}

Base.:| — Method

(|)(x::AbstractToken{T}, default::Union{T,Missing})

Operator syntax for Optional(x, default=default).

julia> parser("abc") | "nothing"
|🗄 Either
├─ abc
└─ nothing
::SubString{String}

julia> parser("abc") | missing
abc? |missing
::Union{Missing, SubString{String}}

CombinedParsers.Repeat1 — Function

Repeat1(x)

Parser repeating pattern x one time or more.

Repeat1(f::Function,a...)

Abbreviation for Base.map(f,Repeat1(a...)).

CombinedParsers.Optional — Type

Optional(parser;default=defaultvalue(result_type(parser)))

Parser that always succeeds. If parser succeeds, return result of parser with curser behind match. If parser does not succeed, return default with curser unchanged.

julia> match(r"a?","b")
RegexMatch("")

julia> parse(Optional("a", default=42),"b")
42

CombinedParsers.defaultvalue — Function

defaultvalue(T::Type)

Default value if Optional<:CombinedParser is skipped.

T<:AbstractString: ""
T<:Vector{E}: E[]
T<:CombinedParser: Always()
otherwise missing

Note

get will return a CombinedParsers._copy of defaultvalue.

CombinedParsers._copy — Function

_copy(x)

copy(x) iif ismutable(x); used when defaultvalue of Optional results in get.

CombinedParsers.Lazy — Type

Lazy(x::Repeat)
Lazy(x::Optional)

Lazy x repetition matching (instead of default greedy).

julia> german_street_address = !Lazy(Repeat(AnyChar())) * Repeat1(' ') * TextParse.Numeric(Int)
🗄 Sequence
├─ .*? AnyValue |> Repeat |> Lazy |> !
├─ \ +  |> Repeat
└─ <Int64>
::Tuple{SubString{String}, Vector{Char}, Int64}

julia> german_street_address("Konrad Adenauer Allee    42")
("Konrad Adenauer Allee", [' ', ' ', ' ', ' '], 42)

Note

PCRE @re_str

julia> re"a+?"
a+?  |> Repeat |> Lazy
::Vector{Char}

julia> re"a??"
a?? |missing |> Lazy
::Union{Missing, Char}

CombinedParsers.Repeat_stop — Function

Repeat_stop(p,stop)
Repeat_stop(p,stop; min=0, max=Repeat_max)

Repeat p until stop (NegativeLookahead), not matching stop. Sets cursor before stop. Tries min:max times Returns results of p.

julia> p = Repeat_stop(AnyChar(),'b') * AnyChar()
🗄 Sequence
├─ 🗄* Sequence[2] |> Repeat
│  ├─ (?!b) NegativeLookahead
│  └─ . AnyValue
└─ . AnyValue
::Tuple{Vector{Char}, Char}

julia> parse(p,"acbX")
(['a', 'c'], 'b')

Atomic

CombinedParsers.Atomic — Type

Atomic(x)

A parser matching p, and failing when required to backtrack (behaving like an atomic group in regular expressions).

Sequences

CombinedParsers.Sequence — Type

Sequence{P,S,T}

of parts::P, sequence_state_type==S with sequence_result_type==T.

Sequence(parts::CombinedParser...; tuplestate=true)

of parts, sequence_state_type(p; tuplestate=tuplestate) with sequence_result_type.

Sequences can alternatively created with *

julia> german_street_address = !Repeat(AnyChar()) * ' ' * TextParse.Numeric(Int)
🗄 Sequence
├─ .* AnyValue |> Repeat |> !
├─ \
└─ <Int64>
::Tuple{SubString{String}, Char, Int64}

julia> german_street_address("Some Avenue 42")
("Some Avenue", ' ', 42)

Indexing (transformation) can be defined with

julia> e1 = Sequence(!Repeat(AnyChar()), ' ',TextParse.Numeric(Int))[1]
🗄 Sequence[1]
├─ .* AnyValue |> Repeat |> !
├─ \
└─ <Int64>
::SubString{String}

julia> e1("Some Avenue 42")
"Some Avenue"

Note

State is managed as sequence_state_type(parts; tuplestate). Overwrite to optimize state types special cases.

Base.:* — Method

(*)(x::Any, y::AbstractToken)
(*)(x::AbstractToken, y::Any)
(*)(x::AbstractToken, y::AbstractToken)

Chain parsers in sSequence. See also @seq.

CombinedParsers.sSequence — Function

sSequence(x...)

Simplifying Sequence, flatten Sequences, remove Always assertions.

julia> Sequence('a',CharIn("AB")*'b')
🗄 Sequence
├─ a
└─ 🗄 Sequence
   ├─ [AB] ValueIn
   └─ b
::Tuple{Char, Tuple{Char, Char}}


julia> sSequence('a',CharIn("AB")*'b')
🗄 Sequence
├─ a
├─ [AB] ValueIn
└─ b
::Tuple{Char, Char, Char}

Recursive Parsers with `Either`

CombinedParsers.Delayed — Function

Delayed(T::Type) =

Either{T}().

CombinedParsers.Either — Type

Either{S,T}(p) where {S,T} = new{typeof(p),S,T}(p)

Parser that tries matching the provided parsers in order, accepting the first match, and fails if all parsers fail.

This parser has no == and hash methods because it can recurse.

julia> match(r"a|bc","bc")
RegexMatch("bc")

julia> parse(Either("a","bc"),"bc")
"bc"

julia> parse("a" | "bc","bc")
"bc"

Base.:| — Method

(|)(x::AbstractToken, y)
(|)(x, y::AbstractToken)
(|)(x::AbstractToken, y::AbstractToken)

Operator syntax for Either(x, y; simplify=true).

julia> 'a' | CharIn("AB") | "bc"
|🗄 Either
├─ a
├─ [AB] ValueIn
└─ bc
::Union{Char, SubString{String}}

CombinedParsers.@syntax — Macro

@syntax name = expr

Convenience macro defining a CombinedParser name=expr and custom parsing macro @name_str.

DocTestFilters = r"map\(.+\)"

julia> @syntax a = AnyChar();

julia> a"char"
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

@syntax for name in either; expr; end

Parser expr is pushfirst! to either. If either is undefined, it will be created. If either == :text || either == Symbol(:) the parser will be added to CombinedParser_globals variable in your module.

julia> @syntax street_address = Either(Any[]);

julia> @syntax for german_street_address in street_address
            Sequence(!!Repeat(AnyChar()),
                     " ",
                     TextParse.Numeric(Int)) do v
                (street = v[1], no=v[3])
            end
       end
🗄 Sequence |> map(#50) |> with_name(:german_street_address)
├─ .* AnyValue |> Repeat |> ! |> map(intern)
├─ \
└─ <Int64>
::NamedTuple{(:street, :no), Tuple{String, Int64}}

julia> german_street_address"Some Avenue 42"
(street = "Some Avenue", no = 42)


julia> @syntax for us_street_address in street_address
            Sequence(TextParse.Numeric(Int),
                     " ",
                     !!Repeat(AnyChar())) do v
                (street = v[3], no=v[1])
            end
       end
🗄 Sequence |> map(#52) |> with_name(:us_street_address)
├─ <Int64>
├─ \  
└─ .* AnyValue |> Repeat |> ! |> map(intern)
::NamedTuple{(:street, :no), Tuple{String, Int64}}

julia> street_address"50 Oakland Ave"
(street = "Oakland Ave", no = 50)

julia> street_address"Oakland Ave 50"
(street = "Oakland Ave", no = 50)

CombinedParsers.substitute — Function

substitute(name::Symbol)

Define a parser substitution.

substitute(parser::CombinedParser)

Apply parser substitution, respecting scope in the defined tree:

Parser variables are defined within scope of Eithers, for all its NamedParser options.
Substitution parsers are replaced with parser variables.
strip_either1 is used to simplify in a second phase.

Note

Substitution implementation is experimental pending feedback.

todo: scope NamedParser objects in WrappedParser, Sequence, etc.?

julia> Either(:a => !Either(
                 :b => "X", 
                 :d => substitute(:b),
                 substitute(:c)),
              :b => "b",
              :c => substitute(:b)
              ) |> substitute
|🗄 Either
├─ |🗄 Either |> ! |> with_name(:a)
│  ├─ X  |> with_name(:b)
│  ├─ X  |> with_name(:b) |> with_name(:d)
│  └─ b  |> with_name(:b) |> with_name(:c)
├─ b  |> with_name(:b)
└─ b  |> with_name(:b) |> with_name(:c)
::SubString{String}

Example

With substitute you can write recursive parsers in a style inspired by (E)BNF. CombinedParsers.BNF.ebnf uses substitute.

julia> def = Either(:integer => !Either("0", Sequence(Optional("-"), substitute(:natural_number))),
                    :natural_number => !Sequence(substitute(:nonzero_digit), Repeat(substitute(:digit))),
                    :nonzero_digit => re"[1-9]",
                    :digit => Either("0", substitute(:nonzero_digit)))
|🗄 Either
├─ |🗄 Either |> ! |> with_name(:integer)
│  ├─ 0 
│  └─ 🗄 Sequence
│     ├─ \-? |
│     └─  natural_number call substitute!
├─ 🗄 Sequence |> ! |> with_name(:natural_number)
│  ├─  nonzero_digit call substitute!
│  └─ * digit call substitute! |> Repeat
├─ [1-9] ValueIn |> with_name(:nonzero_digit)
└─ |🗄 Either |> with_name(:digit)
   ├─ 0 
   └─  nonzero_digit call substitute!
::Union{Nothing, Char, SubString{String}}

julia> substitute(def)
|🗄 Either
├─ |🗄 Either |> ! |> with_name(:integer)
│  ├─ 0 
│  └─ 🗄 Sequence
│     ├─ \-? |
│     └─ 🗄 Sequence |> ! |> with_name(:natural_number) # branches hidden
├─ 🗄 Sequence |> ! |> with_name(:natural_number)
│  ├─ [1-9] ValueIn |> with_name(:nonzero_digit)
│  └─ |🗄* Either |> with_name(:digit) |> Repeat
│     ├─ 0 
│     └─ [1-9] ValueIn |> with_name(:nonzero_digit)
├─ [1-9] ValueIn |> with_name(:nonzero_digit)
└─ |🗄 Either |> with_name(:digit)
   ├─ 0 
   └─ [1-9] ValueIn |> with_name(:nonzero_digit)
::Union{Char, SubString{String}}

Base.push! — Function

Base.push!(x::Either, option)

Push option to x.options as parser tried next if x fails.

Recursive parsers can be built with push! to Either.

Parser generating parsers

CombinedParsers.FlatMap — Type

FlatMap{P,S,Q<:Function,T} <: CombinedParser{S,T}

Like Scala's fastparse FlatMap. See after

CombinedParsers.after — Function

after(right::Function,left::AbstractToken)
after(right::Function,left::AbstractToken,T::Type)

Like Scala's fastparse FlatMap

julia> saying(v) = v == "same" ? v : "different";

julia> p = after(saying, String, "same"|"but")
🗄 FlatMap
├─ |🗄 Either
│  ├─ same 
│  └─ but 
└─ saying
::String

julia> p("samesame")
"same"

julia> p("butdifferent")
"different"

Assertions

CombinedParsers.AtStart — Type

AtStart()

Parser succeding if and only if at index 1 with result_type AtStart.

julia> AtStart()
re"^"

CombinedParsers.AtEnd — Type

AtEnd()

Parser succeding if and only if at last index with result_type AtEnd.

julia> AtEnd()
re"$"

CombinedParsers.Always — Type

Always()

Assertion parser matching always and not consuming any input. Returns Always().

julia> Always()
re""

CombinedParsers.Never — Type

Never()

Assertion parser matching never.

julia> Never()
re"(*FAIL)"

Look behind

CombinedParsers.Lookbehind — Function

Lookbehind(does_match::Bool, p)

PositiveLookbehind if does_match==true, NegativeLookbehind otherwise.

CombinedParsers.PositiveLookbehind — Type

PositiveLookbehind(parser)

Parser that succeeds if and only if parser succeeds before cursor. Consumes no input. The match is returned. Useful for checks like "must be preceded by parser, don't consume its match".

CombinedParsers.NegativeLookbehind — Type

NegativeLookbehind(parser)

Parser that succeeds if and only if parser does not succeed before cursor. Consumes no input. nothing is returned as match. Useful for checks like "must not be preceded by parser, don't consume its match".

julia> la=NegativeLookbehind("keep")
re"(?<!keep)"

julia> parse("peek"*la,"peek")
("peek", re"(?<!keep)")

Look ahead

CombinedParsers.Lookahead — Function

Lookahead(does_match::Bool, p)

PositiveLookahead if does_match==true, NegativeLookahead otherwise.

CombinedParsers.PositiveLookahead — Type

PositiveLookahead(parser)

Parser that succeeds if and only if parser succeeds, but consumes no input. The match is returned. Useful for checks like "must be followed by parser, but don't consume its match".

julia> la=PositiveLookahead("peek")
re"(?=peek)"

julia> parse(la*AnyChar(),"peek")
("peek", 'p')

CombinedParsers.NegativeLookahead — Type

NegativeLookahead(parser)

Parser that succeeds if and only if parser does not succeed, but consumes no input. parser is returned as match. Useful for checks like "must not be followed by parser, don't consume its match".

julia> la = NegativeLookahead("peek")
re"(?!peek)"

julia> parse(la*AnyChar(),"seek")
(re"(?!peek)", 's')

Logging and Side-Effects

CombinedParsers.NamedParser — Type

NamedParser{P,S,T} <: WrappedParser{P,S,T}

Struct with

    name::Symbol
    parser::P
    doc::String

CombinedParsers.with_name — Function

with_name(name::Symbol,x; doc="")

A parser labelled with name. Labels are useful in printing and logging.

CombinedParsers.@with_names — Macro

@with_names

Sets names of parsers within begin/end block to match the variables they are asigned to.

so, for example

julia> @with_names foo = AnyChar()
. AnyValue |> with_name(:foo)
::Char

julia> parse(log_names(foo),"ab")
   match foo@1-2: ab
                  ^
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

other

CombinedParsers.MappedSequenceParser — Type

MappedSequenceParser(f::F,parser::P) where {F<:Function,P}

Match parser on CharMappedString(f,sequence), e.g. in a caseless parser.

CombinedParsers.MemoizingParser — Type

MemoizingParser{P,S,T}

WrappedParser memoizing all match states. For slow parsers with a lot of backtracking this parser can help improve speed.

(Sharing a good example where memoization makes a difference is appreciated.)

CombinedParsers.WithMemory — Type

WithMemory(x) <: AbstractString

String wrapper with memoization of next match states for parsers at indices. Memoization is sometimes recommended as a way of improving the performance of parser combinators (like state machine optimization and compilation for regular languages).

Note

A snappy performance gain could not be demonstrated so far, probably because the costs of state memory allocation for caching are often greater than recomputing a match. If you have a case where your performance benefits with this, let me know!

```