Parser Templates
Composing with TextParse
Parsing Numbers or Dates is most efficiently done with TextParse.
Dates.tryparsenext โ FunctionTextParse.tryparsenext(x::CombinedParser,str,i,till,opts=TextParse.default_opts)TextParse.jl integrates with CombinedParsers.jl both ways.
tryparsenextreturns a tuple(result, nextpos)whereresultis of typeNullable{T},Nullable{T}()if parsing failed, non-null containing the parsed value if it succeeded. If parsing succeeded,nextposis the position the next token, if any, starts at. If parsing failed,nextposis the position at which the
parsing failed.
julia> using TextParse
julia> p = ("Number:" * Repeat(' ') * TextParse.Numeric(Int))[3]
๐ Sequence[3]
โโ Number\:
โโ \ * |> Repeat
โโ <Int64>
::Int64
julia> parse(p, "Number: 42")
42
julia> TextParse.tryparsenext(p, "Number: 42")
(Nullable{Int64}(42), 14)CombinedParsers.NumericParser โ TypeNumericParser(x...) = parser(TextParse.Numeric(x...))CombinedParsers.DateParser โ FunctionDateParser(format::DateFormat...)
DateTimeParser(format::DateFormat...)Create a parser matching either one format using TextParse.DateTimeToken for Dates.Date and Dates.DateTime respectively.
DateParser(format::AbstractString...; locale="english")
DateTimeParser(format::AbstractString...; locale="english")Convenience functions for above using Dates.DateFormat.(format, locale).
CombinedParsers.DateTimeParser โ FunctionDateParser(format::DateFormat...)
DateTimeParser(format::DateFormat...)Create a parser matching either one format using TextParse.DateTimeToken for Dates.Date and Dates.DateTime respectively.
DateParser(format::AbstractString...; locale="english")
DateTimeParser(format::AbstractString...; locale="english")Convenience functions for above using Dates.DateFormat.(format, locale).
For non base 10 numbers, use
CombinedParsers.integer_base โ Functioninteger_base(base,mind=0,maxd=Repeat_max)Parser matching a integer format on base base.
Uses a second Base.parse call on match.
A custom parser could aggregate result incrementally while matching.
Constants and Conversion
CombinedParsers.parser โ Functionparser(x)A ConstantParser matching x.
parser(x::StepRange{Char,<:Integer})ValueIn matching x.
parser(x::Pair{Symbol, P}) where PA parser labelled with name x.first. Labels are useful in printing and logging.
See also: @with_names, with_name, log_names
parser(x::CharWithOptions)A ConstantParser matching x, respecting Base.PCRE.CASELESS.
Base.convert โ FunctionBase.convert(::Type{CombinedParser},x)parser(x).
Base.convert(::Type{Char},y::CharWithOptions)Strips options.
CombinedParsers.wrap โ Functionwrap(x::CombinedParser; log = nothing, trace = false)transform a parser by wrapping sub-parsers in logging and tracing parser types.
Parser Building Blocks
PCRE regular expressions provides established building blocks as escape sequences. Equivalent CombinedParsers are provided by name.
You can also use PCRE regex syntax with the @re_str to build identical CombinedParsers!
Predefined Parsers
Horizontal and Vertical Space
Trimming space
CombinedParsers.trim โ Functiontrim(p...; whitespace=horizontal_space_maybe,
left=whitespace, right=whitespace)Ignore whitespace left and right of sSequence(p...).
CombinedParsers.@trimmed โ Macro@trimmedCreate parser within whitespace_maybe to match the variables they are asigned to.
See also trim.
DocTestFilters = r"map\(.+\)"so, for example
julia> @trimmed foo = AnyChar()
๐ Sequence[2]
โโ (?>[\h]*) ValueIn |> Repeat |> ! |> Atomic
โโ . AnyValue |> with_name(:foo)
โโ (?>[\h]*) ValueIn |> Repeat |> ! |> Atomic
::Char
julia> parse(log_names(foo)," ab ")
match foo@3-4: ab
^
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)Matching Space
CombinedParsers.whitespace_char โ Constantwhitespace_char = re"[[:space:]]"
whitespace_maybe = re"(?>[[:space:]]*)"
whitespace = re"(?>[[:space:]]+)"julia> CombinedParsers.char_label_table(whitespace_char)
| Char | |
|-----------|---------------------|
| '\t' | Horizontal tab (HT) |
| '\v' | Vertical tab (VT) |
| '\f' | Form feed (FF) |
| ' ' | Space |
| '\u85' | Next line (NEL) |
| '\u200e' | Left-to-right mark |
| '\u200f' | Right-to-left mark |
| '\u2028' | Line separator |
| '\u2029' | Paragraph separator |CombinedParsers.horizontal_space_char โ Constanthorizontal_space_char = re"[\h]"
horizontal_space_maybe = re"(?>[\h]*)"
horizontal_space = re"(?>[\h]+)"julia> CombinedParsers.char_label_table(horizontal_space_char)
| Char | |
|-----------|---------------------------|
| '\t' | Horizontal tab (HT) |
| ' ' | Space |
| 'ย ' | Non-break space |
| 'แ' | Ogham space mark |
| '\u180e' | Mongolian vowel separator |
| 'โ' | En quad |
| 'โ' | Em quad |
| 'โ' | En space |
| 'โ' | Em space |
| 'โ' | Three-per-em space |
| 'โ
' | Four-per-em space |
| 'โ' | Six-per-em space |
| 'โ' | Figure space |
| 'โ' | Punctuation space |
| 'โ' | Thin space |
| 'โ' | Hair space |
| 'โฏ' | Narrow no-break space |
| 'โ' | Medium mathematical space |
| 'ใ' | Ideographic space |CombinedParsers.vertical_space_char โ Constantvertical_space_char = re"[\v]"
vertical_space_maybe = re"(?>[\v]*)"
vertical_space = re"(?>[\v]+)"julia> CombinedParsers.char_label_table(vertical_space_char)
| Char | |
|-----------|----------------------|
| '\n' | Linefeed (LF) |
| '\v' | Vertical tab (VT) |
| '\f' | Form feed (FF) |
| '\r' | Carriage return (CR) |
| '\u85' | Next line (NEL) |
| '\u2028' | Line separator |
| '\u2029' | Paragraph separator |CombinedParsers.bsr โ ConstantCombinedParsers.newline
CombinedParsers.Regexp.bsrnewlines, PCRE \r backslash R (BSR).
julia> CombinedParsers.Regexp.bsr
(?>|๐) Either |> Atomic |> with_name(:bsr)
โโ \r\n
โโ [\n\x0b\f\r\x85] ValueIn |> !
::SubString{String}Words
CombinedParsers.caseless โ Functioncaseless(x)MappedSequenceParser(lowercase, deepmap_parser(lowercase,parser(x))).
DocTestFilters = r"[0-9.]+ .s.*"julia> p = caseless("AlsO")
๐ |> MappedSequenceParser
โโ also
โโ lowercase
::SubString{String}
julia> p("also")
"also"
julia> using BenchmarkTools;
julia> @btime match(p,"also");
51.983 ns (2 allocations: 176 bytes)
julia> p = parser("also")
re"also"
julia> @btime match(p,"also");
44.759 ns (2 allocations: 176 bytes)
CombinedParsers.word โ ConstantSubString of at least 1 repeated CombinedParsers.word_char.
CombinedParsers.words โ ConstantVector of at least 1 repeated CombinedParsers.words delimited by CombinedParsers.whitespace_horizontal.
CombinedParsers.inline โ Constantinline = !Atomic(Repeat(NegativeLookahead(at_lineend)*AnyChar()))See at_lineend.
CombinedParsers.word_char โ ConstantEquivalent PRCE \w: Char with unicode class L, N, or _.
CombinedParsers.word_boundary โ Constantword_boundary = re""CombinedParsers.beyond_word โ Constantbeyond_word = Either(non_word_char,AtStart(),AtEnd())Parser part of word_boundary.
Predefined Assertions
CombinedParsers.at_linestart โ Constantat_linestartjulia> CombinedParsers.Regexp.at_linestart
|๐ Either |> with_name(:at_linestart)
โโ ^ AtStart
โโ (?<=๐)) Either |> Atomic |> with_name(:bsr) |> PositiveLookbehind
โโ \r\n
โโ [\n\x0b\f\r\x85] ValueIn |> !
::Union{AtStart, SubString}used in re"^" if Base.PCRE.MULTILINE is set.
CombinedParsers.at_lineend โ Constantat_lineendjulia> CombinedParsers.Regexp.at_lineend
|๐ Either |> with_name(:at_lineend)
โโ $ AtEnd
โโ (?=(?>|๐)) Either |> Atomic |> with_name(:bsr) |> PositiveLookahead
โโ \r\n
โโ [\n\x0b\f\r\x85] ValueIn |> !
::Union{AtEnd, SubString{String}}used in re"$" if Base.PCRE.MULTILINE is set.