Parser Templates
Composing with TextParse
Parsing Numbers or Dates is most efficiently done with TextParse.
Dates.tryparsenext
โ FunctionTextParse.tryparsenext(x::CombinedParser,str,i,till,opts=TextParse.default_opts)
TextParse.jl
integrates with CombinedParsers.jl
both ways.
tryparsenext
returns a tuple(result, nextpos)
whereresult
is of typeNullable{T}
,Nullable{T}()
if parsing failed, non-null containing the parsed value if it succeeded. If parsing succeeded,nextpos
is the position the next token, if any, starts at. If parsing failed,nextpos
is the position at which the
parsing failed.
julia> using TextParse
julia> p = ("Number:" * Repeat(' ') * TextParse.Numeric(Int))[3]
๐ Sequence[3]
โโ Number\:
โโ \ * |> Repeat
โโ <Int64>
::Int64
julia> parse(p, "Number: 42")
42
julia> TextParse.tryparsenext(p, "Number: 42")
(Nullable{Int64}(42), 14)
CombinedParsers.NumericParser
โ TypeNumericParser(x...) = parser(TextParse.Numeric(x...))
CombinedParsers.DateParser
โ FunctionDateParser(format::DateFormat...)
DateTimeParser(format::DateFormat...)
Create a parser matching either one format using TextParse.DateTimeToken
for Dates.Date
and Dates.DateTime
respectively.
DateParser(format::AbstractString...; locale="english")
DateTimeParser(format::AbstractString...; locale="english")
Convenience functions for above using Dates.DateFormat.(format, locale)
.
CombinedParsers.DateTimeParser
โ FunctionDateParser(format::DateFormat...)
DateTimeParser(format::DateFormat...)
Create a parser matching either one format using TextParse.DateTimeToken
for Dates.Date
and Dates.DateTime
respectively.
DateParser(format::AbstractString...; locale="english")
DateTimeParser(format::AbstractString...; locale="english")
Convenience functions for above using Dates.DateFormat.(format, locale)
.
For non base 10 numbers, use
CombinedParsers.integer_base
โ Functioninteger_base(base,mind=0,maxd=Repeat_max)
Parser matching a integer format on base base
.
Uses a second Base.parse call on match.
A custom parser could aggregate result incrementally while matching.
Constants and Conversion
CombinedParsers.parser
โ Functionparser(x)
A ConstantParser
matching x
.
parser(x::StepRange{Char,<:Integer})
ValueIn
matching x.
parser(x::Pair{Symbol, P}) where P
A parser labelled with name x.first
. Labels are useful in printing and logging.
See also: @with_names
, with_name
, log_names
parser(x::CharWithOptions)
A ConstantParser
matching x
, respecting Base.PCRE.CASELESS
.
Base.convert
โ FunctionBase.convert(::Type{CombinedParser},x)
parser
(x)
.
Base.convert(::Type{Char},y::CharWithOptions)
Strips options.
CombinedParsers.wrap
โ Functionwrap(x::CombinedParser; log = nothing, trace = false)
transform a parser by wrapping sub-parsers in logging and tracing parser types.
Parser Building Blocks
PCRE regular expressions provides established building blocks as escape sequences. Equivalent CombinedParser
s are provided by name.
You can also use PCRE regex syntax with the @re_str
to build identical CombinedParser
s!
Predefined Parsers
Horizontal and Vertical Space
Trimming space
CombinedParsers.trim
โ Functiontrim(p...; whitespace=horizontal_space_maybe,
left=whitespace, right=whitespace)
Ignore whitespace left
and right
of sSequence(p...)
.
CombinedParsers.@trimmed
โ Macro@trimmed
Create parser within whitespace_maybe
to match the variables they are asigned to.
See also trim
.
DocTestFilters = r"map\(.+\)"
so, for example
julia> @trimmed foo = AnyChar()
๐ Sequence[2]
โโ (?>[\h]*) ValueIn |> Repeat |> ! |> Atomic
โโ . AnyValue |> with_name(:foo)
โโ (?>[\h]*) ValueIn |> Repeat |> ! |> Atomic
::Char
julia> parse(log_names(foo)," ab ")
match foo@3-4: ab
^
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
Matching Space
CombinedParsers.whitespace_char
โ Constantwhitespace_char = re"[[:space:]]"
whitespace_maybe = re"(?>[[:space:]]*)"
whitespace = re"(?>[[:space:]]+)"
julia> CombinedParsers.char_label_table(whitespace_char)
| Char | |
|-----------|---------------------|
| '\t' | Horizontal tab (HT) |
| '\v' | Vertical tab (VT) |
| '\f' | Form feed (FF) |
| ' ' | Space |
| '\u85' | Next line (NEL) |
| '\u200e' | Left-to-right mark |
| '\u200f' | Right-to-left mark |
| '\u2028' | Line separator |
| '\u2029' | Paragraph separator |
CombinedParsers.horizontal_space_char
โ Constanthorizontal_space_char = re"[\h]"
horizontal_space_maybe = re"(?>[\h]*)"
horizontal_space = re"(?>[\h]+)"
julia> CombinedParsers.char_label_table(horizontal_space_char)
| Char | |
|-----------|---------------------------|
| '\t' | Horizontal tab (HT) |
| ' ' | Space |
| 'ย ' | Non-break space |
| 'แ' | Ogham space mark |
| '\u180e' | Mongolian vowel separator |
| 'โ' | En quad |
| 'โ' | Em quad |
| 'โ' | En space |
| 'โ' | Em space |
| 'โ' | Three-per-em space |
| 'โ
' | Four-per-em space |
| 'โ' | Six-per-em space |
| 'โ' | Figure space |
| 'โ' | Punctuation space |
| 'โ' | Thin space |
| 'โ' | Hair space |
| 'โฏ' | Narrow no-break space |
| 'โ' | Medium mathematical space |
| 'ใ' | Ideographic space |
CombinedParsers.vertical_space_char
โ Constantvertical_space_char = re"[\v]"
vertical_space_maybe = re"(?>[\v]*)"
vertical_space = re"(?>[\v]+)"
julia> CombinedParsers.char_label_table(vertical_space_char)
| Char | |
|-----------|----------------------|
| '\n' | Linefeed (LF) |
| '\v' | Vertical tab (VT) |
| '\f' | Form feed (FF) |
| '\r' | Carriage return (CR) |
| '\u85' | Next line (NEL) |
| '\u2028' | Line separator |
| '\u2029' | Paragraph separator |
CombinedParsers.bsr
โ ConstantCombinedParsers.newline
CombinedParsers.Regexp.bsr
newlines, PCRE \r
backslash R (BSR).
julia> CombinedParsers.Regexp.bsr
(?>|๐) Either |> Atomic |> with_name(:bsr)
โโ \r\n
โโ [\n\x0b\f\r\x85] ValueIn |> !
::SubString{String}
Words
CombinedParsers.caseless
โ Functioncaseless(x)
MappedSequenceParser
(lowercase, deepmap_parser(lowercase,parser(x))).
DocTestFilters = r"[0-9.]+ .s.*"
julia> p = caseless("AlsO")
๐ |> MappedSequenceParser
โโ also
โโ lowercase
::SubString{String}
julia> p("also")
"also"
julia> using BenchmarkTools;
julia> @btime match(p,"also");
51.983 ns (2 allocations: 176 bytes)
julia> p = parser("also")
re"also"
julia> @btime match(p,"also");
44.759 ns (2 allocations: 176 bytes)
CombinedParsers.word
โ ConstantSubString of at least 1 repeated CombinedParsers.word_char
.
CombinedParsers.words
โ ConstantVector of at least 1 repeated CombinedParsers.word
s delimited by CombinedParsers.whitespace_horizontal
.
CombinedParsers.inline
โ Constantinline = !Atomic(Repeat(NegativeLookahead(at_lineend)*AnyChar()))
See at_lineend
.
CombinedParsers.word_char
โ ConstantEquivalent PRCE \w
: Char with unicode class L
, N
, or _
.
CombinedParsers.word_boundary
โ Constantword_boundary = re""
CombinedParsers.beyond_word
โ Constantbeyond_word = Either(non_word_char,AtStart(),AtEnd())
Parser part of word_boundary
.
Predefined Assertions
CombinedParsers.at_linestart
โ Constantat_linestart
julia> CombinedParsers.Regexp.at_linestart
|๐ Either |> with_name(:at_linestart)
โโ ^ AtStart
โโ (?<=๐)) Either |> Atomic |> with_name(:bsr) |> PositiveLookbehind
โโ \r\n
โโ [\n\x0b\f\r\x85] ValueIn |> !
::Union{AtStart, SubString}
used in re"^"
if Base.PCRE.MULTILINE
is set.
CombinedParsers.at_lineend
โ Constantat_lineend
julia> CombinedParsers.Regexp.at_lineend
|๐ Either |> with_name(:at_lineend)
โโ $ AtEnd
โโ (?=(?>|๐)) Either |> Atomic |> with_name(:bsr) |> PositiveLookahead
โโ \r\n
โโ [\n\x0b\f\r\x85] ValueIn |> !
::Union{AtEnd, SubString{String}}
used in re"$"
if Base.PCRE.MULTILINE
is set.