Parser Templates

Composing with `TextParse`

Parsing Numbers or Dates is most efficiently done with TextParse.

Dates.tryparsenext — Function

TextParse.tryparsenext(x::CombinedParser,str,i,till,opts=TextParse.default_opts)

TextParse.jl integrates with CombinedParsers.jl both ways.

tryparsenext returns a tuple (result, nextpos) where result is of type Nullable{T}, Nullable{T}() if parsing failed, non-null containing the parsed value if it succeeded. If parsing succeeded, nextpos is the position the next token, if any, starts at. If parsing failed, nextpos is the position at which the

parsing failed.

julia> using TextParse

julia> p = ("Number:" * Repeat(' ') * TextParse.Numeric(Int))[3]
🗄 Sequence[3]
├─ Number\:
├─ \ *  |> Repeat
└─ <Int64>
::Int64

julia> parse(p, "Number:    42")
42

julia> TextParse.tryparsenext(p, "Number:    42")
(Nullable{Int64}(42), 14)

CombinedParsers.NumericParser — Type

NumericParser(x...) = parser(TextParse.Numeric(x...))

CombinedParsers.DateParser — Function

DateParser(format::DateFormat...)
DateTimeParser(format::DateFormat...)

Create a parser matching either one format using TextParse.DateTimeToken for Dates.Date and Dates.DateTime respectively.

DateParser(format::AbstractString...; locale="english")
DateTimeParser(format::AbstractString...; locale="english")

Convenience functions for above using Dates.DateFormat.(format, locale).

CombinedParsers.DateTimeParser — Function

DateParser(format::DateFormat...)
DateTimeParser(format::DateFormat...)

Create a parser matching either one format using TextParse.DateTimeToken for Dates.Date and Dates.DateTime respectively.

DateParser(format::AbstractString...; locale="english")
DateTimeParser(format::AbstractString...; locale="english")

Convenience functions for above using Dates.DateFormat.(format, locale).

For non base 10 numbers, use

CombinedParsers.integer_base — Function

integer_base(base,mind=0,maxd=Repeat_max)

Parser matching a integer format on base base.

Note

Uses a second Base.parse call on match.

A custom parser could aggregate result incrementally while matching.

Constants and Conversion

CombinedParsers.parser — Function

parser(x)

A ConstantParser matching x.

parser(x::StepRange{Char,<:Integer})

ValueIn matching x.

parser(x::Pair{Symbol, P}) where P

A parser labelled with name x.first. Labels are useful in printing and logging.

parser(x::CharWithOptions)

A ConstantParser matching x, respecting Base.PCRE.CASELESS.

Base.convert — Function

Base.convert(::Type{CombinedParser},x)

parser(x).

Base.convert(::Type{Char},y::CharWithOptions)

Strips options.

CombinedParsers.wrap — Function

wrap(x::CombinedParser; log = nothing, trace = false)

transform a parser by wrapping sub-parsers in logging and tracing parser types.

Parser Building Blocks

PCRE regular expressions provides established building blocks as escape sequences. Equivalent CombinedParsers are provided by name.

Note

You can also use PCRE regex syntax with the @re_str to build identical CombinedParsers!

Predefined Parsers

Horizontal and Vertical Space

Trimming space

CombinedParsers.trim — Function

trim(p...; whitespace=horizontal_space_maybe, 
           left=whitespace, right=whitespace)

Ignore whitespace left and right of sSequence(p...).

CombinedParsers.@trimmed — Macro

@trimmed

Create parser within whitespace_maybe to match the variables they are asigned to.

Matching Space

CombinedParsers.whitespace_char — Constant

whitespace_char  = re"[[:space:]]"
whitespace_maybe = re"(?>[[:space:]]*)"
whitespace       = re"(?>[[:space:]]+)"

julia> CombinedParsers.char_label_table(whitespace_char)
|      Char |                     |
|-----------|---------------------|
|     '\t' | Horizontal tab (HT) |
|     '\v' |   Vertical tab (VT) |
|     '\f' |      Form feed (FF) |
|       ' ' |               Space |
|   '\u85' |     Next line (NEL) |
| '\u200e' |  Left-to-right mark |
| '\u200f' |  Right-to-left mark |
| '\u2028' |      Line separator |
| '\u2029' | Paragraph separator |

CombinedParsers.horizontal_space_char — Constant

horizontal_space_char  = re"[\h]"
horizontal_space_maybe = re"(?>[\h]*)"
horizontal_space       = re"(?>[\h]+)"

julia> CombinedParsers.char_label_table(horizontal_space_char)
|      Char |                           |
|-----------|---------------------------|
|     '\t' |       Horizontal tab (HT) |
|       ' ' |                     Space |
|       ' ' |           Non-break space |
|       ' ' |          Ogham space mark |
| '\u180e' | Mongolian vowel separator |
|       ' ' |                   En quad |
|       ' ' |                   Em quad |
|       ' ' |                  En space |
|       ' ' |                  Em space |
|       ' ' |        Three-per-em space |
|       ' ' |         Four-per-em space |
|       ' ' |          Six-per-em space |
|       ' ' |              Figure space |
|       ' ' |         Punctuation space |
|       ' ' |                Thin space |
|       ' ' |                Hair space |
|       ' ' |     Narrow no-break space |
|       ' ' | Medium mathematical space |
|       '　' |         Ideographic space |

CombinedParsers.vertical_space_char — Constant

vertical_space_char  = re"[\v]"
vertical_space_maybe = re"(?>[\v]*)"
vertical_space       = re"(?>[\v]+)"

julia> CombinedParsers.char_label_table(vertical_space_char)
|      Char |                      |
|-----------|----------------------|
|     '\n' |        Linefeed (LF) |
|     '\v' |    Vertical tab (VT) |
|     '\f' |       Form feed (FF) |
|     '\r' | Carriage return (CR) |
|   '\u85' |      Next line (NEL) |
| '\u2028' |       Line separator |
| '\u2029' |  Paragraph separator |

CombinedParsers.bsr — Constant

CombinedParsers.newline
CombinedParsers.Regexp.bsr

newlines, PCRE \r backslash R (BSR).

julia> CombinedParsers.Regexp.bsr
(?>|🗄) Either |> Atomic |> with_name(:bsr)
├─ \r\n 
└─ [\n\x0b\f\r\x85] ValueIn |> !
::SubString{String}

Words

CombinedParsers.caseless — Function

caseless(x)

MappedSequenceParser(lowercase, deepmap_parser(lowercase,parser(x))).

DocTestFilters = r"[0-9.]+ .s.*"

julia> p = caseless("AlsO")
🗄  |> MappedSequenceParser
├─ also
└─ lowercase
::SubString{String}

julia> p("also")
"also"

julia> using BenchmarkTools;

julia> @btime match(p,"also");
  51.983 ns (2 allocations: 176 bytes)

julia> p = parser("also")
re"also"

julia> @btime match(p,"also");
  44.759 ns (2 allocations: 176 bytes)

CombinedParsers.word — Constant

SubString of at least 1 repeated CombinedParsers.word_char.

CombinedParsers.words — Constant

Vector of at least 1 repeated CombinedParsers.words delimited by CombinedParsers.whitespace_horizontal.

CombinedParsers.inline — Constant

inline = !Atomic(Repeat(NegativeLookahead(at_lineend)*AnyChar()))

See at_lineend.

CombinedParsers.word_char — Constant

Equivalent PRCE \w: Char with unicode class L, N, or _.

CombinedParsers.word_boundary — Constant

word_boundary = re""

CombinedParsers.beyond_word — Constant

beyond_word = Either(non_word_char,AtStart(),AtEnd())

Parser part of word_boundary.

Predefined Assertions

CombinedParsers.at_linestart — Constant

at_linestart

julia> CombinedParsers.Regexp.at_linestart
|🗄 Either |> with_name(:at_linestart)
├─ ^ AtStart
└─ (?<=🗄)) Either |> Atomic |> with_name(:bsr) |> PositiveLookbehind
   ├─ \r\n
   └─ [\n\x0b\f\r\x85] ValueIn |> !
::Union{AtStart, SubString}

Note

used in re"^" if Base.PCRE.MULTILINE is set.

CombinedParsers.at_lineend — Constant

at_lineend

julia> CombinedParsers.Regexp.at_lineend
|🗄 Either |> with_name(:at_lineend)
├─ $ AtEnd
└─ (?=(?>|🗄)) Either |> Atomic |> with_name(:bsr) |> PositiveLookahead
   ├─ \r\n 
   └─ [\n\x0b\f\r\x85] ValueIn |> !
::Union{AtEnd, SubString{String}}

Note

used in re"$" if Base.PCRE.MULTILINE is set.