Number lists (wikitext references)
Wonder what parsing is useful for? This example presents how to read a convenient number list format, e.g. [1-3,9]
, of the Wikipedia wikitext format. This makes a simple example for what parsing into Julia types means, too.
To reflect on the amazing Julia type system, the example shows
- different ways to represent such number
Iterator
s in Julia,
julia> 1:3 == [1,2,3]
true
- how you parse into any such representation
julia> using CombinedParsers
- inter-operation with brief regular expression (PCRE) syntax
julia> using CombinedParsers.Regexp
julia> dash = re" *- *"
๐ Sequence โโ \ * |> Repeat โโ \- โโ \ * |> Repeat ::Tuple{Vector{Char}, Char, Vector{Char}}
- inter-operation with
julia> import TextParse
Number ranges
julia> @syntax int_range = Sequence( Numeric(Int), # 1 dash, # 2 Numeric(Int) # 3 ) do v v[1]:v[3] end;
to match cases like "8 - 11". Julias number range format is 1:3
. The string macro @int_range_str
is defined in @syntax
.
julia> int_range"1-3"
1:3
julia> int_range"8-11"
8:11
Julia Base.collect can be used to convert
julia> @syntax int_vector = map(collect, int_range);
julia> int_vector"8 - 11"
4-element Vector{Int64}: 8 9 10 11
Numbers
Without @syntax
, you can parse
julia> int = map(x -> [x], Numeric(Int));
julia> parse(int,"19")
1-element Vector{Int64}: 19
julia> int("19")
1-element Vector{Int64}: 19
The tree displays how a number is read and transformed to a Vector with length 1.
julia> int
<Int64> |> map(#3) ::Vector{Int64}
Joining numbers and ranges
julia> @syntax numbers = map(join( Repeat(Either(int_vector, int)), re" *, *" )) do v vcat(v...)::Vector{Int} end;
julia> numbers"1-3,9"
4-element Vector{Int64}: 1 2 3 9
Prepend another parser by
julia> (re"no *"*numbers)[end]("no 2-4,19")
4-element Vector{Int64}: 2 3 4 19
Inclusion in a wikitext parser
Long and complicated texts like the Wikipedia can be parsed with CombinedParsers.jl
. The parsers are less pain to write and execute at speeds comparably to PCRE implemented in C, the regular expressions industry standard. CombinedParsers.jl
can inter-operate with Julia packages TextParse.jl
.
julia> @syntax wiki_references = Sequence(2,"[",numbers,"]");
julia> wiki_references"[1, 7-9, 2]"
5-element Vector{Int64}: 1 7 8 9 2
The tree displays how a bracketed comma separated sequence of numbers and number ranges is read and transformed to a Vector.
julia> wiki_references
๐ Sequence |> map(#54) |> with_name(:wiki_references) โโ \[ โโ ๐ Sequence |> map(#74) |> map(#5) |> with_name(:numbers) โ โโ |๐ Either โ โ โโ ๐ Sequence |> map(#1) |> map(collect) |> with_name(:int_range) |> with_name(:int_vector) โ โ โ โโ <Int64> โ โ โ โโ ๐ Sequence โ โ โ โ โโ \ * |> Repeat โ โ โ โ โโ \- โ โ โ โ โโ \ * |> Repeat โ โ โ โโ <Int64> โ โ โโ <Int64> |> map(#3) โ โโ ๐* Sequence |> map(#54) |> Repeat โ โโ ๐ Sequence โ โ โโ \ * |> Repeat โ โ โโ , โ โ โโ \ * |> Repeat โ โโ |๐ Either # branches hidden โโ \] ::Vector{Int64}
PCRE papercuts when parsing number sequences
The same parser as a regular expression will be tedious to understand and write (though writing the regular expression re" *- *"
is clear). PCRE matching does recognize the match but makes not all required parsing parts accessible (7 is not captured).
julia> re = "\\[([[:digit:]]+ *- *[[:digit:]]+|[[:digit:]]+)(?: *, *([[:digit:]]+ *- *[[:digit:]]+|[[:digit:]]+))*\\]"
"\\[([[:digit:]]+ *- *[[:digit:]]+|[[:digit:]]+)(?: *, *([[:digit:]]+ *- *[[:digit:]]+|[[:digit:]]+))*\\]"
julia> match(Regex("^"*re*"\$"), "[1-3,7,9]")
RegexMatch("[1-3,7,9]", 1="1-3", 2="9")
To make the parsing work with regular expressions you would choose a stepwise strategy, handling [
and ]
and stepping though ,
separated parts. Parsing was pain.
This page was generated using Literate.jl.