Number lists (wikitext references)

Wonder what parsing is useful for? This example presents how to read a convenient number list format, e.g. [1-3,9], of the Wikipedia wikitext format. This makes a simple example for what parsing into Julia types means, too.

To reflect on the amazing Julia type system, the example shows

  • different ways to represent such number Iterators in Julia,
julia> 1:3 == [1,2,3]true
  • how you parse into any such representation
julia> using CombinedParsers
  • inter-operation with brief regular expression (PCRE) syntax
julia> using CombinedParsers.Regexp
julia> dash = re" *- *"๐Ÿ—„ Sequence โ”œโ”€ \ * |> Repeat โ”œโ”€ \- โ””โ”€ \ * |> Repeat ::Tuple{Vector{Char}, Char, Vector{Char}}
  • inter-operation with
julia> import TextParse

Number ranges

julia> @syntax int_range = Sequence(
           Numeric(Int), # 1
           dash,    # 2
           Numeric(Int)  # 3
       ) do v
           v[1]:v[3]
       end;

to match cases like "8 - 11". Julias number range format is 1:3. The string macro @int_range_str is defined in @syntax.

julia> int_range"1-3"1:3
julia> int_range"8-11"8:11

Julia Base.collect can be used to convert

julia> @syntax int_vector = map(collect, int_range);
julia> int_vector"8 - 11"4-element Vector{Int64}: 8 9 10 11

Numbers

Without @syntax, you can parse

julia> int = map(x -> [x], Numeric(Int));
julia> parse(int,"19")1-element Vector{Int64}: 19
julia> int("19")1-element Vector{Int64}: 19

The tree displays how a number is read and transformed to a Vector with length 1.

julia> int <Int64> |> map(#3)
::Vector{Int64}

Joining numbers and ranges

julia> @syntax numbers = map(join(
           Repeat(Either(int_vector, int)),
           re" *, *"
       )) do v
           vcat(v...)::Vector{Int}
       end;
julia> numbers"1-3,9"4-element Vector{Int64}: 1 2 3 9

Prepend another parser by

julia> (re"no *"*numbers)[end]("no 2-4,19")4-element Vector{Int64}:
  2
  3
  4
 19

Inclusion in a wikitext parser

Long and complicated texts like the Wikipedia can be parsed with CombinedParsers.jl. The parsers are less pain to write and execute at speeds comparably to PCRE implemented in C, the regular expressions industry standard. CombinedParsers.jl can inter-operate with Julia packages TextParse.jl.

julia> @syntax wiki_references = Sequence(2,"[",numbers,"]");
julia> wiki_references"[1, 7-9, 2]"5-element Vector{Int64}: 1 7 8 9 2

The tree displays how a bracketed comma separated sequence of numbers and number ranges is read and transformed to a Vector.

julia> wiki_references๐Ÿ—„ Sequence |> map(#54) |> with_name(:wiki_references)
โ”œโ”€ \[
โ”œโ”€ ๐Ÿ—„ Sequence |> map(#74) |> map(#5) |> with_name(:numbers)
โ”‚  โ”œโ”€ |๐Ÿ—„ Either
โ”‚  โ”‚  โ”œโ”€ ๐Ÿ—„ Sequence |> map(#1) |> map(collect) |> with_name(:int_range) |> with_name(:int_vector)
โ”‚  โ”‚  โ”‚  โ”œโ”€ <Int64>
โ”‚  โ”‚  โ”‚  โ”œโ”€ ๐Ÿ—„ Sequence
โ”‚  โ”‚  โ”‚  โ”‚  โ”œโ”€ \ *  |> Repeat
โ”‚  โ”‚  โ”‚  โ”‚  โ”œโ”€ \-
โ”‚  โ”‚  โ”‚  โ”‚  โ””โ”€ \ *  |> Repeat
โ”‚  โ”‚  โ”‚  โ””โ”€ <Int64>
โ”‚  โ”‚  โ””โ”€  <Int64> |> map(#3)
โ”‚  โ””โ”€ ๐Ÿ—„* Sequence |> map(#54) |> Repeat
โ”‚     โ”œโ”€ ๐Ÿ—„ Sequence
โ”‚     โ”‚  โ”œโ”€ \ *  |> Repeat
โ”‚     โ”‚  โ”œโ”€ ,
โ”‚     โ”‚  โ””โ”€ \ *  |> Repeat
โ”‚     โ””โ”€ |๐Ÿ—„ Either # branches hidden
โ””โ”€ \]
::Vector{Int64}

PCRE papercuts when parsing number sequences

The same parser as a regular expression will be tedious to understand and write (though writing the regular expression re" *- *" is clear). PCRE matching does recognize the match but makes not all required parsing parts accessible (7 is not captured).

julia> re = "\\[([[:digit:]]+ *- *[[:digit:]]+|[[:digit:]]+)(?: *, *([[:digit:]]+ *- *[[:digit:]]+|[[:digit:]]+))*\\]""\\[([[:digit:]]+ *- *[[:digit:]]+|[[:digit:]]+)(?: *, *([[:digit:]]+ *- *[[:digit:]]+|[[:digit:]]+))*\\]"
julia> match(Regex("^"*re*"\$"), "[1-3,7,9]")RegexMatch("[1-3,7,9]", 1="1-3", 2="9")

To make the parsing work with regular expressions you would choose a stepwise strategy, handling [ and ] and stepping though , separated parts. Parsing was pain.


This page was generated using Literate.jl.