BSON is a efficient compressed json-like data encoding. This BSON specification is copied verbatim from The original document uses pseudo-BNF syntax, this document uses rearranged but otherwise quite similar syntax

using CombinedParsers

The Notes contain information for reassembling the parsed part of the byte array into a JSON-like structure. This document adds the julia functions to the parser. The result document is a pure julia BSON parser.

Specification Version 1.1

BSON is a binary format in which zero or more ordered key/value pairs are stored as a single entity. We call this entity a document.

The following grammar specifies version 1.1 of the BSON standard. We've written the grammar using a CombinedParsers syntax. Valid BSON data is represented by the document non-terminal.

Basic Types

The following basic types are used as terminals in the rest of the grammar. Each type must be serialized in little-endian format.

byte = Bytes(1,UInt8)         #  (8-bits)
int32 = Bytes(4,Int32)        # (32-bit signed integer, two's complement)
int64 = Bytes(8,Int64)        # (64-bit signed integer, two's complement)
uint64 = Bytes(8,UInt64)      # (64-bit unsigned integer)
double = Bytes(8,Float64)     # (64-bit IEEE 754-2008 binary floating point)
using DecFP
decimal128 = Bytes(16,Dec128) # (128-bit IEEE 754-2008 decimal floating point)
16 TypedBytes::DecFP.Dec128


The following specifies the rest of the BSON grammar. Note that quoted strings represent terminals, and should be interpreted with C semantics (e.g. "\x01" represents the byte 0000 0001).

@syntax t = map(ValueIn,Sequence(2, "\\x", integer_base(16,2,2)))

element = Either{Any}(Any[])
e_list = Repeat(element)

Also note that we use the * operator as shorthand for repetition (e.g. ("\x01"*2) is "\x01\x01"). When used as a unary operator, * means that the repetition can occur 0 or more times.


* here is replaced with Repeat.

BSON Document. int32 is the total number of bytes comprising the document.

@syntax document = Sequence(int32, e_list, t"\x00")[2]
๐Ÿ—„ Sequence[2] |> with_name(:document)
โ”œโ”€ 4 TypedBytes::Int32
โ”œโ”€ |* Either |> Repeat
โ””โ”€ [0] ValueIn

Zero or more modified UTF-8 encoded characters followed by '\x00'. The (byte*) MUST NOT contain '\x00', hence it is not full UTF-8.

@syntax cstring = map(String,Repeat_until(byte, t"\x00"))
@syntax e_name = map(Symbol,cstring)            # Key name
๐Ÿ—„ Sequence[1] |> map(String) |> map(Symbol) |> with_name(:cstring) |> with_name(:e_name)
โ”œโ”€ (?>๐Ÿ—„*) Sequence[2] |> Repeat |> Atomic
โ”‚  โ”œโ”€ (?![0]) ValueIn |> NegativeLookahead
โ”‚  โ””โ”€ 1 TypedBytes::UInt8
โ””โ”€ [0] ValueIn

String - The int32 is the number bytes in the (byte*) + 1 (for the trailing '\x00'). The (byte*) is zero or more UTF-8 encoded characters.

@inline string_until_before(v) = Bytes(v-1,String)
@syntax lstring = (after(string_until_before, String, int32) * t"\x00")[1]
๐Ÿ—„ Sequence[1] |> with_name(:lstring)
โ”œโ”€ ๐Ÿ—„ FlatMap
โ”‚  โ”œโ”€ 4 TypedBytes::Int32
โ”‚  โ””โ”€ string_until_before
โ””โ”€ [0] ValueIn

Binary - The int32 is the number of bytes in the (byte*).

@syntax subtype = Either((
    t"\x00",                                  # Generic binary subtype
    t"\x01",                                  # Function
    t"\x02",                                  # Binary (Old)
    t"\x03",                                  # UUID (Old)
    t"\x04",                                  # UUID
    t"\x05",                                  # MD5
    t"\x06",                                  # Encrypted BSON value
    t"\x80",                                  # User defined
@syntax binary = after(Vector{UInt8}, int32) do v
    subtype * Bytes(v,Vector{UInt8})
๐Ÿ—„ FlatMap |> with_name(:binary)
โ”œโ”€ 4 TypedBytes::Int32
โ””โ”€ #1
@syntax code_w_s = int32 * lstring * document # Code w/ scope โ€” Deprecated
for e in [
    :float64 => t"\x01" * e_name * double,               # 64-bit binary floating point
    :string => t"\x02" * e_name * lstring,                # UTF-8 string
    :embedded => t"\x03" * e_name * document,            # Embedded document
    :array => t"\x04" * e_name * document,               # Array
    t"\x05" * e_name * binary,                           # Binary data
    t"\x06" * e_name * (Always() => missing),            # Undefined (value) โ€” Deprecated
    t"\x07" * e_name * Bytes(12,Vector{UInt8}),          # ObjectId
    :boolean => t"\x08" * e_name * Either(
        t"\x00" => false,                                # Boolean "false"
        t"\x01" => true),                                # Boolean "true"
    :UTC => t"\x09" * e_name * map(int64) do v
    v |> Millisecond |> Dates.UTInstant |> DateTime
    end,                                                 # UTC datetime
    t"\x0A" * e_name * (Always() => nothing),            # Null value
    :regex => t"\x0B" * e_name * map(v->Regex(v...),cstring * cstring),
    # Regular expression - The first cstring is the regex pattern,
    # the second is the regex options string.
    # Options are identified by characters, which must be stored in alphabetical order.
    t"\x0C" * e_name * lstring * Bytes(12,Vector{UInt8}), # DBPointer โ€” Deprecated
    t"\x0D" * e_name * lstring,                           # JavaScript code
    t"\x0E" * e_name * lstring,                           # Symbol. โ€” Deprecated
    t"\x0F" * e_name * code_w_s,                         # JavaScript code w/ scope โ€” Deprecated
    t"\x10" * e_name * int32,                            # 32-bit integer
    t"\x11" * e_name * uint64,                           # Timestamp
    :int => t"\x12" * e_name * int64,                    # 64-bit integer
    t"\x13" * e_name * decimal128,                       # 128-bit decimal floating point
    t"\xFF" * e_name,                                    # Min key
    t"\x7F" * e_name,                                    # Max key
    pairp = parser(e)
          map(if result_type(pairp) <: Tuple && fieldcount(result_type(pairp))==3


  • Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array ['red', 'blue'] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.
  • UTC datetime - The int64 is UTC milliseconds since the Unix epoch.
  • Timestamp - Special internal type used by MongoDB replication and sharding. First 4 bytes are an increment, second 4 are a timestamp.
  • Min key - Special type which compares lower than all other possible BSON element values.
  • Max key - Special type which compares higher than all other possible BSON element values.
  • Generic binary subtype - This is the most commonly used binary subtype and should be the 'default' for drivers and tools.
  • The BSON "binary" or "BinData" datatype is used to represent arrays of bytes. It is somewhat analogous to the Java notion of a ByteArray. BSON binary values have a subtype. This is used to indicate what kind of data is in the byte array. Subtypes from zero to 127 are predefined or reserved. Subtypes from 128-255 are user-defined.
    • \x02 Binary (Old) - This used to be the default subtype, but was deprecated in favor of \x00. Drivers and tools should be sure to handle \x02 appropriately. The structure of the binary data (the byte* array in the binary non-terminal) must be an int32 followed by a (byte*). The int32 is the number of bytes in the repetition.
    • \x03 UUID (Old) - This used to be the UUID subtype, but was deprecated in favor of \x04. Drivers and tools for languages with a native UUID type should handle \x03 appropriately.
    • \x80-\xFF "User defined" subtypes. The binary data can be anything.
  • Code w/ scope - Deprecated. The int32 is the length in bytes of the entire codews value. The string is JavaScript code. The document is a mapping from identifiers to values, representing the scope in which the string should be evaluated.

Testing examples

using Dates
testdata = Dict(:a => 1, :b => 3.0, :c => "yeay", :d => true)
Dict{Symbol, Any} with 4 entries:
  :a => 1
  :b => 3.0
  :d => true
  :c => "yeay"

The BSON.jl package provides an optimized and feature-rich BSON serializer and parser.

using BSON
bson("test.bson", testdata)

s = read("test.bson")
parse(document,s, log=true)
4-element Vector{Any}:
 :a => 1
 :b => 3.0
 :d => true
 :c => "yeay"

Aim of this page is demonstrating the straightforward implementation of a BSON parser. The parser is untested and not feature-complete regarding subtypes etc. Parsing logic can be plugged into the elements with map.


using BenchmarkTools

BSON has an optimized parser, used for comparison

@benchmark BSON.parse(IOBuffer(s))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min โ€ฆ max):  1.701 ฮผs โ€ฆ  1.333 ms  โ”Š GC (min โ€ฆ max):  0.00% โ€ฆ 99.52%
 Time  (median):     1.964 ฮผs              โ”Š GC (median):     0.00%
 Time  (mean ยฑ ฯƒ):   2.435 ฮผs ยฑ 22.601 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  16.04% ยฑ  1.72%

  โ–‚โ–†โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–‡โ–ˆโ–‡โ–‡โ–‡โ–†โ–…โ–†โ–…โ–…โ–…โ–„โ–„โ–„โ–„โ–ƒโ–„โ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–‚โ–‚โ–‚ โ–„
  1.7 ฮผs         Histogram: frequency by time         3.2 ฮผs <

 Memory estimate: 1.50 KiB, allocs estimate: 19.

The CombinedParser matching is slower.

@benchmark match(document,s)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min โ€ฆ max):  15.734 ฮผs โ€ฆ   8.759 ms  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 98.79%
 Time  (median):     18.111 ฮผs               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   21.466 ฮผs ยฑ 121.981 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  7.99% ยฑ  1.40%

  โ–โ–„โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–†โ–…โ–„โ–„โ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ– โ–‚
  15.7 ฮผs         Histogram: frequency by time         34.4 ฮผs <

 Memory estimate: 15.64 KiB, allocs estimate: 120.

Transformation is doing too much work currently - lazy access optimizations are planned.

@benchmark parse(document,s)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min โ€ฆ max):  30.094 ฮผs โ€ฆ  11.930 ms  โ”Š GC (min โ€ฆ max):  0.00% โ€ฆ 99.19%
 Time  (median):     36.957 ฮผs               โ”Š GC (median):     0.00%
 Time  (mean ยฑ ฯƒ):   44.729 ฮผs ยฑ 260.389 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  12.90% ยฑ  2.21%

  โ–โ–‚โ–…โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–…โ–…โ–…โ–„โ–„โ–„โ–„โ–„โ–ƒโ–ƒโ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–‚โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ– โ–ƒ
  30.1 ฮผs         Histogram: frequency by time         66.8 ฮผs <

 Memory estimate: 27.02 KiB, allocs estimate: 368.


Profiling reveals that most time is lost due to the Either state type.

# using Profile
# Profile.clear()
# f(n,a...) = for _ in 1:n; match(a...); end
# @profile f(100000,document,s);
# using ProfileView
# ProfileView.view()

This is looked into, I have a strong hunch that the allocation can be removed so CombinedParser can come close to BSON.jl.


Seems, reading a document written with BSON works. BSON.jl has a nonstandard way with Dates...

testdata = Dict(:a => 1, :b => 3.0, :c => "yeay", :d => true, :e => now())
bson("test.bson", testdata)
s = read("test.bson")
parse(document,s, log=true)
5-element Vector{Any}:
 :a => 1
 :b => 3.0
 :d => true
 :e => Any[:tag => "struct", :type => Any[:tag => "datatype", :params => Any[], :name => Any[Symbol("0") => "Dates", Symbol("1") => "DateTime"]], :data => Any[Symbol("0") => Any[:tag => "struct", :type => Any[:tag => "datatype", :params => Any[Symbol("0") => Any[:tag => "datatype", :params => Any[], :name => Any[Symbol("0") => "Dates", Symbol("1") => "Millisecond"]]], :name => Any[Symbol("0") => "Dates", Symbol("1") => "UTInstant"]], :data => Any[Symbol("0") => Any[:tag => "struct", :type => Any[:tag => "datatype", :params => Any[], :name => Any[Symbol("0") => "Dates", Symbol("1") => "Millisecond"]], :data => Any[Symbol("0") => 63766177005737]]]]]]
 :c => "yeay"

This parser does not do any of the magic in BSON.jl to reassemble into julia types. BSON dates work though:

let d = now()
    Any[:e => d] ==
        parse(document,UInt8[0x10,0x0,0x0,0x0, # we actually ignore the 4 byte length of the vector
                             0x09,0x65,0x0,reinterpret(UInt8,[Dates.value(d)])...,0x0], log=true)

This page was generated using Literate.jl.