An idealized Configuration Language

2022-03-31 · tags: netencode, json

JSON brought us one step closer to what an idealized configuration language is, which I define as "data, stripped of all externalities of the system it is working in".

Specifically, JSON is very close to what I consider the minimal properties to represent structured data.

A short history, according to me

In the beginning, Lisp defined s-expressions as a stand-in for an actual syntax. Then, people figured out that it's also a way to represent structured data. It has scalars, which can be nested into lists, recursively.

(this is (a (list) (of lists)))

This provides the first three rules of our idealized language:

A scalar is a primitive value that is domain-specific. We can assume a bunch of bytes here, or a text or an integer.
A list gives an ordering to 0..n (or 1..n) values
Both a scalar and a list are the same kind of "thing" (from here on called value), lists can be created from arbitrary values recursively (for example scalars, or lists of scalars and other lists)

Later, ASN.1 came and had the important insight that the same idealized data structure can be represented in different fashions, for example as a binary-efficient version and a human-readable format.

Then, XML "graced" the world for a decade or two, and the main lesson from it was that you don't want to mix markup languages and configuration languages, and that you don't want a committee to design these things.

In the meantime, Brendan Eich designed Javascript. Its prototype-based object system arguably stripped down the rituals of existing OO-systems. Douglas Crockford later extracted the object format (minus functions) into a syntax, and we got JSON.

{
  "foo": [
    { "nested": "attrs" },
    "some text"
  ],
  "bar": 42
}

JSON adds another fundamental idea into the mix:

Records are unordered collections of name/value pairs. A name is defined to be a unicode string, so a semantic descriptor of the nested value.

Unfortunately, the JSON syntax does not actually specify any semantics of records (objects in JSON lingo), in particular it does not mention what the meaning is if a name appears twice in one record.

If records can have multiple entries with the same name, suddenly ordering becomes important! But wait, remember earlier we defined lists to impose ordering on two values. So in order to rectify that problem, we say that

A name can only appear in a record once, names must be unique.

This is the current state of the programming community at large, where most "modern" configuration languages basically use a version of the JSON model as their underlying data structure. (However not all of them use the same version.)

Improving JSON's data model

We are not yet at the final "idealized" configuration language, though.

Modern languages like Standard ML define their data types as a mixture of

records ("structs" in the C lingo)
and sums (which you can think about as enums that can hold more values inside them)

This allows to express the common pattern where some fields in a record are only meaningful if another field — the so-called tag-field — is set to a specific value.

An easy example: if a request can fail with an error message or succeed with a result.

You could model that as

{
  "was_error": true,
  "error_message": "there was an error"
}

{
  "was_error": false,
  "result": 42
}

in your JSON representation.

But in a ML-like language (like, for example, Rust), you would instead model it as

type RequestResult
  = Error { error_message: String }
  | Success { result: i64 }

where the distinction in Error or Success makes it clear that error_message and result only exist in one of these cases, not the other.

We can encode exactly that idea into JSON in multiple ways, but not a "blessed" way.

Particularly notice the difference between the language representation, where the type is "closed" — only Success or Error can happen — and the data representation where the type is "open", more cases could potentially exist.

This is an important differentiation from a type system: Our idealized configuration language just gives more structure to a bag of data, it does not restrict which value can be where. Think of a value in an unityped language, like Python.

So far we have the notion of

a scalar (a primitive)
a list (ordering on values)
a record (unordered collection of named values)

and in order to get the "open" tagged enumeration values, we introduce

a tag, which gives a name to a value

We can then redefine record to mean "an unordered collection of tagged values", which further reduces the amount of concepts needed.

And that's it, this is the full idealized configuration language.

Some examples of data modelling with tags

This is all well and good, but what does it look like in practice?

For these examples I will be using JSON with a new < "tag": value > syntax to represent tags.

From a compatibility standpoint, tags (or sum types) have dual properties to record types.

With a record, when you have a producer that adds a field to it, the consumer will still be able to handle the record (provided the semantics of the existing fields is not changed by the new field).

With a tag, removing a tag from the producer will mean that the consumer will still be able to handle the tag. It might do one "dead" check on the removed tag, but can still handle the remaining ones just fine.

An example of how that is applied in practice is that in protobuf3, fields of a record are always optional fields.

We can model optional fields by wrapping them in < "Some": value > or < "None": {} > (where the actual value of the None is ignored or always an empty record).

So a protobuf with the fields foo: int and bar: string has to be parsed by the receiver as containing four possibilities:

#	foo	bar
1	`<"None":{}>`	`<"None":{}>`
2	`<"Some":42>`	`<"None":{}>`
3	`<"None":{}>`	`<"Some":"x">`
4	`<"Some":42>`	`<"Some":"x">`

Now, iff the receiver actually handles all four possibilities (and doesn't just crash if a field is not set, as customary in million-dollar-mistake languages), it's easy to see how removing a field from the producer is semantically equal to always setting it to <"None":{}>. Since all receivers should be ready to receive None for every field, this provides a simple forward-compatibility scheme.

We can abstract this to any kind of tag value: If you start with "more" tags, you give yourself space to remove them later without breaking compatibility, typically called "forward compatibility".

To empty list/record or not to

Something to think about is whether records and fields should be defined to always contain at least one element.

As it stands, JSON has multiple ways of expressing the "empty value":

null
[]
{}
""
leave out the field

and two of those come from the possibility of having empty structured values.

Representations of this language

This line of thought originally fell out of me designing netencode as a small human-debuggable format for pipeline serialization.

In addition to the concepts mentioned here (especially tags), it provides a better set of scalars than JSON (specifically arbitrary bytestrings), but it cannot practically be written or modified by hand, which might be a good thing depending on how you look at it.

The way that is compatible with the rest of the ecosystem is probably to use a subset of JSON to represent our idealized language.

There are multiple ways of encoding tags in JSON, which each have their pros and cons.

The most common is probably the "tag field" variant, where the tag is pulled into the nested record:

{
  "_tag": "Success",
  "result": 42
}

Which has the advantage that people know how to deal with it and that it's easy to "just add another field", plus it is backward-compatible when you had a record in the first place.

It has multiple disadvantages however:

If your value wasn't a record (e.g. an int) before, you have to put it in a record and assign an arbitrary name to its field
People are not forced to "unwrap" the tag first, so they are going to forget to check it
The magic "_tag" name cannot be used by any of the record's fields

An in-between version with less downsides is to always push a JSON record onto the stack:

{
  "tag": "Success",
  "value": {
    "result": 42
  }
}

This makes it harder for people to miss checking the tag, but still possible of course. It also makes it easily possible to inspect the contents of value without knowing the exhaustive list of tags, which can be useful in practice (though often not sound!).

Disadvantages:

Breaks backward-compatibility with an existing record-based approach if you want to introduce tags
Verbosity of representation
Hard to distinguish a record with the tag and value fields from a taged value

The final, "most pure" representation is the one from the original introduction:

{
  "Success": {
    "result": 42
  }
}

Now you have to match on the tag name first, before you can actually access your data, and it's less verbose than the above representation.

Disadvantages:

You also have to know what tags to expect, it's harder to query since you need to extract the keys and values from the dict and then take the first one.
Doing a "tag backwards compat" check is harder, because you can't just check whether _tag or tag/value are the keys in the dict.