Schema as the contract¶
Explanation — the central idea everything else follows from.
The idea¶
In this project the XSD schema is the contract between organisations and between tools. It is the single source of truth for what a valid Acoustic Dataset document is. Everything else — the typed data classes, the validation, the HTML reference, the ER diagram, any language bindings — is derived from it, never maintained in parallel with it.
This is the plain statement of two of the delivery plan's carried principles:
- Configure, don't create — if the schema can generate something, we generate it.
- Data as the contract — the schema and the data are the inter-organisational contract; documents and bindings are downstream projections.
Why it matters¶
The failure mode we are designing against is drift: several hand-maintained
representations of "the format" slowly disagreeing. The old write_xml.py was one such
representation living in code; a hand-edited model class would be another; a hand-drawn
diagram a third. Each is a place the truth can rot.
If there is exactly one source (the XSD) and everything else is generated from it, drift becomes structurally impossible — you change the schema and re-generate, or you don't change the format at all. CI enforces this by regenerating and failing on any difference (see ADR 0008).
What flows from the one source¶
flowchart TD
XSD["Enriched XSD<br/>(schema/acoustic_dataset.xsd)"]
XSD --> Models["Typed data classes<br/>(xsdata)"]
XSD --> Validate["Validation gate<br/>(xmlschema)"]
XSD --> Docs["HTML schema reference<br/>(MkDocs Material)"]
XSD --> ERD["Mermaid ERD"]
XSD --> Bindings["Other-language bindings<br/>(Java, JSON Schema — later)"]
Models --> XML["Emitted XML Acoustic Dataset"]
Validate --> XML
"Enriched" — where definitions live¶
An enriched XSD carries human documentation inside the schema using
xs:annotation/xs:documentation (not XML comments, which are discarded at parse time). That
choice has leverage: the same annotation becomes
- the docstring on the generated data class, and
- the prose in the generated HTML reference and the labels in the ERD.
So a definition is written once, in the contract, and shows up everywhere a consumer might look. By contrast, engineering "how it's computed" notes do not belong in the schema — the schema knows nothing about the calculation — they live on the methods in code.
The boundary of the guarantee¶
Working in schema-derived entities is a real gain, but be precise about its limits: entity-level modelling is solid; field-level type strength is only as rich as the XSD declares. If the schema types a field as a plain string, the generated class has a string — the contract can't give you stronger typing than it states. This is why getting the schema right (and enriched) is the high-leverage work.
See also¶
- The two verification gates — why schema-valid isn't enough.
- Pipeline data flow — the entities and how they move.
- ADRs 0001, 0008, 0009.