Specification

Version: 0.1.0-draft
Status: Draft
Scope: a minimal standard for machine-addressable canonical text references.

1. Purpose

TextRefs defines a minimal registry standard for stable, machine-addressable references to texts.

A conforming TextRefs registry MUST provide persistent identifiers for canonical references and MUST describe the citation systems by which those references are formed. It MAY record dereferenceable locations for those references and curated mappings to external identifiers or other references.

The standard is deliberately small. Its centre is a single idea: a reference is an abstract identity, separate from any location, edition, or translation where the referenced text can be read.

2. Conformance

A dataset conforms to the TextRefs Standard if it satisfies all of the following:

It represents registry data using the object types defined in this standard.
Every registry object includes the required fields for its object type.
Every Work.key and CitationSystem.key is a flat, stable key that occupies one URI path segment.
Every CanonicalReference points to one known Work and one known CitationSystem.
Every CanonicalReference.locator validates syntactically against the referenced CitationSystem and semantically by being a registered reference point for the referenced Work.
Every CitationSystem declares valid and invalid examples for automated tests.
Every dereferenceable location is represented as an entry in the resolver_targets array of its CanonicalReference, and every external identifier or cross-reference equivalence through a MappingAssertion.
Every registry object includes administrative metadata.
Registry records contain identifiers, metadata, mappings, provenance, and resolver targets rather than primary text content.

3. Normative language

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in BCP 14, RFC 2119, and RFC 8174 when, and only when, they appear in all capitals.

4. Identity versus location

TextRefs separates identity from location.

Identity is abstract and language-independent. Work, CitationSystem, and CanonicalReference answer the question “which passage”: for example the New Testament, book-chapter-verse, John.3.16. There is exactly one such identity, regardless of how many editions, translations, or websites carry it.
Location and equivalence answer “where can I read it” and “what else is this the same as”. The resolver_targets array embedded in each CanonicalReference lists places where the reference can be read (specific translations, editions, or providers). MappingAssertion records that a Work is equivalent to an external identifier or to another Work.

A reference such as John.3.16 is the same identity whether read in Greek, the King James Version, or the Lutherbibel. The translation is a property of the location, never of the identity. This is what lets the model scale to works with many editions and translations (see §13).

TextRefs registry records store identifiers, metadata, mappings, provenance, and resolver targets. This keeps the registry legally reusable and stable across editions. A conforming record MUST NOT include full text, critical apparatus, commentary, translation text, or copyrighted edition content.

5. Core object types

A conforming registry MUST support these object types. Each top-level object MUST carry a type field matching one of them.

Type	Layer	Purpose
`Work`	identity	An abstract textual work.
`CitationSystem`	identity	A notation that fragments works into locators.
`CanonicalReference`	identity + location	One abstract reference point in a work, with embedded resolver targets.
`MappingAssertion`	equivalence	A curated equivalence between a `Work` and an external identifier.

Dereferenceable locations are not a separate object type. They are recorded as entries in the resolver_targets array embedded in each CanonicalReference (see §9). This keeps language-tagged locations co-located with the reference they describe, and means a work with N translations adds N array entries — not N standalone records.

Every object additionally carries the shared administrative metadata of §12 (omitted from the diagram for clarity).

classDiagram
    class Work {
        +URI id
        +string key
        +string preferred_label
    }
    class CitationSystem {
        +URI id
        +string key
        +string preferred_label
        +string locator_regex
        +string normalization_version
    }
    class CanonicalReference {
        +URI id
        +string work_key
        +string citation_system_key
        +string locator
        +string normalization_version
        +ResolverTargetEntry[] resolver_targets
    }
    class ResolverTargetEntry {
        +IRI url
        +string language
        +string edition
        +string provider
        +enum access
        +string license
    }
    class MappingAssertion {
        +URI id
        +URI subject
        +enum relation
        +string source
    }
    CanonicalReference --> "1" Work : work_key
    CanonicalReference --> "1" CitationSystem : citation_system_key
    CanonicalReference *-- "0..*" ResolverTargetEntry : resolver_targets
    MappingAssertion --> "1" Work : subject

MappingAssertion.subject MUST be a Work IRI. Per-passage external identifiers (e.g. the CTS URN of a single verse) are derived from work-level mappings and the reference locator at resolve time, not stored as separate assertions (see §10).

6. Work

A Work represents an abstract textual work, independent of editions, translations, manuscripts, files, websites, or resolver targets.

Only canonical texts with an established reference system SHOULD be accepted as Work records. The existence of an author, title, edition, file, or web page is not by itself sufficient.

A Work.key is a single flat registry key used to identify the abstract work in references and deterministic UUID seeds. Choose a stable, human-readable key such as plato.respublica or new-testament, and treat the whole string as the identifier. Rich bibliographic and authority data belongs in external systems and is connected to TextRefs records through MappingAssertions.

{
  "id": "https://textrefs.org/id/work/plato.respublica",
  "key": "plato.respublica",
  "type": "Work",
  "preferred_label": "Republic (Plato)",
  "status": "candidate",
  "created": "2026-05-31",
  "modified": "2026-05-31"
}

Required: id, key, type (Work), preferred_label, status, plus administrative metadata (§12). The id MUST be a persistent TextRefs HTTP URI of the form https://textrefs.org/id/work/{key}, where {key} is one flat key and occupies exactly one URI path segment. The key MUST be stable and suitable for deterministic identity generation.

External identifiers for a Work (e.g. Wikidata Q-ID, DOI, VIAF) are recorded as MappingAssertions whose subject is the Work. They are not fields on the Work itself.

7. CitationSystem

A CitationSystem defines the notation and validation rules used to identify locations within one or more works. It is independent of any edition, provider, resolver service, or software implementation. Different versification or pagination traditions are different citation systems.

A CitationSystem.key is a single flat registry key for a locator notation and its validation rules. Choose a stable, human-readable key such as bekker, stephanus, or bible-book-chapter-verse. The key is used by canonical references through citation_system_key, so changing the key changes identity.

{
  "id": "https://textrefs.org/id/system/bible-book-chapter-verse",
  "key": "bible-book-chapter-verse",
  "type": "CitationSystem",
  "preferred_label": "Bible book-chapter-verse (OSIS-style)",
  "normalization_version": "1.0.0",
  "locator_regex": "^(?<book>[A-Za-z][A-Za-z0-9_]*)\\.(?<chapter>[1-9][0-9]*)\\.(?<verse>[1-9][0-9]*)$",
  "examples": {
    "valid": ["Genesis.1.1", "Psalms.23.1", "Matthew.5.3"],
    "invalid": ["Genesis.0.1", "Genesis.1", "1.1.1", "Genesis 1:1"]
  },
  "status": "candidate",
  "created": "2026-05-31",
  "modified": "2026-05-31"
}

Required: id, key, type (CitationSystem), preferred_label, normalization_version, locator_regex, examples.valid, examples.invalid, plus administrative metadata. The id MUST be a persistent TextRefs HTTP URI of the form https://textrefs.org/id/system/{key}, where {key} is one flat key and occupies exactly one URI path segment.

locator_regex MUST be a valid ECMAScript regular expression.
locator_regex provides machine-checkable pre-validation for locator shape only; it need not fully describe citation systems whose valid references cannot be expressed completely as a regular language.
Citation systems SHOULD use an anchored locator_regex when the pattern is intended to describe the full locator string.
Regex success does not by itself prove that a reference point exists in a work.
normalization_version MUST use semantic versioning.
examples.valid MUST all match locator_regex; examples.invalid MUST all fail it.
Unicode handling for keys and locators MUST follow Identifier syntax.
A pull request that adds or changes a citation system MUST include the profile, valid examples, and invalid examples. See Citation-system profiles.
A CanonicalReference links to its citation system through citation_system_key. JSON-LD serializations MAY additionally expose that relation with skos:inScheme.

8. CanonicalReference

A CanonicalReference represents one atomized, language-independent reference point, identified by combining a work, a citation system, a normalized locator, and a normalization version. It also carries the set of dereferenceable external locations for that reference as an embedded resolver_targets array (see §9).

{
  "id": "https://textrefs.org/id/ref/{uuid}",
  "type": "CanonicalReference",
  "work_key": "new-testament",
  "citation_system_key": "bible-book-chapter-verse",
  "locator": "John.3.16",
  "normalization_version": "1.0.0",
  "resolver_targets": [
    {
      "url": "https://www.stepbible.org/?q=version=SBLG|reference=John.3.16",
      "language": "grc",
      "edition": "SBL Greek New Testament",
      "provider": "STEP Bible",
      "access": "open",
      "license": "CC-BY-4.0"
    }
  ],
  "status": "candidate",
  "created": "2026-05-31",
  "modified": "2026-05-31"
}

Required: id, type (CanonicalReference), work_key, citation_system_key, locator, normalization_version, resolver_targets (MAY be empty), plus administrative metadata.

work_key MUST reference a known Work; citation_system_key MUST reference a known CitationSystem.
work_key and citation_system_key MUST be treated as opaque flat keys. Implementations MUST NOT infer author, corpus, title, hierarchy, or resolver behaviour by splitting either key.
locator MUST match the system’s locator_regex; additional profile-specific validation MAY be required for systems that are not fully regex-checkable.
An accepted CanonicalReference MUST represent an attested reference point for the referenced Work under the referenced CitationSystem.
normalization_version is part of the reference’s identity and is fixed when the reference is minted; it records the normalization in force at that time and need not equal the citation system’s current normalization_version. Its correctness is verified by the deterministic identifier (see §14 and Identifier syntax).
The id MUST be generated deterministically per Identifier syntax; its UUID component is the deterministic seed output.
resolver_targets MUST validate per §9.

9. Embedded resolver targets

resolver_targets is the array on each CanonicalReference that records dereferenceable external locations where the reference can be read — typically specific translations, editions, or providers. Each entry is a plain object; it has no independent id or type of its own, because its identity is the parent reference plus its position in the array.

{
  "url": "https://www.biblegateway.com/passage/?search=John%203%3A16&version=KJV",
  "language": "en",
  "edition": "King James Version",
  "provider": "Bible Gateway",
  "access": "open",
  "license": "CC0-1.0",
  "license_url": null,
  "last_checked": "2026-01-01"
}

Required per entry: url, access.

url MUST be a dereferenceable external IRI (RFC 3987).
language MUST be present when the entry is language-specific (e.g. a translation), as a BCP 47 language tag (RFC 5646). Tags MUST include an ISO 15924 script subtag when the entry uses a non-default script for the language (e.g. grc-Grek, hbo-Hebr, grc-Latn). edition SHOULD name the specific edition or version when known.
access MUST be one of open, paywalled, restricted, unknown.
license SHOULD be a current SPDX license identifier (e.g. CC0-1.0, CC-BY-4.0) when the licence of the target resource is known. For licences not in the SPDX list, omit license and use the optional license_url to point at the licence text.
Values implying permission to host copyrighted full text (e.g. a license of proprietary accompanied by hosted text) are forbidden; the no-text rule in §2 governs.
A CanonicalReference whose resolver_targets is an empty array remains a valid identity record; adding or removing an entry MUST NOT change the parent reference’s id.
Tombstoning a single bad URL is done by removing the entry; tombstoning the whole reference uses the parent status field. There is no independent status on individual entries.

10. MappingAssertion

A MappingAssertion records a curated equivalence claim between a TextRefs Work and an external identifier (CTS URN, Wikidata Q-ID, DOI, ARK, …) or another TextRefs Work. There is no separate object type for external identifiers; they are always expressed as mapping targets.

{
  "id": "https://textrefs.org/id/mapping/{uuid}",
  "type": "MappingAssertion",
  "subject": "https://textrefs.org/id/work/new-testament",
  "relation": "exactMatch",
  "target": {
    "target_kind": "wikidata",
    "identifier": "https://www.wikidata.org/entity/Q18813"
  },
  "source": "manual-curation",
  "status": "candidate",
  "created": "2026-05-31",
  "modified": "2026-05-31"
}

Required: id, type (MappingAssertion), subject, relation, target, source, plus administrative metadata.

subject MUST be a Work IRI of the form https://textrefs.org/id/work/{work_key}. Per-passage external identifiers (e.g. the CTS URN of a single verse) are derived from work-level mappings combined with the reference locator at resolve time; they MUST NOT be stored as separate MappingAssertion records.
target.identifier MUST be an IRI (RFC 3987) that identifies a textual resource: a work, edition, manuscript, citation system, or another TextRefs Work.
target.target_kind is OPTIONAL and is a human-readable scheme hint (e.g. "cts", "doi", "wikidata", "textrefs"). Validators MUST NOT key behaviour off it. The presence or absence of target_kind carries no normative weight; the IRI in identifier is authoritative. See Appendix B for non-normative examples.
relation MUST be one of the SKOS-compatible values exactMatch or closeMatch. Use exactMatch only when the mapped resource identifies the same work with sufficient precision; if there is any uncertainty about edition, coverage, or work boundaries, use closeMatch.
source documents the basis for the assertion. A structured W3C PROV-O mapping is reserved for a future version.

11. Identifier policy

TextRefs identifiers MUST be persistent HTTP URIs (RFC 3986) or IRIs (RFC 3987), independent of external URLs, resolver targets, edition identifiers, provider-specific identifiers, and website structures. The deterministic UUID seed remains ASCII-only; see Identifier syntax.

Work identifiers MUST use https://textrefs.org/id/work/{key} and CitationSystem identifiers MUST use https://textrefs.org/id/system/{key}. In both cases {key} is the complete flat key and MUST NOT contain additional path segments. For example, https://textrefs.org/id/work/plato.respublica is valid; https://textrefs.org/id/work/plato/respublica is not.

A CanonicalReference identifier MUST be generated deterministically. The identity seed MUST include work_key, citation_system_key, locator, and normalization_version, in that order (see Identifier syntax).

A MappingAssertion identifier MUST be generated deterministically from subject, relation, and target.identifier, in that order, using the mapping namespace (see Identifier syntax). It MUST remain UUID-based and MUST NOT be derived from provider URLs, corpus paths, or resolver structures. Resolver-target entries do not have their own identifiers.

An implementation MUST NOT silently change the identity-defining fields of an existing CanonicalReference. Because those fields seed the deterministic identifier, any change produces a new CanonicalReference with a new identifier. The prior reference MUST be retained as a tombstone (status deprecated or withdrawn, §12) and SHOULD be linked to its replacement through an exactMatch MappingAssertion (§10).

A conforming registry SHOULD publish each /id/{type}/{key} IRI at two static URLs: the canonical URL itself (HTML for browsers) and a sibling with a .json extension carrying the JSON-LD payload. The HTML representation SHOULD advertise the JSON-LD sibling via <link rel="alternate" type="application/json" href="…json"> in the document head. Accept-header content negotiation is not required.

12. Administrative metadata

Every registry object MUST include:

{
  "status": "active",
  "created": "2026-01-01",
  "modified": "2026-01-01"
}

created and modified MUST be ISO 8601 calendar dates in YYYY-MM-DD form.
status MUST be one of:
- candidate — proposed but not yet accepted as stable.
- active — accepted and recommended for use.
- deprecated — retained but no longer recommended.
- withdrawn — removed from active use because it was erroneous or has been superseded. If a successor exists, it is linked by an exactMatch MappingAssertion; see Versioning for tombstones.
- blocked — retained as a visible tombstone because of a rights, trust, or policy dispute.

Deprecated, withdrawn, and blocked records SHOULD remain visible unless removal is required for legal, privacy, or safety reasons.

13. Worked example: a multi-translation work

This is the case that motivates separating identity from location. The New Testament exists in many editions and translations, yet John.3.16 is one reference in the OSIS-style book-chapter-verse system.

One identity — a single Work, CitationSystem, and CanonicalReference. The reference embeds all language-tagged locations as resolver_targets:

{
  "work": {
    "key": "new-testament",
    "type": "Work",
    "preferred_label": "New Testament (SBLGNT)"
  },
  "citation_system": {
    "key": "bible-book-chapter-verse",
    "type": "CitationSystem",
    "locator_regex": "^(?<book>[A-Za-z][A-Za-z0-9_]*)\\.(?<chapter>[1-9][0-9]*)\\.(?<verse>[1-9][0-9]*)$"
  },
  "canonical_reference": {
    "type": "CanonicalReference",
    "work_key": "new-testament",
    "citation_system_key": "bible-book-chapter-verse",
    "locator": "John.3.16",
    "resolver_targets": [
      {
        "url": "https://www.stepbible.org/?q=version=SBLG|reference=John.3.16",
        "language": "grc",
        "edition": "SBL Greek New Testament",
        "provider": "STEP Bible",
        "access": "open",
        "license": "CC-BY-4.0"
      }
    ]
  }
}

Adding another edition or translation appends one entry to resolver_targets. The reference identity — its UUID, its work, its citation system, its locator — does not change.

Divergent versification is the one case that does create separate references. Where traditions number verses differently (e.g. the Psalms in the Masoretic text versus the Vulgate/Septuagint), each tradition is a distinct CitationSystem, its references are distinct CanonicalReferences, and the equivalence between them is recorded as a closeMatch MappingAssertion — not by collapsing them into one identity.

14. Validation requirements

A conforming validator MUST check:

required fields for each object type and for each resolver_targets entry;
object type values and TextRefs URI patterns, including Work and CitationSystem IDs whose keys occupy exactly one path segment;
flat-key syntax and uniqueness for Work.key and CitationSystem.key;
administrative metadata and status values;
citation-system locator_regex syntax, and its valid/invalid examples;
canonical-reference locator syntax (the normalization_version is the value fixed at minting, verified by the deterministic identifier in item 8, not matched against the system’s current version);
canonical-reference semantic validity: accepted records must be registered, attested reference points for their Work and CitationSystem;
deterministic-identifier correctness for canonical references and mapping assertions;
UUID-based identifier shape for CanonicalReference and MappingAssertion records;
resolver_targets entries: access values, BCP 47 syntax of language and its presence for language-specific entries, and SPDX syntax of license when present;
mapping relation values and the Work-IRI shape of MappingAssertion.subject;
absence of forbidden full-text/apparatus/commentary content.

A validator SHOULD report errors in a machine-readable format, and SHOULD distinguish syntactically valid, registered, mapped, and resolvable references. An input locator that matches locator_regex but has no corresponding registered CanonicalReference is syntactically valid but not a valid TextRefs reference.

A normative JSON Schema 2020-12 document, generated from the canonical Zod schemas, is published at https://textrefs.org/schemas/v1/textrefs.schema.json. The Zod schemas are the implementation source of truth; the JSON Schema is the published machine-readable contract.

15. Extensions

Implementations MAY define extensions, but extensions MUST NOT change the meaning of standard fields and MUST NOT make non-standard fields required for conformance. Content-related extensions MUST be defined separately from this standard.

16. Normative references

This standard relies on the following external standards. Each is normative wherever it is cited above.

Topic	Standard
Normative keywords	BCP 14 / RFC 2119 / RFC 8174
Language tags	BCP 47 / RFC 5646
Script subtags	ISO 15924
Dates	ISO 8601
URIs	RFC 3986
IRIs	RFC 3987
UUIDs	RFC 4122
Unicode normalization (NFC)	Unicode Standard Annex #15
Regular expression dialect	ECMA-262 §22.2
Versioning	SemVer 2.0.0
Linked-data serialization	JSON-LD 1.1
Concepts and mapping relations	SKOS
Dates, provenance, language, licence	Dublin Core Terms
URL, provider, edition, work type	schema.org
Licence identifiers	SPDX License List
Machine-readable schema	JSON Schema 2020-12

Appendix A. Conformance boundary

This standard defines the minimum requirements for a TextRefs registry. Applications, resolvers, editorial tools, APIs, and visualizations may be built on top of it; they conform only insofar as their registry records satisfy this standard.

Build on the core registry by keeping these concerns in application, extension, or resolver layers:

full-text hosting, edition/manuscript modelling, translation hosting, textual apparatus, commentary, thematic annotation;
authority-file or catalogue modelling for agents, organisations, subjects, genres, or corpora;
citation-style rendering, recommendation systems, legal rights clearance for external content.

Appendix B. Well-known external identifier schemes (informative)

The following identifier schemes commonly satisfy §10’s “textual resource” rule and are useful values for MappingAssertion.target.identifier. Treat this table as implementation guidance: the authoritative rule is still whether the IRI identifies a textual resource.

Scheme	`target_kind` hint	Example identifier
TextRefs	`textrefs`	`https://textrefs.org/id/ref/988e0b39-…`
CTS URN	`cts`	`urn:cts:greekLit:tlg0031.tlg004:3.16`
DTS	`dts`	`https://dts.example/api/collection?id=urn:cts:…`
DOI	`doi`	`https://doi.org/10.5281/zenodo.7702622`
ARK	`ark`	`https://n2t.net/ark:/12148/btv1b8451636f`
Handle	`handle`	`https://hdl.handle.net/1887/4531`
PURL	`purl`	`https://purl.org/dc/terms/`
URN:NBN	`urn-nbn`	`urn:nbn:de:bvb:12-bsb00012345-2`
Wikidata	`wikidata`	`https://www.wikidata.org/entity/Q42`

TextRefs keeps mappings focused on textual resources. Identifiers of agents, organisations, instruments, or non-textual datasets (e.g. ROR, ORCID, ISNI) belong in external authority systems reached through mapped textual resources, not in MappingAssertion.target.

A passage-level external identifier (e.g. the CTS URN of a single verse) is derived at resolve time from the work-level mapping plus the reference locator; it is not stored as a separate MappingAssertion. For example, a Work mapping new-testament → urn:cts:greekLit:tlg0031.tlg004 plus the reference locator John.3.16 can yield a derived passage URN for that verse. Source data carries the work-level mapping plus a locator template; the registry does not store one mapping per passage.