Postprocessing¶

genschema can optionally run an additional independent postprocessing step after schema generation. This stage does not modify Converter and does not require the converter to know anything about reference extraction.

The main entry point is:

genschema.postprocessing.SchemaReferencePostprocessor

It analyzes an already generated JSON Schema, finds repeated or highly similar subschemas, merges each candidate group through the regular genschema pipeline, then moves the result into $defs and replaces occurrences with $ref.

Why this is separate¶

The generator and the postprocessor solve different problems:

Converter builds the schema from JSON instances and input schemas
SchemaReferencePostprocessor reorganizes the resulting schema for reuse

That separation makes the postprocessing stage reusable for:

schemas produced by one JSON input
schemas produced by many JSON inputs
schemas loaded from elsewhere, as long as they are ordinary JSON Schema dicts

Basic example¶

from genschema import Converter, PseudoArrayHandler
from genschema.comparators import (
    DeleteElement,
    EmptyComparator,
    FormatComparator,
    RequiredComparator,
)
from genschema.postprocessing import (
    SchemaReferenceExtractionConfig,
    SchemaReferencePostprocessor,
)

conv = Converter(
    pseudo_handler=PseudoArrayHandler(),
    base_of="anyOf",
)

conv.add_json("input1.json")
conv.add_json("input2.json")
conv.add_json("input3.json")

conv.register(FormatComparator())
conv.register(RequiredComparator())
conv.register(EmptyComparator())
conv.register(DeleteElement())
conv.register(DeleteElement("isPseudoArray"))

schema = conv.run()

schema = SchemaReferencePostprocessor.process(
    schema,
    SchemaReferenceExtractionConfig(
        similarity_threshold=0.8,
        min_total_keys=3,
    ),
)

Single JSON also works¶

The postprocessor does not require multiple input files.

If one document contains repeated or highly similar nested structures, those fragments can still be extracted into shared references:

conv = Converter(pseudo_handler=PseudoArrayHandler(), base_of="anyOf")
conv.add_json("input.json")

conv.register(FormatComparator())
conv.register(RequiredComparator())
conv.register(EmptyComparator())
conv.register(DeleteElement())
conv.register(DeleteElement("isPseudoArray"))

schema = conv.run()
schema = SchemaReferencePostprocessor.process(schema)

For example, one input document may still contain:

repeated address objects
similar person-like objects such as customer and manager
similar location-like objects such as warehouse and pickupPoint

CLI support¶

Common reference-extraction settings are also available from the CLI:

genschema input.json --extract-refs -o schema.json

genschema input.json \
    --extract-refs \
    --refs-similarity-threshold 0.9 \
    --refs-min-total-keys 4 \
    --refs-min-occurrences 2 \
    -o schema.json

For custom merge strategies, naming strategies, or non-default merge comparators, use the Python API.

Configuration¶

Use genschema.postprocessing.SchemaReferenceExtractionConfig to adjust the extraction behavior.

Important options:

similarity_threshold: similarity score in the (0, 1] range
min_total_keys: minimum combined number of structural keys before a node is worth extracting
min_occurrences: minimum number of matching nodes in a group
defs_key: output container for extracted definitions, default $defs
ref_prefix: custom prefix for created references
merge_base_of: anyOf / oneOf / allOf for the merge stage
merge_comparator_factories: comparators used during group merge
preserve_common_keywords: enables a final comparator that restores identical non-structural schema keywords such as title or description
merge_strategy: custom full merge implementation
name_factory: custom naming strategy for created definitions

How similarity works¶

The default strategy compares structural tokens collected from a subschema:

object property names
pattern-property names
nested object and array shape
type signatures
selected keywords such as format and enum

That means structures may still be merged when they are not perfectly equal.

Example:

object A has id, fullName, email, phone
object B has id, fullName, email, phone, department

With a relaxed enough similarity_threshold, both can become one shared definition. The merged result is then built through the normal genschema merge pipeline, so conflicts are represented using the configured combinator logic.

Minimum structure size¶

Extraction is intentionally conservative. Very small fragments are often not worth moving into shared refs because the schema becomes harder to read without meaningful deduplication.

By default:

min_total_keys = 3

So two tiny objects with only one or two keys will stay inline unless you lower that threshold.

Result shape¶

The postprocessor returns a new schema dict. It does not mutate the original input object in place.

Typical result:

{
    "$defs": {
        "Address": {
            "type": "object",
            "properties": {
                "street": {"type": "string"},
                "city": {"type": "string"},
                "postalCode": {"type": "string"},
            },
        }
    },
    "properties": {
        "billingAddress": {"$ref": "#/$defs/Address"},
        "shippingAddress": {"$ref": "#/$defs/Address"},
    },
}

Notes¶

existing $ref nodes are not treated as extraction candidates
existing definition sections can be skipped during candidate discovery
overlapping candidate groups are resolved so the same fragment is not extracted twice
definition names can be customized