Postprocessing¶
genschema can optionally run an additional independent postprocessing step
after schema generation. This stage does not modify Converter and does not
require the converter to know anything about reference extraction.
The main entry point is:
It analyzes an already generated JSON Schema, finds repeated or highly similar
subschemas, merges each candidate group through the regular genschema pipeline,
then moves the result into $defs and replaces occurrences with $ref.
Why this is separate¶
The generator and the postprocessor solve different problems:
Converterbuilds the schema from JSON instances and input schemasSchemaReferencePostprocessorreorganizes the resulting schema for reuse
That separation makes the postprocessing stage reusable for:
schemas produced by one JSON input
schemas produced by many JSON inputs
schemas loaded from elsewhere, as long as they are ordinary JSON Schema dicts
Basic example¶
from genschema import Converter, PseudoArrayHandler
from genschema.comparators import (
DeleteElement,
EmptyComparator,
FormatComparator,
RequiredComparator,
)
from genschema.postprocessing import (
SchemaReferenceExtractionConfig,
SchemaReferencePostprocessor,
)
conv = Converter(
pseudo_handler=PseudoArrayHandler(),
base_of="anyOf",
)
conv.add_json("input1.json")
conv.add_json("input2.json")
conv.add_json("input3.json")
conv.register(FormatComparator())
conv.register(RequiredComparator())
conv.register(EmptyComparator())
conv.register(DeleteElement())
conv.register(DeleteElement("isPseudoArray"))
schema = conv.run()
schema = SchemaReferencePostprocessor.process(
schema,
SchemaReferenceExtractionConfig(
similarity_threshold=0.8,
min_total_keys=3,
),
)
Single JSON also works¶
The postprocessor does not require multiple input files.
If one document contains repeated or highly similar nested structures, those fragments can still be extracted into shared references:
conv = Converter(pseudo_handler=PseudoArrayHandler(), base_of="anyOf")
conv.add_json("input.json")
conv.register(FormatComparator())
conv.register(RequiredComparator())
conv.register(EmptyComparator())
conv.register(DeleteElement())
conv.register(DeleteElement("isPseudoArray"))
schema = conv.run()
schema = SchemaReferencePostprocessor.process(schema)
For example, one input document may still contain:
repeated address objects
similar person-like objects such as
customerandmanagersimilar location-like objects such as
warehouseandpickupPoint
CLI support¶
Common reference-extraction settings are also available from the CLI:
genschema input.json --extract-refs -o schema.json
genschema input.json \
--extract-refs \
--refs-similarity-threshold 0.9 \
--refs-min-total-keys 4 \
--refs-min-occurrences 2 \
-o schema.json
For custom merge strategies, naming strategies, or non-default merge comparators, use the Python API.
Configuration¶
Use genschema.postprocessing.SchemaReferenceExtractionConfig to adjust
the extraction behavior.
Important options:
similarity_threshold: similarity score in the(0, 1]rangemin_total_keys: minimum combined number of structural keys before a node is worth extractingmin_occurrences: minimum number of matching nodes in a groupdefs_key: output container for extracted definitions, default$defsref_prefix: custom prefix for created referencesmerge_base_of:anyOf/oneOf/allOffor the merge stagemerge_comparator_factories: comparators used during group mergepreserve_common_keywords: enables a final comparator that restores identical non-structural schema keywords such astitleordescriptionmerge_strategy: custom full merge implementationname_factory: custom naming strategy for created definitions
How similarity works¶
The default strategy compares structural tokens collected from a subschema:
object property names
pattern-property names
nested object and array shape
type signatures
selected keywords such as
formatandenum
That means structures may still be merged when they are not perfectly equal.
Example:
object A has
id,fullName,email,phoneobject B has
id,fullName,email,phone,department
With a relaxed enough similarity_threshold, both can become one shared
definition. The merged result is then built through the normal genschema merge
pipeline, so conflicts are represented using the configured combinator logic.
Minimum structure size¶
Extraction is intentionally conservative. Very small fragments are often not worth moving into shared refs because the schema becomes harder to read without meaningful deduplication.
By default:
min_total_keys = 3
So two tiny objects with only one or two keys will stay inline unless you lower that threshold.
Result shape¶
The postprocessor returns a new schema dict. It does not mutate the original input object in place.
Typical result:
{
"$defs": {
"Address": {
"type": "object",
"properties": {
"street": {"type": "string"},
"city": {"type": "string"},
"postalCode": {"type": "string"},
},
}
},
"properties": {
"billingAddress": {"$ref": "#/$defs/Address"},
"shippingAddress": {"$ref": "#/$defs/Address"},
},
}
Notes¶
existing
$refnodes are not treated as extraction candidatesexisting definition sections can be skipped during candidate discovery
overlapping candidate groups are resolved so the same fragment is not extracted twice
definition names can be customized