Introduction
Schematron is a family of long-context models we trained to extract clean, typed JSON from messy HTML. Purpose-built for web scraping, product data ingestion, and converting arbitrary pages into structured data, these models take a user-defined schema and reliably output conforming JSON. Schematron-8B delivers the highest extraction quality, while Schematron-3B offers nearly comparable performance at half the cost. Both models excel at processing long HTML inputs with strict schema adherence. To learn more about how we trained Schematron and what it can do, see the announcement blog post.Access Schematron
Serverless API Open Source on Hugging Face Schematron is also open source and available on Hugging Face:For best results, pre-clean HTML with
lxml
to remove scripts, styles, and boilerplate before sending it to the model. See Preprocess HTML (recommended).Alternatives like Readability or regex can work; lxml
is simply recommended for this model because it matches how the training data was preprocessed. When preprocessing, err on the side of removing less content, as the model will still be able to extract the information you need.Key Features
- Schema-first extraction – Drive output with your own JSON Schema or a typed model (e.g. Pydantic) and get strictly formatted JSON back.
- Long-context HTML – Trained for long documents with up to 128K-token context and robust to noisy markup.
- Strict JSON mode – 100% schema adherence.
- Cost-efficient quality – Matches the quality of frontier models at a significantly lower cost.
Models
- inference-net/Schematron-8b – Best quality for complex schemas or very long pages.
- inference-net/Schematron-3b – Lower cost, great for simpler, shorter pages.
Getting Started
You’ll need an Inference.net account and API key. See our Quick Start Guide for setup instructions. Install the OpenAI SDK and set the base URL tohttps://api.inference.net/v1
.
Basic Usage: Extract from HTML into a Pydantic model
Use a typed model (e.g., Pydantic) to define your schema. Pass the model’s JSON Schema and the raw HTML to the model, enable JSON mode, and validate the response.This model does not use user/system prompts. It only uses the schema to extract the data.
Output
For the HTML above, Schematron will produce strictly valid JSON that conforms to your schema. A representative output is:Processing at Scale
For large-scale extraction, use our asynchronous APIs:Batch API
Submit up to 50,000 extraction requests per batch with a 24-hour completion window, webhook notifications, and ~95% cost savings vs synchronous.
Group API
Send smaller groups (≤ 50 requests) with a simple JSON body, track progress as a unit, and receive a single callback upon completion.
Best Practices
- Keep temperature at zero – This model is trained to perform best with 0 temperature.
- Use JSON mode – Set
response_format={"type": "json_object"}
. - Provide a clear schema – Pass a JSON Schema or typed model (e.g.,
Product.model_json_schema()
). Include required fields and types, optionally describing them directly within the schema (supported by Pydantic and Zod). For example:
- Pre-clean HTML (recommended) – Remove scripts, styles, and boilerplate to improve accuracy and reduce tokens. See Preprocess HTML (recommended). Alternatives (Readability, Trafilatura, BeautifulSoup, regex) are fine;
lxml
aligns with training. - Validate on ingest – Parse with Pydantic (or your validator) and handle validation errors explicitly. Although Schematron should always return valid JSON matching your schema, it’s still a good idea to validate the response.
Preprocess HTML (recommended)
Schematron models were trained on HTML that had been pre-cleaned withlxml.html.clean.Cleaner
to strip scripts, styles, and inline JavaScript. Aligning your preprocessing with training typically improves extraction quality and consistency.
Python
- You do not have to use
lxml
. Alternatives like Readability (often more aggressive), Trafilatura, BeautifulSoup, or even targeted regex can be acceptable. - Use whichever tool best fits your content;
lxml
is recommended for this model because it matches how the training data was preprocessed.
Limitations
- Context window – Up to 128K tokens per request. Truncate or chunk very large pages.
- Ambiguous fields – If a field requires summarization or synthesis, ensure your schema/description is explicit about expectations. Schematron does not support a prompt or directions, instead all information needs to be passed through the schema.
Support
For technical support or custom deployment options:- Email: [email protected]
- Schedule a call