Introduction

Schematron is a family of long-context models we trained to extract clean, typed JSON from messy HTML. Purpose-built for web scraping, product data ingestion, and converting arbitrary pages into structured records, these models take a user-defined schema and reliably output conforming JSON. Schematron-8B delivers the highest extraction quality, while Schematron-3B offers nearly comparable performance at half the cost. Both models excel at processing long HTML inputs with strict schema adherence.
For best results, pre-clean HTML with lxml to remove scripts, styles, and boilerplate before sending it to the model. See Preprocess HTML (recommended).Alternatives like Readability or regex can work; lxml is simply recommended for this model because it matches how the training data was preprocessed. When preprocessing, err on the side of removing less content, as the model will still be able to extract the information you need.

Key Features

  • Schema-first extraction – Drive output with your own JSON Schema or a typed model (e.g. Pydantic) and get strictly formatted JSON back.
  • Long-context HTML – Trained for long documents with up to 128K-token context and robust to noisy markup.
  • Strict JSON mode – 100% schema adherence.
  • Cost-efficient quality – Matches the quality of frontier models at a signficantly lower cost.

Models

  • inference-net/Schematron-8b – Best quality for complex schemas or very long pages.
  • inference-net/Schematron-3b – Lower cost, great for simpler, shorter pages.

Getting Started

You’ll need an Inference.net account and API key. See our Quick Start Guide for setup instructions. Install the OpenAI SDK and set the base URL to https://api.inference.net/v1.

Basic Usage: Extract from HTML into a Pydantic model

Use a typed model (e.g., Pydantic) to define your schema. Pass the model’s JSON Schema and the raw HTML to the model, enable JSON mode, and validate the response.
This model does not use user/system prompts. It only uses the schema to extract the data.
import os
from pydantic import BaseModel, Field
from openai import OpenAI

# 1) Define your schema (nested data and lists are supported)

class Product(BaseModel):
    name: str
    price: float = Field(
        ..., description=(
            "Primary price of the product."
        )
    )
    specs: dict = Field(
        default_factory=dict,
        description="Specs of the product.",
    )
    tags: list[str] = Field(
        default_factory=list,
        description="Tags assigned of the product.",
    )

# 2) Messy HTML (could be the full page; trim to the relevant region when possible)
html = """
<div id="item">
    <h2 class="title">MacBook Pro M3</h2>
    <p>Price: <b>$2,499.99</b> USD</p>
    <ul info>
    <li>RAM: 16GB</li>
    <li>Storage: 512GB SSD</li>
    </ul>
    <span class="tag">laptop</span>
    <span class="tag">professional</span>
    <span class="tag">macbook</span>
    <span class="tag">apple</span>
</div>
"""

# 3) Client setup
client = OpenAI(
    base_url="https://api.inference.net/v1",
    api_key=os.environ.get("INFERENCE_API_KEY"),
)

# 4) Responses API extraction (typed parsing)
response = client.chat.completions.parse(
    model="inference-net/schematron-8b",
    messages=[
        {"role": "user", "content": html},
    ],
    response_format=Product,
)

print(response.choices[0].message.content)

Output

For the HTML above, Schematron will produce strictly valid JSON that conforms to your schema. A representative output is:
{
  "name": "MacBook Pro M3",
  "price": 2499.99,
  "specs": {
    "RAM": "16GB",
    "Storage": "512GB SSD"
  },
  "tags": [
    "laptop",
    "professional",
    "macbook",
    "apple"
  ]
}

Processing at Scale

For large-scale extraction, use our asynchronous APIs:

Best Practices

  1. Keep temperature at zero – This model is trained to perform best with 0 temperature.
  2. Use JSON mode – Set response_format={"type": "json_object"}.
  3. Provide a clear schema – Pass a JSON Schema or typed model (e.g., Product.model_json_schema()). Include required fields and types, optionally describing them directly within the schema (supported by Pydantic and Zod). For example:
const Product = z.object({
  name: z.string().describe("Exact product name as shown in the title or primary heading."),
  price: z
    .number()
    .describe(
      "Primary price of the product."
    ),
  specs: z.object({
    ram: z.string().describe("Specs of the product."),
    storage: z.string().describe("Specs of the product."),
  }),
});
  1. Pre-clean HTML (recommended) – Remove scripts, styles, and boilerplate to improve accuracy and reduce tokens. See Preprocess HTML (recommended). Alternatives (Readability, Trafilatura, BeautifulSoup, regex) are fine; lxml aligns with training.
  2. Validate on ingest – Parse with Pydantic (or your validator) and handle validation errors explicitly. Although Schematron should always return valid JSON matching your schema, it’s still a good idea to validate the response.
Schematron models were trained on HTML that had been pre-cleaned with lxml.html.clean.Cleaner to strip scripts, styles, and inline JavaScript. Aligning your preprocessing with training typically improves extraction quality and consistency.
Python
from lxml.html.clean import Cleaner
import lxml.html as LH

HTML_CLEANER = Cleaner(
    scripts=True,
    javascript=True,
    style=True,
    inline_style=True,
    safe_attrs_only=False,
)


def strip_noise(html: str) -> str:
    """Remove scripts, styles, and JavaScript from HTML using lxml.
    """
    if not html or not html.strip():
        return ""
    try:
        doc = LH.fromstring(html)
        cleaned = HTML_CLEANER.clean_html(doc)
        return LH.tostring(cleaned, encoding="unicode")
    except Exception:
        return ""
Notes:
  • You do not have to use lxml. Alternatives like Readability (often more aggressive), Trafilatura, BeautifulSoup, or even targeted regex can be acceptable.
  • Use whichever tool best fits your content; lxml is recommended for this model because it matches how the training data was preprocessed.

Limitations

  • Context window – Up to 128K tokens per request. Truncate or chunk very large pages.
  • Ambiguous fields – If a field requires summarization or synthesis, ensure your schema/description is explicit about expectations. Schematron does not support a prompt or directions, instead all information needs to be passed through the schema.

Support

For technical support or custom deployment options: