Schematron

Introduction

Schematron is a family of long-context models we trained to extract clean, typed JSON from messy HTML. Purpose-built for web scraping, product data ingestion, and converting arbitrary pages into structured data, these models take a user-defined schema and reliably output conforming JSON. Schematron-8B delivers the highest extraction quality, while Schematron-3B offers nearly comparable performance at half the cost. Both models excel at processing long HTML inputs with strict schema adherence. To learn more about how we trained Schematron and what it can do, see the announcement blog post.

For best results, pre-clean HTML with lxml to remove scripts, styles, and boilerplate before sending it to the model. See Preprocess HTML (recommended).Alternatives like Readability or regex can work; lxml is simply recommended for this model because it matches how the training data was preprocessed. When preprocessing, err on the side of removing less content, as the model will still be able to extract the information you need.

Quickstart

Schematron-8B

Highest extraction quality for complex schemas and long pages.

Schematron-3B

Lower-cost option with near-parity quality for most use cases.

Notebook Example

End-to-end company scraping walkthrough using Schematron.

You’ll need an Inference.net account, API key, and the OpenAI SDK configured with base_url="https://api.inference.net/v1". See the Quick Start Guide for setup details.

TypeScript Users: The examples below use OpenAI SDK v6 and Zod v3. Install the compatible version with:

npm install zod@^3

Basic Usage: Extract from HTML into a Pydantic model

Use a typed model (e.g., Pydantic) to define your schema. Pass the model’s JSON Schema and the raw HTML to the model, enable JSON mode, and validate the response.

This model does not use user/system prompts. It only uses the schema to extract the data.

import os
from pydantic import BaseModel, Field
from openai import OpenAI

# 1) Define your schema (nested data and lists are supported)

class Product(BaseModel):
    name: str
    price: float = Field(
        ..., description=(
            "Primary price of the product."
        )
    )
    specs: dict = Field(
        default_factory=dict,
        description="Specs of the product.",
    )
    tags: list[str] = Field(
        default_factory=list,
        description="Tags assigned of the product.",
    )

# 2) Messy HTML (could be the full page; trim to the relevant region when possible)
html = """
<div id="item">
    <h2 class="title">MacBook Pro M3</h2>
    <p>Price: <b>$2,499.99</b> USD</p>
    <ul info>
    <li>RAM: 16GB</li>
    <li>Storage: 512GB SSD</li>
    </ul>
    <span class="tag">laptop</span>
    <span class="tag">professional</span>
    <span class="tag">macbook</span>
    <span class="tag">apple</span>
</div>
"""

# 3) Client setup
client = OpenAI(
    base_url="https://api.inference.net/v1",
    api_key=os.environ.get("INFERENCE_API_KEY"),
)

resp = client.beta.chat.completions.parse(
    model="inference-net/schematron-8b",
    messages=[
        {"role": "user", "content": html},
    ],
    response_format=Product,
)

print(resp.choices[0].message.parsed.model_dump_json(indent=2))

Output

For the HTML above, Schematron will produce strictly valid JSON that conforms to your schema. A representative output is:

{
  "name": "MacBook Pro M3",
  "price": 2499.99,
  "specs": {
    "RAM": "16GB",
    "Storage": "512GB SSD"
  },
  "tags": [
    "laptop",
    "professional",
    "macbook",
    "apple"
  ]
}

Best Practices

Keep temperature at zero – This model is trained to perform best with 0 temperature.
Provide a clear schema – Pass a JSON Schema or typed model (e.g., Product.model_json_schema()). Include required fields and types, optionally describing them directly within the schema (supported by Pydantic and Zod). For example:

const Product = z.object({
  name: z.string().describe("Exact product name as shown in the title or primary heading."),
  price: z
    .number()
    .describe(
      "Primary price of the product."
    ),
  specs: z.object({
    ram: z.string().describe("Specs of the product."),
    storage: z.string().describe("Specs of the product."),
  }),
});

Pre-clean HTML (recommended) – Remove scripts, styles, and boilerplate to improve accuracy and reduce tokens. See Preprocess HTML (recommended). Alternatives (Readability, Trafilatura, BeautifulSoup, regex) are fine; lxml aligns with training.
Validate on ingest – Parse with Pydantic (or your validator) and handle validation errors explicitly. Although Schematron should always return valid JSON matching your schema, it’s still a good idea to validate the response.

Preprocess HTML (recommended)

Schematron models were trained on HTML that had been pre-cleaned with lxml.html.clean.Cleaner to strip scripts, styles, and inline JavaScript. Aligning your preprocessing with training typically improves extraction quality and consistency.

Python

from lxml.html.clean import Cleaner
import lxml.html as LH

HTML_CLEANER = Cleaner(
    scripts=True,
    javascript=True,
    style=True,
    inline_style=True,
    safe_attrs_only=False,
)


def strip_noise(html: str) -> str:
    """Remove scripts, styles, and JavaScript from HTML using lxml.
    """
    if not html or not html.strip():
        return ""
    try:
        doc = LH.fromstring(html)
        cleaned = HTML_CLEANER.clean_html(doc)
        return LH.tostring(cleaned, encoding="unicode")
    except Exception:
        return ""

Notes:

You do not have to use lxml. Alternatives like Readability (often more aggressive), Trafilatura, BeautifulSoup, or even targeted regex can be acceptable.
Use whichever tool best fits your content; lxml is recommended for this model because it matches how the training data was preprocessed.

Capabilities & Access

Key Features

Schema-first extraction – Drive output with your own JSON Schema or a typed model (e.g. Pydantic) and get strictly formatted JSON back.
Long-context HTML – Trained for long documents with up to 128K-token context and robust to noisy markup.
Strict JSON mode – 100% schema adherence.
Cost-efficient quality – Matches the quality of frontier models at a significantly lower cost.

Models

inference-net/Schematron-8b – Best quality for complex schemas or very long pages.
inference-net/Schematron-3b – Lower cost, great for simpler, shorter pages.

Access Schematron

Serverless API

Open Source on Hugging Face

Processing at Scale

For large-scale extraction, use our asynchronous APIs:

Batch API

Submit up to 50,000 extraction jobs at once, watch their status asynchronously, and receive optional webhook updates within a 24-hour window.

Group API

Send smaller groups (≤ 50 requests) in one JSON payload, follow the entire group as a single job, and get one callback when everything is done.

Limitations

Context window – Up to 128K tokens per request. Truncate or chunk very large pages.
Ambiguous fields – If a field requires summarization or synthesis, ensure your schema/description is explicit about expectations. Schematron does not support a prompt or directions, instead all information needs to be passed through the schema.

Support

For technical support or custom deployment options:

Email: [email protected]
Schedule a call

Get Started

Workhorse Models

Features

Fine-Tuning

Use Cases

Resources

Introduction

Quickstart

Schematron-8B

Schematron-3B

Notebook Example

Basic Usage: Extract from HTML into a Pydantic model

Output

Best Practices

Preprocess HTML (recommended)

Capabilities & Access

Key Features

Models

Access Schematron

Processing at Scale

Batch API

Group API

Limitations

Support

Get Started

Workhorse Models

Features

Fine-Tuning

Use Cases

Resources

​Introduction

​Quickstart

Schematron-8B

Schematron-3B

Notebook Example

​Basic Usage: Extract from HTML into a Pydantic model

​Output

​Best Practices

​Preprocess HTML (recommended)

​Capabilities & Access

​Key Features

​Models

​Access Schematron

​Processing at Scale

Batch API

Group API

​Limitations

​Support

Introduction

Quickstart

Basic Usage: Extract from HTML into a Pydantic model

Output

Best Practices

Preprocess HTML (recommended)

Capabilities & Access

Key Features

Models

Access Schematron

Processing at Scale

Limitations

Support