Schema-guided extraction from messy HTML using Schematron
lxml
to remove scripts, styles, and boilerplate before sending it to the model. See Preprocess HTML (recommended).Alternatives like Readability or regex can work; lxml
is simply recommended for this model because it matches how the training data was preprocessed. When preprocessing, err on the side of removing less content, as the model will still be able to extract the information you need.https://api.inference.net/v1
.
response_format={"type": "json_object"}
.Product.model_json_schema()
). Include required fields and types, optionally describing them directly within the schema (supported by Pydantic and Zod). For example:lxml
aligns with training.lxml.html.clean.Cleaner
to strip scripts, styles, and inline JavaScript. Aligning your preprocessing with training typically improves extraction quality and consistency.
lxml
. Alternatives like Readability (often more aggressive), Trafilatura, BeautifulSoup, or even targeted regex can be acceptable.lxml
is recommended for this model because it matches how the training data was preprocessed.