Embeddings API

Introduction

The Inference.net Embeddings API provides a fully OpenAI-compatible interface for generating high-quality text embeddings. Embeddings are numerical representations of text that capture semantic meaning, enabling powerful applications like semantic search, clustering, recommendations, and anomaly detection. Our embeddings API is designed to be a drop-in replacement for OpenAI’s embeddings endpoint, making it easy to switch between providers without changing your code.

Key Features

OpenAI Compatible: Use the same code and libraries you already have
Multiple Models: Choose from a variety of embedding models optimized for different use cases
High Performance: Fast inference with low latency
Scalable: Handle high-volume workloads with our robust infrastructure
Asynchronous Support: Process large batches with webhooks for completion notifications

Quick Start

API Endpoint

POST https://api.inference.net/v1/embeddings

Basic Request

curl https://api.inference.net/v1/embeddings \
  -H "Authorization: Bearer $INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen3-embedding-4b",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

Request Parameters

Parameter	Type	Required	Description
`model`	string	Yes	The embedding model to use. See available models
`input`	string or array	Yes	Text(s) to embed. Can be a string or array of strings
`encoding_format`	string	No	Format for the embeddings. Options: `float` (default) or `base64`
`metadata`	object	No	Custom metadata for tracking and webhooks

Batch Processing

You can embed multiple texts in a single request by passing an array:

const response = await openai.embeddings.create({
  model: "qwen/qwen3-embedding-4b",
  input: [
    "First text to embed",
    "Second text to embed",
    "Third text to embed"
  ]
});

// Access individual embeddings
const firstEmbedding = response.data[0].embedding;
const secondEmbedding = response.data[1].embedding;

Response Format

The API returns embeddings in the standard OpenAI format:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0023064255, -0.009327292, ...]
    }
  ],
  "model": "qwen/qwen3-embedding-4b",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

Asynchronous Embeddings

For large-scale embedding tasks, use our asynchronous API with webhooks to receive notifications when processing is complete.

Async Request

To make asynchronous embedding requests, use the /v1/async base URL and include a webhook ID in your request metadata:

curl https://api.inference.net/v1/async/embeddings \
  -H "Authorization: Bearer $INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen3-embedding-4b",
    "input": ["Text 1", "Text 2", "Text 3"],
    "metadata": {
      "webhook_id": "YOUR_WEBHOOK_ID"
    }
  }'

Webhook Payload

When the embedding process completes, you’ll receive a webhook with the following structure:

{
  "event": "async-embedding.completed",
  "timestamp": "2024-01-15T10:30:00Z",
  "webhook_id": "YOUR_WEBHOOK_ID",
  "generation_id": "EMB_abc123",
  "data": {
    "state": "Success",
    "stateMessage": "Embeddings generated successfully",
    "request": {
      "model": "qwen/qwen3-embedding-4b",
      "input": ["text1", "text2", ...],
      "metadata": { "webhook_id": "YOUR_WEBHOOK_ID" }
    },
    "response": {
      "object": "list",
      "data": [
        {
          "object": "embedding",
          "index": 0,
          "embedding": [...]
        }
      ],
      "usage": {
        "prompt_tokens": 100,
        "total_tokens": 100
      }
    },
    "finishedAt": "2024-01-15T10:30:00Z"
  }
}

See our webhook documentation for more details on setting up and handling webhooks.

Use Cases

Semantic Search

# Generate embeddings for your documents
documents = ["Document 1 text", "Document 2 text", ...]
doc_embeddings = []

for doc in documents:
    response = client.embeddings.create(
        model="qwen/qwen3-embedding-4b",
        input=doc
    )
    doc_embeddings.append(response.data[0].embedding)

# Generate embedding for search query
query = "Find documents about machine learning"
query_response = client.embeddings.create(
    model="qwen/qwen3-embedding-4b",
    input=query
)
query_embedding = query_response.data[0].embedding

# Calculate similarities (cosine similarity shown)
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))

similarities = [
    cosine_similarity(query_embedding, doc_emb)
    for doc_emb in doc_embeddings
]

Text Classification

Embeddings can be used as features for classification tasks:

# Generate embeddings for training data
train_embeddings = []
for text in training_texts:
    response = client.embeddings.create(
        model="qwen/qwen3-embedding-4b",
        input=text
    )
    train_embeddings.append(response.data[0].embedding)

# Use embeddings as features for your classifier
from sklearn.svm import SVC
classifier = SVC()
classifier.fit(train_embeddings, labels)

Best Practices

Batch Requests: Send multiple texts in a single request for better efficiency
Model Selection: Choose models based on your specific use case. Visit our models page to compare options
Normalization: Some applications benefit from normalizing embeddings to unit length
Caching: Store generated embeddings to avoid redundant API calls
Async for Large Batches: Use webhooks for processing large datasets asynchronously

Error Handling

try:
    response = client.embeddings.create(
        model="qwen/qwen3-embedding-4b",
        input=text
    )
except Exception as e:
    print(f"Error generating embedding: {e}")
    # Handle error appropriately

Rate Limits

Embeddings API requests are subject to our standard rate limits. See our rate limits documentation for details.

Pricing

Embeddings are priced per token. Visit our models page for current pricing information.

Get Started

Features

Fine-Tuning

Use Cases

Resources

Introduction

Key Features

Quick Start

API Endpoint

Basic Request

Request Parameters

Batch Processing

Response Format

Asynchronous Embeddings

Async Request

Webhook Payload

Use Cases

Semantic Search

Text Classification

Best Practices

Error Handling

Rate Limits

Pricing

Next Steps

Available Models

Webhook Setup

Get Started

Features

Fine-Tuning

Use Cases

Resources

​Introduction

​Key Features

​Quick Start

​API Endpoint

​Basic Request

​Request Parameters

​Batch Processing

​Response Format

​Asynchronous Embeddings

​Async Request

​Webhook Payload

​Use Cases

​Semantic Search

​Text Classification

​Best Practices

​Error Handling

​Rate Limits

​Pricing

​Next Steps

Available Models

Webhook Setup

Introduction

Key Features

Quick Start

API Endpoint

Basic Request

Request Parameters

Batch Processing

Response Format

Asynchronous Embeddings

Async Request

Webhook Payload

Use Cases

Semantic Search

Text Classification

Best Practices

Error Handling

Rate Limits

Pricing

Next Steps