Learn how to use our Group API to submit multiple inference requests together, perfect for processing related tasks that need to be tracked as a unit. The Group API supports both chat completions and text completions with up to 50 requests per group.

Group API is available for both /v1/slow/group/chat/completions and /v1/slow/group/completions endpoints.

You should not mix completion and chat-compeltion requests in the same group.

Overview

The Group API provides a streamlined way to submit multiple asynchronous inference requests as a single unit. Unlike the Batch API which requires JSONL file uploads, the Group API accepts requests directly in the request body, making it ideal for:

  • Small to medium batches: Process up to 50 requests at once
  • Related tasks: Group related inference requests together
  • Webhook notifications: Get notified when all requests in a group complete
  • Simpler integration: No file uploads or JSONL formatting required
  • Faster implementation: Direct JSON API calls without file management

Group API vs Batch API

FeatureGroup APIBatch API
Maximum requests501,000,000
Input formatJSON array in request bodyJSONL file upload
File managementNot requiredRequired
Use caseSmall batches, quick implementationLarge-scale processing
Webhook supportYesYes
Completion time1-72 hours1-72 hours

Getting Started

1. Submit a Group Request

Submit multiple requests together by sending them as an array in the request body:

const response = await fetch('https://api.inference.net/v1/slow/group/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.INFERENCE_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    requests: [
      {
        model: "meta-llama/llama-3.2-1b-instruct/fp-8",
        messages: [
          { role: "system", content: "You are a helpful assistant." },
          { role: "user", content: "What is the capital of France?" }
        ],
        max_tokens: 100
      },
      {
        model: "meta-llama/llama-3.2-1b-instruct/fp-8",
        messages: [
          { role: "system", content: "You are a helpful assistant." },
          { role: "user", content: "What is the capital of Germany?" }
        ],
        max_tokens: 100
      }
    ],
    webhook_id: "my-webhook-123" // Optional: attach a webhook for notifications
  })
});

const result = await response.json();
console.log(result); // { groupId: "group_abc123", groupSize: 2 }

The response will include a group ID and the number of requests:

{
  "groupId": "group_xY3kL9mN2pQ",
  "groupSize": 2
}

2. Retrieve Group Results

Once your group is processed, retrieve all generation results using the group ID:

const response = await fetch(`https://api.inference.net/v1/slow/group/${groupId}/generations`, {
  headers: {
    'Authorization': `Bearer ${process.env.INFERENCE_API_KEY}`
  }
});

const result = await response.json();
console.log(result.generations); // Array of all completed generations

The response includes all generations in the group:

{
  "generations": [
    {
      "_id": "gen_abc123",
      "state": "Success",
      "request": {
        "model": "meta-llama/llama-3.2-1b-instruct/fp-8",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "What is the capital of France?"}
        ],
        "max_tokens": 100
      },
      "response": {
        "id": "gen_abc123",
        "object": "chat.completion",
        "choices": [
          {
            "message": {
              "role": "assistant",
              "content": "The capital of France is Paris."
            },
            "finish_reason": "stop"
          }
        ],
        "usage": {
          "prompt_tokens": 25,
          "completion_tokens": 8,
          "total_tokens": 33
        }
      }
    },
    {
      "_id": "gen_def456",
      "state": "Success",
      "request": {
        "model": "meta-llama/llama-3.2-1b-instruct/fp-8",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "What is the capital of Germany?"}
        ],
        "max_tokens": 100
      },
      "response": {
        "id": "gen_def456",
        "object": "chat.completion",
        "choices": [
          {
            "message": {
              "role": "assistant",
              "content": "The capital of Germany is Berlin."
            },
            "finish_reason": "stop"
          }
        ],
        "usage": {
          "prompt_tokens": 25,
          "completion_tokens": 8,
          "total_tokens": 33
        }
      }
    }
  ]
}

Using Webhooks

Attach a webhook to receive notifications when your group completes processing:

const response = await fetch('https://api.inference.net/v1/slow/group/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.INFERENCE_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    requests: [...], // Your requests array
    webhook_id: "my-webhook-123" // Your configured webhook ID
  })
});

When all requests in the group complete, your webhook will receive a notification with:

  • Group ID
  • Completion status
  • Summary of successful and failed requests
  • Custom IDs for each request (if provided)

See our Webhook Documentation for setup instructions.

Text Completions Support

The Group API also supports text completions:

const response = await fetch('https://api.inference.net/v1/slow/group/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.INFERENCE_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    requests: [
      {
        model: "meta-llama/llama-3.2-1b-instruct/fp-8",
        prompt: "The capital of France is",
        max_tokens: 10
      },
      {
        model: "meta-llama/llama-3.2-1b-instruct/fp-8",
        prompt: "The capital of Germany is",
        max_tokens: 10
      }
    ]
  })
});

Limits and Constraints

  • Maximum requests per group: 50
  • Request format: Direct JSON (no JSONL files required)
  • Supported endpoints:
    • /v1/slow/group/chat/completions
    • /v1/slow/group/completions
  • Completion time: 24-72 hours
  • Request expiration: Groups expire after 72 hours if not completed

Best Practices

  1. Group related requests: Use groups for requests that logically belong together (e.g., analyzing multiple documents from the same source).

  2. Use webhooks for notifications: Instead of polling, configure webhooks to be notified when your group completes.

  3. Handle individual failures: Some requests in a group may fail while others succeed. Check each generation’s status.

  4. Stay under limits: Keep groups to 50 requests or less. For larger batches, use the Batch API.

  5. Include metadata: Add custom IDs or metadata to your requests for easier tracking:

    {
      "model": "meta-llama/llama-3.2-1b-instruct/fp-8",
      "messages": [...],
      "metadata": {
        "custom_id": "doc_123",
        "type": "summary"
      }
    }

Error Handling

The API validates your request structure immediately. Common errors include:

{
  "error": {
    "message": "Invalid request body.",
    "type": "BadRequestError",
    "fields": {
      "_errors": ["Unrecognized key(s) in object: 'webhook_url'"]
    }
  }
}

Ensure you use the correct field names:

  • webhook_id (correct)
  • webhook_url (incorrect)
  • webhook_idd (typo)

When to Use Group API

Choose the Group API when you need:

  • Quick implementation without file management
  • To process 50 or fewer related requests
  • Webhook notifications for a set of requests
  • Simple JSON-based integration

For larger workloads (50+ requests), consider using the Batch API instead.