Completions Endpoints

/v1/completions and /v1/chat/completions


/v1/chat/completions

The chat endpoint /v1/chat/completion allows callers to specify the prompt in terms of an exchange between a user and an assistant, allowing applications to avoid prompt formatting logic and more easily switch between models.

HTTP request

POST https://api.featherless.ai/v1/chat/completions

Request body

{
  "model": "string",
  "messages": [
    {
      "role": "system",
      "content": "string"
    },
    {
      "role": "user",
      "content": "string"
    }
  ],
  "presence_penalty": "float",
  "frequency_penalty": "float",
  "repetition_penalty": "float",
  "temperature": "float",
  "top_p": "float",
  "top_k": "integer",
  "min_p": "float",
  "seed": "integer",
  "stop": ["string"],
  "stop_token_ids": ["integer"],
  "include_stop_str_in_output": "boolean",
  "max_tokens": "integer",
  "min_tokens": "integer"
}

Parameters

Parameter

Type

Description

model

string

ID of the model to use. Specify the model to use for generating chat completions.

messages

array

A list of messages comprising the conversation so far.

presence_penalty

float

Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens; < 0 encourages repetition.

frequency_penalty

float

Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens; < 0 encourages repetition.

repetition_penalty

float

Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens; < 1 encourages repetition.

temperature

float

Controls sampling randomness. Lower values make the model more deterministic; higher values introduce randomness. Zero is greedy sampling.

top_p

float

Controls the cumulative probability of considered top tokens. Must be in (0, 1]. Set to 1 to consider all tokens.

top_k

integer

Number of top tokens to consider. Set to -1 to consider all tokens.

min_p

float

Minimum probability for a token to be considered relative to the most likely token. Must be in 0, 1. Set to 0 to disable.

seed

integer

Random seed for generation. (Not reliable, as we use multiple servers).

stop

array

List of strings that stop generation when generated. The returned output excludes these strings.

stop_token_ids

array

List of tokens that stop generation when generated. The returned output may include these tokens unless they are special tokens.

include_stop_str_in_output

boolean

Boolean to include stop strings in output text. Defaults to False.

max_tokens

integer

Maximum number of tokens generated per output sequence.

min_tokens

integer

Minimum number of tokens generated per output sequence before EOS or stop_token_ids can be generated.

Response body

If successful, the response body will contain data with the following structure:

{
  "id": "string",
  "object": "chat.completion",
  "created": "integer",
  "model": "string",
  "choices": [
    {
      "index": "integer",
      "message": {
        "role": "string",
        "content": "string"
      },
      "finish_reason": "string"
    }
  ],
  "usage": {
    "prompt_tokens": "integer",
    "completion_tokens": "integer",
    "total_tokens": "integer"
  }
}

Example request

curl https://api.featherless.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $FEATHERLESS_API_KEY" \
  -d '{
    "model": "GalrionSoftworks/Margnum-12B-v1",
    "messages": [{"role": "user", "content": "Hello!"}],
    "presence_penalty": 0.5,
    "frequency_penalty": 0.5,
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 100
  }'

/v1/completions

The text completions endpoint takes a single string (the prompt) and generates a completion. This endpoint allows users to take fine-grained control of chat format or use models in non-assistant scenarios (text formatting)

HTTP Request

POST https://api.featherless.ai/v1/completions

This endpoint generates a completion for the provided prompt and parameters using the Featherless AI model.

Request Body

Parameters

Parameter

Type

Description

model

string

Required. ID of the model to use.

prompt

string | array

Required. The prompt(s) to generate completions for, encoded as a string.

When an array is supplied, this is treated as a parallel inference call, where each element of the array will be completed according to the other sampling parameters. This mode is subject to the same concurrency restrictions as parallel independent calls. If you plan affords 6 basic connections, making one call with an array of length 6 will be permitted, but an array of length 7 will be rejected.

presence_penalty

float

Optional. Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens; < 0 encourages repetition.

frequency_penalty

float

Optional. Penalizes new tokens based on their frequency in the generated text. Values > 0 encourage new tokens; < 0 encourages repetition.

repetition_penalty

float

Optional. Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens; < 1 encourages repetition.

temperature

float

Optional. Controls the randomness of sampling. Lower values make the output more deterministic; higher values add more randomness. Zero is greedy sampling.

top_p

float

Optional. Controls the cumulative probability of the most likely tokens. Must be between 0 and 1. Setting this to 1 considers all tokens.

top_k

integer

Optional. The number of top tokens to consider in the sampling process. Set to -1 to consider all tokens.

min_p

float

Optional. Sets a minimum probability threshold relative to the most likely token for a token to be considered. Must be between 0 and 1. Set to 0 to disable.

seed

integer

Optional. Sets a random seed for generation. Not always reliable as multiple servers may be used.

stop

array

Optional. A list of strings that, when encountered in the generated output, will stop further generation. The returned output excludes these strings.

stop_token_ids

array

Optional. Similar to stop, but uses token IDs to halt generation. The output might include these tokens unless they are special tokens.

include_stop_str_in_output

boolean

Optional. If set to true, includes stop strings in the output text. Defaults to false.

max_tokens

integer

Optional. The maximum number of tokens to generate in the completion.

min_tokens

integer

Optional. The minimum number of tokens to generate before EOS or stop_token_ids can be generated.

Example Request

{
  "model": "GalrionSoftworks/Margnum-12B-v1",
  "prompt": "Once upon a time",
  "temperature": 0.7,
  "max_tokens": 150,
  "top_p": 0.9,
  "frequency_penalty": 0.5,
  "presence_penalty": 0.0
}

Example Response

{
  "id": "cmpl-6YgK3ASw92kT14L5f8zJQ7yY",
  "object": "text_completion",
  "created": 1630569482,
  "model": "GalrionSoftworks/Margnum-12B-v1",
  "choices": [
    {
      "text": " in a land far, far away, there lived a wise old owl.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 15,
    "total_tokens": 20
  }
}

Application Headers

For users building software that uses Featherless as an inference provider, we ask that you please set the header HTTP-Referer and X-Title in your API calls. Doing this will

  • help us support mutual users in the event of issues

  • lays the groundwork for future application-specific analytics (e.g. see what models and sampler settings are common among your users)

HTTP-Referer (and not Referer) should be the URL of your project (e.g. https://app.wyvern.chat/) and X-Title should match the value of <title> provided on that page.

e.g. the application Wyvern chat provides the following values in their API calls:

{
  "HTTP-Referer": "https://app.wyvern.chat/"
  "X-Title": "Wyvern chat"
}

Example

fetch("https://api.featherless.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${FEATHERLESS_API_KEY}`,
    "HTTP-Referer": `${YOUR_SITE_URL}`, // Optional: Include your app's URL for tracking
    "X-Title": `${YOUR_SITE_NAME}`,     // Optional: Identify your app in API analytics
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    "model": "GalrionSoftworks/Margnum-12B-v1",
    "messages": [
      {"role": "user", "content": "What is the meaning of life?"},
    ],
  })
});

Notes on Sampler Parameters

  • Presence Penalty: Encourages the model to generate new tokens instead of repeating the same ones, ideal for creative writing.

  • Frequency Penalty: Helps balance repetition, useful in avoiding the overuse of specific words or phrases.

  • Temperature & Top-p: Together control the randomness and diversity of the output, crucial for customizing the creativity of the completion.

This API structure provides flexibility in generating text by allowing fine-tuning with various sampling parameters, making it adaptable for different use cases.