Completions Endpoints

/v1/completions and /v1/chat/completions

`/v1/chat/completions`

The chat endpoint /v1/chat/completion allows callers to specify the prompt in terms of an exchange between a user and an assistant, allowing applications to avoid prompt formatting logic and more easily switch between models.

HTTP request

POST https://api.featherless.ai/v1/chat/completions

Request body

{
  "model": "string",
  "messages": [
    {
      "role": "system",
      "content": "string"
    },
    {
      "role": "user",
      "content": "string"
    }
  ],
  "presence_penalty": "float",
  "frequency_penalty": "float",
  "repetition_penalty": "float",
  "temperature": "float",
  "top_p": "float",
  "top_k": "integer",
  "min_p": "float",
  "seed": "integer",
  "stop": ["string"],
  "stop_token_ids": ["integer"],
  "include_stop_str_in_output": "boolean",
  "max_tokens": "integer",
  "min_tokens": "integer"
}

Parameters

Parameter	Type	Description
`model`	string	ID of the model to use. Specify the model to use for generating chat completions.
`messages`	array	A list of messages comprising the conversation so far.
`presence_penalty`	float	Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens; < 0 encourages repetition.
`frequency_penalty`	float	Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens; < 0 encourages repetition.
`repetition_penalty`	float	Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens; < 1 encourages repetition.
`temperature`	float	Controls sampling randomness. Lower values make the model more deterministic; higher values introduce randomness. Zero is greedy sampling.
`top_p`	float	Controls the cumulative probability of considered top tokens. Must be in (0, 1]. Set to 1 to consider all tokens.
`top_k`	integer	Number of top tokens to consider. Set to -1 to consider all tokens.
`min_p`	float	Minimum probability for a token to be considered relative to the most likely token. Must be in 0, 1. Set to 0 to disable.
`seed`	integer	Random seed for generation. (Not reliable, as we use multiple servers).
`stop`	array	List of strings that stop generation when generated. The returned output excludes these strings.
`stop_token_ids`	array	List of tokens that stop generation when generated. The returned output may include these tokens unless they are special tokens.
`include_stop_str_in_output`	boolean	Boolean to include stop strings in output text. Defaults to False.
`max_tokens`	integer	Maximum number of tokens generated per output sequence.
`min_tokens`	integer	Minimum number of tokens generated per output sequence before EOS or stop_token_ids can be generated.

Response body

If successful, the response body will contain data with the following structure:

{
  "id": "string",
  "object": "chat.completion",
  "created": "integer",
  "model": "string",
  "choices": [
    {
      "index": "integer",
      "message": {
        "role": "string",
        "content": "string"
      },
      "finish_reason": "string"
    }
  ],
  "usage": {
    "prompt_tokens": "integer",
    "completion_tokens": "integer",
    "total_tokens": "integer"
  }
}

Example request

curl https://api.featherless.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $FEATHERLESS_API_KEY" \
  -d '{
    "model": "GalrionSoftworks/Margnum-12B-v1",
    "messages": [{"role": "user", "content": "Hello!"}],
    "presence_penalty": 0.5,
    "frequency_penalty": 0.5,
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 100
  }'

`/v1/completions`

The text completions endpoint takes a single string (the prompt) and generates a completion. This endpoint allows users to take fine-grained control of chat format or use models in non-assistant scenarios (text formatting)

HTTP Request

POST https://api.featherless.ai/v1/completions

This endpoint generates a completion for the provided prompt and parameters using the Featherless AI model.

Request Body

Parameters

Parameter	Type	Description
`model`	string	Required. ID of the model to use.
`prompt`	string \| array	Required. The prompt(s) to generate completions for, encoded as a string. When an array is supplied, this is treated as a parallel inference call, where each element of the array will be completed according to the other sampling parameters. This mode is subject to the same concurrency restrictions as parallel independent calls. If you plan affords 6 basic connections, making one call with an array of length 6 will be permitted, but an array of length 7 will be rejected.
`presence_penalty`	float	Optional. Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens; < 0 encourages repetition.
`frequency_penalty`	float	Optional. Penalizes new tokens based on their frequency in the generated text. Values > 0 encourage new tokens; < 0 encourages repetition.
`repetition_penalty`	float	Optional. Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens; < 1 encourages repetition.
`temperature`	float	Optional. Controls the randomness of sampling. Lower values make the output more deterministic; higher values add more randomness. Zero is greedy sampling.
`top_p`	float	Optional. Controls the cumulative probability of the most likely tokens. Must be between 0 and 1. Setting this to 1 considers all tokens.
`top_k`	integer	Optional. The number of top tokens to consider in the sampling process. Set to -1 to consider all tokens.
`min_p`	float	Optional. Sets a minimum probability threshold relative to the most likely token for a token to be considered. Must be between 0 and 1. Set to 0 to disable.
`seed`	integer	Optional. Sets a random seed for generation. Not always reliable as multiple servers may be used.
`stop`	array	Optional. A list of strings that, when encountered in the generated output, will stop further generation. The returned output excludes these strings.
`stop_token_ids`	array	Optional. Similar to `stop`, but uses token IDs to halt generation. The output might include these tokens unless they are special tokens.
`include_stop_str_in_output`	boolean	Optional. If set to true, includes stop strings in the output text. Defaults to false.
`max_tokens`	integer	Optional. The maximum number of tokens to generate in the completion.
`min_tokens`	integer	Optional. The minimum number of tokens to generate before EOS or `stop_token_ids` can be generated.

Example Request

{
  "model": "GalrionSoftworks/Margnum-12B-v1",
  "prompt": "Once upon a time",
  "temperature": 0.7,
  "max_tokens": 150,
  "top_p": 0.9,
  "frequency_penalty": 0.5,
  "presence_penalty": 0.0
}

Example Response

{
  "id": "cmpl-6YgK3ASw92kT14L5f8zJQ7yY",
  "object": "text_completion",
  "created": 1630569482,
  "model": "GalrionSoftworks/Margnum-12B-v1",
  "choices": [
    {
      "text": " in a land far, far away, there lived a wise old owl.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 15,
    "total_tokens": 20
  }
}

Application Headers

For users building software that uses Featherless as an inference provider, we ask that you please set the header HTTP-Referer and X-Title in your API calls. Doing this will

help us support mutual users in the event of issues
lays the groundwork for future application-specific analytics (e.g. see what models and sampler settings are common among your users)

HTTP-Referer (and not Referer) should be the URL of your project (e.g. https://app.wyvern.chat/) and X-Title should match the value of <title> provided on that page.

e.g. the application Wyvern chat provides the following values in their API calls:

{
  "HTTP-Referer": "https://app.wyvern.chat/"
  "X-Title": "Wyvern chat"
}

Example

fetch("https://api.featherless.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${FEATHERLESS_API_KEY}`,
    "HTTP-Referer": `${YOUR_SITE_URL}`, // Optional: Include your app's URL for tracking
    "X-Title": `${YOUR_SITE_NAME}`,     // Optional: Identify your app in API analytics
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    "model": "GalrionSoftworks/Margnum-12B-v1",
    "messages": [
      {"role": "user", "content": "What is the meaning of life?"},
    ],
  })
});

Notes on Sampler Parameters

Presence Penalty: Encourages the model to generate new tokens instead of repeating the same ones, ideal for creative writing.
Frequency Penalty: Helps balance repetition, useful in avoiding the overuse of specific words or phrases.
Temperature & Top-p: Together control the randomness and diversity of the output, crucial for customizing the creativity of the completion.

This API structure provides flexibility in generating text by allowing fine-tuning with various sampling parameters, making it adaptable for different use cases.

Last edited: Oct 22, 2024