Completions Endpoints
/v1/completions and /v1/chat/completions
/v1/chat/completions
The chat endpoint /v1/chat/completion
allows callers to specify the prompt in terms of an exchange between a user and an assistant, allowing applications to avoid prompt formatting logic and more easily switch between models.
HTTP request
POST https://api.featherless.ai/v1/chat/completions
Request body
{
"model": "string",
"messages": [
{
"role": "system",
"content": "string"
},
{
"role": "user",
"content": "string"
}
],
"presence_penalty": "float",
"frequency_penalty": "float",
"repetition_penalty": "float",
"temperature": "float",
"top_p": "float",
"top_k": "integer",
"min_p": "float",
"seed": "integer",
"stop": ["string"],
"stop_token_ids": ["integer"],
"include_stop_str_in_output": "boolean",
"max_tokens": "integer",
"min_tokens": "integer"
}
Parameters
Parameter | Type | Description |
---|---|---|
| string | ID of the model to use. Specify the model to use for generating chat completions. |
| array | A list of messages comprising the conversation so far. |
| float | Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens; < 0 encourages repetition. |
| float | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens; < 0 encourages repetition. |
| float | Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens; < 1 encourages repetition. |
| float | Controls sampling randomness. Lower values make the model more deterministic; higher values introduce randomness. Zero is greedy sampling. |
| float | Controls the cumulative probability of considered top tokens. Must be in (0, 1]. Set to 1 to consider all tokens. |
| integer | Number of top tokens to consider. Set to -1 to consider all tokens. |
| float | Minimum probability for a token to be considered relative to the most likely token. Must be in 0, 1. Set to 0 to disable. |
| integer | Random seed for generation. (Not reliable, as we use multiple servers). |
| array | List of strings that stop generation when generated. The returned output excludes these strings. |
| array | List of tokens that stop generation when generated. The returned output may include these tokens unless they are special tokens. |
| boolean | Boolean to include stop strings in output text. Defaults to False. |
| integer | Maximum number of tokens generated per output sequence. |
| integer | Minimum number of tokens generated per output sequence before EOS or stop_token_ids can be generated. |
Response body
If successful, the response body will contain data with the following structure:
{
"id": "string",
"object": "chat.completion",
"created": "integer",
"model": "string",
"choices": [
{
"index": "integer",
"message": {
"role": "string",
"content": "string"
},
"finish_reason": "string"
}
],
"usage": {
"prompt_tokens": "integer",
"completion_tokens": "integer",
"total_tokens": "integer"
}
}
Example request
curl https://api.featherless.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FEATHERLESS_API_KEY" \
-d '{
"model": "GalrionSoftworks/Margnum-12B-v1",
"messages": [{"role": "user", "content": "Hello!"}],
"presence_penalty": 0.5,
"frequency_penalty": 0.5,
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 100
}'
/v1/completions
The text completions endpoint takes a single string (the prompt) and generates a completion. This endpoint allows users to take fine-grained control of chat format or use models in non-assistant scenarios (text formatting)
HTTP Request
POST https://api.featherless.ai/v1/completions
This endpoint generates a completion for the provided prompt and parameters using the Featherless AI model.
Request Body
Parameters
Parameter | Type | Description |
---|---|---|
| string | Required. ID of the model to use. |
| string | array | Required. The prompt(s) to generate completions for, encoded as a string. When an array is supplied, this is treated as a parallel inference call, where each element of the array will be completed according to the other sampling parameters. This mode is subject to the same concurrency restrictions as parallel independent calls. If you plan affords 6 basic connections, making one call with an array of length 6 will be permitted, but an array of length 7 will be rejected. |
| float | Optional. Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens; < 0 encourages repetition. |
| float | Optional. Penalizes new tokens based on their frequency in the generated text. Values > 0 encourage new tokens; < 0 encourages repetition. |
| float | Optional. Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens; < 1 encourages repetition. |
| float | Optional. Controls the randomness of sampling. Lower values make the output more deterministic; higher values add more randomness. Zero is greedy sampling. |
| float | Optional. Controls the cumulative probability of the most likely tokens. Must be between 0 and 1. Setting this to 1 considers all tokens. |
| integer | Optional. The number of top tokens to consider in the sampling process. Set to -1 to consider all tokens. |
| float | Optional. Sets a minimum probability threshold relative to the most likely token for a token to be considered. Must be between 0 and 1. Set to 0 to disable. |
| integer | Optional. Sets a random seed for generation. Not always reliable as multiple servers may be used. |
| array | Optional. A list of strings that, when encountered in the generated output, will stop further generation. The returned output excludes these strings. |
| array | Optional. Similar to |
| boolean | Optional. If set to true, includes stop strings in the output text. Defaults to false. |
| integer | Optional. The maximum number of tokens to generate in the completion. |
| integer | Optional. The minimum number of tokens to generate before EOS or |
Example Request
{
"model": "GalrionSoftworks/Margnum-12B-v1",
"prompt": "Once upon a time",
"temperature": 0.7,
"max_tokens": 150,
"top_p": 0.9,
"frequency_penalty": 0.5,
"presence_penalty": 0.0
}
Example Response
{
"id": "cmpl-6YgK3ASw92kT14L5f8zJQ7yY",
"object": "text_completion",
"created": 1630569482,
"model": "GalrionSoftworks/Margnum-12B-v1",
"choices": [
{
"text": " in a land far, far away, there lived a wise old owl.",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 15,
"total_tokens": 20
}
}
Application Headers
For users building software that uses Featherless as an inference provider, we ask that you please set the header HTTP-Referer
and X-Title
in your API calls. Doing this will
help us support mutual users in the event of issues
lays the groundwork for future application-specific analytics (e.g. see what models and sampler settings are common among your users)
HTTP-Referer
(and not Referer
) should be the URL of your project (e.g. https://app.wyvern.chat/) and X-Title
should match the value of <title>
provided on that page.
e.g. the application Wyvern chat provides the following values in their API calls:
{
"HTTP-Referer": "https://app.wyvern.chat/"
"X-Title": "Wyvern chat"
}
Example
fetch("https://api.featherless.ai/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${FEATHERLESS_API_KEY}`,
"HTTP-Referer": `${YOUR_SITE_URL}`, // Optional: Include your app's URL for tracking
"X-Title": `${YOUR_SITE_NAME}`, // Optional: Identify your app in API analytics
"Content-Type": "application/json"
},
body: JSON.stringify({
"model": "GalrionSoftworks/Margnum-12B-v1",
"messages": [
{"role": "user", "content": "What is the meaning of life?"},
],
})
});
Notes on Sampler Parameters
Presence Penalty: Encourages the model to generate new tokens instead of repeating the same ones, ideal for creative writing.
Frequency Penalty: Helps balance repetition, useful in avoiding the overuse of specific words or phrases.
Temperature & Top-p: Together control the randomness and diversity of the output, crucial for customizing the creativity of the completion.
This API structure provides flexibility in generating text by allowing fine-tuning with various sampling parameters, making it adaptable for different use cases.