Replies: 15 comments
-
Hi @Calamari, no, there is currently is not a way to surface the rate information or the current token usage. The token limit one is a general think that applies to all LLMs and should be implemented. The token limit is per model too. I haven't looked into this much at this point. |
Beta Was this translation helpful? Give feedback.
-
I am not quite sure about other models as OpenAI's but wouldn't it be a relatively easy solution to add a struct containing either the real response or the headers of that response as fourth element of the result tuple of the LLMChains run method? |
Beta Was this translation helpful? Give feedback.
-
Here's what I mean by the model limits varying. Heads up: I'm mostly talking out loud here as I think through it too. Here's the details on ChatGPT's 3.5 models: Notice they are 4K or 16K with one legacy of 8K. Then the ChatGPT 4 models: Are 8K or 32K. The cost for using the larger limit models is higher too. The idea of LangChain is to abstract away some of the differences between different models so a config change swaps us to a different model. So the information about the token limit is relative. It's more about "how many tokens do I have left?" I'd like to review how this is managed in the JS or Python LangChain too, since they've had more time to think about it and what's actually helpful. |
Beta Was this translation helpful? Give feedback.
-
I was also thinking along the lines of the meta info of "how many tokens are left and when do they reset". At least for ChatGPT they say, they provide those info as header parameters in the response. And I think that would make sense to somehow pass this through to the caller as well, so they can put in some form of rate limiting in place. As far as I can see, right now, if you make a call that brings you over the limit, you don't even get to know when it would reset. |
Beta Was this translation helpful? Give feedback.
-
I looked into the JS version and they don't have anything documented at least. The Python version docs are much more complete here and I like their approach. https://python.langchain.com/docs/modules/model_io/models/llms/token_usage_tracking The caller can provide a callback to get that information. In an Elixir world, passing in an anonymous callback function could be all that's needed. Then after a call to the LLM, the callback fires with the information in a struct format. Here's the example of the Python result information:
|
Beta Was this translation helpful? Give feedback.
-
But this still doesn't tell me what I want to know, which is, "given the model I'm using, how many tokens do I have left?" That becomes left up to me, the caller, to figure out. |
Beta Was this translation helpful? Give feedback.
-
A callback sounds like a nice idea. Looking at the API docs at least for OpenAI this information about, how many tokens are left is returned in the header as |
Beta Was this translation helpful? Give feedback.
-
There are two different types of limits being talked about here.
The ratelimit tokens are separate and focus on the number of tokens-per-minute that the callers's account is allowed to make. That count and limit resets based on time. The first one is a fixed count based on the conversation size. The time-based rate limits are something a server might want to track and force their own limits on their users across requests. The size-based limits are hard limits and those force the need to summarize or start new conversations. |
Beta Was this translation helpful? Give feedback.
-
Yes, I am currently interested in the time-based rate limits. Since there is currently no way to track them, so the server cannot throttle anything or doesn't know when to retry. |
Beta Was this translation helpful? Give feedback.
-
Hey @Calamari, I'm the maintainer of LiteLLM our Router (used for load balancing across different openai/azure/etc. endpoints) uses time-based limits as way to timeout + retry requests - https://github.com/BerriAI/litellm/blob/9b5f52ae635594aeba3cb6f2a3f81dd3da03e169/litellm/router.py#L190 Let me know how our implementation can be improved. Attaching sample code for quick start below. from litellm import Router
model_list = [{ # list of model deployments
"model_name": "gpt-3.5-turbo", # model alias
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-v-2", # actual model name
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-functioncalling",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "gpt-3.5-turbo",
"api_key": os.getenv("OPENAI_API_KEY"),
}
}]
router = Router(model_list=model_list)
# openai.ChatCompletion.create replacement
response = await router.acompletion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response) |
Beta Was this translation helpful? Give feedback.
-
@Calamari Just a quick follow-up. I'm not sure how to best support this feature. I'm also thinking of Bumblebee based LLMs. I've been in talks with that team about getting token counts from that as well. Just letting you know that I'm tracking with it but not actively working on implementing it myself at this time. |
Beta Was this translation helpful? Give feedback.
-
If you're trying to set rate limits wouldn't it make more sense to setup a proxy which can track the rate limits per deployment across all the calls in the project? @Calamari @brainlid |
Beta Was this translation helpful? Give feedback.
-
Related to discussion #103 |
Beta Was this translation helpful? Give feedback.
-
As I explained in #103, the next version introduces a callback system that will now make it easy to expose this information. |
Beta Was this translation helpful? Give feedback.
-
This adds support for ratelimit response information in the Callbacks. I've added it for OpenAI and Anthropic. It should be very easy to add it for additional services as well following the same pattern. Hopefully this addresses your need! |
Beta Was this translation helpful? Give feedback.
-
For OpenAI they specify rate limits here. They add fields to the header to show, how many tokens are still left. To build something that respects those limits and retries after the limit hast been reset, it would be great to have those in the response some.
I quickly searched in the code but could not find anything. Is there currently a way to handle this?
Beta Was this translation helpful? Give feedback.
All reactions