[Backport M84] Replace tiktoken with gpt-tokenizer #7667

sourcegraph-release-bot · 2025-04-04T07:52:44Z

Changes

In my memory profiling and local stress testing I always see tiktoken leaking memory and contributing to a huge spikes in a CPU usage. If I keep hammering IDE with text changes (like keep pressed Enter for a minute) it can even lead into 100% CPU utilisation which never recovers.

I decided to give some alternatives a try, and I change is easy to see with a bare end.
Namely I replaced tiktoken with gpt-tokenizer which claims to be:

> most feature-complete, open-source GPT tokenizer on NPM. This package is a port of OpenAI's [tiktoken]

It is also much more performant and less memory hungry. Additionally because it initialises much faster I deleted a bunch of code for lazy initialization of tiktoken, and I see no downside.

In a practical test with Cody, I tried the scenario with enter key being pressed for a minute (adding 1-2k lines).
After the test I looked at CPU utilisation (if it persists) and memory usage:

Old version with tiktoken:
100% CPU utilisation, 1.5G+ of memory used by agent process

New version with gpt-tokenizer:
0% CPU utilisation, less than 400MB of memory used by agent process

Test plan

All unit tests still works without any changes to the logic or values.

Scenario 1:

Open big file (like few MB log file)
Make sure it is excluded from the context

Scenario 2:

Open any file
Keep appending text to that file (e.g. by keeping Enter pressed) until ~1000 lines are added
After that check CPU utilisation and memory usage - CPU usage should be 0% and memory utilisation should not increase more than few %
Backport 16c0d79 from Replace tiktoken with gpt-tokenizer #7662

## Changes In my memory profiling and local stress testing I always see tiktoken leaking memory and contributing to a huge spikes in a CPU usage. If I keep hammering IDE with text changes (like keep pressed Enter for a minute) it can even lead into 100% CPU utilisation which **never recovers**. I decided to give some alternatives a try, and I change is easy to see with a bare end. Namely I replaced tiktoken with [gpt-tokenizer](https://github.com/niieani/gpt-tokenizer) which claims to be: > most feature-complete, open-source GPT tokenizer on NPM. This package is a port of OpenAI's [tiktoken] It is also much more performant and less memory hungry. Additionally because it initialises much faster I deleted a bunch of code for lazy initialization of tiktoken, and I see no downside. ![image](https://github.com/user-attachments/assets/90f88dfb-ba50-4d88-8312-80a5494f32b9) ![image](https://github.com/user-attachments/assets/48b432d3-d85c-4e4b-892f-8463b02c3b9b) In a practical test with Cody, I tried the scenario with enter key being pressed for a minute (adding 1-2k lines). After the test I looked at CPU utilisation (if it persists) and memory usage: **Old version with tiktoken:** 100% CPU utilisation, 1.5G+ of memory used by agent process **New version with gpt-tokenizer:** **_0% CPU utilisation, less than 400MB of memory used by agent process_** ## Test plan All unit tests still works without any changes to the logic or values. **Scenario 1:** 1. Open big file (like few MB log file) 2. Make sure it is excluded from the context **Scenario 2:** 1. Open any file 2. Keep appending text to that file (e.g. by keeping `Enter` pressed) until ~1000 lines are added 3. After that check CPU utilisation and memory usage - CPU usage should be 0% and memory utilisation should not increase more than few % (cherry picked from commit 16c0d79)

sourcegraph-release-bot requested review from dominiccooney, pkukielka and a team April 4, 2025 07:52

sourcegraph-release-bot added backports backported-to-M84 labels Apr 4, 2025

hitesh-1997 approved these changes Apr 4, 2025

View reviewed changes

hitesh-1997 merged commit 796041f into M84 Apr 4, 2025
21 of 23 checks passed

hitesh-1997 deleted the backport-7662-to-M84 branch April 4, 2025 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport M84] Replace tiktoken with gpt-tokenizer #7667

[Backport M84] Replace tiktoken with gpt-tokenizer #7667

[Backport M84] Replace tiktoken with gpt-tokenizer #7667

[Backport M84] Replace tiktoken with gpt-tokenizer #7667

Conversation

Changes

Test plan