This project is an attempt to recreate some of the experience "OS1" from the movie 'Her', running entirely locally in your browser using transformers.js and featuring direct speech-to-text interaction. All the while trying to keep it lightweight at a whopping ~2GB of model downloads (which is then cached for future use).
It was a tough challenge trying to get a small 1B model to work in such a usable way without it hallucinating too much, but giving it guided prompts and a memory bank helped tremedously. Still, this model is subject to saying nonsense at times and is far from perfect.
Try it out: https://callbacked.github.io/os1/
(initially, you will download the required models, so loading will be a bit longer at first. Look at your browser's network tab in the console if you really want to see the models being fetched)
8mb.video-jLg-DUqjIs7j.mp4
- Speech-to-text Conversation: Uses the onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX model, which accepts both audio and text input. This lets us do a direct conversational flow where your voice input is processed directly by the core LLM without converting speech into text (more pipelines, more latency).
- Parallel Transcription (for Display/Memory): While your voice directly drives the LLM, the whisper-base model runs in parallel to transcribe your speech. This transcription is used for:
- Displaying your words on the screen for visual feedback.
- Storing the text representation of what you said to the LLM in the vector storage.
- Note: This transcription is not sent to the LLM for response generation.
- (Why I did this: Even though the Ultravox Llama 3.2 1b model uses a Whisper backbone internally to process speech, there isn't a way to expose intermediate transcription within the LLM. Therefore, running Whisper separately is necessary for displaying speech feedback and adding user dialogue to vector storage.)
- Client-Side Memory: Stores interactions between the user and assistant using vector storage in the browser's IndexedDB. Contextually relevant memories are automatically retrieved and injected into the LLM's system prompt.
- Local & Private: All conversation data, memory storage, and AI model processing (including LLM, STT, TTS, Embeddings) happen locally on your machine. Nothing is sent to external servers.
- Small Footprint: Requires approximately 2GB of model downloads on first launch (cached for subsequent visits).
- Proactive Greetings: Welcomes users differently on their very first visit versus return visits, attempting to recall the user's name (if previously mentioned)by querying the memory bank.
- Frontend: React, Vite, TypeScript, Tailwind CSS
- Models
- Core LLM & Audio Input: onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX
- Speech-to-Text (For Display/Memory): onnx-community/whisper-base
- Text-to-Speech: hexgrad/Kokoro-82M
- Embeddings (Memory): sentence-transformers/all-MiniLM-L6-v2
- Memory: In-browser vector storage using basic cosine similarity search
- Model Loading: Transformers.js downloads and initializes the Ultravox Llama 3.2 1b, Whisper-Base, Kokoro-TTS, and all-MiniLM-L6-v2 models.
- Interaction (Voice Example):
- User clicks the microphone and speaks.
- Audio data is captured.
- In Parallel: Whisper processes the audio to generate text for display and later memory storage.
- The captured audio data is prepared for the Ultravox Llama 3.2 model.
- Context Building:
- An embedding is generated from the Whisper transcription
- The local memory (in IndexedDB) is searched for relevant past interactions.
- A system prompt is constructed containing persona instructions and relevant memory excerpts.
- Top 5 memories are fetched, and are only fetched if they meet the minimum similarity threshold score of 0.28. If not, they get filtered out.
- Ultravox Llama 3.2 1b Processing:
- The system prompt (with context) and a special
<|audio|>
token are sent to the Ultravox processor. - The processor combines the text prompt embeddings with embeddings generated from the captured audio data.
- The combined embeddings are fed to the Ultravox LLM to generate a text response stream.
- The system prompt (with context) and a special
- TTS Generation: The generated text stream is sent to Kokoro TTS to generate audio chunks, which is then played back for the user to hear.
- Memory Storage: User input (Whisper transcription or typed text) is processed. If it's short (e.g., < 15 words), the raw text is stored directly. If longer, it's sent to the LLM model for summarization (64 tokens max). Then, it is converted into an embedding and stored with the 'user' role in IndexedDB for future context retrieval.
- Clone the repository:
git clone https://github.com/callbacked/os1 cd os1
- Install dependencies:
npm i
- Run the development server:
npm run dev
- Open your browser and go to
http://localhost:5173
.
The first time you load the application, the necessary AI models (~2GB) will be downloaded and cached by your browser. This might take a few moments. Subsequent loads should be faster.
- Ensure you have a modern browser supporting WebGPU like any Chromium based browser.
- Performance depends heavily on your machine's hardware.
This project is essentially an amalgamation of the hard work put in by these people orders of magnitude smarter than me, I would like to thank them in no particular order.
-
Siyoung Park: For the original MIT Licensed OS1 loading animation concept on CodePen.
-
ONNX Community: For providing the ONNX version of all of the models used here
-
Xenova For their work on Transformers.js, and reference to their LLM worker code in transformers.js-examples
-
Spike Jonze and the creators of the movie 'Her': For the original inspiration. A very bleak movie to watch 12 years later lmao.