8000 GitHub - cosmaadrian/strawberry-problem: Official repository for "The Strawberry Problem πŸ“: Emergence of Character-level Understanding in Tokenized Language Models"
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

cosmaadrian/strawberry-problem

Repository files navigation

The Strawberry Problem πŸ“
Emergence of Character-level Understanding in Tokenized Language Models

πŸ“˜ Abstract

Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.

βš’οΈ Usage

TBD

πŸ“– Citation

If you found our work useful, please cite our paper:

@misc{cosma2025strawberry,
      title={The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models}, 
      author={Adrian Cosma and Stefan Ruseti and Emilian Radoi and Mihai Dascalu},
      year={2025},
      eprint={2505.14172},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.14172}, 
}

πŸ“ License

This work is protected by Attribution-NonCommercial 4.0 International

About

Official repository for "The Strawberry Problem πŸ“: Emergence of Character-level Understanding in Tokenized Language Models"

Topics

Resources

License

Stars

Watchers

Forks

0