This repository implements an Extended TextTiling algorithm for semantic topic segmentation within text documents. The method utilizes cosine similarity over LLM (Large Language Model) embeddings of sentence windows to predict semantic or topic shifts. The approach leverages SentenceTransformer and OpenAI models to perform sentence-level sequence labelling on text segmentation benchmarks, including Wiki-727K and WikiSection.
You may find a deployed demo for this repository on Hugging Face Spaces.
- LLM Embeddings: Uses OpenAI models or SentenceTransformer models for sentence embedding.
- Cosine Similarity: Computes an aggregated semantic similarity between sentence windows to the left and right of candidates to detect topic shifts.
- Sentence Pooling: Offers multiple pooling strategies (mean, max, min) for cosine similarity computation.
- Hyperparameter Tuning: Automatically tunes the window size, pooling method, and boundary threshold using a validation dataset.
- Support for Multiple Datasets: Tested on Wiki-727K and WikiSection (en_city and en_disease) datasets.
-
all-mpnet-base-v2
all-MiniLM-L6-v2
-
text-embedding-ada-002
text-embedding-3-small
text-embedding-3-large
-
wiki727k
: Wiki-727K is a large dataset for text segmentation, automatically extracted and labeled from Wikipedia. (Already tailored for sentence-level sequence labelling: Huggingface). -
en_city
,en_disease
: The WikiSection dataset is a collection of segmented Wikipedia articles related to cities and diseases. (Already tailored for sentence-level sequence labelling: Huggingface).
Any custom dataset can be easily integrated as long as it follows the schema below:
id
: string - A unique identifier for each document.ids
: list[string] - The sentence ids within the documentsentences
: list[string] - The sentences within the document.titles_mask
: optional[list[uint8]] - A binary mask to indicate which sentences are titles.labels
: optional[list[int]] - Binary labels for each sentence, where 0 represents "semantic-continuity" and 1 represents "semantic-shift."
-
Clone this repository:
git clone https://github.com/saeedabc/llm-text-tiling.git cd llm-text-tiling
-
Install the required dependencies:
pip install -r requirements.txt
-
Set your OpenAI API key as an environment variable (if using OpenAI models and don't want to be prompted in runtime):
export OPENAI_API_KEY="your-openai-api-key"
Run the script with the following arguments to segment text from one of the supported datasets:
python llm_tet.py --model_name [MODEL_NAME] \
--data_name [DATASET_NAME] --max_samples [N] --seed [SEED] \
--cache_dir [CACHE_DIR]
--model_name
: The name of the model to use for embedding (t 794E ext-embedding-ada-002
,all-mpnet-base-v2
, etc.).--data_name
: The dataset to use (wiki727k
,en_city
,en_disease
).--keep_titles
: Option to keep titles within the dataset.--max_samples
: Limit the number of samples to process for faster testing.--seed
: Set a random seed for reproducibility.--cache_dir
: Directory to cache embeddings for reuse.
- Inference on
en_city
withall-mpnet-base-v2
:
python llm_tet.py --model_name all-mpnet-base-v2 --data_name en_city --max_samples 100 --seed 42 --cache_dir ./cache
- Inference on
en_city
withtext-embedding-ada-002
:
python llm_tet.py --model_name text-embedding-ada-002 --data_name en_city --max_samples 100 --seed 42 --cache_dir ./cache
- Embedding Sentences: Sentences in each document are embedded using the selected model.
- Cosine Similarity Calculation: Cosine similarities between adjacent sentence windows are computed using different pooling strategies.
- Boundary Prediction: A threshold is applied to these cosine similarities to predict topic shift boundaries.
- Evaluation: The predicted boundaries are evaluated using precision, recall, and F1 score against ground truth labels.
The script tunes three hyperparameters:
- Window size (
k
): The number of sentences to include in each window. - Pooling strategy (
mean
,max
,min
): How to combine the cosine similarities of sentence embeddings within a window. - Cosine similarity threshold: The cutoff to determine whether a topic shift has occurred.
Tuning is performed on the validation set, and the best hyperparameters are applied to the test set.
After tuning, the script provides precision, recall, and F1 score for test dataset as well as the best hyper-parameter config based on validation set.
- Inference on
en_city
withall-mpnet-base-v2
:
...
Validation score: F1=36.94, Prec=31.87, Recall=43.94 | Tuned hp: {'k': 5, 'pool': 'mean', 'threshold': 0.19}
...
Test score: F1=34.70, Prec=29.93, Recall=41.28
- Inference on
en_city
withtext-embedding-ada-002
:
...
Validation score: F1=39.97, Prec=35.32, Recall=46.02 | Tuned hp: {'k': 3, 'pool': 'mean', 'threshold': 0.775}
...
Test score: F1=40.80, Prec=35.74, Recall=47.52
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions and pull requests are welcome.