Few-shot learning for joint Named Entity Recognition (NER) and Topic Modeling.
FewTopNER is a cross-lingual model that integrates Named Entity Recognition (NER) and Topic Modeling using Few-shot Learning. It leverages the WikiNEuRal dataset for NER and Wikipedia data for Topic Modeling in five languages:
- English
- French
- German
- Spanish
- Italian
git clone https://github.com/ibrahimself/fewtopner.git
cd fewtopner
pip install -r requirements.txt
Download the necessary spaCy models for each language:
python -m spacy download en_core_web_sm
python -m spacy download fr_core_news_sm
python -m spacy download de_core_news_sm
python -m spacy download es_core_news_sm
python -m spacy download it_core_news_sm
Create a directory and download the dataset:
mkdir -p data/ner/wikineural
# Download from https://github.com/Babelscape/wikineural
Create a directory and download language-specific Wikipedia dumps:
mkdir -p data/topic/wikipedia
# Refer to https://dumps.wikimedia.org/ for instructions
Copy and customize the configuration file:
cp configs/fewtopner_config.yaml configs/my_experiment.yaml
# Modify `my_experiment.yaml` as needed
Run the training script:
python main.py --config configs/my_experiment.yaml
Evaluate the model using a saved checkpoint:
python main.py --config configs/my_experiment.yaml --checkpoint outputs/checkpoints/best_model.pt
fewtopner/
├── configs/ # Configuration files
│ └── fewtopner_config.yaml
├── data/ # Data directories
│ ├── ner/
│ │ └── wikineural/
│ └── topic/
│ └── wikipedia/
├── src/ # Source code
│ ├── data/ # Data processing modules
│ │ ├── preprocessing/
│ │ │ ├── ner_processor.py
│ │ │ └── topic_processor.py
│ │ ├── dataset.py
│ │ └── dataloader.py
│ ├── model/ # Model architecture components
│ │ ├── shared_encoder.py
│ │ ├── entity_branch.py
│ │ ├── topic_branch.py
│ │ ├── bridge.py
│ │ └── fewtopner.py
│ ├── training/ # Training-related scripts
│ │ ├── trainer.py
│ │ ├── episode_builder.py
│ │ └── metrics.py
│ └── utils/ # Utility scripts
│ ├── config.py
│ └── multilingual.py
├── main.py # Entry point for training/evaluation
└── requirements.txt # Python dependencies
Easily set up the project using Docker:
FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
git \
wget \
&& rm -rf /var/lib/apt/lists/*
# Copy project files
COPY . .
# Install Python dependencies
RUN pip install -r requirements.txt
# Download spaCy models
RUN python -m spacy download en_core_web_sm \
&& python -m spacy download fr_core_news_sm \
&& python -m spacy download de_core_news_sm \
&& python -m spacy download es_core_news_sm \
&& python -m spacy download it_core_news_sm
# Set default command
CMD ["python", "main.py", "--config", "configs/fewtopner_config.yaml"]
Build the Docker image and run the container:
docker build -t fewtopner .
docker run -it --rm fewtopner
This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
If you use FewTopNER in your research, please cite it as follows:
@article{fewtopner2024,
title={FewTopNER: Few-shot Learning for Joint Named Entity Recognition and Topic Modeling},
author={Your Name},
journal={arXiv preprint arXiv:2024.xxxxx},
year={2024}
}