Code and Data for Language Modeling with Editable External Knowledge.
To setup your environment, run:
conda create -n mem_rewrite PYTHON=3.11
conda activate mem_rewrite
# get pytorch with a version of CUDA compatible with your machine
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
# install other requirements
bash setup.sh
To run Mixtral, set your TogetherAI API token
export TOGETHER_API_KEY=<TOGETHER_API_KEY>
Set your OpenAI API token if you wish to run GPT* models
export OPENAI_API_KEY=<OPENAI_API_KEY>
To run ERASE on the CLARK-News dataset, use:
python lm_eval.py \
--dataset news \
--datapath CLARK_news/ \
(--model_name [mistralai/Mixtral-8x7B-Instruct-v0.1|meta-llama/Llama-3-8b-chat-hf]) \
(--local_model_path <local_model_fp> \)
--context_length [2048|4096] \
--save_as_facts \
--retrieve_facts similarity \
(--overwrite_facts similarity --edit_hops 1)
--model_name
sets the model name for querying the TogetherAI API (for open-source models) or OpenAI API (for GPT* models). If this flag is set, queries the respective API for model inference. Otherwise, queries a local model.--local_model_path
sets the filepath if we want to use a local copy of a Huggingface Instruct model. One of--model_name
or--local_model_path
must be set.--context_length
sets the context window of the model--save_as_facts
toggles saving the entries to the KB as facts (rather than as passages)--retrieve_facts
sets how we want to retrieve KB entries. Set it tosimilarity
for dense retrieval. To turn off retrieval, do not include this flag.--overwrite_facts
toggles updating existing KB entries according to new documents. Set it tosimilarity
to use dense retrieval to retrieve facts to update. To turn off updating behavior, do not include this flag.--edit_hops
sets how many "hops" of retrieval we want to performing when updating existing entries. For each edit_hops > 1, the retriever will perform another round of retrieval based on similarity to the facts retrieved from the last round. This is set to 1 by default.
The CLARK-News dataset is available under CLARK_news
.
If you want to collect more data, you may run our data collection process:
- Get Wikidata triples that change over time
python script/get_wikidata_triples.py --data_dir <output_dir>
This saves the final triples to <output_dir>/property_to_results.csv
.
- Get candidate sources for fact from Google
python script/extract_queries.py \
--source_csv <csv_of_wikidata_triples> \
--target_csv <csv_with_candidate_sources>
where csv_of_wikidata_triples
is the filepath to the CSV file from step 1.
This populates csv_with_candidate_sources
with a list of candidate sources from Google.
- Get human-validated annotations (launch annotation interface):
python AnnotationInterface/webserver.py \
--source_file <csv_with_candidate_sources> \
--target_file <csv_with_human_validated_sources> \
--download_date <download_date>
where csv_with_candidate_sources
is the filepath to the CSV file from step 2.
This populates csv_with_human_validated_sources
with human annotations.
download_date
is the date that step 2 was run, in the YYYY-MM-DD. This is needed to infer the origin date of articles mined from Google.
- Pull text of sources from links:
python script/pull_external_sources.py \
--edits_file <csv_with_human_validated_sources> \
--output_dir <output_dir_of_sources>
- Automated validation of round 1 annotations:
python script/check_annotations.py # display annotations in annotations.html
- Second round of human annotation to validate round 1 (launch checking interface):
python CheckInterface/webserver.py
- Make questions from wikidata relations
python script/generate_wikidata_questions.py \
--wikidata_csv <csv_with_human_validated_sources> \
--output_dir <qs_output_dir>
Coming soon.
To cite this work, you may use the following BibTex entry:
@misc{li2024language,
title={Language Modeling with Editable External Knowledge},
author={Belinda Z. Li and Emmy Liu and Alexis Ross and Abbas Zeitoun and Graham Neubig and Jacob Andreas},
year={2024},
eprint={2406.11830},
archivePrefix={arXiv},
}