This is the official repo for our WSDM'22 paper, Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval (Best Paper Award).
**************************** Updates ****************************
- 2022-02-19: We released training code for MS MARCO passage and document ranking.
- 2021-11-13: We released code to evaluate zero-shot retrieval performance of RepCONC and used BEIR benchmark as an example.
- 2021-11-05: We released ranking results for queries from MS MARCO development set and TREC 2019 Deep Learning Track.
- 2021-11-03: We released code for encoding corpus and IVF acceleration.
- 2021-11-02: We released our model checkpoints and retrieval code.
- 2021-10-13: Our paper has been accepted by WSDM! Please check out the preprint paper.
- Quick Tour
- Ranking Results
- Requirements
- Preprocess Data
- Evaluate Open-sourced Checkpoints
- Train RepCONC
- Citation
- Related Work
In this work, we propose RepCONC, which models quantization process as CONstrained Clustering and end-to-end trains the dual-encoders and the quantization method. Constrained clustering involves a clustering loss and a uniform clustering constraint. The clustering loss requires the embeddings to be around the quantization centroids to support end-to-end optimization, and the constraint forces the embeddings to be uniformly clustered to all centroids to maximize distinguishability. The training process and the clustering constraint are visualized as follows:
Training process | Constrained Clustering |
---|---|
RepCONC achieves huge compression ratios ranging from 64x to 768x. It supports fast embedding search thanks to the adoption of IVF (inverted file system). With these designs, it outperforms a wide range of first-stage retrieval methods in terms of effectiveness, memory efficiency, and time efficiency. RepCONC also substantially boosts the second-stage ranking performance, as shown below:
We provide the ranking results of RepCONC via the following two links: passage rank and document rank.
dev
folder and test
folder correspond to MS MARCO development queries and TREC 2019 DL queries, respectively.
In either folder, for each m
value, we provide two ranking files corresponding to different text-id mapping. The one prefixed with 'official' means that it uses the official MS MARCO / TREC 2019 text-id mapping so you can directly use the official qrel files to evaulate the ranking. The other one uses the mapping generated by our preprocessing where we use line offset as id. Both files will give you the same metric number. The files are generated by run_retrieve.sh. Please see evaluation section to know about how to get those ranking results.
This repo needs the following libraries (Python 3.x):
torch == 1.9.0
transformers == 4.3.3
faiss-gpu == 1.7.1
boto3
We reuse many scripts of JPQ library. Install it with
pip install git+https://github.com/jingtaozhan/JPQ
Here are the commands to for preprocessing/tokenization.
If you do not have MS MARCO dataset, run the following command:
bash download_data.sh
Preprocessing (tokenizing) only requires a simple command:
python -m jpq.preprocess --data_type 0; python -m jpq.preprocess --data_type 1
It will create two directories, i.e., ./data/passage/preprocess
and ./data/doc/preprocess
. We map the original qid/pid to new ids, the row numbers in the file. The mapping is saved to pid2offset.pickle
and qid2offset.pickle
, and new qrel files (train/dev/test-qrel.tsv
) are generated. The passages and queries are tokenized and saved in the numpy memmap file.
You can download the query encoders and indexes from our dropbox link. After opening this link in your browser, you can see two folder, doc
and passage
. They correspond to MSMARCO passage ranking and document ranking. There are also four folders in either of them:
- Encoders:
- Indexes (Note, the
pid
in the index is actually the row number of a passage in thecollection.tsv
file instead of the official pid provided by MS MARCO.):official_pq_index
: PQ indexes.official_ivf_index
: IVF accelerated PQ indexes. The number of inverted lists is set to 5000.
Different query encoders and indexes correspond to different compression ratios. For example, the query encoder named m32.marcopass.query.encoder.tar.gz
means 32 bytes per doc, i.e., 768*4/32=96x
compression ratio.
We provide several scripts to help you download these data.
sh ./cmds/download_query_encoder.sh
sh ./cmds/download_doc_encoder.sh
sh ./cmds/download_index.sh
In this section, we provide commands about how to encode the corpus to compact indexes with our provided encoders. Note, you can skip this section and download the open-sourced indexes by running (only once):
sh ./cmds/download_index.sh
To encode the corpus:
First, you need to preprocess the dataset.
Second, please download the open-sourced query and document encoders. Here are two scripts to help you download them.
sh ./cmds/download_query_encoder.sh
sh ./cmds/download_doc_encoder.sh
Finally, run run_encode.py to encode corpus. You can refer to the example commands in cmds/run_encode_corpus.sh. Arguments for run_encode.py script are as follows,
--preprocess_dir
: preprocess dir./data/passage/preprocess
: default dir for passage preprocessing../data/doc/preprocess
: default dir for document preprocessing.
--doc_encoder_dir
: The unified query/document encoder trained in the first-stage training process. The script uses it to generate Index Assignments for all passages/documents.--query_encoder_dir
: The query encoder trained in the second-stage training process. The script uses it to set the centroid embeddings. If it is not provided, the centroid embeddings are set according the--doc_encoder_dir
model.--output_path
: Output index path.--max_doc_length
: Max passage/document length. Set it to 256 for passage and 512 for document, respectively.--batch_size
: Encoding batch size.
In this section, we provide commands about how to use IVF to accelerate search. The IVF index is built upon the PQ index output by run_encode.py. Note, you can skip this section and download the open-sourced indexes by running (only once):
sh ./cmds/download_index.sh
We provide an example command in run_build_ivf_index.sh. It builds an IVFPQ index for MS MARCO Passage Ranking task. It calls build_ivf_index.py. Arguments for this script are as follows,
--input_index_path
: The path for index output by run_encode.py.--output_index_path
: The output index path.--nlist
: The number of inverted lists. Large nlist improves accuracy at the cost of computation overhead.--nprobe
: The number of searched lists during online retrieval. The ideal IVF speedup ratio equals to nlist/nprobe.--threads
: The number of threads.--by_residual
: Whether to combine IVF and PQ embeds. Default: False.
In this section, we provide commands about how to reproduce the retrieval results with our open-sourced indexes and query encoders. Since we use TREC_EVAL toolkit for evaluation, please download it and compile:
sh ./cmds/download_trec_eval.sh
Run the following command to evaluate the retrieval results.
sh ./cmds/run_retrieval.sh
or this command to evaluate the ivf accelerated search results:
sh ./cmds/run_ivf_accelerate_retrieval.sh
Both of them will call run_retrieve.py to retrieve candidates. Arguments for this evaluation script are as follows,
--preprocess_dir
: preprocess dir./data/passage/preprocess
: default dir for passage preprocessing../data/doc/preprocess
: default dir for document preprocessing.
--mode
: Evaluation modedev
run retrieval for msmarco development queries.test
: run retrieval for TREC 2019 DL Track queries.lead
: run retrieval for leaderboard queries.
--index_path
: Index path (can be either PQ index or IVF accelerated index)--query_encoder_dir
: Query encoder dir, which involvesconfig.json
andpytorch_model.bin
.--output_path
: Output ranking file path, formatted following msmarco guideline (qid\tpid\trank).--max_query_length
: Max query length, default: 32.--batch_size
: Encoding and retrieval batch size at each iteration.--topk
: Retrieve topk passages/documents.--gpu_search
: Whether to use gpu for embedding search.--nprobe
: How many inverted lists to probe. This value shoule lie in [1, number of inverted lists].
The above script requires that the queries are preprocessed (see Preprocess). Therefore, it is a bit troublesome if you just want to retrieve passages/docs for some new queries. Here we provide instructions on how to do this. We take TREC 2020 queries as an example. We use this tokenize_retrieve.py, which supports on-the-fly query tokenization. Please download TREC 2020 queries:
sh ./cmds/download_trec20.sh
Run this shell script for retrieval and evaluation:
sh ./cmds/run_tokenize_retrieve.sh
It calls tokenize_retrieve. Arguments for this evaluation script are as follows,
--query_file_path
: Query file with TREC format.--index_path
: Index path.--query_encoder_dir
: Query encoder dir, which involvesconfig.json
andpytorch_model.bin
.--output_path
: Output ranking file path.--pid2offset_path
: It is used only for converting offset pids to official pids.--dataset
: "doc" or "passage". It is used to convert offset pids to official pids because msmarco doc adds a 'D' as docid prefix.--max_query_length
: Max query length, default: 32.--nprobe
: How many inverted lists to probe. This value shoule lie in [1, number of inverted lists].--batch_size
: Encoding and retrieval batch size at each iteration.--topk
: Retrieve topk passages/documents.--gpu_search
: Whether to use gpu for embedding search.
This section shows how to use RepCONC for other datasets in a zero-shot fashion.
We use BEIR as an example because it involves a wide range of datasets. For your own dataset, you only need to format it in the same way as BEIR and you are good to go.
Now, we show how to use JPQ for TREC-Covid dataset. Run
sh ./cmds/run_eval_beir.sh trec-covid
You can also replace trec-covid with other datasets, such as nq. The script calls jpq.eval_beir. Arguments are as follows,
--dataset
: Dataset name in BEIR .--beir_data_root
: Where to save BEIR dataset.--query_encoder
: Path to JPQ query encoder.--doc_encoder
: Path to JPQ document encoder.--split
: test/dev/train.--encode_batch_size
: Batch size, default: 64.--output_index_path
: Optional, where to save the compact index. If the pointed file exists, it will be loaded to save the corpus-encoding time.--output_ranking_path
: Optional, where to save the retrieval results.
Here are the NDCG@10 on several datasets when M=48, i.e., 64x compression ratio:
Dataset | TREC-COVID | NFCorpus | NQ | HotpotQA | FiQA-2018 | ArguAna | Touche-2020 | Quora | DBPedia | SCIDOCS | FEVER | Climate-FEVER | SciFact |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RepCONC (64x Compression) | 0.684 | 0.266 | 0.440 | 0.425 | 0.273 | 0.420 | 0.210 | 0.850 | 0.293 | 0.120 | 0.637 | 0.205 | 0.509 |
RepCONC is initialized by STAR. STAR trained on passage ranking is available here. STAR trained on document ranking is available here.
First, use STAR to encode the corpus and run OPQ to initialize the index. For example, on document ranking task, for 48 sub-vectors per document, please run:
dataset="doc" # or "passage"
if [ $dataset = "passage" ]; then max_doc_length=256 else max_doc_length=512 ; fi
M=48; python -m jpq.run_init \
--preprocess_dir ./data/$dataset/preprocess/ \
--model_dir ./data/$dataset/star \
--max_doc_length $max_doc_length \
--output_dir ./data/$dataset/init \
--subvector_num $M
Next, mine hard negatives. Retrieve document passages for training queries using:
dataset="doc" # or "passage"
M=48; python -m jpq.run_retrieval \
--preprocess_dir ./data/$dataset/preprocess/ \
--index_path ./data/$dataset/init/OPQ$M,IVF1,PQ${M}x8.index \
--mode train \
--query_encoder_dir ./data/$dataset/star \
--output_path ./data/$dataset/init/m$M.train.rank.tsv \
--batch_size 128 \
--max_query_length 32 \
--topk 210 \
--gpu_search
To validate the quality of retrieved passages, use the following command to evaluate MRR. You should get about 0.35-0.36 metric score.
M=48
dataset="doc" # or "passage"
if [ $dataset = "passage" ]; then trunc=10 else trunc="doc" ; fi
python ./msmarco_eval.py ./data/$dataset/preprocess/train-qrel.tsv ./data/$dataset/init/m$M.train.rank.tsv $trunc
We use top-200 irrelevant passages as hard negatives.
dataset="doc" # or "passage"
M=48; python -m repconc.gen_hardnegs \
--rank ./data/$dataset/init/m48.train.rank.tsv \
--qrel ./data/$dataset/preprocess/train-qrel.tsv \
--top 200 \
--output ./data/$dataset/init/m48.hardneg.json
Third, use constrained clustering technique to obtain supervised Index Assignments.
M=48
dataset="doc" # or "passage"
if [ $dataset = "passage" ]
then
max_doc_length=110
batch=1024
else
max_doc_length=512
batch=256
fi
multibatch_per_forward=6
num_train_epochs=10
train_root="./data/$dataset/train/m48"
python -m repconc.run_idx_assign_train \
--learning_rate 5e-6 \
--centroid_lr 2e-4 \
--lr_scheduler_type constant \
--num_train_epochs $num_train_epochs \
--max_query_length 24 \
--max_doc_length $max_doc_length \
--preprocess_dir ./data/$dataset/preprocess \
--label_path ./data/$dataset/preprocess/train-qrel.tsv \
--MCQ_M $M \
--MCQ_K 256 \
--opq_path ./data/$dataset/init/OPQ$M,IVF1,PQ$Mx8.index \
--hardneg_path ./data/$dataset/init/m$M.hardneg.json \
--init_model_path ./data/$dataset/star \
--multibatch_per_forward 6 \
--per_device_train_batch_size $batch \
--fp16 \
--gradient_checkpointing \
--output_dir $train_root/assign_models \
--logging_dir $train_root/assign_log \
--sk_epsilon 0.05 \
--mse_weight 0.05
Hyper-parameters are different for different M
values. Please refer to our paper for settings associated to other M
values.
Models are saved per epoch. You can evaluate the checkpoint with
ckpt=XXXX # the training step corresponding to the saved checkpoint
M=48
dataset="doc" # or "passage"
if [ $dataset = "passage" ]; then max_doc_length=256 ; else max_doc_length=512 ; fi
echo max_doc_length: $max_doc_length
train_root="./data/$dataset/train/m48"
python -m repconc.run_encode \
--preprocess_dir ./data/$dataset/preprocess \
--doc_encoder_dir $train_root/assign_models/checkpoint-$ckpt \
--output_path $train_root/assign_evaluate/checkpoint-$ckpt/m$M.index \
--batch_size 128 \
--max_doc_length $max_doc_length
for mode in "dev" "test"; do
python -m repconc.run_retrieve \
--preprocess_dir ./data/$dataset/preprocess \
--index_path $train_root/assign_evaluate/checkpoint-$ckpt/m$M.index \
--mode $mode \
--query_encoder_dir $train_root/assign_models/checkpoint-$ckpt \
--output_path $train_root/assign_evaluate/checkpoint-$ckpt/m$M.$mode.rank \
--batch_size 128 \
--nprobe 1 \
--gpu_search
done
if [ $dataset = "passage" ]; then trunc=10 ; else trunc="doc" ; fi
python ./msmarco_eval.py ./data/$dataset/preprocess/dev-qrel.tsv $train_root/assign_evaluate/checkpoint-$ckpt/m$M.dev.rank $trunc
./data/trec_eval-9.0.7/trec_eval -c -mrecall.100 -mndcg_cut.10 ./data/$dataset/preprocess/test-qrel.tsv $train_root/assign_evaluate/checkpoint-$ckpt/m$M.test.rank
Finally, we adopt JPQ to train the query encoder and PQ centroids. The Index Assignments are fixed in this stage.
M=48
dataset="doc" # or "passage"
ckpt=xxxx # the initialized RepCONC model checkpoint. Select one with the best dev performance.
train_root="./data/$dataset/train/m48"
python -m repconc.run_centroid_train \
--preprocess_dir ./data/$dataset/preprocess \
--model_save_dir $train_root/centroid_models \
--log_dir $train_root/centroid_log \
--init_index_path $train_root/assign_evaluate/checkpoint-$ckpt/m$M.index \
--init_model_path $train_root/assign_models/checkpoint-$ckpt \
--centroid_lr 2e-5 \
--lr 2e-6 \
--train_batch_size 128 \
--loss list
You can evaluate the checkpoint with
M=48
dataset="doc" # or "passage"
epoch=X # ranging from 1 to total training epoch (default: 6)
train_root="./data/$dataset/train/m48"
for mode in "dev" "test"; do
python -m repconc.run_retrieve \
--preprocess_dir ./data/$dataset/preprocess \
--index_path $train_root/centroid_models/epoch-$epoch/index \
--mode $mode \
--query_encoder_dir $train_root/centroid_models/epoch-$epoch \
--output_path $train_root/centroid_evaluate/epoch-$epoch/m$M.$mode.rank \
--batch_size 128 \
--nprobe 1 \
--gpu_search
done
if [ $dataset = "passage" ]; then trunc=10 ; else trunc="doc" ; fi
python ./msmarco_eval.py ./data/$dataset/preprocess/dev-qrel.tsv $train_root/centroid_evaluate/epoch-$epoch/m$M.dev.rank $trunc
./data/trec_eval-9.0.7/trec_eval -c -mrecall.100 -mndcg_cut.10 ./data/$dataset/preprocess/test-qrel.tsv $train_root/centroid_evaluate/epoch-$epoch/m$M.test.rank
If you find this repo useful, please consider citing our work:
@inproceedings{zhan2022learning,
author = {Zhan, Jingtao and Mao, Jiaxin and Liu, Yiqun and Guo, Jiafeng and Zhang, Min and Ma, Shaoping},
title = {Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval},
year = {2022},
publisher = {Association for Computing Machinery},
url = {https://doi.org/10.1145/3488560.3498443},
doi = {10.1145/3488560.3498443},
booktitle = {Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining},
pages = {1328–1336},
numpages = {9},
location = {Virtual Event, AZ, USA},
series = {WSDM '22}
}
-
CIKM 2021: Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance [code]: It presents JPQ and greatly improves the efficiency of Dense Retrieval. RepCONC utilizes JPQ for second-stage training.
-
SIGIR 2021: Optimizing Dense Retrieval Model Training with Hard Negatives [code]: It provides theoretical analysis on different negative sampling strategies and greatly improves the effectiveness of Dense Retrieval with hard negative sampling. The proposed negative sampling methods are adopted by RepCONC.
-
ARXIV 2020: RepBERT: Contextualized Text Embeddings for First-Stage Retrieval [code]: It is one of the pioneer studies about Dense Retrieval.