Description
Description:
I have been attempting to reproduce the experimental results from your paper "Efficient Test-Time Adaptation of Vision-Language Models" as outlined in the repository. However, after running the experiments using the provided code and data, the results I obtained do not align with those reported in the paper.
Steps to Reproduce:
1.Clone the repository and set up the environment as per the instructions in the README.
2.Run the following commands to reproduce the results:
bash scripts/run_cd_benchmark_rn50.sh
bash scripts/run_cd_benchmark_vit.sh
bash scripts/run_ood_benchmark_rn50.sh
bash scripts/run_ood_benchmark_vit.sh
3.I then modified the code by removing the update_cache()
and compute_cache_logits()
sections in the function run_test_tda()
, so that the model now operates as a zero-shot CLIP. I then re-ran the same commands.
Results:
As shown in the image, "CLIPRN50/vit" refers to zero-shot CLIP. The accuracy of the zero-shot CLIP in my experiment is higher than the results reported in the paper, and it is close to the performace of "TDARN50/vit".
Could you please provide guidance on how to resolve this discrepancy?