This repositoriy contains RNN, LSTM and Transformer-based image captioning models implementation during Computer Vision 2024 course.
Models are trained on COCO dataset and evaluated on NICE Challenge dataset.
In Transformer_Load.ipynb
and load_checkpoint.py
, beam search, which allows my Transformer to achieve higher scores in terms of BLEU@k and CIDEr, is implemented.
Also I figured out that applying n_gram blocking together was also helpful in achieving higher CIDEr scores.
My implementation of CIDEr evaluation metric is available at cider-python3 repository.
Summary report of my work is attached as ./report/image_captioning.pdf
.