This project implements an image captioning system using a combination of CNN-based image encoder, LSTM-based decoder, and a soft attention mechanism. The model is trained and evaluated on the Flickr8k dataset.
The architecture follows the principles outlined in the research paper, including:
Use of cross-entropy loss
Training with Adam optimizer
Evaluation using standard metrics like BLEU-4, METEOR, and CIDEr
While this notebook does not implement full transformer-based models (like ViT or DeiT), it provides a solid baseline and foundation for further exploration into vision-language tasks.