تفاصيل العمل

This project implements an image captioning system using a combination of CNN-based image encoder, LSTM-based decoder, and a soft attention mechanism. The model is trained and evaluated on the Flickr8k dataset.

The architecture follows the principles outlined in the research paper, including:

Use of cross-entropy loss

Training with Adam optimizer

Evaluation using standard metrics like BLEU-4, METEOR, and CIDEr

While this notebook does not implement full transformer-based models (like ViT or DeiT), it provides a solid baseline and foundation for further exploration into vision-language tasks.

بطاقة العمل

اسم المستقل
عدد الإعجابات
0
عدد المشاهدات
8
تاريخ الإضافة
تاريخ الإنجاز
المهارات