Published 30-06-2023
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
How to Cite
Abstract
In 2014, deep learning-based end-to-end image captioning models were brought into play for the very first time, and since then, researchers from several domains have demonstrated their deep learning capabilities by proposing various architectures. These innovative ideas tend towards the reduction in model size and parameters along with improved performance. A plethora of methods introduced deep learning models as the state-of-the-art resulting from enhancements through leveraging different properties of the spatial regions and temporal sequences. These properties suggest a wide span from spatial attention mechanisms to temporal extension. Despite these state-of-the-art models, a comprehensive evaluation from a performance perspective has not been presented.
Image captioning is the process of creating a textual description of a given image. It combines visual recognition and natural language processing. Recently, deep learning methods have been introduced to enable the automatic generation of human-like captions. In this paper, a comprehensive survey of deep learning-based image captioning models is presented, covering the early approaches to the state-of-the-art models. A performance-centric in-depth comparative analysis of these models is presented. It encompasses a qualitative and quantitative comparison of various models in terms of spatial and temporal attention mechanisms, different encoding and decoding architectures, extensions of RNN, and the use of extra information like external language models. Finally, the findings of the comparative analysis are summarized with the associated challenges and trends of future research indicated. Moreover, methods to tackle the common issues of generated captions and a discussion on evaluation metrics are also presented, respectively.
Downloads
References
- Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and Tell: A Neural Image Caption Generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3156-3164).
- Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the International Conference on Machine Learning (pp. 2048-2057).
- Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). SPICE: Semantic Propositional Image Caption Evaluation. In European Conference on Computer Vision (pp. 382-398).
- Tatineni, Sumanth. "Customer Authentication in Mobile Banking-MLOps Practices and AI-Driven Biometric Authentication Systems." Journal of Economics & Management Research. SRC/JESMR-266. DOI: doi. org/10.47363/JESMR/2022 (3) 201 (2022): 2-5.
- Shaik, Mahammad, and Ashok Kumar Reddy Sadhu. "Unveiling the Synergistic Potential: Integrating Biometric Authentication with Blockchain Technology for Secure Identity and Access Management Systems." Journal of Artificial Intelligence Research and Applications 2.1 (2022): 11-34.
- Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-Critical Sequence Training for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7008-7024).
- Lu, J., Xiong, C., Parikh, D., & Socher, R. (2017). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 375-383).
- You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image Captioning with Semantic Attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4651-4659).
- Chen, X., & Lawrence Zitnick, C. (2015). Mind's Eye: A Recurrent Visual Representation for Image Caption Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2422-2431).
- Yao, T., Pan, Y., Li, Y., & Mei, T. (2017). Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6580-6588).
- Huang, L., Wang, W., Chen, J., & Wei, X. (2019). Attention on Attention for Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4634-4643).
- Aneja, J., Deshpande, A., & Schwing, A. G. (2018). Convolutional Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5561-5570).
- Liu, S., Ren, Z., Yu, Z., Yuan, J., & Wang, L. (2017). SibNet: Sibling Convolutional Encoder for Video Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4203-4212).
- Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-Memory Transformer for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10578-10587).
- Gu, J., Wang, G., Cai, J., & Chen, T. (2017). An Empirical Study of Language CNN for Image Captioning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1231-1240).
- Zhang, J., Ma, Y., & Yu, X. (2017). Learning with Rethinking: Recurrent Visual-Semantic Alignment for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2736-2744).
- Wang, P., Wu, Q., Shen, C., Hengel, A. v. d., & Dick, A. (2018). Focal Visual-Text Attention for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1408-1416).
- Gao, L., Ge, R., Chen, X., & Nie, L. (2020). Structured Two-Stream Attention Network for Video Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12393-12402).
- Yang, X., Tang, K., Zhang, H., & Cai, J. (2019). Auto-encoding Scene Graphs for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10685-10694).
- Wang, W., Chen, J., Hoi, S. C. H., & Wei, X. (2019). Hierarchical Attention Network for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13811-13820).