From Alt-text to Real Context: Revolutionizing image captioning using the potential of LLM
DOI:
https://doi.org/10.32628/CSEIT25111238Keywords:
Meta Llama-3.2, Image captioning, Contextual descriptions, Hugging Face Transformers, Image analysis, Sentiment analysis, Zero-shot capabilities, Large Language ModelsAbstract
The "From Alt-Text to Real Context" project harnesses the transformative capability of Meta Llama-3.2-11B-Vision-Instruct, thus unleashing an unprecedented image captioning system with advanced multimodal reasoning and attention mechanisms. These descriptions are richly contextual and domain-adaptive, moving them beyond the boundaries of traditional models. The platform has been constructed very carefully, keeping a heavy robust backend in Python (Flask/FastAPI) and an interactive frontend created in React.js for easier human interaction. With strong, advanced tools that include Hugging Face Transformers, OpenCV, NumPy, and Pandas, it shines for fine-tuning models, image preprocessing, and manipulation for particular, domain-specific customizations suited for a wide variety of applications from accessibility to e-commerce and education. The system generates alt-text and can perform complex image analysis and identify key features, providing more comprehensive scene descriptions. It is equipped with sentiment analysis to decipher emotive signals; it can provide critical metrics to quantify positive sentiment. This sophisticated AI system is built on the power of LLM outshines with zero-shot and multilingual capabilities and can also be utilized in tasks like image-based captioning and storytelling. Combining state-of-the-art AI, high adaptability, and great user-friendliness, this project raises new standards for contextual understanding, applicability in real-life scenarios, and image captioning.
Downloads
References
Bucciarelli, D., Moratelli, N., Cornia, M., Baraldi, L., & Cucchiara, R. (2024, December 4). Personalizing multimodal large language models for image captioning: an Experimental analysis. https://arxiv.org/abs/2412.03665
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., . . . Amodei, D. (2020, May 28). Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023, February 27). LLAMA: Open and Efficient Foundation Language Models. arXiv.org. https://arxiv.org/abs/2302.13971
Llama. (n.d.). Meta Llama. https://www.llama.com/
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020, October 22). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv.org. https://arxiv.org/abs/2010.11929
Zhou, Q., Huang, J., Li, Q., Gao, J., & Wang, Q. (2024, May 28). Text-only synthesis for image captioning. https://arxiv.org/abs/2405.18258
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. https://aclanthology.org/P02-1040/ DOI: https://doi.org/10.3115/1073083.1073135
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All you Need. https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-Shot Text-to-Image Generation. https://arxiv.org/abs/2102.12092
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers.. https://arxiv.org/abs/2005.12872 DOI: https://doi.org/10.1007/978-3-030-58452-8_13
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2017). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. https://arxiv.org/abs/1707.07998 DOI: https://doi.org/10.1109/CVPR.2018.00636
Tan, H., & Bansal, M. (2019). LXMERT: Learning Cross-Modality Encoder Representations from Transformers. https://arxiv.org/abs/1908.07490 DOI: https://doi.org/10.18653/v1/D19-1514
Xu, K., Ba, J., Kiros, R., & Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ResearchGate. https://www.researchgate.net/publication/272194766_Show_Attend_and_Tell_Neural_Image_Caption_Generation_with_Visual_Attention
Hugging Face. (n.d.). Transformers Documentation. Retrieved from https://huggingface.co.
OpenCV. (n.d.). Open Source Computer Vision Library Documentation. Retrieved from https://opencv.org.
Pandas Development Team. (2023). Pandas Documentation. Retrieved from https://pandas.pydata.org.
NumPy Developers. (2023). NumPy Documentation. Retrieved from https://numpy.org/doc/stable/
Flask. (n.d.). Flask Documentation (1.x). Retrieved from https://flask.palletsprojects.com/
Sebastián Ramírez. (2020). FastAPI Documentation. Retrieved from https://fastapi.tiangolo.com
TensorFlow. (n.d.). TensorFlow Documentation. Retrieved from https://www.tensorflow.org/learn
Downloads
Published
Issue
Section
License
Copyright (c) 2025 International Journal of Scientific Research in Computer Science, Engineering and Information Technology
This work is licensed under a Creative Commons Attribution 4.0 International License.