Comparative Investigation of Image Feature Extraction Techniques for Text Detection Using UNet, Textsnake, Swin Transformer, and Craft

Authors

  • Keerthana Sundar Raj Research Scholar, Department of Computer Science, Dr. N.G.P. Arts & Science College, Coimbatore, Tamil Nadu, India Author
  • Dr.J.Savitha Professor, Department of Computer Science, Dr. N.G.P. Arts & Science College, Coimbatore, Tamil Nadu, India Author

DOI:

https://doi.org/10.32628/CSEIT25113383

Keywords:

Text Detection, Image Feature Extraction, Deep Learning, UNet, TextSnake, Swin Transformer, CRAFT, Scene Text Recognition, OCR, ICDAR, Computer Vision, PyTorch, TensorFlow

Abstract

Text detection is a fundamental task in computer vision with wide-ranging applications including document digitization, autonomous navigation, and assistive technologies. This research presents a comprehensive comparative study of four state-of-the-art deep learning models for image feature extraction in text detection: UNet, TextSnake, Swin Transformer, and CRAFT. Each model embodies a distinct architectural approach segmentation-based, geometry-aware, transformer-based, and character-affinity modeling, respectively. The models are implemented using Python with PyTorch and TensorFlow frameworks, and evaluated on standard benchmark datasets including ICDAR 2021, and the Synth90k dataset. Performance is assessed using objective metrics such as Precision, Recall, F1-Score, Intersection over Union (IoU), Detection Accuracy, Inference Speed (FPS), and Model Size (MB). Experimental results highlight the trade-offs between accuracy, computational efficiency, and generalization capability of each model. Among the models evaluated, CRAFT demonstrates the best overall balance between detection accuracy and robustness, especially in scenarios involving irregular or curved text. These findings provide valuable insights for selecting optimal text detection models based on specific real-world application requirements.

Downloads

Download data is not yet available.

References

Zhou X, Yao C, Wen H, Wang Y, Zhou S (2017), “ EAST an efficient and accurate scene text detector”, In: IEEE conference on computer vision and pattern recognition, pp 2642–2651.

Lyu P, Yao C, Wu W, Yan S, Bai X (2018), “ Multi-oriented scene text detection via corner localization and region segmentation”, In: IEEE conference on computer vision and pattern recognition, pp 1–10.

Ghahremannezhad, H.; Shi, H.; Liu, C, “Real-Time Accident Detection in Traffic Surveillance Using Deep Learning”, In Proceedings of the 2022 IEEE International Conference on Imaging Systems and Techniques (IST), Kaohsiung, Taiwan, 21–23 June 2022; IEEE: New York, NY, USA, 2022; pp. 1–6.

Khan, Tauseef, Ram Sarkar, and Ayatullah Faruk Mollah, “Deep learning approaches to scene text detection: a comprehensive review”, Artificial Intelligence Review 54.5 (2021): 3239-3298.

G. Thiodorus, Y. Arum Sari, and N. Yudistira, “Convolutional Neural Network with Transfer Learning for Classification of Food Types in Tray Box Images,” ACM, 2021.

B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2963–2970.

J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximally stable extremal regions,” in Proc. British Machine Vision Conference, 2002, pp. 384–393.

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition,” in Advances in Neural Information Processing Systems, 2014.

X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “EAST: An efficient and accurate scene text detector,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2642–2651.

B. Shi, X. Bai, and S. Belongie, “Detecting oriented text in natural images by linking segments,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2550–2558.

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, pp. 234–241.

S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “TextSnake: A flexible representation for detecting text of arbitrary shapes,” in Proc. European Conf. Computer Vision (ECCV), 2018, pp. 20–36.

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin Transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE Int. Conf. Computer Vision (ICCV), 2021, pp. 10012–10022.

Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character Region Awareness for Text Detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9365–9374.

Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N., & Liang, J. (2020), “UNet++: A Nested U-Net Architecture for Medical Image Segmentation”, IEEE Transactions on Medical Imaging, 39(6), 1856–1867. https://doi.org/10.1109/TMI.2019.2959609

Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J., & Maier-Hein, K. H. (2020), “nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation”, Nature Methods, 18(2), 203–211. https://doi.org/10.1038/s41592-020-01008-z.

Jiang, Y., Zhu, X., Wang, X., Yang, S., Li, W., Wang, H., & Luo, P. (2020), “Rethinking Text Line Detection for Reading Order Detection in Documents”, ECCV Workshops. https://arxiv.org/abs/2008.05054.

Sun, Y., Wang, Y., & Tang, C. K. (2020), “TextNet: Irregular Text Reading from Images with an End-to-End Trainable Network”, Pattern Recognition, 107, 107528. https://doi.org/10.1016/j.patcog.2020.107528.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021), “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv preprint arXiv:2103.14030. (Though published 2021, it was developed in 2020.) https://arxiv.org/abs/2103.14030.

Vaswani, A., et al. inspired work: Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2020), “Training data-efficient image transformers & distillation through attention”, arXiv preprint arXiv: 2012.12877. https://arxiv.org/abs/2012.12877.

Baek, Y., Lee, B., Han, D., Yun, S., & Lee, H. (2020), “Character Region Awareness for Text Detection (CRAFT)”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1), 121–132. https://doi.org/10.1109/TPAMI.2020.2981306.

Wang, J., Wang, H., Wu, Y., & Wang, M. (2020), “Towards robust curved text detection with conditional spatial expansion”, ECCV 2020, pp. 425–441. https://doi.org/10.1007/978-3-030-58542-6_26.

Simonyan, K., & Zisserman, A. (2020), “Very Deep Convolutional Networks for Large-Scale Image Recognition,” International Journal of Computer Vision, 128(3), 261–276. (Originally 2014 paper republished and widely cited in 2020.) https://doi.org/10.1007/s11263-019-01228-7.

Nayef, N., Yin, F., Choi, H. J., Feng, Y., Karatzas, D., Pal, U., & Uchida, S. (2020), “ICDAR2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction”, International Journal on Document Analysis and Recognition (IJDAR), 23(1), 59–76. https://doi.org/10.1007/s10032-020-00338-0.

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2020), “Synthetic Data and Artificial Intelligence for Scene Text Recognition”, International Journal of Computer Vision, 128(4), 904–920. https://doi.org/10.1007/s11263-019-01214-z.

Downloads

Published

15-06-2025

Issue

Section

Research Articles