Enhancing Transformer Architecture: Techniques for Efficient Inference
DOI:
https://doi.org/10.32628/CSEIT25112757Abstract
This paper explores recent advancements in optimizing transformer architectures for efficient inference. We investigate various techniques including pruning, quantization, knowledge distillation, and architectural modifications. Our experimental results demonstrate that combining these approaches can reduce inference time by up to 74% while maintaining over 95% of the original performance. We also introduce a novel attention mechanism that dynamically allocates computational resources based on input complexity. Our implementation shows promise for edge device deployment where computational resources are constrained.
Downloads
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998-6008).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems.
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient transformers: A survey. arXiv preprint arXiv:2009.06732.
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., et al. (2020). Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems.
Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Shen, Z., Zhang, M., Zhao, H., Yi, S., & Li, H. (2018). Efficient attention: Attention with linear complexities. arXiv preprint arXiv:1812.01243.
Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., et al. (2021). Rethinking attention with performers. In International Conference on Learning Representations.
Frantar, E., & Alistarh, D. (2022). SPARSEGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774.
Guo, D., Rush, A. M., & Kim, Y. (2021). Parameter-efficient transfer learning with diff pruning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics.
Kim, Y. J., & Hassan, H. (2020). FastFormers: Highly efficient transformer models for natural language understanding. arXiv preprint arXiv:2010.13382.
Lin, Z., Feng, M., Santos, C. N. D., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Sun, S., Cheng, Y., Gan, Z., & Liu, J. (2019). Patient knowledge distillation for BERT model compression. In Proceedings of EMNLP-IJCNLP (pp. 4323-4332).
Wang, S., Li, B., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
Wu, H., Judd, P., Zhang, X., Isaev, M., & Micikevicius, P. (2020). Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv preprint arXiv:2004.09602.
Zafrir, O., Boudoukh, G., Izsak, P., & Wasserblat, M. (2019). Q8BERT: Quantized 8bit BERT. arXiv preprint arXiv:1910.06188.
Zhang, J., Lin, Y., Jiang, Z., Liu, Q., Lu, P., Zhao, X., & Han, S. (2021). Towards optimal structured CNN pruning via generative adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2790-2799).
Downloads
Published
Issue
Section
License
Copyright (c) 2025 International Journal of Scientific Research in Computer Science, Engineering and Information Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.