OPTIMIZATION OF 1-BIT QUANTIZED MATRIX MULTIPLICATION FOR LARGE LANGUAGE MODELS

Authors

  • Dmytro Salnikov
  • Oleg Vasylchenkov
  • Dmytro Karaman

DOI:

https://doi.org/10.26906/SUNZ.2025.3.136

Keywords:

quantized operations, matrix multiplication, transformers, large language models, CUDA, GPU, LLM, PyTorch, neural networks

Abstract

With the rapid development and improvement of artificial intelligence systems, natural language processing has recently become one of the most relevant and in-demand tasks. Tools and algorithms based on large language models (LLMs) that enable natural language processing and speech-to-text conversion are actively used for automating various everyday tasks, as well as for service systems and real-time human interaction. Efficient and accurate natural language processing, taking into account syntactic and linguistic peculiarities, requires highly complex language models. However, large language models demand significant memory and computational resources, making their widespread use challenging on resource-constrained devices such as battery-powered mobile devices, embedded systems, and Internet of Things (IoT) devices. Thus, optimizing language model algorithms and reducing hardware costs for their deployment is an increasingly pressing issue. To accelerate execution and minimize memory requirements, quantization algorithms for language model parameters are employed. This work formulates key challenges associated with performing quantized matrix multiplication operations, explores popular approaches to implementing matrix multiplication algorithms on GPUs, and presents an optimized high-performance kernel for 1-bit quantized matrix multiplication.

Downloads

Download data is not yet available.

References

1. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L., Polosukhin I. “Attention Is All You Need,” arXiv:1706.03762, Computation and Language (cs.CL); Machine Learning, 2023. DOI: 10.48550/arXiv.1706.03762.

2. Shoeybi M., Patwary M., Puri R., LeGresley P., Casper J., Catanzaro B. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,” arXiv:1909.08053, Computation and Language, 2020. DOI: 10.48550/arXiv.1909.08053.

3. Patterson D., Gonzalez J., Le Q., Liang C., Munguia L.-M., Rothchild D., So D., Texier M., Dean J. “Carbon Emissions and Large Neural Network Training,” arXiv:2104.10350, Computation and Language, 2021. DOI: 10.48550/arXiv.2104.10350.

4. Kaplan J., McCandlish S., Henighan T., Brown T. B., Chess B., Child R., Gray S., Radford A., Wu J., Amodei D. “Scaling Laws for Neural Language Models,” arXiv:2001.08361, Machine Learning (cs.LG); Machine Learning (stat.ML), 2020. URL: https://arxiv.org/abs/2001.08361; DOI: 10.48550/arXiv.2001.08361.

5. Jouppi N. P., Young C., Patil N., Patterson D. “In-Datacenter Performance Analysis of a Tensor Processing Unit,”arXiv:1704.04760, Hardware Architecture (cs.AR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE),2017. DOI: 10.48550/arXiv.1704.04760.

6. Cao Y., Romero J., Olson J. P., Degroote M., Johnson P. D., Kieferová M., Kivlichan I. D., Menke T., Peropadre B., Sawaya N. P. D., Sim S., Veis L., Aspuru-Guzik A. “Quantum Chemistry in the Age of Quantum Computing,” Chemical Reviews, vol. 119 (19), ISSN 1520-6890, 2019, pp. 10856–10915. DOI: 10.1021/acs.chemrev.8b00803. DOI: https://doi.org/10.1021/acs.chemrev.8b00803

7. Zhou Y., Moosavi-Dezfooli S.M., Cheung N. M. and Frossard P. “Adaptive quantization for deep neural network.” In Proceedings of the 32th AAAI Conference on Artificial Intelligence (Vol. 32, No. 1), 2018. DOI: 10.1609/aaai.v32i1.11623. DOI: https://doi.org/10.1609/aaai.v32i1.11623

8. Kodali R. K., Upreti Y. P., Boppana L. “A Quantization Approach for the Reduced Size of Large Language Models.” 16th International Conference on Knowledge and Smart Technology (KST), Krabi, Thailand, 2024, pp. 144-148, DOI: 10.1109/KST61284.2024.10499664. DOI: https://doi.org/10.1109/KST61284.2024.10499664

9. Krishnamoorthi R. “Quantizing deep convolutional networks for efficient inference: A whitepaper.” arXiv:1806.08342, Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); 2018. DOI: 10.48550/arXiv.1806.08342.

10. Elias F., Ashkboos S., Hoefler T., Alistarh D. “Gptq: Accurate post-training quantization for generative pretrained transformers,” arXiv:2210.17323, Machine Learning (cs.LG), 2022. DOI: 10.48550/arXiv.2210.17323.

11. Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei. “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits,” arXiv:2402.17764 (preprint), Computation and Language (cs.CL); Machine Learning (cs.LG), 2024. DOI: 10.48550/arXiv.2402.17764.

12. Chee J., Cai Y., Kuleshov V., De Sa C. “QuIP: 2-bit quantization of large language models with guarantees,”arXiv:abs/2307.13304, Machine Learning (cs.LG); Computation and Language, 2023. DOI: 10.48550/arXiv.2307.13304.

13. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han. “AWQ: activation-aware weight quantization for LLM compression and acceleration,” arXiv:abs/2306.00978, 2023. DOI: 10.48550/arXiv.2306.00978.

14. Chetlur S., Woolley C., Vandermersch P., Cohen J., Tran J., Catanzaro B., Shelhamer, E. “cuDNN: Efficient primitives for deep learning,” arXiv:1410.0759, Neural and Evolutionary Computing (cs.NE); 2014. DOI: 10.48550/arXiv.1410.0759.

15. Kumar R., Negi K. C., Sharma N. K., Gupta P. “Deep Learning-Driven Compiler Enhancements for Efficient Matrix Multiplication.” Journal of Computers, Mechanical and Management, 3(2), 2024. pp.08-18. DOI: 10.57159/gadl.jcmm.3.2.240122. DOI: https://doi.org/10.57159/gadl.jcmm.3.2.240122

16. Xiao G., Yin C., Zhou T., Li X., Chen Y., Li, K. “A Survey of Accelerating Parallel Sparse Linear Algebra.” ACM Computing Surveys, 56 (1), 2023. Article No.: 21, pp. 1-38. DOI: 10.1145/3604606. DOI: https://doi.org/10.1145/3604606

Published

2025-09-30