AREA-EFFICIENT HARDWARE MODULES FOR FP16/FP8/FP32 FORMAT CONVERSION IN EMBEDDED SYSTEMS

Dmytro Salnikov; Oleg Vasylchenkov

doi:10.26906/SUNZ.2026.2.243

Автор(и)

Dmytro Salnikov
Oleg Vasylchenkov

DOI:

https://doi.org/10.26906/SUNZ.2026.2.243

Ключові слова:

floating-point formats, reduced-precision number representation, embedded systems, edge computing, FPGA, VHDL, embedded neural network acceleration, area-efficient architecture

Анотація

The rapid proliferation of neural networks in embedded and edge computing systems has led to an increasing demand for efficient hardware implementations that can support precision-scalable arithmetic. Applications such as autonomous vehicles, intelligent sensors, and industrial automation require high computational performance, low latency, and strict energy constraints. Floating‑point arithmetic, defined by the IEEE 754 standard, remains the dominant numerical representation in such systems due to its versatility and broad dynamic range. However, deploying modern deep learning models on resource‑limited platforms poses significant challenges in balancing accuracy, throughput, and hardware footprint. To address these challenges, emerging reduced‑precision formats such as FP16, BF16, and FP8 (E4M3, E5M2) have gained popularity for both inference and training, enabling decreased memory bandwidth and improved energy efficiency with minimal accuracy degradation. Despite their growing prevalence, many microcontrollers and FPGAs lack native hardware support for these low‑precision formats, motivating the need for compact and reconfigurable conversion modules capable of bridging compatibility with conventional FP32 processing units. This work presents the design, implementation, and hardware evaluation of fully synthesizable VHDL modules for converting between FP8, FP16, BF16, and standard IEEE‑754 single‑precision (FP32) formats. The proposed architecture leverages FPGA Look‑Up Tables (LUTs) to perform exponent and mantissa field manipulation, bias adjustment, and classification of special numerical cases such as Infinity and NaN, ensuring full standard compliance. The converters were synthesized using a commercial design flow targeting an Intel Cyclone V device. Experimental results demonstrate exceptionally low resource utilization and high operating frequency, with the FP8E4M3 and FP8E5M2 converters each requiring only 14 ALMs while achieving frequencies exceeding 500 MHz. These outcomes confirm the suitability of the proposed modules for deployment in mixed‑precision computing systems and embedded neural network accelerators, providing an efficient hardware foundation for energy‑aware and high‑performance AI workloads on constrained platforms.

Завантажити

Дані для завантаження поки недоступні.

Посилання

1. Zoni, D., & Galimberti, A. (2022). Cost-effective fixed-point hardware support for RISC-V embedded systems. J. Syst. Archit., 126, 102476. https://doi.org/10.1016/j.sysarc.2022.102476.

2. Micikevicius, P., Stosic, D., Burgess, N., Cornea, M., Dubey, P., Grisenthwaite, R., Ha, S., Heinecke, A., Judd, P., Kamalu, J., Mellempudi, N., Oberman, S. F., Shoeybi, M., Siu, M., & Wu, H. (2022). FP8 formats for deep learning. arXiv:2209.05433. Machine Learning (cs.LG). https://doi.org/10.48550/arXiv.2209.05433.

3. van Baalen, M., Kuzmin, A., Nair, S. S., Ren, Y., Mahurin, E., Patel, C., Subramanian, S., Lee, S., Nagel, M., Soriaga, J., & Blankevoort, T. (2023). FP8 versus INT8 for efficient deep learning inference. arXiv:2303.17951. Machine Learning (cs.LG) https://doi.org/10.48550/arXiv.2303.17951.

4. Tedja, H. A., & Onno W. Purbo. (2024). Performance and Efficiency Comparison of U-Net and Ghost U-Net in Road Crack Segmentation with Floating Point and Quantization Optimization. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 8(6), 779-787. https://doi.org/10.29207/resti.v8i6.6089.

5. Chen, J., Hao, H., Wang, S., Li, L., Zhao, X., Yu, F., Wang, J., Xu, G., Sun, Z., & Jiang, K. (2024). A multiple precision floating-point arithmetic unit based on the RISC-V instruction set. In Proceedings of the 2024 4th International Conference on Electronic Information Engineering and Computer (EIECT) (pp. 573–578). IEEE. https://doi.org/10.1109/EIECT64462.2024.10867213.

6. Mach, S., Schuiki, F., Zaruba, F., & Benini, L. (2020). FPnew: An open-source multi-format floating-point unit architecture for energy-proportional transprecision computing. arXiv:2007.01530. Hardware Architecture (cs.AR). https://doi.org/10.48550/arXiv.2007.01530.

7. Brand, M., Hannig, F., Keszocze, O., & Teich, J. (2022). Precision- and Accuracy-Reconfigurable Processor Architectures — An Overview. IEEE Transactions on Circuits and Systems II: Express Briefs, 69, 2661-2666. https://doi.org/10.1109/TCSII.2022.3173753.

8. Kunešová, M., Zajíc, Z., Šmídl, L. & Karafiát M. (2024) Comparison of wav2vec 2.0 models on three speech processing tasks. International Journal of Speech Technology. 27, 847–859. https://doi.org/10.1007/s10772-024-10140-6.

9. Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T. L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2024). Mixtral of Experts. arXiv:2401.04088. Machine Learning (cs.LG). https://doi.org/10.48550/arXiv.2401.04088.

10. Peng, Z., Budhkar, A., Tuil, I., Levy, J., Sobhani, P., Cohen, R., & Nassour, J. (2021). Shrinking Bigfoot: Reducing wav2vec 2.0 footprint. arXiv:2103.15760. Computation and Language (cs.CL). https://doi.org/10.48550/arXiv.2103.15760.

11. Hassani Sadi, M., Sudarshan, C. & Wehn, N. (2024) Novel adaptive quantization methodology for 8-bit floating-point DNN training. Design Automation for Embedded Systems, 28, 91–110. https://doi.org/10.1007/s10617-024-09282-2.