Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/155648
Title: TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge
Authors: Zhu, Shien
Duong, Luan H. K.
Liu, Weichen
Keywords: Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Computer systems organization::Performance of systems
Issue Date: 2022
Source: Zhu, S., Duong, L. H. K. & Liu, W. (2022). TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge. ACM Transactions On Embedded Computing Systems. https://dx.doi.org/10.1145/3508390
Project: MOE2019-T2-1-071
MOE2019-T1-001-072
M4082282
M4082087
Journal: ACM Transactions on Embedded Computing Systems
Abstract: Ternary Neural Networks (TNNs) and mixed-precision Ternary Binary Networks (TBNs) have demonstrated higher accuracy compared to Binary Neural Networks (BNNs) while providing fast, low-power and memory-efficient inference. Related works have improved the accuracy of TNNs and TBNs, but overlooked their optimizations on CPU and GPU platforms. First, there is no unified encoding for the binary and ternary values in TNNs and TBNs. Second, existing works store the 2-bit quantized data sequentially in 32/64-bit integers, resulting in bit-extraction overhead. Last, adopting standard 2-bit multiplications for ternary values leads to a complex computation pipeline, and efficient mixed-precision multiplication between ternary and binary values is unavailable. In this paper, we propose TAB as a unified and optimized inference method for ternary, binary and mixed-precision neural networks. TAB includes unified value representation, efficient data storage scheme, and novel bitwise dot product pipelines on CPU/GPU platforms. We adopt signed integers for consistent value representation across binary and ternary values. We introduce a bitwidth-last data format that stores the first and second bits of the ternary values separately to remove the bit extraction overhead. We design the ternary and binary bitwise dot product pipelines based on Gated-XOR using up to 40% fewer operations than State-Of-The-Art (SOTA) methods. Theoretical speedup analysis shows that our proposed TAB-TNN is 2.3X fast as the SOTA ternary method RTN, 9.8X fast as 8-bit integer quantization (INT8), and 39.4X fast as 32-bit full-precision convolution (FP32). Experiment results on CPU and GPU platforms show that our TAB-TNN has achieved up to 34.6X speedup and 16X storage size reduction compared with FP32 layers. TBN, Binary-activation Ternary-weight Network (BTN) and BNN in TAB are up to 40.7X, 56.2X and 72.2X fast as FP32. TAB-TNN is up to 70.1% faster and 12.8% more power-efficient than RTN on Darknet-19 while keeping the same accuracy. TAB is open source as a PyTorch Extension for easy integration with existing CNN models.
URI: https://hdl.handle.net/10356/155648
ISSN: 1539-9087
DOI: 10.1145/3508390
DOI (Related Dataset): https://doi.org/10.21979/N9/RZ75BY
Rights: © 2022 Association for Computing Machinery. All rights reserved. This paper was published in ACM Transactions on Embedded Computing Systems and is made available with permission of Association for Computing Machinery.
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Journal Articles

Files in This Item:
File Description SizeFormat 
TECS_2021_TAB_Accepted Version.pdfThe Accepted Version3.55 MBAdobe PDFView/Open

Page view(s)

45
Updated on Jul 2, 2022

Download(s)

11
Updated on Jul 2, 2022

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.