About Vulkan and Future of High Performance Computing

May 26, 2020 · 7 min read

Huimin C.

Towards a heterogeneous & parallel computing architecture

OpenCL and Vulkan in the future would probably become the same thing. [1][2]

To utilise the power of Vulkan (and OpenCL) we do not have to code it our own, espectially for the purpose of deep learning.

By the way, some framework utilise OpenGL (Compute Shader API) as well, a higher level API.

Mali-G series GPU supports OpenCL 2.0 full profile.

Good Reads for Mobile DL

Neural Network Inference on Mobile SoCs https://arxiv.org/pdf/1908.11450.pdf

A First Look at Deep Learning Apps on Smartphones https://arxiv.org/pdf/1812.05448.pdf

AI Benchmark: Running Deep Neural Networks on Android Smartphones https://arxiv.org/pdf/1810.01109.pdf

good review of the SoCs (Hardware Acceleration, different SDKs)

Chips for Consideration

without NPUs

NXP

iMX6 or iMX8 - popular for general and multimedia usage.

with NPUs

Amlogic

Amlogic A331D (with NPU) (Khadas VIM3)

Amlogic S922X (ODROID-N2)

Their own SDK

Board manufacturer

No known popular usage.

RockChip

Popular chip used in hobbists and multimedia products. Contribute significantly to open source community.

RockChip is upgrading their product line this year 2020 with RK3588. their flagship RK3399 is getting dated. To be equipped with NPUs.

RK1808, RK1806 are supported by Baidu Paddle. Only single camera input is supported.

MediaTek(APU)

MT8168, MT8175 (Mali G52, with APU 0.3TOPS, 2 CSI camera interface)

only chips available?

MediaTek NeuroPilot SDK (Andriod only)

Supported by Baidu Paddle using API conversion.

Newer series

MediaTek helio

P60 280GMAC/s
P90 APU 2.0 @ 1165GMAC/s

No available development boards

Kirin

Kirin970 and Kirin980 uses Cambricon Technologies. (supported by HiAI DDK(v100), but only Andriod?) Kirin 810 and 820 and Kirin 990 uses Da Vinci NPU

Da Vinci NPU is supported by Baidu Paddle using API conversion. No development board available.

HiSilicon

Hi3516A(V300) - NNIE 1.0 TOPS (ARM A7)

Hi3519A(V100) - NNIE 2.0 TOPS @ 2W (ARM A53) Taobao

Hi3559A(V100) - Dual NNIE@840MHz (much more powerful CPU, ARM A73 + A53) Taobao (huge size)

http://www.hisilicon.com/en/Products/ProductList/Camera

Custom SDK May not be easy to use.

Benchmark Other Blog Yolov3 8fps, slightly better than TX1? Inferior than TX2.

Horizon Robotics

BOOTPRINT X2 96Boards ( Sunrise 2.0 AI edge processor - 4 TOPS@2W )

Uncertainty facing a start-up product

Bottomline

Current Multimedia SoC are adding in NPUs quickly, but with custom SDKs mainly
Start up SoC is still largely uncertain its sustainability
Do no expect too much from the built-in NPU performance. More like a off-loading.

DL Framework Comparisons

overview of mobile DL framwork: https://easyai.tech/blog/10-mobil-deeplearning-frame/

嵌入式Linux平台部署AI神经网络模型Inference的方案 https://www.jianshu.com/p/d4425b65c6e6

Future of GPU-based High Performance Computing (NPU to replace GPU) https://zhuanlan.zhihu.com/p/114254288

Tensorflow Lite

Not Recommaned

Mainly focused on Andriod and iOS, so not so friendly for our Robotics use, less documentation and popularity.

Blog: TensorFlow Lite Now Faster with Mobile GPUs This blog shows that only Andriod and iOS are officially supported (basically what Google has in mind).

Note: the full version of Tensorflow could run on with custom compilation from source. GitHub No Official support, and it is probably CPU-only.

PyTorch

Not Possible: Requires CUDA as the sole option for dependencies.

Paddle-Lite by Baidu

Key Features

Official support Mali GPU (OpenCL), Andreno GPU, Apple Metal GPU
Official support Kirin NPU, MTK APU, RK NPU
Future support includes Cambricon and Bitmain
Available in both Lite and Full (CUDA) version, tested on Jetson TX2
Support Yolov3 since version 2.0 (launched in late 2019)
5K GitHub Stars, QQ support group 696965088
Tons of improments and tricks and tools like x2paddle

Key Drawbacks

Still transiting older versions of Paddle-Mobile to the rebranded Paddle-Lite
Reported that documentation is not friendly, for starting. (Refering to version 1, not sure if version 2 improved) Zhihu, Developer Reply

Paddle-Lite benchmark, Paddle-Lite Demo, Release Blog

Bottom Line

Interesting framework to test Kirin NPU (Kirin 970 1.92TFLOPs) and RK NPU (RK1808, RK1806，not currently RK3399Pro) performance. However, currently those chips are not miniturisable.
Hi35xx chips (Hi3559A NPU: Dual core NNIE; Hi3516A, 2TOPS) not supported, probably need to use NNIE instead (takes in caffe format)
Still good to use it for complete GPU support

NCNN by Tencent

Key Features

Design to be light-weight (library <1MB)
Optimised memory access, written all in C++
ARM NEON Assembly optimisation, ARM big.LITTLE CPU optimisation
Utilise VulKan API, for GPU acceleration
Support import from caffe/pytorch/mxnet/onnx
QQ Support Group: 637093648

Key Drawbacks

Focused on Android platform, many users. But on Linux platform untested.
Compilation instruction includes Hisilicon (Hi35xx) and Arm64, but not sure if GPU acceleration and NPU acceleration is enabled

Tencent NCNN claims that their CPU optimisation is quite good (fastest among open-sourced ones), might even outperform the built-in GPU

Bottomline

Claimes to be fast, with good CPU optimisation
could be a good GPU benchmark as well (using Vulkan instead of OpenCL)

MaliG72 looking good with ncnn

MNN by Alibaba

Publication

Key Features

Claimed CPU assembly optimisation
Android: OpenCL, Vulkan, OpenGL support (very comprehensive!)
Appear to support ordinary Linux too, from doc
Lightweight
Support Tensorflow, Tensorflow Lite, Caffe and ONNX (PyTorch/MXNet)
Feels to be research oriented

Roadmap

Key Drawbacks

Just released as open source (2019?) (semi-automated search architecture for better mobile deployment)
The features looks too good to be ture. But lets hope for the best.

Bottomline

The paper is worth reading
A new option with good potential

blog

MXNet by Amazon

https://github.com/apache/incubator-mxnet

Key Features

Used in Universities to teach deep learning classes (Famous book: dive into deep learning)
Great documentation, looks easy to get started with Python
Integration with TVM

Key Drawbacks

seems no support for ARM GPU or NPU
Comparison with tvm, from tvm blog back in 2018. Results not good for MXNet.

Bottomline

Not for us. It is for bigger machines, cluster of machines.

Mace by Xiaomi

Not recommanded.

Does not support CUDA
Does not support popular Raspberry Pi
Hard to find non-Andriod documentation.
Xiaomi's strength is in Quadcomm CPUs
Not as a big community using it

Tengine by OPEN AI Lab (Supported by ARM)

https://github.com/OAID/Tengine

Not recommended. only ARM CPU acceleration? But claim to be fast?

ARM NN

Not recommended. Should not go here, again, platform dependent!

It should be based on ARM Compute Library.

SenseTime Parrots (PPL)

Closed source, but claim to have the best performance among the commercial solutions.

TVM (An Aggressive Step: Auto Tuning)

TVM paper

Key Features

Support ARM GPU (uda, opencl or vulkan backend)
Could add custom accelerator support through VTA (e.g. FPGA)
Machine Learning the best combination to utilise the heterogeneous hardware

Key Drawbacks

Might be too many things varying at the same time, hard to debug?

Integrating TVM into PyTorch https://tvm.apache.org/2019/05/30/pytorch-frontend

云天励飞基于TVM https://zhuanlan.zhihu.com/p/91826247 https://www.intellif.com/int/product/list15.html

Optimizing Mobile Deep Learning on ARM GPU with TVM https://tvm.apache.org/2018/01/16/opt-mali-gpu

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs https://arxiv.org/pdf/1907.02154.pdf

Towards a heterogeneous & parallel computing architecture#

Good Reads for Mobile DL#

Chips for Consideration#

without NPUs#

with NPUs#

Bottomline#

DL Framework Comparisons#

Tensorflow Lite#

Not Recommaned#

PyTorch#

Paddle-Lite by Baidu#

Key Features#

Key Drawbacks#

Bottom Line#

NCNN by Tencent#

Key Features#

Key Drawbacks#

Bottomline#

MNN by Alibaba#

Key Features#

Key Drawbacks#

Bottomline#

MXNet by Amazon#

Key Features#

Key Drawbacks#

Bottomline#

Mace by Xiaomi#

Tengine by OPEN AI Lab (Supported by ARM)#

ARM NN#

SenseTime Parrots (PPL)#

TVM (An Aggressive Step: Auto Tuning)#

Key Features#

Key Drawbacks#

Towards a heterogeneous & parallel computing architecture

Good Reads for Mobile DL

Chips for Consideration

without NPUs

with NPUs

Bottomline

DL Framework Comparisons

Tensorflow Lite

Not Recommaned

PyTorch

Paddle-Lite by Baidu

Key Features

Key Drawbacks

Bottom Line

NCNN by Tencent

Key Features

Key Drawbacks

Bottomline

MNN by Alibaba

Key Features

Key Drawbacks

Bottomline

MXNet by Amazon

Key Features

Key Drawbacks

Bottomline

Mace by Xiaomi

Tengine by OPEN AI Lab (Supported by ARM)

ARM NN

SenseTime Parrots (PPL)

TVM (An Aggressive Step: Auto Tuning)

Key Features

Key Drawbacks