About Vulkan and Future of High Performance Computing

Towards a heterogeneous & parallel computing architecture

OpenCL and Vulkan in the future would probably become the same thing. [1][2]

To utilise the power of Vulkan (and OpenCL) we do not have to code it our own, espectially for the purpose of deep learning.

By the way, some framework utilise OpenGL (Compute Shader API) as well, a higher level API.

Mali-G series GPU supports OpenCL 2.0 full profile.

Good Reads for Mobile DL

Neural Network Inference on Mobile SoCs https://arxiv.org/pdf/1908.11450.pdf

A First Look at Deep Learning Apps on Smartphones https://arxiv.org/pdf/1812.05448.pdf

AI Benchmark: Running Deep Neural Networks on Android Smartphones https://arxiv.org/pdf/1810.01109.pdf

  • good review of the SoCs (Hardware Acceleration, different SDKs)

Chips for Consideration

without NPUs

NXP

iMX6 or iMX8 - popular for general and multimedia usage.

with NPUs

Amlogic

Amlogic A331D (with NPU) (Khadas VIM3)

Amlogic S922X (ODROID-N2)

Their own SDK

Board manufacturer

No known popular usage.

RockChip

Popular chip used in hobbists and multimedia products. Contribute significantly to open source community.

RockChip is upgrading their product line this year 2020 with RK3588. their flagship RK3399 is getting dated. To be equipped with NPUs.

RK1808, RK1806 are supported by Baidu Paddle. Only single camera input is supported.

MediaTek(APU)

MT8168, MT8175 (Mali G52, with APU 0.3TOPS, 2 CSI camera interface)

  • only chips available?

MediaTek NeuroPilot SDK (Andriod only)

Supported by Baidu Paddle using API conversion.

Newer series

MediaTek helio

  • P60 280GMAC/s
  • P90 APU 2.0 @ 1165GMAC/s

No available development boards

Kirin

Kirin970 and Kirin980 uses Cambricon Technologies. (supported by HiAI DDK(v100), but only Andriod?) Kirin 810 and 820 and Kirin 990 uses Da Vinci NPU

Da Vinci NPU is supported by Baidu Paddle using API conversion. No development board available.

HiSilicon

Hi3516A(V300) - NNIE 1.0 TOPS (ARM A7)

Hi3519A(V100) - NNIE 2.0 TOPS @ 2W (ARM A53) Taobao

Hi3559A(V100) - Dual NNIE@840MHz (much more powerful CPU, ARM A73 + A53) Taobao (huge size)

http://www.hisilicon.com/en/Products/ProductList/Camera

Custom SDK May not be easy to use.

Benchmark Other Blog Yolov3 8fps, slightly better than TX1? Inferior than TX2.

Horizon Robotics

BOOTPRINT X2 96Boards ( Sunrise 2.0 AI edge processor - 4 TOPS@2W )

Uncertainty facing a start-up product

Bottomline

  • Current Multimedia SoC are adding in NPUs quickly, but with custom SDKs mainly
  • Start up SoC is still largely uncertain its sustainability
  • Do no expect too much from the built-in NPU performance. More like a off-loading.

DL Framework Comparisons

overview of mobile DL framwork: https://easyai.tech/blog/10-mobil-deeplearning-frame/

嵌入式Linux平台部署AI神经网络模型Inference的方案 https://www.jianshu.com/p/d4425b65c6e6

Future of GPU-based High Performance Computing (NPU to replace GPU) https://zhuanlan.zhihu.com/p/114254288

Tensorflow Lite

Not Recommaned

Mainly focused on Andriod and iOS, so not so friendly for our Robotics use, less documentation and popularity.

Blog: TensorFlow Lite Now Faster with Mobile GPUs This blog shows that only Andriod and iOS are officially supported (basically what Google has in mind).

Note: the full version of Tensorflow could run on with custom compilation from source. GitHub No Official support, and it is probably CPU-only.

PyTorch

Not Possible: Requires CUDA as the sole option for dependencies.

Paddle-Lite by Baidu

Key Features

  • Official support Mali GPU (OpenCL), Andreno GPU, Apple Metal GPU
  • Official support Kirin NPU, MTK APU, RK NPU
  • Future support includes Cambricon and Bitmain
  • Available in both Lite and Full (CUDA) version, tested on Jetson TX2
  • Support Yolov3 since version 2.0 (launched in late 2019)
  • 5K GitHub Stars, QQ support group 696965088
  • Tons of improments and tricks and tools like x2paddle

Key Drawbacks

  • Still transiting older versions of Paddle-Mobile to the rebranded Paddle-Lite
  • Reported that documentation is not friendly, for starting. (Refering to version 1, not sure if version 2 improved) Zhihu, Developer Reply

Paddle-Lite benchmark, Paddle-Lite Demo, Release Blog

Bottom Line

  • Interesting framework to test Kirin NPU (Kirin 970 1.92TFLOPs) and RK NPU (RK1808, RK1806,not currently RK3399Pro) performance. However, currently those chips are not miniturisable.
  • Hi35xx chips (Hi3559A NPU: Dual core NNIE; Hi3516A, 2TOPS) not supported, probably need to use NNIE instead (takes in caffe format)
  • Still good to use it for complete GPU support

NCNN by Tencent

Key Features

  • Design to be light-weight (library <1MB)
  • Optimised memory access, written all in C++
  • ARM NEON Assembly optimisation, ARM big.LITTLE CPU optimisation
  • Utilise VulKan API, for GPU acceleration
  • Support import from caffe/pytorch/mxnet/onnx
  • QQ Support Group: 637093648

Key Drawbacks

  • Focused on Android platform, many users. But on Linux platform untested.https://blog.csdn.net/yuanlulu/article/details/87902106
  • Compilation instruction includes Hisilicon (Hi35xx) and Arm64, but not sure if GPU acceleration and NPU acceleration is enabled

Tencent NCNN claims that their CPU optimisation is quite good (fastest among open-sourced ones), might even outperform the built-in GPU

Bottomline

  • Claimes to be fast, with good CPU optimisation
  • could be a good GPU benchmark as well (using Vulkan instead of OpenCL)

MaliG72 looking good with ncnn

MNN by Alibaba

Publication

Key Features

  • Claimed CPU assembly optimisation
  • Android: OpenCL, Vulkan, OpenGL support (very comprehensive!)
  • Appear to support ordinary Linux too, from doc
  • Lightweight
  • Support Tensorflow, Tensorflow Lite, Caffe and ONNX (PyTorch/MXNet)
  • Feels to be research oriented

Roadmap

Key Drawbacks

  • Just released as open source (2019?) (semi-automated search architecture for better mobile deployment)
  • The features looks too good to be ture. But lets hope for the best.

Bottomline

  • The paper is worth reading
  • A new option with good potential

blog

MXNet by Amazon

https://github.com/apache/incubator-mxnet

Key Features

  • Used in Universities to teach deep learning classes (Famous book: dive into deep learning)
  • Great documentation, looks easy to get started with Python
  • Integration with TVM

Key Drawbacks

  • seems no support for ARM GPU or NPU
  • Comparison with tvm, from tvm blog back in 2018. Results not good for MXNet.

Bottomline

  • Not for us. It is for bigger machines, cluster of machines.

Mace by Xiaomi

Not recommanded.

  • Does not support CUDA
  • Does not support popular Raspberry Pi
  • Hard to find non-Andriod documentation.
  • Xiaomi's strength is in Quadcomm CPUs
  • Not as a big community using it

Tengine by OPEN AI Lab (Supported by ARM)

https://github.com/OAID/Tengine

Not recommended. only ARM CPU acceleration? But claim to be fast?

ARM NN

Not recommended. Should not go here, again, platform dependent!

It should be based on ARM Compute Library.

SenseTime Parrots (PPL)

Closed source, but claim to have the best performance among the commercial solutions.

TVM (An Aggressive Step: Auto Tuning)

TVM paper

Key Features

  • Support ARM GPU (uda, opencl or vulkan backend)
  • Could add custom accelerator support through VTA (e.g. FPGA)
  • Machine Learning the best combination to utilise the heterogeneous hardware

Key Drawbacks

  • Might be too many things varying at the same time, hard to debug?

Integrating TVM into PyTorch https://tvm.apache.org/2019/05/30/pytorch-frontend

云天励飞基于TVM https://zhuanlan.zhihu.com/p/91826247 https://www.intellif.com/int/product/list15.html

Optimizing Mobile Deep Learning on ARM GPU with TVM https://tvm.apache.org/2018/01/16/opt-mali-gpu

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs https://arxiv.org/pdf/1907.02154.pdf