Machine Learning On Edge Devices: Benchmark Report

This article was originally published at Tryolabs’ website. It is reprinted here with the permission of Tryolabs.

Why edge computing?

Humans are generating and collecting more data than ever. We have devices in our pockets that facilitate the creation of huge amounts of data, such as photos, gps coordinates, audio, and all kinds of personal information we consciously and unconsciously reveal.

Moreover, not only are we individuals generating data for personal reasons, but we’re also collecting data unbeknownst to us from traffic and mobility control systems, video surveillance units, satellites, smart cars, and an infinite array of smart devices.

This trend is here to stay and will continue to rise exponentially. In terms of data points, the International Data Corporation (IDC) predicts that the collective sum of the world’s data will grow from 33 zettabytes (ZB) in 2019 to 175 ZB by 2025, an annual growth rate of 61%.

While we’ve been processing data, first in data centers and then in the cloud, these solutions are not suitable for highly demanding tasks with large data volumes. Network capacity and speed are pushed to the limit and new solutions are required. This is the beginning of the era of edge computing and edge devices.

In this report, we’ll benchmark five novel edge devices, using different frameworks and models, to see which combinations perform best. In particular, we’ll focus on performance outcomes for machine learning on the edge.

What is edge computing?

Edge computing consists of delegating data processing tasks to devices on the edge of the network, as close as possible to the data sources. This enables real-time data processing at a very high speed, which is a must for complex IoT solutions with machine learning capabilities. On top of that, it mitigates network limitations, reduces energy consumption, increases security, and improves data privacy.

Under this new paradigm, the combination of specialized hardware and software libraries optimized for machine learning on the edge results in cutting-edge applications and products ready for mass deployment.

The biggest challenges to building these amazing applications are posed by audio, video, and image processing tasks. Deep learning techniques have proven to be highly successful in overcoming these difficulties.

Enabling deep learning on the edge

As an example, let’s take self-driving cars. Here, you need to quickly and consistently analyze incoming data, in order to decipher the world around you and take action within a few milliseconds. Addressing that time constraint is why we cannot rely on the cloud to process the stream of data but instead must do it locally.

The downside of doing it locally is that the hardware is not as powerful as a super computer in the cloud, and we cannot compromise on accuracy or speed.

The solution to this is either stronger, more efficient hardware, or less complex deep neural networks. To obtain the best results, a balance of the two is essential.

Therefore, the real question is: Which edge hardware and what type of network should we bring together in order to maximize the accuracy and speed of deep learning algorithms?

In our quest to identify the optimal combination of the two, we compared several state-of-the-art edge devices in combination with different deep neural network models.

Benchmarking novel edge devices

Based on what we think is the most innovative use case, we set out to measure inference throughput in real-time via a one-at-a-time image classification task, so as to get an approximate frames-per-second score.

To accomplish this, we evaluated top-1 inference accuracy across all categories of a specific subset of ImagenetV2 comparing them to some ConvNets models and, when possible, using different frameworks and optimized versions.

Hardware accelerators

While there has been much effort invested over the last few years to improve existing edge hardware, we chose to experiment with these new kids on the block~~chain~~:

Nvidia Jetson Nano
Google Coral Dev Board
Intel Neural Compute Stick
Raspberry Pi (upper bound reference)
2080ti NVIDIA GPU (lower bound reference)

We included the Raspberry Pi and the Nvidia 2080ti so as to be able to compare the tested hardware against well-known systems, one cloud-based and one edge-based.

The lower bound was a no-brainer. Here at Tryolabs, we design and train our own deep learning models. Because of this, we have a lot of computing power at our disposal. So, we used it. To set this lower bound on inference times, we ran the tests on a 2080ti NVIDIA GPU. However, because we were only going to use it as a reference point, we ran the tests using basic models, with no optimizations.

For the upper bound, we went with the defending champion, the most popular single-board computer: the Raspberry Pi 3B.

Neural network models

There are two main networks we wanted to include in this benchmark: the old, well-known, seasoned Resnet-50 and the novel EfficientNets released by Google this year.

For all benchmarks, we used publicly available pre-trained models, which we run with different frameworks. With respect to the Nvidia Jetson, we tried the TensorRT optimization; for the Raspberry, we used Tensor Flow and PyTorch variants; while for Coral devices, we implemented the Edge TPU engine versions of the S, M, and L EfficientNets models; and finally, regarding Intel devices, we used the Resnet-50 compiled with OpenVINO Toolkit.

The dataset

Because all models were trained on an ImageNet dataset, we use ImageNet V2 MatchedFrequency. It consists of 10,000 images in 1,000 categories.

We ran the inference on each image once, saved the inference time, and then found the average. We calculated the top-1 accuracy from all tests, as well as the top-5 accuracy for certain models.

Top-1 accuracy: this is conventional accuracy, meaning that the model’s answer (the one with the highest probability) must equal the exact expected answer.

Top-5 accuracy: means that any one of the model’s top five highest-probability answers must match the expected answer.

Something to keep in mind when comparing the results: for fast device-model combinations, we ran the tests incorporating the entire dataset, whereas we only used parts of datasets for the slower combinations.

Results & analysis

The dashboards below display the metrics obtained from the experiments. Due to the large difference in inference times across models and devices, the parameters are shown in logarithmic scale.

Average inference time by device

Average accuracy by model

Average inference time by framework

Accuracy vs inference time

Device	Framework	Model	Inference Time (ms)	Accuracy (top-1) %	Accuracy (top-5) %
Jetson Nano	TF-TensorRT-PyTorch	Resnet-50	2.67	64.30	NaN
GTX-2080ti	PyTorch	Resnet-50	4.41	64.85	NaN
GTX-2080ti	TensorFlow	Efficientnet-B0	4.86	65.05	NaN
Coral (Dev. Board)	Edge TPU	Efficientnet-S	5.42	33.79	55.45
GTX-2080ti	PyTorch	Efficientnet-B0	7.28	63.25	NaN
GTX-2080ti	TensorFlow	Efficientnet-B3	8.60	70.40	NaN
Coral (Dev. Board)	Edge TPU	Efficientnet-M	10.72	45.53	67.67
GTX-2080ti	PyTorch	Efficientnet-B3	11.13	69.80	NaN
GTX-2080ti	TensorFlow	Resnet-50	11.40	54.30	NaN
Coral (USB Accelerator)	Edge TPU	Efficientnet-S	13.18	33.79	55.45
Coral (Dev. Board)	Edge TPU	Efficientnet-L	30.72	42.92	65.06
Jetson Nano	PyTorch	Resnet-50	35.67	64.30	NaN
Intel NCS2	OpenVINO	Resnet-50	58.40	48.00	72.80
Coral (USB Accelerator)	Edge TPU	Efficientnet-M	59.62	45.53	67.67
Jetson Nano	PyTorch	Efficientnet-B0	59.97	61.64	NaN
Jetson Nano	PyTorch	Efficientnet-B3	96.94	67.97	NaN
Jetson Nano	TensorFlow	Efficientnet-B0	98.78	79.61	NaN
Jetson Nano	TF-TensorRT	Efficientnet-B0	154.13	79.61	NaN
Jetson Nano	TF-TensorRT	Resnet-50	176.50	54.10	NaN
Intel Movidius	OpenVINO	Resnet-50	186.40	47.90	72.80
Jetson Nano	TensorFlow	Resnet-50	223.73	54.10	NaN
Coral (USB Accelerator)	Edge TPU	Efficientnet-L	225.28	42.92	65.06
Jetson Nano	TensorFlow	Efficientnet-B3	246.26	85.44	NaN
Jetson Nano	PyTorch	Efficientnet-B5	261.27	72.03	NaN
Jetson Nano	TF-TensorRT	Efficientnet-B3	327.28	85.44	NaN
Raspberry Pi 3B	TensorFlow	Efficientnet-B0	539.15	80.00	NaN
Raspberry Pi 3B	TensorFlow	Resnet-50	1660.55	67.00	NaN
Raspberry Pi 3B	TensorFlow	Efficientnet-B3	1891.05	81.00	NaN
Raspberry Pi 3B	PyTorch	Resnet-50	3915.22	79.00	NaN
Raspberry Pi 3B	PyTorch	Efficientnet-B0	6908.80	78.00	NaN
Raspberry Pi 3B	PyTorch	Efficientnet-B3	13685.62	79.00	NaN

Inference time winner #1: Jetson Nano

In terms of inference time, the winner is the Jetson Nano in combination with ResNet-50, TensorRT, and PyTorch. It finished in 2.67 milliseconds, which is 375 frames per second.

This result was surprising since it outperformed the inferencing rate publicized by NVIDIA by a factor of 10x. This difference in results is most likely related to the fact that NVIDIA used TensorFlow instead of PyTorch.

Inference time winner #2: Coral Dev. Board

Coming in second was the combination of Coral Dev. Board with EfficientNet-S. It finished in 5.42 milliseconds, which is 185 frames per second.

These results correspond with the 5.5 milliseconds and 182 frames per second, promised by Google.

Even though speed was high for this combination, accuracy was not. We couldn’t acquire the exact validation set used by Google for accuracy reporting, but one hypothesis is that they used the image preprocessing transformations differently than we did. Since quantized 8-bit models are very sensitive to image preprocessing, this could have had a major impact on the results.

Accuracy winner: Jetson Nano

The best results in terms of accuracy came from the Jetson Nano in combination with TF-TRT and EfficentNet-B3, which attained an accuracy of over 85%. However, these results are relative, since we trained some models using a bigger dataset than others.

We can see that the accuracy rate is higher when we feed the models smaller datasets, and lower when the entire dataset is used. This results from the fact that we didn’t randomly sort the smaller data sets and hence the images were not adequately balanced.

Usability of hardware accelerators

Concerning the usability of these devices, developers will note some major differences.

The Jetson was the most flexible when it came to selecting and employing precompiled models and frameworks. Intel sticks come in second since they provide good libraries, many models and cool projects. Moreover, the sticks have massively improved between the first and second editions. The only drawback is that their vast library, OpenVINO is only supported on Ubuntu 16.04 and not by later Linux OS versions.

Compared to Jetson and Intel sticks, Coral devices present some limitations. If you want to run non-official models on it, you have to convert them to TensorFlow Lite, then quantize and compile them for Edge TPU. Depending on the model, this conversion might not be feasible. Nevertheless, we expect improvements for this Google device with future generations.

Conclusions

The research presented here is based on our exploration of state-of-the-art edge computing devices designed for deep learning algorithms.

We found that the Jetson Nano and Coral Dev. Board performed very well in terms of inference time.

In terms of accuracy, the Jetson Nano once again achieved great results, though the results were relative.

Given the overall performance of the Jetson Nano, it was our clear winner.

However, we must mention that we couldn’t test the Jetson Nano and Coral with the same model, due to their different design. We believe that each device will have its own best-case scenario, depending on the specific task to be completed.

We encourage you to perform a detailed benchmarking as it pertains to your specific tasks, and share your results and conclusions in the comments section below. Further research of interest could include the design and training of your own model, utilizing quantization-aware training.

This blog post was written in collaboration with Guillermo Ripa, Full-stack Developer at Tryolabs.

Juan Pablo González
Research and Machine Learning Engineer, Tryolabs

If you're building AI or vision-enabled products, you've come to the right place.