



Caffe to Zynq: State-of-the-Art Machine Learning Inference Performance in Less Than 5 Watts

Vinod Kathail, Distinguished Engineer May 24, 2017 MACHINE LEARNING I COMPUTER VISION I SENSOR FUSION I CONNECTIVITY



Agenda

- Why Zynq SoCs for Deep Learning Inference
- Caffe to Zynq SoC in Seconds
- A Full System Example



### Diverse Applications with Diverse Design Targets



## Zynq Offers the Most Efficient Deep Learning Inference



#### 

## Zynq SoCs Offer Superior Throughput, Latency

| Gw                  | 4/5                 |                          |               | Xilinx ZU9 | Xilinx ZU5                            | eGPU*         |
|---------------------|---------------------|--------------------------|---------------|------------|---------------------------------------|---------------|
| <b>6</b> X          | 1/5                 | GoogLeNet                | Images/s      | 370.0      | 155.0                                 | 70            |
| Images/sec/watt     | Latency (ms)        | @ batch = 1              | Power (W)     | 7.0        | 4.5                                   | 7.9           |
|                     |                     |                          | Images/s/watt | 53.0       | 34.5                                  | 8.9           |
| Machine             | Real Time           |                          |               |            |                                       |               |
| Learning            | Applications        |                          |               | Xilinx ZU9 | Xilinx ZU5                            | eGPU*         |
| Inference           | Latency             | GoogLeNet<br>@ batch = 1 | Images/s      | 370.0      | 155.0                                 | 70            |
|                     | Latonoy             |                          | Latency (ms)  | 2.7        | 6.4                                   | 14.2          |
|                     |                     |                          |               |            |                                       |               |
| R                   |                     |                          |               | Xilinx ZU9 | Xilinx ZU5                            | eGPU*         |
| a de la             | @ batch = 8         | GoogLeNet<br>@ batch = 8 | Images/s      | 370.0      | 155.0                                 | 163           |
| 20                  |                     |                          | Latency (ms)  | 2.7        | 6.4                                   | 49.0          |
| $\smile$            |                     |                          |               |            |                                       | $\overline{}$ |
| Xilinx<br>Benchmark | Xilinx<br>Benchmark |                          |               |            | For large<br>CPU/GPU/D<br>increases s | SPs latency   |

### The Divergence of Training and Inference





**Training**: Process for machine to "learn" and optimize model from data

**Inference**: Using trained models to predict/estimate outcomes from new observations in efficient deployments

| for maximum efficiency |       |                     |                   |                       |  |  |
|------------------------|-------|---------------------|-------------------|-----------------------|--|--|
| Top-5<br>Accuracy      | FP-32 | FIXED-16<br>(INT16) | FIXED-8<br>(INT8) | Difference<br>vs FP32 |  |  |
| VGG-16                 | 86.6% | 86.6%               | 86.4%             | (0.2%)                |  |  |
| GoogLeNet              | 88.6% | 88.5%               | 85.7%             | (2.9%)                |  |  |
| SqueezeNet             | 81.4% | 81.4%               | 80.3%             | (1.1%)                |  |  |

a a manuel o bit and balance

https://arxiv.org/pdf/1510.00149v5.pdf

© Copyright 2017 Xilinx

### Inference Precisions Moving to Lower and Variable Precision



### Future Proof Architecture for Any Precisions



# **BNN: Unparalleled Performance**

> Reducing precision from 8b to 1b shrinks LUT cost by 40x

> Potential to scale CNN performance to above 23TOPS (ZU9)

| 1b |
|----|----|----|----|----|----|----|----|
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |

Assuming 300 MHz with 90%/70% DSP/LUT utilizations

Resource consumption assumption: 2.5 LUTs/op (INT1), 16 LUTs/op (INT4), 0.25 DSP/op (INT8)

TX2

2.7

ZU9

0.3 Q3





# **BNN: Unparalleled Performance**

> Reducing precision from 8b to 1b shrinks LUT cost by 40x

> Potential to scale CNN performance to above 23TOPS (ZU9)

| 1b |
|----|----|----|----|----|----|----|----|
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |

Assuming 300 MHz with 90%/70% DSP/LUT utilizations

Resource consumption assumption: 2.5 LUTs/op (INT1), 16 LUTs/op (INT4), 0.25 DSP/op (INT8)



TX2

Q3

23 ZU9

1b

# **BNN: Unparalleled Performance**

> Reducing precision from 8b to 1b shrinks LUT cost by 40x

> Potential to scale CNN performance to above 23TOPS (ZU9)

| 1b |
|----|----|----|----|----|----|----|----|
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |
| 1b |



Resource consumption assumption: 2.5 LUTs/op (INT1), 16 LUTs/op (INT4), 0.25 DSP/op (INT8)\_

10W power assumption on ZU9

© Copyright 2017 Xilinx



### Embedded

# 8bits to 1bit: What is the Challenge?

### >Small degradation in accuracy but fast improving



## Low Latency Inference by Layer to Layer Dataflow On Chip



Nvidia TX1 spec: http://wccftech.com/nvidia-tegra-x1-super-chip-announced-ces-2015-features-maxwell-core-architecture-256-cuda-cores/

© Copyright 2017 Xilinx





© Copyright 2017 Xilinx

## xFdnn: Direct Deep Learning Inference from Caffe



Compiles only ARM software code in minutes. No hardware compilation.

© Copyright 2017 Xilinx

## Caffe Prototxt to Zynq



## 32 Bit Training to 8 Bit Inference

> Approach 1: Quick evaluation



### Deep Learning Design Examples

|                                     |                       | May 2017        | Roadmap |
|-------------------------------------|-----------------------|-----------------|---------|
| GoogLeNet                           | Images/s              | 114             | 370     |
| @ batch = 1                         | Power (W)             | 6.0             | 7.0     |
| 3.2 Gops/img                        | Images/s/watt         | 19.0            | 52.9    |
|                                     |                       |                 |         |
|                                     |                       | May 2017        | Roadmap |
| SSD                                 | Images/s              | May 2017<br>6.3 | Roadmap |
| SSD<br>@ batch = 1<br>62.4 Gops/img | Images/s<br>Power (W) |                 | Roadmap |

|                            |               | May 2017 | Roadmap |
|----------------------------|---------------|----------|---------|
| FCN-AlexNet<br>@ batch = 1 | Images/s      | 7.0      |         |
|                            | Power (W)     | 6.0      |         |
| 42.0 Gops/img              | Images/s/watt | 1.2      |         |







## **Deep Learning IP Export Flow**

SDSoC Generated Platform DMA AXI-S

- > Export DNN IP and ARM scheduler to integrate into real system
- > Compile-time configuration of DNN IP (e.g. DSP, BRAM, buffer size ...)



© Copyright 2017 Xilinx

## Building a Full Embedded Vision System



© Copyright 2017 Xilinx

## Building a Full Embedded Vision System



© Copyright 2017 Xilinx

## Putting It All Together: CV and CNN with Multiple Sensors



#### € XILINX > ALL PROGRAMMABLE...

### Summary

- Zynq SoCs offer superior performance and lower latency compared to other SoC offerings
- reVISION stack provides seamless inference of custom deep learning networks from Caffe to Zynq SoCs
- Visit Xilinx.com/reVISION for more information



## **Empowering Product Creators to** Harness Embedded Vision

The Embedded Vision Alliance (<u>www.Embedded-Vision.com</u>) is a partnership of 60+ leading embedded vision technology and services suppliers

Mission: Inspire and empower product creators to incorporate visual intelligence into their products

The Alliance provides low-cost, high-quality technical educational resources for product developers

Register for updates at <u>www.Embedded-Vision.com</u>

The Alliance enables vision technology providers to grow their businesses through leads, ecosystem partnerships, and insights

For membership, email us: <a href="mailto:membership@Embedded-Vision.com">membership@Embedded-Vision.com</a>



24









### For more information and resources visit www.xilinx.com/reVISION

© Copyright 2017 Xilinx

