# Sensk"

Delivering Milliwatt AI to the Edge with Ultra-Low Power FPGAs



# Rapidly Emerging Edge Computing Trend Driven by Latency, Privacy, and Bandwidth Limitations



Unit growth for edge devices with AI will explode increasing over 110% CAGR over the next five years - Semico Research

2 - NASDAQ: LSCC

LATTICE

#### Market Trends

- Most companies know AI has the power to change their business
  - But applying it effectively remains a challenge
- Many are starting to formalize their approach
  - Al Moving out of research groups and into product development
- Deployment of AI based products becoming a reality

#### Market Trends

- The dataset remains a significant challenge to adoption of AI:
  - Machine Learning for image recognition is only viable with a high quality set of training data
- Ecosystem developing for off-the-shelf solutions requiring no dataset
  - Pre-Trained for common applications
- "Synthetic" Data is becoming viable with computer generated data sets







#### iCE40 UltraPlus High Accuracy, Low Power Accelerator

| iCE40 UltraPlus                |
|--------------------------------|
| Programmable FPGA Fabric       |
| 5,280 LUTs<br>120 Kb Block RAM |
| NVCM                           |
| 8 DSP Blocks                   |
| 1 Mb RAM                       |
| I/Os                           |

- Parallel computing capability
  - In device DSPs and 1Mbit SRAM
- Sensor agnostic flexible inferencing engine
- Single digit milli-watt power consumptions
- Lower latency
- Data pre-processing and result post postprocessing in device

|                 | Speed    | Power   | Resolution | Accuracy |
|-----------------|----------|---------|------------|----------|
| Advanced MCU    | 1-2 FPS  | 50-70mW | 64x64x3    | Low      |
| iCE40 UltraPlus | 5-10 FPS | 1-7mW   | 128x128x3  | High     |



#### iCE40 UltraPlus FPGA: 8bit Deep Quantization Support





#### **ECP5 Enables High Speed AI Acceleration**



| Resolution | 224x224x3           |
|------------|---------------------|
| Network    | VGG                 |
| Speed      | 6 frames per second |

#### **Previous Release**



| Resolution | 224x224x3            |
|------------|----------------------|
| Network    | MobileNet v1, Resnet |
| Speed      | 17 frames per second |

#### **New Release**



#### **Focus Applications**

**Object Detection** 



#### Human Machine Interface (HMI)



#### **Object Identification**



Defect detection in smart security and embedded vision cameras

Key Phrase detection to control smart appliances Feature extraction enabling navigation of robots



#### **Customizable Reference Designs**



## **Reference Design / Demo – Key Phrase Detection**

| FEATURES           |                           |  |  |  |  |  |
|--------------------|---------------------------|--|--|--|--|--|
| Sensor Microphones |                           |  |  |  |  |  |
| Network VGG8       |                           |  |  |  |  |  |
| Speed              | 40 Evaluations per Second |  |  |  |  |  |
| Power              | 7 mW on iCE40 UltraPlus   |  |  |  |  |  |



#### **SMART APPLIANCE HMI VIA VOICE**





#### **Reference Design / Demo - Human Face Identification**

|        | FEATURE                       | S                                  |     |                          |                                                     |
|--------|-------------------------------|------------------------------------|-----|--------------------------|-----------------------------------------------------|
| Sensor | Sensor CMOS image sensor      |                                    |     |                          |                                                     |
| Speed  | 2 frames per second           |                                    |     |                          |                                                     |
| Power  | -                             |                                    |     |                          |                                                     |
|        | ICATION IN VIDEO<br>Y DEVICES | USER IDENTIFICATION IN SMA<br>TOYS | ART | SLAM FOR CLEANING ROBOTS | IN SYSTEM OBJECT REGISTRATION<br>WITHOUT RETRAINING |
|        |                               |                                    |     |                          |                                                     |



#### **Reference Design / Demo --- Human Presence Detection**

|                | FEATURES                                       |
|----------------|------------------------------------------------|
| Sensor         | 8                                              |
| Speed<br>Power | 5 frames per second<br>7 mW on iCE40 UltraPlus |
|                | ALWAYS ON HUMAN DETECTION IN APPLIANCE         |
|                |                                                |



# **Reference Design / Demo Object Counting**

|                                              | FEATURES                            |                          |  | æ æ              | AN AN                           |  |
|----------------------------------------------|-------------------------------------|--------------------------|--|------------------|---------------------------------|--|
| Sensor                                       | CMOS image sensor                   |                          |  |                  | 2-5                             |  |
| Speed                                        | 17 frames per second - L<br>Latency | ower                     |  |                  | 8.8                             |  |
| Power                                        | 850 mW on ECP5-85K                  |                          |  |                  | 60 Pece                         |  |
| HUMAN DETECTION IN VIDEO SECURITY<br>DEVICES |                                     | OUNTING IN RETAIL CAMERA |  | TION AND OPERATO |                                 |  |
|                                              |                                     |                          |  |                  | Defect Detected<br>Type : Crack |  |

#### **Popular sensAl Accelerator Use Cases**





#### **Hardware Platforms**

Modular Platforms for Rapid Prototyping



#### HM01B0 UPduino Shield Board







#### **Embedded Vision Development Kit**



#### **Key features**

- Video and Audio sensors
- Compact 22 x 50 mm
- Includes HM01B0 image sensor board
- Arduino Micro form factor UltraPlus board

#### **Key features**

- ECP5 FPGA consuming under 1 W of power consumption
- Flexible video connectivity with support for MIPI CSI-2, eDP, HDMI, GigE Vision, USB 3.0, and more



#### **Software Tools**

Neural Network Compiler







#### **Key features**

- Implement networks developed using standard frameworks into Lattice FPGAs without prior RTL experience
- Rapidly analyze, simulate, and compile CNNs/BNNs for implementation on Lattice sensAI IP cores



#### **Engine Structure**



- Hand crafted and predesigned, not HLS based
- HW engines compute ALL NN functions of one layer
  - No CPU involvement in NN computation
  - All layers have the corresponding HW
    engines



#### **Engine Structure**



- Multiple engines for various different network topologies
  - Reprogram different engines per network
- Focus on HW efficiency and exploit re-programmability







#### FPGA runs not only ML engine but also all the pre/post processors

- Camera control, image processing (e.g., ISP, down scaler), post processing part
- MIC control, I2S master, audio data buffer, spectrograph (timed FFT), output time filter



### **Optimization - Quantization**



- "Quantization during training" instead of "Quantization after training"
  - Put the quantization layer in the training and let neurons know that they are 8b instead of floating point. Neurons/weights will evolve to find out the best values (8b values) that minimize the error in training process
  - Extendible to deeper quantization (4b, 2b, etc.)



#### **Optimization – Memory Assignment**



- Different memory assignments for different blob sizes and FW (weight) sizes
  - Choose different engines per the network requirements (blob size, weight size) and power constraint



## **Optimization – Power Optimization**



#### Minimize the activation of ML engine

- Clock gating of ML engine when preprocessor collecting data to process
- Run engine as fast as possible and turn off clock and/or go to low power mode



#### **Optimization – Multiple FPGAs & Chaining of multiple networks**





#### **Network is partitioned and mapped** into multiple FPGAs for better throughput

Blob value is transferred

# Multiple networks are stored in a Flash and run in serial

 Output of each network is aggregated or used for the next network invoking



# **Network Design for Edge Applications**



#### Don't try to run reference models in the web site as is

Hundred of layers is not needed/suitable for low power edge applications

#### Most of applications can be covered by 8~15 CONV layers

Not much benefit from residual net/dense net

#### Mostly VGG type and MobileNet type

#### **Optimization process**

- Start from a known reference network with a given training set
- Optimize network (reducing depth and width) with monitoring accuracy
- Dataset clean up
- Small network is more sensitive to the quality of training set
- Augmentation to reflect the sensor characteristics
- Add quantization in training



#### **Object Detection – Human Presence Detection**

- 64\*64\*3 input

- 6 zone searching to cover 128\*128\*3
- VGG8 like 8\*(Conv, BatchNorm) + 4\*Pooling
- 10FPS; 6~7mW@5FPS



## Lattice ECP5 FPGA vs SOC and ASICs

- ECP5 has more flexible I/Os and Interfaces
- ECP5 can reconfigure itself from one application to the other
- ECP5 can support changing ML topologies
- SOC:
  - Has more horsepower but consumers more power
- ASICs:
  - Lack the flexibility in topology selection and modification



## Lattice iCE40 UltraPlus vs MCUs

## MCUs generally suffer from performance

- Need ARM Cortex M7 class processors to do image based NN acceleration with good performance
- 10X higher power consumption for most applications
  - MCU runs at higher clock frequency ~200 500 MHz
- MCU has higher latency ~500ms
- •iCE provider higher efficiency in preprocessing and post processing
- Designers are comfortable with MCU environment
- Acceleration use lower class MCU + FPGA (co-exist)



#### Lowest Power, Performance Optimized

# Always-on human presence detection

- 128x128x3 (RGB)
- 5 frames/sec



|                            | Performance | Power  | Cost |
|----------------------------|-------------|--------|------|
| MCU                        | ~1-2 FPS    | ~100mW | \$   |
| Lattice iCE40<br>UltraPlus | ~5 FPS      | ~7 mW  | \$   |



# Always-on human counting

- 1080p downscaled to 224x224x3 (RGB)
- 17 frames/sec



**LATTICE** 



#### **Summary of Latest sensAl Updates**







The Low Power Programmable Leader

# THANK YOU

