ML-TN-002 - Real-time Social Distancing estimation

From DAVE Developer's Wiki
Jump to: navigation, search
Info Box
NeuralNetwork.png Applies to Machine Learning


History[edit | edit source]

Version Date Notes
1.0.0 February 2021 First public release

Introduction[edit | edit source]

Because of the Covid-19 pandemic, everyone has learned to deal with the so-called "Social Distancing" rules very well. When it comes to spaces shared by many people — such as squares, public or private offices, malls, etc. — it is not easy to monitor in real-time the compliance with these rules.

Automatic systems that are capable to do the job have been developed. Most of them are implemented as software running on camera-equipped PC's making use of visual techniques. This is not a one-size-fits-all solution, however. In many cases, the use of a properly designed embedded platform is mandatory, for example, because of tight space constraints, harsh environment operability, or cost constraints — requirements that are typical for industrial-grade applications.

To date, though, the computing power required for algorithms that complex has represented a hurdle difficult to overcome, hindering the adoption of embedded platforms for these tasks. Recently, new system-on-chips (SoC's) integrating Neural Network hardware accelerators have appeared on the market, however. Thanks to such an improvement in terms of computational power, these devices allow the implementation of novel solutions satisfying all the above-mentioned requirements.

This Technical Note illustrates one of these implementations regarding the real-time social distancing estimation issue. This work started off the publicly-available, open-source Social-Distancing project released by the Istituto Italiano di Tecnologia (IIT), which is illustrated in this paper. The goal was to port the IIT code onto one of the DAVE Embedded Systems Single Board Computers (SBC) suitable to build an industrial-grade automatic machine vision system for social distancing.

The hardware/software platform[edit | edit source]

The choice fell on the ORCA SBC, which is powered by the NXP i.MX8M Plus SoC. This industrial/automotive-grade SoC is built around a 4-core ARM Cortex-A53 CPU and has a rich set of peripherals and systems. It also integrates a 2.3 TOPS Neural Processing Unit (NPU) and native interfaces to connect image sensors making it suited for computer vision applications.

The system software is a Yocto Linux distribution derived from the NXP 5.4.70_2.3.0 BSP. In addition to the default packages, a number of libraries were added to satisfy the application's requirements.

Application software[edit | edit source]

As stated previously, the main application derives from the IIT Social-Distancing project. It was developed in several steps starting when only a few alpha samples of the i.MX8M Plus were available thanks to the fact that DAVE Embedded Systems joined the the component's beta program.

Step #1[edit | edit source]

The first step was conducted using the official evaluation kit (EVK) by NXP. The goal was to make the Social-Distancing project to work on this platform maintaining the core functionalities. In essence, the code was modified to replace the OpenPose library with PoseNet. This was required to cope with the operations actually supported by the NXP eIQ software stack and the NPU. For those who are familiar with embedded software development, this should be unsurprising. When porting applications from PC-like platforms to embedded platforms, in fact, handling such hardware/software constraints is a common practice.

The resulting processing pipeline is shown in the following figure.

Processing pipeline

The yellow boxes indicate processing performed by the ARM cores, while the green one refers to the computation carried out by the NPU.

The following screenshots show the application running on the EVK.

The step 1 application running on the EVK (1/2)
The step 1 application running on the EVK (2/2)

It is worth remembering that, even though OpenPose was replaced, the software interface between high-level layers and PoseNet was not altered allowing to keep untouched these layers.

Step #2[edit | edit source]

Step #2 concerned implementing some optimizations in order to increase the overall frame rate.

As usual, before implementing any optimization, the code was profiled in order to detect the portions that made sense to optimize. In addition to traditional, well-know techniques, specific NPU-related tools were used as well. For instance, the following dump shows a detailed report referring to the execution of a Convolutional Neural Network (CNN) on the accelerator.

Example of NPU profiling report
LAYER ID LAYER NAME OPERATION ID OPERATION TYPE TARGET CYCLES READ BW [MByte] WRITE BW [MByte] AXI READ BW [MByte] AXI WRITE BW [MByte] DDR READ BW [MByte] DDR WRITE BW [MByte] TIME [μs]
0 TensorTranspose 0 TENSOR_TRANS TP 482613 0.491743 0.445310 0.000000 0.000000 0.491743 0.445310 631
20 ConvolutionReluPoolingLayer2 0 RESHUFFLE TP 1822 0.002380 0.000000 0.000000 0.000000 0.002380 0.000000 136
20 ConvolutionReluPoolingLayer2 0 RESHUFFLE TP 402743 0.251754 0.000000 0.000000 0.000000 0.251754 0.000000 539
... ... ... ... ... ... ... ... ... ... ... ... ...

Combining the results of profiling with a manual analysis of the code, it was decided to work on the operations performed before the inference. Basically, these tasks were restructured to implement a parallel computation for the purpose of leveraging the quad-core ARM Cortex-A53 cluster. The resulting architecture is depicted in the following figure.

Processing pipeline after implementing parallel computations

Step #3[edit | edit source]

In this step, the application was migrated to the definitive hardware platform, the aforementioned ORCA SBC, which was designed while the software team was working on the EVK.

Testing[edit | edit source]

The following clip shows the application running on the ORCA SBC.


Social Distancing application running on ORCA SBC


In the example, the system was fed with a 640x360 25fps stream. On average, the frame rate of the processed stream is 23 fps.

This screenshot illustrates the CPU load during the execution of the application. As expected, the 4 ARM cores are almost fully loaded because of parallel computation implemented in the algorithm.


CPU load during the execution of the application


For convenience, this test was run using an MPEG4 video file as input. Well-known OpenCV libraries were used to decompress the video and to retrieve the frames. At the time of this writing, these libraries did not support i.MX8M Plus's hardware video decoder. As such, it should be taken into account that video decompression is carried out by the ARM cores as well. Thus, in the case of an uncompressed live stream captured from a camera, it is expected to have further processing headroom for the core computations.