Appearance Filtering

How We Used Machine Learning to Make People Search Smarter

Verkada
9 min readFeb 13, 2020

by Preeti Pillai

The Computer Vision team, from left to right: Naresh Nagabushan, Preeti Pillai, and Cameron Franke

Motivation

Imagine you’re waiting at a train station when you notice an abandoned empty bag left behind on a bench. Or you are a store owner and notice that an item is shoplifted from the shelf without your notice. Searching through hours of long security footage video across dozens of cameras to track down a lost object, suspicious person, intruder takes away precious time you likely don’t have to waste.

But what if our system can perform every bit of exhaustive search and compile a comprehensive list of summarized clips for your review in a matter of seconds? Wouldn’t that be a huge relief? This is the kind of solution Verkada aims to provide to its customers.

To tackle this problem, the computer vision team leveraged existing research literature on Person/Pedestrian Attribute Recognition (PAR), a long explored topic in video surveillance research. By people attributes, we mean humanly searchable semantic descriptions that serve as visual-biometrics for person re-identification/recognition applications. There are several PAR datasets in the vision community (WIDER, PETA, PK100K, Market, Duke, CUHK) that in total contribute to over 50 annotated descriptive attributes.

Role of ‘gender appearance’ in people analytics

The customer obviously won’t have the patience and memory to input 50 different people attributes into the search bar and lookup someone! This is where the product managers Brandon Davito and Zaafir Kherani stepped in and helped us with the customer’s thought process and how to finalize an attribute based on customer selection and feature feasibility.

We first broke down a person’s identifiable semantics into a list of global (age, gender) and local attributes (clothing & accessories). We then ran some preliminary set of classifiers on these open source datasets and found interesting correlations between local and global attributes. For example, a person whose clothing type was marked as a skirt, had a co-category gender label female; or a person with a ‘mustache present’ labeled as true had an accompanying male gender label.

Gender and clothing color descriptors, both being search-friendly and visually perceivable attributes, were given the go-ahead for our first customer release. The scope of this blog is focused on how we took gender appearance search from research to commercial use.

Diverse surveillance scenes and ‘un-viewable’ people faces

The beauty of working with surveillance cameras is that scenes are stationary, have constant scene semantics, fixed room depth and sometimes constant lighting conditions. On the contrary, the nerve-wracking challenge is that every surveillance camera is facing a totally unique indoor/outdoor scene with variations in camera viewpoints, illuminations, resolution changes, people occlusions, blur factor (outdoor fog, rain). For 100k Verkada cameras, we have 100k non-matching room scenes! We were in need of a scalable model that would work consistently for any type of surveillance scene. Available gender categorization algorithms and paid service providers like AWS Rekognition offer person gender labels and people recognition mainly learned from face detection features.

But can a model smartly predict gender appearance based on a person’s overall appearance alone? The human body is non-rigid and subject to the following:
1) Varying shape/height
2) Body part occlusions
3) Pose changes(standing, bending, sitting)
4) Camera view
5) Clothing appearances
6) Background clutter
7) Face visibility
8) Group occlusions
9) Low illumination
10) Skin tone
All of which makes gender appearance search insanely challenging. The computer vision team wanted to take the solution one step further: A gender appearance filter robust to these ten constraints.

Behind the scenes of the Verkada computer vision team

Cameron must be on the phone a lot to warrant a landline plus headset on his desk

Right since the AlexNet moment eight years ago, the computer vision community has seen a dramatic shift from the usage of handcrafted features to a deep learning-based approach. Automatic feature extraction was now possible using multilayer non-linear transformations over an image, to decipher meaningful image insights.

We started off with hours of literature crunching and training diverse convolutional neural network-based models on existing open-source datasets, totaling 120K+ person gender annotations. Our initial target was to obtain a good sense of model inference run-time, complexity in terms of memory usage on our production servers and the robustness to adapt to diverse people-scene appearances. We selected a model that was fairly deep, incorporated depthwise separable convolutions, inferenced fast and gave the best F1 scores of the lot.

We annotated and compiled a benchmark test dataset with gender appearance labels from our surveillance cameras, termed Hamlet_v1* (people crops with diverse clothing & skin color and hair color variations, camera angles). The first version of our model output an average accuracy score of 74.63% on the benchmark dataset and was hence, not production-ready for Verkada customers.

* Hamlet is a Caribbean fish species that functions as both male & female at the same time.

Driving model improvements

In order to plan the next phase of our experiments, we had to figure out why the model failed on 25% of our data and what we could do to improve its performance. Overfitting, poor data quality, data insufficiency, ratio of quality data vs quantity were some determining factors that came to mind.

The 120K training image quality was really subpar compared to our 1080p/4K surveillance camera resolutions. In some cases, even a human eye couldn’t verify if the person was male or female. A model can only learn what it is given to see. We suspected data insufficiency, training data quality and lack of appearance diversity to be the actual problems blocking us. To tackle the big data and low-quality problem, we shifted gears to collecting gender annotations from Google’s large-scale OpenImages dataset consisting of high-res data. Naresh Nagabushan, my colleague on the computer vision team stepped in and helped prepare the gender training dataset consisting of 2 million person crops from the larger OpenImages and helped fine-tune the model with additional pre-processing improvements.

Inside a deep neural network’s brain

Deep learning being a black magic box, we had to understand what the model was actually seeing to make a prediction. We visualized the inference outputs at every stage to ensure that we were on the right path to success. The heatmap visuals show what the model ‘sees’ to make a prediction. It seems to focus on facial hair, scalp hair, body hair, lower clothing type (skirt vs trousers) and person contour to make a joint inference decision.

2D background clutter around a person intrudes with a person’s visual appearance and confuses the model. Depending on how crowded a room gets, accurate 2D person segmentation can turn costly real fast, in terms of time and money. For this, we deployed a first-pass pose detector on all people crops and ran a cropping filter along the person’s skeletal pose to isolate an individual from the background.

Along with face visibility loss and class imbalance loss, all these incremental steps contributed to an improved 89%!

Rob Cromwell, our VP of Engineering, set down a rigid production accuracy goal of 90% for both male and female classes. However, the per-class accuracy of female appearance was just 83%. We racked our brains for a couple of days, studied the incorrect inference results on the Hamlet_v1 benchmark data and finally identified the root cause 1) Extreme camera scene angles and 2) domain data incompatibility. There was now one final process: train on plenty of our own data.

Solving the domain adaptation problem

‘CNNs are designed to cope with translations, but they’re not so good at dealing with other effects of changing viewpoints such as rotation and scaling’. Conditions under which training data was collected, do not match those under which it’s applied. “Domain” refers to a variety of factors such as lighting, viewpoint and resolution. Deep learning models allow for precise predictions when fed in high-quality training data. The underlying data distribution of the OpenImages dataset was a tad different than Verkada surveillance camera images, mainly due to viewing angle variations across customer scenes. Recognizing a front view of a person is much easier than say, recognizing a 60-degree top-down view.

At Verkada, we take customer data privacy very seriously. We tracked the entire annotation process to ensure that the data was secure, right from the collection, sharing, annotating and retrieving the data via encrypted servers. Our PM Brandon Davito communicated with customers for prior permission for data usage and helped ensure all data transactions were completed in a secure manner. We hand-picked large distribution of person crop data from around 172 customer cameras, and employed services of our trusted annotation partner to label them. The Hamlet_v2 training dataset annotations were verified through consensus algorithms before being fed to our deep models for fine-tuning. After the final steps of hyperparameter tuning and multiple batch training experiments, the final accuracy on our benchmark dataset was 90.81% female and 93.94% male

Fine-tuning the existing model on our own dataset greatly helped lessen the effects of the input domain divergence problem.

A few interesting visualizations from the Verkada Halloween party, the model’s judgment is scarily way more accurate than a human’s.

Feature Release

Rolling out the feature was a team effort; we worked closely with the Backend team to integrate all of the model’s outputs into Verkada’s servers, and the Frontend and UX team to create a great customer experience. The Backend team also developed a feature called “Cross-Camera Search,” which allows customers to easily search across all the cameras in their system simultaneously.

Now, months after that initial release, Appearance Filtering is more popular than ever, and customers are requesting new search attributes on a regular basis. Soon, they’ll be able to not only find that person who left their bag at the train station but do safety checks to ensure all crew members are wearing helmets on a worksite, or find a bank robber wearing a shirt with a specific pattern. The team effort continues, and we’re excited to see how this project evolves in the months and years ahead.

Working in the computer vision team

CV team brainstorming solutions to tough problems

Verkada CV team is literally a Disneyworld for anyone with a sincere passion towards computer vision and deep learning. There is not a single computer vision domain not covered by our bucket list of projects. State Classifiers, People/Vehicle/Scene/Face Analytics, Motion Analysis, Floorplan Visualization, Crowd Estimation, Model Optimization, People Activity and Preventative Analytics to make a scene safer; the list of challenging problems we are trying to crack down keeps us constantly entertained.

Interested in joining the Verkada team?

Check out open roles or email questions to recruiting@verkada.com

--

--

Verkada

Setting the new standard for enterprise building security