In this employee spotlight, we get insight into troubleshooting with William Minch, Technical Support Manager at Verkada. (And if you’re interested in joining us in our mission of making the world safer, check out our open positions!)
While working from home amid the COVID-19 pandemic, a customer reached out to me with a concern. Video streams would buffer when they tried playing more than one camera at a time. Even short video clips were frustrating to watch. This was preventing their corporate security team from completing critical investigations.
I always like to sync up with the customer to see the problem in real-time. It gives me a chance to build rapport and appreciate what they are experiencing. After placing a call to the customer’s security team I identified the problem. They often review video from three cameras at once using 3x playback speed. Whenever they did, buffering would begin or the video would freeze. Using Developers Tools in their browser, I was able to confirm what was happening. Video segments were taking longer than normal to retrieve. The camera’s logs showed frequent timeouts on uploading the video to the cloud.
While on the line, I loaded three of their cameras at once and played them back at 3x speed on my computer. I was confident my home network could handle the streams. My intent was to isolate the issue to the camera’s network. Sure enough, the buffering was constant. This showed the issue wasn’t the customer’s computer or the video collection from the cloud.
I worked with the customer to better understand their network environment. This issue was especially tricky because the video being watched remotely and not over the local network at the site. The customer mentioned they didn’t see the issue on their previous system. They were confident that the issue was on the Verkada side. They checked the network and found no drops or configurations limiting the cameras. The customer wasn’t aware of any bandwidth constraints. I knew the best path to resolution was to dive in and figure out the root problem.
Asking for packet captures from the customer would usually be my next step. Packet captures help show network transactions and show what is happening moment-to-moment. COVID-19 made collecting captures a difficult ask, since the customer was remote. I decided to take a visit to Verkada’s empty office to recreate the issue in our lab. I rounded up some gear to set up a like-for-like camera simulation of the customer’s deployment.
I enabled the customer’s feature combination on my cameras and set up a packet capture on the switch’s uplink port. This capture positioning allowed me to see all traffic from the three cameras at once. I began testing with a 1 Gbps symmetric internet circuit — far more than what the customer was likely to have at their sites. My three cameras played fine, even at 3x speed. There was no question the cameras could play footage at increased playback speeds. Next, I did some quick math and determined 9Mbps is the required bandwidth for these cameras. I used QoS policing to rate limit the traffic to 10Mbps. Still no problems.
Then I dropped bandwidth to 5Mbps — and the video started to stutter on playback. I saw the same buffering and occasional timeouts the customer was seeing. This indicated playback issues when there is not enough bandwidth.
I shared my findings with the customer, but they were still not convinced it was an issue with their network. They agreed to send someone out to collect packet captures. Working with the network engineer on site, we were able to run tests and get captures. The tests showed an average throughput well below what they need for smooth playback. I had them send over the collected captures for me to analyze with Wireshark.
Network congestion was the first thing I decided to investigate. I wanted to check if the upload was rate limited on the camera or upstream. Packet capture analysis showed total upload throughput capped at 5.5Mbps. As a result, each camera was only able to reach a data rate of 2Mbps. Each camera requires 3Mbps for smooth playback. I concluded that there was not enough upload bandwidth.
Wireshark I/O graphs plot the amount of data in the packet capture against time. This shows the sent and received data on a per-second basis. I used display filters to chart each of the flows. This allowed me to see the per-camera data rate.
Transmission Control Protocol (TCP) determines why the throughput settled where it did. The TCP receive window established by the receiver, the Verkada endpoint in this case. The calculated receive window is the receiver’s allocated buffer space for this flow. This dictates how much data the camera can send before it receives an ACK. The bytes out cannot exceed the size of the receive window, making this a possible cause of slowness. The TCP window scaling graph shows the bytes out and the receive window against time.
My next step to find the origin of the slow throughput was to check the congestion window. The congestion window determines how much data the sender will send at any given time. It scales up until there is packet loss, then reduces by half and rebuilds from there. This limits throughput to what the network can support. The capture showed frequent packet loss for each camera. This kept the data rate of each camera to below 3Mbps, resulting in continued buffering.
The congestion window isn’t observable in the packets themselves. Looking at the windowing and tcptrace graphs allow us to infer the congestion window.
The windowing graph shows the increase the bytes out. When the congestion window is growing, the bytes out increases. When the sender detects dropped packets, the congestion window shrinks.
Digging deeper into the cause of the reduced congestion window showed it was due to packet loss.
The tcptrace graph shows some information that isn’t found on the graphs we have looked at so far. One such thing is selective acknowledgements (SACK). They are a duplicate ACK, indicating packet loss. A receiver sends a SACK when it receives a segment with a higher sequence number than expected. This packet tells the sender the data it has received and the data needing to be resent.
We noticed many duplicate acknowledgements in the capture at the 15.5 second mark. You can see them highlighted in black. The missing data is from the last acknowledged byte to the left edge of the SACK block.
Looking ahead in the capture, that missing TCP segment is re-sent.
This behavior confirms the shrinking congestion window was due to packet loss. We saw this pattern for each camera throughout the capture.
I was confident their network was dropping traffic, resulting in low throughput. I wrote up an analysis for the customer using the same information I am sharing with you, the reader. They re-checked their network and noticed a configuration they overlooked before. They had a QoS policy policing all traffic in the camera’s subnet. Playback issues ceased once they removed this configuration. This configuration change was successful at each of their sites.
Happy customers are always a great result. But the benefits for Verkada go far beyond this one fix, or even this account. I shared my findings with the Product team. We created a guide to help customers calculate bandwidth requirements for their use-case. This should help smooth out future customer deployments.
Interested in joining the Verkada team?
Want to learn more about the cool problems we troubleshoot in Verkada Technical Support