Facebook has recently announced the Facebook Surround 360, a high-quality, production-ready 3D-360 hardware and software video capture system. In designing this camera, Facebook wanted to create a professional-grade end-to-end system that would capture, edit and render high-quality 3D-360 video. In doing so, the company hoped to meaningfully contribute to the 3D-360 camera landscape by creating a system that would enable more VR content producers and artists to start producing 3D-360 video.
When the project was started, all the existing 3D-360 video cameras the company had seen were either proprietary (so the community could not access those designs), available only by special request, or fundamentally unreliable as an end-to-end system in a production environment. In most cases, the cameras in these systems would overheat, the rigs weren't sturdy enough to mount to production gear and the stitching would take a prohibitively long time because it had to be done by hand.
So Facebook set out to design and build a 3D-360 video camera that did what you'd expect an everyday camera to do — capture, edit, and render reliably every time, which turned out to be a technically daunting challenge for 3D-360 video.
Many of the technical challenges for 3D video stem from shooting the footage in stereoscopic 360. Monoscopic 360, using two or more cameras to capture the whole 360 scene, is pretty mainstream. The resultant images allow you to look around the whole scene but are rather flat, much like a still photo.
However, things get much more complicated when you want to capture 3D-360 video. Unlike monoscopic video, 3D video requires depth. We get depth by capturing each location in a scene with two cameras — the camera equivalent of your left eye and right eye. That means you have to shoot in stereoscopic 360, with 10 to 20 cameras collectively pointing in every direction. Furthermore, all the cameras must capture 30 or 60 frames per second, exactly and simultaneously. In other words, they must be globally synchronised. Finally, you need to fuse or stitch all the images from each camera into one seamless video and you have to do it twice: once from the virtual position for the left eye and once for the right eye.
This last step is perhaps the hardest to achieve, and it requires fairly sophisticated computational photography and computer vision techniques. The good news is that both of these have been active areas of research [1, 2, 3] for more than 20 years. The combination of past algorithm research, the rapid improvement and availability of image sensors and the decreasing cost of memory components like SSDs makes this project possible today. It would have been nearly impossible as recently as five years ago.
With these challenges in mind, the company began experimenting with various prototypes and settled on the three major components it felt were needed to make a reliable, high-quality, end-to-end capture system:
All three are interconnected and require careful design and control to achieve the goals of reliability and quality. Weakness in one area would compromise quality or reliability in another area.
Additionally, the company wanted the hardware to be off-the-shelf. It wanted others to be able to replicate or modify our design based on our design specs and software without having to rely on us to build it for them. It wanted to empower technical and creative teams outside of Facebook by allowing them full access to develop on top of this technology.
As with any system, Facebook started by laying out the basic hardware requirements. Relaxing any one of these would compromise quality or reliability and sometimes both.
Facebook addressed each of these requirements in its design. Industrial-strength cameras by Point Grey have global shutters and do not overheat when they run for a long time. The cameras are bolted onto an aluminum chassis, which ensures that the rig and cameras won't bounce around. The outer shell is made with powder-coated steel to protect the internal components from damage.
Once the hardware design is complete, camera control and data movement become the next big issues to tackle. Because the company wanted to keep the system modifiable, it chose to control the cameras using a Linux-based PC that contained enough system bandwidth to support the transfer of live video streams from all the cameras to disk.
Furthermore, the isochronous nature of the cameras meant that the company needed a real-time thread for capturing the frames followed by other lower-priority buffered disk writing threads to ensure that all frames were captured and not dropped. At 30Hz, this required on the order of a 17Gb/s sustained transfer rate. Similarly, it used an 8-way level-5 RAID SSD disk system to keep up with the isochronous camera capture rates. This, in coordination with the sturdy cameras, allows for minutes to hours of continuous capture.
Complicating matters, the camera control must be performed remotely since the camera itself is capturing the full 360 scene. The solution was to control the camera with a simple web interface, making it possible to control the camera from any device that supports an HTML browser.
In the system, exposure, shutter speed, analogue sensor gain and frame rate are controlled in the custom-designed capture software on a per-camera basis. The cameras are globally synced from the same software. It captures “raw” Bayer data to ensure quality throughout the image rendering pipeline.
The computational imaging algorithms that stitch all the images together are arguably at the heart of the system and represent the most difficult part of the system. Luckily, the past 20 years of computational photography and computer vision research and an additional 60 years of aerial stereo photogrammetry (used to create topological maps) gave the company a strong starting point.
Since the stitching process generates a final video image, visual image quality was paramount in our image processing and computational imaging pipeline. A subtle aspect of digital image capture is that care must be taken at each step of computation to ensure pixel quality is maintained. It’s quite easy to mistakenly degrade the image quality at any step in the process, after which that resolution is forever lost. Repeat that several times and quality is completely compromised.
The company process the data in several steps:
The key algorithmic element is the concept of optical flow. The code builds upon an optical flow algorithm that is mathematically trickier than other stitching solutions but delivers better results. Optical flow allows the company to compute left-right eye stereo disparity between the cameras and to synthesize the novel views separately for the left and right eyes. Facebook flow the top and bottom cameras into the side cameras, and similarly it flow pairs of cameras to match objects that appear at different distances. In fact, optical flow remains an open research area since it’s an ill-posed inverse problem. Its ill-posedness stems from ambiguities caused by occlusions — one camera not being able to see what an adjacent camera can see. While this is mitigated by having multiple cameras and time-varying capture, it nevertheless remains a difficulty. The code uses optical flow to compute left-right eye stereo disparity. It leverage this ability to generate seamless stereoscopic 360 panoramas automatically. Dailies are delivered overnight. This is a dramatic reduction in post-production time.
The company output at 4, 6, and 8K per eye. Due to high bandwidth and data demands, 6 and 8K output use our Dynamic Streaming codec for Gear VR, doubling the industry standard. The output file from the system can be viewed in VR headsets such as the Oculus Rift and Gear VR. These can also be outputted and shared in places like the Facebook News Feed — in which case only one of the monoscopic views is displayed when you're scrolling through News Feed, but the full stereo is downloadable.
Facebook believe that it was successful in creating a reliable, production-ready camera that is a state-of-the-art capture and stitching solution.
Technology challenges remain. Different combinations of optical field-of-view, sensor resolution, camera arrangement, and number of cameras used present a fairly complicated engineering design challenge. More cameras help ease the job of subsequent stitching, but also increase the volume of captured data and bandwidth needed to process it. Similarly, increasing sensor resolution improves final rendering quality but, again, at the cost of increased bandwidth and data volume.