Facebook highlights AI that converts 2D objects into 3D shapesOctober 30, 2019
Futuristic machine learning algorithms can extract two-dimensional objects from photographs and furnish them faithfully in three dimensions. It’s a technique that’s applicable to augmented reality apps and robotics as well as navigation, which is why it’s a key area of research for Facebook.
In a blog post ahead of the International Conference on Computer Vision (ICCV) in Seoul, Facebook presented its latest advancements with respect to intelligent content-understanding. It says that together, its systems can be used to detect even complex foreground and background objects, like the legs of a chair or overlapping furniture.
“[Our] research builds on recent advances in using deep learning to predict and localize objects in an image, as well as new tools and architectures for 3D shape understanding, like voxels, point clouds, and meshes,” wrote Facebook researchers Georgia Gkioxari, Shubham Tulsiani, and David Novotny in a blog post. “Three-dimensional understanding will play a central role in advancing the ability of AI systems to more closely understand, interpret, and operate in the real world.”
One of the works highlighted is Mesh R-CNN, a method that’s able to predict three-dimensional shapes from images of cluttered and clogged objects.
Facebook researchers say they augmented the open source Mask R-CNN’s two-dimensional object segmentation system with a mesh prediction branch, which they further supported with a library — Torch3d — containing highly optimized three-dimensional operators. Mesh R-CNN effectively uses uses Mask R-CNN to detect and classify the various objects in an image, after which It infers three-dimensional shapes with the aforementioned predictor.
Facebook says that, evaluated on the publicly available Pix3D corpus, Mesh R-CNN successfully detects objects of all categories and estimates their full three-dimensional shape across scenes of furniture. On a separate data set — ShapeNet — Mesh R-CNN outperformed prior work by a 7% relative margin.
Another Facebook-developed system — Canonical 3D Pose Networks, cheekily shortened to C3DPO — addresses scenarios where meshes and corresponding images aren’t available for training. It builds reconstructions of three-dimensional keypoint models, achieving state-of-the-art reconstruction results using two-dimensional keypoint supervision. (Keypoints in this context refer to tracked parts of objects that provide a set of clues around the geometry and its viewpoint changes.)
C3DPO taps a reconstruction model that predicts the parameters of the corresponding camera viewpoint and the three-dimensional keypoint locations. An auxiliary component learns alongside the model to address the obscurity introduced in the factorization of three-dimensional viewpoints and shapes.
Facebook says that such reconstructions were previously achievable in part because of memory constraints. C3DPO’s architecture enables three-dimensional reconstruction where hardware for capture isn’t feasible, like with large-scale objects.
“[Three-dimensional] computer vision has many open research questions, and we are experimenting with multiple problem statements, techniques, and methods of supervision as we explore the best way to push the field forward as we did for two-dimensional understanding,” wrote Gkioxari, Tulsiani, and Novotny. “As the digital world adapts and shifts to use products like 3D Photos and immersive AR and VR experiences, we need to keep pushing sophisticated systems to more accurately understand and interact with objects in a visual scene.”