NZ793982A - Deep learning system for cuboid detection - Google Patents
Deep learning system for cuboid detectionInfo
- Publication number
- NZ793982A NZ793982A NZ793982A NZ79398217A NZ793982A NZ 793982 A NZ793982 A NZ 793982A NZ 793982 A NZ793982 A NZ 793982A NZ 79398217 A NZ79398217 A NZ 79398217A NZ 793982 A NZ793982 A NZ 793982A
- Authority
- NZ
- New Zealand
- Prior art keywords
- cuboid
- layer
- image
- detector
- convolutional
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title abstract description 50
- 238000011176 pooling Methods 0.000 claims abstract description 75
- 230000001537 neural Effects 0.000 claims abstract description 47
- 230000000875 corresponding Effects 0.000 claims description 39
- 238000004891 communication Methods 0.000 claims description 7
- 230000004807 localization Effects 0.000 abstract description 25
- 238000000034 method Methods 0.000 description 33
- 239000000203 mixture Substances 0.000 description 17
- 238000010606 normalization Methods 0.000 description 16
- 150000002500 ions Chemical class 0.000 description 14
- 230000000007 visual effect Effects 0.000 description 9
- 230000003190 augmentative Effects 0.000 description 8
- 230000000306 recurrent Effects 0.000 description 8
- 210000000887 Face Anatomy 0.000 description 7
- 230000004913 activation Effects 0.000 description 5
- 238000003384 imaging method Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 5
- 241000282465 Canis Species 0.000 description 4
- 241000229754 Iva xanthiifolia Species 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000010006 flight Effects 0.000 description 4
- 238000003709 image segmentation Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 210000002569 neurons Anatomy 0.000 description 4
- 238000004805 robotic Methods 0.000 description 4
- 241000282412 Homo Species 0.000 description 3
- 229940035295 Ting Drugs 0.000 description 3
- 239000003550 marker Substances 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 238000000926 separation method Methods 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 238000005406 washing Methods 0.000 description 3
- 241000256844 Apis mellifera Species 0.000 description 2
- 210000000613 Ear Canal Anatomy 0.000 description 2
- 241000282619 Hylobates lar Species 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003203 everyday Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006011 modification reaction Methods 0.000 description 2
- 230000003287 optical Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000001131 transforming Effects 0.000 description 2
- 210000004556 Brain Anatomy 0.000 description 1
- 210000003128 Head Anatomy 0.000 description 1
- 210000000707 Wrist Anatomy 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 210000004027 cells Anatomy 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000004059 degradation Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000003365 glass fiber Substances 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 230000002452 interceptive Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 229920000915 polyvinyl chloride Polymers 0.000 description 1
- 230000000644 propagated Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
- 230000000946 synaptic Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Abstract
Systems and methods for cuboid detection and keypoint localization in images are disclosed. In one aspect, a deep cuboid detector can be used for simultaneous cuboid detection and keypoint localization in monocular images. The deep cuboid detector can include a plurality of convolutional layers and non-convolutional layers of a trained convolution neural network for determining a convolutional feature map from an input image. A region proposal network of the deep cuboid detector can determine a bounding box surrounding a cuboid in the image using the convolutional feature map. The pooling layer and regressor layers of the deep cuboid detector can implement iterative feature pooling for determining a refined bounding box and a parameterized representation of the cuboid. d non-convolutional layers of a trained convolution neural network for determining a convolutional feature map from an input image. A region proposal network of the deep cuboid detector can determine a bounding box surrounding a cuboid in the image using the convolutional feature map. The pooling layer and regressor layers of the deep cuboid detector can implement iterative feature pooling for determining a refined bounding box and a parameterized representation of the cuboid.
Description
Systems and methods for cuboid detection and nt localization in images are disclosed.
In one , a deep cuboid detector can be used for simultaneous cuboid detection and
keypoint localization in monocular images. The deep cuboid detector can include a plurality of
convolutional layers and non-convolutional layers of a trained convolution neural network for
determining a convolutional feature map from an input image. A region proposal network of the
deep cuboid detector can determine a ng box surrounding a cuboid in the image using the
convolutional feature map. The pooling layer and regressor layers of the deep cuboid detector can
implement ive feature pooling for determining a refined bounding box and a parameterized
representation of the cuboid.
NZ 793982
DEEP LEARNING SYSTEM FOR CUBOID DETECTION
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of priority to U.S. Patent Application
Number 62/422,547, filed November 15, 2016, entitled “DEEP CUBOID DETECTION:
BEYOND 2D BOUNDING BOXES,” the content of which is hereby incorporated by
reference herein in its entirety.
[0001A] This application is a onal of New Zealand Patent Application No.
753147, the entire content of which is incorporated herein by reference.
Field
The present disclosure relates generally to systems and methods for imensional
object detection in images and more particularly to deep machine learning
systems for detecting cuboids in images.
Description of the Related Art
A deep neural network (DNN) is a computation e ng method.
DNNs belong to a class of artificial neural networks (NN). With NNs, a computational graph
is constructed which imitates the es of a biological neural network. The biological
neural network es features salient for computation and responsible for many of the
capabilities of a biological system that may otherwise be difficult to capture through other
methods. In some implementations, such networks are arranged into a sequential layered
ure in which connections are unidirectional. For example, outputs of artificial neurons
of a particular layer can be connected to inputs of artificial neurons of a subsequent layer. A
DNN can be a NN with a large number of layers (e.g., 10s, 100s, or more layers).
Different NNs are different from one another in different perspectives.
For example, the topologies or architectures (e.g., the number of layers and how the layers
are interconnected) and the weights of different NNs can be different. A weight can be
approximately analogous to the synaptic strength of a neural connection in a biological
system. Weights affect the strength of effect propagated from one layer to r. The
output of an artificial neuron can be a nonlinear function of the ed sum of its inputs.
The weights of a NN can be the weights that appear in these summations.
Building a dimensional (3D) representation of the world from a
single monocular image is an important challenge in computer . The present disclosure
provides examples of systems and methods for ion of 3D cuboids (e.g., box-like
objects) and zation of nts in images. In one aspect, a deep cuboid detector can be
used for simultaneous cuboid detection and keypoint localization in images. The deep
cuboid detector can include a plurality of convolutional layers and non-convolutional layers
of a trained convolutional neural network for determining a convolutional feature map from
an input image. A region proposal network of the deep cuboid detector can ine a
bounding box surrounding a cuboid in the image using the convolutional feature map. The
pooling layer and regressor layers of the deep cuboid detector can ent iterative e
g for determining a refined bounding box and a parameterized representation of the
cuboid.
[0005A] In one broad form, the present invention seeks to provide a system for
training a cuboid detector, the system sing: non-transitory memory configured to store
executable instructions; and one or more hardware processors in communication with the
non-transitory memory, the one or more hardware processors programmed by the executable
instructions to: access a plurality of training images, wherein the plurality of training images
includes a first training image; generate a cuboid detector, wherein the cuboid detector
comprises: a plurality of convolutional layers and non-convolutional layers of a first
convolutional neural network (CNN); a region proposal network (RPN) ted to a first
layer of the plurality of convolutional layers and non-convolutional layers; a pooling layer;
and at least one regressor layer; wherein the pooling layer and the at least one regressor layer
are both connected to a second layer of the plurality of convolutional layers and nonconvolutional
layers; and train the cuboid detector, wherein training the cuboid detector
comprises: determining, by applying the cuboid detector to the first training image, a region
of interest (RoI) at a cuboid image location; determining, by applying the cuboid detector to
the first training image, a representation of a cuboid in the training image; determining a first
difference n a reference cuboid image location and the cuboid image location;
determining a second difference between a reference representation of the cuboid and the
determined representation of the cuboid; and updating weights of the cuboid detector based
on the first ence and the second difference.
[0005B] In one embodiment, the cuboid comprises a cuboid, a cylinder, a sphere,
or any combination f.
[0005C] In one embodiment, the first layer and the second layer are identical.
[0005D] In one embodiment, the one or more hardware processors is further
programmed to: te, using the plurality of convolutional layers and the nonconvolutional
layers, a convolutional feature map for the first training image; determine,
using the RPN, at least one RoI comprising the cuboid at an initial cuboid image location in
the training image; determine, using the initial cuboid image on, a submap of the
convolutional feature map corresponding to the at least one RoI sing the cuboid; and
determine, using the pooling layer, the at least one regressor layer, and the submap of the
convolutional feature map corresponding to the at least one RoI comprising the cuboid, the
RoI at the cuboid image location and the representation of the cuboid.
[0005E] In one embodiment, the initial cuboid image on is represented as a
two-dimensional (2D) bounding box.
[0005F] In one embodiment, the one or more hardware processors is further
programmed to: iteratively ine, using the pooling layer, the at least one regressor
layer, and the submap of the convolutional feature map corresponding to the RoI comprising
the cuboid, the RoI at the cuboid image location and the representation of the cuboid.
] In one embodiment, the initial cuboid image location is represented as a
two-dimensional (2D) ng box.
[0005H] In one embodiment, the one or more hardware processors is r
programmed to: update the weights of the RPN; and update the weights of the at least one
regressor layer.
[0005I] In one embodiment, the one or more hardware processors is further
mmed to: update the weights of the first CNN; update the weights of the RPN; and
update the weights of the at least one regressor layer.
[0005J] In one embodiment, the one or more hardware processors is r
programmed to: receive the first CNN.
[0005K] In one embodiment, the at least one regressor layer comprises two or
more layers.
[0005L] In one embodiment, the two or more layers comprise a fully connected
layer, a non-fully connected layer, or any combination thereof.
[0005M] In one embodiment, the at least one regressor layer is associated with at
least three loss functions during training of the cuboid detector.
[0005N] In one ment, the RPN comprises a deep neural network (DNN).
] In one embodiment, the RPN is associated with at least two loss functions
during the training of the cuboid detector
[0005P] In one embodiment, the entation of the cuboid comprises a
terized representation of the cuboid.
[0005Q] In another broad form, the present ion seeks to provide method for
training a cuboid detector, the method comprising: accessing a plurality of training images,
wherein the plurality of training images includes a first training image; generating a cuboid
detector, wherein the cuboid or comprises: a plurality of convolutional layers and nonconvolutional
layers of a first convolutional neural network (CNN), a region proposal
network (RPN) connected to a first layer of the plurality of utional layers and nonconvolutional
, a pooling layer; and at least one regressor layer, wherein the pooling
layer and the at least one regressor layer are both connected to a second layer of the plurality
of convolutional layers and non-convolutional ; and training the cuboid detector,
wherein training the cuboid detector comprises: determining, by applying the cuboid detector
to the first training image, a region of interest (RoI) at a cuboid image location; determining,
by applying the cuboid detector to the first training image, a representation of a cuboid in the
training image; ining a first difference between a reference cuboid image location and
the cuboid image on; determining a second difference between a reference
representation of the cuboid and the determined representation of the cuboid; and updating
weights of the cuboid detector based on the first difference and the second difference.
] In one embodiment, the first layer and the second layer are identical.
[0005S] In one embodiment, the method further comprises: generating, using the
ity of convolutional layers and the non-convolutional layers, a convolutional feature
map for the first training image; ining, using the RPN, at least one RoI comprising the
cuboid at an initial cuboid image location in the ng image; determining, using the initial
cuboid image location, a submap of the convolutional feature map corresponding to the at
least one RoI comprising the cuboid; and determining, using the pooling layer, the at least
one sor layer, and the submap of the convolutional feature map corresponding to the at
least one RoI comprising the cuboid, the RoI at the cuboid image location and the
representation of the cuboid.
[0005T] In one embodiment, the method further comprises: iteratively
determining, using the pooling layer, the at least one regressor layer, and the submap of the
convolutional feature map corresponding to the RoI comprising the cuboid, the RoI at the
cuboid image location and the representation of the cuboid.
s of one or more implementations of the subject matter described in
this specification are set forth in the accompanying gs and the ption below.
Other features, aspects, and advantages will become apparent from the description, the
drawings, and the claims. Neither this summary nor the following detailed description
purports to define or limit the scope of the inventive subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
is an example monocular image illustrating two-dimensional (2D)
object detection with a ng box id around an object detected.
is an example monocular image illustrating three-dimensional
(3D) cuboid ion with a representation of the cuboid overlaid on the object detected.
shows that one cuboid inside the monocular image is detected and its vertices
localized (shown as eight black circles that are connected).
s an example architecture of a cuboid detector.
is an example image illustrating region of interest (RoI) normalized
coordinates.
FIGS. 4A-4G show images illustrating example cuboid detection and
keypoint localization. One or more cuboids have been detected in each image with keypoint
of each cuboid localized, shown as white connected circles.
FIGS. 5A-5C show example images showing improved performance with
keypoint refinement via ive feature pooling.
is a schematic illustration show example cuboid vanishing points.
FIGS. 7A-7F are plots showing example performance a cuboid detector.
is a flow diagram of an example process of training a cuboid
detector.
is a flow m of an example process of using a cuboid or
for cuboid detection and keypoint localization.
schematically illustrates an example of a wearable display system,
which can implement an embodiment of the deep cuboid detector.
hout the drawings, nce numbers may be re-used to indicate
correspondence between referenced elements. The drawings are provided to rate
example embodiments described herein and are not intended to limit the scope of the
disclosure.
DETAILED DESCRIPTION
Overview
Models representing data relationships and patterns, such as functions,
algorithms, systems, and the like, may accept input, and produce output that corresponds to
the input in some way. For example, a model may be implemented as a machine learning
method such as a convolutional neural network (CNN) or a deep neural network (DNN).
Deep learning is part of a broader family of machine learning s based on the idea of
learning data representations as opposed to task specific algorithms and shows a great deal of
promise in solving audio-visual computational problems useful for augmented reality, mixed
reality, l y, and machines intelligence. In machine learning, a convolutional
neural network (CNN, or ConvNet) can include a class of deep, feed-forward artificial neural
networks, and CNNs have successfully been applied to analyzing visual imagery. Machine
ng methods include a family of methods that can enable robust and accurate solutions
to a wide variety of problems, including eye image segmentation and eye tracking.
Disclosed herein are examples of a cuboid detector which processes an
input image of a scene and zes at least one cuboid in the image. For example, a cuboid
detector (such as a deep cuboid detector) can process a consumer-quality Red-Green-Blue
(RGB) image of a cluttered scene and localize some or all dimensional (3D) cuboids in
the image. A cuboid can se a boxy or a box-like object and can include a polyhedron
(which may be convex) with, e.g., 4, 5, 6, 7, 8, 10, 12, or more faces. For example, cuboids
can include pyramids, cubes, prisms, parallelepipeds, etc. Cuboids are not limited to such
polyhedral shapes from geometry and can include box-like structures such as, e.g., appliances
(e.g., television sets, computer monitors, toasters, washing machines, refrigerators), furniture
(e.g., sofas, chairs, beds, cribs, tables, book cases, cabinets), vehicles (e.g., automobiles,
buses), etc. As further bed below, s may be identified in terms of their faces,
vertices, edges, or presence within a bounding box.
In some embodiments, a cuboid can comprise a geometric shape
characterized as a tuple of N parameters. The parameters may be geometric in nature, like the
radius of a sphere or the length, width, and height of the cuboid. A more general way to
parameterize any geometric primitive can be to represent it as a collection of points on the
surface of the ive. If a random point on the surface of the primitive is chosen, the
random point might not be localizable from a computer-vision point of view. It may be
advantageous for the set of parameterization points to be geometrically ative and
visually minative. For example, in the case of cuboids, the set of terization
points may be the cuboid’s vertices (which may be ed to sometimes herein as corners or
keypoints).
In some ments, a cuboid is represented as a tuple of eight vertices,
where each vertex can be denoted by its coordinates (e.g., Cartesian x,y coordinates) in the
image. In such a representation, a cuboid is represented by 16 parameters: the two
coordinates of each of the eight vertices. Not all 16 ters might be needed in some
cases, for example, as will be discussed below alternate cuboid representations may not
include some vertices (e.g., use only six vertices) and determine the other vertices using
vanishing points.
Contrary to other approaches which fit a 3D model from low-level cues
like corners, edges, and vanishing points, the cuboid detector disclosed herein can be an endto-end
deep learning system that detects cuboids across many semantic categories (e.g.,
ovens, shipping boxes, and furniture). In some entations, the cuboid detector can
ze a cuboid with a mensional (2D) bounding box, and simultaneously localize the
cuboid’s keypoints (e.g., es or s), effectively producing a 3D interpretation or
representation of a box-like object. The cuboid detector can refine keypoints by g
convolutional features iteratively, improving the accuracy of the keypoints ed. Based
on an end-to-end deep learning framework, an advantage of some implementations of the
cuboid detector is that there is little or no need to design custom low-level detectors for line
segments, vanishing points, junctions, etc.
The cuboid detector can include a plurality of utional layers and
non-convolutional layers of a convolutional neural network, a region proposal network
(RPN), and a plurality of pooling and regressor layers. The RPN can generate object
proposals in an image. The plurality convolutional layers and non-convolutional layers can
te a convolutional feature map of an input image. A convolutional layer of the CNN
can include a kernel stack of kernels. A kernel of a convolutional layer, when applied to its
input, can produce a resulting output activation map showing the response to that particular
learned kernel. The resulting output activation map can then be processed by another layer
of the CNN. Non-convolutional layers of the CNN can include, for example, a ization
layer, a rectified linear layer, or a pooling layer.
The region proposal network (RPN), which can be convolutional neural
network or a deep neural network, can determine a 2D bounding box around a cuboid in the
image from the convolutional feature map. The 2D bounding box can represent a region of
interest (RoI) on the image which includes a cuboid at an image location. The plurality of
pooling and regressor layers can include, for example, a g layer and two or more fullyconnected
layers (such as 3, 5, 10, or more layers). Based on the l 2D bounding box,
the plurality of cuboid pooling and regressor layers can, iteratively, determine a refined 2D
bounding box and the cuboid’s keypoints.
The cuboid detector can be trained in an end-to-end fashion and can be
suitable for real-time applications in augmented reality (AR), mixed reality (MR), or robotics
in some implementations. As described below, a le mixed y display device (e.g.,
the wearable display system 1000 described with reference to ) can include a
processor mmed to perform cuboid detection on images acquired by an outward-facing
camera of the display device. Some or all parameters of the cuboid detector can be learned in
a process referred to as training. For example, a machine ng model can be d using
training data that includes input data and the t or preferred output of the model for the
corresponding input data. The machine learning model can repeatedly process the input data,
and the parameters (e.g., the weight values) of the machine ng model can be modified
in what amounts to a trial-and-error process until the model produces (or rges” on) the
correct or preferred output. For example, the modification of weight values may be
performed through a process ed to as “back propagation.” Back propagation includes
determining the difference between the expected model output and the obtained model
output, and then determining how to modify the values of some or all parameters of the
model to reduce the difference n the expected model output and the obtained model
output.
Example Comparison of Object Detection and Cuboid Detection
Building a 3D representation of the world from a single monocular image
is an important problem in computer vision. In some applications, s having explicit 3D
models are localized with their poses estimated. But without such 3D models, a person or a
computer system (e.g., the wearable display system 1000 described with reference to ) may still need to reason about its surrounding in terms of simple combinations of
geometric shapes like cuboids, cylinders, and s. Such primitives, sometimes referred
to as geons, can be easy for humans to reason about. Humans can effortlessly make coarse
estimates about the pose of these simple geometric primitives and even compare geometric
parameters like length, radius or area across disparate instances. While many objects are
composed of le geometric primitives, a large number of real objects can be well
approximated by as little as one primitive.
For example, a common shape is the box. Many everyday objects can
geometrically be classified as a box (e.g., shipping boxes, cabinets, washing machines, dice,
microwaves, desktop computers). Boxes (which are examples of cuboids) span a diverse set
of everyday object instances, and humans can easily fit imaginary cuboids to these s
and localizing their vertices and faces. People can also compare the dimensions of different
box-like objects even though they are not aware of the exact dimensions of the box-like
objects or even if the objects are not t cuboids. Disclosed herein are systems and
methods that implement a cuboid detector for detecting class agnostic ric entities,
such as cuboids. Class agnostic means that different classes of a geometric entity are not
entiated. For example, a cuboid detector may not differentiate between different classes
of a cuboid, such as a shipping box, a microwave oven, or a cabinet. All of these box-like
objects can be represented with the same simplified concept, a cuboid.
An embodiment of a cuboid detector can be used for 3D object detection
as follows: fit a 3D bounding box to objects in an image (e.g., an RGB image or an RGBDepth
(RGB-D) image), detect 3D keypoints in the image, or perform 3D model to 2D image
alignment. Because an image might contain multiple cuboids as well as lots of clutter (e.g.,
non-cuboidal objects), the cuboid detector can first determine a shortlist of regions of interest
(RoIs) that correspond to cuboids. In addition to the 2D ng box enclosing each
cuboid, the cuboid detector can determine the location of all eight vertices.
Deep learning has revolutionized image recognition in the past few years.
Many state-of-the-art methods in object ion today are built on top of deep ks that
have been trained for the task for image classification. A cuboid detector can be a deep
cuboid detector implementing one or more deep ng methods. The cuboid detector can
have high accuracy and run in real-time using the hardware of a mobile device (e.g., the
le display system 1000 descried with reference to ).
is an example monocular image 100a illustrating twodimensional
(2D) object detection with a bounding box 104 id around an object
detected. is an e monocular image 100b illustrating three-dimensional (3D)
cuboid detection with a representation 108 of the cuboid overlaid on the object detected.
shows that one cuboid 108 inside the monocular image 100 is detected and its
vertices localized. The eight vertices are shown as four black circles 112a-112d that are
connected by four edges 120a-120d (represented as dotted lines) and four additional black
circles 116a-116d connected by four edges 124a-124d (represented as solid lines). Four of
the vertices 112a-112d ent one face 128a of the cuboid, and the other four of the
vertices 116a-116d represent another face 128b of the cuboid. The two faces 128a, 128b of
the cuboid 108 are connected by four edges 32d (represented as dashed lines) through
the vertices 112a-112d, 116a-116d. The cuboid detector can detect box-like objects in a
scene. Unlike object detection, the cuboid detector can ine more than a bounding box
of an object. In addition, the cuboid detector can localize the vertices of the cuboids (e.g.,
compare with ). In some embodiments, the cuboid detector can be class
agnostic. For example, the cuboid detector does not care about the class of the cuboids being
detected. For example, the cuboid detector can distinguish two classes of s: a cuboid
and a boid cuboid. The cuboid detector can perform 3D cuboid detection by
determining all cuboids inside a monocular image and localize their vertices. The cuboid
detector can be trained in an end-to-end n. The cuboid detector can run in real-time
and perform cuboid detection with RGB images of cluttered scenes captured using a
consumer-grade camera as input. A wearable display device (e.g., the wearable y
system 1000 descried with reference to ) can implement the cuboid or and use
information about the ed cuboids to generate or update a world map indicative of the
environment nding the user of the wearable display .
A cuboid is a geometric object that can be parameterized, and a cuboid
detector (e.g., a deep cuboid detector) can determine parameters of a cuboid in a scene. One
approach to detect a cuboid is to detect the edges and try to fit the model of a cuboid to these
edges. Hence, robust edge selection may be a useful aspect of the system. However, this
becomes challenging when there are misleading textures on cuboidal surfaces, for example, if
edges and corners are ed or the scene contains considerable background clutter. It can
be challenging to classify whether a given line belongs to a given cuboid with purely local
features. The cuboid or can learn to detect cuboids in images using a data-driven
approach. The cuboid detector can assign a single label (e.g., “cuboid”) to box-like objects
in a scene, even though the label is spread over many categories like , washing
machines, ballot boxes, desks, cars, television sets, etc. The cuboid detector can include a
CNN that is able to sfully learn features that help a system implementing it (e.g., the
le display system 1000 descried with reference to ) identify cuboids in
different scenes.
In some embodiments, a cuboid detector can implement a deep learning
model that jointly performs cuboid detection and keypoint localization. For example, a
cuboid detector can include a deep neural network that jointly performs cuboid detection and
keypoint localization. The cuboid detector can exceed the accuracy of the detection and
localization accuracy performed by other methods. In some implementations, the cuboid
detector can first detect the object of st and then make coarse or l predictions
regarding the location of its vertices. The cuboid can utilize the coarse or initial predictions
as an attention mechanism, performing ment of vertices by only looking at s with
high probability of being a cuboid. In some embodiments, the cuboid detector can
implement an iterative feature pooling mechanism to improve accuracy. The cuboid detector
can combine cuboid-related losses and or implement alternate parametrizations to improve
Example Cuboid Network Architecture and Loss Function
depicts an example architecture of a cuboid detector. The cuboid
detector 200 can include one or more of the following components: a convolutional layers
204 (also referred to herein as a CNN tower), a Region Proposal Network (RPN) 208, at least
one pooling layer 212, or one or more fully connected layers 216 (e.g., a regional CNN (RCNN
) regressor (or classifier)). The g layer 212 and the fully connected layers 216
can implement iterative feature pooling, which refines cuboid keypoint locations. The RCNN
can be a Faster R-CNN.
The cuboid detector 200 can implement a deep cuboid detection pipeline.
The first action of the deep cuboid detection ne can be determining Regions of Interest
(RoIs) 220a1, 220b, in an image 202a where a cuboid might be t. The Region
Proposal Network (RPN) 200 can be trained to output such RoIs 220a1, 220b as illustrated in
the image 202b. Then, regions 224a with features corresponding to each RoI 220a1, 220b
can be pooled, using one or more pooling layers 212, from a utional feature map 228
(e.g., the fifth utional feature map, conv5, in VGG-M from the Visual Geometry
Group at Oxford University). These pooled features can be passed h two fully
connected layers 216. In some implementations, instead of just producing a 2D bounding
box, the cuboid detector 200 can output the normalized offsets of the vertices from the center
of the RoI 220a1, 220b. The cuboid detector 200 can refine the predictions by performing
iterative feature pooling. The dashed lines in show the s 224a, 224b of the
convolutional feature map 228, corresponding to the RoI 220a1 in the image 202b and a
refined RoI 220a2 in the image 202c, from which features can be pooled. The two fully
connected layers 216 can process the region 224b of the convolutional feature map 228
corresponding to the refined RoI 220a2 to determine a further refined RoI and/or a
representation of a cuboid 232 in the image 202d.
The CNN Tower 204 can be the pre-trained fully convolutional part of
ConvNets, such as VGG and ResNets. The convolutional feature map 228 refers to the
output of the last layer of the CNN Tower 204. For example, the convolutional feature map
228 can be the output of the fifth convolutional layer, such as conv5 in VGG16 from the
Visual Geometry Group at Oxford sity with size m x n x 512).
The RPN 208 can be a fully convolutional network that maps every cell in
the utional feature map 228 to a distribution over K multi-scale anchor-boxes,
bounding box offsets, and objectness scores. The RPN can have two associated loss
ons: a log loss function for objectness and a smooth L1 loss function for bounding box
regression. The RPN 208 can, for example, use 512 3 x 3 filters, then 18 1 x 1 filters for
objectness and 36 1 × 1 filters for bounding box offsets.
The RoI pooling layer 212 can use, for example, max pooling to convert
the es inside any valid region of interest 220a1, 220a2, 220b into a small fixed-size
feature map (or a submap of the convolutional feature map 228). For example, for conv5 of
size m x n x × 512, the pooling layer 212 can produce an output of size 7 x 7 x 512,
independent of the input regions aspect ratio and scale. In some embodiments, spatial
pyramid matching can be implemented.
The fully connected layers 216 (e.g., a R-CNN regressor) can then be
applied to each fixed-size feature vector, outputting a cuboidness score, bounding box offsets
(four numbers), and eight cuboid keypoint locations (16 numbers). The ng box
sion values (∆x, ∆y, ∆w, ∆h) can be used to fit the initial object proposal tightly around
the object. The keypoint locations can be encoded as offsets from the center of the RoI and
can be ized by the proposal width/height as shown in illustrates RoI-
normalized coordinates of vertices represented as offsets from the center of an RoI 304 in an
image 300 and ized by the ’s width w and height h with ( é, é) being a
keypoint 308 and ( Ö, Ö) being the center 312 of the RoI. Example ground truth targets
for each keypoint are shown in Equations [1] and [2]:
ëá? ëÎ
ë , and Equation [1]
ëá? ëÎ
ì . Equation [2]
Referring to the R-CNN can include two fully connected layers
216 (e.g., 4096 neurons each) and can have three associated loss functions: a log loss
function for cuboidness and smooth L1 loss functions for both bounding box and vertex
regression.
When viewed in unison, the RoI g layer 212 and R-CNN layers act
as a refinement mechanism, mapping an input box to an improved one, given the feature
map. The cuboid or 200 can apply the last part of the network multiple times (e.g., 2,
3, 4, or more times), referred to herein as iterative feature pooling.
The loss functions used in the RPN 208 can include �� ÔáÖÛâå?Ößæ, the
log loss over two classes (e.g., cuboid vs. not cuboid) and �� ÔáÖÛâå?åØ , the Smooth L1
loss of the bounding box regression values for each anchor box. The loss functions for the RCNN
can e �� ËÈÂ?Ößæ, the log loss over two s (e.g., cuboid vs. not cuboid),
�� ËÈÂ?åØ , the Smooth L1 loss of the bounding box regression values for the RoI, and
�� ËÈÂ?ÖâåáØå, the Smooth L1 loss over the RoI’s predicted keypoint locations. The last term
can be referred to as the corner or vertex regression loss. The complete loss function can be
a weighted sum of the above mentioned losses and can be written as shown in Equation [3].
The loss weight �� Ü can be ent in different implementations, such as 0.1, 0.5, 1, 2, 5, 10,
or more.
�� = �� 5�� ÔáÖÛâå?Ößæ + �� 6�� ÔáÖÛâå?åØÚ + �� 7�� ËÈÂ?Ößæ + �� 8�� ËÈÂ?å +
�� 9�� ËÈÂ?ÖâåáØ . Equation [3]
Example Performance
To determine its performance, an embodiment of the cuboid detector 200
was implemented using Caffe and built on top of an implementation of Faster R-CNN. To
ine the performance, the VGG-M or VGG16 ks that have been pre-trained for
the task of image classification on ImageNet were used. VGG-M is a smaller model with 7
layers while VGG16 contains 16 . All models were fine-tuned for 50K iterations using
stochastic gradient descent (SGD) with a learning rate of 0.001, which was reduced by a
factor of 10 after 30K iterations. Additional parameters used include a momentum of 0.9,
weight decay of 0.0005, and dropout of 0.5. Instead of stage-wise training. Components of
the cuboid detector 200 were jointly optimized with the values of all the loss weights as one
(e.g., �� Ü = 1 in Equation [3]).
Data. The SUN Primitive dataset (a hensive collection of
annotated images covering a large variety of environmental scenes, places and the objects
within; available from https://groups.csail.mit.edu/vision/SUN/) was used to train the deep
cuboid detector 200. The t consists of 3516 images and is a mix of in-door scenes with
lots of clutter, internet images containing only a single cuboid, and outdoor images of
buildings that also look like cuboids. Both cuboid bounding boxes and cuboid keypoints
have ground-truth annotations. This dataset includes 1269 annotated cuboids in 785 images.
The rest of the images are negatives, e.g., they do not contain any s. The dataset was
split to create a training set of 3000 images and their horizontally flipped versions and a test
set with 516 test images.
The cuboid detector 200 was evaluated on two tasks: cuboid bounding box
detection and cuboid keypoint localization. For detection, a bounding box was correct if the
intersection over union (IoU) overlap was greater than 0.5.2. ions were sorted by
confidence (e.g., the network’s classifier x output) with the mean e Precision
(AP) as well as the entire Precision-Recall curve reported. For keypoint localization, the
Probability of Correct nt (PCK) and e Precision of Keypoint (APK) metrics
were used to determine the cuboid detector’s performance. PCK and APK are used in the
human pose estimation literature to e the performance of s predicting the
location of human body parts like head, wrist, etc. PCK measures the on of annotated
instances that are correct when all the ground truth boxes are given as input to the system. A
predicted keypoint was considered correct if its normalized ce from the annotation was
less than a threshold (α). APK, on the other hand, takes both detection confidence and
keypoint localization into consideration. A normalized distance, α, of 0.1 was used, meaning
that a predicted keypoint was considered to be correct if it lied within a number of pixels of
the ground truth annotation of the keypoint shown in Equation [4]. The normalized distance,
α, can be different in different implementations, such as 0.01, 0.2, 0.3, 0.5, 0.9, or more.
1.1 ∗ ������ �� ℎ�� , �������� ℎ) Equation [4]
See FIGS. 7A-7F for these metrics reported on the SUN Primitive test set and samples of
cuboid ions and vertices localization in monocular images 400a-400y, 404a-404e
illustrated in FIGS. 4A-4G. For example, FIG 4A shows a lar image 400a with
example representations 108a-108d of four cuboids each represented as eight vertices. As
another example, FIG 4A shows another lar image 400b with an example
representation 108a of a cuboid with four vertices enting one face of the cuboid
connected by four edges (shown as solid lines) and four vertices representing r face of
the cuboid connected by another four edges (shown as dotted lines). The eight vertices on
these two faces of the representation 108a of the cuboid are connected by four edges (shown
as dashed lines).
FIGS. 7A-7F are graphs illustrating example deep cuboid detector
evaluation metrics. APK: Average Precision of nt, PCK: Probability of Correct
Keypoint: Normalized distance from GT corners, Order of keypoints: front-top-left, backtop-left
, bottom-left, front-top-right, back-bottom-left, front-bottom-right, back-topright
, back-bottom-right. B: bounding box loss, C: corner loss, and I: iterative. FIGS. 4A-4F
show images illustrating example cuboid detection and keypoint location using VGG16 as
the CNN tower and iterative feature pooling. The cuboid detector 200 was able to ze
the vertices of cuboids in consumer-grade RGB images. The cuboid detector 200 was able to
handle both objects like boxes (that are perfectly modeled by a ) as well as objects like
sinks (that are only approximate cuboids). show example images 404a-404e
illustrating improper cuboid detection and nt localization, which can be reduced or
eliminated as further described below.
In one implementation, the cuboid detector 2 achieved a mAP of 75.47 for
bounding box detection, which was significantly better than the HOG-based system with a
mAP of 24.0.
Multi-Task learning. Multiple network each performing ent
multiple tasks were trained. A base network that just output ng boxes around cuboids
was trained. This base network performed general object detection using rectangles
enclosing cuboids. The base network output the class of the box and the bounding box
regression values. Next, a different network with additional supervision about the location of
the corners was trained. This network did not output bounding box regression coordinates.
Then, a network (e.g., the cuboid detector 200) that output both the ng box regression
values and the coordinates of the vertex was trained. A corresponding term was added to the
loss on for each additional task. From testing, adding more tasks (bounding box
detection, keypoint localization, or both bounding box detection and keypoint localization),
affected the performance of the cuboid detector (see Table 1).
Table 1. Multi-task learning Results. A network was trained using only the bounding box
loss, then using the cuboid corner loss.
Additional loss function AP APK PCK
Bounding Box Loss 66.33 - -
Corner Loss 58.39 28.68 27.64
Bounding Box + Corner Loss 67.11 34.62 29.38
Iterative e g. In R-CNN, the final output is a classification
score and the bounding box regression values for every region proposal. The bounding box
regression allows moving the region proposal around and scaling it such that the final
bounding box localizes just the . This implies that the initial region from which the
features are pooled to make this prediction was not entirely correct. In some embodiments,
the cuboid detector 200 goes back and pools features from the refined ng box. This
can be implemented in the network itself, g that the cuboid detector 200 performs
iterative bounding box regression while training and testing in exactly the same way. The
input to the fully-connected layers 216 of the regressor is a fixed-size feature map, a submap
the convolutional feature map 228, that includes of the pooled features from different region
proposals from conv5 layer. The R-CNN outputs can be used for bounding box regression
on the input object proposals to produce new proposals. Then features can be pooled from
these new proposals and passed through the fully-connected layers 216 of the regressor
again. In some ments, the cuboid or 200 is an ime prediction ”
where for applications which are not bound by latency, bounding box regression can be
performed more than once. The performance results (see Table 2) show that iterative feature
pooling can greatly improve both bounding box detection and vertex localization (see FIGS.
. There was not a significant change in mance when features were iteratively
pooled two or more times (e.g., 2, 3, 4, 5, 6, or more times). In some implementations, two
ions are used. FIGS. 5A-5C show example images 500a1-500l1, 500l2
illustrating improved performance (e.g., compare the representations 108b1, 108b2 of the
cuboid in images 500a1, 500a2 and the shape of the bookcase 504 in these images 504.
with keypoint refinement via iterative feature pooling. Cuboid detection regions were d
by re-pooling features from conv5 using the predicted bounding boxes.
Table 2. Results for Iterative Feature Pooling. Iterative feature pooling improved the box
detection AP by over 4% and PCK over 7%.
Method AP APK PCK
Corner Loss 58.39 28.68 27.64
Corner Loss + Iterative 62.89 33.98 35.56
BB + Corner Losses 67.11 34.62 29.38
BB + Corner Loss + Iterative 71.72 37.61 36.53
Depth of Network. Two base models, VGG16 and VGG-M, were tested.
While VGG16 has a very deep ecture with 16 layers, VGG-M is a smaller model with 7
. Table 3 shows the results of the testing. Interestingly, for this dataset and task, two
ions through the shallower network outperformed one iteration through the deeper
network. Coupled with the fact the shallower network with iteration run twice as fast, a
cuboid detector 200 can advantageously include a shallower CNN tower with fewer than 10
layers (e.g., 5, 7, or 9 layers). In some embodiments, a cuboid detector 200 can include a
deeper CNN tower (e.g., 12, 15, 20, or more layers). The four model tested each had e
precision (AP) higher than the AP of a HOG-based system (24.0).
Table 3. VGG-M (7 layers) vs. VGG16 (16 layers) base network. I: iterative feature pooling
was performed. The deeper cuboid detector outperformed the shallower one.
Method AP APK PCK Size Speed
VGG-M 67.11 34.62 29 334 MB 14 fps
VGG-M + I 71.72 37.61 36 334 MB 10 fps
VGG16 70.50 33.65 35 522 MB 5 fps
VGG16 + I 75.47 41.21 38 522 MB 4 fps
Effect of Training Set Size. The impact of increasing the size of ng
data was measured. Three datasets of varying sizes, 1K, 2K and 3K images, were created
and used to train a common network (VGG-M + Iterative). The results (see Table 4) show
significantly improved performance when using larger training set sizes.
Table 4. Performance vs. number of ng images. Deep cuboid ion can benefit
from more training images.
Number of Images AP APK PCK
1000 40.47 20.83 26.60
2000 52.17 27.51 29.31
3000 71.72 37.61 26.53
Memory and Runtime Complexity. The cuboid or 200 was able to
run at interactive rates on a Titan Z GPU while the HOG-based approach would take minutes
to process a single image. The real-time nature of the system may be the result of Faster RCNN
being used as the regressor. In some embodiments, the cuboid detector 200 can
implement a single show multibox detector (SSD) to further improve its speed performance.
Table 3 shows the model sizes, which can be reduced to on mobile devices (e.g., the
wearable display system 1000 descried with reference to ).
Example Keypoint Parameterizations
An embodiment of the cuboid detector 200 can output a cuboid’s vertices
directly. Many convex cuboids have eight vertices, six faces, and twelve edges (not all of
which may be visible in an image). r, certain viewpoints may have an inherent
ambiguity, which may have led to the improper cuboid identification shown in . For
example, which face of the cube in should be labelled the front? Since the cuboid
detector 200 detector may need to deal with such configurations, alternate cuboid
parametrizations were explored. If the world origin is considered to coincide with camera
center coordinates, a parameterization of a cuboid can be represented with 12 numbers. The
following parameterization may be minimal; in other terizations, additional or
different ters can be used.
(X, Y, Z) – Coordinates of the center of the cuboid in 3D
(L, W, H) - Dimensions of the cuboid
(θ, ψ, φ) - 3 angles of rotation of the cuboid (e.g., Euler angles)
(f, ë, ì ) - Intrinsic camera parameters (e.g., focal length and coordinates of the optical
center)
For many modern cameras, no skew in the camera and equal focal lengths
(in orthogonal ions) can be d. The over-parameterization of a cuboid (e.g., a
sixteen-parameter parameterization of a ) may allow a cuboid detector 200 to produce
outputs that do not represent cuboids (see, e.g., some examples in ). Several
different re-parameterizations of a cuboid were tested to better utilize the geometric
constraints. In general, the test results show that the network was able to learn features for
tasks that had more visual evidence in the image and predict parameters which can be scaled
properly for stable optimization. When dealing with 3D geometry and deep ng, proper
parametrization is advantageous. Even image-to-image transformations, such as like
homographies (e.g., isomorphisms of projected spaces) may t from re-parametrization
(e.g., the four-point parametrization). Such techniques may reduce or eliminate improper
identification of cuboids in images.
rner parametrization. An alternate parameterization in which only
six coordinates of eight cuboid vertices were predicted by the detector. The locations of the
remaining two coordinates were inferred using the relationship that there may be parallel
edges in cuboids. For example, the edges that are parallel in 3D meet at the vanishing point
in the image. There may be two pairs of parallel lines on the top base of the cuboid 600 and
two pairs of parallel lines on the bottom face of the cuboid. The pair of parallel lines 604a,
604b on the top face of the cuboid 600 and the pair el line 606a, 606b on the bottom
face of the cuboid should meet at the same vanishing point 608a as shown in The
pair of parallel lines 604c, 604d on the top face of the cuboid 600 and the pair parallel line
606c, 606d on the bottom face of the cuboid should meet at the same vanishing point 608b.
Accordingly, the position of the remaining two points 612a, 612b can be ed. This
allows a cuboid detector 200 to parameterize an output of 12 numbers in some
implementations. schematically illustrates e cuboid vanishing points 608a,
608b. Vanishing points 608a, 608b produced by extrapolating the edges of a cube form a
vanishing line 616 and can be used to reduce the number of parameters. The Front-Top-Left
(FTL) keypoint 612a and ottom-Right (BBR) keypoint 612b can be ed from
the parametrization and inferred using estimated vanishing points (VPs) techniques.
Eight-corner parameterization was compared with six-corner
parameterization. The ground truth data for two vertices was not used while training. One
vertex from each the back and front faces was dropped (those whose detection rates (PCK)
were the worst). A network was trained to predict the location of the remaining six corners.
The ons of the two dropped vertices were inferred using these six corners. The cuboid
detector 200 first determined the vanishing points corresponding to the six points predicted.
This re-parameterization may lead to a reduction in performance (see Table 5). This
degradation may be due to the fact that visual evidence corresponding to the two inferred
s present in the image was discarded. Also, any error in prediction of one vertex due
to occlusion or any other reason would directly propagate to the inferred corners. However,
left to the cuboid detector 200, it learned multiple models to detect a cuboid. The k of
the cuboid detector 200 was free to use all visual evidence to localize the corners of the
cuboid. The cuboid detector 200 was capable of doing pure geometric reasoning because in
many cases the corner on the back did not have visual evidence in the image due to selfocclusion.
Table 5. Corner vs. six-corner parameterization. Eight-corner parameterization uses all
of the cuboid’s corners, s in the six-corner parameterization, the BBR and FTL
corners are dropped (see and inferred from the vanishing points. This shows how an
e network was able to do geometric reasoning and the over-parameterization may add
robustness to the system. BBR: Back-Bottom-Right and FTL: Front-Top-Left.
Method AP APK PCK PCK of BBR PCK of FTL PCK of Remaining
Corner Corner Corners
6 corners 65.26 29.64 27.36 24.44 21.11 28.89
8 corners 67.11 34.62 29.38 27.22 29.44 29.73
Vanishing point parametrization: Another ameterization uses
locations of the two vanishing points and the slopes of six lines which will form the edges of
the cuboid (see . Note that these vanishing points correspond to a particular cuboid
and might be different from the vanishing point of the entire image. The intersection points
of these six lines would give the vertices of the cuboid in this example. However, the
locations of the vanishing points many lie outside the region of interest and have little or
confounding visual evidence in the region of interest or the entire image itself. It also may
become difficult to normalize the targets to predict the vanishing points directly. The slopes
of the six lines can vary between −∞ and +∞. Instead of predicting the slope directly, the
slopes can be regressed to the value of ������(������?5(��)) . There can exist a set of
hyperparameters (e.g., loss weights, learning rates, solver, etc.) for which an embodiment of
this network can be trained.
Example s of Training a Cuboid Detector
is a flow diagram of an example s 800 of training a cuboid
detector. The process 800 starts at block 804, where a plurality of ng images each
comprising at least one cuboid is received. Some of the training images can each include one
or more cuboids. The process 800 can include performing a cuboid-specific (e.g., cuboidspecific
) data tation strategy to improve the performance of a trained cuboid or.
At block 808, a convolutional neural network is received. The convolutional neural network
can be trained for objection detection. For example, the convolutional neural network can be
VGG16 or VGG-M. The convolutional neural network can be a deep neural network in
some entations.
At block 812, a cuboid detector is generated. The cuboid detector can
include a CNN tower. The CNN tower can e a ity of convolutional layers and
non-convolutional layers of the convolutional neural network received at block 808. For
example, the CNN tower can include some or all convolutional layers of the convolutional
neural network received. The non-convolutional layers can include a normalization layer, a
brightness normalization layer, a batch normalization layer, a rectified linear layer, an
upsampling layer, a concatenation layer, a pooling layer, a softsign layer, or any combination
thereof. The CNN tower can te a convolutional feature map from an input image, such
as a monocular image.
The cuboid detector can include a region proposal network (RPN), such as
a CNN or a DNN. The region proposal network can be connected to a layer of the CNN
tower. The region proposal network can determine a region of st (RoI) comprising a
cuboid in the image using the convolutional feature map. For example, the region of interest
can be represented as a two-dimensional (2D) bounding box enclosing a cuboid at a cuboid
image location. The cuboid can comprise a cuboid, a cylinder, a , or any combination
thereof. The RPN can be associated with at least two loss functions, such as a log loss
on and a smooth L1 loss function during training.
The cuboid detector can include a pooling layer and at least one regressor
layer. The pooling layer can be connected to a layer of the CNN tower. The pooling layer
can determine, using the cuboid image on, a submap of the convolutional feature map
corresponding to the region of interest comprising the cuboid. The g layer and the
region proposal network can be connected to the same layer of the CNN tower.
The cuboid detector can include two regressor layers, such as two onnected
layers, of a regional-CNN (R-CNN) or a fast R-CNN. As another example, the
regressor layer is not fully connected. The regressor layer can be associated with at least
three loss functions during training. For example, the at least three loss functions comprises
a log loss function and a smooth L1 loss on.
The cuboid detector can be trained. At block 816, the cuboid detector can
ine a region of st at an image location comprising a cuboid in a training image
received at block 804. In some ments, a representation of the cuboid in the image can
be determined. To determine the RoI at the cuboid image location and the representation of
the cuboid, the cuboid detector can generate a convolutional feature map for the training
image using the utional layers and non-convolutional layers of the CNN tower. Based
on the convolutional e map, the region al k can determine the RoI
comprising the cuboid at an initial image location in the ng image. Based on the initial
image location of the cuboid in the training image, the pooling layer of the cuboid detector
can ine a submap of the convolutional feature map corresponding to the RoI
comprising the cuboid at the initial image location. The at least one regression layer can
determine the RoI at the cuboid image location and the representation of the cuboid. The
initial cuboid image on or the cuboid image location can be ented as a twodimensional
(2D) bounding box. In some entations, the method 800 can include
iteratively determining, using the pooling layer, the at least one regressor layer, and the
submap of the convolutional feature map ponding to the RoI comprising the cuboid,
the RoI at the cuboid image location and the representation of the cuboid.
The representation of the cuboid can be different in different
implementations. The representation can include a parameterized representation of the
cuboid. For example, the parameterized representation of the cuboid can include locations of
a plurality of keypoints of the cuboid (e.g., a cuboid) in the image, such as six or eight
vertices of the cuboid in the image. As another example, the terized representation
can e normalized offsets of the plurality of keypoints of the cuboid from the center of
the image. As a further example, the parameterized representation comprises N tuples, such
as 6 . As an example, the parameterized representation of the cuboid comprises a
vanishing point parameterization.
At block 820, a first difference between a reference image location and the
determined image location and a second difference between a reference representation of the
cuboid and the determined representation of the cuboid can be determined. The reference
representation of the cuboid can include the ground truth targets for each keypoint as
illustrated in Equations [1] and [2] above. The reference image location can include a
bounding box represented by the ground truth targets.
At block 824, weights of the cuboid or can be updated based on the
first difference and the second difference. The differences can be represented as the loss
function (or components thereof) shown in Equation [3]. Some or all of the weights of the
cuboid detector can be updated based on the differences determined. For example, the
weights of the region proposal network and the weights of the at least one regressor layer can
be updated based on the differences. As another example, the s of the RPN and the
weights of the at least one regressor layer can be updated without updating the weights of the
first CNN based on the differences. As a further example, the weights of the CNN tower, the
weights of the region proposal network, and the weights of the at least one regressor layer
can be updated based on the differences. The process 800 can optionally include training the
cuboid detector from a larger dataset and synthetic data, network optimization, and
rization techniques to improve generalization.
Example Process of Using a cuboid detector for Cuboid ion and Keypoint Localization
is a flow diagram of an example process 900 of using a cuboid
detector for cuboid detection and keypoint localization. The s 900 starts at block 904,
where a system (e.g., the wearable display system 1000 described with reference to )
receives an input image including a possible cuboid. The image can include one or more
cuboids. The image can comprise a color image (e.g., RGB or RGB-D) and the image may
be monocular. The image may be a frame of a video and may be obtained using the outwardfacing
imaging system 1044 of the le display system 1000 described with reference to
.
At block 908, the wearable display system 1000 can access a cuboid
detector (such as the cuboid detector trained by the process 800 illustrated in . The
cuboid detector can include a CNN tower comprising a plurality of convolutional layers and
non-convolutional layers. The cuboid detector can include a region proposal k
connected to the CNN tower. The cuboid or can include a g layer and at least
one regressor layer. The pooling layer can be connected to the CNN tower.
At block 912, the le display system 1000 can te, using the
plurality of convolutional layers and the non-convolutional layers of the CNN tower and the
image, a convolutional feature map (e.g., the convolutional feature map 228 in . At
block 916, the wearable display system 1000 can determine, using the region proposal
network, at least one RoI comprising a cuboid at a cuboid image location of the image (e.g.,
the regions of st 220a1, 220a2, 220b in . The cuboid image location can be
represented as a two-dimensional (2D) bounding box. At block 920, the wearable display
system 1000 can determine, using the g layer (e.g., the pooling layer 212 in
and the cuboid image location, a submap of the convolutional feature map corresponding to
the region of interest comprising the cuboid. For example, the submap can be determined
from the regions 224a of the convolutional feature map 228 from which the es can be
pooled in At block 924, the wearable display system 1000 can determine, using the
regressor layer (e.g., a R-CNN regressor) and the submap, a refined RoI at a refined cuboid
image location and a representation of the cuboid. The refined cuboid image location can be
represented as a two-dimensional (2D) bounding box.
In some ments, the method 900 includes iterative feature pooling.
For example, the wearable display system 1000 can determine using the refined cuboid
image location, a refined submap of the convolutional feature map corresponding to the
refined region of interest comprising the cuboid. For example, the submap can be
determined from the regions 224b of the convolutional feature map 228 from which the
features can be pooled in The wearable display system 1000 can determine, using the
pooling layer, the at least one regressor layer, and the d submap of the convolutional
feature map corresponding to the refined RoI, a further d RoI at a further refined cuboid
image location and a further defined representation of the cuboid.
The wearable display system 1000 can interact with a user of the system
based on the refined region of interest at the refined cuboid image location and the
entation of the cuboid. For example, the cuboid can correspond to a stationary box,
and the wearable display system 1000 can te ter animation in relation to the
stationary box based on the refined image location of the cuboid and the representation of the
cuboid. As another example, the cuboid can correspond to a hand-held cuboid. The
wearable display system 1000 can determine a pose of the cuboid using the representation of
the cuboid, and interact with the user of the system based on the pose of the . As a
further example, the cuboid can correspond to a rare object not recognizable by a CNN. The
wearable y system 1000 can provide the user with a notification that the rare object not
recognizable by the CNN is detected. As an example, the cuboid corresponds to a man-made
structure (e.g., a ng). The wearable display system 1000 can assist the user of the
system during an unmanned flight based on the d RoI at the refined cuboid image
location and the entation of the . As another example, the cuboid can be a
cuboid that corresponds to a marker. The wearable display system 1000 can perform
aneous location and mapping (SLAM) based on the pose of the cuboid.
Example Applications
Detecting box-like objects in images and extracting 3D information like
pose can help overall scene understanding. Many high-level semantic problems can be
tackled by first detecting boxes in a scene (e.g., extracting the free space in a room by
reducing the objects in a scene to boxes, estimating the support surfaces in the scene and
estimating the scene layout).
The cuboid detectors disclosed herein can open up one or more
possibilities for augmented reality (AR), human-computer interaction (HCI), autonomous
vehicles, drones, or cs in general. For example, the cuboid or can be used as
follows.
For Augmented Reality, cuboid vertex localization ed by 6-degree
of freedom (6-dof) pose estimation allows a content creator to use the cuboid-centric
coordinate system defined by a stationary box to drive character animation. Because the
volume of space ed by the stationary cuboid is known based on cuboid vertex location
followed by 6-dof pose estimation, animated characters can jump on the box, hide behind it,
and even start drawing on one of the box’s faces. Accordingly, a content creator can use the
cuboid detector to build dynamic worlds around cuboids.
For Human-Computer Interaction, users may interact with scenes using
boxy objects around them. A content creator may create a game or user nment in
which worlds are built up from s. As another e, a hand-held cuboid can be
used as a lightweight game controller. A system, such as the wearable display system 1000
descried with reference to , can include a camera capturing images of the hand-held
cube over time. And the system can estimate the cube’s pose, effectively tracking the cube in
3D space, using the images captured. In some ments, the cuboid can serve as a way
to improve interaction in AR systems (e.g., the tabletop AR demo using cuboids).
For autonomous vehicles, 3D cuboid detection allows the vehicle to
reason about the l extent of rare objects that might be missing in supervised training set.
By reasoning about the pose of objects in a class-agnostic manner, autonomous es can
be safer drivers.
For drones, man-made structures, such as buildings, houses, or cars, can
be well-approximated with cuboids, assisting navigation during unmanned flights. For
robotics in general, detecting box-like objects in images and extracting their 3D information
like pose helps overall scene understanding. For example, placing a handful of cuboids in a
scene (instead of Aruco markers) can make pose tracking more robust for simultaneous
on and mapping (SLAM) applications.
Additional Embodiments
In some embodiments, the cuboid detector does not rely on bottom-up
image processing and works satisfactorily on real images in real-time. The cuboid detector
can be d using a large training database of 3D models and some kind of learning for
2D-to-3D ent. In some implementations, the cuboid detector can implement a
geometry-based method, a deformable parts model, a histogram oriented gradients (HOG)-
based model (e.g., a HOG fier). The cuboid or can detect cuboid vertices in
different views and ine a final cuboid configuration based on a score from the HOG
classifier, 2D vertex displacement, edge alignment score and a 3D shape score that takes into
account how close the predicted vertices are to a cuboid in 3D. The cuboid detector can
jointly optimize over visual evidence (corners and edges) found in the image while
penalizing the predictions that stray too far from an actual 3D cuboid.
t being limited by theory, the cuboid detector may owe its
performance convolutional neural networks. A CNN can be superior to existing methods for
the task of image classification. To localize a cuboid in an image, the image is broken down
into regions and these regions are classified instead, for example, in real-time. The cuboid
or can perform detection in a single step. A cuboid or, for example, running on
the wearable display system 1000 descried with reference to , can process 50-60
frames per second, thus performing real-time cuboid detection and keypoint zation.
The iterative keypoint refinement implemented by the cuboid or can be based on
ive error feedback approach of, the network cascades in, the iterative bounding box
regression of Multi-Region CNN and Inside-Outside Networks. Alternatively, or
additionally, the iterative keypoint refinement implemented by the cuboid detector can be
based on a Recurrent Neural Networks.
Example NN Layers
A layer of a neural network (NN), such as a deep neural network (DNN)
can apply a linear or non-linear transformation to its input to generate its output. A deep
neural network layer can be a normalization layer, a convolutional layer, a softsign layer, a
rectified linear layer, a concatenation layer, a pooling layer, a ent layer, an inceptionlike
layer, or any ation thereof. The normalization layer can normalize the brightness
of its input to generate its output with, for example, L2 normalization. The normalization
layer can, for e, ize the brightness of a plurality of images with respect to one
another at once to generate a plurality of ized images as its output. Non-limiting
examples of methods for normalizing brightness include local contrast ization (LCN)
or local response ization (LRN). Local contrast normalization can normalize the
contrast of an image non-linearly by normalizing local regions of the image on a per pixel
basis to have a mean of zero and a variance of one (or other values of mean and variance).
Local response normalization can normalize an image over local input regions to have a
mean of zero and a variance of one (or other values of mean and variance). The
ization layer may speed up the training process.
The convolutional layer can apply a set of kernels that convolve its input
to generate its output. The softsign layer can apply a softsign function to its input. The
softsign function ign(x)) can be, for example, (x / (1 + |x|)). The softsign layer may
neglect impact of per-element outliers. The rectified linear layer can be a rectified linear
layer unit (ReLU) or a parameterized rectified linear layer unit (PReLU). The ReLU layer
can apply a ReLU function to its input to generate its output. The ReLU function )
can be, for example, max(0, x). The PReLU layer can apply a PReLU function to its input to
generate its output. The PReLU function PReLU(x) can be, for example, x if x ≥ 0 and ax if
x < 0, where a is a positive number. The enation layer can concatenate its input to
generate its . For example, the concatenation layer can concatenate four 5 x 5 images
to generate one 20 x 20 image. The pooling layer can apply a pooling function which down
samples its input to generate its output. For example, the pooling layer can down sample a
x 20 image into a 10 x 10 image. Non-limiting examples of the pooling function include
maximum pooling, average pooling, or minimum pooling.
At a time point t, the recurrent layer can compute a hidden state s(t), and a
recurrent connection can provide the hidden state s(t) at time t to the ent layer as an
input at a subsequent time point t+1. The recurrent layer can compute its output at time t+1
based on the hidden state s(t) at time t. For example, the recurrent layer can apply the
softsign function to the hidden state s(t) at time t to compute its output at time t+1. The
hidden state of the recurrent layer at time t+1 has as its input the hidden state s(t) of the
recurrent layer at time t. The recurrent layer can compute the hidden state s(t+1) by
applying, for example, a ReLU function to its input. The inception-like layer can include one
or more of the normalization layer, the utional layer, the softsign layer, the ied
linear layer such as the ReLU layer and the PReLU layer, the concatenation layer, the
pooling layer, or any ation thereof.
The number of layers in the NN can be different in different
implementations. For example, the number of layers in the DNN can be 50, 100, 200, or
more. The input type of a deep neural network layer can be different in different
implementations. For example, a layer can e the outputs of a number of layers as its
input. The input of a layer can include the outputs of five layers. As another example, the
input of a layer can include 1% of the layers of the NN. The output of a layer can be the
inputs of a number of layers. For example, the output of a layer can be used as the inputs of
five layers. As another e, the output of a layer can be used as the inputs of 1% of the
layers of the NN.
The input size or the output size of a layer can be quite large. The input
size or the output size of a layer can be n x m, where n denotes the width and m denotes the
height of the input or the output. For example, n or m can be 11, 21, 31, or more. The
channel sizes of the input or the output of a layer can be different in different
implementations. For example, the l size of the input or the output of a layer can be 4,
16, 32, 64, 128, or more. The kernel size of a layer can be different in different
implementations. For example, the kernel size can be n x m, where n denotes the width and
m denotes the height of the kernel. For example, n or m can be 5, 7, 9, or more. The stride
size of a layer can be different in different implementations. For example, the stride size of a
deep neural network layer can be 3, 5, 7 or more.
In some embodiments, a NN can refer to a plurality of NNs that together
compute an output of the NN. Different NNs of the plurality of NNs can be trained for
different tasks. A processor (e.g., a sor of the local data sing module 1024
descried with reference to ) can compute outputs of NNs of the plurality of NNs to
determine an output of the NN. For example, an output of a NN of the plurality of NNs can
e a hood score. The processor can determine the output of the NN including the
ity of NNs based on the likelihood scores of the outputs of different NNs of the
ity of NNs.
Example Wearable Display System
In some embodiments, a user device can be, or can be included, in a
wearable display device, which may advantageously provide a more immersive virtual reality
(VR), augmented reality (AR), or mixed reality (MR) experience, where digitally reproduced
images or portions f are presented to a wearer in a manner wherein they seem to be, or
may be perceived as, real.
Without being limited by theory, it is believed that the human eye
typically can interpret a finite number of depth planes to provide depth perception.
Consequently, a highly believable simulation of perceived depth may be achieved by
providing, to the eye, different presentations of an image corresponding to each of these
limited number of depth planes. For example, displays containing a stack of waveguides
may be configured to be worn positioned in front of the eyes of a user, or viewer. The stack
of waveguides may be ed to provide three-dimensional perception to the eye/brain by
using a ity of waveguides to direct light from an image injection device (e.g., discrete
ys or output ends of a multiplexed display which pipe image ation via one or
more optical fibers) to the viewer’s eye at particular angles (and amounts of divergence)
corresponding to the depth plane associated with a particular waveguide.
In some embodiments, two stacks of waveguides, one for each eye of a
viewer, may be utilized to provide different images to each eye. As one example, an
augmented reality scene may be such that a wearer of an AR technology sees a real-world
park-like setting featuring people, trees, buildings in the background, and a concrete
rm. In addition to these items, the wearer of the AR technology may also perceive that
he “sees” a robot statue standing upon the real-world platform, and a cartoon-like avatar
character flying by which seems to be a personification of a bumble bee, even though the
robot statue and the bumble bee do not exist in the real world. The stack(s) of ides
may be used to generate a light field corresponding to an input image and in some
implementations, the wearable display comprises a wearable light field display. Examples of
wearable display device and waveguide stacks for providing light field images are described
in U.S. Patent Publication No. 016777, which is hereby incorporated by reference
herein in its entirety for all it contains.
illustrates an example of a wearable display system 1000 that can
be used to present a VR, AR, or MR experience to a display system wearer or viewer 1004.
The wearable display system 1000 may be programmed to perform any of the ations or
embodiments described herein (e.g., executing CNNs, reordering values of input activation
maps or kernels, eye image segmentation, or eye tracking). The display system 1000
includes a display 1008, and various mechanical and onic modules and systems to
support the functioning of that display 1008. The display 1008 may be coupled to a frame
1012, which is wearable by the display system wearer or viewer 1004 and which is
configured to on the display 1008 in front of the eyes of the wearer 1004. The y
1008 may be a light field display. In some embodiments, a speaker 1016 is coupled to the
frame 1012 and positioned nt the ear canal of the user in some embodiments, another
r, not shown, is positioned adjacent the other ear canal of the user to provide for
stereo/shapeable sound control. The display system 1000 can include an outward-facing
imaging system 1044 (e.g., one or more s) that can obtain images (e.g., still images or
video) of the environment around the wearer 1004. Images obtained by the outward-facing
imaging system 1044 can be analyzed by embodiments of the deep cuboid detector to detect
and localize cuboids in the environment around the wearer 1004.
The display 1008 is operatively coupled 1020, such as by a wired lead or
wireless connectivity, to a local data processing module 1024 which may be mounted in a
variety of configurations, such as fixedly ed to the frame 1012, fixedly attached to a
helmet or hat worn by the user, embedded in headphones, or otherwise removably ed to
the user 1004 (e.g., in a backpack-style configuration, in a belt-coupling style configuration).
The local processing and data module 1024 may comprise a hardware
processor, as well as non-transitory digital memory, such as latile memory e.g., flash
, both of which may be utilized to assist in the sing, caching, and storage of
data. The data include data (a) ed from sensors (which may be, e.g., operatively
coupled to the frame 1012 or otherwise attached to the wearer 1004), such as image capture
devices (such as cameras), microphones, inertial measurement units, accelerometers,
compasses, GPS units, radio devices, and/or gyros; and/or (b) ed and/or processed
using remote processing module 1028 and/or remote data repository 1032, possibly for
passage to the display 1008 after such processing or retrieval. The local processing and data
module 1024 may be operatively coupled to the remote processing module 1028 and remote
data repository 1032 by communication links 1036, 1040, such as via a wired or wireless
ication links, such that these remote modules 1028, 1032 are operatively coupled to
each other and available as resources to the local processing and data module 1024. The
image capture device(s) can be used to capture the eye images used in the eye image
tation, or eye tracking procedures.
In some embodiments, the remote processing module 1028 may comprise
one or more processors configured to analyze and process data and/or image information
such as video information captured by an image capture device. The video data may be
stored locally in the local processing and data module 1024 and/or in the remote data
repository 1032. In some ments, the remote data repository 1032 may comprise a
l data storage facility, which may be available through the internet or other networking
configuration in a “cloud” resource configuration. In some embodiments, all data is stored
and all computations are performed in the local processing and data module 1024, allowing
fully autonomous use from a remote .
In some implementations, the local processing and data module 1024
and/or the remote processing module 1028 are programmed to perform embodiments of
reordering values of input activation maps or kernels, eye image segmentation, or eye
ng disclosed herein. For example, the local processing and data module 1024 and/or
the remote processing module 1028 can be programmed to perform embodiments of the
s 900 described with reference to The local processing and data module 1024
and/or the remote processing module 1028 can be programmed to perform cuboid detection
and keypoint localization disclosed herein. The image capture device can capture video for a
particular application (e.g., augmented reality (AR), human-computer interaction (HCI),
autonomous vehicles, drones, or robotics in l). The video can be analyzed using a
CNN by one or both of the processing modules 1024, 1028. In some cases, off-loading at
least some of the reordering values of input activation maps or kernels, eye image
segmentation, or eye tracking to a remote processing module (e.g., in the “cloud”) may
improve efficiency or speed of the computations. The parameters of the CNN (e.g., weights,
bias terms, subsampling factors for g layers, number and size of kernels in different
layers, number of feature maps, etc.) can be stored in data modules 1024 and/or 1032.
The results of the cuboid detection and keypoint location (e.g., the output
of the cuboid detector 200) can be used by one or both of the processing modules 1024, 1028
for additional operations or processing. For example, the processing modules 1024, 1028 of
the wearable display system 1000 can be programmed to perform additional applications
described herein (such as applications in augmented reality, human-computer ction
(HCI), autonomous vehicles, drones, or robotics in l) based on the output of the cuboid
detector 200.
onal Aspects
In a 1st aspect, a system for cuboid detection and keypoint localization is
disclosed. The system comprises: non-transitory memory configured to store: executable
instructions, an image for cuboid ion, and a cuboid detector comprising: a plurality of
convolutional layers and non-convolutional layers of a first convolutional neural k
(CNN) for generating a convolutional feature map from the image, a region proposal network
(RPN) comprising a second CNN for determining, using the convolutional feature map, at
least one region of interest (RoI) comprising a cuboid at a cuboid image location of the
image, and a pooling layer and at least one regressor layer for ining, using the
convolutional feature map and the RoI comprising the , a refined RoI at a d
cuboid image on and a representation of the cuboid; a hardware sor in
communication with the non-transitory memory, the hardware processor programmed by the
executable instructions to: receive the image; generate, using the plurality of convolutional
layers and the non-convolutional layers of the first CNN and the image, the convolutional
feature map; determine, using the RPN, the at least one RoI sing the cuboid at the
cuboid image location of the image; determine, using the pooling layer and the cuboid image
location, a submap of the convolutional feature map corresponding to the RoI comprising the
cuboid; and determine, using the at least one regressor layer and the submap of the
convolutional feature map ponding to the RoI sing the cuboid, the refined RoI at
the refined cuboid image location and the representation of the cuboid.
In a 2nd , the system of aspect 1, wherein the hardware processor is
further programmed to: determine, using the refined cuboid image location, a refined submap
of the convolutional feature map corresponding to the refined RoI comprising the cuboid;
ine, using the pooling layer, the at least one regressor layer, and the refined submap of
the convolutional feature map corresponding to the refined RoI comprising the cuboid, a
further refined RoI at a further refined cuboid image location and a further defined
representation of the cuboid.
In a 3rd aspect, the system of any one of aspects 1-2, wherein the cuboid
image location is represented as a two-dimensional (2D) ng box.
In a 4th aspect, the system of any one of aspects 1-3, n the refined
cuboid image location is represented as a two-dimensional (2D) bounding box.
In a 5th aspect, the system of any one of aspects 1-4, wherein the nonconvolutional
layers of the first CNN comprises a normalization layer, a brightness
normalization layer, a batch normalization layer, a rectified linear layer, an upsampling layer,
a concatenation layer, a pooling layer, a softsign layer, or any combination thereof.
In a 6th aspect, the system of any one of aspects 1-5, n the at least
one regressor layer comprises two or more layers.
In a 7th aspect, the system of aspect 6, wherein the two or more layers
comprise a fully connected layer, a non-fully connected layer, or any combination f.
In a 8th aspect, the system of any one of aspects 1-7, wherein the at least
one regressor layer is associated with at least three loss functions during training.
In a 9th aspect, the system of aspect 8, n the at least three loss
functions comprises a log loss function and a smooth L1 loss function.
In a 10th aspect, the system of any one of s 1-9, wherein RPN
comprises a deep neural network (DNN).
In a 11th , the system of any one of aspects 1-10, wherein the RPN
is associated with at least two loss functions during training.
In a 12th aspect, the system of aspect 11, wherein the at least two loss
functions comprises a log loss function and a smooth L1 loss function.
In a 13th , the system of any one of aspects 1-12, wherein the
representation of the cuboid comprises a parameterized representation of the cuboid.
In a 14th aspect, the system of aspect 13, wherein the parameterized
representation of the cuboid comprises locations of a plurality of keypoints of the cuboid in
the image.
In a 15th aspect, the system of aspect 14, wherein the plurality of
keypoints comprises eight vertices of the cuboid in the image.
In a 16th aspect, the system of aspect 13, wherein the parameterized
representation comprises ized offsets of the plurality of keypoints of the cuboid from
the center of the image.
In a 17th aspect, the system of aspect 13, wherein the parameterized
representation comprises N .
In a 18th aspect, the system of aspect 13, wherein the parameterized
representation of the cuboid comprises 12 parameters.
In a 19th , the system of aspect 13, wherein the parameterized
representation of the cuboid comprises a vanishing point terization.
In a 20th aspect, the system of any one of aspects 1-19, wherein the
hardware processor is further programmed to: interact with a user of the system based on the
d RoI at the refined cuboid image location and the representation of the cuboid.
In a 21st aspect, the system of aspect 20, wherein the cuboid corresponds
to a stationary box, and wherein to ct with the user of the system, the hardware
processor is further programmed to: generate ter animation in relation to the stationary
box based on the refined image location of the cuboid and the representation of the cuboid.
In a 22nd aspect, the system of aspect 20, wherein the cuboid corresponds
to a hand-held cuboid, and wherein to interact with the user of the system, the hardware
processor is further programmed to: determine a pose of the cuboid using the representation
of the ; and interact with the user of the system based on the pose of the cuboid.
In a 23rd aspect, the system of aspect 20, wherein the cuboid corresponds
to a rare object not recognizable by a third CNN, and wherein to interact with the user of the
system, the hardware processor is further programmed to: e the user with a notification
that the rare object not recognizable by the third CNN is detected.
In a 24th aspect, the system of any one of aspects 1-23, wherein the cuboid
corresponds to a man-made ure, and wherein the hardware processor is r
programmed to: assist a user of the system during an unmanned flight based on the refined
RoI at the refined cuboid image location and the representation of the cuboid.
In a 25th aspect, the system of aspect any one of aspects 1-24, wherein the
cuboid corresponds to a marker, and wherein the hardware processor is further programmed
to: perform simultaneous location and mapping (SLAM) based on the refined RoI at the
refined cuboid image location and the representation of the cuboid.
In a 26th aspect, a wearable y system is disclosed. The le
display comprises: an outward-facing imaging system configured to obtain an image for
cuboid detection; and the system for cuboid detection and keypoint localization of any one of
aspects 1-25.
In a 27th aspect, a system for training a cuboid or is disclosed. The
system comprises: non-transitory memory configured to store executable instructions; and a
hardware processor in ication with the non-transitory memory, the re
processor programmed by the executable instructions to: receive a plurality of ng
images each comprising at least one cuboid; te a cuboid detector, wherein the cuboid
detector comprises: a plurality of convolutional layers and non-convolutional layers of a first
convolutional neural network (CNN), a region proposal network (RPN) connected to a first
layer of the plurality of convolutional layers and non-convolutional , and a pooling
layer and at least one regressor layer, the g layer and the at least one regressor layer
connected to a second layer of the plurality of convolutional layers and non-convolutional
layers; and train the cuboid detector, wherein to train the cuboid detector, the hardware
process is configured to: determine, using the cuboid detector, a RoI at a cuboid image
location and a representation of a cuboid in a training image of the plurality of training
images; determine a first difference between a reference cuboid image location and the
cuboid image on and a second difference between a reference representation of the
cuboid and the determined representation of the cuboid; and update weights of the cuboid
detector based on the first ence and the second difference.
In a 28th aspect, the system of aspect 27, wherein the cuboid comprises a
cuboid, a cylinder, a sphere, or any combination thereof.
In a 29th aspect, the system of any one of aspects 27-28 wherein the first
layer of the plurality of utional layers and non-convolutional layers and the second
layer of the plurality of utional layers and non-convolutional layers are identical.
In a 30th aspect, the system of any one of aspects 27-29, wherein to
determine the RoI at the cuboid image location and the representation of the cuboid, the
hardware processor is further mmed to: generate, using the ity of convolutional
layers and the non-convolutional layers, a convolutional feature map for the at least one
training image of the plurality of training images; determine, using the RPN, at least one RoI
comprising the cuboid at an l cuboid image location in the training image; determine,
using the initial cuboid image location, a submap of the convolutional e map
corresponding to the at least one RoI comprising the cuboid; and determine, using the
pooling layer, the at least one regressor layer, and the submap of the convolutional feature
map corresponding to the at least one RoI comprising the cuboid, the RoI at the cuboid image
location and the representation of the cuboid.
In a 31st aspect, the system of any one of aspects 27-30, wherein the initial
cuboid image location is represented as a two-dimensional (2D) bounding box.
In a 32nd aspect, the system of any one of s 27-31, wherein to
determine the RoI at the cuboid image location and the representation of the cuboid, the
hardware processor is further programmed to: iteratively determine, using the pooling layer,
the at least one regressor layer, and the submap of the convolutional feature map
corresponding to the RoI comprising the cuboid, the RoI at the cuboid image location and the
representation of the cuboid.
In a 33rd aspect, the system of any one of aspects 27-32, wherein the
initial cuboid image location is represented as a two-dimensional (2D) bounding box.
In a 34th aspect, the system of any one of s 27-33, n to update
weights of the cuboid detector, the hardware-based processor is mmed to: update the
weights of the RPN and the weights of the at least one sor layer.
In a 35th aspect, the system of any one of aspects 27-33, wherein to update
weights of the cuboid detector, the hardware-based sor is programmed to: update the
weights of the RPN and the weights of the at least one regressor layer without updating the
weights of the first CNN.
In a 36th , the system of any one of aspects 27-33, wherein to update
weights of the cuboid detector, the hardware-based processor is programmed to: update the
weights of the first CNN, the weights of the RPN, and the weights of the at least one
regressor layer.
In a 37th aspect, the system of any one of aspects 27-36, wherein to
generate the cuboid detector, the hardware-based processor is programmed to: e the
first CNN.
In a 38th , the system of any one of aspects 27-37, wherein the at
least one regressor layer comprises two or more layers.
In a 39th aspect, the system of aspect 38, wherein the two or more layers
comprise a fully ted layer, a non-fully connected layer, or any combination thereof.
In a 40th aspect, the system of any one of aspects 27-38, wherein the at
least one regressor layer is associated with at least three loss functions during training of the
cuboid detector.
In a 41st aspect, the system of aspect 40, wherein the at least three loss
functions comprises a log loss function and a smooth L1 loss function.
In a 42nd aspect, the system of any one of aspects 27-41, wherein RPN
comprises a deep neural network (DNN).
In a 43rd aspect, the system of any one of aspects 27-42, wherein the RPN
is associated with at least two loss functions during the training of the cuboid detector.
In a 44th aspect, the system of aspect 43, wherein the at least two loss
functions comprises a log loss function and a smooth L1 loss function.
In a 45th aspect, the system of any one of aspects 27-44, wherein the
representation of the cuboid comprises a parameterized representation of the cuboid.
In a 46th , the system of aspect 45, wherein the parameterized
representation comprises N tuples.
In a 47th aspect, a wearable y system is disclosed. The wearable
display system comprises: an outward-facing imaging system configured to obtain an image
of an environment of the wearer of the wearable display system; non-transitory memory
configured to store the image; and a hardware processor in communication with the nontransitory
memory, the processor programmed to: access the image of the environment;
analyze the image to detect a cuboid in the image, wherein to e the image, the
processor is programmed to: utilize layers of a convolutional neural network (CNN) to
generate a convolutional feature map comprising es; utilize a region proposal network
(RPN) to map the convolutional feature map into a region of interest (RoI); pool features in
the RoI to generate first pooled features; pass the first pooled features through a regressor to
generate a first bounding box estimate and a first cuboid vertex te; generate second
pooled es based on the first bounding box estimate; and pass the second pooled features
through the regressor to generate a second bounding box estimate and a second cuboid vertex
estimate.
In a 48th aspect, the wearable display system of aspect 47, wherein the
image comprises a monocular color image.
In a 49th aspect, the le y system of aspect 47 or aspect 48,
wherein the RPN comprises a CNN that maps the convolutional feature map to the RoI.
In a 50th , the wearable display system of any one of aspects 47 to
49, wherein the first bounding box estimate or the second bounding box estimate comprise
offsets from a center of a bounding box.
In a 51st aspect, a system for detecting a cuboid in an image is disclosed.
The system comprises: non-transitory memory configured to store an image of a region; a
hardware processor in communication with the non-transitory memory, the processor
programmed to: evaluate a convolutional neural network to generate a feature map; analyze
the e map to obtain a region of interest (RoI); determine that the RoI contains a cuboid;
analyze first pooled features in the RoI of the feature map to generate a first estimate for
vertices of the cuboid; generate an improved RoI based at least in part on the first estimate
for the vertices of the ; analyze second pooled features in the improved RoI of the
feature map to generate a second estimate for vertices of the cuboid; and output the second
estimate for vertices of the .
In a 52nd aspect, the system of aspect 51, wherein to analyze the feature
map to obtain a region of interest (RoI), the processor is programmed to te a region
proposal k (RPN).
In a 53rd aspect, the system of aspect 51 or 52, wherein the first estimate
for vertices of the cuboid comprise offsets from a center of the RoI, or the second estimate
for vertices of the cuboid comprise offsets from a center of the improved RoI.
In a 54th aspect, a method for cuboid detection and keypoint localization
is disclosed. The method is under control of a hardware processor and ses: receiving
an image; ting, using a ity of convolutional layers and non-convolutional layers
of a first convolutional neural network (CNN) of a cuboid detector and the image, a
convolutional feature map; determining, using a region proposal network (RPN) comprising
a second CNN of the cuboid detector, at least one RoI comprising a cuboid at a cuboid image
location of the image; determining, using a pooling layer of the cuboid detector and the
cuboid image location, a submap of the convolutional feature map ponding to the RoI
comprising the cuboid; and determining, using at least one regressor layer of the cuboid
detector and the submap of the convolutional feature map corresponding to the RoI
comprising the cuboid, a refined RoI at a d cuboid image location and the
representation of the cuboid.
In a 55th aspect, the method of aspect 54, further comprising: determining,
using the refined cuboid image location, a refined submap of the convolutional feature map
corresponding to the refined RoI comprising the cuboid; ining, using the pooling
layer, the at least one regressor layer, and the refined submap of the convolutional feature
map corresponding to the refined RoI comprising the cuboid, a further refined RoI at a
further refined cuboid image location and a further defined representation of the cuboid.
In a 56th aspect, the method of any one of aspects 54-55, wherein the
cuboid image location is represented as a two-dimensional (2D) bounding box.
In a 57th aspect, the method of any one of s 54-56, n the
refined cuboid image location is ented as a two-dimensional (2D) bounding box.
In a 58th , the method of any one of aspects 54-57, wherein the nonconvolutional
layers of the first CNN comprises a normalization layer, a brightness
normalization layer, a batch normalization layer, a rectified linear layer, an upsampling layer,
a concatenation layer, a pooling layer, a gn layer, or any combination thereof.
In a 59th aspect, the method of any one of aspects 54-58, wherein the at
least one regressor layer comprises two or more layers.
In a 60th aspect, the method of aspect 59, wherein the two or more layers
comprise a fully connected layer, a non-fully connected layer, or any ation thereof.
In a 61st aspect, the method of any one of aspects 54-60, n RPN
comprises a deep neural network (DNN).
In a 62nd aspect, the method of any one of aspects 54-61, n the
representation of the cuboid comprises a parameterized representation of the cuboid.
In a 63rd aspect, the method of aspect 62, wherein the parameterized
representation of the cuboid comprises locations of a plurality of keypoints of the cuboid in
the image.
In a 64th aspect, the method of aspect 63, wherein the plurality of
keypoints ses eight vertices of the cuboid in the image.
In a 65th aspect, the method of aspect 62, wherein the terized
representation comprises normalized offsets of the plurality of keypoints of the cuboid from
the center of the image.
In a 66th aspect, the method of aspect 62, wherein the parameterized
representation comprises N tuples.
In a 67th aspect, the method of aspect 62, wherein the parameterized
representation of the cuboid comprises 12 parameters.
In a 68th aspect, the method of aspect 62, wherein the parameterized
representation of the cuboid comprises a vanishing point parameterization.
In a 69th aspect, the method of any one of aspects 54-58, r
comprising: interacting with a user based on the refined RoI at the refined cuboid image
location and the representation of the cuboid.
In a 70th aspect, the method of aspect 69, n the cuboid corresponds
to a stationary box, and interacting with the user comprises: generating character animation
in relation to the stationary box based on the refined image location of the cuboid and the
representation of the cuboid.
In a 71st , the method of aspect 69, wherein the cuboid corresponds
to a hand-held cuboid, and wherein interacting with the user comprises: determining a pose
of the cuboid using the representation of the cuboid; and interacting with the user based on
the pose of the cuboid.
In a 72nd aspect, the method of aspect 69, wherein the cuboid corresponds
to a rare object not izable by a third CNN, and n interacting with the user
comprises: providing the user with a notification that the rare object not izable by the
third CNN is detected.
In a 73rd aspect, the method of any one of aspects 54-72, further
comprising: assisting a user of the system during an unmanned flight based on the refined
RoI at the refined cuboid image location and the representation of the cuboid, wherein the
cuboid corresponds to a man-made structure.
In a 74th aspect, the method of any one of aspects 54-73, further
comprising: perform simultaneous location and mapping (SLAM) based on the refined RoI at
the refined cuboid image location and the representation of the cuboid, wherein the cuboid
corresponds to a marker.
In a 75th aspect, the method of any one of aspects 54-74, further
comprising: receiving a plurality of training images each comprising at least one training
; generating the cuboid detector and training the cuboid detector comprising:
determining, using the cuboid detector, a ng RoI at a training cuboid image location and
a representation of a ng cuboid in a training image of the plurality of training images;
determining a first difference between a reference cuboid image location and the training
cuboid image location and a second difference between a reference representation of the
training cuboid and the ined representation of the training cuboid; and updating
s of the cuboid detector based on the first difference and the second difference.
In a 76th aspect, the method of aspect 75, wherein determining the training
RoI at the training cuboid image location and the representation of the ng cuboid
comprises: generating, using the plurality of convolutional layers and the non-convolutional
layers, a training convolutional e map for the at least one training image of the plurality
of training images; determining, using the RPN, at least one training RoI comprising the
training cuboid at an initial training cuboid image location in the training image; determining,
using the l training cuboid image on, a submap of the convolutional feature map
corresponding to the at least one RoI comprising the cuboid; and determining, using the
pooling layer, the at least one regressor layer, and the submap of the training convolutional
feature map corresponding to the at least one training RoI comprising the training cuboid, the
training RoI at the training cuboid image on and the representation of the training
cuboid.
In a 77th aspect, the method of aspect 76, n the initial training
cuboid image location is represented as a two-dimensional (2D) bounding box.
In a 78th aspect, the method of aspect 75, wherein determining the training
RoI at the training cuboid image location and the representation of the training cuboid
comprises: iteratively determining, using the pooling layer, the at least one regressor layer,
and the submap of the training utional feature map corresponding to the training RoI
comprising the training cuboid, the RoI at the training cuboid image location and the
representation of the training cuboid.
In a 79th aspect, the method of aspect 78, wherein the initial training
cuboid image location is ented as a two-dimensional (2D) bounding box.
In a 80th aspect, the method of any one of aspects 75-79, wherein
updating weights of the cuboid detector comprises: updating the weights of the RPN and the
weights of the at least one regressor layer.
In a 81st aspect, the method of any one of aspects 75-79, wherein updating
weights of the cuboid detector comprises: updating the weights of the RPN and the s
of the at least one sor layer without updating the weights of the first CNN.
In a 82nd , the method of any one of aspects 75-79, wherein
updating weights of the cuboid detector comprises: updating the weights of the first CNN,
the weights of the RPN, and the weights of the at least one sor layer.
In a 83rd aspect, the method of any one of aspects 54-82, n
generating the cuboid detector comprises: receiving the first CNN.
In a 84th aspect, the method of any one of s 75-83, wherein the at
least one regressor layer is associated with at least three loss functions during training of the
cuboid detector.
In a 85th aspect, the method of claim 84, wherein the at least three loss
functions comprises a log loss function and a smooth L1 loss function.
In a 86th aspect, the method of any one of aspects 75-85, wherein the RPN
is ated with at least two loss ons during the training of the cuboid detector.
In a 87th aspect, the method of claim 86, wherein the at least two loss
functions ses a log loss function and a smooth L1 loss function.
In a 88th aspect, a method is disclosed. The method is under control of a
hardware processor and comprises: ing an image of the environment; analyzing the
image to detect a cuboid in the image comprising utilizing layers of a convolutional neural
network (CNN) to generate a convolutional feature map comprising features; utilizing a
region proposal network (RPN) to map the convolutional feature map into a region of interest
(RoI); pooling features in the RoI to generate first pooled es; passing the first pooled
features through a regressor to te a first bounding box estimate and a first cuboid
vertex estimate; generating second pooled features based on the first bounding box estimate;
and passing the second pooled features through the regressor to generate a second bounding
box estimate and a second cuboid vertex estimate.
In a 89th aspect, the method of aspect 88, wherein the image comprises a
monocular color image.
In a 90th aspect, the method of aspect 88 or aspect 89, wherein the RPN
comprises a CNN that maps the convolutional feature map to the RoI.
In a 91st aspect, the method of any one of aspects 88 to 89, wherein the
first bounding box estimate or the second bounding box estimate comprise s from a
center of a bounding box.
In a 92nd aspect, a method for detecting a cuboid in an image is disclosed.
The method is under control of a hardware processor and comprises: ting a
convolutional neural network to te a feature map; analyzing the feature map to obtain a
region of interest (RoI); determining that the RoI contains a cuboid; analyzing first pooled
features in the RoI of the feature map to generate a first estimate for vertices of the cuboid;
ting an improved RoI based at least in part on the first estimate for the vertices of the
cuboid; analyzing second pooled features in the ed RoI of the e map to generate
a second estimate for es of the cuboid; and outputting the second estimate for vertices
of the cuboid.
In a 93rd , the method of aspect 92, wherein analyzing the feature
map to obtain a region of interest (RoI) comprises evaluating a region proposal network
(RPN).
In a 94th aspect, the method of aspect 92 or 93, wherein the first estimate
for vertices of the cuboid comprise offsets from a center of the RoI, or the second estimate
for vertices of the cuboid comprise offsets from a center of the improved RoI.
Conclusion
Each of the processes, methods, and algorithms described herein and/or
depicted in the attached figures may be embodied in, and fully or partially automated by,
code modules executed by one or more al computing systems, hardware computer
processors, application-specific circuitry, and/or electronic hardware configured to execute
specific and particular computer instructions. For example, computing systems can include
general purpose ers (e.g., servers) programmed with specific er instructions or
special purpose computers, special purpose circuitry, and so forth. A code module may be
compiled and linked into an executable program, led in a dynamic link library, or may
be written in an interpreted programming language. In some implementations, particular
operations and methods may be performed by try that is specific to a given function.
Further, certain implementations of the functionality of the t
disclosure are sufficiently atically, computationally, or technically complex that
ation-specific hardware or one or more physical computing devices (utilizing
riate specialized executable instructions) may be necessary to perform the
functionality, for example, due to the volume or complexity of the calculations involved or to
provide results substantially in real-time. For example, a video may include many frames,
with each frame having millions of pixels, and specifically mmed computer hardware
is necessary to process the video data to provide a desired image processing task or
application in a commercially reasonable amount of time.
Code modules or any type of data may be stored on any type of nontransitory
computer-readable medium, such as physical computer storage including hard
drives, solid state memory, random access memory (RAM), read only memory (ROM),
optical disc, volatile or non-volatile storage, combinations of the same and/or the like. The
methods and modules (or data) may also be transmitted as generated data signals (e.g., as part
of a r wave or other analog or digital ated signal) on a variety of computerreadable
ission mediums, including wireless-based and wired/cable-based mediums,
and may take a variety of forms (e.g., as part of a single or multiplexed analog signal, or as
multiple discrete digital packets or frames). The results of the disclosed processes or process
steps may be stored, persistently or otherwise, in any type of ansitory, tangible
computer e or may be communicated via a computer-readable transmission medium.
Any ses, blocks, states, steps, or functionalities in flow diagrams
described herein and/or depicted in the attached figures should be understood as potentially
representing code modules, segments, or ns of code which include one or more
executable ctions for implementing specific functions (e.g., logical or arithmetical) or
steps in the process. The s ses, blocks, states, steps, or functionalities can be
combined, rearranged, added to, deleted from, modified, or otherwise changed from the
illustrative examples provided herein. In some embodiments, additional or different
computing systems or code modules may m some or all of the functionalities described
herein. The s and processes described herein are also not limited to any particular
sequence, and the blocks, steps, or states relating thereto can be performed in other sequences
that are appropriate, for example, in serial, in parallel, or in some other manner. Tasks or
events may be added to or removed from the disclosed example embodiments. Moreover,
the separation of various system components in the entations described herein is for
illustrative purposes and should not be understood as requiring such tion in all
implementations. It should be understood that the described program components, methods,
and systems can generally be integrated together in a single computer t or packaged
into multiple computer products. Many implementation variations are possible.
The processes, methods, and systems may be implemented in a network
(or distributed) computing environment. Network environments include enterprise-wide
computer networks, intranets, local area networks (LAN), wide area networks (WAN),
personal area networks (PAN), cloud computing networks, crowd-sourced computing
networks, the Internet, and the World Wide Web. The network may be a wired or a wireless
network or any other type of communication network.
The systems and s of the disclosure each have several innovative
aspects, no single one of which is solely responsible or ed for the ble attributes
disclosed herein. The various es and ses described herein may be used
independently of one another, or may be combined in various ways. All possible
combinations and binations are intended to fall within the scope of this disclosure.
Various modifications to the implementations described in this disclosure may be readily
apparent to those skilled in the art, and the generic principles defined herein may be applied
to other implementations without departing from the spirit or scope of this disclosure. Thus,
the claims are not intended to be limited to the implementations shown herein, but are to be
accorded the widest scope consistent with this disclosure, the principles and the novel
features disclosed herein.
Certain features that are described in this specification in the context of
separate implementations also can be implemented in ation in a single
implementation. Conversely, various features that are described in the context of a single
implementation also can be implemented in multiple implementations tely or in any
suitable bination. Moreover, although features may be described above as acting in
certain combinations and even initially claimed as such, one or more features from a claimed
combination can in some cases be excised from the combination, and the claimed
ation may be directed to a bination or variation of a subcombination. No
single feature or group of features is necessary or indispensable to each and every
ment.
Conditional language used herein, such as, among others, “can,” “could,”
“might,” “may,” “e.g.,” and the like, unless specifically stated otherwise, or otherwise
understood within the context as used, is generally intended to convey that certain
embodiments include, while other embodiments do not include, n features, elements
and/or steps. Thus, such conditional language is not generally intended to imply that
features, elements and/or steps are in any way required for one or more embodiments or that
one or more embodiments necessarily include logic for ng, with or without author input
or prompting, whether these features, elements and/or steps are included or are to be
performed in any particular embodiment. The terms “comprising,” “including,” “having,”
and the like are synonymous and are used ively, in an open-ended fashion, and do not
exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is
used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to
connect a list of elements, the term “or” means one, some, or all of the elements in the list.
In addition, the articles “a,” “an,” and “the” as used in this ation and the appended
claims are to be construed to mean “one or more” or “at least one” unless specified
otherwise.
As used herein, a phrase referring to “at least one of” a list of items refers
to any combination of those items, ing single members. As an example, “at least one
of: A, B, or C” is intended to cover: A, B, C, A and B, A and C, B and C, and A, B, and C.
Conjunctive language such as the phrase “at least one of X, Y and Z,” unless specifically
stated otherwise, is otherwise understood with the context as used in general to convey that
an item, term, etc. may be at least one of X, Y or Z. Thus, such conjunctive language is not
generally intended to imply that certain embodiments e at least one of X, at least one of
Y and at least one of Z to each be present.
Similarly, while operations may be depicted in the drawings in a particular
order, it is to be ized that such operations need not be performed in the particular order
shown or in sequential order, or that all illustrated ions be med, to achieve
desirable s. r, the drawings may schematically depict one more example
processes in the form of a flowchart. However, other operations that are not depicted can be
incorporated in the example methods and processes that are schematically illustrated. For
example, one or more additional ions can be performed before, after, simultaneously,
or between any of the illustrated operations. Additionally, the operations may be rearranged
or reordered in other implementations. In certain circumstances, multitasking and parallel
processing may be advantageous. Moreover, the separation of various system components in
the implementations described above should not be understood as requiring such separation
in all implementations, and it should be understood that the bed program ents
and systems can generally be integrated together in a single software product or packaged
into le software products. Additionally, other implementations are within the scope of
the following . In some cases, the actions recited in the claims can be performed in a
different order and still achieve desirable results.
The reference in this specification to any prior publication (or ation
derived from it), or to any matter which is known, is not, and should not be taken as an
acknowledgment or admission or any form of suggestion that the prior publication (or
information derived from it) or known matter forms part of the common general knowledge
in the field of endeavour to which this specification relates.
Claims (20)
1. A system for training a cuboid detector, the system comprising: non-transitory memory configured to store executable instructions; and one or more re processors in communication with the non-transitory memory, the one or more hardware processors programmed by the executable instructions to: access a plurality of training images, wherein the plurality of training images includes a first ng image; generate a cuboid detector, wherein the cuboid detector comprises: a plurality of convolutional layers and nvolutional layers of a first convolutional neural network (CNN); a region proposal network (RPN) connected to a first layer of the plurality of convolutional layers and non-convolutional layers; a pooling layer; and at least one regressor layer; wherein the pooling layer and the at least one regressor layer are both ted to a second layer of the plurality of convolutional layers and non-convolutional layers; and train the cuboid detector, wherein training the cuboid detector comprises: determining, by applying the cuboid detector to the first training image, a region of interest (RoI) at a cuboid image on; determining, by applying the cuboid detector to the first training image, a representation of a cuboid in the training image; determining a first difference n a nce cuboid image location and the cuboid image location; determining a second difference between a nce representation of the cuboid and the determined representation of the cuboid; and updating weights of the cuboid detector based on the first difference and the second difference.
2. The system of claim 1, wherein the cuboid comprises a cuboid, a cylinder, a sphere, or any combination thereof.
3. The system of claim 1 or claim 2, wherein the first layer and the second layer are
4. The system of any one of claims 1 to 3, n the one or more hardware processors is further programmed to: generate, using the plurality of convolutional layers and the non-convolutional layers, a convolutional feature map for the first training image; determine, using the RPN, at least one RoI comprising the cuboid at an initial cuboid image location in the training image; ine, using the initial cuboid image location, a submap of the convolutional feature map corresponding to the at least one RoI comprising the cuboid; and ine, using the pooling layer, the at least one regressor layer, and the submap of the convolutional feature map corresponding to the at least one RoI comprising the cuboid, the RoI at the cuboid image location and the representation of the .
5. The system of claim 4, wherein the initial cuboid image location is represented as a two-dimensional (2D) bounding box.
6. The system of any one of claims 1 to 5, wherein the one or more hardware processors is further programmed to: iteratively determine, using the pooling layer, the at least one regressor layer, and the submap of the convolutional feature map corresponding to the RoI comprising the cuboid, the RoI at the cuboid image location and the entation of the cuboid.
7. The system of claim 6, wherein the initial cuboid image on is represented as a mensional (2D) bounding box.
8. The system of any one of claims 1 to 7, wherein the one or more hardware processors is further mmed to: update the weights of the RPN; and update the weights of the at least one regressor layer.
9. The system of any one of claims 1 to 7, wherein the one or more hardware processors is further programmed to: update the weights of the first CNN; update the weights of the RPN; and update the weights of the at least one sor layer.
10. The system of any one of claims 1 to 9, wherein the one or more hardware processors is further programmed to: receive the first CNN.
11. The system of any one of claims 1 to 10, wherein the at least one regressor layer comprises two or more layers.
12. The system of claim 11, wherein the two or more layers comprise a fully connected layer, a non-fully connected layer, or any combination thereof.
13. The system of any one of claims 1 to 12, wherein the at least one regressor layer is associated with at least three loss functions during training of the cuboid detector.
14. The system of any one of claims 1 to 13, wherein RPN comprises a deep neural network (DNN).
15. The system of any one of claims 1 to 14, wherein the RPN is associated with at least two loss functions during the training of the cuboid detector
16. The system of any one of claims 1 to 15, wherein the representation of the cuboid comprises a parameterized representation of the cuboid.
17. A method for ng a cuboid or, the method comprising: accessing a plurality of training images, wherein the plurality of training images includes a first training image; generating a cuboid detector, wherein the cuboid detector comprises: a plurality of utional layers and nvolutional layers of a first convolutional neural network (CNN), a region proposal network (RPN) connected to a first layer of the plurality of convolutional layers and non-convolutional layers, a pooling layer; and at least one regressor layer, wherein the g layer and the at least one regressor layer are both connected to a second layer of the plurality of convolutional layers and non-convolutional layers; and training the cuboid or, wherein training the cuboid detector ses: determining, by applying the cuboid detector to the first training image, a region of interest (RoI) at a cuboid image location; determining, by applying the cuboid detector to the first training image, a representation of a cuboid in the training image; determining a first difference between a reference cuboid image location and the cuboid image location; determining a second difference between a reference representation of the cuboid and the determined representation of the cuboid; and updating weights of the cuboid detector based on the first difference and the second difference.
18. The method of claim 17, wherein the first layer and the second layer are identical.
19. The method of claim 17 or claim 18, further comprising: generating, using the plurality of convolutional layers and the nonconvolutional layers, a utional feature map for the first training image; determining, using the RPN, at least one RoI comprising the cuboid at an initial cuboid image location in the training image; determining, using the initial cuboid image location, a submap of the convolutional e map corresponding to the at least one RoI comprising the cuboid; and determining, using the pooling layer, the at least one regressor layer, and the submap of the convolutional feature map corresponding to the at least one RoI comprising the cuboid, the RoI at the cuboid image location and the representation of the cuboid.
20. The method of any one of claims 17 to 19, further comprising: iteratively determining, using the pooling layer, the at least one regressor layer, and the submap of the convolutional feature map corresponding to the RoI comprising the cuboid, the RoI at the cuboid image on and the representation of the cuboid. WO 93796 PCT/USZOl7/061618 QZN F m mm:.u_< ZO_._.<~_m_._._ ENQN ZOEEmE 292%: ._.m_ d. | uNQN n_< ZO_._.<~_m._._ rI‘L,w mm. - NmQNN @mm @N 229m u
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US62/422,547 | 2016-11-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
NZ793982A true NZ793982A (en) | 2022-11-25 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11797860B2 (en) | Deep learning system for cuboid detection | |
US20230394315A1 (en) | Room layout estimation methods and techniques | |
Henry et al. | RGB-D mapping: Using Kinect-style depth cameras for dense 3D modeling of indoor environments | |
Dwibedi et al. | Deep cuboid detection: Beyond 2d bounding boxes | |
NZ793982A (en) | Deep learning system for cuboid detection |