AU2018379393B2

AU2018379393B2 - Monitoring systems, and computer implemented methods for processing data in monitoring systems, programmed to enable identification and tracking of human targets in crowded environments

Info

Publication number: AU2018379393B2
Application number: AU2018379393A
Authority: AU
Inventors: Alen ALEMPIJEVIC; Teresa Alejandra VIDAL CALLEJA; Alexander Joseph VIRGONA
Original assignee: Sydney Technology, University of; University of Technology Sydney; Downer EDI Rail Pty Ltd
Current assignee: University of Technology Sydney; Downer EDI Rail Pty Ltd
Priority date: 2017-12-06
Filing date: 2018-12-05
Publication date: 2024-08-29
Anticipated expiration: 2038-12-05
Also published as: WO2019109142A1; AU2018379393A1

Abstract

Described herein is a method and computer system programmed to enable prediction of human activity. In one aspect, a method includes a capture phase (101), which includes capture of a stream of raw depth images from a depth camera device. A scene processing phase (102) is applied thereby to process the raw depth images from phase (101) to produce an upright pointcloud of foreground only points, aligned with an identified upright orientation (optionally via a ground plane). This is followed by a pointcloud segmentation stage (103) to spatially segment the upright pointcloud into smaller person proposal clusters which describe potential human targets. A pose extraction phase (104) then extracts a shoulder pose from each proposed cluster, rejecting proposals which violate some basic size and shape constraints. Valid poses are then passed for processing in a tracking phase (105). This includes tracking all detected person poses using a probabilistic estimator, such as a particle filter with social constraints to improve tracking robustness in densely crowded environments. Then, via an output phase (106), a final output is provided in the form of a stream of filtered poses with persistent tracking IDs which describe the trajectory of each person observed by the sensor.

Description

MONITORING SYSTEMS, AND COMPUTER IMPLEMENTED METHODS FOR PROCESSING DATA IN MONITORING SYSTEMS, PROGRAMMED TO ENABLE IDENTIFICATION AND TRACKING OF HUMAN TARGETS IN CROWDED ENVIRONMENTS

FIELD OF THE INVENTION

[0001] The present invention relates to monitoring systems, and computer implemented methods for processing data in monitoring systems, programmed to enable identification and tracking of human targets in crowded environments. The technology has primarily been developed to enable tracking of human subjects in crowded spaces, for example to enable automated predictive determination of human behaviour. It will be appreciated, however, that the technology is applicable to a wide range of use cases.

BACKGROUND

[0002] Any discussion of the background art throughout the specification should in no way be considered as an admission that such art is widely known or forms part of common general knowledge in the field.

[0003] Estimating and tracking human body pose is a key requirement for robots and intelligent systems that operate in human environments, enabling them to interact with people and make sense of human behaviour. An intelligent system with this capability could, for example, determine when an individual is interacting with other people to model social connections, or determine when a person is interacting with parts of the environment which hold semantic meaning, such as signage or vending machines.

[0004] Human pose estimation and tracking is a challenging task, as human environments are typically dynamic and unstructured, and as people come in a variety of shapes, sizes and appearances. The difficulty of this task is further increased in crowded environments due to frequent visual occlusions, and the number of targets to be tracked. An inner-city train station is a prime example of such an environment.

SUMMARY OF THE INVENTION

[0005] Example embodiments of the invention are provided in the claims below. Embodiments include devices and frameworks described herein (and aspects/elements thereof), methods described herein (and aspects/elements thereof) and computer program products and/or non-transitory carrier medium for carrying computer executable code that, when executed on a processor, causes the processor to perform a method as described herein.

[0006] A first aspect of the present invention provides a computer system programmed to enable prediction of human activity, the system including:

a sensor device programmed to generate a stream of depth images for a target region;

an input module programmed to receive the stream of depth images for the target region, wherein each image in the stream of depth images is representative of a frame of pointcloud data;

a scene processing module that is programmed to: (i) receive data from the input module; and (ii) process that data thereby to establish the upright orientation for the target region, and transform at least a subset of the frames of pointcloud data into a predefined orientation based on the upright orientation;

a pointcloud segmentation module that is programmed to (i) receive, from the scene processing module, data representative of a plurality of frames of pointcloud data in predefined orientation relative to the upright orientation; and (ii) for at least a subset of those frames, process the data thereby to identify one or more clusters, wherein each cluster represents a potential human target; and

an ellipsoid-based body part extraction module that is programmed to: (i) receive from the pointcloud segmentation module, data representative of the clusters; and (ii) for each cluster, perform an ellipsoid fitting process thereby to, for a given cluster:

(a) in the case that the ellipsoid fitting process is successful, identify a human target cluster; or

(b) in the case that the ellipsoid fitting process is unsuccessful, reject the cluster; and

a pose extraction module programmed to, for each cluster in respect of which a human target is identified, determine pose attributes for the human target based on characteristics of an ellipsoid fitted via the ellipsoid fitting process.

[0007] In some embodiments, the method includes a tracking module, wherein the tracking module is programmed to, for each human target cluster identified by the ellipsoid- based body part extraction module, perform a tracking process using data provided by the pose extraction module. The tracking module may assign a respective probabilistic estimator with social constraints to each human target cluster. The probabilistic estimator is preferably a particle filter. [0008] In some embodiments, the tracking module is programmed to, for each human target cluster, perform a process including: (i) performing a prediction sub-process thereby to predict a next state of particles based on current state, motion model and social constraints; and (ii) performing a data association sub-process thereby to associate pose attributes for a given frame to one of the particle filters.

[0009] In some embodiments, the method includes performing the data association sub process thereby to associate pose attributes for a given frame to one of the particle filters based on a greedy nearest neighbours approach.

[0010] In some embodiments, the prediction sub-process includes applying social constraints by: (i) performing a first prediction for each filter thereby to update a mean hypothesis of each filter; (ii) identifying invalid particles in any given filter, including any particles within a fixed radius of any other filter mean position; and (iii) re-drawing invalid particles until valid.

[001 1] In some embodiments, the method includes an output module, wherein the output module is programmed to, based on data provided by the tracking module, provide a stream of filtered poses which describe the trajectory of one or more predicted humans respectively associated with each of the human target clusters. The output module preferably associates a persistent tracking ID to each predicted human. The output module preferably enables graphical representation of each predicted human as a 3D avatar capable of displaying pose attributes.

[0012] In some embodiments, the scene processing module is programmed to perform a background subtraction process. The background subtraction process preferably includes, processing at least a subset of depth image frames to incrementally update a background model describing the depth values consistent with the static environment, and processing at least a subset of depth image frames to exclude depth values consistent with the background model, wherein only the remaining foreground pixels are provided for further processing. Preferably the background depth image is used to establish the ground plane.

[0013] In some embodiments, the scene processing module is programmed to provide to the pointcloud segmentation module, for the subset of frames, an upright pointcloud of foreground, aligned with the ground plane.

[0014] In some embodiments, the scene processing module is programmed to, for a given frame, sort points in the pointcloud based on height relative to the ground plane, and perform an iterative top-down method thereby to sort the points into clusters based on a separation distance threshold. The iterative top-down method preferably includes iterating through each point pi in the pointcloud and performing a decision to either: (i) adding that point to a nearest cluster; or (ii) staring a new cluster. Preferably, the decision is based on comparing a horizontal distance di j between each point pi and each cluster C, to a fixed separation distance threshold do, and if a point is less than the separation distance to any cluster mean, adding that point to the nearest such cluster, with the mean of the cluster being updated incrementally.

[0015] In one embodiment, the method includes performing a cluster joining process including checking a distance between cluster means, and joining those clusters with a distance less than do in a greedy fashion.

[0016] In one embodiment, the method includes removing clusters having either: (i) less than a threshold number of points; and (ii) less than a threshold representative surface area.

[0017] In some embodiments, the ellipsoid-based body part extraction module that is programmed to: (i) receive a given cluster outputted by the pointcloud segmentation module; and (ii) sequentially seek to apply a head ellipsoid representative of a head, and a shoulder ellipsoid representative of shoulders. Preferably, the head ellipsoid is fit to a first vertical window having a predefined size of points with fixed height extending downward from the top of the pointcloud. Preferably, the shoulder ellipsoid is fit to a second vertical window of points defined relative to the first vertical window. Preferably, the pose extraction module is programmed to determine the pose attributes based on characteristics of the shoulder ellipsoid. Preferably, horizontal pose attributes are determined from a centre of the shoulder ellipsoid, and a vertical pose calculated by an intersection between a vertical line passing through the centre of the shoulder ellipsoid and upper surface of the shoulder ellipsoid.

[0018] In some embodiments, an orientation of the shoulders is determined by projecting a major axis of the shoulder ellipsoid into the horizontal plane and taking the angle of the resulting line, with this angle being rotated 90 degrees to obtain a facing direction of the human, with a horizontal location of the head ellipsoid centre relative to the shoulder ellipsoid major axis is used to determine the forwards facing direction.

[0019] In some embodiments, the pointcloud segmentation module is further programmed to perform downsampling on at least a subset of the plurality of frames of pointcloud data. Preferably, the downsampling includes voxelizing each frame of the subset of frames of pointcloud data. [0020] A second aspect of the present invention provides a method for predicting human activity, the method including:

receiving, from a sensor device, a stream of depth images for a target region, wherein each image in the stream of depth images is representative of a frame of pointcloud data;

operating a scene processing module that is programmed to: (i) receive data from the input module; and (ii) process that data thereby to establish a upright orientation for the target region, and transform at least a subset of the frames of pointcloud data into a predefined orientation based on the upright orientation;

operating a pointcloud segmentation module that is programmed to (i) receive, from the scene processing module, data representative of a plurality of frames of pointcloud data in predefined orientation relative to the upright orientation; and (ii) for at least a subset of those frames, process the data thereby to identify one or more clusters, wherein each cluster represents a potential human target; and operating an ellipsoid-based body part extraction module that is programmed to: (i) receive from the pointcloud segmentation module, data representative of the clusters; and (ii) for each cluster, perform an ellipsoid fitting process thereby to, for a given cluster:

operating a pose extraction module programmed to, for each cluster in respect of which a human target is identified, determine pose attributes for the human target based on characteristics of an ellipsoid fitted via the ellipsoid fitting process.

[0021] In some embodiments, the pointcloud segmentation module is further programmed to perform downsampling on at least a subset of the plurality of frames of pointcloud data. Preferably, the downsampling includes voxelizing each frame of the subset of frames of pointcloud data.

[0022] A third aspect of the present invention provides a computer program product for performing a method according to the second aspect.

[0023] A fourth aspect of the present invention provides a non-transitory carrier medium for carrying computer executable code that, when executed on a processor, causes the processor to perform a method according to the second aspect. BRIEF DESCRIPTION OF THE DRAWINGS

[0024] Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which:

[0025] FIG. 1 illustrates a method according to one embodiment.

[0026] FIG. 2 illustrates example screen outputs generated according to an embodiment.

[0027] FIG. 3 illustrates a system according to an embodiment.

[0028] FIG. 4 illustrates example screen outputs generated according to an embodiment.

[0029] FIG. 5 illustrates an example screen output according to an embodiment.

[0030] FIG. 6 and FIG. 7 provide tables illustrating experimental results.

[0031] FIG. 8 illustrates a background subtraction process.

[0032] FIG. 9 illustrates an example graph of the error between the number of people determined by a method according to an embodiment of the invention in a real-life scenario and the number of people determined by manual count of depth data in that scenario.

[0033] FIG. 10 illustrates an example graph of the actual number of people determined by manual count of the same time period as in FIG. 9.

DETAILED DESCRIPTION

[0034] The present invention relates to monitoring systems, and computer implemented methods for processing data in monitoring systems, programmed to enable identification and tracking of human targets in crowded environments. The technology has primarily been developed to enable tracking of human subjects in crowded spaces, for example to enable automated predictive determination of human behaviour. It will be appreciated, however, that the technology is applicable to a wide range of use cases.

[0035] Embodiments relate to extracting and tracking the shoulder pose of multiple people in a crowded environment from a stream of depth images in real-time. Shoulder pose, in this regard, is defined as a 3D position of a subject’s shoulders and an angular orientation of the shoulders about the vertical axis. There are two significant aspects of embodiments described herein, which may be used individually or in combination: (i) technology for extracting shoulder pose by fitting ellipsoids to pointcloud data of a person’s head and shoulders; and (ii) technology including a probabilistic estimator (e.g. particle filter) based tracking algorithm incorporating knowledge of social constraints and human movement to improve robustness in crowded scenes. Method Overview

[0036] FIG. 1 provides an overview of a pose extraction and tracking method, which is described in more detail further below. This is accompanied by FIG. 2, which provides rendered graphical representations of outputs at each phase of the method of FIG. 1.

[0037] FIG. 1 illustrates a capture phase 101 , which includes capture of a stream of raw depth images from a depth camera device. Example depth imaging devices include a time of flight camera, a stereo camera system and a structured light projection camera. The stream is optionally provided at a rate of 30Hz (and processing steps described herein are optionally applied in respect of each frame, or to a subset of frames, for example a predefined interval). Image 101 provides a rendered visual representation of pointcloud data. Each frame in the stream is a depth image including pointcloud data, which is renderable as a depth image such as image 201 of FIG. 2.

[0038] A scene processing phase 102 is applied thereby to process the raw depth images from phase 101 to produce an upright pointcloud of foreground only points, aligned with an identified upright orientation (optionally via a ground plane). This is followed by a pointcloud segmentation stage 103 to spatially segment the upright pointcloud into smaller person proposal clusters which describe potential human targets (as shown in image 202). A pose extraction phase 104 then extracts a shoulder pose from each proposed cluster, rejecting proposals which violate some basic size and shape constraints (as shown in image 203). Valid poses are then passed for processing in a tracking phase 105. This includes tracking all detected person poses using a probabilistic estimator, such as a particle filter with social constraints to improve tracking robustness in densely crowded environments. Then, via an output phase 106, a final output is provided in the form of a stream of filtered poses with persistent tracking IDs which describe the trajectory of each person observed by the sensor (as shown in image 204).

Example System

[0039] FIG. 3 illustrates a system according to one embodiment. This system is described by reference to a plurality of modules. The term "module" refers to a software component that is logically separable (a computer program), or a hardware component. The module of the embodiment refers to not only a module in the computer program but also a module in a hardware configuration. The discussion of the embodiment also serves as the discussion of computer programs for causing the modules to function (including a program that causes a computer to execute each step, a program that causes the computer to function as means, and a program that causes the computer to implement each function), and as the discussion of a system and a method.

[0040] The system of FIG. 3 is programmed to enable prediction of human activity. In this example the system includes a single primary input device, in the form of a sensor device 301. This sensor device is programmed to generate a stream of depth images for a target region; each image in the stream of depth images is representative of a frame of pointcloud data. The input device is in some embodiments a depth camera (such as a Primesense Carmine or Kinect), which is preferably mounted at a height of ^« 2.5 metres and angled down towards the ground such that the floor is the largest visible plane.

[0041] An input module 302 is programmed to receive the stream of depth images for the target region. Input module 302 represents a functional module in which the initial image stream is received. This may include buffering the image stream and optionally performing pre-processing to ensure the images are in an appropriate format for subsequent processing. The input module 302 may also be responsible for converting the depth images to pointcloud data. However, in some embodiments, this conversion process may be performed by the scene processing module described below. The conversion of depth images to pointcloud data requires calibration parameters of camera (intrinsic calibration parameters). In preferred embodiments, this is performed purely in software. However, it will be appreciated that the conversion process and other functions of the input module 302 may be performed using dedicated hardware such as a Field Programmable Gate Array or the like.

[0042] This may include all frames of pointcloud data (for example at a 30Hz sensor capture rate), or a subset of the frames.

[0043] A scene processing module 303 is programmed to: (i) receive data from the input module; and (ii) process that data thereby to establish an upright orientation for the target region, and transform at least a subset of the frames of pointcloud data into a predefined orientation based on the upright orientation. Scene processing module 303 is responsible for implementing scene processing phase 102. The upright orientation may be established via various techniques, including (but not limited to) the use of an accelerometer on the input device, and image processing steps that identify a ground plane.

[0044] In an example embodiment, a primary purpose of the scene processing module is to convert each frame of depth data into an upright, foreground pointcloud, that is a pointcloud containing only points which are not part of the static environment and transformed such that the z = 0 plane is aligned with the ground plane of the scene. The main benefits to this are: (i) the chance of false positive person detections is reduced by removing pixels belonging to the environment from consideration; (ii) the amount of computation required in subsequent stages is reduced by drastically lowering the number of pixels processed, and (iii) downstream clustering and pose estimation algorithms are simplified thanks to alignment of the pointcloud to the floor.

[0045] A first step in the scene processing phase includes applying a bilateral filter to each raw depth image to smooth out noise and artefacts of the sensing process such as prominent steps in depth resolution present in data obtained from structured light depth cameras. The bilateral filter averages each depth value with those in its pixel neighbourhood weighted by a combination of two factors: the pixel distance in the image space, and the difference in the original depth values. In this way, the bilateral filter is able to smooth consistent regions while maintaining sharp edges.

[0046] In order to segment parts of the depth image potentially describing people from those representing the static environment, background subtraction is performed. In some embodiments, the background subtraction process includes, processing at least a subset of depth image frames to incrementally update a background model describing the depth values consistent with the static environment, and processing at least a subset of depth image frames to exclude depth values consistent with the background model. Only the remaining foreground pixels are provided for further processing. In a further embodiment, on a periodic basis (for example once per second) the filtered depth image is used to update a background depth image representing the furthest observed depth at each pixel. Every frame of the filtered depth image is compared with the background image and any pixels with a depth value less than a predefined threshold (for example between 75% and 95%, preferably 90%) of the corresponding background depth are considered part of the foreground with all other pixels set to zero (invalid) in the produced foreground image.

[0047] As further context, background subtraction aims to segment parts of the depth image potentially describing people from those representing the static environment. In one embodiment a model of the static background is learned incrementally from the depth data and used to mask out pixels of each depth image consistent with the model, leaving only those considered to describe the foreground (as illustrated in FIG. 8). The background model is optionally learned based on a concept whereby an expected value of the background at each pixel in a colour or greyscale scene is modelled as a mixture of Gaussians. Preferably, the approach is to use a single Gaussian per pixel, owing to the inherent relevance of depth information to the task of background modelling. [0048] The background depth image is also used to determine the ground plane of the scene. For example, a preferred approach includes allowing an initial burn-in time for the background image to be established, and subsequently projecting that image into a pointcloud representation and a plane is fitted using Random Sample Consensus (RANSAC). In all subsequent frames the foreground depth image is projected into a pointcloud representation and transformed using the established ground plane such that the z = 0 plane is aligned with the ground plane.

[0049] As shown in FIG. 4, a bilaterally filtered depth image (left) is used to regularly update a background depth image (middle). Each frame of depth data is compared to the current background image in order to remove pixels that conform with the background depth and produce the foreground depth image (right).

[0050] A pointcloud segmentation module 304 is programmed to implement the pointcloud segmentation phase. This includes (i) receiving, from the scene processing module, data representative of a plurality of frames of pointcloud data in predefined orientation relative to the ground frame; and (ii) for at least a subset of those frames, processing the data thereby to identify one or more clusters, wherein each cluster represents a potential human target.

[0051] In an example embodiment, the pointcloud segmentation phase segments the pointcloud into human sized clusters based on the following assumptions:

[0052] People stand with the length of their body perpendicular to the floor.

[0053] The tallest point on a person’s body is their head

[0054] Peoples’ heads are spatially separated from one another

[0055] In some embodiments, before or during the pointcloud segmentation phase, a downsampling process may be performed to reduce the computational workload. This downsampling may include voxelizing each pointcloud frame or a subset of the pointcloud frames. Voxelizing involves converting the pointcloud into a regular grid of voxels. The density of the voxel grid will determine the amount of data to be computed; a courser voxel grid can be processed more rapidly while a denser voxel grid provides a more accurate representation of the original pointcloud.

[0056] The segmentation algorithm starts by sorting the pointcloud in descending height order and then iterates through each point p, in the pointcloud and either: adds it to the nearest cluster, or starts a new cluster. This decision is based on comparing the horizontal distance d, , between each point p, and each cluster C_j to a fixed separation distance threshold d₀. If a point is less than the separation distance to any cluster mean, it is added to the nearest such cluster and the mean of the cluster is updated incrementally.

[0057] After clustering is complete there are often cases where a person is split into multiple clusters due to points at a person’s horizontal extremities, such as their shoulders for which d, _j > dO. To manage this effect, a final cluster joining step is preformed thereby to assess a distance between cluster means, and join those with a distance less than do in a greedy fashion. Finally any clusters with a small number of points or representing a small surface area are removed.

[0058] An ellipsoid-based body part extraction module 305 that is programmed to: (i) receive from the pointcloud segmentation module 304, data representative of the clusters; and (ii) for each cluster, perform an ellipsoid fitting process thereby to, for a given cluster:

(b) in the case that the ellipsoid fitting process is unsuccessful, reject the cluster (in some embodiments, if the model fitting is unsuccessful, observations are still used to update existing track; although the cluster is rejected it is not discarded completely).

[0059] A pose extraction module 306 programmed to, for each cluster in respect of which a human target is identified, determine pose attributes for the human target based on characteristics of an ellipsoid fitted via the ellipsoid fitting process (including operation on an ellipsoid fitting module 307).

[0060] As further context, to extract the pose of each person, a model of the visible surface of the head and shoulders is fit to pointcloud data of the upper body. The surface model should: (1) be similar enough to the shape of human head and shoulders as to provide a good fit, (2) allow extraction of a stable shoulder position and orientation, (3) be flexible enough to encompass the variety of shapes and sizes within the population, and (4) be robust to relative motion between the head and shoulders. With these requirements in mind a pair of ellipsoids was chosen as a suitable surface model: one fit to the head, and one fit to the shoulders, as shown in FIG. 5.

[0061] Each of the clusters output by the pointcloud segmentation stage is provided as input to the pose extraction process, which extracts a shoulder pose comprised of a 3D position and angle of orientation about the vertical axis. In order to fit ellipsoids specifically to the head and shoulders, candidate points must be selected from the pointcloud which are likely to represent these parts of the body. This task requires some assumptions about the size and shape of the head and shoulders of a person, and is made challenging by the wide range of sizes and shapes within the population. To guide these assumptions, a preferred approach is to use statistical data. For example, an example embodiment makes use of a 2012 Anthropometric Survey of U.S. Army Personnel to set physical selection criteria where needed. Ellipsoids are fit sequentially (head then shoulders) to leverage parameters of the head ellipsoid in selecting candidate points for the shoulder fit, hence adapting our physical criteria to the individual.

[0062] First the head ellipsoid is fit to a vertical window of points with fixed height extending downward from the top of the pointcloud. In the example embodiment, a window size of 21cm is used based on the 10th percentile measurement from the top of head to the cervical to capture most of the points on the head while minimising the chance of including the neck or shoulders. The shoulder ellipsoid is similarly fit to a fixed vertical window of points, extending 21cm downward (90th percentile neck to scye length) from the centre of the head ellipsoid.

[0063] To ensure that the shoulder ellipsoid fits the breadth of the shoulders rather than the neck area, in the example embodiment a copy of the head ellipsoid, with radii dilated by between about 10cm and 20cm (preferably 14cm or within 10% of that value), is used to remove the neck and collar region from the points to be fit. This ensures that the fit is dominated by the shoulder tips, improving the quality of orientation estimates obtained.

[0064] Once the head and shoulder ellipsoids have been established they are used to extract a shoulder pose consisting of a 3D position and angle of orientation about the vertical axis. The horizontal components of the pose are taken directly from the centre of the shoulder ellipsoid as this position is more stable than that of the head. However the vertical component of the shoulder ellipsoid is less stable due to its high dependence on the vertical window used to select points for the fit. For this reason the vertical component of the pose is based on the top surface of the shoulder ellipsoid as it is more indicative of the true height of the person’s shoulders. The vertical component of pose is calculated by the intersection between a vertical line passing through the ellipsoid centre and the upper surface of the ellipsoid.

[0065] Finally the orientation of the shoulders is obtained by projecting the major axis of the shoulder ellipsoid into the horizontal plane and taking the angle of the resulting line. This angle is rotated 90° to obtain the facing direction of the person rather than the line of their shoulders, however the forwards direction is ambiguous based on the axis of the shoulders alone. To resolve this ambiguity we make the assumption that the head is forward of the shoulders. The horizontal location of the head ellipsoid centre relative to the shoulder ellipsoid major axis is used to determine the forwards facing direction and set the orientation angle accordingly.

[0066] In the crowded scenarios targeted by this work the number of people in the field- of-view of the sensor at any time can be as many as 15. With 2 ellipsoids to be fit per person and 30 frames of depth data per second this could mean fitting up to 900 ellipsoids per second of data. In order to process all data in real-time it is therefore important to use an efficient method for ellipsoid fitting, rendering iterative optimisation methods less applicable.

[0067] The example embodiment leverages an ellipse fitting method used, proposed by Li et al. in Li, Q.L.Q., Griffiths, J.: Least squares ellipsoid specific fitting. Geometric Modeling and Processing, 2004. Proceedings 2004, 4-9 (2004). DOI 10.1109/GMAP.2004.1290055. This finds the least squares fit of a quadric surface of the form:

[0068] to a set of 3D points subject to the constraint:

[0069] Where:

[0070] Li et al. show that this constraint is sufficient to guarantee that the quadric surface fit is an ellipsoid, and the problem can be efficiently solved by formulating it as an eigensystem.

[0071] A tracking module 308 is programmed to, for each human target cluster identified by the ellipsoid-based body part extraction module, performing a tracking process using data provided by the pose extraction module.

[0072] In the example embodiment, multi-target pose tracking is achieved with a probabilistic estimator in the form of a particle filter per target. Each filter maintains a collection of particles x', representing the distribution over possible states of the target in terms of position x, y, velocity x, y and orientation Q : [0073] At each time step, triggered either by a new frame of data or by a timer, the tracking module is programmed to perform the following steps:

[0074] Prediction - Next state of particles is predicted based on current state, motion model and social constraints.

[0075] Data Association - Pose observations are associated to existing filters based on a greedy nearest neighbours approach.

[0076] Observation Update - Weights of each particle are updated based on the likelihood of pose observations.

[0077] Human Motion Update - Weights of each particle are update based on correlation between shoulder orientation and velocity.

[0078] Flesampling- Particles are resampled to represent weighted particle distributions as equivalent uniformly weighted particle distributions.

[0079] Track Initiation and Deletion - New tracks are created for unassociated observations and tracks are deleted based on covariance or missed observations.

[0080] Prediction of particle states is performed according to a motion model of constant linear velocity and zero angular velocity:

I x: _:

[0081] With process noise Q added to account for random changes in position, velocity and orientation.

[0082] Where q and o² are the constant velocity process noise level and angular variance parameters respectively.

[0083] In crowded environments the close proximity of people to one another can cause tracks to erroneously change targets. This occurs when the separation between two or more targets is small compared to the observation error and is exacerbated in cases where the targets have similar, or near zero, velocities. In order to minimise this effect we use an approach based on Khan, Z , Batch, T., Dellaert, F. : Efficient particle filter-based tracking of multiple interacting targets using an {MFIF}-based motion model; Proceedings of the International Conference on Intelligent Robots and Systems 1 (October), 254-259 (2003); DO1 10. 1 109/IROS.2003. 1250637 which penalises track interaction in the propagation step of the particle filter by redrawing particles which are near other tracks. To implement this, the example embodiment includes first completing the prediction step without social constraints and updating the mean hypothesis of each filter. Based on the predicted states, processing is performed thereby to check for any invalid particles in any filter, i.e. those within a fixed radius of any other filter mean position. Invalid particles are redrawn until valid.

[0084] An output module 309 is programmed to, based on data provided by the tracking module, provide a stream of filtered poses which describe the trajectory of one or more predicted humans respectively associated with each of the human target clusters. The output module associates a persistent tracking ID to each predicted human, and enables graphical representation of each predicted human as a 3D avatar capable of displaying pose attributes on a display device 310 (for example as shown in FIG. 2 in image 204).

Experimental Results: Lab dataset with ground truth

[0085] In order to quantify the precision and accuracy of the example embodiment above, a dataset was captured consisting of 9 depth image sequences of people moving in different ways through the depth sensor field-of-view, with accompanying ground truth measured using a commercial motion capture system. The dataset was captured in a data arena, being a circular cinema room, with an Optitrack motion capture comprising of high frame rate cameras with infrared illumination. Each participant in the experiment had a rigid infra-red marker card attached to their back using a velcro strap, used to accurately track the position and rotation of their upper body. A brief description of the different depth sequences is provided below.

• Wandering 1/2/3/4 - Participants casually moving and stopping within the field-of- view of the depth camera (3/4/8/8 people, 120/123/47/98 seconds).

• Alighting 1/2/3 - Participants simulating situations where 4 train passengers wait to board a service while 2 passengers alight. (6 people, 24/21/16 seconds).

• Walkthrough - 4 participants stand still while 4 others repeatedly cross the field-of- view weaving between stationary participants. (8 people, 142 seconds).

• Passing - All participants repeatedly crossing the field-of-view weaving past one another (8 people, 131 seconds).

[0086] The table of FIG. 6 summarises the results of the pose extraction precision. To account for the arbitrary offset between infra-red markers and the centre-of-shoulder position extracted by the described approach, a single 3D offset has been applied to the ground truth data in the local frame of each person’s marker prior to error calculation. The results therefore are not able to capture any positional bias in the extracted poses but do capture the consistency of the extracted poses which is more important in scenario. The starting orientation of the marker cards is also arbitrary and a similar offset has been applied to each card orientation prior to error computation. Orientation errors are also wrapped between ±p to better characterise errors in the face of ambiguity between the forwards and backwards direction. For clarity the percentage of extracted poses which correctly estimated the forwards direction (and hence did not require wrapping) are also given.

[0087] In order to further evaluate the tracking module, an approach was implemented including use of used person detection and pose extraction from the described technology, which was passed into both the described tracking module and a publicly available person tracker published in Linder, T., Breuers, S., Leibe, B., Arras, K.O.: On multi-modal people tracking from mobile platforms in very crowded and dynamic environments. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 5512-5519. IEEE (2016). DOI 10.1 109/ICRA.2016.7487766.

[0088] The results of this comparison are presented in the table of FIG. 7 in terms of the CLEAR-MOT metrics (see Bernardin, K., Stiefelhagen, R.: Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. EURASIP Journal on Image and Video Processing 2008, 1-10 (2008). DOI 10.1155/2008/246309), a system of metrics devised to enable intuitive and fair benchmarking of multiple object tracking (MOT) systems. The metrics are multiple object track precision (MOTP), a measure of the average position error in real units, and multiple object tracking accuracy (MOTA), and the percentage of tracking outputs which are not erroneous in any way. MOTA is calculated based on counts of false positives, false negatives and ID switches which we also report individually to provide greater insight to the causes of errors.

[0089] The described tracking module performed similarly well in the wandering sequences to the tracker from Linder et al. which is unsurprising as both trackers are provided with the same pose detections and use very similar motion models. Interestingly the tracker performed better in terms of MOTA for the Alighting, Walkthrough and Passing sequences, all of which involve movement of people through densely crowded areas in close proximity to others. It’s likely that this improvement can be explained by the addition of social constraints in track prediction which significantly narrow the spread of particle states in crowded scenarios by avoiding predictions in close proximity to others. [0090] A further evaluation of the above described methodology was performed, this time in a scenario of a crowded train station. This evaluation was performed using depth image sequences captured at a busy inner-city train station (Town Hall Station in Sydney, Australia) using purpose built sensing platforms. Over the course of three days, depth images were recorded at 30 Hz in crowded train platform areas.

[0091] To provide a quantitative evaluation of the system’s performance, the number of active tracks in each frame of tracking data output by the system was compared against a manual count taken by a person directly from the depth data.

[0092] Example results obtained during a difficult 30 minute sequence representing a morning commuter rush hour are illustrated in FIGs . Despite crowds of up to 26 people in the sensors FOV the total number of people reported by the system is accurate to within 1 person most of the time with some bias towards overestimation. This tendency to overestimate total numbers of people can be explained by the lingering of tracks after people leave the sensors FOV. While the manual count will immediately decrement, our tracker maintain its hypothesis of the persons location until the position covariance reaches an upper threshold. The highest errors in the total person count occur in periods of sharp increase or decrease of the ground truth person count.

Conclusion and Interpretation

[0093] To address the challenges of estimating and tracking human pose in crowded environments, the present disclosure presents technology including a novel method for shoulder pose extraction and a tracking algorithm which exploits social constraints to improve track prediction.

[0094] Reference throughout this specification to “one embodiment”, “some embodiments” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment”,“in some embodiments” or“in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.

[0095] As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

[0096] In the claims below and the description herein, any one of the terms comprising, comprised of or which comprises is an open term that means including at least the elements/features that follow, but not excluding others. Thus, the term comprising, when used in the claims, should not be interpreted as being limitative to the means or elements or steps listed thereafter. For example, the scope of the expression‘a device comprising A and B’ should not be limited to devices consisting only of elements A and B. Any one of the terms including or which includes or that includes as used herein is also an open term that also means including at least the elements/features that follow the term, but not excluding others. Thus, including is synonymous with and means comprising.

[0097] As used herein, the term“exemplary” is used in the sense of providing examples, as opposed to indicating quality. That is, an“exemplary embodiment” is an embodiment provided as an example, as opposed to necessarily being an embodiment of exemplary quality.

[0098] For convenience of explanation, the phrases "stores information", "causes information to be stored", and other phrases equivalent thereto are used. If the embodiment is a computer program, these phrases are intended to express "causes a memory device to store information" or "controls a memory device to cause the memory device to store information." The modules may correspond to the functions in a one-to-one correspondence. In a software implementation, one module may form one program or multiple modules may form one program. One module may form multiple programs. Multiple modules may be executed by a single computer. A single module may be executed by multiple computers in a distributed environment or a parallel environment. One module may include another module. In the discussion that follows, the term "connection" refers to not only a physical connection but also a logical connection (such as an exchange of data, instructions, and data reference relationship).

[0099] The term "predetermined" means that something is decided in advance of a process of interest. The term "predetermined" is thus intended to refer to something that is decided in advance of a process of interest in the embodiment. Even after a process in the embodiment has started, the term "predetermined" refers to something that is decided in advance of a process of interest depending on a condition or a status of the embodiment at the present point of time or depending on a condition or status heretofore continuing down to the present point of time. If "predetermined values" are plural, the predetermined values may be different from each other, or two or more of the predetermined values (including all the values) may be equal to each other. A statement that "if A, B is to be performed" is intended to mean "that it is determined whether something is A, and that if something is determined as A, an action B is to be carried out". The statement becomes meaningless if the determination as to whether something is A is not performed.

[00100] The term "system" refers to an arrangement where multiple computers, hardware configurations, and devices are interconnected via a communication network (including a one-to-one communication connection). The term "system", and the term "device", also refer to an arrangement that includes a single computer, a hardware configuration, and a device. The system does not include a social system that is a social "arrangement" formulated by humans.

[00101] At each process performed by a module, or at one of the processes performed by a module, information as a process target is read from a memory device, the information is then processed, and the process results are written onto the memory device. A description related to the reading of the information from the memory device prior to the process and the writing of the processed information onto the memory device subsequent to the process may be omitted as appropriate. The memory devices may include a hard disk, a random- access memory (RAM), an external storage medium, a memory device connected via a communication network, and a ledger within a CPU (Central Processing Unit).

[00102] Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as "processing," "computing," "calculating,"“determining”, analyzing” or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulate and/or transform data represented as physical, such as electronic, quantities into other data similarly represented as physical quantities.

[00103] In a similar manner, the term "processor" may refer to any device or portion of a device that processes electronic data, e.g., from registers and/or memory to transform that electronic data into other electronic data that, e.g., may be stored in registers and/or memory. A“computer” or a“computing machine” or a "computing platform" may include one or more processors.

[00104] The methodologies described herein are, in one embodiment, performable by one or more processors that accept computer-readable (also called machine-readable) code containing a set of instructions that when executed by one or more of the processors carry out at least one of the methods described herein. Any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken are included. Thus, one example is a typical processing system that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system further may include a memory subsystem including main RAM and/or a static RAM, and/or ROM. A bus subsystem may be included for communicating between the components. The processing system further may be a distributed processing system with processors coupled by a network. If the processing system requires a display, such a display may be included, e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT) display. If manual data entry is required, the processing system also includes an input device such as one or more of an alphanumeric input unit such as a keyboard, a pointing control device such as a mouse, and so forth. The term memory unit as used herein, if clear from the context and unless explicitly stated otherwise, also encompasses a storage system such as a disk drive unit. The processing system in some configurations may include a sound output device, and a network interface device. The memory subsystem thus includes a computer-readable carrier medium that carries computer-readable code (e.g., software) including a set of instructions to cause performing, when executed by one or more processors, one of more of the methods described herein. Note that when the method includes several elements, e.g., several steps, no ordering of such elements is implied, unless specifically stated. The software may reside in the hard disk, or may also reside, completely or at least partially, within the RAM and/or within the processor during execution thereof by the computer system. Thus, the memory and the processor also constitute computer-readable carrier medium carrying computer-readable code.

[00105] Furthermore, a computer-readable carrier medium may form, or be included in a computer program product.

[00106] In alternative embodiments, the one or more processors operate as a standalone device or may be connected, e.g., networked to other processor(s), in a networked deployment, the one or more processors may operate in the capacity of a server or a user machine in server-user network environment, or as a peer machine in a peer-to-peer or distributed network environment. The one or more processors may form a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. [00107] Note that while diagrams only show a single processor and a single memory that carries the computer-readable code, those in the art will understand that many of the components described above are included, but not explicitly shown or described in order not to obscure the inventive aspect. For example, while only a single machine is illustrated, the term "machine" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.

[00108] Thus, one embodiment of each of the methods described herein is in the form of a computer-readable carrier medium carrying a set of instructions, e.g., a computer program that is for execution on one or more processors, e.g., one or more processors that are part of web server arrangement. Thus, as will be appreciated by those skilled in the art, embodiments of the present invention may be embodied as a method, an apparatus such as a special purpose apparatus, an apparatus such as a data processing system, or a computer-readable carrier medium, e.g., a computer program product. The computer- readable carrier medium carries computer readable code including a set of instructions that when executed on one or more processors cause the processor or processors to implement a method. Accordingly, aspects of the present invention may take the form of a method, an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of carrier medium (e.g., a computer program product on a computer-readable storage medium) carrying computer-readable program code embodied in the medium.

[00109] The software may further be transmitted or received over a network via a network interface device. While the carrier medium is shown in an exemplary embodiment to be a single medium, the term "carrier medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "carrier medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by one or more of the processors and that cause the one or more processors to perform any one or more of the methodologies of the present invention. A carrier medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical, magnetic disks, and magneto-optical disks. Volatile media includes dynamic memory, such as main memory. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise a bus subsystem. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications. For example, the term "carrier medium" shall accordingly be taken to included, but not be limited to, solid-state memories, a computer product embodied in optical and magnetic media; a medium bearing a propagated signal detectable by at least one processor of one or more processors and representing a set of instructions that, when executed, implement a method; and a transmission medium in a network bearing a propagated signal detectable by at least one processor of the one or more processors and representing the set of instructions.

[001 10] It will be understood that the steps of methods discussed are performed in one embodiment by an appropriate processor (or processors) of a processing (i.e. , computer) system executing instructions (computer-readable code) stored in storage. It will also be understood that the invention is not limited to any particular implementation or programming technique and that the invention may be implemented using any appropriate techniques for implementing the functionality described herein. The invention is not limited to any particular programming language or operating system.

[001 11] It should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, FIG., or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

[001 12] Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.

[001 13] Furthermore, some of the embodiments are described herein as a method or combination of elements of a method that can be implemented by a processor of a computer system or by other means of carrying out the function. Thus, a processor with the necessary instructions for carrying out such a method or element of a method forms a means for carrying out the method or element of a method. Furthermore, an element described herein of an apparatus embodiment is an example of a means for carrying out the function performed by the element for the purpose of carrying out the invention.

[001 14] In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

[001 15] Similarly, it is to be noticed that the term coupled, when used in the claims, should not be interpreted as being limited to direct connections only. The terms "coupled" and "connected," along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Thus, the scope of the expression a device A coupled to a device B should not be limited to devices or systems wherein an output of device A is directly connected to an input of device B. It means that there exists a path between an output of A and an input of B which may be a path including other devices or means. "Coupled" may mean that two or more elements are either in direct physical or electrical contact, or that two or more elements are not in direct contact with each other but yet still co-operate or interact with each other.

[001 16] Thus, while there has been described what are believed to be the preferred embodiments of the invention, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the invention, and it is intended to claim all such changes and modifications as falling within the scope of the invention. For example, any formulas given above are merely representative of procedures that may be used. Functionality may be added or deleted from the block diagrams and operations may be interchanged among functional blocks. Steps may be added or deleted to methods described within the scope of the present invention.

Claims

1. A computer system programmed to enable prediction of human activity, the system including:

2. A system according to claim 1 including a tracking module, wherein the tracking module is programmed to, for each human target cluster identified by the ellipsoid- based body part extraction module, perform a tracking process using data provided by the pose extraction module.

3. A system according to claim 2 wherein the tracking module assigns a respective probabilistic estimator with social constraints to each human target cluster.

4. A system, according to claim 3 wherein the probabilistic estimator is a particle filter.

5. A system according to claim 4 wherein the tracking module is programmed to, for each human target cluster, perform a process including: (i) performing a prediction sub-process thereby to predict a next state of particles based on current state, motion model and social constraints; and (ii) performing a data association sub process thereby to associate pose attributes for a given frame to one of the particle filters.

6. A system according to claim 5 including performing the data association sub process thereby to associate pose attributes for a given frame to one of the particle filters based on a greedy nearest neighbours approach.

7. A system according to claim 5 wherein the prediction sub-process includes applying social constraints by: (i) performing a first prediction for each filter thereby to update a mean hypothesis of each filter; (ii) identifying invalid particles in any given filter, including any particles within a fixed radius of any other filter mean position; and (iii) re-drawing invalid particles until valid.

8. A system according to any preceding claim including an output module, wherein the output module is programmed to, based on data provided by the tracking module, provide a stream of filtered poses which describe the trajectory of one or more predicted humans respectively associated with each of the human target clusters.

9. A system according to claim 6 wherein the output module associates a persistent tracking ID to each predicted human.

10. A system according to claim 6 wherein the output module enables graphical representation of each predicted human as a 3D avatar capable of displaying pose attributes.

11. A system according to any preceding clam wherein the scene processing module is programmed to perform a background subtraction process.

12. A system according to claim 9 wherein the background subtraction process includes, processing at least a subset of depth image frames to incrementally update a background model describing the depth values consistent with the static environment, and processing at least a subset of depth image frames to exclude depth values consistent with the background model, wherein only the remaining foreground pixels are provided for further processing.

13. A system according to claim 12 wherein the background depth image is used to establish the ground plane.

14. A system according to any preceding claim wherein the scene processing module is programmed to provide to the pointcloud segmentation module, for the subset of frames, an upright pointcloud of foreground, aligned with the ground plane.

15. A system according to claim 14 wherein the scene processing module is programmed to, for a given frame, sort points in the pointcloud based on height relative to the ground plane, and perform an iterative top-down method thereby to sort the points into clusters based on a separation distance threshold.

16. A system according to claim 15 wherein the iterative top-down method includes iterating through each point p, in the pointcloud and performing a decision to either: (i) adding that point to a nearest cluster; or (ii) staring a new cluster.

17. A system according to claim 16 wherein the decision is based on comparing a horizontal distance d, , between each point p, and each cluster C_j to a fixed separation distance threshold do, and if a point is less than the separation distance to any cluster mean, adding that point to the nearest such cluster, with the mean of the cluster being updated incrementally.

18. A system according to claim 17 including performing a cluster joining process including checking a distance between cluster means, and joining those clusters with a distance less than d_o in a greedy fashion.

19. A system according to claim 18 including removing clusters having either: (i) less than a threshold number of points; and (ii) less than a threshold representative surface area.

20. A system according to any preceding claim wherein the ellipsoid-based body part extraction module that is programmed to: (i) receive a given cluster outputted by the pointcloud segmentation module; and (ii) sequentially seek to apply a head ellipsoid representative of a head, and a shoulder ellipsoid representative of shoulders.

21. A system according to claim 20 wherein the head ellipsoid is fit to a first vertical window having a predefined size of points with fixed height extending downward from the top of the pointcloud.

22. A system according to claim 21 wherein the shoulder ellipsoid is fit to a second vertical window of points defined relative to the first vertical window.

23. A system according to claim 20 wherein the pose extraction module is programmed to determine the pose attributes based on characteristics of the shoulder ellipsoid.

24. A system according to claim 23 wherein horizontal pose attributes are determined from a centre of the shoulder ellipsoid, and a vertical pose calculated by an intersection between a vertical line passing through the centre of the shoulder ellipsoid and upper surface of the shoulder ellipsoid.

25. A system according to claim 23 or 24 wherein an orientation of the shoulders is determined by projecting a major axis of the shoulder ellipsoid into the horizontal plane and taking the angle of the resulting line, with this angle being rotated 90 degrees to obtain a facing direction of the human, with a horizontal location of the head ellipsoid centre relative to the shoulder ellipsoid major axis is used to determine the forwards facing direction.

26. A system according to any one of the preceding claims wherein the pointcloud segmentation module is further programmed to perform downsampling on at least a subset of the plurality of frames of pointcloud data.

27. A system according to claim 26 wherein the downsampling includes voxelizing each frame of the subset of frames of pointcloud data.

28. A method for predicting human activity, the method including:

operating a scene processing module that is programmed to: (i) receive data from the input module; and (ii) process that data thereby to establish a upright orientation for the target region, and transform at least a subset of the frames of pointcloud data into a predefined orientation based on the upright orientation; operating a pointcloud segmentation module that is programmed to (i) receive, from the scene processing module, data representative of a plurality of frames of pointcloud data in predefined orientation relative to the upright orientation; and (ii) for at least a subset of those frames, process the data thereby to identify one or more clusters, wherein each cluster represents a potential human target; and operating an ellipsoid-based body part extraction module that is programmed to: (i) receive from the pointcloud segmentation module, data representative of the clusters; and (ii) for each cluster, perform an ellipsoid fitting process thereby to, for a given cluster:

29. A method according to claim 28 wherein the pointcloud segmentation module is further programmed to perform downsampling on at least a subset of the plurality of frames of pointcloud data.

30. A method according to claim 29 wherein the downsampling includes voxelizing each frame of the subset of frames of pointcloud data.

31. A computer program product for performing a method according to any one of claims 28 to 30.

32. A non-transitory carrier medium for carrying computer executable code that, when executed on a processor, causes the processor to perform a method according to any one of claims 28 to 30.