1. Introduction
Person-following is an important robot service, and it can be employed in various scenarios, i.e., autonomous wheelchair-following for accompanying people [
1], an item carrier in shopping malls [
2], etc. Generally, the following robot will operate in a dynamic and complicated environment, in which it may meet several environmental challenges. For example, the target person may be covered or partly covered by other persons or obstacles, the illumination situation may suddenly be exposed or darkened by the influence of changing light sources, the target appearance may change during the task (e.g., the target person changes clothes, puts on a hat, or removes their bag, etc.) and the poses of the target may change (e.g., squatting down or picking up something from the floor). Therefore, how to improve the preciseness and robustness for person-localization is a key issue for the person-following function.
The major issue of person-following is how to localize the target person by tracking. Such methods monitor the localization of the target person over time and identify them based on motion continuity. One issue concerns sensor selection. In mobile robot platforms, range data, especially that determined by a laser scanner, are often used for person-detecting and -tracking, because of its accuracy and the wide view in range searching [
3,
4,
5,
6]. It detects the person based the shape of the human scanned by the range data, and tracks the person through the continuous movement of their position. However, signal loss happens when the person is occluded by others for several seconds, and the person-localization system of the robot may then track the person that occluded the target, because the shapes of humans as depicted by the representation of range data are similar. Therefore, accurate identification of the target person is essential for this work. Different from the state of range data, methods based on image data take advantages in identifying the target person, because image data contains visual image information from the real world. Because of the strength of significant features that are extracted from human appearance, the above methods can identify the correct target, overcoming the challenges of cover or partial cover. Similar to the accomplishment of range data, methods based on image data detect the target person in the scene based on feature points [
7,
8], skeletons [
9,
10], gestures [
11], stereo information [
12], or by combining target’s height, path, and the color of cloth [
13], and identifies them by their continuous movement. However, there are two major drawbacks, i.e., limited adaption from model to the environment, and high computing cost. Detection with a higher accuracy generally brings with it more exhaustive computing costs, because of the greater complexity of the detection model, e.g., the method based on deep learning [
14]. However, even though the deep learning approach can obtain magnificent accuracy and operate in real-time under high-performance GPU support, exhaustive computing costs restrict it from operating on a mobile platform (it can only operate on a CPU). Differing from traditional methods, the method based on deep learning learns features through the computer itself, and it overcomes limited representational strength, through human-designed features. However, one other issue arises, in that the training data always have undescribed poses or unexpected appearances of humans, which leads the ROI (region of interest) of the target to sharply flash or be lost; this causes a loss of the target’s localization. Such a situation will cause uncontrollable movement from the person-following robot, and lead to a failure of the following task.
In summary, ROI flash or loss, no matter whether the type of method used is based on range data or image data, is inevitable, because of the complicated background and the appearance of persons in the real-world environment. Simply extending the training dataset can somehow benefit the accuracy of detection, but unknown environments will always exist, which makes the ROI flash or loss still occur. To improve this issue, we propose the addition of an online feature learning procedure, in order to adapt our model to the environment during the following task. Our approach focuses on constantly localization one target person, and regarding all other objects (no matter human or background) as the background. Visual tracking technology (for a review, see [
15]) can be used to play the role of an online detector to localize the target person. One trend is to use a tracking–learning–detecting [
16,
17] strategy, which incorporates the tracker with a detector. In tracking–learning–detecting strategies, the tracker follows the movement of the object, the detector justifies the tracker’s result between two frames, and the learner updates their models. One factor that should be noted is that the major difference between tracking–learning–detecting and detection-by-tracking is that the model localizes the target object, based on an online- or offline-trained model. One drawback that appears for the visual tracking is that the tracker only utilizes the online feature into a learning procedure, and the person-localization model is a lack of restriction on the object’s classification, shown in
Figure 1; i.e., the online learner considers the feature-rich bag as a target, and it puts the intensity of tracking on the bag which leads the outlier happens. Apart from this, the tracking–learning–detecting technology also faces the challenges of scale, rotation, appearance change, etc. Our approach employs the idea of the tracking–learning–detecting strategy, which employs the center cluster approach [
18] as a tracker, the PN classifier [
19] as a detector, and an updating procedure as the learner. The [
18] is a part-based tracker that splits the appearance of the objects into different parts, and follows each part through the tracking procedure. The edge of human appearance is significant in the image scene, which allows for good performance by the part-based tracker. However, the tracker is based on feature points, which mainly represent the extreme points in the image, but not the texture on the object. This issue occurs when two persons overlap; the tracker may lose the correct target because of a similar distribution of feature points. To improve this case, the PN classifier is selected to justify the tracker. The PN classifier is trained online by positive samples (target person’s appearance) and negative samples (background), which justifies the tracker according to the learnt texture feature of the target person and the background. The PN classifier selectively searches the region around the tracker’s output, and evaluates the most confident region of the target.
One major difference is that we employ the CNNs detector to restrict the model, in order to focus on the human’s appearance, to serve as the initializing step. The neural network is trained by our own selected datasets. The CNNs detector is first employed to detect the human’s positions in the first frame, to give an initial position of the target to the person-localization model. After each step during the tracking procedure, the CNNs detector is used to justify the confident rate of the output for humans, in order to avoid the occurrence of the issue, as in
Figure 1. There are several existing detectors that are based on a deep neural network. Starting from R-CNN [
20], the two-stage approaches (finding the objects’ potential positions and detecting the high confident candidates), fast and faster R-CNN [
21,
22] obtain excellent detection quality, but the real-time processing is a challenge, even with GPU enhancement. Another trend is the one-stage approach, such as YOLO [
23], in which researchers convert the detection task as a regression work, and it obtains a timeless detecting speed; however, the accuracy of YOLO is relatively lower than other state-of-art methods, and the training procedure is highly dependent on the parameters to be set. Moreover, SSD [
24] combines the advantages of faster R-CNN and YOLO, which possess fast and accurate detection performances. For the purpose of deploying on the mobile platform, SSD and various optimizations are employed to improve the speed of processing, with limited detection of quality loss. To reduce the time during the processing of the convolutional layers, our approach utilizes the MobileNet [
25] as a back net, and it employs the depthwise convolutional layer instead of the traditional convolutional layer. Moreover, the neural network is re-designed with tiny input resolution (128 × 128) to reduce the computing cost from large resolution neural networks. Because of the connection between two frames, the output through the tracker will not be a great outlier. Therefore, we employ the slightly extended region that is output by the tracker as an input of the neural network, in order to allow the neural network to rapidly localize the person.
In this paper, to prove the efficiency of our person-localization approach, we deploy various experiments to evaluate the common challenges on the person-following task, in both simulated and real-world operations. Firstly, our approach is evaluated both with and without a classification-lock strategy by the occlusion, re-detection of the appearance of similar people, scale change, rotation change, and the target’s appearance change, in four recorded image sequences. The results show that our approach can overcome the above challenges. The classification-lock strategy improves the ROI output, and assists our approach of avoiding the failures singly, by using detection or visual tracking. Thus, a real-world following task is operated. To allow for the movement of the robot, a simple PID controller, which keeps a fixed distance between the user and the robot, is employed to drive the robot. The result shows that our robot accomplishes the human-following mission against environmental influences, with a long-term following travel time, and it allows for the following service to operate among the floors. To prove its efficiency, our approach is compared with four state-of-art tracking methods on the Vojir tracking dataset [
26] whose subject is humans. The result demonstrates that our approach obtains a state-of-art recall rate, and shows the success of our classification-lock strategy.
Prior to this paper, we presented in [
27] an original version of this work. The initial version incorporated the tracker, PN classifier and a Yolo based deep neural network to accomplish the classification-lock function, which successfully allowed a following robot constructed on an Intel Nuc platform to follow the user in a single floor. However, there were three major limitations: (1) The system can only reach 13 FPS on our platform; (2) The Yolo model required a dramatically long training time; (3) The experiment was operated in a single floor environment but modern buildings are mostly multi-floor.
Compared with the former work in [
27], the major technical improvements in the current work can be summarized in three points: (1) Since the model in the conference paper used the Sigmoid as the activation function for Yolo structure, the convergence of the deep neural network required a greatly long iteration. In the new deep neural network, the MobileNet is used for the feature extraction-based SSD structure which allows the deep neural network to quickly converge by only 60,000 iterations; (2) The final layer of the original deep neural network was the fully connected which was computationally very time-consuming, only achieving 13 FPS on the mobile platform. In this paper, we utilize the Depthwise convolutional layers in the SSD instead of the Yolo structure which enables the model to cope with 30 FPS on the mobile platform and retain the same person localization quality; (3) The following robots presented in the conference paper only followed the target in a single floor environment. A new dataset for the situation of the robot operating in the elevator is introduced and an experiment allowing the following robot to move through multiple floors are performed, both of which prove the strength of our robot when operating in multi-floor buildings.
3. Experiments
To evaluate the robustness and performance of our tracking approach, we designed three major scenarios in both static and dynamic scenes, with and without the classification-lock strategy.
Figure 4 shows the entire framework for our present robot following system. Our approach was first compared with other state-of-arts methods in three sequences in the Vojir datasets whose targets were humans, so as to prove the advances of our tracking method. This was further tested on static scenes with challenges of feature-rich influences, scale change, appearance change, light influence, and occlusion, to prove the significant improvement of the classification-lock strategy. Afterwards, one dynamic scene was employed to simulate the environment challenges on a real-following task, in order to evaluate the robustness of our approach, during the time in which the robot was moving. Finally, our approach was employed to accomplish following travel in a multi-floor environment.
One issue concerns the narrow space inside an elevator; the solution was to let the user turn around, allowing their front to face the robot, and to let the robot turn back when the elevator arrived at the proposed floor. The mobile robot platform was set up based on a turtle robot, with a simple PID strategy to control the robot to keep a fixed distance of 1.5 m from the user. The real-robot experiment was only operated in an indoor environment, because the optic structure to obtain distance information was influenced by sunlight; thus, outside, the robot would not maintain its distance from the target human, and this would lead to failure of the following experiments. In our experiment, a laptop equipped with an Intel 7700 HQ CPU and 16 GB ddr4 memory was selected to operate our following robot system. To acquire the image and distance information, the prime sense was selected. The prime sense contained a 640 × 480 camera with a widely horizontal angle range and a structured light obtaining precise depth map, which assisted our approach in constantly localization the user on the scene, and obtaining the distance between the user and the robot. During the operation, our approach could achieve about 30 FPS in average during the tracking period with such a mobile platform.
3.1. Training Data
For the purpose of training the CNNs detector, an in-house dataset was collected (1255 images) with two classes, upper body, and whole body. In our dataset, Shenlu Jiang, Ling Li and Cheng Su played as the models for some of datasets and 60% datasets are collected from other colleagues with consent to use them in the dataset. The other 40% of the dataset is collected from the Internet with legal license. The modeling of the upper body was employed for the following task on real-world experiments. Because our robot set a fixed distance of 1.5 m between it and its target human, the scale of the human was too large, and the camera could not capture the entire appearance of the human at this distance; thus, we employed the upper body detector to localize the human. The whole-body model was utilized for the purpose of evaluating our model in the sequences of static views and the Vojir dataset.
One factor that should be noted is that the whole and upper body datasets were trained and tested individually, but not when mixed together. As shown in
Figure 5, our own designed dataset included various rotations, different poses, and different scale of humans, which confirmed the robustness of the detector to localize the person. We then divided 400 images as test images to evaluate the performance of the neural network (in the followed experiments, all data was be used to train the model). The mean average precision (MaP) is used in the evaluation.
As shown in
Figure 6, the SSD with a 512 × 512 input resolution obtained higher accuracies than the other two in whole- and upper-body tests (89.1% and 88.5%). We tested SSD 300 and 512 in our platform, and calculated their own perspective processing times. SSD 300 took approximately 130 ms to process each image, and SSD 512 took approximately 300 ms to process each image. Compared with SSD 128 (about 60 ms), although the mode in the other two resolutions obtained higher accuracies, the high computing costs prevented them from operating in our robot system on a mobile platform in real-time. Due to the fact that the neural network only detected the user’s position in a slightly extended region after the tracker and PN classifiers, the SSD 128 (70.1% and 69.8%) was sufficient to fulfil the person-localization task. One other task for SSD 128 was to provide the potential positions of the user on the scene during the initialization step. The SSD 128 only achieved a 69.8% map, and this approach mainly failed in a crowded environment, because the image resolution of the test dataset was much higher than 128 × 128. Thus, we designed a simple initializing environment, i.e., a single person standing alone on the scene and using a mobile phone to remotely connect the robot to get the scene on a robot operating a PC. The user interacted with the robot by simply clicking the region of the target person in order to select the original region of the user on the robot scene.
3.2. Benchmark Comparison
First of all, our tracking approach was compared with other state-of-the-art approaches on a common dataset to prove the advance of our tracking quality on a human target. There were three sequences: gym, surfer, and person, which were selected from the Vojir tracking dataset (available online via the link in [
26], as shown in
Figure 7). The target objects of the three sequences were humans, which included scale, appearance, and rotation changes:
The recall rate was used to evaluate the tracking performance. Its formula is shown in Equation (7). TP is a true positive object, and FN is a false negative object. In the evaluation, we regarded the outputs with overlapped regions with the ground truth ROI ≥ 0.8 being the true positive, and less than 0.8 being the false positive. Comparing with simply using precision to evaluate the models, the recall rate sufficiently evaluated the situation of localization the failure, and it could obtain a more statistical result. The recall rate of each item is calculated in
Table 1. The state-of-art methods of tracking, STR [
17], CT [
32], STC [
33] TLD [
16], and CMT [
18] were used to compare with our approach.
According to the results shown on
Table 1, STR received the two best scores in the gym and person sequences, because a great number of appearance changes existed in the two scenes, which proved the success of tracking-by-detecting with learner updating. However, a recall rate of 0.46 in the surfer sequence showed the risk of STR during a great scale change. STR’s similar approach of TLD received a median score in all items, since the simple modeling method restricted its abilities in updating the model and the tracking object.
The whole appearance tracker CT received the same score as STR in the gym sequence, but it failed in the surfer and person sequences. Such a result shows that the whole appearance tracker did not have the ability to overcome scale changes, but it did well for appearance changes. STC also received a poor score, because the scale change issue was similar to CT. The part-based tracker CMT received good scores for all three items; however, the static geometric structure caused a limitation for CMT, with significant changes occurring in the surfer and person sequences. Our approach without a classification-lock was the second-best recall rate on average, and it proved that our strategy of tracking could overcome common challenges in visual tracking. Last but not least, the classification-lock strategy improved our approach with higher precision, which allowed our approach to achieve the best score on average. However, our approach received a lower score in the surfer sequence than without a classification-lock. This was due to a part-based influence, i.e., we input some part-based situations into our dataset that allowed the classification to estimate the person’s appearance, even if part of the body was covered. The surfer benchmark labeled the coordinates of the person’s upper body after partial occlusion occurred, and this was the main reason for our score being lower than that without a classification-lock. Apart from this, our approach performances were 32.7, 28.9, and 30.1 FPS in the person, gym, and surfer sequences, respectively (because the number of feature points on a person’s appearance was different), which was sufficient to fulfill the processing on the following robot.
The results proved the efficiency and advance of our tracking approach, which achieved significantly higher accuracies than other state-of-art methods. Comparing with original version, the FPS of our model triple increase from 13 FPS to 30 (in average) on a mobile platform, and it keeps the same level accuracy in the visual tracking benchmark comparison.
Our approach then further evaluated the efficiency of a classification-lock strategy in a person-following task. Because the two sequences and the dynamic were evaluated as real robot simulations, the input ROI was detected by SSD128, which was different from visual tracking (human sets). The ROI detected by the SSD was not absolutely solid, but the visual tracking methods needed a stable coordinate of ROI to input to the tracker in the first frame; thus, the state-of-art trackers were not compared with our approaches in the dynamic and static scenes. Since various kind of challenges with visual tracking may happen at the same time, we employed various poses, with those challenges being acted on by the user in the three image sequences, and the results are shown in
Figure 8 and
Table 2. Our approach without a classification-lock strategy was evaluated.
3.3. Static Scene
In the first sequence, a child conducted a series of movements on a feature-rich school bag with his hand, to show the influence of the feature-rich inner object’s influence. He then formed several poses with different scales on the scene, frequently accompanied by user walking. Finally, an occlusion by his classmate occurred, in order to simulate an overlapping situation with the target person. In this sequence, a feature-rich influence, scale changes, appearance changes, and occlusion were included for the purpose of evaluating the adaptation of our tracking strategy with those challenges. According to the results, our approach with the classification-lock achieved a 0.96 recall rate, and localized the correct child with suitable ROI during the whole period of this sequence. However, the approach without a classification-lock failed in the feature-rich influence risk, because the feature points were mainly distributed on the bag, which hindered the tracker from maintaining localization of the child. The classification-lock strategy contributed in this part, which pulled back the ROI when the tracker proposed to localize the error target. Therefore, we proved the accuracy of our approach in the above challenges, and showed the efficiency of our approach to localize the user against scale and appearance change influences, even when they occurred together.
In the second sequence, the developer performed a series of rotation behaviors in different scales, which consisted of sudden scale, rotation, and appearance changes. Both our approaches with and without classification maintained constant tracking of the object. However, the implementations with ROI and with the classification-lock acted more stable than the ones without these properties, during the whole tracking procedure. The major reason was that the unstable feature points during the scale and appearance changed. During the movement of the target, the distribution of the feature points greatly moved, and this may have caused the occurrence of a temperately unstable ROI for the target person. In such case, the classification-lock strategy assisted the tracker to localize the person better, with a 0.08% higher recall rate. In this sequence, our approach was illustrated to be able to overcome rotation change in different scales, overlaps, and feature-rich challenges.
In the third sequence, the developer walked into an elevator, rotated inside the evaluator, using the front face to the follower, and walked out when the elevator door opened. This sequence consisted of an exposure and a darkness situation, which simulated the situation of our robot moving among the floors. Both of our approaches with and without a classification-lock strategy could keep constant tracking intensity on the target human, but the one with a classification-lock strategy showed more a precise ROI output during the point at which the darkness caused an unstable trend and decrease in the number of feature points.
The recall rates of our approach were also higher than without a classification lock strategy. The three sequences illustrated that our classification-lock strategy was able to further improve the ROI output of the tracker, and to avoid the occurrence of the accumulation error on the tracker. Furthermore, the classification-lock strategy also helped the tracker to work against failure when the feature points were unstable. In additional, our approach managed to efficiently overcome scale and appearance changes with the lock present, or not (except for the feature-rich object influence) in these static scenes.
3.4. Dynamic Scene
In the dynamic scene, a designed task was employed to simulate a real case that could happen to the person being followed. This image sequence was recorded on a rainy day inside our office, when the illumination situation was awful. In this experimental image sequence, our office served as the background, which had complicated light situations and numerous background features. We designed a series of challenges described in the dynamic view, and the image sequence in
Figure 9 shows this travel, in which the background environment was complicated enough to evaluate the robustness of our approach.
Our robot initialized the user’s model at an initial position, and then it began to follow the user in the office environment. During this travel, our person-localization approach worked against the influence of a complex background and the illumination changes in different rooms. The robot determined the appearance changes of the user when he decided to wear a hat. Occlusions occurred three times during the following task, and our approach rapidly re-detected the user correctly. We also allowed the user to speed up and periodically pause during the experiment, which made large-scale changes to the scene, and our robot successfully continued to follow the user, even when their scale changed greatly.
Our approach was then operated as a real-world following task in a multi-floor environment. One factor was the human–robot interaction. Our system employed a simple APP in a cell phone to communicate with the robot remotely. During the initialization, the user could watch the APP to check the detector’s result in real-time, and select his position to begin following. During the operation, the APP could allow the user to stop and re-start the following function. As shown in
Figure 10, our robot first began its travel in our room. The user then turned around in the lobby and entered an elevator. Inside the elevator, the user turned make their front face towards the robot. Once the elevator arrived at the correct floor, the user went straight ahead to make the robot turn back to move outside the elevator. Afterwards, the user roamed on the fifth floor, and finally moved to another elevator. The behavior of the user in the elevator was the same as when he controlled the first elevator back to the seventh floor. Finally, the user returned to the original place and finished their travel.
The result shows that our approach on a single floor constantly localized the correct target against appearance changes, scale changes, and illumination challenges, even though some scenes were extremely dark or exposed. The classification-lock strategy collaborated with the trackers, which further improved the tracking quality. The classification-lock strategy also made the ROI output more reasonable, based on previous perceptions of human appearance, and this benefitted the tracker when the tracker lost feature points in the scene. The successful travel over multiple floors illustrates that our approach had the strength to be a robust follower and serve in the office environment, even while using a simple PID controller.