1. Introduction
Many breakthroughs have been achieved in service robotics in various fields due to technological advances in computer vision and robotics. A mobile service robot is usually a man–machine environment system. In this system, human participation and coordination are critical to improve system performance. Service robots have broad application prospects, especially in the home environment [
1]. The service robot’s perception of the home environment is typically based on the recognition and detection of indoor scenes, objects, and various tools. Reference [
2] introduced a visual attention mechanism to detect salient objects in the environment, improving the efficiency, intelligence, and robustness of the 3D modeling of the robot environment. It manually annotated the functional parts of objects and identified scene categories using public datasets to ensure that the robot could quickly find key areas of objects in the before research [
3].
An indoor functional area is an area with a specific function, such as the living room, study, bedroom, dining room, bathroom, and other spaces. Various functional tools with functional and operational attributes are used in these spaces. Both perception and classification of the indoor environment are required by robots for analysis, decision making, and human–computer interaction [
4].
There is a lack of research on knowledge representation and management models of tools and functional areas. Current research has mainly focused on the semantics of tool names and spatial relationships. Problems in this area include a large number of labels and difficulties in labeling some tools. Functional semantic features are critical in human cognition. Functionality refers to the potential uses of a tool or place. It is a better semantic description than the relationship between tool names and spatial semantics. The latest research by Li Feifei et al. [
5] shows that the integration of functionality into environmental cognition substantially improves the cognitive ability of service robots.
It is expected that using knowledge representation of tools and functional areas through machine learning, especially deep learning, will improve the autonomous knowledge of service robots. Due to their natural interaction and humanized services, service robots already can recognize basic semantic information such as tool names and categories. As an important part of semantic knowledge, utility connects the real world and the task world and has attracted the attention of experts and scholars. Service robots typically use passive cognition to obtain the functional semantics of tools from semantic tags. Researchers are also exploring active cognitive methods based on inference learning [
6,
7].
Intelligent control is an important development direction for people-centered home service intelligent robots. The development of artificial neural networks has greatly promoted the development of intelligent control theory, making robots more humanized and more able to think from the perspective of people. Developing human-centered, natural, and efficient processes will become the main goal of the new generation of human–computer interaction technology.
The main contributions of this study are as follows:
- (1)
We use the theory of ontology and knowledge graphs to establish a knowledge graph system suitable for the home environment.
- (2)
An analysis of the characteristics and determinants of robot service is conducted to establish a decision-making template for service tasks by correlating the three types of knowledge in the home environment domain ontology. Knowledge acquisition is based on visual, physical, category, functional, and other attributes.
- (3)
A framework for the rapid classification of scenarios and tools for service robots is proposed because robots require timely information on indoor scenes using relatively small datasets.
The rest of the manuscript is divided into three sections.
Section 2 presents the related work.
Section 3 describes the application of domain knowledge based on the knowledge graph to home service robots.
Section 4 presents the classification of home domain knowledge.
Section 5 provides the experiments with models and frameworks.
Section 6 provides the conclusion.
2. Related Work
Intelligent services are the core of semantic cognition and effective representation of environmental information, which include predicate logic representation, production rule representation, and ontology representation. Ontology has attracted people’s attention because of its structured knowledge expression and reasoning ability. Park et al. [
8] proposed a design of a knowledge network system based on domain ontology. Hao et al. [
9] used the ontology model to recombine the original datasets to obtain new datasets with better logic to meet the needs of upper-layer applications. Peng et al. [
10] divided knowledge into three layers—basic knowledge, deep knowledge, and application knowledge, and created an agent-based hierarchical knowledge graph construction framework and methodology. Aiming at the overall formal representation and construction framework of knowledge graphs, Buchgeher et al. [
11] clarified the specific representation mode of the pattern layer and data layer from a logical perspective. Yang et al. [
12] proposed a semi-automatic tagging framework to represent Internet of Things (IoT) metadata. A probabilistic graph model was used to map the framework IoT resources to a knowledge network in an independent domain to build a knowledge graph to enable joint reasoning of entities, classes, and relationships.
The KnowRob knowledge system [
13] was proposed for home service robots to provide robots with a richer knowledge of objects to improve the intelligence of robot task execution. The object knowledge was shared between the system’s modules. The team of Tian et al. [
14,
15,
16] at Shandong University conducted in-depth research on ontology models for multi-domain knowledge sharing and reuse of robotic platforms. Rafferty et al. [
17] established a knowledge-based model to determine the user’s intent and reasoning in a home environment. Xiangru Zhu et al. [
18] improved the ability of machines to understand the real world through the construction of a multi-modal knowledge graph. Lin et al. [
19] discussed the development of knowledge representation in accordance with the difference of entities, relationships, and properties.
Zhou et al. [
20] showed the co-occurrence of “target pairs” with the Bayesian method to represent and identify the scene preferably. Ricardo Pereira et al. [
21] proposed a two-branch CNN architecture based on 1D and 2D convolution layers, which enhances the classification of indoor scenes with the semantic feature of distance between objects. Zaiwei Zhang et al. [
22] proposed a deep generation scene-modeling technology suitable for the indoor environment, in which the generation model can be trained with a feedforward neural network. The network maps the previous distribution (e.g., normal distribution) to the distribution of the main objects in the indoor scene. The research of Shengyu Huang et al. [
23] also showed the recognition of successful scenes, which were not only about identifying single objects unique to some scene types, but depended on several different clues, including rough 3D geometry, color, and (implicit) distribution of object categories. Chen et al. [
24] preprocessed multi-source data at the instance level, mapped it into the content of a knowledge graph, and further used SWRL to effectively extend OWL semantics and realize rule representation. López-Cifuentes et al. proposed a new scene recognition method based on the end-to-end multi-modal CNN, which significantly reduced the number of network parameters by combining image and context information through the attention module [
25]. B. Miao et al. proposed an object scene model, in which the object feature aggregation module and the object attention module learn object features and object relationships, respectively [
26].
Numerous deep learning image classification networks have been proposed in recent years. Brock et al. [
27] proposed NFNets to improve training efficiency and accuracy by eliminating expensive batch normalization. Srinivas et al. [
28] focused on improving training speed and accuracy by adding attention layers to ConvNets. In 2014, Oxford University and Deepmind Company proposed a deep convolutional network with a depth of 16–19 layers. ResNet [
29] includes a residual block to ensure the efficiency of increasing the depth of deep networks. The network depth reached an unprecedented level, and ResNet provided excellent classification performance. These deep networks with many parameters have powerful data representation capabilities. The parameter layer of the neural network can extract shallow features, such as texture, as well as the functional semantic features of objects.
Although deep structures have powerful data representation capabilities, they have high computational complexity. Thus, it is difficult to perform real-time analysis on mobile terminals without efficient GPU resources. Some recent work has been aimed at improving training or reasoning speed rather than parametric efficiency. For example, RegNet [
30], ResNeSt [
31], and EfficientNet-X [
4] focus on GPU and TPU inference speeds. In 2016, lightweight models achieved a significant breakthrough. They use depthwise separable convolution and have fewer parameters with only a negligible decrease in accuracy. Examples include SqueezeNet [
32], MobileNet [
33], and XceptionNet [
34]. Significant progress has been made in reducing the size of the network. MobileNet, ShuffleNet, and XceptionNet are excellent applications of depthwise separable convolution. The combination of complementary search techniques and the NetAdapt method has resulted in further improvements in the inference speed [
35]. In addition, transfer learning has been applied to image classification with small sample sizes. Wenmei Li et al. [
36] proposed a scene classification method for HSRRS images with transfer learning and the DeCNN (TL DeCNN) model, with obvious classification advantages which will not overfit. Amsa Shabbir et al. [
37] also fine-tuned the classification of satellite and scene images based on transfer learning and ResNet50, which classifies images more effectively. Transfer learning is also suitable for solving the problem of insufficient indoor scene recognition data.
3. Using a Knowledge Graph for a Home Service Robot
3.1. Home Service Robot Service System
In this paper, the home service robot system takes the family environment as the specific research scene. By identifying the common objects and scenes in the text, images, and videos marked with semantic labels in the sample database, it learns the semantic features such as apparent features, names, functions, and spatial layout of various objects and scenes in a supervised manner, so as to obtain the knowledge of home scenes.
As shown in
Figure 1, the service system of the home service robot system consists of the following parts:
- (1)
Knowledge extraction of data information of service robot
First, starting from a large number of household object samples of text, RGB-D images, and videos equipped with functional semantic tags, the RGB-D apparent features and semantic features including functional and usage methods of various objects are learned, object knowledge representation of combined apparent features and semantic features is constructed, and knowledge extraction is carried out by the learning algorithm.
Then, starting from the home location scene samples containing functional semantic tags, combining the semantic features such as object layout and spatial relationship, the scene location knowledge is represented, and knowledge extraction is carried out by the learning algorithm. Finally, starting from the sample of family characters with semantic tags, we learn all kinds of characters and their behavior characteristics, carry out knowledge representation of the characters, and extract knowledge using the learning algorithm.
- (2)
Home Service robot domain knowledge graph system design
The home environment domain knowledge graph representation form, which is more in line with the natural interaction requirements of service robots, is first designed. Then, an effective link between object knowledge, environment knowledge, human knowledge, and the knowledge graph is designed to build a domain knowledge graph system for service robots.
- (3)
Research on the construction and application of knowledge graph for home service robot system
Based on the knowledge acquisition of daily objects, location scenes, and people by service robots, as well as the design of domain knowledge graph systems, a household service domain knowledge graph will be constructed under a household service Internet of Things robot system. This will involve a position and pose observation of the sensor nodes of the Internet of Things and service robots.
In addition, the application research based on the domain knowledge graph will be further carried out. For the newly acquired information of the sensor nodes of the Internet of Things, the knowledge graph can accurately represent the relationship between various types of information, and help the service robot realize the reasoning and planning of service tasks.
In this paper, the service system of the home service robot system obtains semantic labels and other information of scenes and tools in the image through image recognition, and then connects the semantic information through the “relationship” through knowledge representation and ontology modeling. Finally, the information and relationship are managed and utilized through the knowledge graph, and the specific development and application is carried out in the field of family service. The system combines the visual domain with the knowledge graph to ensure the effective use of the visual information obtained by the service robot.
3.2. Construction of Concept Layer of the Home Service Robot Knowledge Graph
Domain ontology refers to the concepts, entities, and interrelationships in a specific domain and describes the rules and characteristics of knowledge in this domain. A domain ontology of the home environment is constructed by classifying the domain data of the home environment. This approach enables the description of people, objects, and robot service information at the semantic level to enable the sharing of domain knowledge for different pieces of equipment and different types of information.
Two basic methods are typically used for ontology construction. The top-down approach is sufficient to meet the application requirements in fields with relatively mature and complete knowledge systems. The broadest concepts in the field are defined first and are subsequently refined. However, in some fields, the data are not systematic, and the knowledge is incomplete. Thus, a bottom-up and data-driven approach is required. The most specific concepts are defined first and summarized next. The top-down and bottom-up approaches are combined in real-life applications of developing complex domain ontology. A concept layer is first established using the top-down approach. Then, the classes and attributes are supplemented by the bottom-up approach using various data. The following two steps are used in this study to construct the domain ontology model of the home environment:
Determine the classes in the home environment and their hierarchical relationship to establish an empty ontology model with a class structure. This involves analyzing the home environment to categorize relevant entities such as objects and locations. This initial step lays the foundation for the ontology by creating an organized framework of classes that represent the primary components of the home environment.
Determine the object attributes that describe the relationships between classes and the data attributes that describe the class (or instance). The domain and range for each property are established to ensure clear relationships between different classes and their instances.
These two steps are used to construct the concept layer of the knowledge graph, which has the following three parts:
The environment class represents the environment domain, including the subclasses object class and location class. The object class contains all objects in the home environment, including furnishings, household items, and household appliances. The location class contains location information, and the user class represents the use characteristics, including the subclasses people class and behavior class. The people class contains information on the family, and the behavior class describes the behavior of the family. The robot service class represents robot services, including the operation and service subclasses. The operation class contains information on the robot’s operations of various objects and equipment. The service class contains details on specific service tasks of the robot.
- 2.
Define the relevant object attributes, use the class as the definition domain and the value domain, and establish the connection between classes.
“isLocation” describes the location of objects in the home environment. It uses the location class as the value domain to establish a connection between the object class and the people class “.toDo” determines the actions involved in the robot service operation. Its definition domain is the service class, and its value domain is the operation class.
“actOn” determines the objects involved in the robot service operation; its definition domain is the operation class, and its value domain is the object class. “hasBehavior” describes the behavior of the family. Its definition domain is the people class, and its value domain is the behavior class.
- 3.
Define the data attributes used to describe objects in the home environment, i.e., object knowledge representation.
The domain ontology of the home environment is shown in
Figure 2.
Based on the knowledge graph system of the home service robot, the semantic web rule language (SWRL) is used to establish a rule base for service inference, and the Jess inference engine is used to parse the ontology knowledge and service rules. Rule inference is used to match facts and rules in the inference engine to enable autonomous reasoning and decision making to perform services.
To ensure that the knowledge graph remains comprehensive and up-to-date, we draw upon rich data from esteemed public knowledge databases, such as DBpedia and ConceptNet. These sources provide foundational knowledge about household objects, their interrelationships, and domain-specific concepts, enriching the depth and breadth of the knowledge graph. We adopt a strategy of periodic data integration, continuously incorporating updates from these external repositories to maintain the relevance and accuracy of the knowledge base. This dynamic approach allows the knowledge graph to evolve gracefully, reflecting the latest insights and information in the field.
3.3. Home Service Reasoning Based on SWRL Rules
The robot uses the knowledge graph system to obtain and analyze the data in the ontology knowledge base to perform semantic cognition of the home environment. According to the user’s identity, behavior, habits, and the environmental information, the robot can determine which services should be provided using the SWRL rule base.
The subclasses, instances, and attributes of the environment and user domains in the home environment domain ontology can be combined using the SWRL rule and the subclasses and instances in the service domain as the conclusion of the SWRL rule. A decision-making template for service tasks is established by correlating the three categories in the home environment domain ontology.
As shown in
Figure 3, the template indicates that the service task that the robot should provide is determined according to the environmental, user identity, and behavior information in the home environment. The establishment of the SWRL rule base should consider the conditions of various services and determine the various service modes that may occur in the home environment and the execution logic of the service tasks. An appropriate system and scale will substantially enhance the robot’s ability to provide services.
In addition, the construction and description of the SWRL-based service inference rules can be divided into three steps:
Determine the types, locations, and operations of the service and define them;
Comprehensively analyze various factors affecting the service and clarify the preconditions by defining their classes and attributes;
Determine the type and operation mode of the service.
Use “water delivery” as a service example. First, determine that the service object is user Alan, and the operation object includes the water cups, which are defined. Then analyze the factors affecting the water delivery, which may include “User Alan is sitting in the bedroom, and it is time for drinking water”. The service operation mode corresponding to the “water delivery” service is “send”.
To further enhance the water delivery service, we consider additional conditions, such as the environmental temperature. For example, if the temperature inside the home exceeds a certain threshold (e.g., 30 degrees Celsius), the service robot should proactively deliver water to help prevent dehydration.
Therefore, the SWRL rules for the water delivery service are established as follows:
User (Alan) ^ StateUser (sit) ^ Location (bedroom) ^ Time (t1) ^ hasUserState (Alan,sit) ^ hasLocation (Alan, bedroom) ^ hasTime (sit, t1) → operateOn (send, cup).
User (Alan) ^ Location (home) ^ Temperature (temp) ^ greaterThan (temp, 30) ^ isLocatedIn (Alan, home) ^ hasTemperature (home, temp) → operateOn (send, cup).
These rules ensure that the service robot can intelligently decide when to deliver water based on both user behavior and environmental conditions. The combination of basic and extended rules allows the robot to dynamically adapt its actions to various situations, enhancing the efficiency and responsiveness of the water delivery service.
The service inference rule base serves as an essential foundation for the robot’s decision making in service tasks, and it must account for abnormal factors that may occur in real household scenarios. Therefore, the robot must respond appropriately to highly unusual situations involving individuals or environmental conditions, such as a user falling, fainting, requiring rescue, or encountering abnormal environmental conditions like extreme temperatures or harmful gas concentrations.
For instance, in the case of a user fainting, the robot should provide an emergency call service. The SWRL rule for establishing this emergency call service is as follows:
StateUser(sleep) ^ Location(?x) ^ swrlb: booleanNot (?x, bed) → operateOn(call, family).
The service reasoning template based on SWRL rules effectively infers the service tasks that the robot should provide by analyzing diverse daily information and specific abnormal information in the context of household scenarios. This inference is achieved by incorporating detailed environmental information, user data, and user behavior patterns.
3.4. Object Knowledge Reasoning Based on SWRL Rules
It is difficult for the system to identify the color, shape, and material properties of each object instance without making mistakes. Thus, some properties are lacking, and reasoning is an effective method for obtaining missing knowledge. This section describes the dependencies between the object attributes. SWRL rules are used to establish the acquisition mechanism for missing attributes of objects and to perform reasoning using object knowledge. This strategy enhances the logic of the ontology model and enables the robot can infer the missing attributes based on the existing attributes of the object instance.
As shown in
Figure 4, based on the acquisition of various attributes of object instances in the initial stage of the system, the mechanism includes the following rules:
The first rule type is the inference acquisition of the functional properties of objects. It includes two cases. The first case is to obtain functional attributes based on the category, physical, and visual attributes, including rules such as: Category (?x, Cup) → hasAffordance (?x, Containable), indicates that if the instance ?x is a “Cup”, then it has the functional attribute “Containable”. The second case is to use co-occurrence and infer missing functional attributes from known functional attributes, including rules such as: hasAffordance (?x, Openable) → hasAffordance (?x, Closable), indicating that if instance ?x has the functional attribute “Openable”, then it also has another functional attribute “Closable”.
The second rule type is the inference acquisition of the operational properties of objects. The operation attribute is obtained by functional attribute inference. It contains rules such as: hasAffordance (?x, Graspable) → hasOperational Attribute (?x, Grasp), indicating that if the instance ?x has the functional attribute “Graspable”, then it also has the operational attribute “Grasp”.
In addition, as shown in
Figure 5, a mechanism for acquiring missing category, physical, and visual attributes of the object is established. The reasoning relationship between attributes is formally represented to achieve mutual reasoning for the three attributes and the individual attributes. It includes rules such as: Category (?x, Ball) → hasShape (?x, Sphere), indicates that if the instance ?x is a “Ball”, then it has a visual attribute, i.e., the shape is a “Sphere”; hasShape (?x, cubic) ^ hasWidth (?x, 20) → hasHeightcm (?x, 20) ^ hasLengthcm (?x, 20) means that if the shape of instance x is a cube and its width is 20 cm, then its length and height are also 20 cm.
4. Classification of Domain Knowledge
4.1. Model Selection and Usage Strategies
The service robot platform is constrained by computing power, which limits the model’s parameter size and inference time. As a result, using networks with a large depth for direct training would lead to long training periods, and parameters cannot be updated promptly during the training process. Additionally, the large model size makes it difficult to deploy the model onto the service robot platform, resulting in delays in making timely predictions. Furthermore, the robot requires a high level of responsiveness for real-time services. MobileNetV3 is specifically designed to optimize speed and efficiency. By using depthwise separable convolutions, MobileNetV3 significantly reduces the number of parameters and model size. Moreover, MobileNetV3 is optimized for deployment on mobile and edge devices. Its optimized architecture ensures that it can run efficiently on these edge devices, enabling the robot to perform fast local inference while minimizing latency and energy consumption, thus guaranteeing the model’s quick response capabilities. Therefore, in this study, MobileNetV3 is used as the convolutional module of the model to ensure the efficient extraction of high-dimensional features from image data.
As shown in
Figure 6, the network structure of MobileNetV3 is divided into three parts. The difference is the number of basic unit bnecks and internal parameters, mainly the number of channels.
MobileNetV3 includes two versions (large and small) that are suitable for different scenarios. The overall structure of the two versions is the same, but the difference is the number of basic units bottlenecks (bneck) and the internal parameters, such as the number of channels.
Table 1 lists the parameters of MobileNetV3 (large) used in this work.
Since people live in indoor environments, few datasets are available due to privacy concerns. In this study, we used the CIFAR-100 and ImageNet datasets. Therefore, it is difficult to train deep learning models that require many parameters. In addition, a large model with many parameters deployed on the mobile terminal may be limited by the hardware computing power, and real-time applications may not be possible. Moreover, indoor images have worse lighting conditions and more occlusions than outdoor images, and it is difficult to perform effective feature extraction to describe the rich semantic features in the image.
Transfer learning strategies compensate for the shortcomings of convolutional neural networks for these tasks. Transfer learning is a technique that leverages pre-trained models to extract useful features from large and diverse datasets, followed by fine-tuning the model to adapt to a new dataset specific to the current task. As shown in
Figure 7, the transfer learning strategy applied in this study enables the model, through pre-training on a large dataset, to learn visual features such as edges, textures, and patterns from various types of images. These features are shared across different domains, while also acquiring weights. After completing model pre-training, the linear classifier of the pre-trained model is removed, and the remaining network layers are used as convolutional layers with their weights frozen. This means that the weights of these layers will not be updated during further training. These frozen convolutional layers retain the general visual features learned from the large dataset. Finally, the new linear layer is used as the classifier, and only the parameter gradients of the last classification layer are updated on the indoor scene dataset, substantially improving the classification accuracy of the model.
The parameters of the convolutional layer obtained from pre-training contain the characteristic information of the original field. The classification of indoor functional areas is based on the detection of key objects. If pillows are present, the functional area is likely a bedroom. If there are dining tables, plates, and spoons, the functional area is likely a dining room. Local semantic features in visual information may be more helpful for classification than texture or edge features. Appropriate pre-training methods can improve the network’s ability to extract semantic features, prevent focusing on the entire space, and ultimately improve network performance.
It should be note that, when computational resources are limited, the width multiplier can be reduced from 1.0 to 0.75 or 0.5 to decrease the number of parameters and computational load. For simpler detection environments, the number of certain blocks can be appropriately reduced to simplify the model structure and shorten inference time. Adjusting the dropout rate according to the size of the dataset can help prevent overfitting and improve the model’s generalization ability. Setting the learning rate too high may prevent the model from finding the optimal solution, while setting it too low could slow down the convergence rate. The learning rate should be fine-tuned based on the actual dataset to determine the optimal value.
4.2. The Semantic Cognitive Framework
We use the strategy of pre-training with a large sample size and finetuning using the indoor dataset. First, the MobileNetV3 network is pre-trained on the CIFAR-100 dataset to obtain the weights and parameters. The final classification layer of the trained neural network is removed and used as a feature extractor. Then we add the linear layer as a classifier, perform a “gradient freeze” on the feature extractor, and retrain it on the indoor scene and the tool dataset. As shown in
Figure 8, the structure of the semantic cognitive framework can be divided into two stages, i.e., feature extraction and type recognition. The feature extraction stage is used to extract image semantic features at different levels. The type recognition stage is used to reduce the dimension of the features extracted in the feature extraction stage and predict the sample category.
In the feature extraction stage, a convolutional neural network model pre-trained on a large dataset is used to obtain the transfer network parameters. The structure of the model is changed to construct a feature extraction model. The output of the feature extraction model is a feature vector with one dimension containing rich information. In this stage, the features are extracted to distinguish the various categories in different images to perform type recognition of the scenes and tools in the next stage.
The type recognition stage inputs the scene and tool images into the same feature extraction model as the previous stage. The feature extraction model is used to process the features of all scenes and tool images in the dataset to represent the scene and tool category. The model matches the category labels of each scene and tool to train a classifier to classify the scene and tool category.
5. Experiments with Models and Frameworks
The experimental dataset is as follows:
CIFAR-100 is an image dataset with 60,000 images, including various outdoor scenes and objects.
There are 100 sub-categories with fine labels, with 600 images per category; these 100 categories are divided into 20 major categories with coarse labels.
Figure 9 shows some examples of the CIFAR-100 dataset and the fine labels.
- (2)
Scene dataset (indoor functional area dataset)
The Scene15 dataset is an extension of the Scene13 scene dataset. Scene13 contains more than 3000 images in 13 categories, including city, country, suburban house, street, highway, high-rise, mountain, coast, forest, living room, bedroom, office, and kitchen. The Scene15 dataset includes images in two additional categories (stores and factories), comprising 15 scenes.
Figure 10 shows some examples of the indoor scene category of the Scene15 dataset.
- (3)
Scene dataset (utility tool dataset)
The UMD part affordance dataset is a comprehensive dataset of general household tools. This dataset was acquired with RGB-D sensors and includes semantic labels for 17 household tools, including cup, turner, ladle, shear, mug, spoon, trowel, knife, tenderizer, scoop, shovel, scissors, hammer, saw, pot, bowl, and mallet.
Figure 11 shows examples of the tool images and classes in the dataset.
We used the Scene15 dataset for indoor functional area classification to verify the proposed recognition model. We considered five indoor scenes based on the home service robot’s working environment: living room, bedroom, kitchen, store, and office. For each scene, 80% of the images were selected as the training set, 10% were used as the validation set, and 10% were used as the test set. The hyperparameters for model training included a batch dataset of 256 images, 200 training epochs, and an initial learning rate of 0.001.
Figure 12a,b show the loss values of the training set and the accuracy of the validation set during training.
For the Scene15 scene dataset, the convergence of the loss value occurs after 50 epochs and stabilizes around 0.3. The accuracy of the model for the validation set converges to 80% after 30 epochs, increases after 125 epochs, and eventually stabilizes at around 88%.
In the experiment of the functional tool classification, the 17 household tools in the UMD part affordance dataset are divided into a training set, validation set, and test set at a ratio of 8:1:1. The hyperparameters for model training include a batch dataset of 256 images, 200 training epochs, and an initial learning rate is 0.001.
Figure 13a,b show the loss values of the training set and the accuracy of the validation set during training.
Breakpoint training is used to prevent falling into a local optimal solution during training. The loss value of the training set converges after 125 epochs and stabilizes below 0.5. The accuracy rate of TOP1 in the validation set is about 96%, and that of TOP5 is close to 100%.
The proposed model is compared with other advanced indoor scene recognition methods to demonstrate its performance. The classification accuracy and inference speed are used as evaluation metrics. The results indicate that the proposed method has a higher inference speed of 63.11ms compared to the other advanced indoor scene recognition methods, as shown in
Table 2.
The
Table 3 results indicate an excellent performance for predicting the tool type, with accuracy values close to 100%. Classical deep networks such as ResNet have many network parameters, a low inference speed, and a high computational complexity. In contrast, the proposed method has a high inference speed due to the low number of parameters. The inference time is only about 1/4 that of ResNet, and its accuracy is significantly higher than that of the other models.
Note that further expanding the dataset including more diverse and challenging indoor environments would be beneficial to the model, helping it to remain adaptable to a wider range of real-world situations and have a high scalability in various home settings.