CN113920415A

CN113920415A - Scene recognition method, device, terminal and medium

Info

Publication number: CN113920415A
Application number: CN202111135850.9A
Authority: CN
Inventors: 万培佩; 刘贤焯
Original assignee: Orbbec Inc
Current assignee: Orbbec Inc
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2022-01-11

Abstract

The application is applicable to the technical field of image processing, and provides a scene recognition method, a scene recognition device, a scene recognition terminal and a scene recognition medium. The scene recognition method specifically comprises the following steps: acquiring an image to be identified; inputting the image to be recognized into a neural network model, and acquiring feature points of the image to be recognized output by the neural network model and descriptors corresponding to the feature points; inputting the descriptor into a bag-of-words model database for coding to obtain a characteristic vector of the image to be recognized; and calculating the similarity between the feature vector of the image to be identified and the feature vector of each type of scene image, wherein the similarity is used for the scene type to which the image to be identified belongs. The method and the device for identifying the scene can meet the requirement of scene identification on real-time performance and improve accuracy of scene identification.

Description

Scene recognition method, device, terminal and medium

Technical Field

The present application relates to the field of image processing, and in particular, to a scene recognition method, apparatus, terminal, and medium.

Background

Scene recognition belongs to a sub-field of image processing, and means that the most similar scene is found from a candidate library by matching an acquired image of a scene to be recognized with a large number of images in the candidate library, so as to determine which scene the scene to be recognized belongs to.

The existing scene recognition algorithm is mainly divided into a traditional bag-of-words model recognition algorithm and a deep learning scene recognition algorithm. The traditional bag-of-words model recognition algorithm needs image feature points extracted through a manual algorithm, and is poor in robustness and scene recognition accuracy, while the scene recognition algorithm based on deep learning extracts global semantic features through a neural network model, and the global semantic feature dimensions extracted by the model are large, so that the computation amount is large when feature matching is carried out on the global semantic features, the computation speed is low, and a large memory space needs to be occupied.

Therefore, a scene recognition algorithm that can satisfy both real-time performance and accuracy is needed.

Disclosure of Invention

The embodiment of the application provides a scene recognition method, a scene recognition device, a terminal and a storage medium, which can meet the real-time requirement of scene recognition and improve the accuracy of scene recognition.

A first aspect of an embodiment of the present application provides a scene identification method, including:

acquiring an image to be identified;

inputting the image to be recognized into a neural network model, and acquiring feature points of the image to be recognized output by the neural network model and descriptors corresponding to the feature points;

inputting the descriptor into a bag-of-words model database for coding to obtain a characteristic vector of the image to be recognized;

and calculating the similarity between the feature vector of the image to be identified and the feature vector of each type of scene image, wherein the similarity is used for identifying the scene type to which the image to be identified belongs.

A second aspect of the embodiments of the present application provides a scene recognition apparatus, including:

the image acquisition unit is used for acquiring an image to be identified;

the model processing unit is used for inputting the image to be recognized into a neural network model and acquiring the feature points of the image to be recognized output by the neural network model and descriptors corresponding to the feature points;

the cluster analysis unit is used for inputting the descriptors into a bag-of-words model database for coding to obtain the characteristic vectors of the images to be identified;

and the scene identification unit is used for calculating the similarity between the feature vector of the image to be identified and the feature vector of each type of scene image, and the similarity is used for identifying the scene type to which the image to be identified belongs.

A third aspect of the embodiments of the present application provides a terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method when executing the computer program.

In some embodiments of the present application, the terminal further includes an image capturing device, and the image capturing device is configured to capture the image to be recognized.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the above method.

A fifth aspect of embodiments of the present application provides a computer program product, which when run on a terminal, causes the terminal to perform the steps of the method.

In the embodiment of the application, the image to be recognized is input into the neural network model, the feature points of the image to be recognized output by the neural network model and the descriptors corresponding to the feature points are obtained, the image features with higher precision and stronger robustness can be extracted, then the descriptors are input into the bag-of-words model database for coding, the feature vectors of the image to be recognized are obtained, the similarity between the feature vectors of the image to be recognized and the feature vectors of each type of scene image is calculated, the feature vectors used in feature matching are reduced compared with the global semantic feature dimension, and therefore the speed of feature matching is increased.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart illustrating an implementation process of a scene identification method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a bag of words model database provided by an embodiment of the present application;

fig. 3 is a schematic diagram of a specific implementation flow of cluster training on a bag-of-words model database according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a specific implementation of inputting a descriptor into a bag-of-words model database for encoding according to an embodiment of the present application;

FIG. 5 is a flowchart of an embodiment of the present application for implementing a word vector for determining single descriptor associations;

fig. 6 is a schematic structural diagram of a scene recognition apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall be protected by the present application.

The existing scene recognition algorithm is mainly divided into a traditional bag-of-words model recognition algorithm and a deep learning scene recognition algorithm.

The traditional bag-of-words model recognition algorithm extracts Feature points of an image through an orb (organized FAST and Rotated brief) algorithm, a Scale-Invariant Feature Transform (SIFT) algorithm, a Speeded Up Robust Features (SUFT) algorithm and other extraction algorithms, then codes word vectors by using the Feature points, and searches the most similar scene image from a candidate library by using the coded word vectors.

The scene recognition algorithm for deep learning is to train a scene recognition network model by using a large amount of data, the network model can extract the global semantic features of images, and then the most similar scene images are searched from a candidate library by using the global semantic features extracted by the model.

The traditional bag-of-words model recognition algorithm needs image feature points extracted through a manual algorithm, and is poor in robustness and scene recognition accuracy, while the scene recognition algorithm based on deep learning extracts global semantic features through a neural network model, and the global semantic feature dimensions extracted by the model are large, so that the computation amount is large when feature matching is carried out on the global semantic features, the computation speed is low, and a large memory space needs to be occupied.

In order to explain the technical means of the present application, the following description will be given by way of specific examples.

Fig. 1 shows a schematic flow chart of an implementation of a scene recognition method provided in an embodiment of the present application, where the method may be applied to a terminal, and the terminal may be a smart phone, a computer, or other intelligent devices that need to perform scene recognition, and is applicable to a situation that needs to meet a requirement of scene recognition on real-time performance and improve accuracy of scene recognition.

Specifically, the scene recognition method may include the following steps S101 to S104.

And step S101, acquiring an image to be identified.

The image to be recognized refers to an image obtained by shooting a scene to be recognized. The scene to be recognized refers to a scene needing scene recognition.

It should be noted that, the method for acquiring the image to be recognized is not limited, and in some embodiments of the present application, the terminal may capture the current scene through a camera carried by the terminal to obtain the image to be recognized, so as to perform scene recognition on the current scene; the image to be recognized, which is shot by other equipment on the scene where the other equipment is located, can also be received, so as to perform scene recognition on the scene where the other equipment is located.

Step S102, inputting the image to be recognized into the neural network model, and obtaining the feature points of the image to be recognized output by the neural network model and the descriptors corresponding to the feature points.

Because the hierarchical structure feature extraction effect of the neural network in the visual task is better than that of the manually made feature points, in the implementation mode of the application, the terminal can input the image to be recognized into the neural network model and acquire the feature points of the image to be recognized output by the neural network model and the descriptors corresponding to the feature points. The descriptor is a feature vector for describing the corresponding interest point.

Wherein, the neural network model is a neural network model trained in advance. The specific structure of the model can be selected according to actual conditions. For example, the neural network model may be a superpoint neural network model.

Specifically, in some embodiments of the present application, the neural network model may include an encoding layer and a first decoding layer and a second decoding layer respectively connected to the encoding layer. The neural network model includes a shared encoder for processing and reducing dimensions of the input image. After the encoder, two different decoders are connected, one for feature point detection and one for feature point description.

Therefore, the terminal can input the image to be identified into the coding layer to obtain a low-dimensional image of the image to be identified; then, feature points output by the first decoding layer according to the low-dimensional image and descriptors output by the second decoding layer according to the low-dimensional image are acquired.

It should be noted that the terminal may perform model training in advance using a large number of sample images, and perform image processing through the neural network model after the training is completed. In some embodiments of the present application, the sample images may be manually labeled, the neural network model may be trained by using a plurality of groups of sample image groups, and the combined loss function may be calculated by the manual labeling to determine whether the neural network model converges. The combined loss function represents the sum of the loss value of the characteristic point and the loss value of the descriptor, and each group of sample image group comprises a plurality of sample images of the same type of scene. When the neural network model converges, the characteristic points and the descriptors of the sample image can be accurately output by the neural network model, so that the terminal can utilize the neural network model to perform image processing on the image to be identified.

And step S103, inputting the descriptor into a bag-of-words model database for coding to obtain the characteristic vector of the image to be identified.

In an embodiment of the present application, the bag-of-words model database is a model obtained by performing clustering training on a large number of sample images in advance. As shown in fig. 2, the bag-of-words model database may be represented by a tree structure, each layer of nodes in the tree structure respectively represents a cluster center in the cluster training process, and leaf nodes thereof represent the nth layer of cluster centers obtained by the cluster training.

Specifically, in some embodiments of the present application, as shown in fig. 3, the process of performing cluster training on the bag-of-words model database may include the following steps S301 to S306.

In step S301, a sample descriptor of a sample image is acquired.

Wherein the sample image may contain images of different categories of scenes. The number of the sample images can be set by a worker according to actual conditions, and in order to ensure the reliability of the bag-of-words model database, the number of the used sample images is greater than a certain number threshold.

In some embodiments of the present application, the terminal may respectively extract a descriptor of each sample image to obtain a sample descriptor for training the bag-of-words model database, and the specific extraction process may refer to the description of step S102 in the present application, which is not described herein again.

Step S302, selecting K sample descriptors from the sample descriptors as the clustering centers of the current layer for clustering.

K is a positive integer larger than 0, and the specific value of K can be adjusted according to the actual situation.

In some embodiments of the present application, the current layer clustering center is a clustering center that needs to be clustered currently. After K current-level clustering centers are selected from the sample descriptors, the sample descriptors can be clustered, that is, the K sample descriptors are used as the current-level clustering centers, and other sample descriptors are clustered to the current-level clustering center closest to the current-level clustering center in euclidean distance. At this time, each current layer cluster center may have associated therewith a plurality of sample descriptors.

It should be noted that the way of selecting K sample descriptors may be selected according to actual situations, and in some embodiments, K sample descriptors may be randomly selected from among the sample descriptors.

Step S303, the current layer clustering center is used as the previous layer clustering center, and K corresponding sample descriptors are selected for each previous layer clustering center and used as the current layer clustering center for clustering.

In some embodiments of the present application, the previous-layer clustering center is a clustering center obtained by clustering last time. For a certain previous-layer clustering center, since it is associated with a plurality of sample descriptors, it is possible to reselect the corresponding K current-layer clustering centers from the sample descriptors associated with the previous-layer clustering center, and cluster the sample descriptors associated with the previous-layer clustering center again.

Step S304, detecting whether the clustering layer number corresponding to the current layer clustering center is N.

Wherein, N is a positive integer greater than 0, and the specific value of N can be set by the staff according to the actual conditions, and the value is generally greater than 1.

After K current-layer clustering centers are selected for each previous-layer clustering center for clustering, clustering of the current clustering layer is completed, and at this time, whether the number of clustering layers corresponding to the current-layer clustering center is N needs to be detected.

Step S305, if the number of clustering layers corresponding to the current layer clustering center is not N, the current layer clustering center is used as the previous layer clustering center, K corresponding sample descriptors are selected for each previous layer clustering center to be used as the current layer clustering center for clustering, and whether the number of clustering layers corresponding to the current layer clustering center is N or not is detected until the number of clustering layers corresponding to the current layer clustering center is N, so that the N layer clustering centers are obtained.

If the number of clustering layers corresponding to the current-layer clustering center is not N, the final layer of the required clustering is not reached, namely the number of clustering times does not reach the set number of times, therefore, the current-layer clustering center can be used as the previous-layer clustering center again, K current-layer clustering centers corresponding to each previous-layer clustering center are selected for clustering, and whether the number of clustering layers corresponding to the current-layer clustering center is N or not is detected until the number of clustering layers corresponding to the current-layer clustering center is N, the final layer of the required clustering is reached, namely the number of clustering times reaches the set number of times, and the N-layer clustering centers are obtained.

And S306, constructing a bag-of-words model database by using the N-layer clustering centers.

At this time, the N-layer clustering centers obtained in the clustering training process are respectively used as a node in the tree structure, and then the bag-of-words model database can be constructed. As shown in fig. 2, the root node represents all the sample descriptors, K sample descriptors are selected from the sample descriptors, and the K sample descriptors are used as the cluster center of the current layer (i.e., the cluster center of the first layer) and are clustered. And then continuously selecting K sample descriptors for each first-layer clustering center, taking the K sample descriptors as the current-layer clustering center (namely, the second-layer clustering center), clustering, and so on to finally obtain N-layer clustering centers.

In the embodiment of the application, after the training of the bag-of-words model database is completed, the descriptor can be input into the bag-of-words model database for encoding, and the feature vector of the image to be recognized is obtained.

Specifically, as shown in fig. 4, the encoding process may include the following steps S401 to S403.

Step S401, each Nth-layer clustering center of the N-layer clustering centers in the word bag model database is respectively used as a word vector.

In some embodiments of the present application, each N-th clustering center in the bag-of-words model database may represent feature information of a similar class of descriptors, and therefore, the descriptor corresponding to each N-th clustering center may be respectively used as a word vector, and each word vector is used to represent a certain class of feature attributes of an image.

Step S402, determining word vectors respectively associated with each descriptor, and counting the weight of each word vector.

In some embodiments of the present application, the terminal may determine the word vector associated with each descriptor in turn.

Specifically, as shown in fig. 5, the step of determining the word vector in which the single descriptors are associated may include the following steps S501 to S502.

Step S501, calculating the similarity between a single descriptor and a first-layer cluster center in the N-layer cluster centers, and clustering the single descriptor to a target first-layer cluster center with the highest similarity.

Specifically, by calculating the similarity between a single descriptor and each first-layer cluster center, when the similarity between a certain first-layer cluster center and a single descriptor is the highest, the characteristic that the single descriptor belongs to the first-layer cluster center representation is described, so that the terminal can take the first-layer cluster center with the highest similarity as a target first-layer cluster center and cluster the single descriptor to the target first-layer cluster center.

Step S502, calculating the similarity between the single descriptor and each second-layer clustering center of the target first-layer clustering centers, clustering the single descriptor to the target second-layer clustering center with the highest similarity, and so on until the single descriptor is clustered to the target Nth-layer clustering center with the highest similarity, and taking the target Nth-layer clustering center as a word vector associated with the single descriptor.

In the process of cluster training, a plurality of second-layer cluster centers are included below the target first-layer cluster center, that is, the characteristic of the target first-layer cluster center is further subdivided with multi-class characteristics, so that the terminal can continuously calculate the similarity between a single descriptor and each second-layer cluster center of the target first-layer cluster center, and cluster the single descriptor to the target second-layer cluster center with the highest similarity. And repeating the steps until the single descriptor is clustered to the target N-th layer clustering center with the highest similarity, namely the single descriptor of the image to be recognized is clustered to the target N-th layer clustering center, and the characteristic that the single descriptor belongs to the target N-th layer clustering center is shown, so that the target N-th layer clustering center can be used as a word vector associated with the single descriptor.

By performing the steps described in fig. 5 on each descriptor of the image to be recognized, a word vector associated with each descriptor can be determined.

At this time, because the number of descriptors associated with each word vector (i.e., each N-th clustering center) is not the same, which indicates that the importance degree of each class of features in the image to be recognized is not the same, the terminal also needs to count the corresponding weight (IDF) of each word vector.

Specifically, the terminal may determine a number ratio of the descriptor associated with each word vector in the descriptor corresponding to the image to be recognized, and use the number ratio corresponding to each word vector as a weight corresponding thereto.

That is, the weight corresponding to a word vector is the proportion of the number of its associated descriptors in all the number of descriptors corresponding to the image to be recognized.

Therefore, if the number of descriptors associated with a word vector is larger, the importance degree of the word vector in the image to be recognized is larger, and therefore the weight corresponding to the word vector is also larger.

And step S403, using a set formed by each word vector and the weight corresponding to each word vector as a feature vector of the image to be identified.

In particular, in some embodiments of the present application, the feature vector v of the image to be identified_A＝{(w_A1，η_A1)，(w_A2，η_A2)，…，(w_Ai，η_Ai) In which w_AiRepresenting the ith word vector, i.e., the ith Nth cluster center, η, of the Nth cluster centers shown in FIG. 2_AiRepresenting the weight of the ith word vector.

And step S104, calculating the similarity between the feature vector of the image to be identified and the feature vector of each type of scene image.

Specifically, for a certain type of scene image, that is, for an image corresponding to a certain type of scene, the terminal may obtain the type of scene image, then input the type of scene image into the neural network model, obtain the feature points and descriptors output by the neural network model, and input the descriptors of the type of scene image into the bag-of-words model database for encoding, so as to obtain the feature vectors of the type of scene image.

In the embodiment of the application, the terminal can calculate the similarity between the feature vector of the image to be recognized and the feature vector of each type of scene image, that is, the feature vectors of the image to be recognized and the feature vectors of each type of scene image are subjected to feature matching, and the obtained similarity can be used for recognizing the scene type to which the image to be recognized belongs.

Note that the present application does not limit the calculation method of the similarity. In some embodiments of the present application, in step S104, the step of calculating the similarity between the feature vector of the image to be recognized and the feature vector of the single type of scene image may include: respectively calculating the word vector similarity between the word vector of each position of the feature vector of the image to be recognized and the word vector of the corresponding position in the feature vector of the single-class scene image, and then calculating the similarity between the feature vector of the image to be recognized and the feature vector of the single-class scene image according to the word vector similarity.

More specifically, the similarity between the feature vector of the image to be recognized and the feature vector of the image of the single type of scene

Wherein, the feature vector v of the single scene image_B＝{(w_B1，η_B1)，(w_B2，η_B2)，…，(w_Bi，η_Bi)}。

After the similarity between the feature vector of the image to be recognized and the feature vector of each type of scene image is obtained, the terminal can determine the scene type to which the image to be recognized belongs as the scene type to which the scene image with the highest similarity belongs.

Experiments show that compared with the traditional bag-of-words model scene recognition algorithm, the scene recognition method provided by the application has the advantages that the precision is improved by 60%, meanwhile, the feature matching speed is greatly improved compared with that of a deep neural network, and the scene recognition method can be well applied to scene recognition.

It should be noted that, for simplicity of description, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders.

Fig. 6 is a schematic structural diagram of a scene recognition device 600 according to an embodiment of the present application, where the scene recognition device 600 is configured on a terminal.

Specifically, the scene recognition apparatus 600 may include:

an image acquisition unit 601 configured to acquire an image to be recognized;

a model processing unit 602, configured to input the image to be recognized into a neural network model, and obtain feature points of the image to be recognized output by the neural network model and descriptors corresponding to the feature points;

a feature extraction unit 603, configured to input the descriptor into a bag-of-words model database for encoding, so as to obtain a feature vector of the image to be identified;

a scene identification unit 604, configured to calculate a similarity between the feature vector of the image to be identified and the feature vector of each type of scene image, where the similarity is used to identify a scene type to which the image to be identified belongs.

In some embodiments of the present application, the scene recognition apparatus 600 further includes an image acquisition unit, configured to acquire the image to be recognized; correspondingly, the image acquiring unit 601 is further configured to acquire the image to be identified acquired by the image acquiring unit.

In some specific embodiments of the present application, the image capturing unit may be any one or more combination of a color camera, a black and white camera, a grayscale camera, an infrared camera, a depth camera, and the like, which is not limited herein.

In some embodiments of the present application, the scene recognition apparatus 600 may further include a cluster training unit, configured to: obtaining a sample descriptor of a sample image; selecting K sample descriptors from the sample descriptors as a current layer clustering center for clustering, wherein K is a positive integer greater than 0; taking the current layer clustering center as a previous layer clustering center, and selecting corresponding K sample descriptors for each previous layer clustering center as the current layer clustering center for clustering; detecting whether the number of clustering layers corresponding to a current layer clustering center is N, wherein N is a positive integer greater than 0; if the number of clustering layers corresponding to the current layer clustering center is not N, re-executing the steps of taking the current layer clustering center as a previous layer clustering center, selecting K corresponding sample descriptors for each previous layer clustering center as the current layer clustering center for clustering, and detecting whether the number of clustering layers corresponding to the current layer clustering center is N or not until the number of clustering layers corresponding to the current layer clustering center is N, and obtaining N layer clustering centers; and constructing the bag-of-words model database by utilizing the N-layer clustering centers.

In some embodiments of the present application, the feature extraction unit 603 may be further specifically configured to: taking each Nth-layer clustering center of the N-layer clustering centers in the bag-of-words model database as a word vector respectively; determining word vectors respectively associated with each descriptor, and counting the weight of each word vector; and taking a set formed by each word vector and the weight corresponding to each word vector as the characteristic vector of the image to be recognized.

In some embodiments of the present application, the feature extraction unit 603, when calculating a word vector associated with a single descriptor, may further specifically be configured to: calculating the similarity between the single descriptor and a first-layer cluster center in the N-layer cluster centers, and clustering the single descriptor to a target first-layer cluster center with the highest similarity; calculating the similarity between the single descriptor and each second-layer cluster center of the target first-layer cluster centers, clustering the single descriptor to the target second-layer cluster center with the highest similarity, and so on until the single descriptor is clustered to the target N-th-layer cluster center with the highest similarity, and taking the target N-th-layer cluster center as a word vector associated with the single descriptor.

In some embodiments of the present application, the feature extraction unit 603 may be further specifically configured to: respectively determining the number ratio of the descriptors associated with each word vector in the descriptors corresponding to the image to be recognized; and taking the number ratio corresponding to each word vector as the corresponding weight.

In some embodiments of the present application, when calculating the similarity between the feature vector of the image to be recognized and the feature vector of the single-type scene image, the scene recognition unit 604 may be specifically configured to: respectively calculating word vector similarity between the word vector of each position of the feature vector of the image to be recognized and the word vector of the corresponding position in the feature vector of the single scene image; and calculating the similarity between the feature vector of the image to be recognized and the feature vector of the single-class scene image according to the word vector similarity. .

In some embodiments of the present application, the neural network model includes an encoding layer, and a first decoding layer and a second decoding layer respectively connected to the encoding layer; the model processing unit 602 may be specifically configured to: inputting the image to be identified into the coding layer to obtain a low-dimensional image of the image to be identified; acquiring feature points output by the first decoding layer according to the low-dimensional image and descriptors output by the second decoding layer according to the low-dimensional image.

It should be noted that, for convenience and simplicity of description, the specific working process of the scene recognition apparatus 600 may refer to the corresponding process of the method described in fig. 1 to fig. 5, and is not described herein again.

Fig. 7 is a schematic diagram of a terminal according to an embodiment of the present application. The terminal can be intelligent equipment such as a smart phone and a computer.

The terminal 7 may include: a processor 70, a memory 71 and a computer program 72, such as a scene recognition program, stored in said memory 71 and executable on said processor 70. The processor 70, when executing the computer program 72, implements the steps in the various scene recognition method embodiments described above, such as the steps S101 to S104 shown in fig. 1. Alternatively, the processor 70, when executing the computer program 72, implements the functions of the modules/units in the above-described apparatus embodiments, such as the image acquisition unit 601, the model processing unit 602, the feature extraction unit 603, and the scene recognition unit 604 shown in fig. 6.

The computer program may be divided into one or more modules/units, which are stored in the memory 71 and executed by the processor 70 to implement the present application, and the one or more modules/units may be a series of computer program instruction segments capable of implementing specific functions, which are used for describing the execution process of the computer program in the terminal.

For example, the computer program may be divided into: the device comprises an image acquisition unit, a model processing unit, a feature extraction unit and a scene recognition unit.

The specific functions of each unit are as follows: the image acquisition unit is used for acquiring an image to be identified; the model processing unit is used for inputting the image to be recognized into a neural network model and acquiring the feature points of the image to be recognized output by the neural network model and descriptors corresponding to the feature points; the feature extraction unit is used for inputting the descriptor into a bag-of-words model database for coding to obtain a feature vector of the image to be identified; and the scene identification unit is used for calculating the similarity between the feature vector of the image to be identified and the feature vector of each type of scene image, and the similarity is used for identifying the scene type to which the image to be identified belongs.

The terminal may include, but is not limited to, a processor 70, a memory 71. It will be appreciated by those skilled in the art that fig. 7 is only an example of a terminal and is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or different components, for example, the terminal may also include input output devices, network access devices, buses, etc.

The Processor 70 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 71 may be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. The memory 71 may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal. Further, the memory 71 may also include both an internal storage unit and an external storage device of the terminal. The memory 71 is used for storing the computer program and other programs and data required by the terminal. The memory 71 may also be used to temporarily store data that has been output or is to be output.

In some embodiments of the present application, the terminal may further include an image capturing device, and the image capturing device may be configured to capture the image to be recognized.

In some specific embodiments of the present application, the image capturing device may be any one or more combination of a color camera, a black and white camera, a grayscale camera, an infrared camera, a depth camera, and the like, which is not limited herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal and method may be implemented in other ways. For example, the above-described apparatus/terminal embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for scene recognition, comprising:

acquiring an image to be identified;

2. The scene recognition method of claim 1, wherein the bag-of-words model database is a model obtained by performing cluster training on sample images in advance, and the process of the cluster training comprises:

obtaining a sample descriptor of a sample image;

selecting K sample descriptors from the sample descriptors as a current layer clustering center for clustering, wherein K is a positive integer greater than 0;

taking the current layer clustering center as a previous layer clustering center, and selecting corresponding K sample descriptors for each previous layer clustering center as the current layer clustering center for clustering;

detecting whether the number of clustering layers corresponding to a current layer clustering center is N, wherein N is a positive integer greater than 0;

if the number of clustering layers corresponding to the current layer clustering center is not N, re-executing the steps of taking the current layer clustering center as a previous layer clustering center, selecting K corresponding sample descriptors for each previous layer clustering center as the current layer clustering center for clustering, and detecting whether the number of clustering layers corresponding to the current layer clustering center is N or not until the number of clustering layers corresponding to the current layer clustering center is N, and obtaining N layer clustering centers;

and constructing the bag-of-words model database by utilizing the N-layer clustering centers.

3. The scene recognition method of claim 2, wherein the inputting the descriptor into a bag-of-words model database for encoding to obtain the feature vector of the image to be recognized comprises:

taking each Nth-layer clustering center of the N-layer clustering centers in the bag-of-words model database as a word vector respectively;

determining word vectors respectively associated with each descriptor, and counting the weight of each word vector;

and taking a set formed by each word vector and the weight corresponding to each word vector as the characteristic vector of the image to be recognized.

4. The scene recognition method of claim 3, wherein in the step of determining the word vector associated with each of the descriptors, the step of determining the word vector associated with a single descriptor comprises:

calculating the similarity between the single descriptor and a first-layer cluster center in the N-layer cluster centers, and clustering the single descriptor to a target first-layer cluster center with the highest similarity;

calculating the similarity between the single descriptor and each second-layer cluster center of the target first-layer cluster centers, clustering the single descriptor to the target second-layer cluster center with the highest similarity, and so on until the single descriptor is clustered to the target N-th-layer cluster center with the highest similarity, and taking the target N-th-layer cluster center as a word vector associated with the single descriptor.

5. The scene recognition method of claim 3 or 4, wherein said counting the weight of each of said word vectors comprises:

respectively determining the number ratio of the descriptors associated with each word vector in the descriptors corresponding to the image to be recognized;

and taking the number ratio corresponding to each word vector as the corresponding weight.

6. The scene recognition method according to any one of claims 1 to 4, wherein in the step of calculating the similarity between the feature vector of the image to be recognized and the feature vector of each type of scene image, the step of calculating the similarity between the feature vector of the image to be recognized and the feature vector of a single type of scene image includes:

respectively calculating word vector similarity between the word vector of each position of the feature vector of the image to be recognized and the word vector of the corresponding position in the feature vector of the single scene image;

and calculating the similarity between the feature vector of the image to be recognized and the feature vector of the single-class scene image according to the word vector similarity.

7. A scene recognition apparatus, comprising:

the image acquisition unit is used for acquiring an image to be identified;

the feature extraction unit is used for inputting the descriptor into a bag-of-words model database for coding to obtain a feature vector of the image to be identified;

8. A terminal comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when executing the computer program.

9. The terminal according to claim 8, characterized in that the terminal further comprises an image capturing device for capturing the image to be recognized according to any one of claims 1 to 6.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.