CN111694965B

CN111694965B - Image scene retrieval system and method based on multi-mode knowledge graph

Info

Publication number: CN111694965B
Application number: CN202010478948.3A
Authority: CN
Inventors: 陈南希; 张柔佳; 夏春天; 张琳瑜; 张晓林
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2023-06-13
Anticipated expiration: 2040-05-29
Also published as: CN111694965A

Abstract

The invention relates to an image scene retrieval system based on a multi-mode knowledge graph, which inputs retrieval information and obtains retrieval results through a user interface, and comprises the following steps: the scene acquisition module is used for acquiring a plurality of image scenes containing different target objects; the scene analysis module is used for carrying out scene understanding on the acquired images to obtain different target objects and the relation between the different target objects; the multi-mode data management module is used for storing and inquiring the target objects and the relation between the target objects; and the retrieval interaction module is used for extracting keywords from the retrieval information, acquiring node information corresponding to the keywords, and finally creating a data exchange file and returning the data exchange file to the user interface. The invention also provides an image scene retrieval method based on the multi-mode knowledge graph, which can reduce the difficulty of storing the results related to a plurality of image scenes, and simply, intuitively and comprehensively show the target objects and the relations between the target objects. Meanwhile, the invention improves the accuracy of searching and the efficiency of scene searching.

Description

Image scene retrieval system and method based on multi-mode knowledge graph

Technical Field

The invention relates to the fields of knowledge graph, scene understanding and information retrieval, in particular to an image scene retrieval system and method based on a multi-mode knowledge graph.

Background

In the task of scene understanding, although the target objects in the image are the basis stones that construct the scene graph, it is often decided whether the core key of the overall interpretation is the relationship between the objects. Relationship reasoning is a very challenging task in scene understanding, which involves not only detecting low-level entities (i.e., target objects), but also being able to identify high-level interactions (e.g., spatial relationships, actions, sizes, etc.) between them. Knowing the diversity of different relationship types not only enables the creation of richer question-answer models, but also helps improve image retrieval, scene graph resolution, scene graph description, and many other visual reasoning tasks.

The field of information retrieval began in the 50 s of the 20 th century, and the main focus of its processing has been structured text. According to the requirements of users, related information is found out from information sets stored according to a certain organization through some methods. With the development of multimedia technology, information retrieval applications gradually include multi-modal information with structures, such as audio, images, and the like. Because the content of these information is difficult to directly describe and the structure is not easy, the mature technology of searching non-text documents still depends on the text description of the non-text documents, rather than the content of multi-mode information, which makes the accuracy of image scene retrieval not high. Meanwhile, the current system cannot meet the search of a plurality of objects in the same scene, so that the search efficiency is low.

The Knowledge map (knowledgegraph), called Knowledge domain visualization or Knowledge domain mapping map in book condition report, is a series of various graphs showing Knowledge development process and structural relationship, and uses visualization technology to describe Knowledge resources and their carriers, and excavate, analyze, construct, draw and display Knowledge and their interrelationships. The knowledge graph is essentially a large structured semantic knowledge base for storing different entities and relations between the different entities, wherein the entities and the relations can be defined according to requirements, and the visualization based on a graph structure network can be realized through corresponding means.

Disclosure of Invention

Therefore, the invention provides an image scene retrieval system and method based on a multi-mode knowledge graph, which are used for more comprehensively describing the current scene by fusing the scene understanding result of an image with corresponding other mode information and storing related information based on the image scene by using the knowledge graph so as to complete the image scene retrieval task more accurately, efficiently and quickly.

Based on the above, the invention provides an image scene retrieval system based on a multi-mode knowledge graph, comprising an image scene retrieval system based on the multi-mode knowledge graph, wherein the image scene retrieval system inputs retrieval information through a user interface and obtains retrieval results, and the image scene retrieval system comprises:

the scene acquisition module is used for acquiring a plurality of image scenes containing different target objects.

And the scene analysis module is used for carrying out scene understanding on the images acquired by the scene acquisition module to obtain different target objects and the relation between the different target objects.

And the multi-mode data management module is used for storing and inquiring the target objects and the relation among the target objects.

And the retrieval interaction module is used for extracting keywords from the retrieval information, acquiring node information corresponding to the keywords from the multi-mode data management module, and finally creating a data exchange file and returning the data exchange file to the user interface.

The scene acquisition module is arranged to acquire the image scene through data crawling or a camera module.

The scene analysis module comprises:

the target detection sub-module is used for obtaining target object characteristics in the image by utilizing a target detection algorithm and framing out target objects contained in the image.

And the relation reasoning sub-module is used for inputting the image characteristics into the selected neural network model and estimating predicate relation existing between the target objects.

The target object features include a target object bounding box, coordinate information of the target object bounding box, and an identified target object class.

The multi-modal data management module includes:

the multi-mode map generation sub-module is configured to store each image scene and multi-mode information thereof to the same node of the knowledge map, store each target object and multi-mode information thereof to the same node of the knowledge map, and store the association between the node storing the scene information and the node storing the target object information to the edge of the corresponding node in the knowledge map.

The data management sub-module is set as a database operation interface, inserts the information in the knowledge graph into a database, acquires corresponding information from the database according to the search information, and stores the information in the data exchange file and returns the information to the user interface.

When the multi-mode fusion sub-module is set to store an image scene, setting a node tag value as the image scene, and storing a binary stream or a picture address corresponding to an original image of the image scene, a unique id of the image scene in a knowledge graph system, image size information, the number of contained target objects, scene description and other mode information related to the image scene in a node; when the target object is stored, the node label value is set as the target object, and the binary stream or picture address of the screenshot of the corresponding target object in the image scene, the unique id of the target object in the knowledge graph system, the screenshot size information, the target category, the id of the affiliated image scene and the corresponding instance number in the image scene are stored in the node.

The data exchange file is set to store unique ids of different nodes in the knowledge graph and corresponding relations among the ids of the different nodes.

The invention also provides an image scene retrieval method based on the multi-mode knowledge graph, which comprises the steps of:

step S200, receiving a search request and analyzing the type of the search information, and if the search information is a common text string, entering step S201; if the search information includes multi-modal information, the process proceeds to step S202.

Step S201, analyzing the common text character string, extracting keywords, searching the database according to the keywords, and returning corresponding node information.

Step S202, extracting corresponding scenes, target objects or relation information among the scenes and the target objects in the retrieval information by utilizing the scene analysis module, retrieving stored multi-mode information in the database, and returning key word information of corresponding nodes and relation edges.

Step S203, retrieving the relation among the scene name, the target class name and the target object in the data conversion file according to the key words, and returning the node information meeting the retrieval requirement.

Step S204, according to the search result, generating a data exchange file through the search interaction module and returning the result to the user interface.

The step S201 includes:

in step S2011, the input common text string is subjected to word segmentation and part-of-speech tagging, the stem part-of-speech tagged as a noun is extracted as an image scene name and a target object name, and the stem part-of-speech tagged as a preposition is extracted as a relationship between target objects.

Step S2012, similarity calculation is carried out on the extracted stem based on the ontology knowledge base, a plurality of candidate synonyms similar to the extracted stem are selected, and the selected candidate synonyms and the extracted stem are used as keywords to be retrieved.

The image scene retrieval system based on the multi-mode knowledge graph provided by the invention is not a common question-answer retrieval system, uses the knowledge graph as a carrier, uses an image processing algorithm, directly uses multi-mode data as the storage content of nodes in the knowledge graph by fusing the target detection result after scene understanding and the relationship obtained by reasoning with other mode information, stores and generates a corresponding knowledge graph according to a certain structure, and then completes the scene retrieval function based on the graph structure of the knowledge graph. The invention also provides an image scene retrieval method based on the multi-mode knowledge graph, which can reduce the difficulty of storing the results related to a plurality of image scenes, and simply, intuitively and comprehensively show the target objects and the relations between the target objects; on the other hand, the invention can directly search the picture by virtue of the advantages of the knowledge graph, thereby improving the accuracy of searching and the efficiency of scene searching.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a system block diagram of an image scene retrieval system based on a multimodal knowledge-graph, in accordance with an embodiment of the invention;

fig. 2 is a flowchart of an image scene retrieval method based on a multi-modal knowledge-graph, in accordance with an embodiment of the invention.

Detailed Description

The following description of the preferred embodiments of the present invention is given with reference to the accompanying drawings, so that the function and features of the present invention can be better understood.

As shown in fig. 1, the image scene retrieval system 100 based on the multi-modal knowledge graph of the present invention includes a scene acquisition module 101, a scene analysis module 102, a multi-modal data management module 103, and a retrieval interaction module 104, which are sequentially connected. Each module is specifically described below by taking a multi-scenario of a science park as an example.

The scene acquisition module 101 acquires the input of the retrieval system, i.e. acquires a plurality of image scenes containing different target objects. The acquisition mode can be divided into data crawling or sampling by a camera module, the data crawling technology refers to searching web pages by a system according to URL (uniform resource locator) information input by a user, crawling corresponding image scenes, and the camera module refers to a component capable of shooting and generating RGB (three-channel color image) or RGB-D (three-channel+depth) images in different terminal equipment.

In this embodiment, a mobile phone is selected to sample scenes of different offices in a park, and a plurality of office image scenes containing different target objects are obtained after shooting.

The scene analysis module 102 performs scene understanding task on the image acquired by the scene acquisition module 101 to obtain high-level image information, namely different target objects and relations among the different target objects, and the scene analysis module 102 comprises a target detection sub-module and a relation reasoning sub-module.

The target detection sub-module obtains image characteristics by utilizing a target detection algorithm, frames out target objects of specific categories contained in an input image, and predicts coordinate positions of the target boundary boxes in a current image scene. The target detection algorithm adopts R-CNN series and YOLO series algorithms in the prior art, the image features comprise target bounding boxes, corresponding coordinate information and identified target categories, and the specific categories refer to specific target categories which are designated in advance according to user requirements and are of interest to users.

And the relation reasoning sub-module inputs the image characteristics obtained by the target detection sub-module into the selected neural network model to estimate the interactive relation existing between the target objects.

In this embodiment, a fast R-CNN algorithm model trained based on an MS COCO dataset is selected to perform target detection on a plurality of images acquired by the scene acquisition module 101, and then the position and target type of the obtained target bounding box in the scene are input into a MOTIFNET (graph convolutional neural network) model trained based on a Visual Genome dataset or other suitable LSTM (long short term memory) network to perform relationship prediction, wherein the predicted relationship types are predicates and spatial relationships (coordinate position+image feature learning), so as to facilitate searching for the scene with the combined object. Such as searching for "riding people", "apples on the tree", the correct scene is coming out, not the horse-drawn people and apples under the tree.

The multi-mode data management module 103 stores the image scene and other multi-mode information corresponding to the image scene or the other multi-mode information corresponding to the object into a knowledge graph based on the graph structure according to the different target objects and the relation between the different target objects obtained by the scene analysis module 102, and generates a visual interface, and the graph generation module 103 comprises a multi-mode graph generation sub-module and a data management sub-module.

The multi-modal map generation sub-module fuses the image scenes or the object objects framed in the image scenes with other modal information, namely stores each image scene and the multi-modal information thereof into the same node of the knowledge map, and stores each object and the multi-modal information thereof into the same node of the knowledge map, namely node definition. The multi-modality information described herein refers to information derived from other modalities besides image information, such as time or geographic position of the acquired scene, and the like, which are fused into the node in association with the attribute of the current scene.

The multi-mode map generation sub-module also stores the corresponding relation between the nodes storing the scene information and the nodes storing the target object information to the edges of the corresponding nodes in the knowledge map, wherein the corresponding relation comprises the subordinate relation and the interactive relation, and specifically, if the two nodes respectively store the image scene and the target object, the relation is the subordinate relation of the two; if the two nodes respectively store different target objects of the same image scene, the relationship is the interaction relationship of the two objects in the scene.

The above node definition is specifically divided into two types: 1) When an image scene is stored, setting a node tag value as the scene, and storing a binary stream or picture address of an original image corresponding to the image scene, a unique id of the image scene in a knowledge graph system, image size information (target boundary box size), the number of contained target objects, scene description and other mode information related to the scene in the node; 2) When the target object in the scene is stored, the node label value is set as the target object, and the binary stream or picture address of the screenshot of the corresponding target object in the image scene, the unique id of the target object in the knowledge graph system, the screenshot size information (the size of the target boundary box), the target category, the id of the image scene to which the target object belongs and the corresponding instance number in the image scene are stored in the node.

And the data management sub-module is used for importing the knowledge spectrum data generated by the multi-mode spectrum generation sub-module into the database, receiving the search keywords generated by the search interaction module, and inquiring and extracting the corresponding knowledge spectrum data from the database.

In this embodiment, the multi-mode fusion is performed on the geographical location information of the scene, and the system can implement the location of the scene in the map by calling the map API. The definition of nodes is performed for each image scene of the office and the target objects contained in the scenes, and specifically, two cases are divided: 1) When the image scene is stored, the node tag value is set as the scene, and the node stores the local storage address of the original image of the corresponding image scene, the unique id of the image scene in the knowledge graph system, the image size information, the number of contained target objects, the scene description and the longitude and latitude coordinate information of the geographic position of the scene; 2) When the target object in the scene is stored, the node label value is set as the target object, the local storage address of the screenshot of the corresponding target object in the image scene, the unique id of the target object in the knowledge graph system, the screenshot size information, the target class, the id of the image scene and the corresponding instance number in the image scene are stored in the node, then a relation edge is created according to the affiliation of the image scene and the target object therein and the interaction relationship between different target objects, and the content of the node and the edge is imported into a database.

And the search interaction module 104 extracts the required keywords from the search information input by the user and returns the search result to the front end, and the search interaction module 104 obtains the search result by calling the data management sub-module, generates a corresponding data exchange file and returns the corresponding data exchange file to the front end for display. The data exchange file may be a file in a format of json, xml, csv, txt, etc., and in this embodiment, the retrieved node content may be stored in an "attributes. Json" file, and the relationship side content may be stored in a "mapping. Json" file.

In this example, text or images may be entered in the user interface to retrieve the desired image scene or target object.

The invention also provides a retrieval method of the image scene retrieval system 100 based on the multi-mode knowledge graph, and the specific flow is shown in fig. 2. Two different types of retrieval modes can be classified according to input information:

(1) Searching by describing semantic information such as the scene name, the object name or the relation among objects of the nodes based on the semantic matching search of the node content, wherein the corresponding search keywords are the relation among the scene name, the object name and the objects, and the step S201 is correspondingly described below;

(2) Based on the multi-modal information retrieval of the node fusion, the retrieval key is multi-modal information (such as longitude and latitude coordinate information and scene picture information) of the image scene, and corresponds to step S202 described below.

Specifically, the image scene retrieval method based on the multi-mode knowledge graph comprises the following steps:

step S200, receiving a search request and analyzing the type of search information, if the search information is text information input by a user. If the search information is a common text string, the step S201 is entered; if the search information includes multi-modal information, such as a geographic location or an image, the process proceeds to step S202.

Step S201, analyzing the text character string, extracting keywords, and searching the data conversion file according to the keywords, so as to return corresponding node information. The method specifically comprises the following steps:

step S2011: and performing word segmentation and part-of-speech tagging on the input text, extracting word stems with part-of-speech tagging as nouns as image scene names and target object names, and extracting word stems with part-of-speech tagging as prepositions as relations among target objects.

Step S2012: and (3) carrying out similarity calculation on the extracted stem based on the ontology knowledge base, selecting a plurality of candidate synonyms, such as 5 candidate synonyms, which are ranked at the top of the similarity of the extracted stem, and taking the selected candidate synonyms and the stem as keywords to be searched.

In this embodiment, firstly, input text is processed by using Stanford CoreNLP, then candidate synonyms are extracted based on a synonym set (syncet) and a generic word set (Class word) in WordNet, and then word2vec features are used to extract feature vectors of two dimensions. The similarity between the candidate synonyms and the stem words can be obtained by calculating the distances between the feature vectors and the word vectors of the stem words in two dimensions, the smaller the distance is, the larger the similarity is, and the first 5 candidate synonyms and the stem words with the largest similarity with the stem words are selected as keywords to be searched.

Step S202, according to the multi-modal information, such as the local image of the scene, the scene analysis module is utilized to extract the corresponding scene, the target object or the relation information among the corresponding scenes in the search information, the stored multi-modal information is searched in the database, and the key word information of the corresponding nodes and the relation edges is returned.

Step S203: and searching the relation among the scene name, the target class name and the target object in the database according to the keywords, and returning the node information of the image scene or the target object meeting the searching requirement.

Step S204, according to the search result, generating a data exchange file and returning the result to the front-end user.

The retrieval results of the two retrieval types are node ids, and the node ids directly display corresponding scenes or target objects and the relations between the scenes or the target objects at the front end through a visualization submodule of the knowledge graph system; the latter visual presentation can be divided into two cases: 1) If the retrieval information only has multi-mode information, taking the geographic position as an example, the system displays a thumbnail of the current scene at the front end through a visualization submodule, and provides a page skip entry to display more image scenes or target objects and the relation among the image scenes or the target objects at the geographic position of the scene; 2) If the search information additionally includes semantic information, the system returns the node id in the scene corresponding to the geographic position with reference to steps S2011-2012, and then displays the corresponding scene or the target object and the relationship between the scene and the target object at the front end through the visualization submodule.

In this embodiment, the user can select two different search modes and use them in combination. Assuming that the longitude and latitude coordinate value of the current scientific and creative park is (a, b), after a user inputs "(a, b) +meeting room+ chair on the floor" in a retrieval system, the system firstly carries out text word segmentation and part-of-speech tagging on the user through Stanford CoreNLP to obtain binary longitude and latitude coordinate information and nouns of meeting room, pair, floor and preposition of on; searching in an attribute file to find a corresponding node id according to longitude and latitude coordinate information, performing similarity matching on the extracted candidate synonyms in the words, and selecting the candidate synonyms and the original stems with the top 5 similarity ranks as keywords; according to the keywords, further screening out node ids of which the scene description is "meeting rotor", the target category is "chair" and the relationship is "on the floor" corresponding to candidate synonyms from the nodes found by the coordinate information based on the data conversion files "attibutes. Json" and "mapping. Json"; and finally, displaying the corresponding scene or target object and the relation between the scene or target object and the target object at the front end through a retrieval interaction module, and calling a map API to realize the positioning of the positioning node.

According to the invention, the objects in the original graph are respectively extracted and used as nodes to be inserted into the knowledge graph, so that the current graph searching engine is replaced by manual text marking based on pictures, and the graph searching engine replaces the graph searching with the graph searching text; or the searching of a plurality of objects in a scene can not be satisfied, so that the accuracy is higher than that of a common image searching mode.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and various modifications can be made to the above-described embodiment of the present invention. All simple, equivalent changes and modifications made in accordance with the claims and the specification of this application fall within the scope of the patent claims. The present invention is not described in detail in the conventional art.

Claims

1. An image scene retrieval system based on a multi-modal knowledge graph, which inputs retrieval information through a user interface and obtains retrieval results, is characterized by comprising:

the scene acquisition module is used for acquiring a plurality of image scenes containing different target objects;

the scene analysis module is used for carrying out scene understanding on the images acquired by the scene acquisition module to obtain different target objects and the relation between the different target objects; comprising the following steps:

the target detection sub-module is used for obtaining image characteristics by utilizing a target detection algorithm and framing a target object contained in the image;

the relation reasoning sub-module is used for inputting the image characteristics into the selected neural network model and estimating the interactive relation existing between the target objects;

the multi-mode data management module is used for storing and inquiring the target objects and the relation among the target objects; comprising the following steps:

the multi-mode map generation sub-module is arranged to store each image scene and multi-mode information thereof to the same node of the knowledge map, store each target object and multi-mode information thereof to the same node of the knowledge map, and store the relation between the node storing the scene information and the node storing the target object information to the edge of the corresponding node in the knowledge map;

the data management sub-module is set as a database operation interface, inserts the information in the knowledge graph into a database, acquires corresponding information from the database according to the search information, stores the information in the data exchange file and returns the information to the user interface;

2. The multimodal knowledge-graph-based image scene retrieval system of claim 1, wherein said scene acquisition module is configured to acquire said image scene via data crawling or a camera module.

3. The multi-modal knowledge-graph based image scene retrieval system as recited in claim 1, wherein the image features include a target bounding box, coordinate information of the target bounding box, and an identified target class.

4. The image scene retrieval system based on the multi-mode knowledge graph according to claim 1, wherein when the multi-mode fusion submodule is set to store an image scene, a node label value is set as the image scene, and a binary stream or a picture address of an original picture of the corresponding image scene, a unique id of the image scene in the knowledge graph system, image size information, the number of contained target objects, scene description and other mode information related to the image scene are stored in the node; when the target object is stored, the node label value is set as the target object, and the binary stream or picture address of the screenshot of the corresponding target object in the image scene, the unique id of the target object in the knowledge graph system, the screenshot size information, the target category, the id of the affiliated image scene and the corresponding instance number in the image scene are stored in the node.

5. The multi-modal knowledge-graph-based image scene retrieval system as claimed in claim 1, wherein the data exchange file is configured to store unique ids of different nodes in the knowledge graph and correspondence between the ids of the different nodes.

6. An image scene retrieval method based on a multi-mode knowledge graph, which is characterized in that the image scene retrieval system based on the multi-mode knowledge graph as claimed in claim 1 is built and comprises the following steps:

step S200, receiving a search request and analyzing the type of the search information; if the search information is a common text string, the step S201 is entered; if the search information includes image information, the process proceeds to step S202;

step S201, analyzing the common text character string, extracting keywords, searching the database according to the keywords, and returning corresponding node information;

step S202, extracting corresponding scenes, target objects or relation information among the scenes and the target objects in the retrieval information by utilizing the scene analysis module, retrieving stored multi-mode information in the database, and returning corresponding node information;

step S203, retrieving the relation among the scene name, the target class name and the target object in the data conversion file according to the key words, and returning the node information meeting the retrieval requirement;

step S204, according to the search result, a data exchange file is generated through the search interaction module, and is returned to the user interface.

7. The method for retrieving an image scene based on a multimodal knowledge graph as claimed in claim 6, wherein the step S201 includes:

step S2011, word segmentation and part-of-speech tagging are carried out on an input common text character string, a stem part of speech tagged as a noun is extracted to serve as an image scene name and a target object name, and a stem part of speech tagged as a preposition is extracted to serve as a relation between target objects;