CN118133191B

CN118133191B - Target detection method and device for multi-mode data

Info

Publication number: CN118133191B
Application number: CN202410557496.6A
Authority: CN
Inventors: 刘微; 陈维强; 狄建锴; 郑维学; 高语函; 张建安; 鞠全永
Original assignee: Hisense Group Holding Co Ltd
Current assignee: Hisense Group Holding Co Ltd
Priority date: 2024-05-08
Filing date: 2024-05-08
Publication date: 2024-08-02
Anticipated expiration: 2044-05-08
Also published as: CN118133191A

Abstract

The embodiment of the application provides a target detection method and device for multi-mode data, which are used for solving the problems of large calculated amount and poor scene adaptability in the prior art. The method comprises the following steps: determining multi-mode data to be detected and a detection task, wherein the multi-mode data comprise at least one data type of video, sound and images; the detection task comprises at least one detection task text; respectively extracting information from at least one detection task text and multi-modal data to obtain text abstract layer information of at least one detection task text and multi-modal data abstract layer information of multi-modal data; determining target abstract layer information matched with the text abstract layer information in the multi-mode data abstract layer information according to an alignment rule between the abstract layer information of each data type and the abstract layer information of the text data type; and determining a target detection result corresponding to at least one detection task text according to the target abstract layer information.

Description

Target detection method and device for multi-mode data

Technical Field

The present application relates to the field of target detection technologies, and in particular, to a method and an apparatus for detecting a target of multi-mode data.

Background

In the prior art, the multi-target detection method is to perform feature fusion based on the aggregation features of the visual images and the aggregation features of the infrared images to obtain fusion features, and then detect through the fusion features. However, the method needs to perform feature aggregation and fusion and detect through fusion features, so that the calculated amount is large and the resource consumption is large. In addition, the method can only process the image data types, and can not use the method to perform target detection on the data of other data types, so that the scene adaptability is poor.

Disclosure of Invention

The embodiment of the application provides a target detection method for multi-mode data, which is used for solving the problems of large calculated amount and poor scene adaptability in the prior art.

In a first aspect, an embodiment of the present application provides a method for detecting a target of multi-modal data, including:

Determining multi-mode data to be detected and a detection task, wherein the multi-mode data comprise at least one data type of video, sound and images; the detection task comprises at least one detection task text; the detection task text is used for expressing the detection requirement of the multi-mode data;

Respectively extracting information from the at least one detection task text and the multi-modal data to obtain text abstract layer information of the at least one detection task text and multi-modal data abstract layer information of the multi-modal data;

Determining target abstract layer information matched with the text abstract layer information in the multi-mode data abstract layer information according to an alignment rule between the abstract layer information of each data type and the abstract layer information of the text data type;

and determining a target detection result corresponding to the at least one detection task text according to the target abstract layer information.

In one possible implementation manner, the determining, according to an alignment rule between the abstract layer information of each data type and the abstract layer information of the text data type, target abstract layer information matched with the text abstract layer information in the multi-mode data abstract layer information includes:

Respectively carrying out position coding on the multi-mode data abstraction layer information and the text abstraction information of the at least one detection task text to obtain data position coding information of the multi-mode data and task position coding information corresponding to the at least one detection task text;

And determining target abstract layer information in the multi-mode data abstract layer information according to the corresponding relation between the data position coding information and the task position coding information aiming at each detection task text.

In one possible implementation manner, the determining the target abstract layer information in the multi-mode data abstract layer information according to the corresponding relation between the data position coding information and the task position coding information includes:

Determining target data position coding information matched with task position coding information corresponding to the text abstraction layer information of the detection task text from the data position coding information;

and taking the multi-mode data abstraction layer information corresponding to the target data position coding information as target abstraction layer information.

In one possible implementation, the method further includes:

When the multi-mode data comprises a first data type and a second data type, determining a first network model corresponding to the first data type and a second network model corresponding to the second data type;

Inputting the data corresponding to the first data type in the at least one detection task text and the multi-mode data into the first network model, so that the first network model outputs a target detection result of the data corresponding to the first data type for the at least one detection task text; and inputting the data corresponding to the second data type in the at least one detection task text and the multi-mode data into the second network model, so that the second network model outputs a target detection result of the data corresponding to the second data type for the at least one detection task text.

In a possible implementation manner, the determining, according to the target abstract layer information, a target detection result corresponding to the at least one detection task text includes:

Determining a first target detection result according to the first target abstract layer information; wherein the first target abstract layer information is determined by the first network model from data corresponding to the first data type based on at least one detection task text;

determining a second target detection result according to the second target abstract layer information; wherein the second target abstract layer information is determined by the second network model from data corresponding to the second data type based on at least one detection task text;

And determining a target detection result corresponding to at least one detection task text according to the first target detection result and the second target detection result.

In one possible implementation, the method further includes:

when a preset result exists in a detection task corresponding to a target detection task text, comparing the target detection result with the preset result;

When the target detection result is determined to be different from the preset result, inputting a target detection task text and the multi-mode data into a multi-mode network model to obtain a verification detection result;

and determining the detection result of the multi-mode data according to the target detection result and the verification detection result.

In a second aspect, an embodiment of the present application provides a target detection apparatus for multi-modal data, including:

the system comprises a determining module, a detecting module and a judging module, wherein the determining module is used for determining multi-mode data to be detected and a detection task, and the multi-mode data comprise at least one data type of video, sound and images; the detection task comprises at least one detection task text; the detection task text is used for expressing the detection requirement of the multi-mode data;

The extraction module is used for respectively extracting information of the at least one detection task text and the multi-modal data to obtain text abstract layer information of the at least one detection task text and multi-modal data abstract layer information of the multi-modal data;

the determining module is further configured to determine target abstract layer information matched with the text abstract layer information in the multi-mode data abstract layer information according to an alignment rule between abstract layer information of each data type and abstract layer information of a text data type; and determining a target detection result corresponding to the at least one detection task text according to the target abstract layer information.

In one possible implementation manner, the determining module is specifically configured to, when determining, in the multi-mode data abstraction layer information according to an alignment rule between abstraction layer information of each data type and abstraction layer information of a text data type, target abstraction layer information matched with the text abstraction layer information in the multi-mode data abstraction layer information:

In one possible implementation manner, the determining module is specifically configured to, when determining the target abstract layer information in the multi-mode data abstract layer information according to a correspondence between the data position coding information and the task position coding information:

In one possible implementation, the determining module is further configured to:

In a possible implementation manner, the determining module is specifically configured to, when determining, according to the target abstract layer information, a target detection result corresponding to the at least one detection task text:

In a third aspect, an embodiment of the present application provides an execution apparatus, including:

A memory for storing program instructions;

And the processor is used for calling the program instructions stored in the memory and executing the method according to the obtained program instructions and different implementation manners of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions that, when run on a computer, cause the computer to perform the method of the first aspect and the different implementations of the first aspect.

The beneficial effects of the application are as follows:

In the application, information extraction is carried out on the detection task and the multi-mode data respectively to obtain the text abstract layer information corresponding to the detection task and the multi-mode data abstract layer information of the multi-mode data. Further, through an alignment rule between the abstract layer information of the multi-mode data type and the abstract layer information of the text data type, determining target abstract layer information matched with the text abstract layer information in the multi-mode data abstract layer information, and further determining a target detection result corresponding to at least one detection task text according to the target abstract layer information. According to the method, alignment among abstract layer information of different modal data can be achieved through the alignment rule among abstract layer information of each modal data and the text, and then target detection of multi-modal data is achieved. In addition, the alignment of the abstract layer information among the modal data can be performed, so that the method is suitable for target detection in various scenes and has good scene adaptability.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application;

fig. 3 is a schematic diagram of another application scenario provided in an embodiment of the present application;

FIG. 4 is a flowchart of target detection of multi-modal data according to an embodiment of the present application;

Fig. 5 is a schematic flow chart of a network training according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of target detection according to an embodiment of the present application;

fig. 7 is a schematic diagram of a result of a railroad switch inspection provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a result of attendance detection according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a target detection device for multi-modal data according to an embodiment of the present application;

fig. 10 is a schematic diagram of an execution device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In order to facilitate understanding of the target detection method and apparatus for multi-mode data provided by the embodiments of the present application, a part of terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

(1) Token is information that represents different aspects and layers of data. In the field of text processing, a token generally refers to the smallest unit in text, such as a word, phrase, or character. In Natural Language Processing (NLP) tasks, text is first segmented, decomposed into token, and further subjected to semantic analysis, emotion analysis, and other tasks. In machine learning and deep learning tasks, a token represents an abstract feature or vector representation of data. For example, in a text classification task, text may be converted into word embeddings (word embeddings) that are processed as token. In models such as Large Language Model (LLM), language visual large model (LVM), visual language large model (VLM) and the like of a transducer architecture (inclusive), token generally refers to a character entry or an image embedded abstract symbol.

(2) The token is represented at a potential abstraction layer, which refers to a symbol or an identifier of data corresponding to the token at a certain abstraction layer, and is used for representing or identifying a specific entity, state or attribute of the data.

(3) Position coding refers to assigning a vector representing the position or order of each element (e.g., word or point in time) in a sequence when processing sequence data (e.g., natural language text or time sequence data). Position coding is typically used in models of attention mechanisms and position awareness so that the model can better understand the relative positional relationship between elements in a sequence. In the transducer model, the position coding is typically composed of two parts: sine encoding and cosine encoding. These encodings are added to the word embedding or to the representation of the input data to provide information about the position of each element in the sequence. For example, in natural language processing, if we have a sentence containing N words, each word being represented as a word-embedded vector in d dimensions, then position coding is a matrix in d dimensions, where each row corresponds to a position in the sentence. By adding position coding to word embedding, the model can obtain a complete representation of each word, including both the semantic information of the word and its position information in the sentence. The purpose of position coding is to enable a model to distinguish between elements at different positions in a sequence and to use this position information to better capture long-range dependencies in the sequence.

(4) The self-attention mechanism is a mechanism for processing sequence data, and is particularly suitable for the case of long-distance dependency that cannot be well processed by a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN). It allows the model to dynamically focus on different parts of the input sequence as it processes the sequence data and adjust the output accordingly. In the self-attention mechanism, each element in the input sequence is associated with other elements, where the importance of each element to the other elements is represented by an attention weight. These weights are determined by calculating the similarity between each pair of elements and then obtained by normalization processing. Finally, the output of each element is a weighted average of the input sequence, the weights being determined by their degree of association with the other elements. The advantage of the self-attention mechanism is that it is able to capture long-range dependencies in the input sequence and is not limited by the sequence length. In addition, it is also able to process sequence data of varying length, since there is a correlation between each element and all other elements when calculating the attention weight.

(5) A cross-attention mechanism is an attention mechanism for interacting and correlating between multiple sequences. Unlike self-attention mechanisms, cross-attention mechanisms consider not only relationships between elements within a sequence, but also relationships between different sequences. In a cross-attention mechanism, there are typically two sequences, for example in an image subtitle generating task, one sequence is a feature sequence of an image and the other sequence is a word sequence of text. Through the cross-attention mechanism, the model is able to learn to associate each word in the text with a relevant portion in the image, thereby generating the appropriate subtitles. Specifically, in a cross-attention mechanism, there will typically be one query sequence and one key-value pair sequence. The query sequence is used to calculate the attention weight and the key pair sequence is used to provide the relevant information. And obtaining attention weight through calculating the similarity between the query sequence and the key value pair sequence and then through normalization processing, and finally obtaining the association result of the query sequence and the key value pair sequence. The cross-attention mechanism is widely applied to multi-mode tasks, and can effectively correlate and integrate information of different modes, so that the performance of the model on the tasks is improved.

In view of the above problems, embodiments of the present application provide a method and an apparatus for detecting targets of multi-modal data, which respectively extract information from a detection task and the multi-modal data, to obtain text abstraction layer information corresponding to the detection task, and multi-modal data abstraction layer information of the multi-modal data. Further, through an alignment rule between the abstract layer information of the multi-mode data type and the abstract layer information of the text data type, determining target abstract layer information matched with the text abstract layer information in the multi-mode data abstract layer information, and further determining a target detection result corresponding to at least one detection task text according to the target abstract layer information. According to the method, the alignment of the abstract layer information among the abstract layer information of the data of different modes can be realized through the alignment rule among the abstract layer information of the data of each mode, so that the target detection of the data of multiple modes is realized, and the method does not need feature fusion, so that the calculated amount can be effectively reduced. In addition, the alignment of the abstract layer information among the modal data can be performed, so that the method is suitable for target detection in various scenes and has good scene adaptability.

The following description is made for some simple descriptions of application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application, but not limiting. In the specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The target detection method for the multi-mode data provided by the embodiment of the application can be realized by the execution equipment. In some embodiments, the execution device may be a terminal device. The terminal device may be a display device having a display function. The display device may include: smart televisions, cell phones, tablet computers, and the like. In other embodiments, the executing device may be an electronic device, which may be implemented by one or more servers, which may be local servers or cloud servers. In some scenarios, the electronic device may also perform training of the object detection model in embodiments of the present application.

The structure of the execution device and the application scenario are described below taking the execution device as an electronic device as an example. An electronic device may be implemented by one or more servers. Referring to fig. 1, the server 100 may be implemented as a physical server or may be implemented as a virtual server. The server can be realized by a single server, can be realized by a server cluster formed by a plurality of servers, and can be realized by a single server or a server cluster. In fig. 2, the server 100 is connected to the data acquisition device 300 and the display device 200. The data collection device 300 may transmit the collected data to the display device 200. The server 100 may perform a target detection method of multi-modal data. In some scenarios, the server 100 may also receive a target detection task of the multimodal data sent by the display device 200, or send a target detection result of the multimodal data to the display device 200.

The application scenario is described below taking an execution device as an example of a terminal device. In fig. 3, the data acquisition device 300 and the terminal device 400 are shown as an example. The data acquisition device 300 may transmit the acquired data to the terminal device 400. The terminal device 400 may perform the target detection method of the multi-modal data according to the target detection task of the multi-modal data.

When the terminal device is a device without display function, the terminal device can be externally connected with a display so as to enable the terminal device to have the display function, thereby realizing the display of the target detection result of the multi-mode data.

It should be noted that the application scenarios and structures shown in fig. 1-3 are only examples, and the embodiments of the present application are not limited thereto.

The embodiment of the application provides a target detection method of multi-mode data, fig. 4 exemplarily shows a flow of the target detection method of multi-mode data, where the flow may be executed by an executing device, and the executing device may be a server shown in fig. 2 or a terminal device shown in fig. 3, which is not limited in particular. The specific flow is as follows:

The multimodal data to be detected and the detection task are determined 401.

The multi-mode data is data comprising at least one data type of video, sound and image; the detection task comprises at least one detection task text, and the detection task text is used for expressing the detection requirement of the multi-mode data.

As an example, in target detection of multi-modal data, detection task text is used as an auxiliary rule of the multi-modal data, and detection work is described. For example, when the video in the classroom needs to be attendance-counted, the task text is detected as follows: and counting the attendance of the video data.

And 402, respectively extracting information of at least one detection task text and multi-modal data to obtain text abstract layer information of at least one detection task text and multi-modal data abstract layer information of multi-modal data.

In some embodiments, the abstract layer information may represent the semantics and features of the data, so the token abstract extraction of the detection task text may be implemented through embedding techniques. For example, in natural language processing, embedding techniques can map words, phrases, or sentences into a continuous vector space so that similar semantics possess similar representations, which can help the model better understand semantics. Specifically, it is necessary to first perform a word segmentation or tokenization process on the input data, and convert it into a discrete token sequence. For example, the detection task text is processed at word granularity, converting the detection task text into a numeric form. Further, the token sequence may be mapped to a continuous vector representation by Embedding techniques provided by a transducer to obtain the text abstraction layer information.

For image data, abstract layer information for the image may be extracted by the embedding technique in mamba. In particular, mamba may generate visual features of the image data and convert these features into a vector form, referred to as embedding of the image data, i.e., abstract layer information of the image data. Embedding of the image data contains abstract layer information of the image data, which contains high-level semantic information about the image content. By representing the images as embedding, the images can be more easily compared, matched, and understood in semantic space.

For sound data, abstract layer information of the sound data may be extracted through embedding technology in mamba. Specifically, mamba may generate abstract features of the sound data and convert these features into a vector form, referred to as embedding of the sound data, i.e., abstract layer information of the sound data. Sound data embedding contains abstract layer information of sound data, which may contain semantic information, emotional content, etc. of sound data.

For video data, the abstract layer information extraction method for sound data and image data can be integrated to obtain the abstract layer information of the video data.

403, Determining target abstract layer information matched with the text abstract layer information in the multi-mode data abstract layer information according to an alignment rule between the abstract layer information of each data type and the abstract layer information of the text data type.

In some embodiments, according to the alignment rule between the abstract layer information of each data type and the abstract layer information of the text data type, the method may be embodied as an alignment rule between the position of the abstract layer information of each data type and the position of the text abstract layer information of the detection task text.

Specifically, the multi-mode data abstraction layer information and the text abstraction information of at least one detection task text can be respectively subjected to position coding to obtain data position coding information of the multi-mode data and task position coding information corresponding to the at least one detection task text. Further, for each detection task text, target abstract layer information in the multi-mode data abstract layer information is determined according to the corresponding relation between the data position coding information and the task position coding information.

When determining target abstract layer information in multi-mode data abstract layer information according to the corresponding relation between data position coding information and task position coding information, the concrete method can be realized by the following steps: and determining target data position coding information matched with the task position coding information corresponding to the text abstraction layer information of the detection task text from the data position coding information. Further, multi-mode data abstraction layer information corresponding to the target data position coding information can be used as target abstraction layer information.

The coding rules of the multi-mode data abstraction layer information and the coding rules of the text abstraction layer information are trained in a training stage, so that the position codes of the text abstraction layer information representing the same semantic meaning are the same as the position codes of the data abstraction layer information of all mode data.

As an example, the text is: finding out white flowers from the image, wherein the 'white' is adjective, the position code of the corresponding abstract layer information is 2, the 'flowers' is noun, the position code of the corresponding abstract layer information is 6, in the training process, the training direction is to code the abstract layer information of RGB data used for representing the 'white' in the image into 2 when the position code is carried out, and the abstract layer information of textures and structures used for representing the 'flowers' in the image is coded into 6 when the position code is carried out.

When the target detection is carried out, the data position code of the multi-mode data abstraction layer information of the multi-mode data can be determined according to the text position code of the text abstraction layer information of the detection task text, the target data position code which is the same as the text position code is used as the target abstraction layer information, and the multi-mode data abstraction layer information corresponding to the target data position code is used as the target abstraction layer information.

In some embodiments, sine and cosine position codes, ALiBi relative position codes, roPE position codes, etc. may be used in position coding the text abstraction layer information and the multi-modal data abstraction layer information. Specifically, the method can include spatial two-dimensional data fusion (for performing the encoding of the potential abstract layer of the position information), spatial three-dimensional data mapping and data fusion (for performing the encoding of the potential abstract layer of the position information), namely, performing encoding marking on the original positions of the tokens extracted by the abstract layers of the data tokens of the modes.

And 404, determining a target detection result corresponding to at least one detection task text according to the target abstract layer information.

In some embodiments, when the multimodal data includes a first data type and a second data type, determining a first network model corresponding to the first data type and a second network model corresponding to the second data type;

Inputting data corresponding to a first data type in at least one detection task text and multi-mode data into a first network model, so that the first network model outputs a target detection result of the data corresponding to the first data type for the at least one detection task text; and inputting the data corresponding to the second data type in the at least one detection task text and the multi-mode data into the second network model, so that the second network model outputs a target detection result of the data corresponding to the second data type for the at least one detection task text.

Further, when determining a target detection result corresponding to the at least one detection task text according to the target abstract layer information, determining a first target detection result according to the first target abstract layer information; determining a second target detection result according to the second target abstract layer information; the first target abstract layer information is determined by the first network model based on data corresponding to at least one detection task text from the first data type, and the second target abstract layer information is determined by the second network model based on data corresponding to at least one detection task text from the second data type. And further, determining a target detection result corresponding to at least one detection task text according to the first target detection result and the second target detection result.

In some embodiments, when a preset result exists in the detection task corresponding to the target detection task text, comparing the target detection result with the preset result; when the target detection result is different from the preset result, inputting the target detection task text and the multi-mode data into the multi-mode network model to obtain a verification detection result; and determining the detection result of the multi-mode data according to the target detection result and the verification detection result.

Specifically, when target detection is performed on multi-mode data, the network model of each data type can be used for target detection, when the target detection result is different from the preset result, the multi-mode data and the detection task can be input into a large model for further detection, and a verification detection result is obtained. And determining the detection result of the multi-mode data through the target detection result and the verification detection result. Through twice detection, the accuracy and precision of target detection can be effectively improved. As an example, when the attendance statistics is performed on the image, it is assumed that 30 classmates are preset in a class, when the target detection is performed, the target detection result is 28 attendance, and at the moment, the target detection result is different from the preset result, and at the moment, the target detection result is input into the multi-mode network model for secondary detection, so as to obtain a verification detection result, so as to determine the final attendance.

In the application, the target detection model comprises a model corresponding to each data type and a multi-mode network model, wherein the model corresponding to each data type and the multi-mode network model are obtained after training, and the step of network training of each network model is shown in fig. 5, and the specific steps are as follows:

Step one: multimodal data acquisition including, but not limited to, near-far sound, multi-view, multi-target image, etc.;

Step two: each mode data label comprises, but is not limited to, sound source distance, sound source age, gender, name and other structural information, sound content, voiceprint information, image view angle, multi-target gender, age and other structural information in the image, target size information of the image and the like; and combining a multi-model decision-making technology, and obtaining a multi-model data set by utilizing the semi-automatic labeling capability.

Step three: training by a deep learning model, combining an attention mechanism, aligning the token abstract layers of the modal data across modes, learning token alignment rules of the abstract layers of the modal data to the text abstract layers, decoding the codes according to coding marks, and simultaneously obtaining target detection capability of the multi-modal potential abstract layers by corresponding to multi-modal source data;

Step four: the potential abstract layer coding means comprise sine and cosine position coding, ALiBi relative position coding, roPE position coding and the like, and coding marks are carried out on the original positions of the tokens extracted by the data token abstract layer of each mode.

The step simultaneously comprises the steps of converting each mode view angle into the same space, namely aligning various mode data to text data so as to unify the consistency of space two-dimensional data (including space three-dimensional data two-dimensional mapping);

In the third step four, the embedding technology provided by a transducer or mamba and the like is utilized to realize the respective token abstract extraction of each multi-mode data, a self-attention mechanism and a cross-attention mechanism are utilized to conduct the first step of model pre-training, the token characteristics of each mode and a cross-mode abstract layer between modes are learned and obtained, the token characteristics of the abstract layers are not continuous and discrete, cross-mode information interference is reduced, mathematical connection is established through cross-attention, and the token abstract layer is a type of discrete space 'context' connection, and the token discrete context connection of the abstract layer is obtained at the moment; and the cross attention mechanism and the context of the discrete space are utilized, so that the training data volume is reduced, the calculation complexity is also reduced, and the alignment rule of each modal token at a potential abstract layer, namely the potential alignment rule between the modal information codes, is obtained.

And performing target detection according to the data of the same coding position to obtain a prediction result, calculating according to the prediction result and the label, determining a loss value, and adjusting the network parameters according to the loss value so that the network model can accurately identify the network parameters.

Step five: and according to the structured and unstructured information such as face recognition, face check and the like, the merging of the identity information base of the image face identity, voiceprint and human body state is realized.

When the target detection is carried out, the primary screening detection is carried out through the network model of each data type, and the fine screening detection is carried out according to the multi-mode network model, so that the accuracy and precision of the target detection are greatly improved. In some scenarios, the network model for each data type may be referred to as a small model, and the multi-modal network model may be referred to as a large model.

In some implementations, when used in different scenarios or data detection, the model may be learned using reinforcement delta learning and updated. Incremental learning may update the model as new data is received without retraining the entire model. This is very useful for processing large-scale data or real-time data streams, as it can reduce computational resources and time consumption. Reinforcement learning, learning how to take action to maximize jackpot by interacting with the environment. In reinforcement incremental learning, the system learns not only how to adapt to new data, but also how to make appropriate decisions in a changing environment.

When the multi-mode data is subjected to target detection, the detection process is shown in fig. 6, and the specific flow is as follows:

Step one: the multi-mode information is input, certain description and prompt information are given out according to the modes of sound, images and the like to serve as auxiliary rules of the input information, and meanwhile, detection tasks are described to obtain detection task texts, for example, multi-person roll calling and the like are carried out according to the images. The multi-mode information is multi-mode offline data or real-time data.

Step two: the multi-mode detection combines the advantages of the small model, the small model is screened for the first time, a first screening result is given, new data are provided for the enhanced enhancement sub-module, false detection data needing to be re-detected are provided for the multi-mode large model for fine detection, the detection result is finally output, unstructured information is displayed in an image-text mode, structured information forms a comparison report to be summarized, and meanwhile, the result (such as a plurality of roll calls) is output. The small model is a small model of each mode facing to a specific industry, and preliminary detection is carried out by using the small model. When the primary screening result has false detection, the multi-mode large model of the specific industry is used for accurate detection, and the final detection result is determined and output. If the new data cannot be identified in the primary screening result and/or the accurate detection, performing reinforcement increment learning, and performing semi-automatic labeling of each multi-mode data and training the multi-mode large model to update the large model.

In addition, in the actual data reasoning process, the small model feedback structured and unstructured data are combined to update the multi-mode data in real time, the iteration model is further optimized by utilizing the reinforced increment learning and the collected new data, and therefore the detection accuracy of the multi-mode model is improved.

As an example, when the inspection task is to perform personnel inspection and tool inspection on an image during railway turnout inspection, the inspection task and the image are input into a model, and then the target inspection result in the image data can be determined. The target detection result includes the number of people included in the image and the number of tool types, and is represented in the image, as shown in fig. 7. The target boxes with different line types are used for representing different detection task results.

In some scenes, the method can also be applied to intelligent class for detecting absences. The detection task is to identify the number of attendance in the video. And inputting the detection task and the video data into the model for target detection, so as to obtain the attendance number in the image video data, as shown in fig. 8. In some embodiments, if the image data in the video data does not detect the correlation result of the Zhang Sanhe, but the target voiceprint is determined to be Zhang Sanhe according to the sound data, the detection results of the image data and the sound data are integrated, and Zhang Sanhe is determined to be attendance.

In some scenes, in intelligent class, the detection of the class listening state and teaching interaction can be performed. The detection task is to identify the interaction condition of the class listening state and the classroom teaching in the video. During detection, related detection of detection tasks can be performed according to the image data and the sound data, such as multi-mode data of the physical sitting posture, the sound and the like of students, so that a final detection result is determined.

The method can be used for simultaneously and simultaneously detecting multiple targets according to the multi-mode data, and is used for simultaneously detecting and identifying the targets with multiple visual angles and multiple targets. The intelligent class student attendance statistics method is used for scenes such as intelligent class student attendance statistics, multi-person collaborative road construction or railway turnout inspection, etc., roll calling in class, entrance and exit of road construction site operators, one-time multi-person identity recognition and statistics, etc.

According to the application, the voice voiceprint, the human face and the multi-mode information such as the body state or sitting posture of the person are combined, the comprehensive characteristics of the voice token, the image token and the database text information token are aligned, so that the identity information of the staff in the scene of multiple persons is verified and confirmed at one time, the recognition precision of the small targets of the human face is improved, the identity recognition and roll call process is not sensitive, the points are captured visually, the small targets can be recorded and recognized by hearing and seeing, the completely implicit recognition is realized, the user experience is improved, the recognition precision is improved, the time for target recognition and identity verification in the scene of the small targets of multiple persons is saved, and the efficiency and effect are improved.

Based on the same technical concept, referring to fig. 9, an embodiment of the present application provides a target detection apparatus 900 for multi-modal data. The apparatus 900 may perform any step of the method for detecting targets of multi-modal data, and in order to avoid repetition, the description is omitted here. The apparatus 900 comprises a determination module 901 and an extraction module 902.

The determining module 901 is configured to determine multi-mode data to be detected and a detection task, where the multi-mode data includes at least one data type of video, sound and image; the detection task comprises at least one detection task text; the detection task text is used for expressing the detection requirement of the multi-mode data;

The extracting module 902 is configured to extract information from the at least one detection task text and the multimodal data, to obtain text abstraction layer information of the at least one detection task text and multimodal data abstraction layer information of the multimodal data;

The determining module 901 is further configured to determine target abstract layer information matched with the text abstract layer information in the multi-mode data abstract layer information according to an alignment rule between the abstract layer information of the multi-mode data type and the abstract layer information of the text data type; and determining a target detection result corresponding to the at least one detection task text according to the target abstract layer information.

In a possible implementation manner, the determining module 901, when determining, according to an alignment rule between the abstract layer information of each data type and the abstract layer information of the text data type, the target abstract layer information matched with the text abstract layer information in the multi-mode data abstract layer information, is specifically configured to:

In a possible implementation manner, the determining module 901 is specifically configured to, when determining the target abstract layer information in the multi-mode data abstract layer information according to a correspondence between the data position coding information and the task position coding information:

In one possible implementation, the determining module 901 is further configured to:

In a possible implementation manner, the determining module 901 is specifically configured to, when determining, according to the target abstract layer information, a target detection result corresponding to the at least one detection task text:

Based on the same inventive concept, an embodiment of the present application provides an electronic device, which can implement the functions of the target detection device for multi-mode data discussed above, and referring to fig. 10, the device 1000 includes a memory 1001 and a processor 1002.

A memory 1001 for storing program instructions;

And a processor 1002, configured to call the program instructions stored in the memory, and execute any step of the target detection method of the multimodal data according to the obtained program instructions.

In an embodiment of the application, the processor 1002 is a control center of the electronic device, connects various parts of the electronic device using various interfaces and routes, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 1001, and invoking data stored in the memory 1001. In the alternative, processor 1002 may include one or more processing units. The processor 1002 may be a control component such as a processor, microprocessor, controller, etc., such as a general purpose central processing unit (central processing unit, CPU), general purpose processor, digital signal processing (DIGITAL SIGNAL processing, DSP), application Specific Integrated Circuits (ASIC), field programmable gate array (field programmable GATE ARRAY, FPGA) or other programmable logic device, transistor logic device, hardware components, or any combination thereof.

The memory 1001 may be used to store software programs and modules, and the processor 1002 executes the software programs and modules stored in the memory 1001 to thereby perform various functional applications and data processing. The memory 1001 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to business processes, etc. The memory 1001 is used as a nonvolatile computer-readable storage medium for storing nonvolatile software programs, nonvolatile computer-executable programs, and modules. The Memory 1001 may include at least one type of storage medium, and may include, for example, flash Memory, a hard disk, a multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), magnetic Memory, magnetic disk, optical disk, and the like. Memory 1001 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1001 in the embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, for storing program instructions and/or data.

Based on the same inventive concept, an embodiment of the present application provides a computer-readable storage medium, including: computer instructions that, when run on a computer, cause the computer to perform the target detection method for any of the multimodal data as previously discussed. Since the principle of the above-mentioned computer readable storage medium for solving the problem is similar to that of the target detection method of the multi-mode data, the implementation of the above-mentioned computer readable storage medium can refer to the implementation of the method, and the repetition is not repeated.

Based on the same inventive concept, embodiments of the present application also provide a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform a method of object detection of multimodal data as previously discussed. Since the principle of the solution of the problem of the computer program product is similar to that of the target detection method of the multi-mode data, the implementation of the computer program product can refer to the implementation of the method, and the repetition is omitted.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for target detection of multi-modal data, comprising:

determining a target detection result corresponding to the at least one detection task text according to the target abstract layer information;

When the multi-mode data comprises a first data type and a second data type, determining a target detection result corresponding to the at least one detection task text according to the target abstract layer information, wherein the target detection result comprises:

Determining a first target detection result according to the first target abstract layer information; the first target abstract layer information is determined by a first network model from data corresponding to a first data type based on at least one detection task text;

Determining a second target detection result according to the second target abstract layer information; the second target abstract layer information is determined by a second network model from data corresponding to a second data type based on at least one detection task text;

2. The method of claim 1, wherein determining target abstract layer information matching the text abstract layer information in the multi-modal data abstract layer information according to an alignment rule between abstract layer information of each data type and abstract layer information of a text data type, comprises:

3. The method of claim 2, wherein determining target abstraction layer information in the multi-modal data abstraction layer information based on a correspondence between data location encoding information and task location encoding information comprises:

4. A method as claimed in claim 2 or 3, characterized in that the coding rules of the multi-modal data abstraction layer information and the coding rules of the text abstraction layer information are trained in a training phase, and the position codes of the text abstraction layer information representing the same semantics are the same as the position codes of the data abstraction layer information of each modal data.

5. The method of claim 1, wherein the method further comprises:

6. A multi-modal data object detection apparatus comprising:

the determining module is further configured to determine target abstract layer information matched with the text abstract layer information in the multi-mode data abstract layer information according to an alignment rule between abstract layer information of each data type and abstract layer information of a text data type; determining a target detection result corresponding to the at least one detection task text according to the target abstract layer information;

When the multimodal data includes a first data type and a second data type, the determining module is specifically configured to:

7. The apparatus of claim 6, wherein the determining module, when determining the target abstract layer information matched with the text abstract layer information in the multi-modal data abstract layer information according to an alignment rule between abstract layer information of each data type and abstract layer information of a text data type, is specifically configured to:

8. An execution device, comprising:

A memory for storing program instructions;

a processor for invoking program instructions stored in said memory and performing the method according to any of claims 1-5 in accordance with the obtained program instructions.

9. A computer readable storage medium storing computer instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-5.