CN114610919A - Multi-expert-based image-text model generation method, device, equipment and medium - Google Patents
Multi-expert-based image-text model generation method, device, equipment and medium Download PDFInfo
- Publication number
- CN114610919A CN114610919A CN202210232059.8A CN202210232059A CN114610919A CN 114610919 A CN114610919 A CN 114610919A CN 202210232059 A CN202210232059 A CN 202210232059A CN 114610919 A CN114610919 A CN 114610919A
- Authority
- CN
- China
- Prior art keywords
- picture
- text
- sample
- target
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 239000013598 vector Substances 0.000 claims abstract description 258
- 238000004590 computer program Methods 0.000 claims description 10
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 7
- 238000012423 maintenance Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/51—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/532—Query formulation, e.g. graphical querying
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Image Analysis (AREA)
- Processing Or Creating Images (AREA)
Abstract
The application relates to the field of artificial intelligence, and discloses a multi-expert-based image-text model generation method and device, a storage medium and computer equipment, wherein the method comprises the following steps: acquiring a training sample set; determining an initial picture vector based on a sample picture in a training sample, and inputting the initial picture vector to an initial picture expert module to obtain a first target vector; determining an initial text vector based on a sample text in a training sample, and inputting the initial text vector into an initial text expert module to obtain a second target vector; determining a picture text target vector according to the first target vector and the second target vector, inputting the picture text target vector to an initial picture text expert module, and obtaining a first prediction score based on an output result and a full connection layer; and determining a model loss value of the preset picture text model based on the first prediction score and the real label, and training the preset picture text model based on the model loss value to obtain a multi-expert-based picture-text model.
Description
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a multi-expert-based image-text model generation method and device, a storage medium and computer equipment.
Background
Currently, large-scale image-text pre-training is generally used for solving the following problems, namely a picture retrieval task, a text retrieval task and a picture and text complex reasoning task. The image retrieval task comprises two types of retrieval of images according to the images and retrieval of characters according to the images, and the character retrieval task comprises two types of retrieval of the characters according to the characters and retrieval of the images according to the characters.
However, in the prior art, the pre-trained image-text model is usually a single-expert model, and different personnel are responsible for training, deployment and maintenance, which increases the training cost and maintenance cost of the model and occupies a large amount of computer resources.
Disclosure of Invention
In view of this, the present application provides a method and an apparatus for generating a multi-expert-based image-text model, a storage medium, and a computer device, which can enable an initial image expert module, an initial text expert module, and an initial image text expert module to achieve co-training, save training and maintenance costs of the model, and effectively reduce occupation of computer resources.
According to one aspect of the application, a multi-expert-based graphics-text model generation method is provided, and comprises the following steps:
acquiring a training sample set, wherein the training sample set comprises a plurality of training samples, each training sample comprises a sample picture and a sample text, and the sample text is provided with a real label indicating the relation with the sample picture;
determining an initial picture vector based on the sample picture in any training sample, and inputting the initial picture vector to an initial picture expert module of a preset picture text model to obtain a first target vector;
determining an initial text vector based on the sample text in any training sample, and inputting the initial text vector to an initial text expert module of the preset picture text model to obtain a second target vector;
determining a picture text target vector according to the first target vector and the second target vector, inputting the picture text target vector to an initial picture text expert module of the preset picture text model, and obtaining a first prediction score between the sample picture and the sample text based on an output result and a full connection layer;
and determining a model loss value of the preset picture text model based on the first prediction score and the real label, and training the preset picture text model based on the model loss value to obtain the multi-expert-based picture and text model.
Optionally, the determining an initial picture vector based on the sample picture in any one of the training samples specifically includes:
determining a picture dimension of the sample picture, wherein the picture dimension comprises a picture height and/or a picture width;
dividing the picture height and/or the picture width of the sample picture based on a preset dividing size to obtain a sub-sample picture corresponding to the sample picture;
and converting the sub-sample pictures into the initial picture vectors corresponding to each sub-sample picture through a preset conversion tool.
Optionally, the determining an initial text vector based on the sample text in any of the training samples specifically includes:
based on a preset word vector database, respectively determining word vectors corresponding to each word in the sample text from the preset word vector database, and splicing the word vectors corresponding to each word in the sample text to obtain the initial text vector.
Optionally, the determining a picture text target vector according to the first target vector and the second target vector specifically includes:
splicing the first target vectors corresponding to each sub-sample picture to obtain picture splicing vectors;
and splicing the picture splicing vector and the second target vector corresponding to the sample text to obtain the picture text target vector.
Optionally, the determining a model loss value of the preset picture text model based on the first prediction score and the real label, and training the preset picture text model based on the model loss value to obtain the multi-expert-based graphics context model specifically includes:
determining a model loss value of the preset picture text model through a preset cross entropy loss function based on the first prediction score and the real label corresponding to each training sample in the training sample set;
when the model loss value is larger than a preset loss threshold value, adjusting module parameters corresponding to at least one of the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model according to the model loss value to obtain an updated preset picture text model, obtaining a second prediction score between each sample picture and the sample text through the updated preset picture text model and the full-connection layer, and calculating the model loss value again;
and when the model loss value is less than or equal to the preset loss threshold value, obtaining the multi-expert-based image-text model.
Optionally, after obtaining the multi-expert-based graphics-text model, the method further includes:
receiving an object to be analyzed, and determining a corresponding target analysis module from the multi-expert-based image-text model according to the format of the object to be analyzed, wherein the target analysis module comprises at least one of a target picture expert module, a target text expert module and a target picture text expert module;
and converting the object to be analyzed into a corresponding target input vector, inputting the target input vector into the target analysis module, obtaining a target output vector corresponding to the object to be analyzed, and obtaining a target result through the target output vector.
Optionally, the determining, according to the format of the object to be analyzed, a corresponding target analysis module from the multi-expert-based graphics-text model specifically includes:
when the format of the object to be analyzed is a picture format, taking the target picture expert module as the target analysis module;
when the format of the object to be analyzed is a text format, taking the target text expert module as the target analysis module;
and when the format of the object to be analyzed comprises a picture format and a text format, taking the target picture expert module, the target text expert module and the target picture text expert module as the target analysis module.
According to another aspect of the present application, there is provided a multi-expert-based teletext model generation apparatus comprising:
the training sample set comprises a plurality of training samples, each training sample comprises a sample picture and a sample text, and the sample text is provided with a real label indicating the relation between the sample picture and the sample text;
the first input module is used for determining an initial picture vector based on the sample picture in any training sample, and inputting the initial picture vector to an initial picture expert module of a preset picture text model to obtain a first target vector;
the second input module is used for determining an initial text vector based on the sample text in any training sample, and inputting the initial text vector to an initial text expert module of the preset picture text model to obtain a second target vector;
the prediction module is used for determining a picture text target vector according to the first target vector and the second target vector, inputting the picture text target vector to an initial picture text expert module of the preset picture text model, and obtaining a first prediction score between the sample picture and the sample text based on an output result and a full connection layer;
and the model training module is used for determining a model loss value of the preset picture text model based on the first prediction value and the real label, and training the preset picture text model based on the model loss value to obtain the multi-expert-based picture and text model.
Optionally, the first input module is specifically configured to:
determining a picture dimension of the sample picture, wherein the picture dimension comprises a picture height and/or a picture width; dividing the picture height and/or the picture width of the sample picture based on a preset dividing size to obtain a sub-sample picture corresponding to the sample picture; converting the sub-sample pictures into the initial picture vectors corresponding to each sub-sample picture through a preset conversion tool.
Optionally, the second input module is specifically configured to:
based on a preset word vector database, respectively determining word vectors corresponding to each word in the sample text from the preset word vector database, and splicing the word vectors corresponding to each word in the sample text to obtain the initial text vector.
Optionally, the prediction module is specifically configured to:
splicing the first target vectors corresponding to each sub-sample picture to obtain picture splicing vectors; and splicing the picture splicing vector and the second target vector corresponding to the sample text to obtain the picture text target vector.
Optionally, the model training module is specifically configured to:
determining a model loss value of the preset picture text model through a preset cross entropy loss function based on the first prediction score and the real label corresponding to each training sample in the training sample set; when the model loss value is larger than a preset loss threshold value, adjusting module parameters corresponding to at least one of the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model according to the model loss value to obtain an updated preset picture text model, obtaining a second prediction score between each sample picture and the sample text through the updated preset picture text model and the full-connection layer, and calculating the model loss value again; and when the model loss value is less than or equal to the preset loss threshold value, obtaining the multi-expert-based image-text model.
Optionally, the apparatus further comprises:
the receiving module is used for receiving the object to be analyzed after the multi-expert-based graphics-text model is obtained, and determining a corresponding target analysis module from the multi-expert-based graphics-text model according to the format of the object to be analyzed, wherein the target analysis module comprises at least one of a target picture expert module, a target text expert module and a target picture text expert module;
and the third input module is used for converting the object to be analyzed into a corresponding target input vector, inputting the target input vector into the target analysis module, obtaining a target output vector corresponding to the object to be analyzed, and obtaining a target result through the target output vector.
Optionally, the receiving module is specifically configured to:
when the format of the object to be analyzed is a picture format, taking the target picture expert module as the target analysis module; when the format of the object to be analyzed is a text format, taking the target text expert module as the target analysis module; and when the format of the object to be analyzed comprises a picture format and a text format, taking the target picture expert module, the target text expert module and the target picture text expert module as the target analysis module.
According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described multi-expert based graphics-text model generation method.
According to yet another aspect of the present application, there is provided a computer device, comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the multi-expert based graphics-text model generation method when executing the program.
By means of the technical scheme, according to the multi-expert-based image-text model generation method and device, the storage medium and the computer device, firstly, a training sample set can be obtained, the training sample set can comprise a plurality of training samples, and each training sample can comprise a sample picture and a sample text. The sample text may also include a genuine label indicating the relationship with the sample picture. For each training sample in the training sample set, the sample picture in the training sample may be converted to obtain an initial picture vector corresponding to the sample picture. Then, the initial picture vector may be input into an initial picture expert module in a preset picture text model, and then the first target vector may be output. An initial text vector of sample text in the training sample corresponding to the sample picture may also be determined. Then, the initial text vector may be input into an initial text expert module in a preset picture text model, and a second target vector may be output. After a first target vector corresponding to the sample picture and a second target vector corresponding to the sample text are obtained, a picture text target vector can be further determined based on the first target vector and the second target vector. And then, the picture text target vector can be used as input and input into an initial picture text expert module of a preset picture text model, and the output of the initial picture text expert module passes through a full connection layer to output a first prediction score between a sample picture and a sample text. After the first prediction score is obtained, a model loss value of a preset picture text model can be determined according to the first prediction score and the real label of each training sample, the preset picture text model is trained on the basis of the model loss value, and a multi-expert image-text model based on a picture expert, a text expert and a picture text expert can be obtained after the training. According to the embodiment of the application, the initial picture expert module, the initial text expert module and the initial picture text expert module can be trained together, the training and maintenance cost of the model can be saved, and the occupation of computer resources is effectively reduced.
The above description is only an overview of the technical solutions of the present application, and the present application may be implemented in accordance with the content of the description so as to make the technical means of the present application more clearly understood, and the detailed description of the present application will be given below in order to make the above and other objects, features, and advantages of the present application more clearly understood.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic flowchart illustrating a multi-expert-based graphics-text model generation method according to an embodiment of the present application;
fig. 2 is a schematic flow chart of another multi-expert-based graphics-text model generation method provided in the embodiment of the present application;
fig. 3 shows a schematic structural diagram of a multi-expert-based graphics-text model generation device according to an embodiment of the present application.
Detailed Description
The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
In this embodiment, a method for generating a multi-expert-based graphics-text model is provided, as shown in fig. 1, the method includes:
the image-text model generation method based on multiple experts provided by the embodiment of the application can enable the initial image expert module, the initial text expert module and the initial image text expert module to realize co-training, can save the training and maintenance cost of the model, and effectively reduces the occupation of computer resources. The preset picture text model mainly comprises three parts, namely an initial picture expert module, an initial text expert module and an initial picture text expert module, and a target picture expert module, a target text expert module and a target picture text expert module can be correspondingly generated after training is finished. First, a training sample set may be obtained, where the training sample set may include a plurality of training samples, where each training sample may include a sample picture and a sample text. The sample text may further include a true tag indicating a relationship with the sample picture, for example, if the sample text is a positive sample of the sample picture, i.e., the sample text is an interpretation of the sample picture, the true tag may be 1; the true label may be 0 if the sample text is a negative sample of the sample picture, i.e., the sample text is not an interpretation of the sample picture.
102, determining an initial picture vector based on the sample picture in any training sample, and inputting the initial picture vector to an initial picture expert module of a preset picture text model to obtain a first target vector;
in this embodiment, for each training sample in the training sample set, a sample picture in the training sample may be converted to obtain an initial picture vector corresponding to the sample picture. Then, the initial picture vector may be input into an initial picture expert module in a preset picture text model, and then the first target vector may be output.
103, determining an initial text vector based on the sample text in any training sample, and inputting the initial text vector to an initial text expert module of the preset picture text model to obtain a second target vector;
in this embodiment, an initial text vector of the sample text corresponding to the sample picture in the training sample may also be determined. Then, the initial text vector may be input into an initial text expert module in a preset picture text model, and a second target vector may be output.
104, determining a picture text target vector according to the first target vector and the second target vector, inputting the picture text target vector to an initial picture text expert module of the preset picture text model, and obtaining a first prediction score between the sample picture and the sample text based on an output result and a full connection layer;
in this embodiment, after obtaining the first target vector corresponding to the sample picture and the second target vector corresponding to the sample text, the picture text target vector may be further determined based on the first target vector and the second target vector. And then, the picture text target vector can be used as input and input into an initial picture text expert module of a preset picture text model, then, the output of the initial picture text expert module can pass through a full connection layer, a first prediction score between the sample picture and the sample text is output, and a correlation degree score between the sample text and the sample picture can be seen from the first prediction score.
And 105, determining a model loss value of the preset picture text model based on the first prediction value and the real label, and training the preset picture text model based on the model loss value to obtain the multi-expert-based picture and text model.
In this embodiment, after determining the first prediction score between the sample picture and the sample text of each training sample, the model loss value of the preset picture text model may be determined according to the first prediction score and the true label of each training sample. Then, a preset picture text model can be trained on the basis of the model loss value, and a multi-expert image-text model based on the picture experts, the text experts and the picture text experts can be obtained after training.
By applying the technical scheme of the embodiment, first, a training sample set may be obtained, where the training sample set may include a plurality of training samples, and each training sample may include a sample picture and a sample text. The sample text may also include a genuine label indicating the relationship with the sample picture. For each training sample in the training sample set, the sample picture in the training sample may be converted to obtain an initial picture vector corresponding to the sample picture. Then, the initial picture vector may be input into an initial picture expert module in a preset picture text model, and then the first target vector may be output. An initial text vector of sample text corresponding to the sample picture in the training sample may also be determined. The initial text vector may then be input into an initial text expert module in the pre-set photo text model, and a second target vector may then be output. After a first target vector corresponding to the sample picture and a second target vector corresponding to the sample text are obtained, a picture text target vector can be further determined based on the first target vector and the second target vector. And then, the picture text target vector can be used as input and input into an initial picture text expert module of a preset picture text model, and the output of the initial picture text expert module passes through a full connection layer to output a first prediction score between a sample picture and a sample text. After the first prediction score is obtained, a model loss value of a preset picture text model can be determined according to the first prediction score and the real label of each training sample, the preset picture text model is trained on the basis of the model loss value, and a multi-expert image-text model based on a picture expert, a text expert and a picture text expert can be obtained after the training. According to the embodiment of the application, the initial picture expert module, the initial text expert module and the initial picture text expert module can be trained together, the training and maintenance cost of the model can be saved, and the occupation of computer resources is effectively reduced.
Further, as a refinement and an extension of the specific implementation of the above embodiment, in order to fully describe the specific implementation process of the embodiment, another graphics and text model generation method based on multiple experts is provided, as shown in fig. 2, the method includes:
in this embodiment, first, a training sample set may be obtained, where the training sample set may include a plurality of training samples, where each training sample may include a sample picture and a sample text. The sample text may further include a true label indicating a relationship with the sample picture, for example, if the sample text is a positive sample of the sample picture, that is, the sample text is an explanation of the sample picture, the true label may be 1; the true label may be 0 if the sample text is a negative sample of the sample picture, i.e., the sample text is not an interpretation of the sample picture.
in this embodiment, the picture dimension may be determined for the sample picture in each training sample, where the picture dimension may include a picture height and a picture width, and may further include a picture channel number. For example, the picture dimension corresponding to the sample picture may be H x W x C, where H represents the picture height of the sample picture, W represents the picture width of the sample picture, and C represents the number of picture channels of the sample picture.
in this embodiment, after the picture dimension of the sample picture is determined, the sample picture may be divided according to a preset dividing size, where only the sample picture may be divided from the picture height direction, the picture width remains unchanged, the sample picture may also be divided from the picture width direction, the picture height remains unchanged, and the sample picture may also be divided from two directions of the picture height and the picture width of the sample picture. After the division, a plurality of sub-sample pictures corresponding to the sample picture can be obtained. For example, the picture dimension of the sample picture is H x W x C, and the sample picture may be divided into a plurality of P x C sub-sample pictures according to a preset division size, that is, the picture dimension corresponding to each sub-sample picture is P x C.
in this embodiment, after obtaining a plurality of sub-sample pictures corresponding to each sample picture, each sub-sample picture may be converted into an initial picture vector corresponding to the sub-sample picture by using a preset conversion tool, that is, each sub-sample picture is directly represented by the initial picture vector corresponding to the sub-sample picture. Here, the preset conversion tool may be reshape. For example, each sub-sample picture corresponds to a picture dimension of P x C, and then each sub-sample picture can be converted into a dimension of P by a preset conversion tool2Vector of C, this P2The vector of C may be the initial picture vector. In addition, P corresponding to each subsample picture can be used2And converting the vector of the C into a one-dimensional vector with a specified dimension in a dimension reduction mode, and taking the converted one-dimensional vector as an initial picture vector. The initial picture vector is obtained through dimensionality reduction, so that the initial picture vector can more conveniently participate in the subsequent operation, the difficulty of the subsequent operation can be reduced, and the operation efficiency is improved.
step 206, based on a preset word vector database, determining word vectors corresponding to each word in the sample text from the preset word vector database, and splicing the word vectors corresponding to each word in the sample text to obtain the initial text vectors;
in this embodiment, the initial picture vector corresponding to each sub-sample picture is input into the initial picture expert module of the preset picture text model, and the first target vector may be output correspondingly. In addition, for each word in the sample text, a word vector corresponding to each word can be found from a preset word vector database, and then the word vectors corresponding to each word are spliced according to the sequence of each word in the sample text to obtain an initial text vector corresponding to each sample text.
step 208, splicing the first target vectors corresponding to each of the sub-sample pictures to obtain picture splicing vectors; splicing the picture splicing vector with the second target vector corresponding to the sample text to obtain a picture text target vector;
in this embodiment, the initial text vector may be input into an initial text expert module in the preset picture text model, and the second target vector may be output. After a plurality of first target vectors corresponding to the sample picture and a second target vector corresponding to the sample text are obtained, the first target vectors and the second target vectors can be spliced on the basis of the first target vectors and the second target vectors, and the picture text target vectors are further determined.
in this embodiment, the picture text target vector is used as an input and is input into an initial picture text expert module of the preset picture text model, and then the output of the initial picture text expert module can be passed through a full connection layer to output a first prediction score between the sample picture and the sample text, and the association degree score between the sample text and the sample picture can be seen from the first prediction score.
in this embodiment, the first corresponding to each training sample is obtainedAfter the score is predicted, a model loss value of the preset picture text model can be calculated through a preset cross entropy loss function according to the first prediction score and the corresponding real label. Here, the pre-set cross entropy loss function may be Wherein,is the true label between the sample picture and the sample text, can be 0 or 1,is the first prediction score between the sample picture and the sample text, and N is the number of training samples in the set of training samples.
in this embodiment, after the model loss value is obtained through calculation, when the model loss value is less than or equal to the preset loss threshold, the preset image text model may be directly used as the final multi-expert-based image-text model. When the model loss value is greater than the preset loss threshold value, it indicates that the accuracy of the preset picture text model has not reached the expectation yet, parameters of the preset picture text model may be further adjusted, specifically, parameters of one or more of the initial picture expert module, the initial text expert module, and the initial picture text expert module may be adjusted, and an updated preset picture text model may be obtained after the parameters are adjusted. After the parameters of the preset picture text model are adjusted to obtain the updated preset picture text model, a second prediction score corresponding to each training sample can be further obtained according to the training sample set, and then the model loss value of the updated preset picture text model can be calculated again through the preset cross entropy loss function according to the second prediction score and the corresponding real label. And then judging the magnitude relation between the model loss value and the preset loss threshold value again, updating the parameters of the updated preset picture text model again when the model loss value is still larger than the preset loss threshold value, continuously calculating a third prediction score, and repeatedly adjusting the model parameters of the preset picture text model and calculating the model loss value through the third prediction score and the real label calculation model loss value … … until the model loss value is smaller than or equal to the preset loss threshold value.
And 212, when the model loss value is less than or equal to the preset loss threshold value, obtaining the multi-expert-based image-text model.
In this embodiment, when the model loss value is less than or equal to the preset loss threshold, it indicates that the model accuracy has reached the expectation, and at this time, the multi-expert-based graphic model is obtained, and at this time, the multi-expert-based graphic model includes the trained target picture expert module, the trained target text expert module, and the trained target picture text expert module. When the method and the device train the preset picture text model, an initial picture expert module, an initial text expert module and an initial picture text expert module are trained simultaneously, each module is equivalent to a Transformer layer of an original BERT model, the initial picture expert module and the initial text expert module can correspond to an F layer, and the initial picture text expert module corresponds to an (L-F) layer. Therefore, the size of the L and the size of the F can be flexibly and freely configured in the training process according to the resource and time requirements of the actual service situation, so that the training of the model is closer to the actual service requirements, and the initial picture expert module and the initial text expert module share the parameters of the Multi-head authentication layer in the training process, thereby greatly reducing the parameter quantity of the model and reducing the requirements of the model on the GPU video memory when the model is deployed.
In this embodiment of the present application, optionally, after step 212, the method further includes: receiving an object to be analyzed, and determining a corresponding target analysis module from the multi-expert-based image-text model according to the format of the object to be analyzed, wherein the target analysis module comprises at least one of a target picture expert module, a target text expert module and a target picture text expert module; and converting the object to be analyzed into a corresponding target input vector, inputting the target input vector into the target analysis module, obtaining a target output vector corresponding to the object to be analyzed, and obtaining a target result through the target output vector.
In this embodiment, after obtaining the multi-expert based graphical-text model, one or more modules can be determined from the multi-expert based graphical-text model for use directly according to the object to be analyzed. Specifically, first, an object to be analyzed may be received, where the object to be analyzed may be a picture or a text. After receiving the object to be analyzed, the format of the object to be analyzed can be analyzed, and the selected module is determined according to the format of the object to be analyzed. After the selected module is determined, the object to be analyzed can be converted into a corresponding target input vector, and then the target input vector is input into the target analysis module, so that a target output vector corresponding to the object to be analyzed can be output. In this way, the target result can be subsequently obtained by using the target output vector. For example, when the object to be analyzed is in a text format, after the target output vector corresponding to the object to be analyzed is obtained, the most similar vector can be obtained through the corresponding similarity index, so as to search for a similar text or a similar picture of the object to be analyzed. Here, when the object to be analyzed is converted into the corresponding target input vector, a method of dividing the picture into sub-pictures and then converting the sub-pictures into the corresponding target input vector may be similarly employed, or a method of finding the corresponding word vector for each word in the text and finally splicing the word vectors together to convert the word vectors into the target input vector may be similarly employed.
In this embodiment of the application, optionally, the determining, according to the format of the object to be analyzed, a corresponding target analysis module from the multi-expert-based graphics-text model specifically includes: when the format of the object to be analyzed is a picture format, taking the target picture expert module as the target analysis module; when the format of the object to be analyzed is a text format, taking the target text expert module as the target analysis module; and when the format of the object to be analyzed comprises a picture format and a text format, taking the target picture expert module, the target text expert module and the target picture text expert module as the target analysis module.
In this embodiment, the target analysis module may be determined according to the format of the object to be analyzed, and when the format of the object to be analyzed is a picture format, a target picture expert module in a multi-expert-based graphics-text model may be used as the target analysis module; when the format of the object to be analyzed is a text format, a target text expert module in the multi-expert-based image-text model can be used as a target analysis module; when the format of the object to be analyzed includes not only a picture format but also a text format, the target picture expert module, the target text expert module and the target picture text expert module in the multi-expert-based picture-text model can all be used as target analysis modules, so that after the object to be analyzed in the text format is converted into a target input vector, a corresponding output vector is obtained through the target text expert module, after the object to be analyzed in the picture format is converted into a target input vector, a corresponding output vector is obtained through the target picture text expert module, and finally the output vector corresponding to the target text expert module and the output vector corresponding to the target picture text expert module are spliced to be used as the input corresponding to the target picture text expert module to obtain a target output vector. When the object to be analyzed comprises the object to be analyzed in the picture format and the object to be analyzed in the text format, the vector corresponding to the object to be analyzed in the picture format is output through the target picture expert module, the vector corresponding to the object to be analyzed in the text format is output through the target text expert module, and then the vector is spliced and input into the target picture text expert module, so that the accuracy of the target output vector of the target picture text expert can be improved, and the subsequent use effect can be improved.
Further, as a specific implementation of the method in fig. 1, an embodiment of the present application provides a graphics-text model generating apparatus based on multiple experts, as shown in fig. 3, the apparatus includes:
the training system comprises a sample acquisition module, a comparison module and a comparison module, wherein the sample acquisition module is used for acquiring a training sample set, the training sample set comprises a plurality of training samples, each training sample comprises a sample picture and a sample text, and the sample text is provided with a real label indicating the relation between the sample picture and the sample text;
the first input module is used for determining an initial picture vector based on the sample picture in any training sample, and inputting the initial picture vector to an initial picture expert module of a preset picture text model to obtain a first target vector;
the second input module is used for determining an initial text vector based on the sample text in any training sample, and inputting the initial text vector to an initial text expert module of the preset picture text model to obtain a second target vector;
the prediction module is used for determining a picture text target vector according to the first target vector and the second target vector, inputting the picture text target vector to an initial picture text expert module of the preset picture text model, and obtaining a first prediction score between the sample picture and the sample text based on an output result and a full connection layer;
and the model training module is used for determining a model loss value of the preset picture text model based on the first prediction value and the real label, and training the preset picture text model based on the model loss value to obtain the multi-expert-based picture and text model.
Optionally, the first input module is specifically configured to:
determining a picture dimension of the sample picture, wherein the picture dimension comprises a picture height and/or a picture width; dividing the picture height and/or the picture width of the sample picture based on a preset dividing size to obtain a sub-sample picture corresponding to the sample picture; and converting the sub-sample pictures into the initial picture vectors corresponding to each sub-sample picture through a preset conversion tool.
Optionally, the second input module is specifically configured to:
based on a preset word vector database, respectively determining word vectors corresponding to each word in the sample text from the preset word vector database, and splicing the word vectors corresponding to each word in the sample text to obtain the initial text vector.
Optionally, the prediction module is specifically configured to:
splicing the first target vectors corresponding to each sub-sample picture to obtain picture splicing vectors; and splicing the picture splicing vector and the second target vector corresponding to the sample text to obtain the picture text target vector.
Optionally, the model training module is specifically configured to:
determining a model loss value of the preset picture text model through a preset cross entropy loss function based on the first prediction score and the real label corresponding to each training sample in the training sample set; when the model loss value is larger than a preset loss threshold value, adjusting module parameters corresponding to at least one of the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model according to the model loss value to obtain an updated preset picture text model, obtaining a second prediction score between each sample picture and the sample text through the updated preset picture text model and the full-connection layer, and calculating the model loss value again; and when the model loss value is less than or equal to the preset loss threshold value, obtaining the multi-expert-based image-text model.
Optionally, the apparatus further comprises:
the receiving module is used for receiving the object to be analyzed after the multi-expert-based graphics-text model is obtained, and determining a corresponding target analysis module from the multi-expert-based graphics-text model according to the format of the object to be analyzed, wherein the target analysis module comprises at least one of a target picture expert module, a target text expert module and a target picture text expert module;
and the third input module is used for converting the object to be analyzed into a corresponding target input vector, inputting the target input vector into the target analysis module, obtaining a target output vector corresponding to the object to be analyzed, and obtaining a target result through the target output vector.
Optionally, the receiving module is specifically configured to:
when the format of the object to be analyzed is a picture format, taking the target picture expert module as the target analysis module; when the format of the object to be analyzed is a text format, taking the target text expert module as the target analysis module; and when the format of the object to be analyzed comprises a picture format and a text format, taking the target picture expert module, the target text expert module and the target picture text expert module as the target analysis module.
It should be noted that other corresponding descriptions of the functional units related to the multi-expert-based graphics-text model generation device provided in the embodiment of the present application may refer to the corresponding descriptions in the methods in fig. 1 to fig. 2, and are not described herein again.
Based on the method shown in fig. 1 to 2, correspondingly, an embodiment of the present application further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method for generating a multi-expert-based graphics-text model shown in fig. 1 to 2.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, or the like), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, or the like) to execute the method described in the implementation scenarios of the present application.
Based on the above methods shown in fig. 1 to fig. 2 and the virtual device embodiment shown in fig. 3, in order to achieve the above object, an embodiment of the present application further provides a computer device, which may specifically be a personal computer, a server, a network device, and the like, where the computer device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the above-described multi-expert based teletext model generation method shown in fig. 1-2.
Optionally, the computer device may further include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.
It will be appreciated by those skilled in the art that the present embodiment provides a computer device architecture that is not limiting of the computer device, and that may include more or fewer components, or some components in combination, or a different arrangement of components.
The storage medium may further include an operating system and a network communication module. An operating system is a program that manages and maintains the hardware and software resources of a computer device, supporting the operation of information handling programs, as well as other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and other hardware and software in the entity device.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. First, a training sample set may be obtained, where the training sample set may include a plurality of training samples, where each training sample may include a sample picture and a sample text. The sample text may also include a genuine label indicating the relationship with the sample picture. For each training sample in the training sample set, the sample picture in the training sample may be converted to obtain an initial picture vector corresponding to the sample picture. Then, the initial picture vector may be input into an initial picture expert module in a preset picture text model, and then the first target vector may be output. An initial text vector of sample text corresponding to the sample picture in the training sample may also be determined. Then, the initial text vector may be input into an initial text expert module in a preset picture text model, and a second target vector may be output. After a first target vector corresponding to the sample picture and a second target vector corresponding to the sample text are obtained, a picture text target vector can be further determined based on the first target vector and the second target vector. And then, the picture text target vector can be used as input and input into an initial picture text expert module of a preset picture text model, and the output of the initial picture text expert module passes through a full connection layer to output a first prediction score between a sample picture and a sample text. After the first prediction score is obtained, a model loss value of a preset picture text model can be determined according to the first prediction score and the real label of each training sample, the preset picture text model is trained on the basis of the model loss value, and a multi-expert image-text model based on a picture expert, a text expert and a picture text expert can be obtained after the training. According to the embodiment of the application, the initial picture expert module, the initial text expert module and the initial picture text expert module can be trained together, the training and maintenance cost of the model can be saved, and the occupation of computer resources is effectively reduced.
Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.
Claims (10)
1. A multi-expert-based graphics and text model generation method is characterized by comprising the following steps:
acquiring a training sample set, wherein the training sample set comprises a plurality of training samples, each training sample comprises a sample picture and a sample text, and the sample text is provided with a real label indicating the relation between the sample picture and the sample text;
determining an initial picture vector based on the sample picture in any training sample, and inputting the initial picture vector to an initial picture expert module of a preset picture text model to obtain a first target vector;
determining an initial text vector based on the sample text in any training sample, and inputting the initial text vector to an initial text expert module of the preset picture text model to obtain a second target vector;
determining a picture text target vector according to the first target vector and the second target vector, inputting the picture text target vector to an initial picture text expert module of the preset picture text model, and obtaining a first prediction score between the sample picture and the sample text based on an output result and a full connection layer;
and determining a model loss value of the preset picture text model based on the first prediction score and the real label, and training the preset picture text model based on the model loss value to obtain the multi-expert-based picture and text model.
2. The method according to claim 1, wherein the determining an initial picture vector based on the sample picture in any of the training samples comprises:
determining a picture dimension of the sample picture, wherein the picture dimension comprises a picture height and/or a picture width;
dividing the picture height and/or the picture width of the sample picture based on a preset dividing size to obtain a sub-sample picture corresponding to the sample picture;
and converting the sub-sample pictures into the initial picture vectors corresponding to each sub-sample picture through a preset conversion tool.
3. The method according to claim 1, wherein the determining an initial text vector based on the sample text in any of the training samples comprises:
based on a preset word vector database, respectively determining word vectors corresponding to each word in the sample text from the preset word vector database, and splicing the word vectors corresponding to each word in the sample text to obtain the initial text vector.
4. The method according to claim 2 or 3, wherein determining a picture text target vector according to the first target vector and the second target vector comprises:
splicing the first target vectors corresponding to each sub-sample picture to obtain picture splicing vectors;
and splicing the picture splicing vector and the second target vector corresponding to the sample text to obtain the picture text target vector.
5. The method according to claim 1, wherein the determining a model loss value of the preset picture text model based on the first prediction score and the real label, and training the preset picture text model based on the model loss value to obtain the multi-expert-based graphics model specifically includes:
determining a model loss value of the preset picture text model through a preset cross entropy loss function based on the first prediction score and the real label corresponding to each training sample in the training sample set;
when the model loss value is larger than a preset loss threshold value, adjusting module parameters corresponding to at least one of the initial picture expert module, the initial text expert module and the initial picture text expert module in the preset picture text model according to the model loss value to obtain an updated preset picture text model, obtaining a second prediction score between each sample picture and the sample text through the updated preset picture text model and the full-connection layer, and calculating the model loss value again;
and when the model loss value is less than or equal to the preset loss threshold value, obtaining the multi-expert-based image-text model.
6. The method of claim 1, wherein after obtaining the multi-expert based graphical-text model, the method further comprises:
receiving an object to be analyzed, and determining a corresponding target analysis module from the multi-expert-based image-text model according to the format of the object to be analyzed, wherein the target analysis module comprises at least one of a target picture expert module, a target text expert module and a target picture text expert module;
and converting the object to be analyzed into a corresponding target input vector, inputting the target input vector into the target analysis module, obtaining a target output vector corresponding to the object to be analyzed, and obtaining a target result through the target output vector.
7. The method according to claim 6, wherein the determining a corresponding target analysis module from the multi-expert based graphical-text model according to the format of the object to be analyzed specifically comprises:
when the format of the object to be analyzed is a picture format, taking the target picture expert module as the target analysis module;
when the format of the object to be analyzed is a text format, taking the target text expert module as the target analysis module;
and when the format of the object to be analyzed comprises a picture format and a text format, taking the target picture expert module, the target text expert module and the target picture text expert module as the target analysis module.
8. A multi-expert-based graphics-text model generation device is characterized by comprising:
the training system comprises a sample acquisition module, a comparison module and a comparison module, wherein the sample acquisition module is used for acquiring a training sample set, the training sample set comprises a plurality of training samples, each training sample comprises a sample picture and a sample text, and the sample text is provided with a real label indicating the relation between the sample picture and the sample text;
the first input module is used for determining an initial picture vector based on the sample picture in any training sample, and inputting the initial picture vector to an initial picture expert module of a preset picture text model to obtain a first target vector;
the second input module is used for determining an initial text vector based on the sample text in any training sample, and inputting the initial text vector to an initial text expert module of the preset picture text model to obtain a second target vector;
the prediction module is used for determining a picture text target vector according to the first target vector and the second target vector, inputting the picture text target vector to an initial picture text expert module of the preset picture text model, and obtaining a first prediction score between the sample picture and the sample text based on an output result and a full connection layer;
and the model training module is used for determining a model loss value of the preset picture text model based on the first prediction value and the real label, and training the preset picture text model based on the model loss value to obtain the multi-expert-based picture and text model.
9. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method of any of claims 1 to 7.
10. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the method of any one of claims 1 to 7 when executing the computer program.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210232059.8A CN114610919A (en) | 2022-03-09 | 2022-03-09 | Multi-expert-based image-text model generation method, device, equipment and medium |
PCT/CN2022/089730 WO2023168811A1 (en) | 2022-03-09 | 2022-04-28 | Picture-text model generation method and apparatus based on multiple experts, and device and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210232059.8A CN114610919A (en) | 2022-03-09 | 2022-03-09 | Multi-expert-based image-text model generation method, device, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114610919A true CN114610919A (en) | 2022-06-10 |
Family
ID=81861502
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210232059.8A Pending CN114610919A (en) | 2022-03-09 | 2022-03-09 | Multi-expert-based image-text model generation method, device, equipment and medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114610919A (en) |
WO (1) | WO2023168811A1 (en) |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9875429B2 (en) * | 2015-10-06 | 2018-01-23 | Adobe Systems Incorporated | Font attributes for font recognition and similarity |
CN110781633A (en) * | 2019-10-30 | 2020-02-11 | 广东博智林机器人有限公司 | Image-text design quality detection method, device and system based on deep learning model |
CN111310041B (en) * | 2020-02-12 | 2021-06-22 | 腾讯科技(深圳)有限公司 | Image-text publishing method, model training method and device and storage medium |
CN113283551B (en) * | 2021-07-22 | 2021-10-29 | 智者四海(北京)技术有限公司 | Training method and training device of multi-mode pre-training model and electronic equipment |
-
2022
- 2022-03-09 CN CN202210232059.8A patent/CN114610919A/en active Pending
- 2022-04-28 WO PCT/CN2022/089730 patent/WO2023168811A1/en unknown
Also Published As
Publication number | Publication date |
---|---|
WO2023168811A1 (en) | 2023-09-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220230420A1 (en) | Artificial intelligence-based object detection method and apparatus, device, and storage medium | |
US11222236B2 (en) | Image question answering method, apparatus and system, and storage medium | |
CN109948700B (en) | Method and device for generating feature map | |
CN109948699B (en) | Method and device for generating feature map | |
CN109902763B (en) | Method and device for generating feature map | |
CN112771578B (en) | Image generation using subdivision scaling and depth scaling | |
CN107609506B (en) | Method and apparatus for generating image | |
CN111666416B (en) | Method and device for generating semantic matching model | |
US20230274566A1 (en) | Sequence recognition method and apparatus, image processing device, and storage medium | |
CN115457531A (en) | Method and device for recognizing text | |
US20200193661A1 (en) | Signal change apparatus, method, and program | |
CN112084920B (en) | Method, device, electronic equipment and medium for extracting hotwords | |
EP3832475A1 (en) | Sentence processing method and system and electronic device | |
CN110659639B (en) | Chinese character recognition method and device, computer readable medium and electronic equipment | |
CN109816023B (en) | Method and device for generating picture label model | |
CN117237606A (en) | Interest point image generation method, interest point image generation device, electronic equipment and storage medium | |
US12106555B2 (en) | Method and device for retrieving image | |
CN115661829A (en) | Image-text recognition method and data processing method of image-text recognition model | |
CN110674813A (en) | Chinese character recognition method and device, computer readable medium and electronic equipment | |
CN109919249B (en) | Method and device for generating feature map | |
CN111797266B (en) | Image processing method and apparatus, storage medium, and electronic device | |
CN114610919A (en) | Multi-expert-based image-text model generation method, device, equipment and medium | |
CN117391201A (en) | Question answering method and device and electronic equipment | |
EP4447006A1 (en) | Font recognition method and apparatus, readable medium, and electronic device | |
CN113780370B (en) | Visual question-answering method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |