CN117975486A - Text image-based product abstract generation method, system and storage medium - Google Patents
Text image-based product abstract generation method, system and storage medium Download PDFInfo
- Publication number
- CN117975486A CN117975486A CN202410372214.5A CN202410372214A CN117975486A CN 117975486 A CN117975486 A CN 117975486A CN 202410372214 A CN202410372214 A CN 202410372214A CN 117975486 A CN117975486 A CN 117975486A
- Authority
- CN
- China
- Prior art keywords
- text
- product
- image
- mode
- encoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000003860 storage Methods 0.000 title claims description 16
- 230000000007 visual effect Effects 0.000 claims abstract description 67
- 238000012512 characterization method Methods 0.000 claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 26
- 230000003993 interaction Effects 0.000 claims abstract description 15
- 230000006870 function Effects 0.000 claims description 31
- 230000008569 process Effects 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 12
- 239000003550 marker Substances 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000003384 imaging method Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 230000002708 enhancing effect Effects 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 5
- 230000002452 interceptive effect Effects 0.000 claims description 4
- 238000013140 knowledge distillation Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims 1
- 230000001427 coherent effect Effects 0.000 abstract description 3
- 230000004927 fusion Effects 0.000 abstract description 3
- 238000004590 computer program Methods 0.000 description 11
- 238000012545 processing Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/1918—Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
After training a product abstract generation model through a multi-mode multi-task learning method, the product abstract generation model only keeps a multi-mode encoder and a visual guide decoder when the product abstract generation model is applied and deployed, and a product image and a product text description are input into the multi-mode encoder to respectively obtain image characterizationAnd text characterizationThe image is then characterizedAnd text characterizationInput to a visual guide decoder and generation of a product summary. The invention expands the single-mode BART, improves the single-mode BART into the multi-mode BART, improves interaction and fusion of multi-mode features, and introduces interaction information enhancement to acquire the semantic rich representation in text input. A large number of experiments performed on the China electronic commerce product abstract dataset CEPSUM prove that compared with the existing method, the method provided by the invention has the advantages that the simple and coherent product abstract can be generated.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a method, a system and a storage medium for generating a product abstract based on a text image, which are suitable for multi-mode product abstract generating tasks.
Background
In e-commerce advertising, products are typically displayed on web pages, rich in information including product titles, detailed descriptions, and product pictures. While the rich product information enhances the customer's understanding of the product, the amount of content on the web page is so large that the customer needs to spend valuable time screening detailed information to assess whether the product meets its particular needs. In view of this potential obstacle to the consumer experience, vendors are pressing to provide a compact product summary summarizing the unique features and advantages of the product from product information such as textual descriptions and images. The concise and coherent product abstract not only reduces the cognitive burden of the customer, but also significantly reduces the time cost of finding an ideal product. However, manually creating product offerings for large quantities of products is a challenging task that requires expertise to summarize critical product information. In response to this challenge, many researchers have focused on automated product summary techniques to help marketers write attractive product summaries and promote e-commerce sales.
Early automatic product summarization methods relied primarily on text modalities, utilizing textual information such as product descriptions and product attributes to generate product summaries. While these methods exhibit good performance, they inadvertently ignore the visual modality of the product. In e-commerce advertising, products are often associated with images to appeal to customers, and product images can provide rich information to enhance automated product summarization methods. Therefore, the existing multi-modal product summary method MMPS utilizes the text description and the image of the product to generate a concise and coherent product summary. The core goal of MMPS is to generate a comprehensive text summary for a product, with its multi-modal information, including long text descriptions (such as concise product titles and detailed product descriptions) and product images. While early approaches to MMPS made progress in driving automated product abstracts, they still had two limitations:
(1) The existing MMPS method introduces a large-scale pre-trained visual encoder to extract image features, which severely increases the parameters and inference time of the model. For example, MMPG extracts global Visual features using ResNet-101 pre-trained on ImageNet and acquires local Visual features using Faster R-CNN pre-trained on Visual Genome.
(2) Furthermore, existing methods employ proprietary encoders to obtain visual and textual representations, respectively. In this design, there is insufficient interaction between the two modalities, resulting in insufficient fusion of the multimodal information.
Disclosure of Invention
In order to solve the problems, the invention provides a product abstract generating method, a system and a storage medium based on text images.
According to the text image-based product abstract generation method, after a product abstract generation model is trained through a multi-mode multi-task learning method, only a multi-mode encoder and a visual guide decoder are reserved in the product abstract generation model when application and deployment are carried out, and multi-mode product information is givenWherein/>And/>Respectively, a product image and a product text description, and inputting the product image and the product text description into a multi-mode encoder to respectively obtain image characterization/>And text characterization/>The image is then characterized/>And text characterization/>Input into a visual guide decoder and generate a product summary/>The method specifically comprises the following steps:
step 1, imaging a product Cut into multiple non-overlapping regions and flattened into a sequence/>Product text description/>Expressed as sequence/>And use/>And/>Marking the beginning of text entry and image entry;
step2, linear projection mapping is carried out on the text and the image which are input after the pretreatment in the step 1 And/>Mapping into text embedding/>, respectivelyAnd visual embedding/>The text embedding and the visual embedding are connected and input into a multi-mode encoder, and the multi-mode representation/> is obtained through output,/>Representation of the beginning of the markup text input,/>Representation of the beginning of the input of a marker image,/>And/>Representing representations of the product image and the textual description, respectively, the multi-modal encoder being represented as:
wherein text is mapped into text by linear projection ,/>For multimodal embedding connected by text embedding and visual embedding,/>Representation of/>, including product title and long text descriptionA product text description sequence of individual words; /(I)、/>And/>Respectively, special mark embedding introduced,/>And/>For marking the beginning and end of text input, respectively,/>For marking the beginning of an image input,/>Is a position marker embedding of multimodal sequences,/>Representing a function of a multi-modal encoder,/>Is the number of image areas,/>Is the word quantity of the text sequence,/>And/>Is a linear projection map, i.e. an embedded matrix and a learnable parameter of the multi-modal encoder;
Step 3, representing the multiple modes Input to the visual guide decoder, and automatically generate the inclusion/>Product abstract of individual words/>,/>And/>;
The product abstract generation model is obtained through training by the following multi-mode and multi-task learning method:
(1) Training a task guidance model through multi-mode product abstract generation:
The visual guide decoder iteratively focuses on previously generated words And multi-modal encoder output/>Predicting probability/>, of next text wordGenerating the next word until all word quantities of the product abstract are generated; training a model by minimizing a negative log-likelihood loss function, said minimizing a negative log-likelihood loss function/>The method comprises the following steps:
Wherein, Represents the current model generation/>Words,/>The word amount representing the generated product abstract,;
(2) Maximizing interaction information between text input data and a reference abstract by an interaction information enhanced MIE task guidance modelInput to the multi-mode encoder to obtain its characterizationSubsequent average pooling/>And/>Obtain/>And/>And maximize its similarity/>And (3) completing interaction information enhancement:
Wherein, And/>For marking the beginning and end of reference summary input,/>Is the word quantity of the reference abstract sequence,/>And/>Respectively representing embedding and characterization of the reference abstract; /(I)Representing a function of the multi-mode encoder; /(I)AndRespectively mean pooling operations and similarity calculation operations along the column dimension; /(I)And/>Respectively representing matrix transposition and regularization operations; /(I)Is the batch size,/>Is a temperature parameter which can be learned,/>Representing an exponential function based on a natural constant e,/>Enhancing a loss function for the interactive information;
(3) The visual-language knowledge distillation VLKD is realized through the training of the single-mode feature alignment UFA and the cross-mode contrast learning CCL, so that the alignment of multiple modes is ensured, the better understanding of the cross-mode relation is promoted, and a multi-mode encoder and a visual guide decoder are continuously updated in the whole training process, but the CLIP image encoder is kept unchanged:
Imaging a product Input CLIP image encoder/>Get its global characterization/>Subsequently minimize/>And a global information output characterization/>, of a multi-modal encoderBetween/>Distance, finishing single-mode feature alignment; at the same time, utilize/>And/>Performing contrast learning to finish cross-modal contrast training, so that the multi-modal encoder can extract multi-modal knowledge from the CLIP image encoder; the unimodal feature aligns the UFA loss function/>The expression is as follows:
Wherein, Is a CLIP image encoder/>The representation of the [ CLS ] mark output represents the global information of the image; represent a batch size/> />Image,/>Is a weight matrix for linearly projecting the output of the CLIP image encoder into the space of the multi-mode encoder,/>Representation/>Distance calculation operation,/>A representation of the beginning of the input of the ith marker image;
The cross-modal contrast learning CCL loss function The expression is as follows:
Wherein, Is a temperature parameter which can be learned,/>Representing a matrix transpose operation, i2t representing image to text, t2i being the opposite,/>Represents the/>Features,/>Represents the/>A plurality of features; /(I)Respectively represent the/>Or (5 /)Characterization of the beginning of the text entry of the individual tag,/>Respectively represent the/>Or (5 /)Individual CLIP image encoder/>Characterization of the output [ CLS ] tagged image global information;
(4) Total loss function of product abstract generation model Is a combination of all of the above loss functions:
Wherein, Representing a predefined loss weight.
The visual guide decoder for optimizing the utilization of encoded visual and textual information first focuses on the encoded visual representation using a self-attention mechanism and generates a cross-modal representation of the visual guide that continually fuses the textual representations to generate a final multi-modal representation as follows:
Wherein, Is a representation of the reference abstract,/>、/>And/>Representing multi-head self-attention, cross-attention and layer normalization, respectively,/>And/>Representing single-mode cross-attention modules specifically designed for visual and text modes, respectively, using two cross-attention modules to model visual and text features, respectively, enables a visual guidance decoder to be explicitly guided by information from different modes.
The system comprises a memory and a processor, wherein the memory is used for storing a program, and the processor is used for running the program stored in the memory, wherein the program runs to execute the product abstract generation method based on the text image.
A non-volatile storage medium comprising a stored program, wherein the system in which the non-volatile storage medium is controlled to execute the above-described text image-based product abstract generation method when the program is run.
After solving the problems, the invention has the beneficial effects that:
(1) The present invention designs a multi-modal encoder distilled from a large-scale pre-trained visual language teacher model (e.g., CLIP) that reduces the amount of parameters of the product summary generation model UMPS and the resource consumption during visual feature processing.
(2) The invention expands the single-mode BART, which is a denoising self-encoder and decoder designed for text related tasks, expands the single-mode plain text encoder into a multi-mode encoder by connecting text embedding and visual embedding as input, improves interaction and fusion of multi-mode features in a product abstract generation model UMPS, enables the product abstract generation model UMPS to simultaneously capture and fuse the multi-mode features in the multi-mode encoder, and provides better text generation guidance for a visual guidance decoder by fusing text and visual modal information.
(3) The invention introduces an interactive information enhancement method into a product abstract generation model framework based on text images, and promotes the model to learn the rich-semantic characterization from text input.
Drawings
FIG. 1 is a schematic diagram of a multi-modal merchandise summarization task of the present invention;
FIG. 2 is a schematic diagram of a multi-modal multi-task learning method for training the product abstract generation model of the invention;
FIG. 3 is a statistical plot of CEPSUM datasets;
FIG. 4 is a flow chart of generating a product summary after deployment of the model application of the present invention;
Fig. 5 is a schematic block diagram of a text image based product summary generation system of the present invention.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In embodiments of the application, the words "exemplary" or "such as" are used to mean that any embodiment or aspect of the application described as "exemplary" or "such as" is not to be interpreted as preferred or advantageous over other embodiments or aspects. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
Example 1
The first embodiment of the invention relates to a text image-based product abstract generation method, which realizes multi-mode product abstract generation through a product abstract generation model UMPS, wherein the product abstract generation model UMPS expands a single-mode BART into a multi-mode BART, and takes the multi-mode BART as a backbone architecture of the product abstract generation model UMPS, after training the product abstract generation model through a multi-mode multi-task learning method, when the product abstract generation model UMPS is applied and deployed, only a multi-mode encoder and a visual guide decoder are reserved, as shown in fig. 4, and multi-mode product information is givenWherein/>And/>Respectively, a product image and a product text description, and inputting the product image and the product text description into a multi-mode encoder to respectively obtain image characterization/>And text characterization/>The image is then characterized/>And text characterization/>Input into a visual guide decoder and generate a product summary/>The method specifically comprises the following steps:
step 1, imaging a product Cut into multiple non-overlapping regions and flattened into a sequence/>Product text description/>Expressed as sequence/>And use/>And/>Marking the beginning of text entry and image entry;
step2, linear projection mapping is carried out on the text and the image which are input after the pretreatment in the step 1 And/>Mapping into text embedding/>, respectivelyAnd visual embedding/>The text embedding and the visual embedding are connected and input into a multi-mode encoder, and the multi-mode representation/> is obtained through output,/>Representation of the beginning of the markup text input,/>Representation of the beginning of the input of a marker image,/>And/>Representing representations of the product image and the textual description, respectively, the multi-modal encoder being represented as:
wherein text is mapped into text by linear projection ,/>For multimodal embedding connected by text embedding and visual embedding,/>Representation of/>, including product title and long text descriptionA product text description sequence of individual words;、/> And/> Is the special mark embedding introduced,/>And/>For marking the beginning and end of text input,/>For marking the beginning of an image input,/>Is a position marker embedding of multimodal sequences,/>Representing a function of a multi-modal encoder,/>Is the number of image areas,/>Is the word quantity of the text sequence,/>And/>Is a linear projection map, i.e. an embedded matrix and a learnable parameter of the multi-modal encoder;
Step 3, representing the multiple modes Input to the visual guide decoder, and automatically generate the inclusion/>Product abstract of individual words/>,/>And/>;
The visual guide decoder is configured to optimize the utilization of encoded visual and textual information by focusing on the encoded visual representation using a self-attention mechanism, generating a cross-modal representation of the visual guide, and then continuously fusing the textual representations to generate a final multi-modal representation, as follows:
Wherein, The method is characterized by referring to abstract, common English symbols in the method represent unprocessed original input, bold English symbols represent vectors or matrixes processed by a model, and the method comprises the steps of/>、/>And/>Representing multi-head self-attention, cross-attention and layer normalization, respectively,/>And/>A single-mode cross-attention module specifically designed for a visual mode and a text mode, respectively. Modeling visual features and text features separately using two cross-attention modules enables a visual guidance decoder to be explicitly guided by information from different modalities.
The multi-mode encoder: the prior art mainly utilizes a large-scale vision pre-training model to process vision information. While this approach helps to more effectively extract visual representations, it also places a significant computational burden. In multi-modal tasks, the processing time of visual information often exceeds the processing time of text information due to the vast size of the visual backbone model. This imbalance in processing time results in an abnormal distribution of time loss, deviating from the efficiency sought by the automated multi-modal product summarization method. Unlike previous methods, which use different proprietary encoders to handle different modalities, e.gAnd/>. While the present invention will text/>And image/>Input linear projection map/>And/>Then, the data is input into a multi-mode encoder. The invention uses a simple but effective linear projection to process the image area to obtain the visual embedded representation, and the real method not only reduces the calculation requirement related to a huge visual backbone model, but also ensures more balanced processing time distribution between visual and text modes. Unlike conventional BART architectures that use a single-mode input paradigm, the modified multi-mode decoder of the present invention focuses on visual and textual information sequentially.
The product abstract generation model UMPS is obtained through training by a multi-mode multi-task learning method, and specifically comprises the following steps:
Step 1, training a task guide model through multi-mode product abstract generation:
as shown in the upper left-hand portion of FIG. 2, the multi-modal decoder iteratively focuses on previously generated words And multi-modal encoder output/>Predicting probability/>, of next text wordGenerating the next word until all word quantities of the product abstract are generated; the multi-modal decoder generates word by word during the generation process, such as first inputting/>The multi-modal decoder generates the first word/>Input/>The multi-modal decoder generates a second word/>; Reinput intoThe multi-modal decoder generates a third word/>,.. Until generation/>Words and phrases;
training a model by minimizing a negative log-likelihood loss function The method comprises the following steps:
Wherein, Represents the current model generation/>Words,/>Representing the amount of words that generated the product abstract,;
Step 2, maximizing interaction information between text input data and a reference abstract by enhancing an MIE task guidance model through interaction information, and enabling the reference abstract to be referred toInput to the multi-mode encoder to obtain its characterizationSubsequent average pooling/>And/>Obtain/>And/>And maximize its similarity/>The enhancement of the interaction information is completed, specifically:
Wherein, And/>For marking the beginning and end of the reference summary input, respectively,/>Is the word quantity of the reference abstract sequence,/>And/>Respectively represent embedding and characterization of the reference abstract,/>Represents the/>Characterization of individual reference abstract words; /(I)Representing a function of the multi-mode encoder; /(I)And/>Respectively mean pooling operations and similarity calculation operations along the column dimension; /(I)And/>Respectively representing matrix transposition and regularization operations; /(I)Is the batch size,/>Is a temperature parameter which can be learned,/>Representing an exponential function based on a natural constant e,/>Enhancing a loss function for the interactive information;
The maximum learning characterization and data specification interaction information is a common strategy for enhancing characterization learning, and the design objective of the text image-based product abstract generation model UMPS of the invention is to maximize interaction information between text input data and a reference abstract. The excitation model comprehensively encodes the input, ensures that the information which is proved to be key and relevant in the process of generating the product abstract later is captured, and is shown in the lower left corner part of fig. 2;
step 3, implementing visual-language knowledge distillation VLKD through single-mode feature alignment UFA and cross-mode contrast learning CCL training, ensuring multi-mode alignment, and promoting better understanding of cross-mode relation, continuously updating a multi-mode encoder and a visual guided decoder in the whole training process, but keeping a CLIP image encoder unchanged:
Imaging a product Input CLIP image encoder/>Get its global characterization/>Subsequently minimize/>And a global information output characterization/>, of a multi-modal encoderBetween/>Distance, finishing single-mode feature alignment; at the same time, utilize/>And/>Performing contrast learning to finish cross-modal contrast training, so that the multi-modal encoder can extract multi-modal knowledge from the CLIP image encoder;
As shown in the upper right-hand corner of FIG. 2, the training process of the present invention introduces single-modality feature alignment (UFA) and cross-modality contrast learning (CCL) in the process of refining multi-modality knowledge from the CLIP image editor to reduce UMPS parameters. One key strategy is to continually freeze model parameters associated with the CLIP image editor throughout the training process. This strategy ensures that the gradient associated with the freeze parameters does not counter-propagate, thus preserving its dual encoder alignment, i.e., the multi-modal encoder and the visually guided decoder will be updated continuously throughout the training process, but the CLIP image encoder will be unchanged.
Step 3.1 As shown in the lower right hand corner of FIG. 2, to enhance the ability of the multi-modal encoder to model visual features, a single-modal feature alignment UFA task is designed for minimizing CLIP image encoderAnd a global information output characterization of the multi-mode encoderDistance, two output characterizations for the same input image are approximated, the Unimodal Feature Alignment (UFA) loss function/>The expression is as follows:
Wherein, Is a CLIP image encoder/>The representation of the [ CLS ] mark output represents the global information of the image; represent a batch size/> />Image,/>Is a weight matrix for linearly projecting the output of the CLIP image encoder into the space of the multi-mode encoder,/>Representation/>Distance calculation operation,/>A representation of the beginning of the input of the ith marker image;
step 3.2, as shown in the upper right corner of fig. 2, introducing cross-modal contrast learning CCL as an optimization technique in the process of improving and perfecting semantic space alignment covering multi-modal features generated by a multi-modal encoder:
Using symmetric InfoNCE losses to act on text representations generated by a multi-modal encoder and visual representations generated by a CLIP image encoder, aligning representations from both modalities by optimizing the loss-steering model to enhance shared semantic understanding of underlying content, the cross-modal contrast learning CCL loss function The expression is as follows:
Wherein, Is a temperature parameter which can be learned,/>Representing a matrix transpose operation, i2t representing image to text, t2i being the opposite,/>Represents the/>Features,/>Represents the/>Features, the same formula being distinguished by two notations,/>Respectively represent the/>Or (5 /)Characterization of the beginning of the text entry of the individual tag,/>Respectively represent the/>Or (5 /)Individual CLIP image encoder/>Characterization of the output [ CLS ] tagged image global information;
Step 4, generating a model UMPS total loss function by the product abstract Is a combination of all of the above loss functions:
Wherein, Representing a predefined loss weight.
The present invention uses CEPSUM datasets from an e-commerce platform in china during model training. It includes about 1.4 million products, covering three categories: home appliances, clothing and bags. Fig. 3 is a statistical plot of CEPSUM datasets. As shown in FIG. 1, each product CEPSUM contains a long product description, a product title, a product image, and a high-quality product abstract written by humans.
The present invention uses pre-trained bart-base-chinese to initialize the model parameters so that the multi-mode encoder and visual guide decoder in the model each have six layers, set the batch size B to 16, the maximum length M of the text sequence to 400, and the number of reserved areas per image L to 49. The model was optimized using Adam with an initial learning rate of 3e-5.
Example two
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
Specifically, each step of the method embodiment in the embodiment of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in a software form, and the steps of the method disclosed in connection with the embodiment of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.
A second embodiment of the present application provides a text image-based product abstract generating system, as shown in fig. 5, the system 700 may include: a memory 710 and a processor 720, the memory 710 being configured to store a computer program and to transfer the program code to the processor 720. In other words, the processor 720 may call and run a computer program from the memory 710 to implement the method in the embodiment of the present application. For example, the processor 720 may be configured to perform the method provided in the first embodiment according to instructions in the computer program.
In some embodiments of the application, the processor 720 may include, but is not limited to:
A general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
In some embodiments of the application, the memory 710 includes, but is not limited to: volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDR SDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCH LINK DRAM, SLDRAM), and Direct memory bus random access memory (DRRAM).
In some embodiments of the present application, the computer program may be divided into one or more modules, which are stored in the memory 710 and executed by the processor 720 to perform the method provided by the first embodiment of the present application. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are intended to describe the execution of the computer program by the system 700.
As shown in fig. 5, the system may further include: a transceiver 730, the transceiver 730 being connectable to the processor 720 or the memory 710. The processor 720 may control the transceiver 730 to communicate with other devices, and in particular, may send information or data to other devices or receive information or data sent by other devices. Transceiver 730 may include a transmitter and a receiver. Transceiver 730 may further include antennas, the number of which may be one or more.
It should be appreciated that the various components in the system 700 are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.
Example III
The third embodiment of the present invention also provides a computer storage medium having a computer program stored thereon, which when executed by a computer enables the computer to perform the method provided in the first embodiment.
The present application also provides a computer program product comprising a computer program/instructions which, when executed by a computer, cause the computer to perform the method provided in the first embodiment.
When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Drive (SSD)), or the like.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.
Claims (4)
1. The text image-based product abstract generation method is characterized in that after a product abstract generation model is trained through a multi-mode multi-task learning method, the product abstract generation model only keeps a multi-mode encoder and a visual guide decoder when the application and the deployment are carried out, and multi-mode product information is givenWherein/>And/>Respectively, a product image and a product text description, and inputting the product image and the product text description into a multi-mode encoder to respectively obtain image characterization/>And text characterization/>The image is then characterized/>And text characterization/>Input into a visual guide decoder and generate a product summary/>The method specifically comprises the following steps:
step 1, imaging a product Cut into multiple non-overlapping regions and flattened into a sequence/>Textual description of the productExpressed as sequence/>And use/>And/>Marking the beginning of text entry and image entry;
step2, linear projection mapping is carried out on the text and the image which are input after the pretreatment in the step 1 And/>Mapping into text embedding/>, respectivelyAnd visual embedding/>The text embedding and the visual embedding are connected and input into a multi-mode encoder, and the multi-mode representation/> is obtained through output,/>Representation of the beginning of the markup text input,/>Representation of the beginning of the input of a marker image,/>And/>Representing representations of the product image and the textual description, respectively, the multi-modal encoder being represented as:
wherein text is mapped into text by linear projection ,/>For multimodal embedding connected by text embedding and visual embedding,/>Representation of/>, including product title and long text descriptionA product text description sequence of individual words; /(I)、/>And/>Is the special mark embedding introduced,/>And/>For marking the beginning and end of text input,/>For marking the beginning of an image input,/>Is a position marker embedding of multimodal sequences,/>Representing a function of a multi-modal encoder,/>Is the number of image areas,/>Is the word quantity of the text sequence,/>And/>Is a linear projection map, i.e. an embedded matrix and a learnable parameter of the multi-modal encoder;
Step 3, representing the multiple modes Input to the visual guide decoder, and automatically generate the inclusion/>Product abstract of individual words,/>And/>;
The product abstract generation model is obtained through training by the following multi-mode and multi-task learning method:
(1) Training a task guidance model through multi-mode product abstract generation:
The visual guide decoder iteratively focuses on previously generated words And multi-modal encoder output/>Predicting probability/>, of next text wordGenerating the next word until all word quantities of the product abstract are generated; training a model by minimizing a negative log-likelihood loss function, said minimizing a negative log-likelihood loss function/>The method comprises the following steps:
Wherein, Represents the current model generation/>Words,/>The word amount representing the generated product abstract,;
(2) Maximizing interaction information between text input data and a reference abstract by an interaction information enhanced MIE task guidance modelInput to the multi-mode encoder to obtain its characterizationSubsequent average pooling/>And/>Obtain/>And/>And maximize its similarity/>And (3) completing interaction information enhancement:
Wherein, And/>For marking the beginning and end of reference summary input,/>Is a word quantity referring to the sequence of digests,And/>Respectively representing embedding and characterization of the reference abstract; /(I)Representing a function of the multi-mode encoder; /(I)AndRespectively mean pooling operations and similarity calculation operations along the column dimension; /(I)And/>Respectively representing matrix transposition and regularization operations; /(I)Is the batch size,/>Is a temperature parameter which can be learned,/>Representing an exponential function based on a natural constant e,/>Enhancing a loss function for the interactive information;
(3) The visual-language knowledge distillation VLKD is realized through the training of the single-mode feature alignment UFA and the cross-mode contrast learning CCL, so that the alignment of multiple modes is ensured, the better understanding of the cross-mode relation is promoted, and a multi-mode encoder and a visual guide decoder are continuously updated in the whole training process, but the CLIP image encoder is kept unchanged:
Imaging a product Input CLIP image encoder/>Get its global characterization/>Subsequently minimize/>And a global information output characterization/>, of a multi-modal encoderBetween/>Distance, finishing single-mode feature alignment; at the same time, utilize/>And/>Performing contrast learning to finish cross-modal contrast training, so that the multi-modal encoder can extract multi-modal knowledge from the CLIP image encoder; the unimodal feature aligns the UFA loss function/>The expression is as follows:
Wherein, Is a CLIP image encoder/>The representation of the [ CLS ] mark output represents the global information of the image; /(I)Represent a batch size/>/>Image,/>Is a weight matrix for linearly projecting the output of the CLIP image encoder into the space of the multi-mode encoder,/>Representation/>Distance calculation operation,/>A representation of the beginning of the input of the ith marker image;
The cross-modal contrast learning CCL loss function The expression is as follows:
Wherein, Is a temperature parameter which can be learned,/>Representing a matrix transpose operation, i2t representing image to text, t2i being the opposite,/>Represents the/>Features,/>Represents the/>A plurality of features; /(I)Respectively represent the/>Or (5 /)Characterization of the beginning of the text entry of the individual tag,/>Respectively represent the/>Or (5 /)Individual CLIP image encoder/>Characterization of the output [ CLS ] tagged image global information;
(4) Total loss function of product abstract generation model Is a combination of all of the above loss functions:
Wherein, Representing a predefined loss weight.
2. The text image based product abstract generation method of claim 1 wherein said visual guide decoder is adapted to optimize utilization of coded visual and text information by first focusing on coded visual representations using a self-attention mechanism and generating a visual guide cross-modal representation which continuously blends the text representations to generate a final multi-modal representation as follows:
Wherein, Is a representation of the reference abstract,/>、/>And/>Representing multi-head self-attention, cross-attention and layer normalization, respectively,/>And/>Representing single-mode cross-attention modules specifically designed for visual and text modes, respectively, using two cross-attention modules to model visual and text features, respectively, enables a visual guidance decoder to be explicitly guided by information from different modes.
3. A text image based product abstract generation system comprising a memory for storing a program and a processor for executing the program stored in the memory, wherein the program is executed to perform the text image based product abstract generation method according to any one of claims 1 to 2.
4. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored program, wherein the system in which the non-volatile storage medium is controlled to execute the text image based product abstract generation method according to any one of claims 1 to 2 when the program runs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410372214.5A CN117975486B (en) | 2024-03-29 | 2024-03-29 | Text image-based product abstract generation method, system and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410372214.5A CN117975486B (en) | 2024-03-29 | 2024-03-29 | Text image-based product abstract generation method, system and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117975486A true CN117975486A (en) | 2024-05-03 |
CN117975486B CN117975486B (en) | 2024-08-16 |
Family
ID=90860025
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410372214.5A Active CN117975486B (en) | 2024-03-29 | 2024-03-29 | Text image-based product abstract generation method, system and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117975486B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20200087977A (en) * | 2019-01-14 | 2020-07-22 | 강원대학교산학협력단 | Multimodal ducument summary system and method |
US20220284321A1 (en) * | 2021-03-03 | 2022-09-08 | Adobe Inc. | Visual-semantic representation learning via multi-modal contrastive training |
US20230237093A1 (en) * | 2022-01-27 | 2023-07-27 | Adobe Inc. | Video recommender system by knowledge based multi-modal graph neural networks |
CN116958997A (en) * | 2023-09-19 | 2023-10-27 | 南京大数据集团有限公司 | Graphic summary method and system based on heterogeneous graphic neural network |
CN117218503A (en) * | 2023-09-12 | 2023-12-12 | 昆明理工大学 | Cross-Han language news text summarization method integrating image information |
CN117421591A (en) * | 2023-10-16 | 2024-01-19 | 长春理工大学 | Multi-modal characterization learning method based on text-guided image block screening |
-
2024
- 2024-03-29 CN CN202410372214.5A patent/CN117975486B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20200087977A (en) * | 2019-01-14 | 2020-07-22 | 강원대학교산학협력단 | Multimodal ducument summary system and method |
US20220284321A1 (en) * | 2021-03-03 | 2022-09-08 | Adobe Inc. | Visual-semantic representation learning via multi-modal contrastive training |
US20230237093A1 (en) * | 2022-01-27 | 2023-07-27 | Adobe Inc. | Video recommender system by knowledge based multi-modal graph neural networks |
CN117218503A (en) * | 2023-09-12 | 2023-12-12 | 昆明理工大学 | Cross-Han language news text summarization method integrating image information |
CN116958997A (en) * | 2023-09-19 | 2023-10-27 | 南京大数据集团有限公司 | Graphic summary method and system based on heterogeneous graphic neural network |
CN117421591A (en) * | 2023-10-16 | 2024-01-19 | 长春理工大学 | Multi-modal characterization learning method based on text-guided image block screening |
Also Published As
Publication number | Publication date |
---|---|
CN117975486B (en) | 2024-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112685565B (en) | Text classification method based on multi-mode information fusion and related equipment thereof | |
CN111339415B (en) | Click rate prediction method and device based on multi-interactive attention network | |
US11436451B2 (en) | Multimodal fine-grained mixing method and system, device, and storage medium | |
CN110990555B (en) | End-to-end retrieval type dialogue method and system and computer equipment | |
CN117173504A (en) | Training method, training device, training equipment and training storage medium for text-generated graph model | |
CN110852110B (en) | Target sentence extraction method, question generation method, and information processing apparatus | |
CN116579339B (en) | Task execution method and optimization task execution method | |
CN114510939A (en) | Entity relationship extraction method and device, electronic equipment and storage medium | |
CN116861258B (en) | Model processing method, device, equipment and storage medium | |
CN116720004A (en) | Recommendation reason generation method, device, equipment and storage medium | |
CN114282528A (en) | Keyword extraction method, device, equipment and storage medium | |
CN118155023B (en) | Text graph and model training method and device, electronic equipment and storage medium | |
CN116932731A (en) | Multi-mode knowledge question-answering method and system for 5G message | |
CN115994317A (en) | Incomplete multi-view multi-label classification method and system based on depth contrast learning | |
CN115114407B (en) | Intention recognition method, device, computer equipment and storage medium | |
CN116467417A (en) | Method, device, equipment and storage medium for generating answers to questions | |
CN117975486B (en) | Text image-based product abstract generation method, system and storage medium | |
CN114692624A (en) | Information extraction method and device based on multitask migration and electronic equipment | |
CN115292439A (en) | Data processing method and related equipment | |
CN112784156A (en) | Search feedback method, system, device and storage medium based on intention recognition | |
CN113704466B (en) | Text multi-label classification method and device based on iterative network and electronic equipment | |
CN113656555B (en) | Training method, device, equipment and medium for nested named entity recognition model | |
CN115470397B (en) | Content recommendation method, device, computer equipment and storage medium | |
CN113836903A (en) | Method and device for extracting enterprise portrait label based on situation embedding and knowledge distillation | |
CN108170657A (en) | A kind of natural language long text generation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |