We formulate the task of compatible garment recommendation as the task of retrieving a dressing modality by suggesting suitable complementary fashion items to be paired with a given one.
Our recommendation model is based on two external memory modules in which disentangled color/shape features are stored. Each module acts as an associative memory, relating top features with bottom features and describing different combination modalities for either color or shape. The retrieved modalities are then combined to recreate a final recommendation that can be re-ranked according to general preferences. In the following we present the building blocks of our architecture.
3.1 Disentangled Feature Representation
We process garment images with a convolutional encoder followed by a flattening operation. In this way we map garments into a latent representation
\(\phi\). To obtain separate features for shape and color, we use two different
Multi-Layer Perceptron (MLP) models,
\(MLP_{shape}\) and
\(MLP_{color}\), that yield descriptors
\(\phi _{shape}\) and
\(\phi _{color}\), which are intended to capture different traits of the garment. The two representations are then concatenated, blended together with an additional MLP, and finally decoded with a deconvolutional decoder reconstructing the input image. The model, shown in Figure
2, acts as an autoencoder with two intermediate latent states, trained by optimizing a reconstruction MSE loss
\(\mathcal {L}_{rec}\) over pixels.
To disentangle the hidden representations of such states and capture either shape or color, we adopt a contrastive learning approach using a siamese network with three branches (Figure
3). At training time, we feed to the model three images in parallel. The main branch processes the original unaltered image, while the other two branches receive as input color-jittered and rotated versions of the same image. Thanks to these augmentations, the three images share attributes pair-wise: the main branch and the color branch receive images of garments with the same shape, while the main branch and the shape branch observe images with the same color. The rotated and color-jittered images instead do not share any color/shape attribute. In order to disentangle the latent states of the autoencoder, we optimize two triplet margin losses [
1] across the three branches.
The rationale is to make features of shared attributes close in the latent space, while pushing away the feature of the altered image. For instance, we want the shape features to be similar for the original and color-jittered images while being dissimilar to the ones of the rotated image.
Let the triplet margin loss be defined as
where
\(d(\cdot , \cdot)\) is a distance function,
\(\phi\) is a reference anchor feature,
\(\phi ^{+}\) a positive feature that we want to be close to
\(\phi\), and, vice versa,
\(\phi ^{-}\) is a negative feature that we want to separate from the others by a margin
\(M\). In our experiments we use the cosine distance and
\(M=0.5\).
To ensure
\(\phi _{color}\) and
\(\phi _{shape}\) respectively capture color and shape characteristics, we optimize the following losses:
Here, the
\(rot\) and
\(jitter\) superscripts indicate that the feature is extracted from the rotated image or the color-jittered image, respectively. Overall, to train the siamese autoencoder, we jointly optimize the reconstruction losses for the three branches, which share all the parameters, and the two triplet margin losses for shape and color:
where the triplet losses are weighed by a coefficient
\(\lambda\), which we set to 0.01.
We train the autoencoder using both top and bottom images, thus obtaining generic encoder/decoder functions that can be used for any garment image. For the sake of simplicity, in the following we refer to \(\phi ^{T}\) for features extracted from top garments and \(\phi ^{B}\) for bottom garments.
3.2 Model
To perform recommendations, we adopt a MANN, with two external memories,
\(M_{color}\) and
\(M_{shape}\), as depicted in Figure
4. The idea is derived from [
7], where top and bottom garments are paired in a permanent memory, according to user-defined outfits. Here, instead, we learn to store pairs of top-bottom features concerning either shape or color and we perform a late fusion in the decoding phase. The advantage of our approach is that we can identify non-redundant modalities to pair colors and shapes separately, thus avoiding combinatorial growth in memory size and obtaining more diverse recommendations.
Both memories contain pairs of top-bottom features belonging to known outfits. The memories reflect the feature disentanglement provided by the encoders of the autoencoder. That is, in \(M_{color}\) we only store pairs of color features \((\phi ^{T}_{color}, \phi ^{B}_{color})\) and in \(M_{shape}\) pairs of shape features \((\phi ^{T}_{shape}, \phi ^{B}_{shape})\).
At inference time, a top image is fed as input to our recommendation system and encoded into two latent vectors \(\phi ^{T}_{shape}\) and \(\phi ^{T}_{color}\), using the shape and color encoders. The two features are compared via cosine similarity against the respective memories, acting as read keys to find the most relevant locations.
We retrieve the best K elements from both memories, retaining only the bottom items. We then combine such bottom features from both memories together with an outer product, creating all \(K^2\) possible combinations of shapes and colors. These combined features represent different ways of matching shapes and colors, which can be decoded into actual garment images. However, to perform recommendations we need to suggest existing garments, so we simply retrieve the training sample with the highest cosine similarity according to the concatenation of shape and color features.
Among the \(K^2\) generated pairs, we establish a re-ranking based on the frequency of co-occurrence of color and shape labels \((c, s)\) between top and bottom pairs that belong to a same outfit in the training set. The idea behind this re-ranking strategy is that the MANN is useful to extract good modalities from a pure content-based point of view. By re-ranking using co-occurrence frequencies, we are also taking into account how common these outfits are according to fashion trends.
Memory Controller Training. . In order to fill up the two memories, we train two separate memory controllers, \(C_{shape}\) and \(C_{color}\). The training process for both controllers is identical, so we will refer to a generic memory \(M\) and generic features \((\phi ^{T}, \phi ^{B}\)) without any attribute subscript.
Given a top attribute representation
\(\phi ^T\), the memory outputs K bottom features
\(\phi _k^B\) for
\(k=1...K\). Since the memory should be able to propose an attribute (either shape or color, depending on the memory), we compare the features of all proposed items against the corresponding feature of the ground-truth bottom
\(\bar{\phi }^B\) using a cosine distance:
As in [
7,
21], we take the minimum error and we feed it to the memory controller, which is trained to write samples in a non-redundant way, storing only relevant information necessary to obtain a satisfactory recommendation. The advantage of considering only the best recommendation is that the network is not penalized for recommending a variety of different outputs, while instead it is enforced to recommend at least an item similar to the ground truth.
A memory controller is a simple linear layer with sigmoidal activation that emits a writing probability
\(p_w\), taking as input the minimum distance
\(d^{*}=\min \lbrace d_k\rbrace\). A sample gets written in memory when such probability exceeds a given threshold
\(th_w\). Previous works in the literature [
7,
21] have trained similar memory controllers to maximize the writing probability when the error is high, i.e., when the sample should be added in memory to obtain a better prediction, and minimizing the writing probability when the output is already satisfactory, thus avoiding redundancy. Such behavior is obtained minimizing the following controller loss:
The controller loss in this form, however, suffers from two issues: (1) dependence on the distribution of
\(d^{*}\), which reflects on the number of samples getting written in memory, and (2) collapsing to trivial solutions where
\(p_w\) is always 0 or 1. Both issues arise when
\(d^{*}\) does not follow a normal or uniform distribution, i.e., when there is a strong unbalance toward either low or high distances.
To avoid these issues, we extend this loss in the following way. First, we scale the cosine distances in
\([0, 1]\) and we apply a normalization dividing each
\(d^*\) by an estimate of the maximum distance
\(d^*_{max}\), accumulated during training. This has the effect of stretching all errors to cover the whole
\([0, 1]\) interval, making the second term in
\(\mathcal {L}_{controller}\) tend to zero when
\(d^{*}\) is sufficiently high. Second, we add a penalty term to avoid collapsing to trivial solutions. To do so, we accumulate the
Nth percentile of
\(d^{*}\), which we denote with
\(perc_N\), averaging across batches. We assume that samples with errors higher than such value should be written in memory, and therefore we penalize the model when their writing probability is lower than
\(th_{w}\). Vice versa, we still add the penalty when a sample is written but the corresponding
\(d^{*}\) is lower than the
Nth percentile. The penalties can be formalized as margin losses with margin
\(m\):
where we make the threshold
\(th_w\) adaptive by setting it equal to the estimate of the
Nth percentile of
\(d^*\), normalized by
\(d^*_{max}\):
The final controller loss that we adopt is therefore
In our experiments we use \(m=0.3\), \(\alpha =10\) and set the distance threshold to the 99.5 percentile of the distance distribution.
3.3 Training Details
We train our model in two separate steps. First, we train the autoencoder to learn disentangled features for shape and color, and then we train the memory controllers to store nonredundant samples. Separating training into two phases is necessary since we do not want the representations of stored samples to change during training. During the training phase or the memory controllers, to avoid storing incorrect samples at the first iterations, we reset the memory after each epoch by emptying it and re-initializing it with K random samples, i.e., the number of samples that we want to suggest. When the controller is fully trained, we fill the memory from scratch by iterating over the training samples for an additional epoch. We observed that, once convergence is achieved, different initializations do not lead to substantial differences in the final results.
We train our model on two different datasets, IQON3000 [
29] and FashionVC [
28], as outlined in Section
5. Our final memory modules, trained on the IQON3000 dataset, are filled with 9,282 pairs for color and 2,157 pairs for shape, whereas the memories trained on FashionVC are filled up with 399 pairs for color and 262 pairs for shape. The different number of pairs in the memories filled with the two datasets is given by the different sizes of the datasets: IQON3000 has 308,747 outfits; on the contrary, FashionVC has just 20,726 outfits.
As for the components of our model, the autoencoder is composed as follows. The encoder has 4 convolutional layers with kernel size \(3 \times 3\), padding 1, and number of channels equal to 8, 16, 32, and 64. Each layer has a ReLU activation and is followed by a max-pooling operation. The resulting feature is a \(9 \times 9 \times 64\) feature map, which is flattened into a 5,184-dimensional vector and fed to the two MLP encoders, both with a hidden dimension of 1,024 and an output of 256. Again, all outputs are followed by ReLU activations. The MLP decoder and the convolutional decoder follow an inverse structure, replacing convolutions with transposed convolutions with stride 2.
For training both models we use the Adam optimizer with a learning rate of 0.005.