[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Towards Food Image Retrieval via Generalization-Oriented Sampling and Loss Function Design

Published: 25 August 2023 Publication History

Abstract

Food computing has increasingly received widespread attention in the multimedia field. As a basic task of food computing, food image retrieval has wide applications, that is, food image retrieval can help users to find the desired food from a large number of food images. Besides, the retrieved information can be applied to establish a richer database for the subsequent food content-related recommendation. Food image retrieval aims to achieve better performance on novel categories. Thus, it is worth studying to transfer the embedding ability from the training set to the unseen test set, that is, the generalization of the model. Food is influenced by various factors, such as culture and geography, leading to great differences between domains, such as Asian food and western food. Therefore, it is challenging to study the generalization of the model in food image retrieval. In this article, we improve the classical metric learning framework and propose a generalization-oriented sampling strategy, which boosts the generalization of the model by maximizing the intra-class distance from a proportion of positive pairs to avoid the excessive distance compression in the embedding space. Considering that the existing optimization process is in an opposite direction to our proposed sampling strategy, we further propose an adaptive gradient assignment policy named gradient-adaptive optimization, which can alleviate the intra-class distance compression during optimization by assigning different gradients to different samples. Extensive evaluation on three popular food image datasets demonstrates the effectiveness of the proposed method. We also experiment on three popular general datasets to prove that solving the problem from the generalization can also improve the performance of general image retrieval. Code is available at https://github.com/Jiajun-ISIA/Generalization-oriented-Sampling-and-Loss.

1 Introduction

Food is closely related to the healthy life of human beings; thus, food-related research has become a hotspot. Food image retrieval aims to find all relevant images from the original food database through a query, and it is an important issue in the multimedia field. As an important branch of content-based image retrieval (CBIR), the development of food image retrieval has expanded the scale of CBIR and has shown that this field has made great progress. Therefore, food image retrieval has high academic value. In addition, food image retrieval has higher application value. First, food image retrieval can help consumers with special needs to find the desired food from a large number of food images. For example, fitness coaches and body managers usually need to use food-related search engines to find healthy meals from a wide range of dishes displayed by a large number of stores on the Internet. Food image retrieval can help them to find the best match from a large number of food images. Second, food image retrieval can solve the problem of dynamic categories that cannot be solved by a common food recognition paradigm. There are many kinds of unlabeled novel foods and labelling these food images costs a lot, whereas food image recognition requires a set of annotated images to train a robust classifier. Food image retrieval can be used to find similar foods among the available ones and can suggest the most appropriate food class to solve the problem. Third, food image retrieval can be used to enrich the database of food recommendation systems. The catering-oriented social media, such as Yelp1 and Meituan2, utilize the dish name-based recommendation system to predict the users’ preference. However, they usually fail because different restaurants may give different names to the same dish and the recommendation system cannot match the corresponding dish name in the database. The food image-based recommendation system can avoid the problem. Therefore, food image retrieval helps to recommend food more accurately by enriching the database, which avoids the matching failure caused by the same dish with different names.
With the rapid development of the food field, new food categories are produced every day. Existing methods [10, 15, 43] are not suitable for retrieving corresponding images for these unseen categories. Therefore, we focus on food image retrieval on unseen categories, which is also consistent with general image retrieval in the current multimedia field. The generalization of the model to novel food categories is an important issue to solve for food image retrieval. However, there are few works to explore this issue in food image retrieval. Improving the generalization of the model in food image retrieval is to cross the gap among categories to achieve the knowledge transfer, and the challenge is to solve the problem with large inter-class differences of different food domains. Specifically, there are many domains of food, such as Asian food and western food. Western food is often visually represented as a simple combination of processed ingredients, such as steak, fish and chips, and hamburgers. Due to the uniqueness of cooking methods, Asian food such as Chinese food is often processed again by mixing various ingredients, which is inseparable. The situation leads to the great visual difference of food in different domains. As shown in Figure 1, the crispy sweet and sour pork slices on the left differ greatly from the beefsteak on the right. When two categories belong to the training set and the test set, respectively, the performance may be poor.
Fig. 1.
Fig. 1. The difference between two domains of food.
To solve this problem, we improve the metric learning framework and propose a generalization-oriented sampling strategy, named \(\rho\) Sampling. The previous work [40] indicates that the generalization of the model on representation learning such as image retrieval under considerable shifts between training and testing distribution is hurt by excessive feature compression. Therefore, making features occupy a larger hyperspace helps improve the generalization of models. A simple example is shown in Figure 2. Although two models both successfully separate the data on the training set, the left model fails to separate all test classes due to the excessive compression, leading to confusion on the test set. In contrast, the right model successfully separates the test classes by keeping the intra-class variants, which means that the model transfers the discrimination ability from the training set to the test set. Based on this, we argue that the training shouldn’t blindly minimize the intra-class distance. It is necessary to maximize the space utilization to achieve better generalization. The proposed sampling strategy avoids excessive compression by maximizing a few intra-class distances during the training, which can take instances from the same class apart to occupy a larger hyperspace to achieve the generalization advancement.
Fig. 2.
Fig. 2. The toy example for the distribution of the training set and test set. Different colors represent different categories. (a) shows the distribution of the excessive compression. (b) shows the optimized distribution.
After we investigate the widely used loss functions from the perspective of the optimization under the metric learning framework, we find that recent methods [17, 28] prefer to force all positive instances to be closed during training by minimizing the distance between the anchor and the positive one. This optimization method will force positive instances with different confidence to the anchor position and lead to excessive compression, which is opposite to the above requirement. Therefore, we introduce an adaptive gradient assignment policy, named Gradient-Adaptive Optimization (GAO) to avoid excessive compression. The proposed GAO keeps the intra-class distance by assigning smaller gradients to the edge sample to avoid excessive compression and achieves better generalization. The edge sample is the sample with low confidence, which is far from the class center. Meanwhile, GAO can achieve more stable optimization during training because GAO can reduce the influence of edge samples that may be potential noise data. We just verify it on two popular loss functions, contrastive loss and margin loss, A similar strategy can be used with other loss functions for performance improvement. Extensive evaluation of three food image datasets demonstrates the effectiveness of the proposed method. In order to prove that solving the problem from generalization can also improve the performance of other tasks, we conduct the evaluation on three popular general datasets to verify the performance on the general image retrieval task.
In summary, the main contributions of this work are listed below.
We are the first to investigate food image retrieval on unseen categories and improve the generalization of the model for performance gain to the best of our knowledge.
We improve the classical metric learning framework for food image retrieval and propose a generalization-oriented sampling strategy named \(\rho\) Sampling and an adaptive gradient assignment policy named Gradient-Adaptive Optimization to avoid excessive distance compression in the embedding space to improve the generalization of the model.
We conduct a benchmark on three datasets and experimental results demonstrate that the proposed method achieves state-of-the-art performance. Experiments on three general datasets prove that the method can be further popularized in general image retrieval.

2 Related Work

Our work is closely related to three research fields: (1) food retrieval, (2) deep metric learning, and (3) generalization in image retrieval.

2.1 Food Retrieval

Recently, Min et al. [32] provided a comprehensive survey of food retrieval and other food-related works [19, 22, 31, 37, 54, 64]. Food-relevant retrieval consists of three types: visual food retrieval, recipe retrieval, and cross-modal recipe-image retrieval. For food image retrieval, Barlacchi et al. [2] introduced a search engine for restaurant retrieval based on dishes rather than using their categories. Farinella et al. [15] conducted food image retrieval by comparing images through similarity measures, in which food images were represented as vectors through the combination of different types of features, such as SIFT and Bag of Textons. For recipe retrieval, Wang et al. [53] investigated the underlying features of Chinese recipes. Based on workflow-like cooking procedures, they modelled recipes as graphs and further proposed a novel similarity measurement based on frequent patterns, and devised an effective filtering algorithm to support efficient online searching. Recently, Chang et al. [6] proposed an interactive system RecipeScape to analyze multiple recipes for one dish. They changed the recipe instruction into a tree-structure representation for recipe similarity calculation. Xie et al. [58] further jointly utilized various features, such as cooking flow, eating. and nutrition features, to create a hybrid semantic item model for recipe search. Besides food/recipe retrieval, there are some studies [9, 29, 33, 52] on cross-modal recipe-image retrieval. Salvador et al. [41] first proposed a large-scale, structured corpus of over one million cooking recipes and food images, called Recipe1M. Micael et al. [29] extended previous works by providing a double-triplet strategy to jointly express both the retrieval loss and the classification loss for cross-modal retrieval. Chen et al. [9] further used an attention mechanism to model a recipe and align the word embedding of the recipe with its image feature.
Compared wih previous works, the goal of our task is to search for the most similar image from a number of images, which is different from other food retrieval tasks, that is, recipe retrieval and cross-modal recipe-image retrieval. In addition, in contrast to existing methods, which are suitable for seen categories, our method focuses on unseen categories and improves the generalization of the model for these categories.

2.2 Deep Metric Learning

Deep metric learning methods [30, 55, 57, 62] aim to learn a non-linear transformation to project input images into an embedding space, in which images with similar semantic information have larger similarity and vice versa. Pairwise loss functions represented by contrastive loss [21, 42] were widely used in operating on image pairs. These loss functions maximized the similarity between positive instances while encouraging a margin between negative pairs. In contrast to the above works, triplet networks have been proposed to solve the ranking problem [1, 36, 60] via optimizing the relative distance between positive and negative pairs. However, Cui et al. [11] argued that there are too many pairs or tuples that have no effect in updating a model due to their poor discriminative power. They realized that these triplets wasted much time and resources, thus, slowing down the training and leading to worse performance. To this end, instead of selecting random samples, a series of sampling strategies were proposed to mine hard examples [11, 35, 42]. Online and offline mining are two strategies to mine hard examples in the dataset. Online hard example mining usually mines hard pairs (triplets) in a training batch during training. oOffline mining focuses on the whole dataset or a subset of the dataset after each epoch. Online mining is usually faster than offline hard example mining, whereas offline hard example mining can consider more samples and capture more realistic data distribution.
In this article, we follow the classical deep metric learning framework and further propose a novel sampling strategy and a loss function design to improve the generalization of the model. Specifically, we propose the \(\rho\) Sampling and GAO to boost the generalization by avoiding the excessive compression within the class.

2.3 Generalization in Image Retrieval

Recent works [7, 13, 25, 47, 59, 63] widely used the metric learning method for the image retrieval task, and a series of sampling methods and loss functions have been further proposed to improve the performance of image retrieval. Tishby Zaslavsky [48] studied the concept of compression to explain this performance gain on the basis of the learned embedding space for the metric learning method. The recent work of Verma et al. [50] introduced Singular Value Decomposition (SVD) on the data representations and obtained an increased decay of singular values, which linked the compression to the class-conditioned flattening of representation. Thus, class representations occupied a more compact volume, thereby reducing the number of directions with significant variance. For classic classification scenarios with i.i.d. training and test distributions, that is, the overlap setup, researchers strongly focused on the most discriminative directions. However, this overly discarded features that could capture data characteristics outside the training distribution, which damaged the generalization of the model. Bellet and Habrard [3] indicated that generalization in the no-overlap problem like metric learning was hindered due to the shift in the training and test distribution. Based on this, Roth et al. [40] proposed to retain a considerable amount of directions of significant variance (DoV) to learn a well generalizing embedding function. Therefore, the metric learning under considerable shifts between training and testing distribution was hurt by the excessive feature compression.
Inspired by the work in [40], we aim to further avoid excessive compression to improve the generalization of the model. The proposed \(\rho\) Sampling boosts the generalization of the model by maximizing the intra-class distance from a proportion of positive pairs to avoid excessive distance compression in the embedding space. GAO further alleviates the intra-class distance compression during optimization by assigning different gradients to different samples.

3 Our Method

As shown in Figure 3, we improve the classical metric learning framework by introducing the novel sampling strategy and loss function to avoid excessive distance compression in the embedding space and improve the generalization. Particularly, we propose \(\rho\) Sampling and GAO, which maintain the intra-class distance and achieve better generalization performance, which will be detailed in Sections 3.1 and 3.2, respectively. Without loss of generality, we choose Vision Transformer (ViT) [14] as the backbone. In the test phase, all images in the test set are fed into the backbone to obtain image descriptors to compute cosine similarities among the query and other images. Finally, we sort these similarities and obtain the final ranking list.
Fig. 3.
Fig. 3. The proposed food image retrieval framework. \(\phi _a\), \(\phi _p\) and \(\phi _n\) are image descriptors and \(\phi _{\text{cls}}\) is the logit for classification.

3.1 ρ Sampling for Generalization

Food image retrieval aims to achieve better performance on unseen classes based on the model trained via seen classes. Therefore, generalization becomes crucial. As mentioned above, the generalization of the model on representation learning such as image retrieval under considerable shifts between training and testing distribution is hurt by the excessive distance compression in the embedding space. Due to the obvious differences between different domains of food images, it is more challenging to study the generalization in food image retrieval. In this article, we argue that in order to achieve better generalization performance, the training should not only minimize the intra-class variance and inter-similarity but also distribute the data into wider space. As mentioned above, the model that makes full use of the hyperspace is considered to have better generalization. This model achieves the ability to distribute unseen classes into hyperspace uniformly and widely, and avoids the confusion of the test data.
To this end, we propose a novel sampling strategy, \(\rho\) Sampling, which keeps the intra-class variance by randomly maximizing some intra-class distances during training. Specifically, in order to avoid excessive compression, an intuitive idea is to relax the optimization of positive samples. That is because metric learning methods make the distance between the anchor and the positive instance closer, which leads to excessive distance compression in the embedding space. To solve the problem, we randomly select a part from all tuples \(T = {(x_{a_1},x_{p_1},x_{n_1}),(x_{a_2},x_{p_2},x_{n_2}),...,(x_{a_k},x_{p_k},x_{n_k})}\) for relaxation and we assign the same \(\rho\) probability for each tuple to approximately obtain \(\rho\) ratio of all tuples for relaxation, where \(x_a\) is the anchor, \(x_p\) is the positive instance, \(x_n\) is the negative instance and k is the size of the batch. For these selected tuples, we replace positive instances \(x_p\) with anchors \(x_a\), and optimize original positive instances as negative ones, which is defined as:
\begin{equation} t = (x_{a},x_{p},x_{n}) \rightarrow t^{\prime } = (x_{a},x_{a},x_{p}) \end{equation}
(1)
This strategy pushes samples from the same class apart, enabling the model to capture extra discriminative features and results in outstanding generalization. As shown in Figure 4, the proposed \(\rho\) Sampling avoids excessive distance compression in the embedding space and makes full use of the hyperspace to have better generalization. However, it may be unreasonable to randomly select positive instances to change. After ensuring that the proportion of selected tuples is \(\rho\), we hope to select more appropriate tuples for this inverse optimization. Therefore, we analyse the distribution of pairwise distance, which follows
\begin{equation} q(d) \propto d^{n-2}\left[1-\frac{1}{4} d^{2}\right]^{\frac{n-3}{2}} , \end{equation}
(2)
where \(d\) is the distance between two embeddings and \(n\) is the dimension of the embedding. In high-dimensional space, \(q(d)\) approaches \(\mathcal {N}(\sqrt {2}, \frac{1}{2 n})\), which means that if we randomly select training samples, the distances between the positive pairs will usually be \(\sqrt {2}\), that is, hard positive pairs. As shown in Figure 4, if hard positive samples (\(x_1\) and \(x_3\)) are used for inverse optimization, these edge samples will be even more far from the center, which leads to the confusion between classes. Although this method maintains the intra-class variance, the potential confusion between categories weakens the discrimination between classes. We believe that occupying a larger hyperspace should ensure the boundary. Based on this, we should give priority to samples that are close to the center to keep intra-class variance because such samples are confident enough and the influence of their change is less. To this end, we define the probability of a tuple being selected as
\begin{equation} P(t^*|a,p) \propto (D_{ap})^{-1} , \end{equation}
(3)
where \(D_{ap}\) is the distance between the anchor and the positive instance \(d(\phi _a,\phi _p)\). In other words, we prefer to select easy positive pairs as altered samples.
Fig. 4.
Fig. 4. Visualization of the optimization process. Circles of different colors represent instances of different food classes. According to the \(\rho\) Sampling, the positive \(x_2\) is replaced as negative ones to avoid over-compression for better generalization. The dashed arrows represent the optimization processes in training.
Discussion. The conventional metric learning method makes the distance between samples of the same category closer and pushes samples of different categories away. Although this method can distinguish samples of different categories, it will inevitably lead to excessive distance compression of the same kind of samples in the embedding space because this method narrows the distance between all positive samples equally without considering the distribution of a category. As shown in Figure 4, the traditional optimization leads to excessive distance compression in the embedding space. In contrast, the proposed \(\rho\) Sampling maximizes the intra-class distance from a proportion of positive pairs to avoid excessive compression, thus, improving the generalization of the model.

3.2 Gradient-Adaptive Optimization (GAO)

In order to further reduce excessive distance compression in the embedding space in the class and improve the generalization performance of the model, we propose an adaptive gradient assignment policy named Gradient-Adaptive Optimization (GAO) to cooperate with the proposed \(\rho\) Sampling. It is beneficial to the generalization performance of the model by paying less attention to hard positive instances and slowly correcting them, which occupies a larger hyperspace. Based on this, the strategy slowly corrects instances with lower confidence by giving them lower gradients. To this end, we define the strength of a positive instance during the optimization as
\begin{equation} s(x_p) \propto (D_{ap})^{-1} . \end{equation}
(4)
The equation means that the optimization strength of the positive instance decreases with the increase of \(D_{ap}\).
Since derivative functions of distances between anchors and positive instances in original margin loss and contrastive loss are constant functions, they are convenient for us to study the impact of different gradient allocations on retrieval performance. Therefore, we apply the proposed GAO to these two loss functions to further verify the effectiveness of the strategy.
Without loss of generality, we take contrastive loss as an example. It is formulated as
\begin{equation} L_{m_1} = D_{ap} + \max (m - D_{an},0) , \end{equation}
(5)
where \(m\) is the margin. Contrastive loss forces all positive instances to be close during training by minimizing the distance between the anchor and the positive one. Meanwhile, contrastive loss will assign equal strength to different instances and synchronously correct them, shown in Figure 5(b). The gradient of \(D_{ap}\) keeps constant when \(D_{ap}\) changes, which means that positive instances with different confidence will be under the same optimization strength.
Fig. 5.
Fig. 5. Compared with the original contrastive loss, the proposed training strategy assigns a smaller gradient to harder positives and slowly corrects them to achieve the adaptive optimization.
The contrastive loss combined with GAO is defined as follows:
\begin{equation} L^{\prime }_{m_1} = \log (1 + D_{ap}) + \max (m - D_{an},0) . \end{equation}
(6)
\(L^{\prime }_{m_1}\) and its derivative function with respect to \(D_{ap}\) is shown in Figure 5(b). When \(D_{ap}\) is low, the gradient of \(D_{ap}\) is large, which means strong optimization strength. In contrast, when \(D_{ap}\) is large, the gradient of \(D_{ap}\) is small, which means weak optimization strength. The improved contrastive loss will quickly correct easy samples (with high confidence) while slowly correcting edge samples. This strategy avoids the excessive compression and weakens the impact of wrong correction caused by the potential noise data and achieves more stable optimization.
The proposed strategy has also been used to improve margin loss because the derivative function of distances between anchors and positive instances is the constant function. The original margin loss is defined as
\begin{equation} L_{m_2} = \gamma + \lbrace D_{ap}- \beta \rbrace - \lbrace D_{an}-\beta \rbrace , \end{equation}
(7)
where \(\gamma\) is the triplet margin. The original margin loss extends the standard triplet loss by introducing a dynamic, learnable boundary \(\beta\) between positive and negative pairs, which transfers the common triplet ranking problem to a relative ordering of pairs. The improved margin loss is defined as
\begin{equation} L^{\prime }_{m_2} = \gamma + \lbrace \log (1 + D_{ap})- \beta \rbrace - \lbrace D_{an}-\beta \rbrace . \end{equation}
(8)
Similar to the improved contrastive loss, the improved margin loss redistributes the gradient of \(D_{ap}\) to quickly correct easy examples while slowly correcting edge samples, which avoids excessive compression to improve the generalization of the model.
Finally, we add a classification sub-network, leading to the joint optimization on both the metric learning and classification loss. We adopt the classification loss to improve the discrimination on different classes. Features further occupy a larger hyperspace to improve the generalization of the model. The final loss is defined as
\begin{equation} L = \alpha {L_m} + (1 - \alpha){L_c} , \end{equation}
(9)
where \(\alpha\) is an adjustable hyper-parameter and \(L_m\) and \(L_c\) represent the metric learning loss and classification loss, respectively.

4 Experiment

Section 4.1 shows datasets for evaluation and their division. Sections 4.2 and 4.3 shows details of adopted evaluation metrics and the experimental setup, respectively. In Section 4.4, we introduce results of the comparison with state-of-the-art, different loss functions, and different sampling strategies. We also conduct a cross-domain evaluation experiment to further verify the generalization ability of our method and the impact of hyper-parameters is also shown. Section 4.5 presents the qualitative analysis. Finally, we introduce results of the comparison with the state-of-the-art on general image retrieval in Section 4.6 to further verify the effectiveness of our method.

4.1 Dataset

We use the following three popular food datasets for evaluation.
ETH Food-101 [4] is a western food dataset, which contains 101,000 images from 101 categories. Following the test protocol [43], we use the first 70 classes as the training set and the remaining 31 classes to evaluate models.
Vireo Food-172 [8] contains 110,241 images from 172 categories, which mainly consists of Asian dishes. The first 120 classes are used for training and the remaining 52 classes for testing.
ISIA Food-500 [34] is a large miscellaneous dataset with 399,726 images from 500 classes, which consists of Asian food and western food. We keep the first 350 classes for training and the last 150 classes for testing.

4.2 Evaluation Metrics

We show the pipeline of the test phase in Figure 3. For the test set of all benchmark datasets, every instance is used as the query \(I_q\) and the remaining instances are used as the gallery. We use Recall@K (R@K) and MAP@K as the main evaluation metrics and Normalized Mutual Information (NMI) to evaluate the generalization of the model. MAP@K includes MAP@1K and MAP@C; K is the largest number of samples in each class. NMI is defined as the mutual information between cluster labels and the ground truth. The dists@intra (intra-class distance) and dists@inter (inter-class distance) [40] are further used to measure the space occupied by datasets.

4.3 Experimental Setup

We apply the platform proposed by the authors of [40] to ensure the fairness of experiments. During training, we use ViT models pretrained on ImageNet [14] and fine-tune them on benchmark datasets. Models are optimized by Adam [23]. The initial learning rate is set to \(10^{-6}\) and the weight decay is \(4\times 10^{-4}\). Batch size is set to 112. To form the mini-batch, we randomly sample k classes and \(|P|\) samples per class, where \(|P|\) is set to 2. We resize images to 256 \(\times\) 256, and randomly crop them to \(224\times 224\) as input. During training, we use random crops and random flipping (\(p = 0.5\)) for data augmentation, and a single center crop of \(224\times 224\) is used during evaluation. The embedding size of all models is 128-D. We use the same random seed 0 to avoid seed-based performance fluctuations.

4.4 Result Analysis

4.4.1 Comparison with State-of-the-art.

Tables 1 and 2 show the performance of recent methods and ours on three benchmark datasets. We observe that our method achieves the state-of-the-art results. In particular, our model outperforms other methods on all evaluation metrics among all datasets. Specifically, our fusion method outperforms others by 0.3–2.6% on R@K, 0.5–1.1% on MAP@K and 1.1–2.6% on NMI. Compared with baseline methods, that is, contrastive loss and margin loss functions, the proposed method achieves almost 0.3–5% on Recall@1, 0.5–4.8% on MAP@1K and 1–7% on NMI. It shows that the proposed method not only achieves better retrieval performance but also better generalization performance. It is noteworthy that MAP@1K and MAP@C are the same on ETH Food-101, because each category of them contains 1,000 samples.
Table 1.
MethodsETH Food-101Vireo Food-172ISIA Food-500
R@1R@2R@4NMIR@1R@2R@4NMIR@1R@2R@4NMI
ImageNet [12](Pre-train)60.3572.0281.5244.1068.7177.0083.9144.8249.8660.0569.5040.13
Pair-basedContrastive [43]71.4380.9687.7954.2082.2087.9792.0763.5263.2572.8180.5853.57
Triplet [1]71.7581.1987.8354.8982.5088.2192.3064.3462.9772.5580.3453.52
N-Pair [44]65.4875.9383.8246.3473.0280.3786.0449.5347.8658.3567.9740.24
Hist. [49]72.7882.0188.3852.8380.7886.9291.2962.0461.0071.0679.2652.51
Margin [28]73.0682.1388.7556.4482.8588.3092.2365.3564.6273.9081.3754.97
Lifted [18]64.4675.3383.4845.2771.9279.2685.2847.7645.2455.7265.6438.32
Circle [46]61.1173.5683.2447.1173.5381.4988.1058.7249.1359.2168.7039.81
Proxy-basedSoftmax [61]73.1482.5888.8757.4582.0187.7791.9564.8867.1676.4083.5359.17
ProxyNCA [35]69.9979.6586.6153.4979.8585.8990.6360.5666.3675.7682.9657.93
SoftTriple [38]63.8974.7682.7844.8073.9981.2887.0051.2951.6461.5170.4640.76
Proxy Synthesis [16]70.7280.6987.8055.0581.1786.6690.7663.3266.9876.3283.4258.90
AP-basedSmooth-AP [5]71.5580.7487.5153.9581.9587.8491.7564.1164.2273.6681.0855.60
PNP [26]71.5180.6187.3053.5282.0587.5491.6164.1363.6173.0780.6054.63
Ours (Margin+\(\rho\) Sampling)73.5282.5688.9857.5782.9488.3892.3165.8864.6774.1381.6557.25
Ours (Margin+\(\rho\) Sampling+GAO)74.1083.2089.2858.5883.1388.4992.3266.2169.7078.7785.3561.71
Table 1. Comparison of R@K and NMI Performance for all Loss Functions (%)
Table 2.
MethodsETH Food-101Vireo Food-172ISIA Food-500
MAP@1KMAP@CMAP@1KMAP@CMAP@1KMAP@C
ImageNet(Pre-train)15.1215.1213.1412.085.245.27
Pair-basedContrastive [43]22.5322.5331.0428.4912.7912.82
Triplet [1]22.9222.9231.4528.8712.5612.58
N-Pair [44]19.3119.3119.4117.835.635.66
Hist. [49]23.1723.1730.5927.9612.2112.23
Margin [28]23.3823.3831.7929.1712.7912.82
Lifted [18]18.4118.4118.0916.624.784.80
Circle [46]15.0415.0422.3520.864.114.09
Proxy-basedSoftmax [61]23.8823.8830.8928.0115.3515.37
ProxyNCA [35]21.6521.6530.1327.5715.8915.92
SoftTriple [38]18.1118.1118.6517.035.975.98
Proxy Synthesis [16]22.0322.0326.9924.5216.5216.56
AP-basedSmooth-AP [5]23.2423.2431.7829.1114.6714.72
PNP [26]23.2023.2031.5128.9214.8714.91
Ours (Margin+\(\rho\) Sampling)23.5023.5031.6629.0914.5214.53
Ours (Margin+\(\rho\) Sampling+GAO)24.9624.9632.3229.6317.5917.68
Table 2. Comparison of MAP@K Performance for all Loss Functions (%)

4.4.2 Performance of Different Loss Functions.

We compare the performance of the pre-trained model and 12 different loss functions under the same setting. Table 1 also summarizes the results of different loss functions into three types. Specifically, as for ranking-based methods, margin loss achieves best performance on ETH Food-101, Vireo Food-172 and ISIA Food-500, which outperforms others about 0.5–1% on Recall@1 and 0.2–1.2% on MAP@1K. We further compare three proxy-based methods and find that the Softmax method, which is used for classification tasks, achieves competitive performance. We also compare the recent proposed AP-based method, Smooth-AP [5], for comparison. In summary, we find that there is no obvious performance divergence between different types of loss functions on food image retrieval tasks. Interestingly, we find that recent methods, such as Circle loss [46], work well on general image retrieval tasks, but the performance on food datasets is poor. The probable reason is that this local optimization method leads to poor generalization of the model, which further verifies the importance of generalization in food image retrieval.

4.4.3 Performance of Different Sampling Strategies.

We also include different sampling strategies for our tuple mining study using contrastive loss and margin loss. The performance is shown in Table 3. We find that given the same loss function, compared with the random method, three sampling methods all achieve improvement. In particular, while contrastive loss yields considerably worse results than margin loss with random sampling, its performance is significantly improved when using a sampling procedure similar to margin loss. Without sampling strategy (random), performance of different loss functions is very different (73.13% and 78.38%, 40.56% and 58.21%), but using the advanced sampling strategy can reduce this difference (82.07% and 82.58%, 60.23% and 65.48%). The evidence shows the importance of sampling methods. We find that the proposed sampling method achieves competitive performance when using different datasets and different loss functions. These results show that the proposed \(\rho\) Sampling improves the generalization of the model and achieves better performance.
Table 3.
MethodsETH Food-101Vireo Food-172ISIA Food-500
R@1R@2R@4MAPNMIR@1R@2R@4MAPNMIR@1R@2R@4MAPNMI
ContrastiveRandom [20]63.3774.4083.4017.9545.2473.1380.6786.8220.3949.5240.5650.6560.693.8035.97
Softhard [39]68.6778.9886.3721.0750.3382.0787.6991.8131.4063.6160.2370.1778.4211.5851.19
Distance [56]71.4380.9687.7922.5354.2082.2087.9792.0731.4563.5263.2572.8180.5812.7953.57
\(\rho\) Sampling72.6981.8588.6723.3157.4682.7388.3592.2931.6265.6964.6074.0281.5913.2257.11
MarginRandom [20]67.3077.5985.3920.5649.3278.3884.7689.5026.8757.9658.2168.4377.1410.9750.51
Softhard [39]73.1382.1888.5523.3555.8282.5888.2892.2631.4664.4665.4874.9482.2414.0456.31
Distance [56]73.0682.1388.7523.3856.4482.8588.3092.2331.7965.3564.6273.9081.3712.7954.97
\(\rho\) Sampling73.5282.5688.9823.5057.5782.9488.3892.3131.6665.8864.6774.1381.6514.5257.25
Table 3. Comparison of R@K and NMI performance for all sampling strategies (%). MAP means MAP@1K.
Figure 6 shows that the proposed \(\rho\) Sampling outperforms the baseline on all metrics. Surprisingly, we find that the dists@inter is less than the dists@intra on the baseline method, and the situation is consistent on the \(\rho\) Sampling. The abnormal phenomenon means that the method leads to potential confusion between categories and weakens the discriminative capability. Fortunately, the fusion method solves the problem and achieves the better performance on three datasets.
Fig. 6.
Fig. 6. The comparison of dists@intra, dists@inter, and NMI on three datasets. Ours represents the combination of our methods.

4.4.4 Cross-domain Evaluation.

In order to further verify the generalization ability of our method, we conduct a cross-domain evaluation experiment. Specifically, as for ETH Food-101 and Vireo Food-172, we train the model via one training set and evaluate the model via another test set. For example, we use the training set of ETH Food-101 for training while the test set of Vireo Food-172 for testing and vice versa. As shown in Table 4, compared with the basic setting, all methods have performance loss. However, our method still achieves the best performance on cross-domain datasets thanks to its considerations of generalization. Specifically, our method exceeds the performance of the other methods by 1–13% on R@1, 0.5–11% on MAP@1K, and 1–14% on NMI.
Table 4.
MethodETH Food-101\(\rightarrow\)Vireo Food-172Vireo Food-172\(\rightarrow\)ETH Food-101
R@1R@2R@4MAPNMIR@1R@2R@4MAPNMI
Contrastive [43]75.0682.3487.9417.6451.3270.4580.3687.5420.8152.74
Softmax [61]75.0982.4087.9619.6453.5370.2179.6986.4519.9253.68
Smooth-AP [5]74.8181.5187.2419.6951.3070.2779.7386.5022.2152.44
Circle [46]63.7773.0080.7810.3542.1557.4568.5376.3513.8343.76
Proxy Synthesis [16]73.0180.7186.8916.7950.1064.5275.2783.4317.0746.78
PNP [26]75.8783.2188.2019.9053.8670.3379.6886.3222.4152.10
Ours76.9883.8688.8521.4155.4972.3682.0188.3423.1956.23
Table 4. Comparison of R@1 and NMI Performance on Domain Transfer
ETH Food-101\(\rightarrow\)Vireo Food-172 means that we use ETH Food-101 for training and Vireo Food-172 for testing and vice versa (%). MAP means MAP@1K.

4.4.5 Impact of Hyper-parameters.

We investigate the effect of different hyper-parameter settings, that is, probability \(\rho\) in Section 3.2 and loss weight \(\alpha\) in Equation (9) on ETH Food-101 and Vireo Food-172.
Probability \(\rho\): The influence of \(\rho\) is shown in Figure 7(a). We can find that best generalization corresponding to \(\rho\) equals 0.2 or 0.3, and then the performance gradually decreases as \(\rho\) continues to increase. As mentioned in Section 3.1, the selected \(\rho\) should ensure the existence of the boundary while maximizing some intra-class distances during training, that is, there is no confusion between the categories. Therefore, it can be explained that too many wrong labels make it difficult to provide enough training information with the increasing \(\rho\), which will confuse the whole training process and lead to poor performance. Therefore, we need an appropriate probability \(\rho\) to make a balance between training efficiency and generalization.
Fig. 7.
Fig. 7. Impact of different hyper-parameters on ETH Food-101, Vireo Food-172, and ISIA Food-500 (%).
Weight \(\alpha\): The influence of \(\alpha\) is shown in Figure 7(b). We observe that the \(\alpha\) = 0.1 results in the highest R@1. \(\alpha\) controls the weight of the classification loss and the metric learning loss.

4.5 Qualitative Analysis

Relationship between generalization and over-fit. In Section 1, we point out the importance of generalization in food image retrieval and that poor generalization will lead to limited performance. In experiments, we find that popular metric learning methods, such as contrastive loss and margin loss, are easy to over-fit. This is shown in Figure 8. Although these two methods converge rapidly, the over-fitting is very obvious, especially in ISIA Food-500, which shows the limited generalization. The convergence speed of our method is slower than the other two methods, but the slow growth obtains better performance, which also shows that our method is not easy to over-fit and that our model has better generalization and superior performance.
Fig. 8.
Fig. 8. The Recall@1 on three test datasets with different methods.
In this section, we qualitatively analyze the performance of the proposed method. Retrieval results on three benchmark datasets are shown in Figure 9. We select one query on each dataset and visualize Top-10 retrieval results. We can observe that the retrieval performance achieves great improvement with our method. Significantly, as for the third query, the baseline model mistakenly recalls other food images due to their similar appearance. In contrast, our model successfully captures the main features of the query, excludes hard negative instances, and achieves a more favorable ranking list.
Fig. 9.
Fig. 9. Query instances and Top-10 ranked instances from the retrieval set on ETH Food-101, Vireo Food-172 and ISIA Food-500 datasets when using the baseline model (ImageNet pre-trained weights) and after using our method. Images with a green border are positive instances and images with a red border are negative ones for each query.
To further verify the impact of generalization, we visualize feature embeddings optimized by the baseline, that is, contrastive loss, and our method with t-SNE [27] on three datasets. We randomly choose 3,000 instances from test sets as target points. As shown in Figure 10, compared with the baseline, our method can occupy a larger hyperspace and the discrimination of different categories is more obvious, which means better generalization. Results of the visualization also correspond to larger dists@intra and dists@inter in Figure 6.
Fig. 10.
Fig. 10. Visualization of feature embeddings optimized by different methods with t-SNE [27]. Each point represents an image in test sets of three datasets. Different classes are distinguished by different colors.

4.6 Performance of General Image Retrieval

Current results verify the power of our method in food image retrieval. We hope to prove that solving the problem from generalization can also improve the performance of other tasks. In order to further verify the performance on the general image retrieval task, we experiment on three popular general datasets. CUB200-2011 [51] is a fine-grained bird dataset, which contains 11,788 images from 200 categories. Empirically, we use the first 100 classes (5,864 images) for training and the last 100 classes (5,924 images) for testing. CARS196 [24] has 196 car classes with 16,185 images. Training/test sets use the first/last 98 classes (8,054/8,131 images). Stanford Online Products (SOP) [45] contains 120,053 product images divided into 22,634 classes. We use the first 11,318 classes (59,551 images) for training and the last 11,316 classes (60,502 images) for testing. In SOP, unlike the other benchmarks, most classes have few instances, leading to significantly different data distribution compared with CUB200-2011 and CARS196.
Results show that our method still outperforms others on the general image retrieval task. Our method outperforms others by 0.5-1% on R@1, 0.5-1% on MAP@C and 0.2-1.1% on NMI. Specifically, we find that the performance gain on CUB200-2011 and CARS196 is lower than SOP. This is because the difference between categories of the first two datasets is not obvious. Nevertheless, our method outperforms others by 0.3% on R@1. Due to the diversity of products, SOP is similar to food datasets; thus, the performance improvement of our method on SOP is more obvious, which also shows that our idea of the generalization is more suitable for other image retrieval tasks with larger differences between categories.
Surprisingly, we find that proxy-based methods, which perform well on food image datasets, are not effective in the general image retrieval task, that is, softmax and proxy synthesis. Compared with general image retrieval datasets, the number of categories in food datasets are small, and the number of samples in each category is large. Proxy-based losses introduce the idea of classification loss and optimize samples of the same category to the center by constructing the category center. When the number of samples in a category is large, these loss functions are very effective, but pair-based losses have a lot of local optimization, which leads to convergence difficulties and affects the performance. In contrast, general image retrieval datasets have a small number of samples per category, that is, SOP, which leads to the difficulty of constructing the center. Therefore, the performance of proxy-based losses is poor.

5 Conclusions

In this article, we propose a generalization-oriented analysis method for food image retrieval and conduct a new benchmark on three food datasets. Focusing on the generalization problem caused by the large difference between categories of different domains in food images, we propose the \(\rho\) Sampling and GAO methods, which greatly improve the generalization of the model. The proposed method also achieves the best performance on three benchmark datasets.
Future work includes the following. (1) The selection of the \(\rho\) parameter. In this article, we investigate the impact of the \(\rho\) parameter and conduct the evaluation on different food image datasets. However, the selection of \(\rho\) is affected by the data distribution, the number of samples in each category, and the number of categories. In future work, we hope to explore an adaptive method to obtain the optimal \(\rho\) to adapt to different datasets. (2) The generalization for food analysis. In this work, we design a generalization-oriented sampling strategy and a loss function design method to enhance the generalization of the model. In future work, we can also design a feature extraction method to enhance generalization. In addition, due to the in-depth research on generalization in tasks such as transfer learning, we can learn from the idea of transfer learning to design a network for food image retrieval.

Footnotes

References

[1]
G. Albert, A. Jon, R. Jerome, and L. Diane. 2016. Deep image retrieval: Learning global representations for image search. In European Conference on Computer Vision. 241–257.
[2]
G. Barlacchi, A. Abad, E. Rossinelli, and A. Moschitti. 2016. Appetitoso: A search engine for restaurant retrieval based on dishes. In Italian Conference on Computational Linguistics. 46–50.
[3]
A. Bellet and A. Habrard. 2015. Robustness and generalization for metric learning. Neurocomputing 151 (2015), 259–267.
[4]
L. Bossard, M. Guillaumin, and L. Van Gool. 2014. Food-101-mining discriminative components with random forests. In European Conference on Computer Vision. 446–461.
[5]
A. Brown, W. Xie, V. Kalogeiton, and A. Zisserman. 2020. Smooth-AP: Smoothing the path towards large-scale image retrieval. In European Conference on Computer Vision. 677–694.
[6]
M. Chang, L. Guillain, H. Jung, V. Hare, J. Kim, and M. Agrawala. 2018. RecipeScape: An interactive tool for analyzing cooking instructions at scale. In Proc. CHI Conference on Human Factors in Computing Systems. 451:1–451:12.
[7]
C. Chaudhary, P. Goyal, N. Goyal, and Y. Chen. 2020. Image retrieval for complex queries using knowledge embedding. ACM Transactions on Multimedia Computing, Communications and Applications 16, 1 (2020), 13:1–13:23.
[8]
J. Chen and C. Ngo. 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the ACM International Conference on Multimedia. 32–41.
[9]
J. Chen, C. Ngo, F. Feng, and T. Chua. 2018. Deep understanding of cooking procedure for cross-modal recipe retrieval. In Proceedings of the ACM International Conference on Multimedia. 1020–1028.
[10]
G. Ciocca, P. Napoletano, and R. Schettini. 2017. Learning CNN-based features for retrieval of food images. In International Conference on Image Analysis and Processing. 426–434.
[11]
Y. Cui, F. Zhou, Y. Lin, and S. Belongie. 2016. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In IEEE Conference on Computer Vision and Pattern Recognition. 1153–1162.
[12]
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
[13]
T. Do, T. Hoang, D. Tan, H. Le, T. Nguyen, and N. Cheung. 2019. From selective deep convolutional features to compact binary representations for image retrieval. ACM Transactions on Multimedia Computing, Communications and Applications 15, 2 (2019), 43:1–43:22.
[14]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
[15]
G. M. Farinella, D. Allegra, M. Moltisanti, F. Stanco, and S. Battiato. 2016. Retrieval and classification of food images. Computers in Biology and Medicine 77 (2016), 23–39.
[16]
G. Gu, B. Ko, and H. Kim. 2021. Proxy synthesis: Learning with synthetic classes for deep metric learning. In AAAI. 1460–1468.
[17]
R. Hadsell, S. Chopra, and Y. Lecun. 2006. Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition. 1735–1742.
[18]
A. Hermans, L. Beyer, and B. Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arxiv: 1703.07737 (2017).
[19]
S. Horiguchi, S. Amano, M. Ogawa, and K. Aizawa. 2018. Personalized classifier for food image recognition. IEEE Trans. Multimedia 20, 10 (2018), 2836–2848.
[20]
J. Hu, J. Lu, and Y. Tan. 2014. Discriminative deep metric learning for face verification in the wild. In IEEE Conference on Computer Vision and Pattern Recognition. 1875–1882.
[21]
Xin Ji, Wei Wang, Meihui Zhang, and Yang Yang. 2017. Cross-domain image retrieval with attention modeling. In Proceedings of the ACM International Conference on Multimedia. 1654–1662.
[22]
S. Jiang, W. Min, Y. Lyu, and L. Liu. 2020. Few-shot food recognition via multi-view representation learning. ACM Transactions on Multimedia Computing, Communications and Applications 16, 3 (2020), 87:1–87:20.
[23]
D. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations. 1–15.
[24]
J. Krause, M. Stark, J. Deng, and F. Li. 2013. 3D object representations for fine-grained categorization. In IEEE International Conference on Computer Vision Workshops. 554–561.
[25]
Y. Li, H. Yao, T. Zhang, and C. Xu. 2021. Part-based structured representation learning for person re-identification. ACM Transactions on Multimedia Computing, Communications and Applications 16, 4 (2021), 134:1–134:22.
[26]
Z. Li, W. Min, J. Song, Y. Zhu, and S. Jiang. 2021. Rethinking ranking-based loss functions: Only penalizing negative instances before positive ones is enough. arXiv preprint arXiv:2102.04640 (2021).
[27]
L. Maaten and G. Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605.
[28]
R. Manmatha, Chao-Yuan Wu, Alexander Smola, and Philipp Krahenbuhl. 2017. Sampling matters in deep embedding learning. In IEEE International Conference on Computer Vision. 2859–2867.
[29]
C. Micael, C. Rémi, P. David, S. Laure, T. Nicolas, and C. Matthieu. 2018. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In International ACM SIGIR Conference. 35–44.
[30]
T. Milbich, K. Roth, B. Brattoli, and B. Ommer. 2022. Sharing matters for generalization in deep metric learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 1 (2022), 416–427.
[31]
W. Min, S. Jiang, and R. Jain. 2020. Food recommendation: Framework, existing solutions, and challenges. IEEE Transactions on Multimedia 22, 10 (2020), 2659–2671.
[32]
W. Min, S. Jiang, L. Liu, Y. Rui, and R. Jain. 2019. A survey on food computing. Comput. Surveys 52, 5 (2019), 36.
[33]
W. Min, S. Jiang, J. Sang, H. Wang, X. Liu, and L. Herranz. 2017. Being a supercook: Joint food attributes and multi-modal content modeling for recipe retrieval and exploration. IEEE Transactions on Multimedia 19, 5 (2017), 1100–1113.
[34]
W. Min, L. Liu, Z. Wang, Z. Luo, X. Wei, X. Wei, and Jiang S.2020. ISIA Food-500: A dataset for large-scale food recognition via stacked global-local attention network. In Proceedings of the ACM International Conference on Multimedia. 393–401.
[35]
Y. Movshovitz-Attias, A. Toshev, T. Leung, S. Ioffe, and S. Singh. 2017. No fuss distance metric learning using proxies. In IEEE International Conference on Computer Vision. 360–368.
[36]
E. Ong, S. Husain, and M. Bober. 2017. Siamese network of deep Fisher-vector descriptors for image retrieval. arXiv preprint arXiv:1702.00338 (2017).
[37]
P. Pouladzadeh and S. Shirmohammadi. 2017. Mobile multi-food recognition using deep learning. ACM Transactions on Multimedia Computing, Communications and Applications 13, 3s (2017), 36:1–36:21.
[38]
Q. Qian, L. Shang, B. Sun, J. Hu, T. Tacoma, H. Li, and R. Jin. 2019. SoftTriple loss: Deep metric learning without triplet sampling. In IEEE International Conference on Computer Vision. 6449–6457.
[40]
K. Roth, T. Milbich, S. Sinha, P. Gupta, B. Ommer, and J. Cohen. 2020. Revisiting training strategies and generalization performance in deep metric learning. In ICML. 8242–8252.
[41]
A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli, I. Weber, and A. Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In IEEE Conference on Computer Vision and Pattern Recognition. 3020–3028.
[42]
F. Schroff, D. Kalenichenko, and J. Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition. 815–823.
[43]
W. Shimoda and K. Yanai. 2017. Learning food image similarity for food image retrieval. In IEEE 3rd International Conference on Multimedia Big Data. 165–168.
[44]
K. Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems. 1857–1865.
[45]
H. Song, Y. Xiang, S. Jegelka, and S. Savarese. 2016. Deep metric learning via lifted structured feature embedding. In IEEE Conference on Computer Vision and Pattern Recognition. 4004–4012.
[46]
Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei. 2020. Circle loss: A unified perspective of pair similarity optimization. In IEEE Conference on Computer Vision and Pattern Recognition. 6398–6407.
[47]
Z. Tang and J. Huang. 2022. Harmonious multi-branch network for person re-identification with harder triplet loss. ACM Transactions on Multimedia Computing, Communications and Applications 18, 4 (2022), 98:1–98:21.
[48]
N. Tishby and N. Zaslavsky. 2015. Deep learning and the information bottleneck principle. In IEEE Information Theory Workshop. 1–5.
[49]
E. Ustinova and V. Lempitsky. 2016. Learning deep embeddings with histogram loss. In NIPS. 4170–4178.
[50]
V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio. 2019. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning. 6438–6447.
[51]
C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. 2011. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001. California Institute of Technology.
[52]
H. Wang, D. Sahoo, C. Liu, K. Shu, P. Achananuparp, E. Lim, and S. Hoi. 2022. Cross-modal food retrieval: Learning a joint embedding of food images and recipes with semantic consistency and attention mechanism. IEEE Transactions on Multimedia 24 (2022), 2515–2525.
[53]
L. Wang, Q. Li, N. Li, G. Dong, and Y. Yang. 2008. Substructure similarity measurement in Chinese recipes. In Proceedings of the ACM International Conference on World Wide Web. 979–988.
[54]
W. Wang, L. Duan, H. Jiang, P. Jing, X. Song, and L. Nie. 2021. Market2Dish: Health-aware food recommendation. ACM Transactions on Multimedia Computing, Communications and Applications 17, 1 (2021), 33:1–33:19.
[55]
Z. Wang, Y. Li, R. Hong, and X. Tian. 2019. Eigenvector-based distance metric learning for image classification and retrieval. ACM Transactions on Multimedia Computing, Communications and Applications 15, 3 (2019), 84:1–84:19.
[56]
C. Wu, R. Manmatha, A. Smola, and P. Krahenbuhl. 2017. Sampling matters in deep embedding learning. In IEEE International Conference on Computer Vision. 2840–2848.
[57]
J. Wu, J. Jiang, M. Qi, C. Chen, and Y. Liu. 2022. Improving feature discrimination for object tracking by structural-similarity-based metric learning. ACM Transactions on Multimedia Computing, Communications and Applications 18, 4 (2022), 90:1–90:23.
[58]
H. Xie, L. Yu, and Q. Li. 2011. A hybrid semantic item model for recipe search by example. In IEEE International Symposium on Multimedia. 254–259.
[59]
Q. Xu, A. Molino, J. Lin, F. Fang, V. Subbaraju, L. Li, and J. Lim. 2021. Lifelog image retrieval based on semantic relevance mapping. ACM Transactions on Multimedia Computing, Communications and Applications 17, 3 (2021), 92:1–92:18.
[60]
X. Yao, D. She, H. Zhang, J. Yang, M. Cheng, and L. Wang. 2021. Adaptive deep metric learning for affective image retrieval and classification. IEEE Trans. Multimedia 23 (2021), 1640–1653.
[61]
A. Zhai and H. Wu. 2019. Classification is a strong baseline for deep metric learning. In British Machine Vision Conference. 91.
[62]
L. Zhang, H. Guo, K. Zhu, H. Qiao, G. Huang, S. Zhang, H. Zhang, J. Sun, and J. Wang. 2022. Hybrid modality metric learning for visible-infrared person re-identification. ACM Transactions on Multimedia Computing, Communications and Applications 18, 1s (2022), 25:1–25:15.
[63]
S. Zhang, G. Li, W. Zhang, Q. Huang, Ti. Huang, M. Shah, and N. Sebe. 2022. Introduction to the special issue on fine-grained visual recognition. ACM Transactions on Multimedia Computing, Communications and Applications 18, 1s (2022), 24:1–24:3.
[64]
B. Zhu, C. Ngo, and W. Chan. 2022. Learning from web recipe-image pairs for food recognition: Problem, baselines and performance. IEEE Transactions on Multimedia 24 (2022), 1175–1185.

Cited By

View all
  • (2024)Food Computing for Nutrition and Health2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00066(29-31)Online publication date: 13-May-2024

Index Terms

  1. Towards Food Image Retrieval via Generalization-Oriented Sampling and Loss Function Design

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 1
    January 2024
    639 pages
    EISSN:1551-6865
    DOI:10.1145/3613542
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 August 2023
    Online AM: 29 May 2023
    Accepted: 22 May 2023
    Revised: 11 May 2023
    Received: 25 June 2022
    Published in TOMM Volume 20, Issue 1

    Check for updates

    Author Tags

    1. Food computing
    2. image retrieval
    3. deep learning

    Qualifiers

    • Research-article

    Funding Sources

    • National Nature Science Foundation of China
    • CAAI-Huawei MindSpore Open Fund

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)628
    • Downloads (Last 6 weeks)77
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Food Computing for Nutrition and Health2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00066(29-31)Online publication date: 13-May-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media