research-article

Open access

Towards Food Image Retrieval via Generalization-Oriented Sampling and Loss Function Design

Authors:

Jiajun Song,

Zhuo Li,

Weiqing Min,

Shuqiang JiangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 1

Article No.: 13, Pages 1 - 19

https://doi.org/10.1145/3600095

Published: 25 August 2023 Publication History

PDF eReader

Abstract

Food computing has increasingly received widespread attention in the multimedia field. As a basic task of food computing, food image retrieval has wide applications, that is, food image retrieval can help users to find the desired food from a large number of food images. Besides, the retrieved information can be applied to establish a richer database for the subsequent food content-related recommendation. Food image retrieval aims to achieve better performance on novel categories. Thus, it is worth studying to transfer the embedding ability from the training set to the unseen test set, that is, the generalization of the model. Food is influenced by various factors, such as culture and geography, leading to great differences between domains, such as Asian food and western food. Therefore, it is challenging to study the generalization of the model in food image retrieval. In this article, we improve the classical metric learning framework and propose a generalization-oriented sampling strategy, which boosts the generalization of the model by maximizing the intra-class distance from a proportion of positive pairs to avoid the excessive distance compression in the embedding space. Considering that the existing optimization process is in an opposite direction to our proposed sampling strategy, we further propose an adaptive gradient assignment policy named gradient-adaptive optimization, which can alleviate the intra-class distance compression during optimization by assigning different gradients to different samples. Extensive evaluation on three popular food image datasets demonstrates the effectiveness of the proposed method. We also experiment on three popular general datasets to prove that solving the problem from the generalization can also improve the performance of general image retrieval. Code is available at https://github.com/Jiajun-ISIA/Generalization-oriented-Sampling-and-Loss.

1 Introduction

Food is closely related to the healthy life of human beings; thus, food-related research has become a hotspot. Food image retrieval aims to find all relevant images from the original food database through a query, and it is an important issue in the multimedia field. As an important branch of content-based image retrieval (CBIR), the development of food image retrieval has expanded the scale of CBIR and has shown that this field has made great progress. Therefore, food image retrieval has high academic value. In addition, food image retrieval has higher application value. First, food image retrieval can help consumers with special needs to find the desired food from a large number of food images. For example, fitness coaches and body managers usually need to use food-related search engines to find healthy meals from a wide range of dishes displayed by a large number of stores on the Internet. Food image retrieval can help them to find the best match from a large number of food images. Second, food image retrieval can solve the problem of dynamic categories that cannot be solved by a common food recognition paradigm. There are many kinds of unlabeled novel foods and labelling these food images costs a lot, whereas food image recognition requires a set of annotated images to train a robust classifier. Food image retrieval can be used to find similar foods among the available ones and can suggest the most appropriate food class to solve the problem. Third, food image retrieval can be used to enrich the database of food recommendation systems. The catering-oriented social media, such as Yelp¹ and Meituan², utilize the dish name-based recommendation system to predict the users’ preference. However, they usually fail because different restaurants may give different names to the same dish and the recommendation system cannot match the corresponding dish name in the database. The food image-based recommendation system can avoid the problem. Therefore, food image retrieval helps to recommend food more accurately by enriching the database, which avoids the matching failure caused by the same dish with different names.

With the rapid development of the food field, new food categories are produced every day. Existing methods [10, 15, 43] are not suitable for retrieving corresponding images for these unseen categories. Therefore, we focus on food image retrieval on unseen categories, which is also consistent with general image retrieval in the current multimedia field. The generalization of the model to novel food categories is an important issue to solve for food image retrieval. However, there are few works to explore this issue in food image retrieval. Improving the generalization of the model in food image retrieval is to cross the gap among categories to achieve the knowledge transfer, and the challenge is to solve the problem with large inter-class differences of different food domains. Specifically, there are many domains of food, such as Asian food and western food. Western food is often visually represented as a simple combination of processed ingredients, such as steak, fish and chips, and hamburgers. Due to the uniqueness of cooking methods, Asian food such as Chinese food is often processed again by mixing various ingredients, which is inseparable. The situation leads to the great visual difference of food in different domains. As shown in Figure 1, the crispy sweet and sour pork slices on the left differ greatly from the beefsteak on the right. When two categories belong to the training set and the test set, respectively, the performance may be poor.

Fig. 1.

To solve this problem, we improve the metric learning framework and propose a generalization-oriented sampling strategy, named \(\rho\) Sampling. The previous work [40] indicates that the generalization of the model on representation learning such as image retrieval under considerable shifts between training and testing distribution is hurt by excessive feature compression. Therefore, making features occupy a larger hyperspace helps improve the generalization of models. A simple example is shown in Figure 2. Although two models both successfully separate the data on the training set, the left model fails to separate all test classes due to the excessive compression, leading to confusion on the test set. In contrast, the right model successfully separates the test classes by keeping the intra-class variants, which means that the model transfers the discrimination ability from the training set to the test set. Based on this, we argue that the training shouldn’t blindly minimize the intra-class distance. It is necessary to maximize the space utilization to achieve better generalization. The proposed sampling strategy avoids excessive compression by maximizing a few intra-class distances during the training, which can take instances from the same class apart to occupy a larger hyperspace to achieve the generalization advancement.

Fig. 2.

After we investigate the widely used loss functions from the perspective of the optimization under the metric learning framework, we find that recent methods [17, 28] prefer to force all positive instances to be closed during training by minimizing the distance between the anchor and the positive one. This optimization method will force positive instances with different confidence to the anchor position and lead to excessive compression, which is opposite to the above requirement. Therefore, we introduce an adaptive gradient assignment policy, named Gradient-Adaptive Optimization (GAO) to avoid excessive compression. The proposed GAO keeps the intra-class distance by assigning smaller gradients to the edge sample to avoid excessive compression and achieves better generalization. The edge sample is the sample with low confidence, which is far from the class center. Meanwhile, GAO can achieve more stable optimization during training because GAO can reduce the influence of edge samples that may be potential noise data. We just verify it on two popular loss functions, contrastive loss and margin loss, A similar strategy can be used with other loss functions for performance improvement. Extensive evaluation of three food image datasets demonstrates the effectiveness of the proposed method. In order to prove that solving the problem from generalization can also improve the performance of other tasks, we conduct the evaluation on three popular general datasets to verify the performance on the general image retrieval task.

In summary, the main contributions of this work are listed below.

•

We are the first to investigate food image retrieval on unseen categories and improve the generalization of the model for performance gain to the best of our knowledge.

•

We improve the classical metric learning framework for food image retrieval and propose a generalization-oriented sampling strategy named \(\rho\) Sampling and an adaptive gradient assignment policy named Gradient-Adaptive Optimization to avoid excessive distance compression in the embedding space to improve the generalization of the model.

•

We conduct a benchmark on three datasets and experimental results demonstrate that the proposed method achieves state-of-the-art performance. Experiments on three general datasets prove that the method can be further popularized in general image retrieval.

2 Related Work

Our work is closely related to three research fields: (1) food retrieval, (2) deep metric learning, and (3) generalization in image retrieval.

2.1 Food Retrieval

Recently, Min et al. [32] provided a comprehensive survey of food retrieval and other food-related works [19, 22, 31, 37, 54, 64]. Food-relevant retrieval consists of three types: visual food retrieval, recipe retrieval, and cross-modal recipe-image retrieval. For food image retrieval, Barlacchi et al. [2] introduced a search engine for restaurant retrieval based on dishes rather than using their categories. Farinella et al. [15] conducted food image retrieval by comparing images through similarity measures, in which food images were represented as vectors through the combination of different types of features, such as SIFT and Bag of Textons. For recipe retrieval, Wang et al. [53] investigated the underlying features of Chinese recipes. Based on workflow-like cooking procedures, they modelled recipes as graphs and further proposed a novel similarity measurement based on frequent patterns, and devised an effective filtering algorithm to support efficient online searching. Recently, Chang et al. [6] proposed an interactive system RecipeScape to analyze multiple recipes for one dish. They changed the recipe instruction into a tree-structure representation for recipe similarity calculation. Xie et al. [58] further jointly utilized various features, such as cooking flow, eating. and nutrition features, to create a hybrid semantic item model for recipe search. Besides food/recipe retrieval, there are some studies [9, 29, 33, 52] on cross-modal recipe-image retrieval. Salvador et al. [41] first proposed a large-scale, structured corpus of over one million cooking recipes and food images, called Recipe1M. Micael et al. [29] extended previous works by providing a double-triplet strategy to jointly express both the retrieval loss and the classification loss for cross-modal retrieval. Chen et al. [9] further used an attention mechanism to model a recipe and align the word embedding of the recipe with its image feature.

Compared wih previous works, the goal of our task is to search for the most similar image from a number of images, which is different from other food retrieval tasks, that is, recipe retrieval and cross-modal recipe-image retrieval. In addition, in contrast to existing methods, which are suitable for seen categories, our method focuses on unseen categories and improves the generalization of the model for these categories.

2.2 Deep Metric Learning

Deep metric learning methods [30, 55, 57, 62] aim to learn a non-linear transformation to project input images into an embedding space, in which images with similar semantic information have larger similarity and vice versa. Pairwise loss functions represented by contrastive loss [21, 42] were widely used in operating on image pairs. These loss functions maximized the similarity between positive instances while encouraging a margin between negative pairs. In contrast to the above works, triplet networks have been proposed to solve the ranking problem [1, 36, 60] via optimizing the relative distance between positive and negative pairs. However, Cui et al. [11] argued that there are too many pairs or tuples that have no effect in updating a model due to their poor discriminative power. They realized that these triplets wasted much time and resources, thus, slowing down the training and leading to worse performance. To this end, instead of selecting random samples, a series of sampling strategies were proposed to mine hard examples [11, 35, 42]. Online and offline mining are two strategies to mine hard examples in the dataset. Online hard example mining usually mines hard pairs (triplets) in a training batch during training. oOffline mining focuses on the whole dataset or a subset of the dataset after each epoch. Online mining is usually faster than offline hard example mining, whereas offline hard example mining can consider more samples and capture more realistic data distribution.

In this article, we follow the classical deep metric learning framework and further propose a novel sampling strategy and a loss function design to improve the generalization of the model. Specifically, we propose the \(\rho\) Sampling and GAO to boost the generalization by avoiding the excessive compression within the class.

2.3 Generalization in Image Retrieval

Recent works [7, 13, 25, 47, 59, 63] widely used the metric learning method for the image retrieval task, and a series of sampling methods and loss functions have been further proposed to improve the performance of image retrieval. Tishby Zaslavsky [48] studied the concept of compression to explain this performance gain on the basis of the learned embedding space for the metric learning method. The recent work of Verma et al. [50] introduced Singular Value Decomposition (SVD) on the data representations and obtained an increased decay of singular values, which linked the compression to the class-conditioned flattening of representation. Thus, class representations occupied a more compact volume, thereby reducing the number of directions with significant variance. For classic classification scenarios with i.i.d. training and test distributions, that is, the overlap setup, researchers strongly focused on the most discriminative directions. However, this overly discarded features that could capture data characteristics outside the training distribution, which damaged the generalization of the model. Bellet and Habrard [3] indicated that generalization in the no-overlap problem like metric learning was hindered due to the shift in the training and test distribution. Based on this, Roth et al. [40] proposed to retain a considerable amount of directions of significant variance (DoV) to learn a well generalizing embedding function. Therefore, the metric learning under considerable shifts between training and testing distribution was hurt by the excessive feature compression.

Inspired by the work in [40], we aim to further avoid excessive compression to improve the generalization of the model. The proposed \(\rho\) Sampling boosts the generalization of the model by maximizing the intra-class distance from a proportion of positive pairs to avoid excessive distance compression in the embedding space. GAO further alleviates the intra-class distance compression during optimization by assigning different gradients to different samples.

3 Our Method

As shown in Figure 3, we improve the classical metric learning framework by introducing the novel sampling strategy and loss function to avoid excessive distance compression in the embedding space and improve the generalization. Particularly, we propose \(\rho\) Sampling and GAO, which maintain the intra-class distance and achieve better generalization performance, which will be detailed in Sections 3.1 and 3.2, respectively. Without loss of generality, we choose Vision Transformer (ViT) [14] as the backbone. In the test phase, all images in the test set are fed into the backbone to obtain image descriptors to compute cosine similarities among the query and other images. Finally, we sort these similarities and obtain the final ranking list.

Fig. 3.

3.1 ρ Sampling for Generalization

Food image retrieval aims to achieve better performance on unseen classes based on the model trained via seen classes. Therefore, generalization becomes crucial. As mentioned above, the generalization of the model on representation learning such as image retrieval under considerable shifts between training and testing distribution is hurt by the excessive distance compression in the embedding space. Due to the obvious differences between different domains of food images, it is more challenging to study the generalization in food image retrieval. In this article, we argue that in order to achieve better generalization performance, the training should not only minimize the intra-class variance and inter-similarity but also distribute the data into wider space. As mentioned above, the model that makes full use of the hyperspace is considered to have better generalization. This model achieves the ability to distribute unseen classes into hyperspace uniformly and widely, and avoids the confusion of the test data.

To this end, we propose a novel sampling strategy, \(\rho\) Sampling, which keeps the intra-class variance by randomly maximizing some intra-class distances during training. Specifically, in order to avoid excessive compression, an intuitive idea is to relax the optimization of positive samples. That is because metric learning methods make the distance between the anchor and the positive instance closer, which leads to excessive distance compression in the embedding space. To solve the problem, we randomly select a part from all tuples \(T = {(x_{a_1},x_{p_1},x_{n_1}),(x_{a_2},x_{p_2},x_{n_2}),...,(x_{a_k},x_{p_k},x_{n_k})}\) for relaxation and we assign the same \(\rho\) probability for each tuple to approximately obtain \(\rho\) ratio of all tuples for relaxation, where \(x_a\) is the anchor, \(x_p\) is the positive instance, \(x_n\) is the negative instance and k is the size of the batch. For these selected tuples, we replace positive instances \(x_p\) with anchors \(x_a\), and optimize original positive instances as negative ones, which is defined as:

\begin{equation} t = (x_{a},x_{p},x_{n}) \rightarrow t^{\prime } = (x_{a},x_{a},x_{p}) \end{equation}

(1)

This strategy pushes samples from the same class apart, enabling the model to capture extra discriminative features and results in outstanding generalization. As shown in Figure 4, the proposed \(\rho\) Sampling avoids excessive distance compression in the embedding space and makes full use of the hyperspace to have better generalization. However, it may be unreasonable to randomly select positive instances to change. After ensuring that the proportion of selected tuples is \(\rho\), we hope to select more appropriate tuples for this inverse optimization. Therefore, we analyse the distribution of pairwise distance, which follows

\begin{equation} q(d) \propto d^{n-2}\left[1-\frac{1}{4} d^{2}\right]^{\frac{n-3}{2}} , \end{equation}

(2)

where \(d\) is the distance between two embeddings and \(n\) is the dimension of the embedding. In high-dimensional space, \(q(d)\) approaches \(\mathcal {N}(\sqrt {2}, \frac{1}{2 n})\), which means that if we randomly select training samples, the distances between the positive pairs will usually be \(\sqrt {2}\), that is, hard positive pairs. As shown in Figure 4, if hard positive samples (\(x_1\) and \(x_3\)) are used for inverse optimization, these edge samples will be even more far from the center, which leads to the confusion between classes. Although this method maintains the intra-class variance, the potential confusion between categories weakens the discrimination between classes. We believe that occupying a larger hyperspace should ensure the boundary. Based on this, we should give priority to samples that are close to the center to keep intra-class variance because such samples are confident enough and the influence of their change is less. To this end, we define the probability of a tuple being selected as

\begin{equation} P(t^*|a,p) \propto (D_{ap})^{-1} , \end{equation}

(3)

where \(D_{ap}\) is the distance between the anchor and the positive instance \(d(\phi _a,\phi _p)\). In other words, we prefer to select easy positive pairs as altered samples.

Fig. 4.

Discussion. The conventional metric learning method makes the distance between samples of the same category closer and pushes samples of different categories away. Although this method can distinguish samples of different categories, it will inevitably lead to excessive distance compression of the same kind of samples in the embedding space because this method narrows the distance between all positive samples equally without considering the distribution of a category. As shown in Figure 4, the traditional optimization leads to excessive distance compression in the embedding space. In contrast, the proposed \(\rho\) Sampling maximizes the intra-class distance from a proportion of positive pairs to avoid excessive compression, thus, improving the generalization of the model.

3.2 Gradient-Adaptive Optimization (GAO)

In order to further reduce excessive distance compression in the embedding space in the class and improve the generalization performance of the model, we propose an adaptive gradient assignment policy named Gradient-Adaptive Optimization (GAO) to cooperate with the proposed \(\rho\) Sampling. It is beneficial to the generalization performance of the model by paying less attention to hard positive instances and slowly correcting them, which occupies a larger hyperspace. Based on this, the strategy slowly corrects instances with lower confidence by giving them lower gradients. To this end, we define the strength of a positive instance during the optimization as

\begin{equation} s(x_p) \propto (D_{ap})^{-1} . \end{equation}

(4)

The equation means that the optimization strength of the positive instance decreases with the increase of \(D_{ap}\).

Since derivative functions of distances between anchors and positive instances in original margin loss and contrastive loss are constant functions, they are convenient for us to study the impact of different gradient allocations on retrieval performance. Therefore, we apply the proposed GAO to these two loss functions to further verify the effectiveness of the strategy.

Without loss of generality, we take contrastive loss as an example. It is formulated as

\begin{equation} L_{m_1} = D_{ap} + \max (m - D_{an},0) , \end{equation}

(5)

where \(m\) is the margin. Contrastive loss forces all positive instances to be close during training by minimizing the distance between the anchor and the positive one. Meanwhile, contrastive loss will assign equal strength to different instances and synchronously correct them, shown in Figure 5(b). The gradient of \(D_{ap}\) keeps constant when \(D_{ap}\) changes, which means that positive instances with different confidence will be under the same optimization strength.

Fig. 5.

The contrastive loss combined with GAO is defined as follows:

\begin{equation} L^{\prime }_{m_1} = \log (1 + D_{ap}) + \max (m - D_{an},0) . \end{equation}

(6)

\(L^{\prime }_{m_1}\) and its derivative function with respect to \(D_{ap}\) is shown in Figure 5(b). When \(D_{ap}\) is low, the gradient of \(D_{ap}\) is large, which means strong optimization strength. In contrast, when \(D_{ap}\) is large, the gradient of \(D_{ap}\) is small, which means weak optimization strength. The improved contrastive loss will quickly correct easy samples (with high confidence) while slowly correcting edge samples. This strategy avoids the excessive compression and weakens the impact of wrong correction caused by the potential noise data and achieves more stable optimization.

The proposed strategy has also been used to improve margin loss because the derivative function of distances between anchors and positive instances is the constant function. The original margin loss is defined as

\begin{equation} L_{m_2} = \gamma + \lbrace D_{ap}- \beta \rbrace - \lbrace D_{an}-\beta \rbrace , \end{equation}

(7)

where \(\gamma\) is the triplet margin. The original margin loss extends the standard triplet loss by introducing a dynamic, learnable boundary \(\beta\) between positive and negative pairs, which transfers the common triplet ranking problem to a relative ordering of pairs. The improved margin loss is defined as

\begin{equation} L^{\prime }_{m_2} = \gamma + \lbrace \log (1 + D_{ap})- \beta \rbrace - \lbrace D_{an}-\beta \rbrace . \end{equation}

(8)

Similar to the improved contrastive loss, the improved margin loss redistributes the gradient of \(D_{ap}\) to quickly correct easy examples while slowly correcting edge samples, which avoids excessive compression to improve the generalization of the model.

Finally, we add a classification sub-network, leading to the joint optimization on both the metric learning and classification loss. We adopt the classification loss to improve the discrimination on different classes. Features further occupy a larger hyperspace to improve the generalization of the model. The final loss is defined as

\begin{equation} L = \alpha {L_m} + (1 - \alpha){L_c} , \end{equation}

(9)

where \(\alpha\) is an adjustable hyper-parameter and \(L_m\) and \(L_c\) represent the metric learning loss and classification loss, respectively.

4 Experiment

Section 4.1 shows datasets for evaluation and their division. Sections 4.2 and 4.3 shows details of adopted evaluation metrics and the experimental setup, respectively. In Section 4.4, we introduce results of the comparison with state-of-the-art, different loss functions, and different sampling strategies. We also conduct a cross-domain evaluation experiment to further verify the generalization ability of our method and the impact of hyper-parameters is also shown. Section 4.5 presents the qualitative analysis. Finally, we introduce results of the comparison with the state-of-the-art on general image retrieval in Section 4.6 to further verify the effectiveness of our method.

4.1 Dataset

We use the following three popular food datasets for evaluation.

ETH Food-101 [4] is a western food dataset, which contains 101,000 images from 101 categories. Following the test protocol [43], we use the first 70 classes as the training set and the remaining 31 classes to evaluate models.

Vireo Food-172 [8] contains 110,241 images from 172 categories, which mainly consists of Asian dishes. The first 120 classes are used for training and the remaining 52 classes for testing.

ISIA Food-500 [34] is a large miscellaneous dataset with 399,726 images from 500 classes, which consists of Asian food and western food. We keep the first 350 classes for training and the last 150 classes for testing.

4.2 Evaluation Metrics

We show the pipeline of the test phase in Figure 3. For the test set of all benchmark datasets, every instance is used as the query \(I_q\) and the remaining instances are used as the gallery. We use Recall@K (R@K) and MAP@K as the main evaluation metrics and Normalized Mutual Information (NMI) to evaluate the generalization of the model. MAP@K includes MAP@1K and MAP@C; K is the largest number of samples in each class. NMI is defined as the mutual information between cluster labels and the ground truth. The dists@intra (intra-class distance) and dists@inter (inter-class distance) [40] are further used to measure the space occupied by datasets.

4.3 Experimental Setup

We apply the platform proposed by the authors of [40] to ensure the fairness of experiments. During training, we use ViT models pretrained on ImageNet [14] and fine-tune them on benchmark datasets. Models are optimized by Adam [23]. The initial learning rate is set to \(10^{-6}\) and the weight decay is \(4\times 10^{-4}\). Batch size is set to 112. To form the mini-batch, we randomly sample k classes and \(|P|\) samples per class, where \(|P|\) is set to 2. We resize images to 256 \(\times\) 256, and randomly crop them to \(224\times 224\) as input. During training, we use random crops and random flipping (\(p = 0.5\)) for data augmentation, and a single center crop of \(224\times 224\) is used during evaluation. The embedding size of all models is 128-D. We use the same random seed 0 to avoid seed-based performance fluctuations.

4.4 Result Analysis

4.4.1 Comparison with State-of-the-art.

Tables 1 and 2 show the performance of recent methods and ours on three benchmark datasets. We observe that our method achieves the state-of-the-art results. In particular, our model outperforms other methods on all evaluation metrics among all datasets. Specifically, our fusion method outperforms others by 0.3–2.6% on R@K, 0.5–1.1% on MAP@K and 1.1–2.6% on NMI. Compared with baseline methods, that is, contrastive loss and margin loss functions, the proposed method achieves almost 0.3–5% on Recall@1, 0.5–4.8% on MAP@1K and 1–7% on NMI. It shows that the proposed method not only achieves better retrieval performance but also better generalization performance. It is noteworthy that MAP@1K and MAP@C are the same on ETH Food-101, because each category of them contains 1,000 samples.

Table 1.

Methods		ETH Food-101				Vireo Food-172				ISIA Food-500
Methods		R@1	R@2	R@4	NMI	R@1	R@2	R@4	NMI	R@1	R@2	R@4	NMI
ImageNet [12](Pre-train)		60.35	72.02	81.52	44.10	68.71	77.00	83.91	44.82	49.86	60.05	69.50	40.13
Pair-based	Contrastive [43]	71.43	80.96	87.79	54.20	82.20	87.97	92.07	63.52	63.25	72.81	80.58	53.57
	Triplet [1]	71.75	81.19	87.83	54.89	82.50	88.21	92.30	64.34	62.97	72.55	80.34	53.52
	N-Pair [44]	65.48	75.93	83.82	46.34	73.02	80.37	86.04	49.53	47.86	58.35	67.97	40.24
	Hist. [49]	72.78	82.01	88.38	52.83	80.78	86.92	91.29	62.04	61.00	71.06	79.26	52.51
	Margin [28]	73.06	82.13	88.75	56.44	82.85	88.30	92.23	65.35	64.62	73.90	81.37	54.97
	Lifted [18]	64.46	75.33	83.48	45.27	71.92	79.26	85.28	47.76	45.24	55.72	65.64	38.32
	Circle [46]	61.11	73.56	83.24	47.11	73.53	81.49	88.10	58.72	49.13	59.21	68.70	39.81
Proxy-based	Softmax [61]	73.14	82.58	88.87	57.45	82.01	87.77	91.95	64.88	67.16	76.40	83.53	59.17
	ProxyNCA [35]	69.99	79.65	86.61	53.49	79.85	85.89	90.63	60.56	66.36	75.76	82.96	57.93
	SoftTriple [38]	63.89	74.76	82.78	44.80	73.99	81.28	87.00	51.29	51.64	61.51	70.46	40.76
	Proxy Synthesis [16]	70.72	80.69	87.80	55.05	81.17	86.66	90.76	63.32	66.98	76.32	83.42	58.90
AP-based	Smooth-AP [5]	71.55	80.74	87.51	53.95	81.95	87.84	91.75	64.11	64.22	73.66	81.08	55.60
AP-based	PNP [26]	71.51	80.61	87.30	53.52	82.05	87.54	91.61	64.13	63.61	73.07	80.60	54.63
Ours (Margin+\(\rho\) Sampling)		73.52	82.56	88.98	57.57	82.94	88.38	92.31	65.88	64.67	74.13	81.65	57.25
Ours (Margin+\(\rho\) Sampling+GAO)		74.10	83.20	89.28	58.58	83.13	88.49	92.32	66.21	69.70	78.77	85.35	61.71

Table 1. Comparison of R@K and NMI Performance for all Loss Functions (%)

Table 2.

Methods		ETH Food-101		Vireo Food-172		ISIA Food-500
Methods		MAP@1K	MAP@C	MAP@1K	MAP@C	MAP@1K	MAP@C
ImageNet(Pre-train)		15.12	15.12	13.14	12.08	5.24	5.27
Pair-based	Contrastive [43]	22.53	22.53	31.04	28.49	12.79	12.82
	Triplet [1]	22.92	22.92	31.45	28.87	12.56	12.58
	N-Pair [44]	19.31	19.31	19.41	17.83	5.63	5.66
	Hist. [49]	23.17	23.17	30.59	27.96	12.21	12.23
	Margin [28]	23.38	23.38	31.79	29.17	12.79	12.82
	Lifted [18]	18.41	18.41	18.09	16.62	4.78	4.80
	Circle [46]	15.04	15.04	22.35	20.86	4.11	4.09
Proxy-based	Softmax [61]	23.88	23.88	30.89	28.01	15.35	15.37
	ProxyNCA [35]	21.65	21.65	30.13	27.57	15.89	15.92
	SoftTriple [38]	18.11	18.11	18.65	17.03	5.97	5.98
	Proxy Synthesis [16]	22.03	22.03	26.99	24.52	16.52	16.56
AP-based	Smooth-AP [5]	23.24	23.24	31.78	29.11	14.67	14.72
AP-based	PNP [26]	23.20	23.20	31.51	28.92	14.87	14.91
Ours (Margin+\(\rho\) Sampling)		23.50	23.50	31.66	29.09	14.52	14.53
Ours (Margin+\(\rho\) Sampling+GAO)		24.96	24.96	32.32	29.63	17.59	17.68

Table 2. Comparison of MAP@K Performance for all Loss Functions (%)

4.4.2 Performance of Different Loss Functions.

We compare the performance of the pre-trained model and 12 different loss functions under the same setting. Table 1 also summarizes the results of different loss functions into three types. Specifically, as for ranking-based methods, margin loss achieves best performance on ETH Food-101, Vireo Food-172 and ISIA Food-500, which outperforms others about 0.5–1% on Recall@1 and 0.2–1.2% on MAP@1K. We further compare three proxy-based methods and find that the Softmax method, which is used for classification tasks, achieves competitive performance. We also compare the recent proposed AP-based method, Smooth-AP [5], for comparison. In summary, we find that there is no obvious performance divergence between different types of loss functions on food image retrieval tasks. Interestingly, we find that recent methods, such as Circle loss [46], work well on general image retrieval tasks, but the performance on food datasets is poor. The probable reason is that this local optimization method leads to poor generalization of the model, which further verifies the importance of generalization in food image retrieval.

4.4.3 Performance of Different Sampling Strategies.

We also include different sampling strategies for our tuple mining study using contrastive loss and margin loss. The performance is shown in Table 3. We find that given the same loss function, compared with the random method, three sampling methods all achieve improvement. In particular, while contrastive loss yields considerably worse results than margin loss with random sampling, its performance is significantly improved when using a sampling procedure similar to margin loss. Without sampling strategy (random), performance of different loss functions is very different (73.13% and 78.38%, 40.56% and 58.21%), but using the advanced sampling strategy can reduce this difference (82.07% and 82.58%, 60.23% and 65.48%). The evidence shows the importance of sampling methods. We find that the proposed sampling method achieves competitive performance when using different datasets and different loss functions. These results show that the proposed \(\rho\) Sampling improves the generalization of the model and achieves better performance.

Table 3.

Methods		ETH Food-101					Vireo Food-172					ISIA Food-500
Methods		R@1	R@2	R@4	MAP	NMI	R@1	R@2	R@4	MAP	NMI	R@1	R@2	R@4	MAP	NMI
Contrastive	Random [20]	63.37	74.40	83.40	17.95	45.24	73.13	80.67	86.82	20.39	49.52	40.56	50.65	60.69	3.80	35.97
	Softhard [39]	68.67	78.98	86.37	21.07	50.33	82.07	87.69	91.81	31.40	63.61	60.23	70.17	78.42	11.58	51.19
	Distance [56]	71.43	80.96	87.79	22.53	54.20	82.20	87.97	92.07	31.45	63.52	63.25	72.81	80.58	12.79	53.57
	\(\rho\) Sampling	72.69	81.85	88.67	23.31	57.46	82.73	88.35	92.29	31.62	65.69	64.60	74.02	81.59	13.22	57.11
Margin	Random [20]	67.30	77.59	85.39	20.56	49.32	78.38	84.76	89.50	26.87	57.96	58.21	68.43	77.14	10.97	50.51
	Softhard [39]	73.13	82.18	88.55	23.35	55.82	82.58	88.28	92.26	31.46	64.46	65.48	74.94	82.24	14.04	56.31
	Distance [56]	73.06	82.13	88.75	23.38	56.44	82.85	88.30	92.23	31.79	65.35	64.62	73.90	81.37	12.79	54.97
	\(\rho\) Sampling	73.52	82.56	88.98	23.50	57.57	82.94	88.38	92.31	31.66	65.88	64.67	74.13	81.65	14.52	57.25

Table 3. Comparison of R@K and NMI performance for all sampling strategies (%). MAP means MAP@1K.

Figure 6 shows that the proposed \(\rho\) Sampling outperforms the baseline on all metrics. Surprisingly, we find that the dists@inter is less than the dists@intra on the baseline method, and the situation is consistent on the \(\rho\) Sampling. The abnormal phenomenon means that the method leads to potential confusion between categories and weakens the discriminative capability. Fortunately, the fusion method solves the problem and achieves the better performance on three datasets.

Fig. 6.

4.4.4 Cross-domain Evaluation.

In order to further verify the generalization ability of our method, we conduct a cross-domain evaluation experiment. Specifically, as for ETH Food-101 and Vireo Food-172, we train the model via one training set and evaluate the model via another test set. For example, we use the training set of ETH Food-101 for training while the test set of Vireo Food-172 for testing and vice versa. As shown in Table 4, compared with the basic setting, all methods have performance loss. However, our method still achieves the best performance on cross-domain datasets thanks to its considerations of generalization. Specifically, our method exceeds the performance of the other methods by 1–13% on R@1, 0.5–11% on MAP@1K, and 1–14% on NMI.

Table 4.

Method	ETH Food-101\(\rightarrow\)Vireo Food-172					Vireo Food-172\(\rightarrow\)ETH Food-101
Method	R@1	R@2	R@4	MAP	NMI	R@1	R@2	R@4	MAP	NMI
Contrastive [43]	75.06	82.34	87.94	17.64	51.32	70.45	80.36	87.54	20.81	52.74
Softmax [61]	75.09	82.40	87.96	19.64	53.53	70.21	79.69	86.45	19.92	53.68
Smooth-AP [5]	74.81	81.51	87.24	19.69	51.30	70.27	79.73	86.50	22.21	52.44
Circle [46]	63.77	73.00	80.78	10.35	42.15	57.45	68.53	76.35	13.83	43.76
Proxy Synthesis [16]	73.01	80.71	86.89	16.79	50.10	64.52	75.27	83.43	17.07	46.78
PNP [26]	75.87	83.21	88.20	19.90	53.86	70.33	79.68	86.32	22.41	52.10
Ours	76.98	83.86	88.85	21.41	55.49	72.36	82.01	88.34	23.19	56.23

Table 4. Comparison of R@1 and NMI Performance on Domain Transfer

ETH Food-101\(\rightarrow\)Vireo Food-172 means that we use ETH Food-101 for training and Vireo Food-172 for testing and vice versa (%). MAP means MAP@1K.

4.4.5 Impact of Hyper-parameters.

We investigate the effect of different hyper-parameter settings, that is, probability \(\rho\) in Section 3.2 and loss weight \(\alpha\) in Equation (9) on ETH Food-101 and Vireo Food-172.

Probability \(\rho\): The influence of \(\rho\) is shown in Figure 7(a). We can find that best generalization corresponding to \(\rho\) equals 0.2 or 0.3, and then the performance gradually decreases as \(\rho\) continues to increase. As mentioned in Section 3.1, the selected \(\rho\) should ensure the existence of the boundary while maximizing some intra-class distances during training, that is, there is no confusion between the categories. Therefore, it can be explained that too many wrong labels make it difficult to provide enough training information with the increasing \(\rho\), which will confuse the whole training process and lead to poor performance. Therefore, we need an appropriate probability \(\rho\) to make a balance between training efficiency and generalization.

Fig. 7.

Weight \(\alpha\): The influence of \(\alpha\) is shown in Figure 7(b). We observe that the \(\alpha\) = 0.1 results in the highest R@1. \(\alpha\) controls the weight of the classification loss and the metric learning loss.

4.5 Qualitative Analysis

Relationship between generalization and over-fit. In Section 1, we point out the importance of generalization in food image retrieval and that poor generalization will lead to limited performance. In experiments, we find that popular metric learning methods, such as contrastive loss and margin loss, are easy to over-fit. This is shown in Figure 8. Although these two methods converge rapidly, the over-fitting is very obvious, especially in ISIA Food-500, which shows the limited generalization. The convergence speed of our method is slower than the other two methods, but the slow growth obtains better performance, which also shows that our method is not easy to over-fit and that our model has better generalization and superior performance.

Fig. 8.

In this section, we qualitatively analyze the performance of the proposed method. Retrieval results on three benchmark datasets are shown in Figure 9. We select one query on each dataset and visualize Top-10 retrieval results. We can observe that the retrieval performance achieves great improvement with our method. Significantly, as for the third query, the baseline model mistakenly recalls other food images due to their similar appearance. In contrast, our model successfully captures the main features of the query, excludes hard negative instances, and achieves a more favorable ranking list.

Fig. 9.

To further verify the impact of generalization, we visualize feature embeddings optimized by the baseline, that is, contrastive loss, and our method with t-SNE [27] on three datasets. We randomly choose 3,000 instances from test sets as target points. As shown in Figure 10, compared with the baseline, our method can occupy a larger hyperspace and the discrimination of different categories is more obvious, which means better generalization. Results of the visualization also correspond to larger dists@intra and dists@inter in Figure 6.

Fig. 10.

4.6 Performance of General Image Retrieval

Current results verify the power of our method in food image retrieval. We hope to prove that solving the problem from generalization can also improve the performance of other tasks. In order to further verify the performance on the general image retrieval task, we experiment on three popular general datasets. CUB200-2011 [51] is a fine-grained bird dataset, which contains 11,788 images from 200 categories. Empirically, we use the first 100 classes (5,864 images) for training and the last 100 classes (5,924 images) for testing. CARS196 [24] has 196 car classes with 16,185 images. Training/test sets use the first/last 98 classes (8,054/8,131 images). Stanford Online Products (SOP) [45] contains 120,053 product images divided into 22,634 classes. We use the first 11,318 classes (59,551 images) for training and the last 11,316 classes (60,502 images) for testing. In SOP, unlike the other benchmarks, most classes have few instances, leading to significantly different data distribution compared with CUB200-2011 and CARS196.

Results show that our method still outperforms others on the general image retrieval task. Our method outperforms others by 0.5-1% on R@1, 0.5-1% on MAP@C and 0.2-1.1% on NMI. Specifically, we find that the performance gain on CUB200-2011 and CARS196 is lower than SOP. This is because the difference between categories of the first two datasets is not obvious. Nevertheless, our method outperforms others by 0.3% on R@1. Due to the diversity of products, SOP is similar to food datasets; thus, the performance improvement of our method on SOP is more obvious, which also shows that our idea of the generalization is more suitable for other image retrieval tasks with larger differences between categories.

Surprisingly, we find that proxy-based methods, which perform well on food image datasets, are not effective in the general image retrieval task, that is, softmax and proxy synthesis. Compared with general image retrieval datasets, the number of categories in food datasets are small, and the number of samples in each category is large. Proxy-based losses introduce the idea of classification loss and optimize samples of the same category to the center by constructing the category center. When the number of samples in a category is large, these loss functions are very effective, but pair-based losses have a lot of local optimization, which leads to convergence difficulties and affects the performance. In contrast, general image retrieval datasets have a small number of samples per category, that is, SOP, which leads to the difficulty of constructing the center. Therefore, the performance of proxy-based losses is poor.

5 Conclusions

In this article, we propose a generalization-oriented analysis method for food image retrieval and conduct a new benchmark on three food datasets. Focusing on the generalization problem caused by the large difference between categories of different domains in food images, we propose the \(\rho\) Sampling and GAO methods, which greatly improve the generalization of the model. The proposed method also achieves the best performance on three benchmark datasets.

Future work includes the following. (1) The selection of the \(\rho\) parameter. In this article, we investigate the impact of the \(\rho\) parameter and conduct the evaluation on different food image datasets. However, the selection of \(\rho\) is affected by the data distribution, the number of samples in each category, and the number of categories. In future work, we hope to explore an adaptive method to obtain the optimal \(\rho\) to adapt to different datasets. (2) The generalization for food analysis. In this work, we design a generalization-oriented sampling strategy and a loss function design method to enhance the generalization of the model. In future work, we can also design a feature extraction method to enhance generalization. In addition, due to the in-depth research on generalization in tasks such as transfer learning, we can learn from the idea of transfer learning to design a network for food image retrieval.

Footnotes

https://www.yelp.com/.

https://meituan.com/.

References

[1]

G. Albert, A. Jon, R. Jerome, and L. Diane. 2016. Deep image retrieval: Learning global representations for image search. In European Conference on Computer Vision. 241–257.

Abstract

1 Introduction

2 Related Work

2.1 Food Retrieval

2.2 Deep Metric Learning

2.3 Generalization in Image Retrieval

3 Our Method

3.1 ρ Sampling for Generalization

3.2 Gradient-Adaptive Optimization (GAO)

4 Experiment

4.1 Dataset

4.2 Evaluation Metrics

4.3 Experimental Setup

4.4 Result Analysis

4.4.1 Comparison with State-of-the-art.

4.4.2 Performance of Different Loss Functions.

4.4.3 Performance of Different Sampling Strategies.

4.4.4 Cross-domain Evaluation.

4.4.5 Impact of Hyper-parameters.

4.5 Qualitative Analysis

4.6 Performance of General Image Retrieval

5 Conclusions

Footnotes

References

Cited By

Index Terms

Recommendations

A Survey on Food Computing

A Large-Scale Benchmark for Food Image Segmentation

Consistent penalizing field loss for zero-shot image retrieval

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations