More Web Proxy on the site http://driver.im/

research-article

Learning Food Appearance by a Supervision with Recipe Text

Authors:

Atsushi Hashimoto,

Masaaki Iiyama,

Michihiko MinohAuthors Info & Claims

CEA2017: Proceedings of the 9th Workshop on Multimedia for Cooking and Eating Activities in conjunction with The 2017 International Joint Conference on Artificial Intelligence

Pages 39 - 44

https://doi.org/10.1145/3106668.3106675

Published: 20 August 2017 Publication History

Abstract

We attempt to train a classifier to identify food items in images captured during food preparation processes. Food changes appearance significantly during the cooking process, and different foods can be mixed together. Thus, manually annotating individual food items during the preparation process is difficult. To train a classifier without manual annotation, we used multimedia recipes with stepwise pairs of instructional text and an image. Such stepwise pairs can be informative for training; however, most images contain objects that are not referenced in text (i.e., missing labels) and vice versa (inaccurate labels). To reduce such mismatches, we propose a method that identifies missing labels by searching label candidates from upstream processes in the given recipe. Then, inconsistent word-appearance pairs are removed from the sample based on differences in the model fitting speed using a convolutional neural network. We conducted an experiment using carrot as the target food. The classifier trained using the proposed method outperformed the one that used the method with naive implementation. The proposed method achieved an average precision of 91% compared with 83.8% for the naive implementation.

References

[1]

Gustavo Carneiro, Antoni B Chan, Pedro J Moreno, and Nuno Vasconcelos. 2007. Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans. on PAMI 29, 3 (2007), 394--410.

Digital Library

[2]

Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo, Daniel Gildea, and Henry Kautz. 2016. Unsupervised Alignment of Actions in Video with Text Descriptions. In Proc. of IJCAI.

[3]

Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. 2014. Multi-fold mil training for weakly supervised object localization. In Proc. of CVPR. IEEE, 2409--2416.

Digital Library

[4]

Robert J Fowler, Michael S Paterson, and Steven L Tanimoto. 1981. Optimal packing and covering in the plane are NP-complete. Information processing letters 12, 3 (1981), 133--137.

[5]

Abhinav Gupta, Scott Satkin, Alexei A. Efros, and Martial Hebert". 2011. From 3D Scene Geometry to Human Workspace. In Proc. of CVPR. 1961--1968.

Digital Library

[6]

Jun Harashima, Michiaki Ariga, Kenta Murata, and Masayuki Ioki. 2016. A large-scale recipe and meal data collection as infrastructure for food research. In Proc. of the 10th International Conference on Language Resources and Evaluation.

[7]

Atsushi Hashimoto, Jin Inoue, Takuya Funatomi, and Michihiko Minoh. 2016. Intention-Sensing Recipe Guidance via User Accessing Objects. International Journal of Human-Computer Interaction 32, 9 (2016), 722--733.

[8]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proc. of CVPR. 4565--4574.

[9]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proc. of CVPR. 3128--3137.

[10]

Andrej Karpathy, Armand Joulin, and Fei Fei F Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems. 1889--1897.

[11]

Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. on PAMI 35, 12 (2013), 2891--2903.

Digital Library

[12]

Jinna Lei, Xiaofeng Ren, and Dieter Fox. 2012. Fine-grained kitchen activity recognition using rgb-d. In Proc. of UbiComp. ACM, 208--211.

Digital Library

[13]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. In Proc. of ECCV. Springer, 21--37.

[14]

Hirokuni Maeta, Tetsuro Sasada, and Shinsuke Mori. 2015. A framework for procedural text understanding. In Proceedings of the 14th International Conference on Parsing Technologies. 50--60.

[15]

Yuki Matsumura, Atsushi Hashimoto, Shinsuke Mori, Takuya Funatomi, Masaaki Iiyama, and Michihiko Minoh. 2016. Mapping Video Segments to a Work Flow based on Path Search. SIGMVE Tech. Report 115, 495 (2016), 37--42.

[16]

Ishan Misra, C Lawrence Zitnick, Margaret Mitchell, and Ross Girshick. 2016. Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels. In Proc. of CVPR.

[17]

D. J Moore, I. A Essa, and M. H Hayes III. 1999. Exploiting human actions and object context for recognition tasks. In Proc. of ICCV, Vol. 1. IEEE, 80--86.

[18]

Iftekhar Naim, Young Chol Song, Qiguang Liu, Liang Huang, Henry Kautz, Jiebo Luo, and Daniel Gildea. 2015. Discriminative unsupervised alignment of natural language instructions with corresponding video segments. North American Chapter of the Association for Computational Linguistics Human Language Technologies (2015).

[19]

Iftekhar Naim, Young Chol Song, Qiguang Liu, Henry A Kautz, Jiebo Luo, and Daniel Gildea. 2014. Unsupervised Alignment of Natural Language Instructions with Video Segments. In Proc. of AAAI. 1558--1564.

[20]

Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. 2015. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proc. of CVPR. 685--694.

[21]

Megha Pandey and Svetlana Lazebnik. 2011. Scene recognition and weakly supervised object localization with deformable part-based models. In Proc. of ICCV. IEEE, 1307--1314.

Digital Library

[22]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.

[23]

Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In Proc. of ICCV. 433--440.

Digital Library

[24]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[25]

Ke Tsung-Wei, Lin Che-Wei, Liu Tyng-Luh, and Geiger Davi. 2016. Variational Convolutional Networks for Human-Centric Annotations. In Proc. of ACCV. Springer.

[26]

Jasper RR Uijlings, Koen EA van de Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective search for object recognition. International journal of computer vision 104, 2 (2013), 154--171.

[27]

Yoko Yamakata, Koh Kakusho, and Michihiko Minoh. 2010. Object recognition based on object's identity for cooking recognition task. In Proc. of IEEE International Symposium on Multimedia. IEEE, 278--283.

Digital Library

[28]

Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proc. of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 444--454.

Digital Library

[29]

Stella X. Yu and Jianbo Shi. 2003. Multiclass Spectral Clustering. In Proc. of ICCV (ICCV '03). IEEE Computer Society, Washington, DC, USA, 313--. http://dl.acm.org/citation.cfm?id=946247.946658

Index Terms

Learning Food Appearance by a Supervision with Recipe Text

Recommendations

Efficient Path Prediction for Semi-Supervised and Weakly Supervised Hierarchical Text Classification
WWW '19: The World Wide Web Conference

Hierarchical text classification has many real-world applications. However, labeling a large number of documents is costly. In practice, we can use semi-supervised learning or weakly supervised learning (e.g., dataless classification) to reduce the ...
Few-shot Food Recognition via Multi-view Representation Learning

This article considers the problem of few-shot learning for food recognition. Automatic food recognition can support various applications, e.g., dietary assessment and food journaling. Most existing works focus on food recognition with large numbers of ...
W2N: Switching from Weak Supervision to Noisy Supervision for Object Detection
Computer Vision – ECCV 2022
Abstract
Weakly-supervised object detection (WSOD) aims to train an object detector only requiring the image-level annotations. Recently, some works have managed to select the accurate boxes generated from a well-trained WSOD network to supervise a semi-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

CEA2017: Proceedings of the 9th Workshop on Multimedia for Cooking and Eating Activities in conjunction with The 2017 International Joint Conference on Artificial Intelligence

August 2017

64 pages

ISBN:9781450352673

DOI:10.1145/3106668

General Chair:
Ichiro Ide
Nagoya University
,
Program Chair:
Yoko Yamakata
The University of Tokyo

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

The International Joint Conferences on Artificial Intelligence, Inc. (IJCAI)

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Japan Society for the Promotion of Science

Conference

CEA2017

CEA2017: 9th Workshop on Multimedia for Cooking and Eating Activities in conjunction with The 2017 International Joint Conference on Artificial Intelligence

August 20, 2017

Melbourne, Australia

Acceptance Rates

CEA2017 Paper Acceptance Rate 7 of 12 submissions, 58%;

Overall Acceptance Rate 20 of 33 submissions, 61%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
121
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents