[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3106668.3106675acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Food Appearance by a Supervision with Recipe Text

Published: 20 August 2017 Publication History

Abstract

We attempt to train a classifier to identify food items in images captured during food preparation processes. Food changes appearance significantly during the cooking process, and different foods can be mixed together. Thus, manually annotating individual food items during the preparation process is difficult. To train a classifier without manual annotation, we used multimedia recipes with stepwise pairs of instructional text and an image. Such stepwise pairs can be informative for training; however, most images contain objects that are not referenced in text (i.e., missing labels) and vice versa (inaccurate labels). To reduce such mismatches, we propose a method that identifies missing labels by searching label candidates from upstream processes in the given recipe. Then, inconsistent word-appearance pairs are removed from the sample based on differences in the model fitting speed using a convolutional neural network. We conducted an experiment using carrot as the target food. The classifier trained using the proposed method outperformed the one that used the method with naive implementation. The proposed method achieved an average precision of 91% compared with 83.8% for the naive implementation.

References

[1]
Gustavo Carneiro, Antoni B Chan, Pedro J Moreno, and Nuno Vasconcelos. 2007. Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans. on PAMI 29, 3 (2007), 394--410.
[2]
Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo, Daniel Gildea, and Henry Kautz. 2016. Unsupervised Alignment of Actions in Video with Text Descriptions. In Proc. of IJCAI.
[3]
Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. 2014. Multi-fold mil training for weakly supervised object localization. In Proc. of CVPR. IEEE, 2409--2416.
[4]
Robert J Fowler, Michael S Paterson, and Steven L Tanimoto. 1981. Optimal packing and covering in the plane are NP-complete. Information processing letters 12, 3 (1981), 133--137.
[5]
Abhinav Gupta, Scott Satkin, Alexei A. Efros, and Martial Hebert". 2011. From 3D Scene Geometry to Human Workspace. In Proc. of CVPR. 1961--1968.
[6]
Jun Harashima, Michiaki Ariga, Kenta Murata, and Masayuki Ioki. 2016. A large-scale recipe and meal data collection as infrastructure for food research. In Proc. of the 10th International Conference on Language Resources and Evaluation.
[7]
Atsushi Hashimoto, Jin Inoue, Takuya Funatomi, and Michihiko Minoh. 2016. Intention-Sensing Recipe Guidance via User Accessing Objects. International Journal of Human-Computer Interaction 32, 9 (2016), 722--733.
[8]
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proc. of CVPR. 4565--4574.
[9]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proc. of CVPR. 3128--3137.
[10]
Andrej Karpathy, Armand Joulin, and Fei Fei F Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems. 1889--1897.
[11]
Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. on PAMI 35, 12 (2013), 2891--2903.
[12]
Jinna Lei, Xiaofeng Ren, and Dieter Fox. 2012. Fine-grained kitchen activity recognition using rgb-d. In Proc. of UbiComp. ACM, 208--211.
[13]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. In Proc. of ECCV. Springer, 21--37.
[14]
Hirokuni Maeta, Tetsuro Sasada, and Shinsuke Mori. 2015. A framework for procedural text understanding. In Proceedings of the 14th International Conference on Parsing Technologies. 50--60.
[15]
Yuki Matsumura, Atsushi Hashimoto, Shinsuke Mori, Takuya Funatomi, Masaaki Iiyama, and Michihiko Minoh. 2016. Mapping Video Segments to a Work Flow based on Path Search. SIGMVE Tech. Report 115, 495 (2016), 37--42.
[16]
Ishan Misra, C Lawrence Zitnick, Margaret Mitchell, and Ross Girshick. 2016. Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels. In Proc. of CVPR.
[17]
D. J Moore, I. A Essa, and M. H Hayes III. 1999. Exploiting human actions and object context for recognition tasks. In Proc. of ICCV, Vol. 1. IEEE, 80--86.
[18]
Iftekhar Naim, Young Chol Song, Qiguang Liu, Liang Huang, Henry Kautz, Jiebo Luo, and Daniel Gildea. 2015. Discriminative unsupervised alignment of natural language instructions with corresponding video segments. North American Chapter of the Association for Computational Linguistics Human Language Technologies (2015).
[19]
Iftekhar Naim, Young Chol Song, Qiguang Liu, Henry A Kautz, Jiebo Luo, and Daniel Gildea. 2014. Unsupervised Alignment of Natural Language Instructions with Video Segments. In Proc. of AAAI. 1558--1564.
[20]
Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. 2015. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proc. of CVPR. 685--694.
[21]
Megha Pandey and Svetlana Lazebnik. 2011. Scene recognition and weakly supervised object localization with deformable part-based models. In Proc. of ICCV. IEEE, 1307--1314.
[22]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.
[23]
Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In Proc. of ICCV. 433--440.
[24]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[25]
Ke Tsung-Wei, Lin Che-Wei, Liu Tyng-Luh, and Geiger Davi. 2016. Variational Convolutional Networks for Human-Centric Annotations. In Proc. of ACCV. Springer.
[26]
Jasper RR Uijlings, Koen EA van de Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective search for object recognition. International journal of computer vision 104, 2 (2013), 154--171.
[27]
Yoko Yamakata, Koh Kakusho, and Michihiko Minoh. 2010. Object recognition based on object's identity for cooking recognition task. In Proc. of IEEE International Symposium on Multimedia. IEEE, 278--283.
[28]
Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proc. of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 444--454.
[29]
Stella X. Yu and Jianbo Shi. 2003. Multiclass Spectral Clustering. In Proc. of ICCV (ICCV '03). IEEE Computer Society, Washington, DC, USA, 313--. http://dl.acm.org/citation.cfm?id=946247.946658

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
CEA2017: Proceedings of the 9th Workshop on Multimedia for Cooking and Eating Activities in conjunction with The 2017 International Joint Conference on Artificial Intelligence
August 2017
64 pages
ISBN:9781450352673
DOI:10.1145/3106668
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • The International Joint Conferences on Artificial Intelligence, Inc. (IJCAI)

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Food recognition
  2. Instructional text
  3. Weakly supervised learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

CEA2017

Acceptance Rates

CEA2017 Paper Acceptance Rate 7 of 12 submissions, 58%;
Overall Acceptance Rate 20 of 33 submissions, 61%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 121
    Total Downloads
  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media