The ubiquitous cameras are generating a huge amount of visual data. Automatic visual content analysis and recognition are thus desirable for effective utilization of those data. Fine-Grained Visual Recognition and Re-Identification (FGVRID) aims to accurately analyze, identify visual objects, and match re-appearing targets, e.g., persons and vehicles from a large set of images and videos. It has the potential to offer an unprecedented possibility for intelligent video processing and analysis.
Compared with traditional visual search and classification tasks, FGVRID has the following properties, making it more challenging. First, proper object detection algorithms should be designed to locate objects, their local parts, or meaningful spatial contexts in videos before proceeding to the identification step. Second, the visual appearance of an object is easily affected by many factors like viewpoint changes, illumination changes, occlusions, and camera parameter differences, etc. Third, annotating the fine-grained identity or category cues is expensive and time consuming. Finally, to cope with the large-scale visual data, scalable indexing, or feature coding, algorithms should be designed to ensure the online recognition efficiency. In recent years, FGVRID tasks like person re-identification (re-id), vehicle re-id, multi-object multi-camera tracking, fine grained image classification, etc., have exhibited impressive performance thanks to the development of Convolutional Neural Networks (CNN), and self-supervised learning strategies. Besides that, novel neural network architectures like brain inspired networks and spiking neural networks have exhibited advantages in the detection and recognition of fast-moving objects.
A total of 17 submissions were received by this Special Issue, and each paper was assigned to three to four reviewers. After one or two rounds of revision, 13 papers were finally accepted. They cover a variety of FGVRID tasks. Specifically, four papers are about vehicle and person re-id, four papers are about fine-grained classification, two papers are about dataset construction for visual recognition. Besides that, two papers work on visual representation learning for fine-grained classification, and one paper is about crowd counting, respectively. Those papers bring novel algorithms, insights, and meaningful discussions to their studied topics, and have promoted the state-of-the art performance on several commonly used datasets.
Zhang et al. explore the complex within and cross modality variations for visible-infrared person re-id. They propose a comprehensive hybrid modality metric learning framework based on both class-level and modality-level similarity constraints. A new binary neural network is proposed in the study by Xu et al. for efficient person re-id (BiRe-ID). In the study of Zhao et al., the incompatibility issue of sample generation and re-id accuracy in a GAN architecture is investigated. The JoT-GAN, a generative adversarial training framework, is presented to make the generator and the re-id model mutually benefit from each other. In the study of Liang et al., a simple yet powerful deep model (EIA-Net) is introduced for vehicle model verification, which can learn a more discriminative image representation by localizing key vehicle parts and jointly incorporating two distance metrics, i.e., vehicle-level embedding and vehicle-part-sensitive embedding, respectively.
For fine-grained visual recognition, Yan et al. propose Multi-feature Fusion and Decomposition (MFD) framework for age-invariant face recognition. Zhai et al. propose to incorporate a rectified meta-learning module into a common CNN paradigm to train a noise-robust deep network for image-based plant disease classification. Tan et al. develop a fine-grained image classification model, namely Multi-scale Selective Hierarchical biQuadratic Pooling (MSHQP), with hierarchical biquadratic pooling to ensure a robust feature interaction. In the work of Cucchiara et al., the problem of fine-grained human analysis under occlusions and perspective constraints is studied. They present possible solutions to effectively detect people by fine-grained analysis, with the aim to detect people under occlusions both in the 2D image plane and in the 3D space exploiting single monocular cameras.
A novel Instance Correlation Graph for Unsupervised Domain Adaptation is proposed in the study by Wu and Ling et al., referred as ICGDA, which is trained end-to-end by jointly optimizing three types of losses, i.e., Supervised Classification loss, Centroid Alignment loss, and ICG Alignment loss, respectively. Mugnai et al. introduce a Semi-Supervised Learning (SSL) method which leverages ideas from adversarial entropy optimization and second-order pooling. Their main goal is to reduce the prohibitive annotation cost of FGVC according to the SSL setting. Luo et al. propose a novel self-supervised method, called Exploring Relations in Untrimmed Videos (ERUV), which can be straightforwardly applied to untrimmed videos to learn spatio-temporal features.
Finally, Wang et al. propose an efficient crowd counting neural architecture search framework to search efficient crowd counting network structures. A novel search from pre-trained strategy enables their cross-task architecture search to efficiently explore the large and flexible search space. Li et al. formulate the video summarization as a hierarchical refining process. They propose a hierarchical summarization network with deep Q-learning (HQSN) to achieve the refining process and explore temporal dependency. Besides, they collect a new dataset that consists of structured game videos with fine-grained actions and importance annotations.
To summarize, those papers have illustrated the effectiveness of self-supervised learning, semi-supervised learning, transfer learning, reinforcement learning, as well as neural architecture search in FGVRID tasks, respectively. This Special Issue may benefit broader readers of researchers, practitioners, and students who are interested in FGVRID. We would like to thank the authors for their contributions to this Special Issue. We also thank the journal—ACM Transactions on Multimedia Computing, Communications, and Applications—for their support!
Shiliang Zhang
Peking University
Guorong Li
University of Chinese Academy of Sciences
Weigang Zhang
Harbin Institute of Technology
Qingming Huang,
University of Chinese Academy of Sciences
Tiejun Huang
Peking University
Mubarak Shah
University of Central Florida
Nicu Sebe
University of Trento
Guest Editors