In recent years, with the widespread availability of digital sensors (e.g., cameras) and an increasing need for urban artificial intelligence applications, the ability of learning the representations, similarities, and associations of multimedia data in dynamic environments becomes critically important in many multimedia applications. Its goal is to design flexible learning machines to learn environmentally robust descriptors of multimedia data and model complex relationships among them in complex application scenarios, benefiting diverse tasks such as visual object re-identification, cross-modal retrieval, and human pose estimation. The aim of this Special Section on “Learning Representations, Similarities, and Associations in Dynamic Multimedia Environments” is to bring academic researchers and industry developers together for sharing the recent advances and future trends of the representation/similarity learning and association of complex multimedia data.
The Special Section attracted 25 submissions and after a rigorous review, six papers have been finally accepted for publication. Specifically, two papers are about person re-identification. The rest four papers work on cross-modal matching, human pose estimation, few-shot classification, and compatible representation learning, respectively. Those papers bring novel algorithms, insights, and meaningful discussions to their studied tasks.
In the article entitled “
Rank-in-Rank Loss for Person Re-identification”, Xu et al. propose a
Differentiable Retrieval-Sort Loss (DRSL) to optimize the re-ID model. Considering that the ranking and sorting operations are non-differentiable and non-convex, the DRSL also performs the optimization of automatic derivation and backpropagation. The DRSL can not only maintain the inter-class distance distribution but also preserve the intra-class similarity structure in terms of angle constraints. In the article entitled “
3D Skeleton and Two Streams Approach to Person Re-identification Using Optimized Region Matching”, Han et al. propose a 3D skeleton and two-stream approach for person Re-ID. The first stream uses the 3D skeleton for background filtering and region segmentation, and the second stream uses the Siamese net for global descriptor extraction. The two streams are finally effectively fused to improve the distance learning with an optimized region matching strategy.
In the article entitled “
Guided Graph Attention Learning for Video-Text Matching”, Li et al. propose a
Guided Graph Attention Learning (GGAL) model to enhance the video embedding learning by capturing important region-level semantic concepts within the spatial-temporal space. The GGAL model builds connections between object regions and performs hierarchical graph reasoning on both frame-level and whole video level region graphs. The global context is used to guide the attention learning on this hierarchical graph topology. Then the learned video embedding can be better aligned with text captions.
In the article entitled “
GLPose: Global-Local Representation Learning for Human Pose Estimation”, Jiao et al. propose a
global-local enhanced pose estimation (GLPose) network to tackle the challenging multi-frame human pose estimation task. The GLPose framework consists of a feature processing module that conditionally incorporates global semantic information and local visual context to generate a robust human representation, and a feature enhancement module that excavates complementary information from this aggregated representation to enhance keyframe features for precise estimation.
In the article entitled “
Revisiting Local Descriptor for Improved Few-Shot Classification”, He et al. propose a
Dense Classification and Attentive Pooling (DCAP) method for few-shot visual object classification. Specifically, it formulates the meta-learning as a two-stage training paradigm, where it introduces a dense classification pre-training stage to reduce semantic discrepancy among local descriptors and devises an attentive pooling strategy in meta-finetuning to select more informative local descriptors for few-shot classification.
In the article entitled “
CL2R: Compatible Lifelong Learning Representations”, Biondi et al. propose a method to partially mimic natural intelligence for the problem of lifelong learning representations that are compatible. The authors identify stationarity as the property that the feature representation is required to hold to achieve compatibility and propose a novel training procedure that encourages local and global stationarity on the learned representation. Due to stationarity, the statistical properties of the learned features do not change over time, making them interoperable with previously learned features.
In closing, the guest editors would like to thank all the authors who significantly contributed to this Special Section and the reviewers for their efforts in respecting deadlines and their constructive reviews. We are also grateful to the Editor-in-Chief, Abdulmotaleb El Saddik and the Information Director, Mohammad Anwar Hossain for their support. We hope this Special Section will inspire further research and development ideas for learning representations, similarity, and associations in dynamic multimedia environments.
Xun Yang
University of Science and Technology of China, China
Liang Zheng
Australian National University, Australia
Elisa Ricci
University of Trento, Italy
Meng Wang
Hefei University of Technology, China
Guest Editors