Video2mesh: 3D human pose and shape recovery by a temporal convolutional transformer network
From a 2D video of a person in action, human mesh recovery aims to infer the 3D human pose and shape frame by frame. Despite progress on video‐based human pose and shape estimation, it is still challenging to guarantee high accuracy and smoothness ...
From a video of a person in action, human mesh recovery aims to infer the 3D human pose and shape. We propose a Video2mesh, a temporal convolutional transformer (TConvTransformer) network which is able to recover accurate and smooth human mesh from 2D ...
MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation
Multimodal abstractive summarisation (MAS) aims to generate a textual summary from multimodal data collection, such as video‐text pairs. Despite the success of recent work, the existing methods lack a thorough analysis for consistency across ...
We propose a novel MCR model for the video‐containing multimodal abstractive summarisation task, aiming to model the thoroughly consistent and complementary semantics in multimodal data. We design the cross‐fusion module implemented by the cross‐modal ...
Self‐supervised non‐rigid structure from motion with improved training of Wasserstein GANs
This study proposes a self‐supervised method to reconstruct 3D limbic structures from 2D landmarks extracted from a single view. The loss of self‐consistency can be reduced by performing a random orthogonal projection of the reconstructed 3D ...
We present SS‐Graphformer, a graph convolution and Transformer‐based method for 3D structure reconstruction from 2D landmarks. In addition, geometric self‐consistency is used to achieve self‐supervision; when combined with the 2D structure discriminator, ...
TANet: Transformer‐based asymmetric network for RGB‐D salient object detection
Existing RGB‐D salient object detection methods mainly rely on a symmetric two‐stream Convolutional Neural Network (CNN)‐based network to extract RGB and depth channel features separately. However, there are two problems with the symmetric ...
In this paper, we proposed a Transformer‐based asymmetric network (TANet) to address the problem that the Convolutional Neural Network (CNN)‐based models are ineffective in extracting global semantic information while the symmetric two‐stream structures ...
Multi‐directional feature refinement network for real‐time semantic segmentation in urban street scenes
Efficient and accurate semantic segmentation is crucial for autonomous driving scene parsing. Capturing detailed information and semantic information efficiently through two‐branch networks has been widely utilised in real‐time semantic ...
This work proposes a network named MRFNet based on two‐branch strategy to solve the problem of accuracy and speed of segmentation in urban scenes. Experiments on Cityscapes and CamVid datasets demonstrate the effectiveness of our method by achieving a ...
Facial expression recognition based on regional adaptive correlation
To address the problem that the features extracted by CNN‐based facial expression recognition (FER) do not consider structural information, a region adaptive correlation deep network (RACN) is proposed. The network consists of two branches. In one ...
This paper proposes a regional adaptive correlation network (RACN) to explore more effective description of structural information of faces and enrich the expression feature representation. The network consists of two branches. The proposed second‐order ...
Semantics recalibration and detail enhancement network for real‐time semantic segmentation
Real‐time semantic segmentation is a crucial technology in automatic driving scenarios, which needs to meet both high precision and real‐time. The authors observe that learning complex correlations between object categories is vital in the real‐...
We propose a Semantics Recalibration and Detail Enhancement Network for real‐time semantic segmentation based on BiSeNet V2. On the one hand, a lightweight Semantics Recalibration module is designed to effectively extract global semantic contextual ...
Loop and distillation: Attention weights fusion transformer for fine‐grained representation
Learning subtle discriminative feature representation plays a significant role in Fine‐Grained Visual Categorisation (FGVC). The vision transformer (ViT) achieves promising performance in the traditional image classification filed due to its multi‐...
We fuse attention weight grouped by head to reinforce the attention of different regions. Subsequently, we adopt three attention weight fusion blocks and channel grouping methods to accurately select discriminative region. In addition, we utilise ...
Selective feature fusion network for salient object detection
Fully convolutional neural networks have achieved great success in salient object detection, in which the effective use of multi‐layer features plays a critical role. Based on this advantage, many saliency detectors have emerged in recent years, ...
In this paper, we propose a selective feature fusion network which consists of a selective feature fusion module (SFM) and an attention‐guide hierarchical feature emphasis module (AEM). Selective feature fusion modules adaptively selects the important ...
An efficient mixed attention module
Recently, the application of attention mechanisms in convolutional neural networks (CNNs) has become a hot area in computer vision. Most existing methods focus on channel attention or spatial attention. Some mixed attention usually achieves better ...
For recent attention methods to increase performance by increasing complexity, we provide an efficient mixed attention that aggregates channel information and spatial information through a learnable combinatorial formulation. In this way, the modelling ...