研究室の輪講で使った古いスライド。物体検出の黎明期からシングルショット系までのまとめ。
Old slides used in a lab lecture. A summary of object detection from its early days to single-shot systems.
フォント不足による表示崩れがあります(筑紫A丸ゴシック、Montserratを使用)。
[DL輪読会]Generative Models of Visually Grounded ImaginationDeep Learning JP
The document proposes a new model for visually grounded semantic imagination that can generate images from linguistic descriptions of concepts specified by attributes. The model uses a variational autoencoder with three inference networks to handle images, attributes, and missing modalities. It represents the attribute inference distribution as the product of expert Gaussians, allowing generation of concepts not seen during training by combining learned attributes. The paper introduces three criteria for evaluating such models: correctness, coverage, and compositionality.
ConvMixer is a simple CNN-based model that achieves state-of-the-art results on ImageNet classification. It divides the input image into patches and embeds them into high-dimensional vectors, similar to ViT. However, unlike ViT, it does not use attention but instead applies simple convolutional layers between the patch embedding and classification layers. Experiments show that despite its simplicity, ConvMixer outperforms more complex models like ResNet, ViT, and MLP-Mixer on ImageNet, demonstrating that patch embeddings may be as important as attention mechanisms for vision tasks.
[DL輪読会]Generative Models of Visually Grounded ImaginationDeep Learning JP
The document proposes a new model for visually grounded semantic imagination that can generate images from linguistic descriptions of concepts specified by attributes. The model uses a variational autoencoder with three inference networks to handle images, attributes, and missing modalities. It represents the attribute inference distribution as the product of expert Gaussians, allowing generation of concepts not seen during training by combining learned attributes. The paper introduces three criteria for evaluating such models: correctness, coverage, and compositionality.
ConvMixer is a simple CNN-based model that achieves state-of-the-art results on ImageNet classification. It divides the input image into patches and embeds them into high-dimensional vectors, similar to ViT. However, unlike ViT, it does not use attention but instead applies simple convolutional layers between the patch embedding and classification layers. Experiments show that despite its simplicity, ConvMixer outperforms more complex models like ResNet, ViT, and MLP-Mixer on ImageNet, demonstrating that patch embeddings may be as important as attention mechanisms for vision tasks.