what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? · Issue #301 · mlfoundations/open_flamingo · GitHub

8000 what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? · Issue #301 · mlfoundations/open_flamingo · GitHub

More Web Proxy on the site http://driver.im/

what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? #301

Open

Open

what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames?#301

According to the original paper, the input shape to PerceiverResampler should be (b, T, v, d)，T means the number of frames in video in time and v means the number of visual tokens of one frame. But I'm confused about the concept of media

Metadata

Assignees

No one assigned

Labels

No labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

0