what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? · Issue #301 · mlfoundations/open_flamingo · GitHub
More Web Proxy on the site http://driver.im/
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? #301
According to the original paper, the input shape to PerceiverResampler should be (b, T, v, d),T means the number of frames in video in time and v means the number of visual tokens of one frame. But I'm confused about the concept of media