Paper Figure 3 and Figure 4 conflict with each other

Dear authors,

I am so impressed with your work and carefully read your CVPR2024 paper.
I am sorry if I understood your paper incorrectly. Here I am confused with the order of cross-attention. In figure 3, the highest-level features (x1) is firstly fed into cross-attention; however, in figure 4 the highest-level features (x1) is the last one fed into cross-attention.