Open
Description
I am reading torch implementation, your implementation and the pytorch implementation. I found that there are mask in your implementation and torch implementation, but there is no mask in pytorch implementation. Is the role of mask is to get the valid ones? If there is no mask, what will the performance and the result be like?
I am training the pytorch implementation on handwritten dataset, I found that there is a lot of repeat in the decoded result, as below shown. is is the reason that I didn't use mask in the procedure of attention operation?
groundtruth: the^fragile^nature
prediction: the^fragile^fragile^fragile^fragile^fragile^fragile^fragile^fragile^fragile^fragi
Metadata
Metadata
Assignees
Labels
No labels