[ad_1]
This paper was accepted at WMT convention at EMNLP.
The Transformer structure has two fundamental non-embedding elements: Consideration and the Feed Ahead Community (FFN). Consideration captures interdependencies between phrases no matter their place, whereas the FFN non-linearly transforms every enter token independently. On this work, we discover the function of FFN and discover that regardless of, and discover that regardless of taking on a big fraction of the mannequin’s parameters, it’s extremely redundant. Concretely, we’re capable of considerably scale back the variety of parameters with solely a modest drop in accuracy by…
[ad_2]
Source link