transformer
QA
- how many query in memory nteworks?
- One, but transformer has multiples
- what's the limitation of memory network?
- not ground-breaking
- what does transformer improve?
- multiple queries
- Multi-heads attention(MHA)
- Self-attention layer
- what does self-attention do?
- allow to dynamically change the word representation according to its context
- what is the computational cost of RNN, ConvNet, and transformer? Assuming L is the len of sequence, d is the dimension of hidden layer and k is the kernal size
- RNN,
O(L*d^2)
- ConvNet,
O(L*d^2*k)
- Transformer,
O(L^2*d)
- RNN,
- Why do we divide by
√d/H
?- because the q, k, and v are
d/H
- because the q, k, and v are