Positional Encoding

์ž…๋ ฅ $x = (x_1, x_2, \cdots, x_n)$ ์— ๋Œ€ํ•ด์„œ softmax ๋Š” $z=(z_1,z_2, \cdots, z_n)$ ๋ฅผ ๋‚ด๋ณด๋‚ด๋ฉฐ, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์“ฐ์—ฌ์งˆ ์ˆ˜ ์žˆ๋‹ค.

\(z_i = \sum_{j=1}^n \frac{\exp (\alpha_{ij})}{\sum_{j^'=1}^n \exp (\alpha_{ij^'}} (x_j W^V)\) where \(\alpha_{ij} = \frac{1}{\sqrt{d}}(x_i W^Q)(x_j W^V)^\top\)

์ฃผ์–ด์ง„ position ์ •๋ณด๋กœ ๋ชจ๋ธ์€ ํ† ํฐ ์œ„์น˜๋ฅผ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋‹ค. absolute positional encoding ์€ ์ฃผ์–ด์ง„ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ $w_i\in \mathbb{R}^d$์— position ์ •๋ณด $p_i\in \mathbb{R}^d$๋ฅผ ๋”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์—ฐ์‚ฐ๋œ๋‹ค.

\(x_i = w_i + p_i\) ์ด๋กœ๋ถ€ํ„ฐ self-attention ๋ชจ๋“ˆ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์“ฐ์—ฌ์งˆ ์ˆ˜ ์žˆ๋‹ค.

\[\alpha_{ij} = \frac{1}{\sqrt{d}}((w_i + p_i) W^Q)((w_j + p_j) W^V)^\top\]

์ฃผ์–ด์ง„ $p_i$ ๋Š” ๋ฏธ๋ฆฌ ์ •ํ•ด์งˆ์ˆ˜๋„ ์žˆ๊ณ , ํ•™์Šต์œผ๋กœ ๋ฐ”๋€” ์ˆ˜๋„ ์žˆ๋‹ค. ๊ด€๊ฑด์€ $p$๋ฅผ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ์‹์— ์˜ํ•ด์„œ ์–ด๋–ค ํŠน์ง•์ด ๋‚˜ํƒ€๋‚˜๋Š”์ง€ ์ดํ•ดํ•ด์•ผ ํ•œ๋‹ค. position์€ ๋ฒกํ„ฐ๋ฅผ ์ „ํ˜€๋‹ค๋ฅธ ๊ณต๊ฐ„์œผ๋กœ ๋งตํ•‘ํ•˜๋Š”์ง€ ์•„๋‹ˆ๋ฉด ์žŠํ˜€์ง€๋Š”์ง€ ์•Œ๊ณ  ์‹ถ๋‹ค. ๋ฌผ๋ก  sinusoidal positional encoding์˜ ๊ฒฝ์šฐ, ์ฐจ์›์˜ ๋’ท๋ถ€๋ถ„์œผ๋กœ ๊ฐˆ์ˆ˜๋ก ์‹œ๊ทธ๋„ ์ž์ฒด๊ฐ€ ์•ฝํ•ด์ง€๋ฏ€๋กœ ์•ž ๋ถ€๋ถ„์ด position์— ์˜ํ–ฅ์„ ๋ฐ›๋Š” ๊ณต๊ฐ„์„ ํ˜•์„ฑํ•œ๋‹ค.

์ƒ๋Œ€์ ์ธ ๋‹จ์–ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•˜๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. (Shaw et al. 2018)

\[\alpha_{ij} = \frac{1}{\sqrt{d}}((x_i) W^Q)((x_j + a_{j-i}) W^V)^\top\]

where $a_{j-i} \in \mathbb{R}^d$ is learnable parameter.

T5 (Rafael 2019) ์—์„œ๋Š” relative position์„ query key ์—ฐ์‚ฐ์—์„œ ์ œ๊ฑฐํ•˜์˜€๋‹ค.

\[\alpha_{ij} = \frac{1}{\sqrt{d}}((x_i) W^Q)((x_j) W^V)^\top + b_{j-i}\]

์—ฌ๊ธฐ์„œ $b_{j-i} \in \mathbb{R}$ ์€ scalar์ด๋‹ค.


Sinusoidal ์˜ ๊ฐ€์ •

\[PE(t, 2i) = \sin (t \cdot \frac{1}{10000}^{\frac{2i}{d}}) \\ PE(t, 2i+1) = \cos (t \cdot \frac{1}{10000}^{\frac{2i}{d}})\]

๋”ฐ๋ผ์„œ, $2\pi$ ๋ถ€ํ„ฐ $2\pi \cdot 10000$ ๊นŒ์ง€ wavelength๊ฐ€ ์กด์žฌํ•˜๋„๋ก ๊ฐ€์ •. ๊ณ ์ฐจ์›์œผ๋กœ ๊ฐˆ์ˆ˜๋ก position์— ๋Œ€ํ•ด์„œ ํฌ๊ฒŒ ๋ณ€ํ•˜์ง€ ์•Š๋Š” vector ๋ฅผ ๊ฐ€์ง„๋‹ค.