This vignette describes how to implement the attention mechanism - which forms the basis of transformers - in the R language.
We begin by generating encoder representations of four different words.
# encoder representations of four different words
word_1 = matrix(c(1,0,0), nrow=1)
word_2 = matrix(c(0,1,0), nrow=1)
word_3 = matrix(c(1,1,0), nrow=1)
word_4 = matrix(c(0,0,1), nrow=1)Next, we stack the word embeddings into a single array (in this case
a matrix) which we call words.
Let’s see what this looks like.
Next, we generate random integers on the domain
[0,3].
# initializing the weight matrices (with random values)
set.seed(0)
W_Q = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)
W_K = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)
W_V = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)Next, we generate the Queries (Q), Keys
(K), and Values (V). The %*%
operator performs the matrix multiplication. You can view the R help
page using help('%*%').
Following this, we score the Queries (Q) against the Key
(K) vectors (which are transposed for the multiplation
using t(), see help('t') for more info).
# scoring the query vectors against all key vectors
scores = Q %*% t(K)
print(scores)
#> [,1] [,2] [,3] [,4]
#> [1,] 6 4 10 5
#> [2,] 4 6 10 6
#> [3,] 10 10 20 11
#> [4,] 3 1 4 2We now generate the weights matrix.
Let’s have a look at the weights matrix.
print(weights)
#> [,1] [,2] [,3] [,4]
#> [1,] 0.10679806 0.03928881 0.7891368 0.06477630
#> [2,] 0.03770440 0.10249120 0.7573132 0.10249120
#> [3,] 0.00657627 0.00657627 0.9760050 0.01084244
#> [4,] 0.27600434 0.10153632 0.4550542 0.16740510Finally, we compute the attention as a weighted sum of
the value vectors (which are combined in the matrix V).
Now we can view the results using: