{"id":951,"date":"2023-06-09T02:14:20","date_gmt":"2023-06-09T00:14:20","guid":{"rendered":"https:\/\/janbielak.com\/?p=951"},"modified":"2023-06-26T12:10:18","modified_gmt":"2023-06-26T10:10:18","slug":"some-ideas-on-transformers","status":"publish","type":"post","link":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/","title":{"rendered":"Some ideas on Transformers"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Note<\/h2>\n\n\n\n<p>I started writing this post about two months ago. However, I had been coming up with the ideas I write about for much longer. I didn&#8217;t want to publish the article early, unpolished, or with errors. However, the field of AI is moving, so quickly that I am literally unable to keep this article updated as I am writing it. This is why some sections will be incomplete, and errors may be present.<\/p>\n\n\n\n<p>In this article, I present a variety of ideas and hypotheses about Transformers. Usually, I would go ahead and verify them, but I don&#8217;t have time. I have a dozen ideas, but no resources to verify them. So, this article is meant to serve as an inspiration to others.<\/p>\n\n\n\n<p><a href=\"?noamp=available\" rel=\"noamphtml\">Click here to load math.<\/a><\/p>\n\n\n\n\n  <script src=\"https:\/\/cdn.jsdelivr.net\/npm\/mathjax@3\/es5\/tex-chtml-full.js\" type=\"text\/javascript\"><\/script>\n  <!--[if lt IE 9]>\n    <script src=\"\/\/cdnjs.cloudflare.com\/ajax\/libs\/html5shiv\/3.7.3\/html5shiv-printshiv.min.js\"><\/script>\n  <![endif]-->\n<p><a href=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png?ssl=1\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png?w=680&#038;ssl=1\" alt=\"neuro\" data-recalc-dims=\"1\"><\/a><\/p>\n<h2 id=\"what-is-self-attention-really\">What is self-attention,\nreally?<\/h2>\n<p>Since ChatGPT was introduced, I became interested in the <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Transformer architecture<\/a>. It\nis the machine learning model that underlies practically all current\nbest LLMs, like <a href=\"https:\/\/openai.com\/blog\/chatgpt\">ChatGPT<\/a>,\n<a href=\"https:\/\/openai.com\/research\/gpt-4\">GTP-4<\/a>, <a href=\"https:\/\/ai.facebook.com\/blog\/large-language-model-llama-meta-ai\/\">LLaMA<\/a>,\nand <a href=\"https:\/\/ai.google\/discover\/palm2\">PaLM 2<\/a>. As the\nTransformer\u2019s paper\u2019s name suggests, \u201cAttention\u201d is the secret\ningredient that makes all these models so impressive (or, if I may\nquote, <a href=\"http:\/\/karpathy.github.io\/2015\/05\/21\/rnn-effectiveness\/\">\u201cunreasonably\neffective\u201d<\/a>).<\/p>\n<p>So what is this \u201cAttention\u201d, really? This is both outlined in the\npaper and explained in many places on the internet. Usually, it is\nexplained by comparing it to a relational database, like <a href=\"https:\/\/www.mysql.com\/\">MySQL<\/a> or <a href=\"https:\/\/www.microsoft.com\/en-us\/microsoft-365\/access\">MS\nAccess<\/a>. I personally enjoyed this <a href=\"https:\/\/jalammar.github.io\/illustrated-transformer\/\">animated\nexplanation<\/a>.<\/p>\n<p>This analogy gives the intuitive understanding that, for a given\ntoken, the self-attention layer extracts relevant information from the\ncontext window and imbues the token with this context-specific\ninformation. The token\u2019s query vector represents what the current token\nis \u201clooking for\u201d, the keys of the context\u2019s tokens represent what kind\nof thing they \u201coffer\u201d, while their values are the \u201coffering\u201d itself.<\/p>\n<p>It has been noticed that the self-attention mechanism is very\npowerful and it is currently being used in various models, including\nones that are not transformers or even language models. Some examples\ninclude <a href=\"https:\/\/ai.facebook.com\/blog\/multilingual-model-speech-recognition\/\">MMS<\/a>,\n<a href=\"https:\/\/segment-anything.com\">SAM<\/a>, <a href=\"https:\/\/research.nvidia.com\/labs\/toronto-ai\/VideoLDM\/\">Video\nLDM<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2103.00020\">CLIP<\/a>, and\ngenerally many diffusion models inspired by <a href=\"https:\/\/arxiv.org\/abs\/2112.10752\">Latent Diffusion<\/a>.<\/p>\n<p>Still, the question of why the introduction of self-attention to a\nmodel significantly increases its capabilities remains largely\nunanswered. Let\u2019s try to tackle this problem.<\/p>\n<h3 id=\"a.-attention-is-a-convex-hull\">A. Attention is\u2026 a convex\nhull<\/h3>\n<p>Let\u2019s revisit the self-attention equation: <span class=\"math display\">\\[\nZ=\\mathrm{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V\n\\]<\/span> This form illustrates the bidirectional attention used in the\nencoder. Now, let\u2019s consider a version of this that calculates the\nself-attention for a single token only. Recall that this happens in the\ndecoder &#8211; because of masking, new tokens do not affect the previous\nones, so self-attention is calculated for only one token at a time, as\nthey are generated. Let\u2019s say that the new token is <span class=\"math inline\">\\(\\vec{x}\\)<\/span> and its query vector is <span class=\"math inline\">\\(\\vec{q}:=\\vec{x}W^Q\\)<\/span>. For now, I\u2019ll look\nat only a single attention head. In this case the token\u2019s new value\n<span class=\"math inline\">\\(\\vec{z}\\)<\/span> will be: <span class=\"math display\">\\[\n\\vec{z}=\\sum_{i}s_i{\\left(\\vec{q}\\cdot\\vec{k_i}\\right)\\vec{v_i}}\n\\]<\/span> Here, <span class=\"math inline\">\\(\\vec{k_i}\\)<\/span> are the\nkey vectors of previous tokens while <span class=\"math inline\">\\(\\vec{v_i}\\)<\/span> are their value vectors. <span class=\"math inline\">\\(s_i\\)<\/span> is a scalar that represents the\neffect of applying <span class=\"math inline\">\\(\\mathrm{softmax}\\)<\/span>.<\/p>\n<p>Because we are considering only one head, we can rewrite the keys and\nvalues as products of the original tokens\u2019 values and the trained\nconversion matrices: <span class=\"math display\">\\[\n\\vec{z}=\\sum_{i}s_i{\\left(\\vec{q}\\cdot\\vec{t_i}W^K\\right)\\vec{t_i}W^V}\n\\]<\/span> Here, <span class=\"math inline\">\\(\\vec{t_i}\\)<\/span> are the\nprevious tokens\u2019 vectors, as they came in into the self-attention layer,\nand the <span class=\"math inline\">\\(W^K\\)<\/span> and <span class=\"math inline\">\\(W^V\\)<\/span> matrices are the self-attention\nconversion matrices. Now, let\u2019s look at the attention relevance score\ncalculation <span class=\"math inline\">\\(\\vec{q}\\cdot\\vec{t_i}W^K\\)<\/span>. Let\u2019s expand\nout the vector-matrix multiplication: <span class=\"math display\">\\[\n\\vec{t_i}W^K=\\sum_jt_{i,j}\\vec{W_j^K}\n\\]<\/span> Here, we decompose <span class=\"math inline\">\\(W^K\\)<\/span>\ninto the list of its row vectors <span class=\"math inline\">\\(\\vec{W_J^K}\\)<\/span>. Here is the summation that\nhappens in the <span class=\"math inline\">\\(\\vec{t_i}W^K\\)<\/span>\nmultiplication, visualized:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-mult.png?ssl=1\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-mult.png?w=680&#038;ssl=1\" alt=\"vector matrix multiplication diagram\" data-recalc-dims=\"1\"><\/a><\/p>\n<p>Now, let\u2019s take a look at the dot product <span class=\"math inline\">\\(\\vec{q}\\cdot\\vec{t_i}W^K\\)<\/span> again: <span class=\"math display\">\\[\n\\vec{q}\\cdot\\vec{t_i}W^K=\\vec{q}\\cdot\\sum_jt_{i,j}\\vec{W_j^K}\n\\]<\/span> We take the dot product between the vector <span class=\"math inline\">\\(\\vec{q}\\)<\/span> and of the sum of vectors <span class=\"math inline\">\\(t_{i,j}\\vec{W_j^K}\\)<\/span>. The dot product is\ndistributive (ie. <span class=\"math inline\">\\(\\vec{a}\\cdot(\\vec{b}+\\vec{c})=\\vec{a}\\cdot\n\\vec{b}+\\vec{a}\\cdot \\vec{c}\\)<\/span>), so we may rewrite this as a sum\nof dot products: <span class=\"math display\">\\[\n\\vec{q}\\cdot\\vec{t_i}W^K=\\sum_j\\vec{q}\\cdot\n\\left(t_{i,j}\\vec{W_j^K}\\right)\n\\]<\/span> Ok, so our result is this sum. It is a sum of products of\n<span class=\"math inline\">\\(t_{i,j}\\)<\/span> and <span class=\"math inline\">\\(\\vec{W_j^K}\\cdot\\vec{q}\\)<\/span>: <span class=\"math display\">\\[\n\\vec{q}\\cdot\\vec{t_i}W^K=\\sum_j\\left(\\vec{W_j^K}\\cdot\\vec{q}\\right)*t_{i,j}\n\\]<\/span> This looks like another dot product. It is a sum of\nelement-wise products of <span class=\"math inline\">\\(\\vec{t_i}\\)<\/span>\n(Row-vector) and <span class=\"math inline\">\\(W^K\\vec{q}^T\\)<\/span>\n(column-vector):<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-prod2.png?ssl=1\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-prod2.png?w=680&#038;ssl=1\" alt=\"prod2\" data-recalc-dims=\"1\"><\/a><\/p>\n<p>So, finally, we obtain: <span class=\"math display\">\\[\n\\vec{q}\\cdot\\vec{t_i}W^K=\\vec{t_i}\\cdot W^K\\vec{q}^T\n\\]<\/span> To get rid of the column vector, we can transpose it <span class=\"math inline\">\\(\\left(W^K\\vec{q}^T\\right)^T=\\vec{q}\\left({W^K}\\right)^T\\)<\/span>,\nas the dot product doesn\u2019t change: <span class=\"math display\">\\[\n\\vec{q}\\cdot\\vec{t_i}W^K=\\vec{t_i}\\cdot \\vec{q}\\left({W^K}\\right)^T\n\\]<\/span> As there is only one query vector (the one for the new token),\nand the trained matrix doesn\u2019t change, we can precompute this\nmatrix-vector product and, substituting <span class=\"math inline\">\\(\\vec{v}:=\\vec{q}\\left({W^K}\\right)^T\\)<\/span>, get\n<span class=\"math display\">\\[\n\\vec{q}\\cdot\\vec{t_i}W^K=\\vec{t_i}\\cdot\\vec{v}\n\\]<\/span> Formally, we just proved that the dot product is distribtive\nover matrix multiplication. I couldn\u2019t easily find a proof of it online,\nso that\u2019s why I included it here.<\/p>\n<blockquote>\n<p><strong>Missing nonlinearity in query calculation<\/strong><\/p>\n<p>We can see that <span class=\"math inline\">\\(\\vec{v}:=\\vec{q}\\left({W^K}\\right)^T=\\vec{x}W^Q\\left({W^K}\\right)^T\\)<\/span>.\nThis is like having our token vector <span class=\"math inline\">\\(\\vec{x}\\)<\/span> pass through two linear layers\nwith nothing in-between them.<\/p>\n<p>The intent of multi-headed attention and the introduction of the\n<span class=\"math inline\">\\(W^Q\\)<\/span> and <span class=\"math inline\">\\(W^K\\)<\/span> matrices was to reduce the\ndimensionality of the vectors (from <span class=\"math inline\">\\(d\\)<\/span> to <span class=\"math inline\">\\(d_k\\)<\/span>). However, this goes to show that\neven if the current multi-headed self-attention design is to be\nmaintained, there should probably be a nonlinearity present between the\ntwo matrices. Otherwise, we are wasting time and memory on training two\nlinear layers that have no nonlinearity between them.<\/p>\n<p>To solve this issue, the query vectors <span class=\"math inline\">\\(\\vec{q_i}\\)<\/span> should pass through a nonlinear\ntransform, such as ReLU or the more recent SwiGLU. In math, we should do\n<span class=\"math inline\">\\(\\left(\\vec{q_i},\n\\vec{k_i}\\right)=\\left(f\\left(\\vec{x}W^Q\\right),\n\\vec{x}W^K\\right)\\)<\/span> instead of the current <span class=\"math inline\">\\(\\left(\\vec{q_i},\n\\vec{k_i}\\right)=\\left(\\vec{x}W^Q, \\vec{x}W^K\\right)\\)<\/span>. This\nshould allow the queries and keys to capture more complex relationships\nmore easily, as this is what happens to an FNN, when introducing a\nnonlinearity between two linear layers. As far as I know, this idea\nhasn\u2019t been explored before.<\/p>\n<\/blockquote>\n<p>Now, let\u2019s look at the simplified self-attention equation: <span class=\"math display\">\\[\n\\vec{z}=\\sum_{i}s_i{\\left(\\vec{t_i}\\cdot\\vec{v}\\right)\\vec{t_i}W^V}\n\\]<\/span> What is this? Essentially, this is a sum of vector-matrix\nproducts, <span class=\"math inline\">\\(s_i\\left(\\vec{t_i}\\cdot\\vec{v}\\right)\\vec{t_i}\\)<\/span>\nbeing the vectors and <span class=\"math inline\">\\(W^V\\)<\/span> being a\nmatrix. Just like the dot product, matrix multiplication is distributive\n(ie. <span class=\"math inline\">\\(\\left(\\vec{v}+\\vec{u}\\right)M=\\vec{v}M+\\vec{u}M\\)<\/span>).\nThis means that we can \u201cfactor out\u201d the multiplication with <span class=\"math inline\">\\(W^V\\)<\/span>: <span class=\"math display\">\\[\n\\vec{z}=\\left(\\sum_{i}s_i{\\left(\\vec{t_i}\\cdot\\vec{v}\\right)\\vec{t_i}}\\right)W^V\n\\]<\/span> Now, let\u2019s think about what this equation means. Because we\ncan multiply by the value matrix after doing the sum, the bulk of our\ncomputation is the sum, so let\u2019s focus our <em>attention<\/em> on that:\n<span class=\"math inline\">\\(\\sum_{i}{s_i\\left(\\vec{t_i}\\cdot\\vec{v}\\right)\\vec{t_i}}\\)<\/span>.<\/p>\n<blockquote>\n<p><strong>Memory savings<\/strong><\/p>\n<p>We are about to go deeper and think about the nature of\nself-attention. But let\u2019s think about what we already have. We showed\nthat the bulk of self-attention is completely independent of the QVK\nmatrices. So, in practice, this means memory savings during inference.\nWe do not need to store the key and value vectors of previous tokens,\nbut only their token values.<\/p>\n<p>Suppose, we have <span class=\"math inline\">\\(h\\)<\/span> attention\nheads, each storing a key and a value for each of <span class=\"math inline\">\\(n\\)<\/span> previous tokens. If the token has a\nlength of <span class=\"math inline\">\\(d\\)<\/span>, while QVK have length\n<span class=\"math inline\">\\(d_k\\)<\/span>, we changed the number of\nstored <code>float<\/code>s from <span class=\"math inline\">\\(2hnd_k\\)<\/span> to <span class=\"math inline\">\\(nd\\)<\/span>. Importantly, memory usage is now\nindependent of the headedness of the model. As far as I know, this idea\nhasn\u2019t been explored before.<\/p>\n<\/blockquote>\n<p>Now, to get the result, we accumulate a result over a set of <span class=\"math inline\">\\(\\vec{t_i}\\)<\/span> vectors. Let\u2019s think about\nthis. For now, I will temporarily omit the <span class=\"math inline\">\\(\\mathrm{softmax}\\)<\/span> step and assume that\n<span class=\"math inline\">\\(s_i=1\\)<\/span>. We\u2019ll reintroduce it later.\nSo, let\u2019s have a look at <span class=\"math inline\">\\(\\sum_{i}{\\left(\\vec{t_i}\\cdot\\vec{v}\\right)\\vec{t_i}}\\)<\/span>.<\/p>\n<p>How does the value of this sum change, depending on the input <span class=\"math inline\">\\(\\vec{v}\\)<\/span>? It can easily be shown that\nchanging the magnitude of <span class=\"math inline\">\\(\\vec{v}\\)<\/span>\nonly changes the magnitude of the result &#8211; they are proportional. So,\nit\u2019s the direction of <span class=\"math inline\">\\(\\vec{v}\\)<\/span> that\nbears significance. To try to understand what this sum really is, let\u2019s\nvisualize it.<\/p>\n<p>Let\u2019s start with something simple &#8211; a 2D case with, say, 3\ntokens:<\/p>\n<iframe src=\"https:\/\/www.desmos.com\/calculator\/kis8rvkxlo?embed\" frameborder=\"0\" style=\"filter:invert(1)\">\n<\/iframe>\n<p>Ok, so we have 4 movable points in total &#8211; 3 \u201ctokens\u201d representing\nthe previous tokens in the context and a \u201cquery\u201d that represents <span class=\"math inline\">\\(\\vec{v}\\)<\/span>.<\/p>\n<p>The relations between the point locations seem rather complex. Moving\nthe tokens parallel to the axes, moves the result on a parabola, but\nthat doesn\u2019t seem useful. Let\u2019s rotate the query and see how the result\nchanges:<\/p>\n<iframe src=\"https:\/\/www.desmos.com\/calculator\/6q9ezkvlub?embed\" frameborder=\"0\" style=\"filter:invert(1)\">\n<\/iframe>\n<p>Looks a bit like an ellipse. Let\u2019s trace the path of the result and\nsee what we get:<\/p>\n<iframe src=\"https:\/\/www.desmos.com\/calculator\/uasb9picft?embed\" frameborder=\"0\" style=\"filter:invert(1)\">\n<\/iframe>\n<p>Looks like an ellipse indeed. And it seems that three tokens is too\nmuch, as two are enough to get any ellipse. But maybe this is just a\ncoincidence and we\u2019ll be able to get more complex shapes with more\npoints.<\/p>\n<iframe src=\"https:\/\/www.desmos.com\/calculator\/smuzsnzy0k?embed\" frameborder=\"0\" style=\"filter:invert(1)\">\n<\/iframe>\n<p>Nope, still looks like an ellipse.<\/p>\n<p>Given this observation, we may hypothetise that two points are enough\nto saturate all degrees of freedom in our 2D case. Extending this to\nhigher dimensions, we can postulate that <span class=\"math inline\">\\(d\\)<\/span> vectors are enough to saturate the\nself-attention layer.<\/p>\n<p>The query vector moves along a circle in the animation, while the\nresult sits on an ellipse. This means that the transformation that\nhappens is affine. In this context these <span class=\"math inline\">\\(d\\)<\/span> vectors make sense, as they are used to\nget a square affine transformation matrix.<\/p>\n<p>After thinking for a while, we may find that this scheme resembles a\nsimple neural network. I mean, we know that neural networks need to have\nnonlinear layers interweaved with the linear ones, or otherwise the\nnetwork will just \u201ccollapse\u201d to one layer. Similarly, here, adding extra\ntokens, beyond the initial <span class=\"math inline\">\\(d\\)<\/span>, gives\nno further control over the output shape. Remember that we temporarily\nignored the <span class=\"math inline\">\\(\\mathrm{softmax}\\)<\/span>\nfunction. If the analogy is correct, this could be interpreted as\nomitting the nonlinearity between layers. So, with this intuition, let\u2019s\ntry to interpret our sum <span class=\"math inline\">\\(\\sum_{i}{s_i\\left(\\vec{t_i}\\cdot\\vec{v}\\right)\\vec{t_i}}\\)<\/span>\nas a simple neural network.<\/p>\n<p>It turns out that it\u2019s actually quite simple. We know that the input\nis a <span class=\"math inline\">\\(d\\)<\/span>-dimensional vector <span class=\"math inline\">\\(\\vec{v}\\)<\/span> and that the output is some other\n<span class=\"math inline\">\\(d\\)<\/span>-dimensional vector. This means\nthat the input and output layers need to have <span class=\"math inline\">\\(d\\)<\/span> neurons each. We also concluded that\nthere should be a nonlinear <span class=\"math inline\">\\(\\mathrm{softmax}\\)<\/span>, so we\u2019ll also need a\nhidden layer. This will get us a simple 2-layer FNN.<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-diag3.png?ssl=1\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-diag3.png?w=680&#038;ssl=1\" alt=\"the created 2-layer neural network\" data-recalc-dims=\"1\"><\/a><\/p>\n<p>But what are the weights and biases? And what is the size of the\nhidden layer? Well, intuitively, that size would be the context length.\nAfter all, this network shows the self-attention mechanism. We know that\nthe context changes over time, so it seems logical, to assume that the\nhidden layer\u2019s size depends on it. What about parameters? Well, they can\nbe directly read from our equation.<\/p>\n<p><span class=\"math inline\">\\(\\vec{v}\\)<\/span> is the input. <span class=\"math inline\">\\(\\sum_{i}{s_i\\left(\\vec{t_i}\\cdot\\vec{v}\\right)\\vec{t_i}}\\)<\/span>\nis the output. How is the output computed? It is a sum of <span class=\"math inline\">\\(\\vec{t_i}\\)<\/span> vectors scaled by coefficients.\nThe coefficients are <span class=\"math inline\">\\(s_i\\left(\\vec{t_i}\\cdot\\vec{v}\\right)\\)<\/span>.\nSo, in our neural network, <span class=\"math inline\">\\(s_i\\left(\\vec{t_i}\\cdot\\vec{v}\\right)\\)<\/span> are\nthe outputs of the <span class=\"math inline\">\\(\\mathrm{softmax}\\)<\/span>\nlayer, and <span class=\"math inline\">\\(\\vec{t_i}\\cdot\\vec{v}\\)<\/span>\nare the inputs to the <span class=\"math inline\">\\(\\mathrm{softmax}\\)<\/span> layer. Now we can easily\nsee that the weights of the first linear layer are simply <span class=\"math inline\">\\(\\vec{t_i}\\)<\/span> vectors arranged in a matrix.\nRegarding the second linear layer, the i-th component of the output is\nthe weighed sum of the context tokens\u2019 i-th components. So, the weights\nof the second linear layer are also <span class=\"math inline\">\\(\\vec{t_i}\\)<\/span> vectors arranged in a matrix,\nbut, this time, transposed.<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-diag4.png?ssl=1\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-diag4.png?w=680&#038;ssl=1\" alt=\"self attention layer in matrix multiplication form\" data-recalc-dims=\"1\"><\/a><\/p>\n<p>Mathematically, we can express this as: <span class=\"math display\">\\[\n\\sum_{i}{s_i\\left(\\vec{t_i}\\cdot\\vec{v}\\right)\\vec{t_i}}=\\mathrm{softmax}(\\vec{v}T)T^T\n\\]<\/span> Here, <span class=\"math inline\">\\(T\\)<\/span> is the matrix\ncreated by stacking from current context\u2019s tokens\u2019 vectors. Let\u2019s see\nhow this simplifies the self-attention equation: <span class=\"math display\">\\[\n\\vec{z}=\\mathrm{softmax}\\left(\\vec{v}T\\right)T^TW^V\n\\]<\/span> And, subsituting for <span class=\"math inline\">\\(\\vec{v}:=\\vec{q}\\left({W^K}\\right)^T=\\vec{x}W^Q\\left({W^K}\\right)^T\\)<\/span>,\nwe get: <span class=\"math display\">\\[\n\\vec{z}=\\mathrm{softmax}\\left(\\vec{x}W^Q\\left({W^K}\\right)^TT\\right)T^TW^V\n\\]<\/span> We can also visualize this equation, which captures the\nentirety of the self-attention layer, as a simple FNN:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-diagram5.png?ssl=1\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-diagram5.png?w=680&#038;ssl=1\" alt=\"diagram5\" data-recalc-dims=\"1\"><\/a><\/p>\n<p>As noted earlier, there are nonlinear layers missing, which should be\nthe more visible, given this diagram. Also, we no longer need to store\nthe context\u2019s keys and values. This decreases the memory usage, as long\nas <span class=\"math inline\">\\(2hnd_k&gt;nd\\)<\/span>, where <span class=\"math inline\">\\(h\\)<\/span> is the number of self-attention heads,\nand <span class=\"math inline\">\\(n\\)<\/span> is the current context\nlength. As implementations are often I\/O bound, this might actually\nimprove performance, by saving on the memory bandwith, despite\nperforming more computations.<\/p>\n<blockquote>\n<p><strong>Dynamic heads<\/strong><\/p>\n<p>What is cool about this new equation is that we don\u2019t really care\nabout the key and value matrices. We can make them arbitrary. This\neffectively allows us to perform self-attention for a token with a\ndifferent attention head than the heads used with prior tokens.\nPotentially, the attention head could even by dynamic. We could, for\nexample, calculate <span class=\"math inline\">\\(W^K\\)<\/span> and <span class=\"math inline\">\\(W^V\\)<\/span>, using yet another neural network,\nbased on <span class=\"math inline\">\\(\\vec{q}\\)<\/span>. So, instead of\ntraining concrete <span class=\"math inline\">\\(W^K\\)<\/span> and <span class=\"math inline\">\\(W^V\\)<\/span> matrices, we would train an NN that\ncreates them from <span class=\"math inline\">\\(\\vec{q}\\)<\/span>. As far\nas I know, this idea hasn\u2019t been explored before.<\/p>\n<\/blockquote>\n<p>Now that we reintroduced the <span class=\"math inline\">\\(\\mathrm{softmax}\\)<\/span> nonlinearity, let\u2019s go\nback to our diagram and see what we get.<\/p>\n<iframe src=\"https:\/\/www.desmos.com\/calculator\/bkvdr38y3n?embed\" frameborder=\"0\" style=\"filter:invert(1)\">\n<\/iframe>\n<p>Ok, so we got some colorful blobs. As a reminder, this diagram\ndirectly illustrates how the self-attention mechanism works. Given a set\nof tokens in the context (purple points), it illustrates what the layer\ndoes to every possible input. In the illustration, each circle gets\nblobiffied into a curve of the same color.<\/p>\n<p>The resultant curves, look a bit like splines or B\u00e9zier curves. Also,\nit seems that no matter how hard I try to position the tokens, the\nresultant curve is always closed, smooth and not self-intersecting.<\/p>\n<p>Here, circles end up transformed into curves. Extrapolating to higher\ndimensions, hyperspheres would end up transformed into differentiable\nmanifolds. I note this explicitly, as, if the input vectors were\nnormalized, they\u2019d lie on a unit hypersphere.<\/p>\n<p>I recommend you to open <a href=\"https:\/\/www.desmos.com\/calculator\/rjht9zmgk8\">this diagram<\/a> in\na larger window as you can zoom in really closely and see that the curve\nactually has fine details that depend on the token point placement.\nThere is also <a href=\"https:\/\/www.desmos.com\/calculator\/dv4sp8tgst\">this version<\/a>\nthat also displays the \u201cdensity\u201d of points on the curves. It explains\nwhy the result vector doesn\u2019t traverse the curve with uniform speed.<\/p>\n<p>To me, this looks like a flexible piece of cloth stretched between\nstrings with beads, where token vectors are \u201cwells\u201d that attract the\nbeads to some position. Here are my takeaways:<\/p>\n<ol type=\"1\">\n<li>The tokens, which are in the convex hull of the token point set, are\nthe most significant, as they define the overall shape of the\ncurve.<\/li>\n<li>Tokens, which are located inside the hull, change the general shape\nslightly, but affect the contour lines inside the hull.<\/li>\n<li>If the tokens are far from the origin, they attract the beads very\nstrongly and the shape is very similar to the convex hull of the token\nset. Most points end up mapped near to the tokens themselves.<\/li>\n<\/ol>\n<p>It appears that the self-attention transformation warps the embedding\nspace in a smooth and continuous manner. We can take a closer look at\nthis in the following diagram, which showcases what the self-attention\ntransformation does to the coordinate system grid:<\/p>\n<iframe src=\"https:\/\/www.desmos.com\/calculator\/qxjbrvk06b?embed\" frameborder=\"0\" style=\"filter:invert(1)\">\n<\/iframe>\n<p>We can again see that the space is heavily compressed, as points,\nwhich are further than a few units from the origin, are all squished\ninto the edges or vertices of the polygon.<\/p>\n<p>In a way, we can consider the tokens as singularities. After all,\nthey are the points, where the space collapses upon itself. We can focus\nonly on the coordinate system axes:<\/p>\n<iframe src=\"https:\/\/www.desmos.com\/calculator\/t0capcaygq?embed\" frameborder=\"0\" style=\"filter:invert(1)\">\n<\/iframe>\n<p>Their ends eventually end up in one of the tokens. This presents us\nwith an interesting problem &#8211; given that there are five tokens, but\nthere are only four ends, there must be some point at which an end\n\u201cjumps\u201d between two tokens, a discontinuity. This is interesting, as a\nsmall change in the position of a token completely changes, which way\nthe space is distorted. This reminds me of a classifier, as it creates a\ndiscrete mapping of axis ends to tokens.<\/p>\n<p>For a given arrangement of tokens, the mapping is continuous and the\ntransformation is differentiable with respect to the query. However, the\nmapping is not differentiable with respect to the tokens.<\/p>\n<p>Recall that there is a residual connection around the self-attention\nmechanism. What would happen, if we would <em>add<\/em> the mapped\nposition to the original? Let\u2019s look what would happen, if the <span class=\"math inline\">\\(W^Q\\)<\/span>, <span class=\"math inline\">\\(W^K\\)<\/span>, and <span class=\"math inline\">\\(W^V\\)<\/span> matrices were all identity transforms\n(<a href=\"https:\/\/www.desmos.com\/calculator\/wdqkyc2xvp\">here<\/a> is a\nversion including gridlines in the range <span class=\"math inline\">\\([-1,1]\\)<\/span>, if your computer can handle\nthat):<\/p>\n<iframe src=\"https:\/\/www.desmos.com\/calculator\/bzm5qsawvx?embed\" frameborder=\"0\" style=\"filter:invert(1)\">\n<\/iframe>\n<p>At first glance, this seems surprisingly organized to me. I mean,\nthis is just matrix multiplication and a <span class=\"math inline\">\\(\\mathrm{softmax}\\)<\/span>, with arbitrary token\nvalues. Yet, still, this complex shape is created.<\/p>\n<p>After observing it for a while, we may see that the diagram features\na few areas with distinct characteristics:<\/p>\n<ol type=\"1\">\n<li>The <span class=\"math inline\">\\((-1, -1), (-1, 1), (1, 1), (1,\n-1)\\)<\/span> square is transformed into a shape resembling the convex\nhull of the token point set.<\/li>\n<li>The region inside that square is warped. Non-hull tokens influence\nthe nature of this warping.<\/li>\n<li>Far from the hull, the coordinate system\u2019s axes are straight. They\nlook like funnels that attract gridlines towards them. The ends of the\naxis are \u201cappropriated\u201d by respectively the leftmost, topmost, rightmost\nand bottommost tokens (AABB).<\/li>\n<li>The edges of the hull seem to project funnels outwards. It looks\nlike the gridlines are \u201cavoiding\u201d them and, hence, stretch quickly to\nthe other side.<\/li>\n<li>The remaining parts of the coordinate system seem to be left in\npeace.<\/li>\n<\/ol>\n<p><a href=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-partitioning.png?ssl=1\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-partitioning.png?w=680&#038;ssl=1\" alt=\"questions\" data-recalc-dims=\"1\"><\/a><\/p>\n<p>I find it impressive that the self-attention layer essentially finds\nthe convex hull of the token point set, as well as the AABB of the hull.\nThis is all, while having no conditional statements or control flow\nconstructs, as it is just two matrix multiplications and a <span class=\"math inline\">\\(softmax\\)<\/span> operation &#8211; <code>add<\/code>,\n<code>sub<\/code>, <code>mul<\/code>, <code>div<\/code>,\n<code>exp<\/code>.<\/p>\n<p>We can think about the \u201cdensity\u201d of this new space. It looks to me\nlike the hull and its projected funnels are regions of low density &#8211;\npoints avoid it and prefer not to end up there, and gridlines are\nstretched there. We can confirm this by plotting points instead of\nlines:<\/p>\n<iframe src=\"https:\/\/www.desmos.com\/calculator\/p2nrzwp5k6?embed\" frameborder=\"0\" style=\"filter:invert(1)\">\n<\/iframe>\n<p>Now, if we vary the influence of the self-attention layer &#8211; multiply\nit by a constant in <span class=\"math inline\">\\((0,1)\\)<\/span> before\nadding it to the original value &#8211; we can clearly see that the embedding\nspace ends up split into five regions:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-animation.gif?ssl=1\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-animation.gif?w=680&#038;ssl=1\" data-recalc-dims=\"1\"><\/a><\/p>\n<p>I hope that this insight into how the self-attention layer \u201clooks\u201d\nwill perhaps allow to draw further conclusions about it or optimise it,\nfor example by constructing clever data structures that will accelerate\nits computation. The convex hull seems to be of importance here.\nFurthermore, if all transformations were somehow made\n\u201ccavity-preserving\u201d, the convex hull would need to be calculated only\nonce.<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/06\/IMG_0240-scaled.jpeg?ssl=1\"><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/06\/IMG_0240-scaled.jpeg?w=680&#038;ssl=1\" alt=\"Multi-headed self-attention with pre-normalization\" data-recalc-dims=\"1\"><\/a><\/p>\n<p><em>Multi-headed self-attention with pre-normalization. Notice how\nmany nonlinear layers are missing.<\/em><\/p>\n<p>Looking at the entirety of multi-headed self-attention, we can see\nthat it is no more than a regular neural network. Still, transformer\nblocks are made up of two units: self-attention and a feed-forward\nnetwork. The FNN is said to capture some intra-token relationships and,\nbeing independent from the tokens, potentially add context-independent\ninformation. Still, seeing that self-attention is just a neural network,\ndo we really need to append a different neural network to it? Why not\njust work with what we already have and somehow put that universal\ncontext-independent knowledge inside the attention mechanism itself\u2026\n<em>continue in section B<\/em><\/p>\n<h3 id=\"b.-attention-is-a-vector-database\">B. Attention is\u2026 a vector\ndatabase<\/h3>\n<p>Recently, it has become popular to imbue large language models with\nexternal knowledge by using a so called \u201cvector database\u201d.<\/p>\n<p>The goal is to make the model be able to factually answer questions\nand, preferably, cite its sources. This can be useful, for example, for\nonline technical assistants. An LLM can be given access to the entire\nmanual, support forum, changelog, issue database, etc. The assistant\nwill serve as a more user-friendly interface to searching all these\nresources and it will also be able to synthesize some level of new\nresponses that will, hopefully, solve the user\u2019s problem &#8211; or at least,\nmake them not dial the call center.<\/p>\n<p>The problem is that LLMs have a limited context window and, hence,\ncannot simply be given the entire text of all these sources combined. As\nsuch, we can provide it with only a subset of our \u201cknowledge\u201d. So, how\ndo we know what to tell it? This is the job of the vector database.<\/p>\n<p>There exist machine learning models that convert text into a vector\nof floating point numbers. These are often called embedding models for\nthe fact that they create <em>text embeddings<\/em>. What we do is we\ntake the combined text of all our sources and we <em>embed<\/em> it. We\ndo this by splicing it into parts (could be senteces, paragraphs, pages,\noften these are intermixed) and then embedding each of the parts\nseparately. Then, we store the <code>(text, embedding)<\/code> pairs in a\ndatabase that we will later query. Then, to query the database, we just\nembed the user\u2019s prompt and search for embeddings that are \u201csimilar\u201d to\nit.<\/p>\n<p>Many companies are raising million of dollars, creating complex\naccelerated vector databases. Still, we can construct a naive and simple\ndatabase with a few lines of code. We simply have to pick a subset of\nthe pairs that are the most \u201csimilar\u201d to our query:<\/p>\n<pre class=\"python\"><code>userQuery: str\ndb: list[tuple[str,list[float]]]\n...\nuserEmbedding: list[float] = embed(userPrompt)\nresults: list[str] = []\nfor text, embedding in db:\n  if similarEnough(embedding, userEmbedding):\n    results.append(text)\nreturn smartAI(context=results, prompt=userPrompt)<\/code><\/pre>\n<p>That would be the general idea. But what is the \u201csimilarity\u201d?\nUsually, it is the cosine of the angle between the text\u2019s embedding and\nthe user query\u2019s embedding, as measured in their multidimensional space.\nAs often all these embeddings are normalized, calculating the cosine\nusually amounts to simply computing a dot product of the embeddings.<\/p>\n<p>After thinking for a while about this, I saw a striking resemblance\nto something else I knew. \u201cQuery, dot product, text embedding vector,\ntext string\u201d &#8211; this all sounds just like the self-attention layer in the\ntransformer. We have queries &#8211; users\u2019 prompts, keys &#8211; text embeddings,\nand values &#8211; text strings themselves. The only difference between a\nvector database and the self-attention layer is apparently the storage\nformat of the data. Here are some takeaways from this analogy:<\/p>\n<ol type=\"1\">\n<li>Why are the queries and the text string using the same embedding\nmodel? In the transformer, there is a separate <span class=\"math inline\">\\(W^Q\\)<\/span> matrix for calculating the query\nvectors and a separate <span class=\"math inline\">\\(W^K\\)<\/span> matrix\nfor the keys.<\/li>\n<li>Why one head? Each string and query has only one embedding. This is\nlike having only a single head in a transfomer. Understandably, this is\nprobably caused by the storage requirements. Storing separate embeddings\nfor many \u201cheads\u201d would use too much space. Unless\u2026 we didn\u2019t have to\nstore them at all.<\/li>\n<\/ol>\n<p>In the previous section, we saw that the self-attention layer works\njust fine, if instead of remembering the key and value vector for each\nhead for each token, we remember only the token itself. We can do just\nthat with our embeddings.<\/p>\n<p>The current embedding models are trained to produce these, possibly\noptimal, <em>query-key vectors<\/em>. But, we can change this design and\ninstead train the model to create good <em>token vectors<\/em>, and train\nit alongside a set of matrices that we will use to first transform the\nembeddings. Given that current embedding vectors are long already, this\ndoesn\u2019t seem to bring any additional computational cost. Well, we would\nneed to multiply the query embedding by a set of query matrices, but\nthat time is negligible compared to having to go through the entire\ndatabase, while querying it. The only component that needs to be changed\nis the embedding model. The database is good as-is and it is oblivious\nof the change &#8211; it doesn\u2019t care about the \u201cnature\u201d of the vectors it\nstores.<\/p>\n<p>Going a step further, we may also just get rid of the text itself.\nBecuase we already store the texts\u2019 vector representations, we can use a\nvalue matrix to turn them into the values that we want to accumulate.\nThat matrix would be trained along the embedding model, just like the\nquery and key matrices.<\/p>\n<p>Ok, so now we can rest as we have freed ourselves from\nvariable-length strings and operate only on vectors. But, wait\u2026 why do\nwe need an <em>external<\/em> vector database at all? Given that we just\nreplicated the self-attention mechanism, albeit with some additional\nsteps, like training a new embedding model, why not just go ahead and\nuse self-attention instead?<\/p>\n<p>Traditionally, the self-attention layer operates on the context given\nto the transfomer &#8211; hence <em>self<\/em>-attention. But what if we\nchanged this and introduced a new attention layer &#8211; <em>database-decoder\ncross-attention<\/em>?<\/p>\n<p>Currently, LLMs are trained on vast amounts of data and pick up\nknowledge along the way. Where is it stored, we don\u2019t know (somewhere in\nthe weights, duh). So why not create an explicit container for the\nmodel\u2019s knowledge &#8211; an internal database. We would first train the model\nwith general text, as usual, but then, we would go ahead and teach it\nspecific subjects. (Typically, this would be called fine tuning, but\nthat name suggests that we are taking a complete model and chaging it,\nwhile in this case, we are simply splitting the training process into\nsteps.)<\/p>\n<p>The model would have a database cross-attention layer in each of its\ndecoders. We would append zeros at the end of the database, and train\nthe model on a new subject, while keeping the rest of the parameters\nfrozen. This would mean that whatever it learns must be located in the\nnew scratch space, we provided it.<\/p>\n<p>What about stability? Fine-tuning a model often decreases its ability\nto generalise and requires additional restorative tuning. Well, it is\npossible that this would also happen in this case. However,\ntheoretically, there is a reason for why it would not happen in this\ncase. Namely &#8211; usual fine tuning appends new layers to a model and\ntrains them. This means that everything the model produces gets\nprocessed by these new layers. As such, it\u2019s easy for the model to\nbecome worse, as all its outputs are garbled by the new layers.\nMeanwhile, this approach of appending data to a database used in a\ncross-attention layer, does not interfere with the model\u2019s previous\nfunctioning. As the data is zero-initialized, it has no effect on the\noutput of the model. When we train the model, we only add to its pool of\nknowledge, so theoretically, it shouldn\u2019t forget anything it already\nknew, or lose any abilities, as, fundamentally, we are not modyfing the\nmodel at all. Instead, we are adding a sort of plugin, or mix-in to\nit.<\/p>\n<blockquote>\n<p><strong>TOME<\/strong><\/p>\n<p>Since writing this, I have learned that a similar idea has already\nbeen explored in <a href=\"https:\/\/arxiv.org\/abs\/2110.06176\">TOME (de\nJong et al., 2022)<\/a>. However, there are some differences. The process\nthe paper outlines is roughly:<\/p>\n<ol type=\"1\">\n<li>Create a set of mentions. A mention is a certain type of text string\nthat mentions named entities, their properties and relations between\nthem.<\/li>\n<li>Train a transformer encoder (E) that will create a key and value\nvector for each mention. These keys and values form the memory (M).<\/li>\n<li>Train the main transformer with M-cross-attention. M contents are\nfrozen.<\/li>\n<li>Add new knowledge to the transformer by encoding it with E and\ninserting the new keys and values into M.<\/li>\n<\/ol>\n<p>Meanwhile, I propose:<\/p>\n<ol type=\"1\">\n<li>Create a fixed-sized memory (M) &#8211; a list of key and value vector\npairs. Initialize it to random values.<\/li>\n<li>Train the main transformer with M-cross-attention. M\u2019s contents are\ntrained together with the transformer\u2019s parameters.<\/li>\n<li>Add new knowledge to the transformer by adding some number of rows\nto M and training them. The transformer and the previous M\u2019s contents\nare frozen.<\/li>\n<\/ol>\n<p>Let\u2019s compare the two. My approach does not need gathering any new\ndata. The memory is trained on the same text corpus that the transformer\nis trained on. The paper requires explicitly creating a set of mentions.\nThis means that my approach requires only changing the implementation of\nthe transformer, while the entire process of training remains unchanged,\nwhile the paper needs changes to both to be made. Using all text as the\nsource of knowledge can allow the transformer to capture more\ninformation inside the memory. However, learning from mentions can be\nmore efficient and possibly more effective, as these mentions can be\nmore information-dense than general text. Generally, the paper seems to\noutline a more \u201cstrict\u201d learning paradigm than I do. In my approach, the\ntransformer can learn arbitrary information in and an arbitrary\nformat.<\/p>\n<\/blockquote>\n<p>Going further, we can think if we really need a separate\ncross-attention layer. Maybe we could query both our internal database\nand the context in the same attention layer. By combining the two\ntogether there would be no distinction between the model\u2019s prior\nknowledge and its working context. Going further still, we could go\nahead and remove FNNs and leave only the attention layers. After all,\nany necessary information can just be saved in our database.<\/p>\n<h3 id=\"c.-multi-decoder\">C. Multi-decoder<\/h3>\n<p>Originally, if ChatGPT was asked what is the sum of the squares of\nthe first 20 primes, it would just make up some number. Now, it\ndescribes step by step, how to calculate the result and gets it\ncorrectly. This is most likely the result of it having been fine-tuned\nby OpenAI on chain-of-thought examples.<\/p>\n<p>The idea behind <a href=\"https:\/\/arxiv.org\/pdf\/2205.11916.pdf\">\u201clet\u2019s\nthink step by step\u201d<\/a>, as well as previous prompt-engineering guidance\nis that an LLM needs more \u201cspace\u201d when dealing with a \u201charder\u201d task.\nIntuitively, the amount of computation that happens, when a token is\ngenerated, is constant, regardless of the prompt or the new token. We\ncan treat this amount of computation as time that the model has to\n\u201cthink\u201d. This is because all the LLM\u2019s generation logic happens during,\nwell, token generation. As such, when we ask an LLM a question like\n<code>P=NP? [Y\/N]<\/code>, we do two things &#8211; we give it a hard problem,\nand we give it little time, by forcing it to answer with only Yes or\nNo. We can expect poor performance, as the LLM simply doesn\u2019t havve\nenough time to figure out an answer to our question. On the other hand,\nif we tell the model to \u201cthink step by step\u201d, we suggest it to produce a\nlonger output. This essentially means giving the model more time. Using\nthis knowledge, we can estimate a task\u2019s difficulty: <span class=\"math display\">\\[\nd=\\frac{c}{\\tau}\n\\]<\/span> The <span class=\"math inline\">\\(d\\)<\/span>ifficulty of a task\nis its <span class=\"math inline\">\\(c\\)<\/span>omplexity per unit <span class=\"math inline\">\\(\\tau\\)<\/span>ime (solving a complex problem in\nlittle time is difficult &#8211; it is easier, when more time is available or\nwhen the task is simpler).<\/p>\n<p>The performance of the model is dependent on the type task given.\nSome LLMs are better at solving particular types of tasks rather than\nothers, but models\u2019 general capability can also be compared. The\nperformance of a model at a particular task can by roughly modeled as:\n<span class=\"math display\">\\[\np=\\frac{Cf(t)}{d}\n\\]<\/span> The performance is higher for more <span class=\"math inline\">\\(C\\)<\/span>apable models and for models better\n<span class=\"math inline\">\\(f\\)<\/span>it for the particular <span class=\"math inline\">\\(t\\)<\/span>ype of task at hand. At the same time,\nit decreases as the <span class=\"math inline\">\\(d\\)<\/span>ifficulty of a\ntask gets higher.<\/p>\n<p>By levaraging fine-tuning, we can teach an exisiting LLM to produce\noutputs that resemble a step by step reasoning process. Still, it is us\ncontrolling the \u201cthought-process\u201d of the model. This may cause us to\nteach the model on thought-processes with inadequate length &#8211; different\nLLMs with different capabilities may require different lengths.\nAdditionally, the model has to reason like a human &#8211; fine-tuning limits\nit to \u201cthinking\u201d only in words. Preferably, the model should be able to\n\u201cthink\u201d in a \u201cneural format\u201d &#8211; arbitrary vector representations that\nneed not map to textual tokens.<\/p>\n<p>The key limitation is that LLMs have no real \u201cscratch-space\u201d, as all\ntheir decoded tokens are included in the output. This means that their\n\u201cthinking time\u201d is directly bound to the length of their output. One\noption to decouple the two is to introduce scoping tokens, like\n<code>&lt;thought&gt;<\/code> and <code>&lt;\\thought&gt;<\/code>. LLMs are\nproficient at using scopes, both in code, as well as in regular language\n(direct speech). The purpose of these scoping tokens would be simple &#8211;\nanything between them is not decoded and does not affect the loss. The\nproblem with this is that there is no obvious way of making the model\ngenerate these tokens at all or of lmiting the model\u2019s \u201cthinking time\u201d.\nHaving the model generate tokens forever would certainly be an\nundesirable quality.<\/p>\n<p>I suggest a mechanism designed to sidestep these problems that allows\nthe model to reason for as long as it fits. Additionally, it requires no\nexplicit training, as it is an extension of the Transformer rather than\na novel training method.<\/p>\n<p>The original Transformer consited of an encoder and a decoder. Since\nthen, other models have been proposed that include only one of these\nelements. Notably, all leading models, like GPTs, LLaMAs and PaLMs, are\ndecoder-only. This shows that the encoder-decoder cross-attention\nmechanism is not needed for achieving top performance.<\/p>\n<p>Seeing these \u201creduced Transformers\u201d I naturally wondered, if there\nare any models that do the reverse &#8211; add additional encoders or decoders\nto the Transformer. I am not aware of any, and hence propose how one\ncould look.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/multi-e1687774016702.png?w=680&#038;ssl=1\" data-recalc-dims=\"1\"><\/p>\n<p>The key idea is to have the Transformer generate every single token\nin an auto regressive manner itself. This will allow the model to vary\nits \u201cthinking time\u201d. This is achieved by introducing an intermediary\ndecoder.<\/p>\n<p>The model consists of a masked encoder and two autoregressive\ndecoders. The encoder (and the decoders) uses causal attention masking &#8211;\nie. a token cannot attend to future tokens. In principle, this makes\neach step of the autoregression process self-contained, just like in a\ndecoder-only model.<\/p>\n<p>The first decoder generates an intermediary token sequence. The\nformat of it is left opaque &#8211; the model can generate token vectors as it\nsees fit. The produced token could be given to the decoder as-is or\npotentially additionally positionally encoded. The decoder is first\ngiven some BOS (beginning-of-sequence) token. As this decoder does not\nhave direct access to the prompt, it has to use encoder-decoder\ncross-attention to relate to the prompt. This design allows the\nTransformer to \u201cthink\u201d for as long as it needs. The autoregression will\nstop when a certain criterion is met &#8211; it will be discussed briefly.<\/p>\n<p>When the first decoder is done generating, the second decoder begins\nits work. This time, it has access to the input prompt, like in a\ndecoder-only model. Additionally, it has cross-attention layers that\nimbue it with the \u201cthoughts\u201d generated by the first decoder. This\ndecoder produces the next output token, just as usual. When it\ncompletes, the token is decoded, reencoded and the entire process\nautoregresses. The keys and values of prompt tokens can be (and should\nbe) cached inside the encoder and the second decoder, as well as in\ntheir embeddings. The middle decoder can potentially use caching. If it\ndoes, it will have access to all previous \u201cthoughts\u201d, which could\npotentially save it from doing some redundant work. On the other hand,\nthis would be more computationally expensive, and would make the\nautoregression step no longer self-contained &#8211; a token\u2019s generation\nwould rely on previous \u201creasoning\u201d.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/tokenandahalf.png?w=680&#038;ssl=1\" data-recalc-dims=\"1\"><\/p>\n<p>How could this be implemented? The first problem to solve is to\ndetermine the stop criterion. Because this is an internal decoder, we\ndon\u2019t want to teach the model to produce a special EOS token. At the\nsame time, we would like to be able to put an upper bound on the amount\nof time the model can spend. To achieve this, I propose to introduce a\n<code>life<\/code> value. It would be a number in <span class=\"math inline\">\\([0, 1]\\)<\/span> representing how \u201cdone\u201d is the\nmodel &#8211; when <code>life<\/code> reaches zero, this means that the model\nhas concluded \u201cthinking\u201d. After generating each token, this value would\nbe decreased by some amount <code>life -= f(token)<\/code>. This allows\nthe model to ask for more time or to finish early, by producing an\nappropraite type of token. What happens if <code>life &lt; 0<\/code>?\nThis is the interesting part.<\/p>\n<p>Usually, creating such a transformer would come with a significant\nproblem &#8211; it would be non differentiable. If a model with parameters\n<span class=\"math inline\">\\(p\\)<\/span> produced <span class=\"math inline\">\\(n\\)<\/span> tokens in the intermediary layer, given\ninput <span class=\"math inline\">\\(i\\)<\/span> in the intermediary layer,\nhow many would it produce if the values of <span class=\"math inline\">\\(p\\)<\/span> were to slightly change? Normally, we\ncouldn\u2019t tell. That\u2019s because <span class=\"math inline\">\\(n(p,\ni)\\)<\/span> is not continuous. As a model can produce only an integer\namount of tokens, this function must necesarily be non-differentiable\nwith respect to <span class=\"math inline\">\\(p\\)<\/span>. But what if the\nmodel could produce \u201chalf\u201d a token?<\/p>\n<p>This is the other function of <code>life<\/code>. If it becomes\nnegative after a token is produced, we know that the model \u201cwent to\nfar\u201d. To account for this, the model needs to \u201cunproduce\u201d a part of the\ntoken, to make <code>life<\/code> be capped at zero. This requires\n\u201ccutting a token in half\u201d. Fortunately, these tokens are only vectors.\nSo, naturally, we can just decrease the magnitude of the token vector by\na certain amount. This will decrease its respective influence in the\ncross-attention layer in the second decoder:<\/p>\n<pre class=\"python\"><code>life: float\ntoken: vector\n\ndl = f(token)\nnl = life - dl\ncredit = max(0, -nl)\n\ninfluence = 1 - (credit \/ dl)\n\ntoken *= influence\naddToContext(token)\nlife = nl\nif life &lt; 0:\n\tbreak<\/code><\/pre>\n<p>This allows the LLM to control its own execution (via the\n<code>break<\/code> statement), while retaining its\ndifferentiability.<\/p>\n<p>All model parameters would be random-initialized as usual. As long\n<code>f(token)<\/code> is made to always return positive values, the\nmodel is guaranteed to stop working at some point. This can be\ncontrolled by changing the implementation of <code>f<\/code>. Teaching\nthe model works as usual &#8211; the loss, computed as the difference between\nthe model\u2019s output and the desired one, is backpropagated. This alone,\ncan cause the model to work ineffectively, as it doesn\u2019t take\ncomputation time into account. As such, I suggest including the length\n(number of tokens; <code>3.3<\/code> above) of the intermediary sequence\nas part of the loss.<\/p>\n<h3 id=\"d.-attention-is-multiplication\">D. Attention is\u2026\nmultiplication<\/h3>\n<p>One of the prevalent questions surrounding transformers is \u201cwhy are\nthey so good?\u201d Usually, the self-attention mechanism is provided as the\nexplanation. But what is so special about self-attention?<\/p>\n<p>We saw that, fundamentally, self-attention is just an FNN. So what\u2019s\nall the fuss about? Is it the <span class=\"math inline\">\\(\\mathrm{softmax}\\)<\/span> activation function? I\nthink that crucial component that self-attention brings is\n<em>multiplication<\/em>.<\/p>\n<p>A neural network is a series of transformations that are applied to\nsome input data. Usually, it is a chain of linear layers intermixed with\nnonlinear ones. Yes, models can be more complex, like an autoregressive\ntransformer &#8211; it has a loop &#8211; or a diffusion model &#8211; it adds noise at\neach iteration. But still, if we limit ourselves to only linear and\nnonlinear layers, we miss a crucial component.<\/p>\n<p>Typical neural networks never multiply two inputs together. If we\u2019d\nlook at the path a single input <code>float<\/code> traverses, we\u2019ll see\nthat it gets scaled by pretrained weights, it has pretrained biases\nadded to it, it has expressions of other inputs added to it, it is\ntransformed by nonlinear functions. But never are two input\n<code>float<\/code>s multiplied together.<\/p>\n<p>Even if we look at RNNs, we can see that the hidden representation as\nwell as the input only get multiplied by pretrained weights. Only LSTMs\nintroduce <em>self-multiplication<\/em> in an indirect way, as\nexpressions dependent on the cell\u2019s current and previous states get\nmultiplied. This happens when going through the various gates, as they\nare dependent on the cell state. Although LSTMs are significantly less\nadvanced than Transformers, this introduction of self-multiplication\ncould be a possible explanation of their improved performance over\nsimilarly-sized RNNs.<\/p>\n<p>Perhaps it would be beneficial to test\n<em><code>Productional<\/code><\/em> layers, instead of only\n<code>Linear<\/code> ones. There are many ways to design a layer that\ninvolves products of its inputs. Still, I think that adding such a layer\ncould potentially result in performance superior to purely linear\nones.<\/p>\n<h3 id=\"attention-is-protein-folding\">Attention is\u2026 protein folding<\/h3>\n<p>Protein folding is a task that involves predicting the final\nlocations of a set of particles, given their initial locations. The\ndifficulty stems from the fact that the position of particles changes\nthe magnitude of electrostatic forces between the particles. If we think\nof this more abstractly, a given set of particles is defined by only\ntheir positions, and is associated with a well-defined set of couplings\nbetween the particles that is a direct product of their positions. We\ncould think of self-attention in a similar manner. Tokens could be\nconsidered as particles, while their relative compatibilities, could be\ninterpreted as measures of some interactions between them. In this case,\nthe token system passing through the various attention layers could be\nseen as a set of particles undergoing folding over time. A critical\nconsequence of this is that the system would have to either \u201cconverge\u201d\nto a stable state or enter an oscillating pattern. In practice, this\nwould mean that Transformers have scalability limitations &#8211; adding more\nlayers to them results in more accurate predictions, but has diminishing\nreturns.<\/p>\n<h3 id=\"attention-is-a-knowledge-graph\">Attention is\u2026 a knowledge\ngraph<\/h3>\n<p>Consider a single attention head. It assigns a certain \u201ccompatibility\nvalue\u201d to each pair of tokens &#8211; the dot product between their respective\nquery and key. We can treat tokens in the context as nodes of a clique,\nand these compatibilities as weights of edges between the nodes. If we\nlook at multiple heads, we can interpret the token graph as a multilayer\ngraph. Each head yields a different set of edges &#8211; a layer of the graph.\nThis can be compared to a knowledge graph, in which the tokens model\nentities, while the heads capture relationships between them. Maybe\nattention proves so effective, as it allows the model to understand its\ninput in the form of a graph?<\/p>\n<h2 id=\"perplexity\">Perplexity<\/h2>\n<p>Beam width is a hyperparameter of the autoregressive token generation\nprocess. As such, its value is determined mostly using trial and error\nand by trading off either performance of generation quality. Could it be\npossible to find an <em>optimal<\/em> value for beam width &#8211; larger\nwidths would not yield better performance? My idea is to set it to the\nmodel\u2019s perplexity. That is because perplexity is, fundamentally, a\nmeasure of the model\u2019s uncertainty in generating its predictions. When a\nmodel has a perplexity of, say, 15, this means that when a model\ngenerated a new token, it was as unsure as if it had to pick between 15\noptions. Perplexity can be treated as <a href=\"https:\/\/towardsdatascience.com\/perplexity-in-language-models-87a196019a94\">a\nweighted branching factor<\/a>. To me, this seems similar to beam width.\nAfter all, the beam width is the number of best predictions that are\nbeing done in parallel. So, intuitively making it equal to the\nperplexity should explore all the model\u2019s possible outputs.<\/p>\n<h2 id=\"teach---not-train\">Teach &#8211; not train<\/h2>\n<p>How do (human) children learn to read? If you handed The Works of\nShakespeare Combined to a kid, would you expect them to learn English\nfrom them? I wouldn\u2019t. Yet, this is how we treat LLMs. LLMs are trained\non trillions of tokens, which is orders of magnitude higher than what it\ntakes humans to learn linguistic capabilities. Admittedly, the training\nprocess teaches the models not only to communicate effectively, but also\ngives them knowledge and other capabilities. Yet, still, this seems\nhighly inefficient compared to how humans learn. Usually, this is either\nattributed to an imperfect architecture or limited model size. However,\nto me it seems that the problem lays in the training process, rather\nthan in the model itself.<\/p>\n<p>If artificial neural networks are to mimic their real counterparts in\nany way, they learning patterns should be similar as well. As such, it\nseems understandable to me that a model has to see billions of tokens in\norder to learn proper grammar. This is because the learning material it\nis being given is complete chaos. It is no coincidence that people talk\nto children in a \u201cchildish\u201d manner. This seems to have been designed by\nevolution in order to facilitate the language learning process for the\nkids. In general, optimal learning performance will be achieved only\nwhen the learning material\u2019s difficulty is adequate compared to the\nlearner\u2019s skill.<\/p>\n<p>This is why training models is so inefficient. Let\u2019s say that we\ntrain a model to distinguish species of fish. This seems a bit awkward,\nas the model is suddenly shown pictures of some \u201cthings\u201d, while it\ndoesn\u2019t even know \u201cwhat is a fish\u201d. In this context, fine-tuning a\nfoundation model, can be treated as simply training a model new skills.\nRecent research often suggests that large models learn the majority of\ntheir capabilities during their pretraining stages, and fine-tuning only\nallows these capabilities to surface. This is a plausible explanation,\nbut I also see a different one &#8211; pretraining teaches the model how to\nlearn. This doesn\u2019t happen directly. Pretraining only makes the model a\nmore skilled student that is able to learn harder tasks quicker, based\non their previous knowledge. And their new task doesn\u2019t have to be\nreflected in their pretraining process.<\/p>\n<p>As such, it could be worth exploring how would models behave, if they\nwould be taught, rather than trained. This would affect two aspects of\nthe learning process &#8211; the dataset and the optimizer. The dataset would\nhave to be tiered, from easiest to hardest examples. It would also\npreferably be dynamic &#8211; if a model struggles (has high loss on the\nvalidation set from a certain tier), it would be trained on it for\nadditional epochs. In addition to this <i>difficulty gradation<\/i>, other pedagogical\ntechniques could be used. Some examples include repeating earlier examples\nTo reinforce older, potentially partially forgotten, abilities.<\/p>\n<p>The other component to be affected is the optimizer. If it is to\nmimic human learning, it should begin with a very large learning rate,\nand then, gradually decrease it. This can sound contrary to usual advise on training\nmodels &#8211; wouldn&#8217;t this cause overfitting? Well, overfitting is caused when a model\ncontinuously relearns its training data. Here, however, after the model would learn\nsimpler examples, it would never return to them &#8211; learning them would be\nonly a part of the process. In other words, I propose to overfit the model on purpose,\nas the learned data will get overwritten anyway. This is a major paradigm shift\nin the comprehension of the learning process. Currently, a learning model has only\none goal &#8211; minimize the overall loss. I propose changing this goal &#8211; facilitate further learning.\nThis can seem counterintuitive &#8211; what is the point of teaching a model something, only to\nlater overwrite that? I say that it is to make learning faster, easier and more efficient. This overfitting can be treated not as training, but a way of initializing the model&#8217;s weights. Instead of random values, we use weights associated with an easier problem.<\/p>\n<h2 id=\"continuous-attention\">Continuous attention<\/h2>\n<p>What would happen if the set contained literally all the possible\nvectors? After all, as the context grows larger and larger, it will\ncontain increasingly more vectors. Well, the answer would always be\n<span class=\"math inline\">\\(0\\)<\/span>, as they would cancel each other\nout. So no &#8211; the context cannot contain all vectors. More generally, it\ncannot contain vectors that are distributed about \u201chomogenously\u201d in\nspace. Speaking of homogeneity, this makes me wonder about the nature of\nthe self-attention context. Currently, it is discrete &#8211; it is a set of\nvectors. However, intuitively, it seems, that what we care about\nactually is what is the distribution of the vectors. If we think of the\nset of vectors as a point cloud, we don\u2019t care about the \u201cparticles of\nthe cloud\u201d. From a distance, it is, well, a cloud. We care about its\ndensity. The current solution of storing a set of vectors is simply a\ndiscretized version of this.<\/p>\n<p>So let\u2019s reevaluate what are our inputs. We have a density function\n<span class=\"math inline\">\\(\\rho(\\vec{t}):\\R^{d_k}\\rarr(0,1)\\)<\/span>\nthat can be roughly imagined as our set of points, after having gone\nthrough a Photoshop blurring filter. And we have a single vector <span class=\"math inline\">\\(\\vec{v}\\)<\/span> &#8211; from our single attention head\nunder consideration now.<\/p>\n<p>I think we can reasonably redefine the attention-mechanism using the\nprovided analogy and make it continuous: <span class=\"math display\">\\[\n\\vec{z}=\\left(\\int_{\\R^{d_k}}\\left(\\vec{t_i}\\cdot\\vec{v}\\right)\\vec{t_i}*\n\\rho(\\vec{t})d^{d_k}\\vec{t}\\right)W^V\n\\]<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Note I started writing this post about two months ago. However, I had been coming up with the ideas I write about for much longer. I didn&#8217;t want to publish the article early, unpolished, or with errors. However, the field of AI is moving, so quickly that I am literally unable to keep this article [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Some ideas on Transformers - Jan Bielak<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Some ideas on Transformers - Jan Bielak\" \/>\n<meta property=\"og:description\" content=\"Note I started writing this post about two months ago. However, I had been coming up with the ideas I write about for much longer. I didn&#8217;t want to publish the article early, unpolished, or with errors. However, the field of AI is moving, so quickly that I am literally unable to keep this article [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/\" \/>\n<meta property=\"og:site_name\" content=\"Jan Bielak\" \/>\n<meta property=\"article:published_time\" content=\"2023-06-09T00:14:20+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-06-26T10:10:18+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png\" \/>\n<meta name=\"author\" content=\"Jan Bielak\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jan Bielak\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"37 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/\"},\"author\":{\"name\":\"Jan Bielak\",\"@id\":\"https:\/\/janbielak.com\/#\/schema\/person\/04eb61a91ea147e5ce8cc7d174ef800d\"},\"headline\":\"Some ideas on Transformers\",\"datePublished\":\"2023-06-09T00:14:20+00:00\",\"dateModified\":\"2023-06-26T10:10:18+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/\"},\"wordCount\":8537,\"publisher\":{\"@id\":\"https:\/\/janbielak.com\/#\/schema\/person\/04eb61a91ea147e5ce8cc7d174ef800d\"},\"image\":{\"@id\":\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png\",\"inLanguage\":\"en\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/\",\"url\":\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/\",\"name\":\"Some ideas on Transformers - Jan Bielak\",\"isPartOf\":{\"@id\":\"https:\/\/janbielak.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png\",\"datePublished\":\"2023-06-09T00:14:20+00:00\",\"dateModified\":\"2023-06-26T10:10:18+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#breadcrumb\"},\"inLanguage\":\"en\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#primaryimage\",\"url\":\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png?fit=2048%2C2048&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png?fit=2048%2C2048&ssl=1\",\"width\":2048,\"height\":2048},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/janbielak.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Some ideas on Transformers\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/janbielak.com\/#website\",\"url\":\"https:\/\/janbielak.com\/\",\"name\":\"Jan Bielak\",\"description\":\"C++ Programming and Computer Graphics...\",\"publisher\":{\"@id\":\"https:\/\/janbielak.com\/#\/schema\/person\/04eb61a91ea147e5ce8cc7d174ef800d\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/janbielak.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/janbielak.com\/#\/schema\/person\/04eb61a91ea147e5ce8cc7d174ef800d\",\"name\":\"Jan Bielak\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\/\/janbielak.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/0724454f30bc6b720ca31ca3286eba20?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/0724454f30bc6b720ca31ca3286eba20?s=96&d=mm&r=g\",\"caption\":\"Jan Bielak\"},\"logo\":{\"@id\":\"https:\/\/janbielak.com\/#\/schema\/person\/image\/\"},\"sameAs\":[\"https:\/\/janbielak.com\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Some ideas on Transformers - Jan Bielak","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/","og_locale":"en_US","og_type":"article","og_title":"Some ideas on Transformers - Jan Bielak","og_description":"Note I started writing this post about two months ago. However, I had been coming up with the ideas I write about for much longer. I didn&#8217;t want to publish the article early, unpolished, or with errors. However, the field of AI is moving, so quickly that I am literally unable to keep this article [&hellip;]","og_url":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/","og_site_name":"Jan Bielak","article_published_time":"2023-06-09T00:14:20+00:00","article_modified_time":"2023-06-26T10:10:18+00:00","og_image":[{"url":"https:\/\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png"}],"author":"Jan Bielak","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Jan Bielak","Est. reading time":"37 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#article","isPartOf":{"@id":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/"},"author":{"name":"Jan Bielak","@id":"https:\/\/janbielak.com\/#\/schema\/person\/04eb61a91ea147e5ce8cc7d174ef800d"},"headline":"Some ideas on Transformers","datePublished":"2023-06-09T00:14:20+00:00","dateModified":"2023-06-26T10:10:18+00:00","mainEntityOfPage":{"@id":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/"},"wordCount":8537,"publisher":{"@id":"https:\/\/janbielak.com\/#\/schema\/person\/04eb61a91ea147e5ce8cc7d174ef800d"},"image":{"@id":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#primaryimage"},"thumbnailUrl":"https:\/\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png","inLanguage":"en"},{"@type":"WebPage","@id":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/","url":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/","name":"Some ideas on Transformers - Jan Bielak","isPartOf":{"@id":"https:\/\/janbielak.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#primaryimage"},"image":{"@id":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#primaryimage"},"thumbnailUrl":"https:\/\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png","datePublished":"2023-06-09T00:14:20+00:00","dateModified":"2023-06-26T10:10:18+00:00","breadcrumb":{"@id":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#breadcrumb"},"inLanguage":"en","potentialAction":[{"@type":"ReadAction","target":["https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/"]}]},{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#primaryimage","url":"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png?fit=2048%2C2048&ssl=1","contentUrl":"https:\/\/i0.wp.com\/janbielak.com\/wp-content\/uploads\/2023\/05\/attention-neuro.png?fit=2048%2C2048&ssl=1","width":2048,"height":2048},{"@type":"BreadcrumbList","@id":"https:\/\/janbielak.com\/index.php\/2023\/06\/09\/some-ideas-on-transformers\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/janbielak.com\/"},{"@type":"ListItem","position":2,"name":"Some ideas on Transformers"}]},{"@type":"WebSite","@id":"https:\/\/janbielak.com\/#website","url":"https:\/\/janbielak.com\/","name":"Jan Bielak","description":"C++ Programming and Computer Graphics...","publisher":{"@id":"https:\/\/janbielak.com\/#\/schema\/person\/04eb61a91ea147e5ce8cc7d174ef800d"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/janbielak.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en"},{"@type":["Person","Organization"],"@id":"https:\/\/janbielak.com\/#\/schema\/person\/04eb61a91ea147e5ce8cc7d174ef800d","name":"Jan Bielak","image":{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/janbielak.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/0724454f30bc6b720ca31ca3286eba20?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/0724454f30bc6b720ca31ca3286eba20?s=96&d=mm&r=g","caption":"Jan Bielak"},"logo":{"@id":"https:\/\/janbielak.com\/#\/schema\/person\/image\/"},"sameAs":["https:\/\/janbielak.com"]}]}},"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","jetpack-related-posts":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/janbielak.com\/index.php\/wp-json\/wp\/v2\/posts\/951"}],"collection":[{"href":"https:\/\/janbielak.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/janbielak.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/janbielak.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/janbielak.com\/index.php\/wp-json\/wp\/v2\/comments?post=951"}],"version-history":[{"count":7,"href":"https:\/\/janbielak.com\/index.php\/wp-json\/wp\/v2\/posts\/951\/revisions"}],"predecessor-version":[{"id":964,"href":"https:\/\/janbielak.com\/index.php\/wp-json\/wp\/v2\/posts\/951\/revisions\/964"}],"wp:attachment":[{"href":"https:\/\/janbielak.com\/index.php\/wp-json\/wp\/v2\/media?parent=951"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/janbielak.com\/index.php\/wp-json\/wp\/v2\/categories?post=951"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/janbielak.com\/index.php\/wp-json\/wp\/v2\/tags?post=951"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}