... ...
Cross Entropy

Cut Cross Entropy: 20x Memory reduction in LLM training through optimized cross entropy kernels

Introduction 2024 saw the first 100m$+ training runs, while the insane compute requirements of training Large Language Models is no secret, this jump brought more attention to the need to find tricks and methods to optimize both the transformers architecture and the training process (infrastructure). The biggest / most successful of these optimizations in recent times has been flash attention (cite here), an idea that focuses on how the compute hungry (O(N)^2) self attention mechanism is computed, by moving attention matrices to SRAM. (Essentially the idea here is that we tried to optimize self attention by modifying how the operation is performed on the GPU.). The trend of squeezing out performance as much as possible from the training infrastructure continued, with researchers writing their own optimal cuda kernels. Deepseek took this a step further, writing their own distributed file system (Fire Flyer FileSystem), a new attention mechanism (MultiHead Latent Attention with custom kernels), a highly tuned communication library for mixture-of-experts models (Deep-EP), and Deep Gemm, an FP-8 optimized matrix multiplication kernel library. ...

<span title='2026-01-16 08:25:46 +0100 +0100'>January 16, 2026</span>&nbsp;·&nbsp;8 min&nbsp;·&nbsp;1566 words&nbsp;·&nbsp;Damilola John

Understanding Differential Attention.

Introduction Over the last few years, Transformers have emerged as the de-facto deep learning architecture in modelling language. Their unprecendented success in solving complex language tasks, reasoning (or mimmicking it), solving math and coding problems, have ushered in a new era in AI, powering successful AI products like ChatGPT. The key innovation of transformers lies in the self-attention mechanism, which allows each tokens in the input sequence to attend to every other token in the sequence. ...

<span title='2024-12-11 04:14:46 +0100 +0100'>December 11, 2024</span>&nbsp;·&nbsp;7 min&nbsp;·&nbsp;1387 words&nbsp;·&nbsp;Damilola John

What do you do when majority of your Coroutines are blocking?

Introduction I recently had to implement a feature at work that basically sent a ton of emails, with custom attachments to a ton of buyers across the world. Since we had a broker, for placing mails on a queue that were picked and sent by workers, how architecture was already primed for asynchronous communication. However, on the client side, placing these emails asynchronously wasn’t as simple. The main path between creating the mail and actually placing it on the service bus (broker), consisted mainly of blocking, CPU heavy tasks ; reading excel files, adding user information, serializing excel files to an in-memory buffer, base64 encoding these bytes and decoding them to strings, before finally creating the message json and sending it to the queue (this is non blocking). Implementing an efficient way of asychronously performing these operations was an interesting problem, keeping in mind python’s GIL amongst the other limitations I mentioned earlier. I thought to write this blog post on my approach ...

<span title='2024-12-11 04:14:46 +0100 +0100'>December 11, 2024</span>&nbsp;·&nbsp;1 min&nbsp;·&nbsp;163 words&nbsp;·&nbsp;Damilola John

Finetuning GPT2 to Reconstruct Sentences

Two words are anagrams if one can be formed by permuting the letters of the other. Applying the same logic to a sentence, would be saying that two sentences are anagrams(no such thing) if their component words can be permutated to form clones of each other. I thought it would be interesting to teach a language model to do this. You might be thinking that simply re-arranging words in a sentence doesn’t require intelligence and can be done with very trivial algorithms,you would be right, but I added an edge to this task, given a random sequence of words, the language model has to return a grammatically correct sequence using the same set of words. For example, the following sequence: ...

<span title='2024-06-15 04:14:46 +0100 +0100'>June 15, 2024</span>&nbsp;·&nbsp;10 min&nbsp;·&nbsp;2047 words&nbsp;·&nbsp;Damilola John

Classifying Code snippets with BERT.

This is a fun side project where I explored transformers based sentiment classification for the first time by training BERT to identify 15 of the most popular programming languages. i startED with simple machine learning approaches and gradually work our way up to more complex methods till we have a satisfactory solution. The Dataset Our dataset is a csv containing 45,000 samples. The dataset is made up of two columns, the ‘code’ feature contains code snippets we want to classify and the language column, which is our label contains the programming language it belongs to.Our train and test datasets were created from stratified sampling based on the target variable. ...

<span title='2023-08-19 04:14:46 +0100 +0100'>August 19, 2023</span>&nbsp;·&nbsp;4 min&nbsp;·&nbsp;841 words&nbsp;·&nbsp;Damilola John
tokenizers

Byte-Pair Encoding, The Tokenization algorithm powering Large Language Models.

Tokenization is an umbrella term for the methods used to turn texts into chunks of words or sub-words. Tokenization has a lot of applications in computer science, from compilers to Natural Language Processing. In this article, we would be focusing on tokenizers in Language models, in particular, a method of tokenization called Byte Pair Encoding. The last few years have witnessed a revolution in NLP catalyzed mainly by the introduction of the transformers architecture in 2017 with the paper ‘Attention is all you need ’ epitomized by the introduction of ChatGPT in late 2022. ...

<span title='2023-07-20 04:14:46 +0100 +0100'>July 20, 2023</span>&nbsp;·&nbsp;13 min&nbsp;·&nbsp;2564 words&nbsp;·&nbsp;Damilola John
image sensor

A guide on how AI is changing Computational Photography

And Enhance!! (from Blade Runner ), that’s Computational Photography . Computational photography describes signal processing techniques and algorithms that allow computers to replicate photographic processes like motion - blur correction , auto-focus ,depth-sensing , zoom and other features that would otherwise be impossible without optics ,while some of these processes use artificial intelligence techniques, Computational Photography is more than just AI , it involves a series of process like that takes an image from the Ones and Zeros on captured by image signal sensors and process to the final image displayed on screens . This article is going to be majorly focused on some computational photographical techniques employing AI. Smartphone Cameras have compensated for their hardware limitations due to the limited space to fit actual optics (like movable lenses to alter focus or depth of view ), and the limitations that comes with the technology behind digital cameras (CMOS sensors) , with the enormous computational power of their processors and have had to use clever algorithms to provide features like Zoom, Object-sensitive focus among others. These algorithms have incorporated some AI techniques in recent times to provide some unimaginable features like taking Google pixel’s night mode that allows you to take high definition pictures in extremely low-light . ...