Transformers

Cut Cross Entropy Deep Dive: 20x Memory reduction in LLM Pretraining through optimized triton kernels

Introduction Whilst working on pretraining SabiYarn in 2025, I came across a really interesting paper by a team at Apple called “Cut Your Losses In Large-Vocabulary Language Models”, that had a very interesting proposition - the cross entropy loss function has had a memory problem that has quietly crept up with a trend in LLM development, Large Vocabulary sizes. The paper introduces an optimised triton kernel for computing the cross entropy, called Cut Cross Entropy. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the cross entropy loss computation during training from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. ...

Finetuning GPT2 to Reconstruct Sentences

Two words are anagrams if one can be formed by permuting the letters of the other. Applying the same logic to a sentence, would be saying that two sentences are anagrams(no such thing) if their component words can be permutated to form clones of each other. I thought it would be interesting to teach a language model to do this. You might be thinking that simply re-arranging words in a sentence doesn’t require intelligence and can be done with very trivial algorithms,you would be right, but I added an edge to this task, given a random sequence of words, the language model has to return a grammatically correct sequence using the same set of words. For example, the following sequence: ...

Classifying Code snippets with BERT.

This is a fun side project where I explored transformers based sentiment classification for the first time by training BERT to identify 15 of the most popular programming languages. i startED with simple machine learning approaches and gradually work our way up to more complex methods till we have a satisfactory solution. The Dataset Our dataset is a csv containing 45,000 samples. The dataset is made up of two columns, the ‘code’ feature contains code snippets we want to classify and the language column, which is our label contains the programming language it belongs to.Our train and test datasets were created from stratified sampling based on the target variable. ...

Byte-Pair Encoding, The Tokenization algorithm powering Large Language Models.

Tokenization is an umbrella term for the methods used to turn texts into chunks of words or sub-words. Tokenization has a lot of applications in computer science, from compilers to Natural Language Processing. In this article, we would be focusing on tokenizers in Language models, in particular, a method of tokenization called Byte Pair Encoding. The last few years have witnessed a revolution in NLP catalyzed mainly by the introduction of the transformers architecture in 2017 with the paper ‘Attention is all you need ’ epitomized by the introduction of ChatGPT in late 2022. ...