Byte-Pair Encoding, The Tokenization algorithm powering Large Language Models.
Tokenization is an umbrella term for the methods used to turn texts into chunks of words or sub-words. Tokenization has a lot of applications in computer science, from compilers to Natural Language Processing. In this article, we would be focusing on tokenizers in Language models, in particular, a method of tokenization called Byte Pair Encoding. The last few years have witnessed a revolution in NLP catalyzed mainly by the introduction of the transformers architecture in 2017 with the paper ‘Attention is all you need ’ epitomized by the introduction of ChatGPT in late 2022....