Thoughts Brewing Blog

BizTech Q&A 18: Tokenizer, is that even a word?

Written by Damien Griffin | Feb 2, 2025 11:15:00 AM

Answering your questions about business and technology

Question

What is a tokenizer?

~ Anonymous

Answer

Hello,

A tokenizer (yes, it’s a word) is a tool that AI tools like ChatGPT, Gemini, etc., use to break the text up into tokens.  Helpful right?


I’ll go backwards slightly since I think that it will help.  In the AI world, a token is the basic unit of data that a language model can process. It would be analogous to a dollar being the basic unit of currency in the U.S.


This matters because the language models recognize patterns in the tokens (noticing common sequences of characters) and understand the statistical relationship between tokens.  


A different way to say that is that the language models use tokens to turn language into math.  And math is way easier to predict than human language.


Here is an example of a piece of AI-generated text and a tokenized version of that same text -

  • Normal
    • By the river’s edge, a capybara named Momo befriended every creature—birds perched on her back, fish swam beneath her. One day, a jaguar approached, eyes gleaming. Instead of running, Momo offered a berry. Surprised, the jaguar ate. From that day, even predators joined Momo’s peaceful circle by the river.
  • Tokenized

That is actually more the “human” version.  


The computer will represent each token with a number so it will look more like this -

  • [1582, 290, 20608, 802, 11165, 11, 261, 2328, 88, 25358, 11484, 391, 15150, 50245, 872, 3933, 1753, 46949, 2322, 100222, 183083, 402, 1335, 1602, 11, 13897, 2766, 313, 39397, 1335, 13, 5108, 2163, 11, 261, 12107, 13077, 52390, 11, 9623, 28398, 11300, 13, 21050, 328, 6788, 11, 391, 15150, 10877, 261, 119910, 13, 9568, 638, 5761, 11, 290, 12107, 13077, 28397, 13, 7217, 484, 2163, 11, 1952, 119264, 16863, 391, 15150, 802, 37838, 22005, 656, 290, 20608, 13]


I’ll walk away there because that’s where I usually start to lose people (if they made it this far).  It gets much more involved than this….


The cliffnote version is 

  • Tokens are basic units of data
  • Tokenizers break text into tokens
  • This helps them find patterns and predict what you want


Hope that helps

Keep ‘em coming

Damien