Answering your questions about business and technology
What is a tokenizer?
~ Anonymous
Hello,
A tokenizer (yes, it’s a word) is a tool that AI tools like ChatGPT, Gemini, etc., use to break the text up into tokens. Helpful right?
I’ll go backwards slightly since I think that it will help. In the AI world, a token is the basic unit of data that a language model can process. It would be analogous to a dollar being the basic unit of currency in the U.S.
This matters because the language models recognize patterns in the tokens (noticing common sequences of characters) and understand the statistical relationship between tokens.
A different way to say that is that the language models use tokens to turn language into math. And math is way easier to predict than human language.
Here is an example of a piece of AI-generated text and a tokenized version of that same text -
That is actually more the “human” version.
The computer will represent each token with a number so it will look more like this -
I’ll walk away there because that’s where I usually start to lose people (if they made it this far). It gets much more involved than this….
The cliffnote version is
Hope that helps
Keep ‘em coming
Damien