There is barely any field untouched with LLMs right now, and every day we hear about more and more about what amazing things they can do. From writing poetry to answering complex questions, it feels like there’s no limit to what they can do. But here’s something we don’t often talk about: how these models turn raw probabilities into the final words you see on your screen.
Decoding—the process of converting token probabilities into coherent responses—is a crucial piece of the puzzle. It’s not as simple as just picking the "most likely" next word. Different situations call for different strategies, depending on whether you want something fast, creative, or super-precise. And the best part? We’ve got a lot of clever ways to do this!
In this post, we’ll break it all down:
Let’s dive in!
Okay, imagine this: You ask a model to generate a story, and it predicts a list of possible words for the next step, each with a probability. The easiest approach? Just pick the word with the highest probability every time. Done, right?
Not quite.
Depending on what you’re trying to do, this method might not always give you what you want. For instance:
This is why decoding strategies matter. They shape the final output, balancing things like speed, quality, and creativity. Let’s explore how this works!
First up is greedy decoding. This one’s pretty straightforward: at each step, the model picks the token with the highest probability, feed in back to the model to generate the next token: the auto-regressive way. Boom, you’re done.