Decoding LLM decoding: How LLMs Turn Probabilities into Magic

There is barely any field untouched with LLMs right now, and every day we hear about more and more about what amazing things they can do. From writing poetry to answering complex questions, it feels like there’s no limit to what they can do. But here’s something we don’t often talk about: how these models turn raw probabilities into the final words you see on your screen.

Decoding—the process of converting token probabilities into coherent responses—is a crucial piece of the puzzle. It’s not as simple as just picking the "most likely" next word. Different situations call for different strategies, depending on whether you want something fast, creative, or super-precise. And the best part? We’ve got a lot of clever ways to do this!

In this post, we’ll break it all down:

Why decoding matters: It’s not just about efficiency; it’s about getting the right kind of output.
The basic approaches: Greedy decoding, which is deterministic approach. You can add a dash of randomness with top-K sampling and nucleus sampling. And by now hearing the word ‘greedy’; LeetCode people must be dreading the next ‘tokens’: Dynamic Programming.
The fancy stuff: Let’s ponder if LLMs can guide themselves for decoding?

Let’s dive in!

Why Decoding isn’t just afterthought

Okay, imagine this: You ask a model to generate a story, and it predicts a list of possible words for the next step, each with a probability. The easiest approach? Just pick the word with the highest probability every time. Done, right?

Not quite.

Depending on what you’re trying to do, this method might not always give you what you want. For instance:

Writing a creative story? You might want some randomness to keep things interesting.
Summarizing a document? You’ll want the most logical and concise sequence of words, sticking to the concepts from the original.
Writing code? You will need to adhere to particular syntax to write your step-by-step instructions.
Building a chatbot? A mix of coherence, diversity, and speed is key.

This is why decoding strategies matter. They shape the final output, balancing things like speed, quality, and creativity. Let’s explore how this works!

1. Greedy Decoding: Keeping It Simple

First up is greedy decoding. This one’s pretty straightforward: at each step, the model picks the token with the highest probability, feed in back to the model to generate the next token: the auto-regressive way. Boom, you’re done.