In the first part of this series, we explored foundational decoding strategies like greedy methods, randomness injection, and Beam Search. But let's take a step back and look at what we’re trying to do: create a coherent answer, based on the query and context given. What if we have an intelligent decoding method, which is not just looking at probabilities one step at a time, but also at the task at hand? As you might’ve guessed by now, it would involve an meta-LLM model, guiding our generation.
Let’s look at how would we go about designing such method; and TL;DR: it’s aptly called Speculative Decoding. “Generate the sequence auto-regressively using efficient model, and evaluate it in parallel using the beefy one!”
Here’s the ICML talk and the Google Blog for your reference.
But won’t having to use 2 LLMs double the computation cost? and Let’s not forget, the generation is done in autoregressive way, so the overheads stack up. How can we design a smart way?
When generating text, it’s crucial to recognize that not all words are equally important. Certain words or tokens dictate the flow, structure, and meaning of the generated text, while others play more of a supporting role. For example, in the sentence:
What happens when water reaches 100 degrees Celsius? When water reaches 100 degrees Celsius, it boils and turns into steam. This transformation, known as a phase change, marks the transition from liquid to gas.
In this example, generating the token "100 degrees Celsius" is relatively straightforward because it directly follows the question and is predictable from prior context. However, generating the token "boils" is more challenging as it requires scientific knowledge to identify the correct physical phenomenon. Importantly, the token "boils" dictates the flow of the entire subsequent sentence, guiding the explanation toward a discussion of phase changes rather than some unrelated topic.
This unequal importance of tokens presents an opportunity: Can we scale compute based on how important the token is for the context?
Let’s take another look at the world of LLMs: typically, when these models are released, they come in multiple versions, often distinguished by their parameter size. For instance, Meta’s LLaMA models are available in variations like LLaMA-13B, LLaMA-7B, and LLaMA-2B, each tailored to different resource and performance requirements.
The trade-off between model performance and resource efficiency drives this diversity. Larger models like LLaMA-13B generally exhibit superior performance in terms of fluency, contextual understanding, and generalization. However, they demand significant computational resources, making them impractical for applications where latency and cost are critical factors.
Smaller models like LLaMA-7B or LLaMA-2B, on the other hand, are faster and more resource-efficient but may compromise on the richness of their outputs. They excel in use cases that prioritize speed and lightweight deployment, such as running on edge devices or powering real-time applications.
So we have different versions, but how would we use them efficiently?
Let’s look back at generation task, and how we wanna restrict the answer to be meaningful: