Dynamic Few-Shot Prompting: Overcoming Context Limit for ChatGPT Text Classification

Iryna Kondrashchenko
6 min read17 hours ago

--

Recent explosion in the popularity of large language models like ChatGPT has led to their increased usage in classical NLP tasks like language classification. This involves providing a context (sample) and candidate labels to the model to reason about. More precisely, this approach is called zero-shot classification, which implies that the model does not have to be retrained to generalize to new/unseen categories. Naturally, not having to fine-tune a model for a specific task can be highly beneficial from a business perspective, as it significantly reduces development time as well as the additional burden of maintaining custom models.

However, one downside of the zero-shot approach is its limited ability to leverage existing labeled samples (training data). In this article, we will see how to solve this problem using dynamic few-shot prompting, which includes only a relevant subset of the training data in the prompt.

All code examples in this article use the Scikit-LLM library. Please check my previous post for more information.

Recap: Zero-Shot vs Few-Shot prompting

Before diving into dynamic few-shot prompting, let’s briefly recap the concepts of zero-shot and few-shot prompting.

Zero-shot prompting is the method where ChatGPT, or any other language model, is used to classify text without any additional task-specific training. It involves framing a question or task for the model and providing it with options to choose from. Essentially, the model uses the knowledge it was trained with to complete the task.

A very simple zero-shot prompt can look like this:

Input: "The rover landed on Mars after a seven-month journey."
Prompt: "Is this text about Science, Sports, or Politics?"

Under the hood, a `ZeroShotGPTClassifier` from Scikit-LLM also uses zero-shot prompting and allows allows building an estimator in just three lines of code:

from skllm import ZeroShotGPTClassifier

clf = ZeroShotGPTClassifier(openai_model = "gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)

On the other hand, few-shot prompting gives ChatGPT several examples alongside the input. This serves to contextualize the task and offer direction for the model’s responses. Seeing these examples, the model understands the expected output, which improves its performance across a range of tasks.

Example prompt:

Example 1:
Input: "The athlete won a gold medal at the Olympics."
Output: "Sports"

Example 2:
Input: "The legislation was passed after a long debate in the Senate."
Output: "Politics"

Example 3:
Input: "The discovery of the Higgs boson at CERN marked a milestone in particle physics."
Output: "Science"

Task:
Input: "The rover landed on Mars after a seven-month journey."
Prompt: "Is this text about Science, Sports, or Politics?"

Example code:

from skllm import FewShotGPTClassifier

clf = FewShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)

Dynamic Few-Shot prompting

While few-shot prompting looks great on paper, as it allows the use of information from the training dataset to make predictions, it has significant scalability issues.

To understand the problem, let’s take a sample from the previous few-shot prompt.

Input: "The athlete won a gold medal at the Olympics."
Output: "Sports"

If we pass this through the web interface of the OpenAI tokenizer, we can see that the text corresponds to 19 tokens.

Why is this important?

Firstly, modern LLMs have limited context length. For example, gpt-3.5-turbo, the most popular OpenAI model, has a context limit of 4096 tokens, while most of the current generation open-source models are limited to 2048 tokens. Given that our sample was 19 tokens long, we could provide at most 215 samples to gpt-3.5-turbo or 107 samples to an open-source LLM like LLaMA. In real scenarios, this number will be even lower since we do not account for additional tokens consumed by the prompt itself. Additionally, the text samples that need to be classified are often much longer.

Even if the context size was unlimited, processing longer prompts would require more computational resources, which is usually associated with higher financial costs.

A very natural solution to the problem is to use only a subset of the training data for the prompt itself.

This is exactly what `DynamicFewShotGPTClassifier` does. During inference, for each unlabeled datapoint, it dynamically selects N training examples from each class to be used in the prompt.

from skllm import DynamicFewShotGPTClassifier

clf = DynamicFewShotGPTClassifier(n_examples=3)
clf.fit(X, y)
labels = clf.predict(X)

To better understand how it works let’s consider a toy example, where the goal is to figure out whether the person is talking about the books or movies.

from skllm import DynamicFewShotGPTClassifier

X = [
"I love reading science fiction novels, they transport me to other worlds.",
"A good mystery novel keeps me guessing until the very end.",
"Historical novels give me a sense of different times and places.",
"I love watching science fiction movies, they transport me to other galaxies.",
"A good mystery movie keeps me on the edge of my seat.",
"Historical movies offer a glimpse into the past.",
]

y = ["books", "books", "books", "movies", "movies", "movies"]

query = "I have fallen deeply in love with this sci-fi book; its unique blend of science and fiction has me spellbound."

clf = DynamicFewShotGPTClassifier(n_examples=1).fit(X, y)

prompt = clf._get_prompt(query)
print(prompt)

These are the examples that were automatically picked by the classifier to be included in the prompt:

Sample input:
```I love reading science fiction novels, they transport me to other worlds.```

Sample target: books


Sample input:
```I love watching science fiction movies, they transport me to other galaxies.```

Sample target: movies

Notice how both of the examples are clearly similar to the query as the person is talking about the science-fiction genre in all of the cases.

But how exactly does it select examples dynamically based on the new input?

This is achieved by adding a classical KNN-like algorithm as an additional preprocessor. If we assume that the most relevant examples are the most similar ones, then the problem reduces to a nearest neighbors search and can be tackled in three steps:

(1) Vectorization

Before doing the nearest neighbors search, the training set must be embedded into fixed-dimensional vectors. This can easily be achieved using the OpenAI embedding API (or any other alternative).

(2) Index construction

While any nearest neighbor algorithm can be used for this task, even the brute force option from scikit-learn, it is important to keep in mind the scalability aspects as we are dealing with very high-dimensional data and potentially a high number of samples as well. Luckily, there are plenty of tools that handle these scenarios extremely well. For example, Annoy, a library for fast approximate nearest neighbors search developed by Spotify. By constructing the index once during training, it is possible to perform fast neighbors searches during inference.

(3) Balanced sampling

The last thing to be accounted for is class balancing. If only N nearest neighbors are selected for a few-shot prompting, there is a very high risk that some of the classes will be underrepresented or missing completely. To mitigate this issue, instead of creating a single index, the training data is partitioned by class. In this way, we are able to sample N examples from each class, ensuring the equal representation of each class.

Conclusion

In this article, we explored the concept of dynamic few-shot prompting, an enhancement of the zero-shot and few-shot prompting approaches for text classification with large language models like ChatGPT. We learned how it utilizes the existing labeled data to improve classification accuracy by dynamically selecting relevant examples to include in the prompt. By leveraging a KNN-like algorithm, the dynamic few-shot prompting process creates a balance between using the inherent capacity of language models to generalize from limited examples, and the need to process large-scale data for practical applications.

Furthermore, it is an efficient method that adapts to the constraints of context length in current language models and optimizes the usage of computational resources.

--

--