6  Prompt engineering

Prompt engineering is how we call a nascent discipline of designing and optimizing prompts for different tasks, overcoming or navigating around the many limitations that LLMs currently have. While some believe that prompt engineering will become less and less relevant the more advanced (and “smart”) LLMs get, I still think some basic principles and techniques will remain useful, simply because no matter how smart an agent is, there are better and worse ways to give instructions.

There are literally thousands of prompt engineering guides and cheatsheets published online, many of which are not generalizable or become irrelevant as new LLMs are created. Also, the vast majority of prompt hacks you’ll find online are based on some key principles. If you master these principles, you can design your optimized prompt for any task with some focused experimentation.

For this reason, in this chapter we won’t focus too deeply on specific prompting templates for concrete tasks. Instead, we’ll focus on the most important high-level principles and the most general techniques that apply in all tasks and domains.

Principles for effective prompt engineering

The goal of prompt engineering is to design an effective prompt for a given task. The reason this is necessary is because LLMs have inherent limitations and design caveats that make then brittle and prone to fail on an otherwise solvable task if the wrong instructions are given. And wrong doesn’t mean necessarily wrong in any objective sense, just not adjusted to the limitations of the LLM you are using.

Thus, to come up with principles for effective prompt engineering, it will be pay to briefly revisit some of the main limitations of LLMs we saw in Chapter 4. Please read that chapter first for a full picture.

The following principles of prompt engineering stem from the basic structure of statistical language modeling: next-token prediction—understanding that an LLM is ultimately an autocompletion machine on steroids, based on word-context correlations learned from data. This realization informs the following principles.

Context matters

Since every new token generated is conditioned on the previously generated tokens, the response you get for a given prompt will be heavily influenced by the content of the promp, down to the exact phrasing you use. Though LLMs are notoriously good at capturing the underlying semantics of text, the actual words, the tone, the style, and even the exact order in which you construct the prompt may determine the quality of the response.

Current research suggests, for example, that LLMs tend to focus more on the beginning and final parts of the prompt and less on the middle, although this may change rapidly as novel architectures are invented. But, regardless of the idisincracies of specific models, the critical insight here is that whatever you put in the prompt will heavily influence the response, so everything important should be explicitly mentioned.

Focus matters

For the same reason, it is notoriously hard for LLMs to perform many tasks at once. The more focused the instructions are, the better–and more robust–output you’ll get. As we have seen in Chapter 4, LLMs are weak reasoners, and struggle with complicated conditional instructions. Some bigger models may be able to deal with a larger degree of flexibility, but in general you should avoid writing prompts with conditions, and instead make them as straightforward as possible.

For example, you can try prompting “if the user says X then reply Y, otherwise reply Z”, hoping the LLM will correctly classify the input and choose the right response in the same API call. This might work in many cases, depending on how good the LLM is. But the problem here is that you’re asking the LLM to solve two problems at once. Instead, you can solve the same problem more robustly with two sequential instructions: first ask the LLM to classify the user query, and then pick the corresponding prompt for that category, thus the LLM never has to explicity choose.

Reasoning requires verbosity

As we say in Chapter 2, LLMs perform a fixed amount of computation per token. This includes input and generated tokens. Thus, intuitively, a larger, more detailed prompt will tend to produce a better response, especially when some complicated reasoning is involved.

But crucially, this also applies to output generated with the LLM. The more it talks, the more computation is performing in total. For this reason, asking for very terse output is often not optimal. Instead, your prompt should incite the LLM to be verbose, explain its reasoning, summarize its key points before reaching a conclusion, etc.

But more is not always better

However, keep in mind that just stuffing the prompt with redundant information or instructions is not always better. The information density in the context matters more than its raw length.

You should strive for a minimum valuable prompt: the shortest prompt that contains the necessary amount of information to produce a succesful response. To achieve it, consider making your instructions more intentional, using more precise wording and avoiding vague terms. Provide informative examples where necessary.

But crucially, do not add irrelevant instructions, as LLMs are lousy at ignoring things. A typical issue is adding negative examples to correct some behavior and discover the LLM doubles down. The reason is simple, everything in the context will influence what the LLM generates, so negative examples are still examples that will bias the sampling process towards similar content.

Experimentation rules

All of the above being said, prompt engineering is still mostly artisanal, and far from a established science. For this reason, no amount of theory can replace good old experimentation. You should try different orderings of the instructions, different output formats, different writing styles, and see which gives you better results.

While you should write prompts that are, in general, agnostic to the specific LLM you’re using, keep in mind that the optimally-tuned prompt for one model, say GPT-4, might not be the universally best prompt. Different LLMs trained on different datasets and tuned with different strategies might have subtle differences that make one, for example, perform better with terse instructions while the other prefers verbosity. This may go down to the actual selection of words: a single word changed by an appropriate synonym may very well improve results significantly.

These principles are high-level insights that should inform how you approach prompt engineering. But keep in mind everything we know about large language models is changing very rapidly at the moment, and many of their current limitations could be fixed or at least reduced considerably with newer models, making some of these principles less relevant in the near future.

Prompt engineering techniques

In the following sections, we will explore specific prompt engineering techniques or patterns that are general enough to be valuable in many contexts and domains. Each of these patterns and techniques is informed by the afforementioned principles, and we will discuss why they seem to work within this framework.

Zero-shot instructions

Zero-shot means providing a model with a single instruction and asking it to solve a problem without any additional training data or example. This should be the baseline for any new application, and is useful for complex or novel tasks where there is no existing data to draw from.

Zero-shot learning works by leveraging the model’s ability to generalize from a single instruction. By providing a clear and specific prompt, the model can use its internal knowledge and understanding to generate a solution without needing additional training data.

Some exaplmes of zero-shot learning are:

  • Generating product descriptions for new or unique products.
  • Translating text between languages without parallel data.
  • Summarizing long documents or articles into concise overviews.

Few-shot learning

Few-shot learning involves adding a small set of examples to the model’s input to help it generalize better. This technique is particularly useful for rare or ambiguous queries, as it allows the model to learn from a small number of examples.

Few-shot learning works by providing the model with a few examples of similar queries and their corresponding answers. This helps the model learn from these examples and apply the knowledge to new, unseen queries. The few-shot approach also reinforces the output format, thus improving fidelity.

Some examples of few-shot learning are:

  • Solving tasks that aren’t easy to explain concisely.
  • Reinforcing an output format or response style.
  • Generating recomendations of products, activities, etc., based on examples.

Role playing

Role playing involves informing the model about the audience, tone, role, and other context-specific details to bias it towards a specific complexity level, extension, or style. This technique is useful for generating responses that are tailored to a specific audience or context.

Role playing works by providing the model with context-specific information that helps it generate responses that are more relevant and engaging to the target audience. By understanding the audience and context, the model can tailor its responses to meet their needs and expectations.

Examples of role playing include:

  • Writing dialogue for characters with distinct personalities and speaking styles.
  • Generating social media posts tailored to different demographics and platforms.
  • Drafting emails with appropriate tone and formality for different recipients.

Chain of thought

Chain of thought involves asking the model to output a detailed reasoning process before providing the final answer. This technique is useful for complex queries that require multi-step reasoning or problem-solving.

Chain of thought works by forcing the model to explicitly demonstrate its thought process. This helps ensure that the model’s responses are based on sound reasoning and logic, making them more accurate and trustworthy.

Examples where chain of thought is useful include:

  • Solving logic puzzles and brain teasers by breaking down the steps.
  • Providing step-by-step instructions for complex procedures or recipes.
  • Analyzing data to draw insights and conclusions.

Structured output

Structured output involves instructing the model to produce the output in a structured format, such as JSON. This technique is useful for applications that require structured data, such as database queries or data analysis.

Structured output works by simplifying the parsing of the response and allowing for easier integration with downstream applications. By providing a structured format, the model can generate responses that are easily consumable and actionable.

Examples where structured ouput is useful:

  • Generating tabular data like schedules, calendars, or price lists.
  • Producing API responses in a standardized JSON format.
  • Extracting structured information like addresses, dates, or product details from text.

Self-reflection

Self-reflection involves asking the model to evaluate its own response and determine if, given new context, it would change it. This technique is useful for identifying and correcting errors or inconsistencies in the model’s original output.

Self-reflection works by allowing the model to assess its own responses and identify potential errors or inconsistencies. By reflecting on its own output, the model can refine its responses and improve their accuracy and fidelity.

Examples of using self-reflection include:

  • Identifying biased or unethical statements in the model’s own outputs.
  • Detecting logical inconsistencies or contradictions in generated text.
  • Refining responses based on feedback or additional context provided.

Ensembling

Ensembling involves combining the output of several models and asking a final model to produce a consensus answer. This technique is useful for improving the overall accuracy and fidelity of the response.

Ensembling works by leveraging the strengths of multiple models to generate a more accurate and reliable response. By combining the output of multiple models, ensembling can reduce the impact of individual errors and improve the overall quality of the response.

Examples of ensembling include:

  • Combining outputs from models with different specialties or training data.
  • Aggregating responses from models with different decoding strategies.
  • Leveraging models with different strengths to produce well-rounded outputs.

Conclusions

Prompt engineering is a nascent discipline and much is still unknown about what are core principles and what are caveats and consequences of the current state of development in LLMs. This means you should beware that many of these ideas or principles might not remain relevant in the mid-term, as novel models and architectures are invented.

Prompt engineering is a very powerful and cualitatively new software development pattern. You can now program a computer to solve novel problems with reasonable effectiveness using natural language! But there is no free lunch, as usual. The main limitations of Large Language Models (LLMs) in prompt engineering stem from their inherent design caveats and constraints.

LLMs can be brittle and prone to failure on tasks if the instructions provided are not aligned with the model’s capabilities. This means that even solvable tasks can fail if the wrong instructions are given, highlighting the importance of crafting prompts that suit the specific LLM being used. They struggle with complex reasoning tasks and conditional instructions. They may not perform well when faced with intricate conditional prompts, making it challenging for them to handle multiple instructions or conditions within a single prompt.

LLMs operate with a fixed amount of computation per token, which affects both input and output tokens. This constraint implies that a more detailed and verbose prompt tends to produce better responses, especially for tasks involving complex reasoning. This also means that overly terse prompts may not yield optimal results. However, some balance is important, because overly verbose prompts can be more confusing that informaive.

Prompt engineering is still an evolving and artisanal process, lacking a standardized approach. Experimentation is essential to determine the optimal prompt for a specific task and LLM. Different LLMs trained on diverse datasets may respond better to varying prompt styles, making it necessary to experiment with different strategies to find the most effective approach.

These limitations underscore the complexity and nuances involved in designing prompts for LLMs, highlighting the need for careful consideration and adaptation to maximize the performance of these models in various tasks and domains. In Part 3 of the book we will apply all of these prompt engineering techniques to concrete problems, and we will have plenty of time to explore the contexts in which they perform well.