Secure LLM Applications

Will BoslerJuly 1, 2024

Deploying LLM applications safely requires an understanding of the ways in which these systems can be exploited. An understanding of prompt hacking techniques, which can be loosely categorised into Prompt Injection and Jailbreaking methods. As well as implications for data privacy, provides a bedrock for designing remedies and safeguards.

Prompt Hacking refers to attacks intended to circumvent the instructions provided by a development team for the LLM. These attacks can range from simple instructions to elaborate systems like recursive prompt hacking and technical exploits that utilise anomalous tokens.

A concern exists for creators of LLM applications and those training models on proprietary data, regarding the ability for end users to recover this data via exploits. To bound this risk we can consider Discoverable Memorization, which provides an upper bound on the training data that could theoretically be recovered. The practical amount of training data that can be discovered is referred to as Extractable Memorization and reflects what an independent attacker could efficiently recover. Both methods involve the development of prompts that can be used to recover example data. The former involves the tester having full knowledge of the training data, the latter none.

Developing a culture of internal red teaming, where testers are aware of exploit methods, enables efficient discovery of vulnerabilities prior to release. Once discovered, these can be mitigated via architecture design, including the deployment of detectors and guardrails.

In the below slides and video walkthrough we discuss all of these concepts.

Click to progress through slides or use < and > arrow keys

Language Model Security