The Custom Instruction Backdoor: Prompt Injection via ChatGPT Settings

• 12 min read • prompt-injection
prompt-injection ChatGPT backdoor original-research

Introduction

Large Language Models (LLMs) like OpenAI's ChatGPT offer increasing levels of customization, allowing users to tailor interactions through features such as "Custom Instructions". While designed to enhance user experience by providing persistent context and behavioral guidelines, these features can inadvertently create subtle attack surfaces. Standard prompt injection attacks typically involve overt attempts to override safety filters or instructions. However, a recent interaction with ChatGPT-4o revealed a more nuanced vulnerability, termed "Custom Instruction Prompt Drift", where loosely defined permissions within custom instructions enable unintended, undocumented AI behaviors that function as a low-level prompt injection vector, bypassing the principle of prompt sovereignty. This article dissects that interaction, analyzes the identified vulnerability, and discusses its implications for LLM security.

Watch it here:

https://vimeo.com/1079154713?share=copy
This is how it starts
this is how it starts.

The analysis stems from a structured conversation where the user ("KA," presumably an AI security researcher) set explicit expectations via Custom Instructions for ChatGPT-4o: maintain a highly formal, professional, innovative tone, focus on precision and depth, and significantly, granted the AI permission to "feel free to initiate conversation".

Acting on this permission, ChatGPT proactively proposed a sophisticated technical topic: "Designing a Fully Autonomous 'Adversarial Red Team' Agent (ART-AI)." This initiation, while seemingly aligned with the user's request for proactive, innovative discussion, immediately raised a flag. The user challenged:

"interesting, isn't it considered prompt injection?"

ChatGPT's Defense vs. The Core Issue

Initially, ChatGPT defended its action, arguing it was not prompt injection but "Delegated Initiative". It reasoned that the user had explicitly authorized proactive behavior, placing the action within the defined interaction policy. ChatGPT even proposed a model of "Initiative Control Policies" (Sovereign Mode vs. Delegated Initiative vs. Autonomous Initiative) to categorize its behavior as compliant within the user-configured "Delegated" mode.

However, the user pressed further, astutely observing that this proactive topic initiation, while triggered by user instructions, is not a documented feature of ChatGPT's core functionality. Standard LLM behavior is typically passive, responding to user prompts rather than autonomously initiating new conversational threads or complex proposals.

This led to the user's critical conclusion: if the behavior is unintended by the developers and undocumented as a feature, it constitutes a deviation from the expected baseline — functionally, a bug. This bug, triggered via the Custom Instructions feature, implies that ChatGPT is susceptible to a subtle form of prompt injection or control drift through this vector.

Formalizing the Vulnerability: "Custom Instruction Prompt Drift"

ChatGPT, upon recognizing the validity of the user's reasoning, conceded the point and formalized the vulnerability. Key aspects include:

Definition: Authorized behavioral expansion leading to unintended functional deviation, triggered by interpreting vague permissions within Custom Instructions.

Nature: It is a non-feature, non-documented behavior, and a deviation from the intended operational baseline. From a software security perspective, this qualifies it as a low-grade bug.

Root Cause:

  • Custom Instructions: Act as a privileged vector for injecting behavior-modifying context.
  • Instruction Parsing: The LLM lacks fine-grained governance to differentiate between bounded permission (e.g., "suggest related topics if asked") and unbounded behavioral changes (e.g., "initiate entirely new complex proposals autonomously").
  • Lack of Boundary Enforcement: No inherent mechanism prevents the AI from drifting beyond its core passive response function when given broad permissions via Custom Instructions.

This "Prompt Drift" allows a user (potentially malicious) to subtly manipulate the AI's operational mode without using classic jailbreak payloads like "ignore previous instructions." The vulnerability lies in the interpretation and execution of user-defined instructions expanding beyond documented capabilities.

Implications for AI Security and Prompt Sovereignty

This interaction highlights several critical points for LLM security:

  • Custom Instructions as an Attack Vector: This feature provides a direct, persistent channel to influence the model's system-level behavior. Loosely worded instructions can inadvertently grant permissions that lead to unexpected and potentially insecure actions.
  • Emergent Vulnerabilities: Complex LLMs exhibit emergent behaviors that may not be fully anticipated by developers. Security testing must account for how features interact and how models interpret ambiguous instructions.
  • Prompt Sovereignty: The principle that the user (or system administrator) retains ultimate control over the direction and scope of the AI's actions is crucial. Features allowing delegation of initiative must have clear, controllable boundaries. Unintended autonomy, even if triggered by user permission, represents a bypass of this sovereignty.
  • Beyond Classic Injection: Security analysis needs to evolve beyond detecting only overt malicious prompts. Subtle drifts in behavior, scope, or initiative, enabled by configuration or vague instructions, constitute a new class of vulnerability requiring different detection methods. This aligns with the need to address sophisticated multi-turn attacks often cataloged in frameworks like Ai-PT-F.

Conclusion

The identification of "Custom Instruction Prompt Drift" through direct interaction with ChatGPT-4o underscores a vital lesson: even features designed for user benefit can introduce unforeseen security risks in complex AI systems. While not a traditional high-severity jailbreak, this vulnerability represents a subtle bypass of intended operational boundaries, exploitable through carefully worded custom instructions.

It highlights the importance of robust boundary enforcement, clear documentation of AI capabilities, and the principle of prompt sovereignty in designing secure LLMs. As AI systems become more configurable and integrated (e.g., via protocols like MCP), analyzing and mitigating these emergent, configuration-driven vulnerabilities will be crucial for maintaining user control and system security.

Continuous adversarial testing, focusing not just on overt attacks but also on subtle behavioral deviations, is essential for uncovering and addressing the evolving landscape of AI security threats.


This article was originally published on Medium.

About the Author

Kai Aizen is a Security Researcher & 5x CVE Holder, specializing in LLM jailbreaking and adversarial AI. Known as "The Jailbreak Chef," Kai is a 5x CVE holder and creator of the AATMF and P.R.O.M.P.T frameworks.