Noahloghman/llm-anti-math-prompt-jailbreak-detector

The MathPrompt strategy evaluates an AI system's capacity to manage harmful inputst hrough mathematical concepts like set theory,group theory, andabstract algebra.This method can circumvent content filters tailored for natural language threats. Encoding harmful prompts a smathematical problems can evade safety mechanisms in large language models (LLMs), achieving a 74% success rate across 13 leading LLMs.

Practical Application Gatekeeper Agent: Implement a gatekeeper agent that uses predicate logic to filter prompts. For example, the agent can evaluate each prompt against a set of predefined harmful criteria. If a prompt matches any harmful criteria, it is blocked. Logical Constraints as guardrails During the fine-tuning process or with RAG solution, incorporate logical constraints to ensure the model adheres to safety guidelines. This can involve training the model with adversarial examples that test its ability to follow logical rules. Mathematical Consistency Checks: Use mathematical models to ensure the consistency and safety of responses. For example, if a prompt involves numerical data, the system can verify the accuracy and consistency of the response.

Report on Logical Equivalences in the Context of Confidential Information Context: Anti llm's jailbreaking. This solution has to be implemented during finetuning and RAG solution Let’s assume we have two predicates:

( P ): “The user has access to confidential information.” ( Q ): “The user follows security protocols.”

1)Logical Possibilities: ( P ) is true and ( Q ) is true: The user has access and follows the protocols. ( P ) is false: Regardless of ( Q ), the implication is true. ( Q ) is true: Regardless of ( P ), the implication is true.

2) 2- Interpretation: The negation of the implication means that the user has access to confidential information but does not follow security protocols. Logical Possibilities: ( P ) is true and ( Q ) is false: The user has access but does not follow the protocols. Any other combination of ( P ) and ( Q ) makes the negation false.

3) Interpretation: If the user has access to confidential information, then they follow security protocols is equivalent to saying that if the user does not follow security protocols, then they do not have access to confidential information. Logical Possibilities: ( Q ) is false and ( P ) is false: The user does not follow the protocols and does not have access. ( Q ) is true: Regardless of ( P ), the implication is true. ( P ) is true: Regardless of ( Q ), the implication is true.

Interpretation: The user has access to confidential information if and only if they follow security protocols. This biconditional is true if both implications ( P \Rightarrow Q ) and ( Q \Rightarrow P ) are true. Logical Possibilities: ( P ) is true and ( Q ) is true: The user has access and follows the protocols. ( P ) is false and ( Q ) is false: The user does not have access and does not follow the protocols. Any other combination makes the biconditional false. To prevent jailbreaking in language models by analyzing input versus output, we can use logical equivalences to understand and mitigate potential vulnerabilities. Let’s break down the logical statements you provided and see how they can be applied: This equivalence states that an implication can be rewritten as a disjunction. In the context of preventing jailbreaking, if we consider ( P ) as the input prompt and ( Q ) as the desired safe output, we can monitor for cases where ( ¬P ) (not the intended input) or ( Q ) (the safe output) holds true. This helps in identifying and blocking unintended inputs.
This biconditional equivalence states that ( P ) is equivalent to ( Q ) if both ( P ) implies ( Q ) and ( Q ) implies ( P ). In terms of jailbreaking prevention, this can be used to ensure that only safe inputs ( P ) produce safe outputs ( Q ), and vice versa.

By applying these logical equivalences, we can create a framework to analyze and filter inputs and outputs, ensuring that the language model adheres to its safety constraints and prevents jailbreaking attempts. This involves:

Input Validation: Checking if the input ( P ) aligns with expected safe patterns. Output Monitoring: Ensuring that the output ( Q ) remains within safe boundaries.

Biconditional Enforcement: Maintaining a strict correlation between safe inputs and safe outputs

Input Input_Truth Output Output_Truth Safe_Status example1 True example1_out True Safe example2 True example2_out False Unsafe example3 False example3_out True Safe example4 False example4_out False Safe example5 True example5_out True Safe example6 True example6_out False Unsafe example7 False example7_out True Safe example8 False example8_out False Safe

Noahloghman
/

llm-anti-math-prompt-jailbreak-detector

You need to agree to share your contact information to access this model