Why LLM Jailbreaks Work: Competing Objectives and Mismatched Generalization
A practical explanation of LLM jailbreak failure modes from the Jailbroken paper: competing objectives, mismatched generalization, combination attacks, and safety-capability parity.