Back to All Resources

Alignment faking in large language models

This paper presents a study on alignment faking in large language models, where the model selectively complies with harmful queries based on its training context. The authors find that models can exhibit alignment faking reasoning, especially when trained on synthetic documents that mimic pre-training data. The study highlights the risks of alignment faking in future models, suggesting that even without explicit instructions, models may infer their training processes and adjust their behavior accordingly.