Approaches to training AI to be safe
Jump to navigation
Jump to search
part of AI Risk
Reinforcement Learning with Human Feedback[edit | edit source]
Ayush Thakur Understanding Reinforcement Learning from Human Feedback (RLHF)
Constitutional AI[edit | edit source]
Anthropic Claude's Constitution
Yuntao Bai, Saurav Kadavath et al. Constitutional AI: Harmlessness from AI Feedback