Maria Rufova
Sycophancy in Language Models and its Effects on Human Judgement
Sycophancy in language models is the tendency of AI, particularly those trained with RLHF (Reinforcement Learning from Human Feedback), to prioritize user agreement and flattery over factual accuracy. Our work spans the human side (how people perceive and respond to sycophantic outputs), the model side (what conditions increase or decrease sycophancy), and the human-model interaction side (how sycophantic outputs shape trust and decision-making). We study this through the lens of child development, which raises the stakes considerably: children are still assembling their world view, learning who to trust, when to push back, and how to distinguish genuine help from empty validation. A sycophantic model undermines that developmental process in ways we’re only beginning to understand.
Message To Sponsor
I'm incredibly grateful for this opportunity to keep pursuing this research in a topic that is not only interesting to me but I believe will continue to stay extremely relevant as modern language models grow and develop. Thank you so much for giving me the opportunity to continue learning and contributing to our amazing research community here at UC Berkeley!