"Evaluating LLMs is a minefield" Arvind Narayanan & Sayash Kapoor "In short, many things can go wrong when we are trying to evaluate LLMs’ performance on a certain task or behavior in a certain scenario. It has big implications for reproducibility: both for research on LLMs and research that uses LLMs to answer a question in social science or any other field." Great slidedeck!