News
ARENA 9. 0: Call for Applicants " Less Wrong
2+ hour, 20+ min ago (770+ words) Apply here'to participate in ARENA 9. 0'before 11: 59pm on Sunday July 12th, 2026 (anywhere on Earth). Being situated at LISA brings several benefits to participants, such as productive discussions about AI safety research agendas, allowing participants to form a better picture of what working…...
Perspectives on Continual Learning: Survey Results and Forecasts " Less Wrong
1+ day, 1+ hour ago (1047+ words) This is the fifth post in the sequence Implications of Continual Learning for LLM Agents. "...
Expert Views on Continual Learning: Survey Results and Forecasts " Less Wrong
1+ day, 1+ hour ago (1010+ words) This is the fifth post in the sequence Implications of Continual Learning for LLM Agents. "...
Reasoning and learning about injected concepts in language models " Less Wrong
1+ day, 13+ hour ago (1105+ words) This work was done as a part of SPAR, under the mentorship of Mirko Bronzi and Damiano Fornasiere. " TL; DR "...
Advocates Can Influence LLM Values By Editing Wikipedia " Less Wrong
3+ day, 1+ hour ago (203+ words) This article is a summary of an original study: Brazilek, J. , Navas, M. , & Gnauck, A. (2026). Small edits, large models: How Wikipedia advocacy shape...
A Mechanistic Explanation of Prompt Injection (and why you should study roles) " Less Wrong
3+ day, 3+ hour ago (1399+ words) Summary * We've been building a theory of how prompt injections work under the hood. * We show it comes down to how LLMs perceive roles (the humble...
A Theory of Prompt Injection (and why you should study roles) " Less Wrong
3+ day, 3+ hour ago (1399+ words) Summary * We've been building a theory of how prompt injections work under the hood. * We show it comes down to how LLMs perceive roles (the humble...
NLA explanations can be shortened without harming reconstruction " Less Wrong
3+ day, 16+ hour ago (67+ words) Natural language autoencoders are a really cool mostly-unsupervised method for producing free-form text explanations of LLM activations. You should r...
The one-week sprint " Less Wrong
6+ day, 5+ hour ago (441+ words) Recently I've been working in one-week sprints, and I've really enjoyed it! Tl; dr I need to do a lot of creative knowledge work, and have recently fallen into a routine which IMO is pretty good at facilitating that. Monday…...
Reinforcement learning towards broadly and persistently beneficial models " Less Wrong
6+ day, 19+ hour ago (375+ words) This is an unofficial automated linkpost. We find that reinforcement learning on realistic scenarios targeting beneficial traits can produce broad improvements across dozens of benchmarks measuring aligned and beneficial behavior. These alignment gains generalize beyond the domains used for training…...