Can we reverse engineer transformer language models into human-understandable computer programs? Inspired by the Distill Circuits Thread, we're going to try.
We think interpretability research benefits a lot from interactive articles (see Activation Atlases for a striking example). Previously we would have submitted to Distill, but with Distill on Hiatus, we're taking a page from David Ha's approach of simply creating websites (eg. World Models) for research projects.
As part of our effort to reverse engineer transformers, we've created several other resources besides our paper which we hope will be useful. We've collected them on this website, and may add future content here, or even collaborations with other institutions.