Small rant : Basically, the title. Instead of answering every question, if it instead said it doesn’t know the answer, it would have been trustworthy.
Small rant : Basically, the title. Instead of answering every question, if it instead said it doesn’t know the answer, it would have been trustworthy.
It’s right in the research I was mentioning:
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Find the section on the model’s representation of self and then the ranked feature activations.
I misremembered the top feature slightly, which was: responding “I’m fine” or gives a positive but insincere response when asked how they are doing.