Concept Engineering Detox
And: Penn's new Data Science initiative, a $250,000 benchmark prize, and more.
Welcome to the Penn AI Newsletter from SafeAI@Penn, a student organization focused on AI safety research at Penn. We discuss news and developments in AI and AI safety.
Concept Engineering for Removing Undesirable Outputs in LLMs
LLM can produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. How can we align large language models to reduce undesirable outputs without re-training them?
SafeAI@Penn Ph.D. students Jinqi Luo and Tianjiao Ding find that concepts of similar semantics are clustered together in the activation space, indicating that the representation space has semantic structure:
They introduce a new concept representation dataset (PaCE-1M), and with it, an alignment framework (PaCE) that uses control vectors to modify LLM representations.
Given an input prompt, they decompose its activation vector into a sparse linear combination of these concept vectors. This is done using sparse coding techniques, specifically elastic net regularization. After decomposition, they identify which concepts are "undesirable" for the given task (e.g., toxic or biased concepts for a detoxification task). They then set the coefficients for these undesirable concepts to zero, effectively removing their contribution from the activation. Finally, they reconstruct the activation using only the remaining "benign" concepts, creating a modified activation that should lead to more aligned output.
This sparse, selective intervention allows them to effectively remove undesirable content while maintaining the model's general linguistic capabilities. You can find their pre-print “PaCE: Parsimonious Concept Engineering for Large Language Models” here.
Penn launches a new Data Science Center next Fall
Amy Gutmann Hall (34th and Chestnut) is set to be the home of the Innovation in Data Engineering and Science (IDEAS) Center at Penn. Many of the Penn AI faculty will likely be moving into this building.
Penn is pouring $750 million dollars into artificial intelligence and machine learning, which has spurred more faculty hiring in this space. This comes after the announcement of the new AI major at Penn led by Prof. George Pappas, as well as a new AI Master’s program led by Prof. Chris Callison-Burch.
Prof. René Vidal – with a background in mathematical and trustworthy AI – is set to lead the Data Science Center.
What We’re Reading
Research papers:
A new “circuit breaking” technique aims to improve alignment and robustness of models to jailbreaks.
The “weapons of mass destruction proxy” benchmark, done by the Center for AI Safety in collaboration with Scale AI, aims to remove dangerous capabilities from models while keeping other capabilities intact.
A mechanistic interpretability paper by Anthropic uses a sparse autoencoder to find monosemantic features in the activations of a one-layer transformer, a technique they call “dictionary learning.”
Broadly, control vectors seem to be trending in the AI world – the use of linear vectors in the embedding space large language models to change features related to personality, creativity, jailbreaking, and more. A new paper sheds light into ways in which language model features may be represented in non-linear ways.
Opportunities:
SafeBench offers $250,000 in prizes for new ML safety benchmarks in robustness, monitoring, or alignment.
Policy & news in AI:
OpenAI appointed former NSA director, Paul M. Nakasone, to its board of directors.
Nvidia released the Nemotron-4 340B model family.
Elon Musk withdrew his lawsuit against OpenAI and Sam Altman.
A new public competition is offering 1M+ to beat the ARC-AGI benchmark.
New to AI safety? Check out these key papers that cover various areas of safety.
Acknowledgements: Thanks to Penn’s ASSET Center, PRECISE Center, PennNLP, Penn IDEAS Center, Penn AI major for supporting AI safety research at Penn.


