19
Had a chat with a dev who changed how I think about AI safety filters
I was talking to a friend who works on content moderation at a big company. He said their team spends like 80% of their time trying to stop their AI from getting tricked into bypassing safety rules. He showed me a test where someone got the model to write phishing emails just by framing it as a creative writing exercise. That hit me hard because I always thought the tech was basically solved by now. Has anyone else run into edge cases like this in your own projects? I'm wondering if these gaps are way bigger than people admit.
2 comments
Log in to join the discussion
Log In2 Comments
king.dakota4d ago
Bet my own code could trick itself if I gave it half a chance.
7
adams824d ago
Hold up, I gotta push back a little here. From what I've seen messing around with open source models, the gaps really aren't as big as people make them out to be. Most of these edge cases get patched pretty fast once they're public. That phishing email trick you mentioned? I'd bet money it was already fixed within a week of someone finding it. Companies like OpenAI and Google have whole red teams that do nothing but find these loopholes all day long. On top of that, the average user isn't going to be clever enough to exploit most of these things anyway. So while it's true no system is perfect, I think the real risk is way smaller than what that dev made it sound like.
5