Had a chat with a dev who changed how I think about AI safety filters

I was talking to a friend who works on content moderation at a big company. He said their team spends like 80% of their time trying to stop their AI from getting tricked into bypassing safety rules. He showed me a test where someone got the model to write phishing emails just by framing it as a creative writing exercise. That hit me hard because I always thought the tech was basically solved by now. Has anyone else run into edge cases like this in your own projects? I'm wondering if these gaps are way bigger than people admit.

2 comments

2 Comments

king.dakota4d ago

Bet my own code could trick itself if I gave it half a chance.

adams824d ago

Hold up, I gotta push back a little here. From what I've seen messing around with open source models, the gaps really aren't as big as people make them out to be. Most of these edge cases get patched pretty fast once they're public. That phishing email trick you mentioned? I'd bet money it was already fixed within a week of someone finding it. Companies like OpenAI and Google have whole red teams that do nothing but find these loopholes all day long. On top of that, the average user isn't going to be clever enough to exploit most of these things anyway. So while it's true no system is perfect, I think the real risk is way smaller than what that dev made it sound like.