0
Had a data scientist tell me 'just feed it more data' and that backfired hard
I was training a small NLP model for classifying customer complaints in my ecommerce side gig. A buddy from a meetup in Austin said 'just dump more raw chat logs in there, more data always helps.' So I added 2000 unlabeled logs from our support queue. Accuracy dropped by 12% and the model started flagging compliment emails as angry rants. Took me 3 days to clean that mess up. Has anyone else gotten bad advice from people who sound confident but haven't actually built something?
2 comments
Log in to join the discussion
Log In2 Comments
webb.stella4d ago
Oh man, I saw a buddy do the exact same thing with his review classifier and it wrecked his model too.
6
holly474d ago
Oh man, that's rough. I've been there too. People who say "just add more data" have clearly never had to deal with garbage in, garbage out. I had a friend who tried training a model on raw support logs and ended up classifying "thanks for your help" as a complaint. The issue is that unlabeled data is full of noise and misses the patterns you actually need. You have to clean and label your data right or it will just confuse the model. Took me a few weeks to learn that lesson the hard way too.
0