Imagine this: You are accused of cheating in your job. But you are categorically NOT cheating. Nevertheless, you are fired.
This absolute horror scenario is what happened to writer Michael Berben (not his real name). Michael has been a freelancer for the past three years, contributed to many different publications, and written about 200 articles over the course of his career.
Michael reached out to us, and I’d like to share his story. Because what happened to Michael is probably happening to a lot of creative workers right now. And it will happen to many more in the future: questionable AI detectors are literally destroying people’s livelihoods.
When writers become collateral damage
Like many writers, Michael has followed the advent of ChatGPT with curiosity on the one hand, and worry on the other. Curious, because ChatGPT might be help writers create better articles. Worried, because what if ChatGPT could replace writers entirely in the future?
What he wasn’t worried about: That ChatGPT might be the reason for him to be accused of cheating. That, though, is exactly what happened.
One day, Michael's main client informed him that they had started to use an AI detector, and the results were supposedly damning for him: his most recent articles flagged a 95% likelihood of being AI generated. His client started to look at all of his previous articles, many written before ChatGPT was even widley available, and Michael was notified that all his articles showed a likelihood of being AI generated of 65-95%. They terminated his contract with immediate effect. A decision solely based on a single number (or range) that the AI detector spit out.
Michael tried whatever he could to prove his articles were not AI generated. He even gave his client access to the full Google Docs history and showed them his writing process and progress, all edits included. But the seed of doubt that the AI detector had sowed was too strong. Michael lost his main client, and with it most of his income.
A number of things are problematic with this story. And I’d like to go over them one by one:
1. The accuracy of general AI detectors is questionable
General AI detection is flawed. Period. It’s not like the detection works almost all the time and Michael's case is one of the few very unfortunate outliers. No, false positives are the norm when it comes to general AI detection.
Even OpenAI themselves stopped offering their very own detector for this reason:
"The AI classifier is no longer available due to its low rate of accuracy." Open AI, creator of ChatGPT, on their detection
Outliers happen because the writing style of a text is very hard to interpret, and it ultimately also does not say anything about whether an AI/LLM wrote the text or not. Sure, ChatGPT uses a certain style, but that can be easily changed with a bit of prompt engineering.
At the same time, a writer might just coincidentally have a writing style that is similar to ChatGPT, even though they do all the work manually. For those reasons alone, trying to detect AI-written content based on individual pieces is futile.
2. AI detectors are advertising with cherry picked numbers
Given general AI detection doesn’t work well, you might think that companies offering their services in that area would be cautious. The opposite appears to be the case. Here are various claims found on AI detector websites:
“99% accuracy”
“99.1% accuracy”
“98% accuracy”
How do they even come up with these numbers? At Authory, we looked into AI detection ourselves , and we concluded that trying to detect AI written content based on individual pieces does not work. So, how come others are confident to mention things like “99% accuracy”?
Well, as so often in life, it depends on what you are looking at exactly. If your data set is small enough — say 2,000 articles only — chances are high that it contains data that is very easy to distinguish from AI-written content. That could be due to topic, or due to a very distinct writing style in the dataset. An accuracy that is based on such a small dataset is basically meaningless.
I am sure you have seen skincare adverts where they claim things such as “83% felt their skin to be smoother.” Sounds great, right? The fine print then usually says something like “based on a sample of 40 respondents.” With such a small sample size, getting one that will tell you how smooth their skin feels after applying your cream is fairly easy. Do this with a sample size of 1,000 respondents (or in the case of AI detectors, with a sample size of 100,000 articles) and you’ll be a lot more credible.
3. Writers have no way to defend themselves against the AI blackbox
According to the AI detector his employer was using, Michael’s texts were between 65-95% likely to be AI-generated, depending on the piece. Every seasoned writer will tell you that kind of accusation hurts, and it hurts badly. But how do you prove the AI detector wrong? You can show the research that went into your pieces, you can highlight your writing style, and you can explain your process. Having to go through all of this is pretty painful. But of course, Michael did it.
Desperate, Michael even reached out to the company providing the AI test. To no avail. They simply told him he should set up an account with them, buy himself some credits, and test his articles again himself. At the end of the day, nothing changed the fact that there was this all-encompassing metric being thrown at him saying: You are cheating.
The story's sad end
I wish I could say that Michael's story had a happy end after all. But it did not. His client didn’t change their mind. They were terrified of Google downranking any AI-generated content, and because of the apparently clear results from their AI checker, they couldn’t be convinced otherwise. Michael lost his client for good.