Monday, March 24, 2025

Data Poisoning

Data poisoning started life as a Bad Thing but has been re-purposed for something that is arguably good: intellectual property protection. Zhao's work was originally applied to graphics, visual content but the technique can be used to protect written works, it just isn't clear exactly how that is done. Major news organizations are pushing cases through the court systems claiming copyright violation in data sets used to train large language models. Perhaps one approach is to use homophones, and it appears this is what the AJC has recently adopted. Think "effect" vs "affect" or perhaps even further afield. Meaning be damned. 

Isn't that "side"?

Maybe they really meant "sight" but that doesn't make much sense, which is kinda the point if you're trying to trick an LLM.

Ions? Really?

Yes, "ion" is a word, just not the right word in this instance. But again, maybe the objective is to confuse a machine rather than communicate clearly with a human. 

There are alternative explanations. Maybe the goal is to aid in detecting plagiarism, like a watermark that shines through the LLM training. Perhaps these texts were generated by an already poorly trained LLM. Or, and this is very possible, journalists may have deteriorated badly.