Publishers target CommonCrawl in battle over AI training data
Trending Now

Publishers target CommonCrawl in battle over AI training data


Danish media outlets have demanded that the non-profit web archive Common Crawl remove copies of their articles from past data sets and immediately stop crawling their websites. The request was issued amid growing outrage over the way artificial intelligence companies such as OpenAI use copyrighted material.

Common Crawl plans to comply with the request issued Monday. Executive Director Rich Skrenta says the organization is “not equipped” to fight media companies and publishers in court.

The Danish Rights Alliance (DRA), an organization representing copyright holders in Denmark, led the campaign. It made the request on behalf of four media outlets, including Berlingske Media and the daily newspaper Jyllands-Posten. The New York Times made a similar request Last year, the CEO of Common Crawl filed a lawsuit against OpenAI for using his work without permission. ComplaintThe New York Times highlighted how data from Common Crawl was the most “weighted data set” in GPT-3.

Thomas Heldrup, DRA’s head of content protection and enforcement, says the new effort was inspired by the Times. “CommonCrawl is unique in that we see a lot of the big AI companies using their data,” Heldrup says. He sees its corpus as a threat to media companies trying to negotiate with AI giants.

Although Common Crawl has been essential to the development of many text-based generative AI tools, it was not designed with AI in mind. Founded in 2007, the San Francisco-based organization was best known for its value as a research tool before the AI ​​boom. Stephan Back, a data analyst at the Mozilla Foundation, recently published an article noting that “Common Crawl is caught up in this conflict about copyright and generative AI.” Report On the role of Common Crawl in AI training. “For many years it was a small project that almost no one knew about.”

Common Crawl did not receive a single request to modify data before 2023. Now, in addition to the requests from the New York Times and this group of Danish publishers, it is also increasing the number of requests that have not been made public.

In addition to this sharp increase in demand to redact data, Common Crawl’s web crawler, CCbot, is also being consistently blocked from collecting new data from publishers. According to AI detection startup Originality AI, which often Tracks the use of web crawlersMore than 44 percent of the top global news and media sites block CCbot. Aside from BuzzFeed, which began blocking it in 2018, most of the major outlets it analyzed—including Reuters, the Washington Post, and the CBC—rejected the crawler last year. “They’re being blocked more and more,” says Back.

Common Crawl’s quick compliance with such a request is motivated by the realities of keeping a small nonprofit organization afloat. However, compliance does not equate to ideological consent. Skrenta views this effort to remove archival materials from data repositories like Common Crawl as nothing less than an affront to the Internet as we know it. “It’s an existential threat,” he says. “They will kill the open web.”

Leave a Reply

Your email address will not be published. Required fields are marked *