Unethically Trained. Not Again!

PLUS: Apple Responds

IN THIS ISSUE

  • AI Companies Used YouTube Videos to Train AI

  • Apple Responds To The Controversy

TOP PICKS

Photo by Szabo Viktor on Unsplash

AI Companies Used YouTube Videos to Train AI

Despite YouTube's rules against using its videos without permission. Creators whose videos were used were not informed either.

173,536 YouTube videos from 48,000 channels were used to train a dataset called YouTube Subtitles by AI companies including Anthropic, Nvidia, Apple, and Salesforce.

This violates YouTube’s rules on using the content available on its platform without permission. Additionally, the creators whose videos were used were not informed that their original content was being used. Videos that were deleted were also a part of this training.

This report was released by Proof News and copublished with Wired.

Why It Matters

High-quality data is the gold standard for AI training models. It is critical to beat competition and AI companies are doing everything they can to land this data directly from the consumers. [Remember Adobe?]

Subtitles represent how people talk. It can help models get trained on speech-to-text data and be useful when learning how to imitate humans interacting with each other using speech.

YouTube Subtitles have video transcripts from the European Parliament, English Wikipedia, Khan Academy, MIT, Harvard, the BBC, Jimmy Kimmel Live, The Late Show With Stephen Colbert, and many more.

The controversy arises because of the following reasons:

  1. Increased safety concerns

Some content promoted conspiracy theories such as the ‘flat-Earth theory’. Developers at Salesforce found that the training data contained colorful language, gender bias, racial slurs, and biases against certain religious groups. Including such content for model training can introduce serious safety concerns.

  1. Creators getting robbed of a livelihood

Creators spend time, resources, money, and creative energy to create content. Using it without permission and appropriate remuneration is akin to stealing in broad daylight.

Already there are whispers (and controversies) of major studios leaning on AI tools to recreate and replicate voice artists and actors.

Creators are sharing strong reactions against this ‘theft’ and terming it as exploitation.

The report states that Apple has used this dataset to train OpenELM as it gears up to add AI features to its products.

Photo by Sumudu Mohottige on Unsplash

Apple Responds To The Controversy

Apple says that OpenELM was created to contribute to the research community and advance open-source large language model development. It was only for research purposes, was published open-source, and is available to everyone.

It does not power Apple Intelligence or any of its AI or machine learning features.

Why This Matters?

Apple is positioning itself on privacy, which the AI companies are struggling with. It doesn't look good for Apple if it gets caught violating privacy of online creators and using their content to train its own model. Then, pass it off as "privacy".

Previously, Apple had announced that its intelligence models were trained “on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web crawler.” Website owners must explicitly opt out if they do not want the crawler on their website.

Apple has clearly stated, "We do not use our users’ private personal data or user interactions when training our foundation models."

FOR YOUR READING PLEASURE

What did you think of today's newsletter?

Login or Subscribe to participate in polls.