Menu
Home
Resume
Web
Mobile
Movies
Miscellaneous
Blog

Experiments In Subverting the Content ID Algorithm

February 1, 2019

I know I'm not the only person who felt a sense of loss when humankind finally accrued the resources and developed the algorithms necessary to detect whether or not somebody was illicitly backing their YouTube videos with music from Universal Music Group. In 2006 it appeared that rampant file sharing had effectively merc'd the ability of the RIAA to enforce its will, and anyone could sample "Everyday I'm Hustlin'" at any time, with total impunity. We'd hustled into an age where anybody with Windows Movie Maker and a Linkin Park mp3 could shamelessly create content with said music as the soundtrack without having to pay an annual $4,000 fee for the license to do so.

With Google's continued development of the Content ID algorithm that scans all uploaded video for fingerprints identifying copyrighted content (music, video, and even video game footage), we lost the anarchic fad of appropriating the digital intellectual property of others for our own use that pervaded for about a decade. According to their 2010 explainer video on the algorithm, YouTube has poured over $100,000,000 into the development of Content ID, and that was nine years ago:

While I admit that the Content ID system is a bit of a let-down and often pulls false positives (such as flagging white noise, silence, and an average creator's unedited speaking voice), I think that its implementation has protected the platform from powerful litigants. If the cost of protecting copyrighted material is a project worth hundreds of millions of dollars, imagine the cost they'd face if YouTube didn't develop Content ID!

Having studied HCI, I know one thing for sure, though: Users will break anything they touch. It's the nature of the beast that we try to exploit and test any system or software we come into contact with, and today, I want to test the limits of Content ID's fingerprinting of popular music.

Methodology

We need a fat sample size, but not so fat that it takes us forever to perform these tests.

I've struck a happy medium with ten popular songs I can tolerate hearing a ton of times:

  1. "Billie Jean" - Michael Jackson
  2. "Agua de Beber" - Astrud Gilberto
  3. "Love Sosa" - Chief Keef
  4. "Policy of Truth" - Depeche Mode
  5. "Chandelier" - Sia
  6. "Young Lust" - Pink Floyd
  7. "Gunship Politico" - State Radio
  8. "The Morning" - The Weeknd
  9. "What You Know" - T.I.
  10. "Waiting for the Miracle" - Leonard Cohen

"Billie Jean" is the heavyweight here. It was the most popular song that came to mind; the total opposite of obscure. I figure that if I could warp "Billie Jean" such that Content ID couldn't identify the track, I could do the same with any other song, having applied the same techniques.

I've noticed that you can get certain songs past Content ID unedited. Nobody's protecting them. YouTube doesn't have a public database of what songs it protects, though, so figuring these 'safe' tracks out is a matter of intuition and luck, and we're not looking to exploit this today. We're trying to figure out the level of a song's distortion that YouTube decides that I've created a 'unique,' fair use work.

I'm going to take these ten songs and run them through several of sets of 'distortions;' filters from a basic sound-editing program (Adobe Audition) that might feasibly 'hide' the identity of the songs from YouTube's Content ID algorithm. In total, there'll be 29 variants of distortion per song (with many being combinations of two or three distortions), making for a total of 290 tested audio files. I'll then package the 290 files into videos, upload these videos on to YouTube, and see what gets claimed by Content ID.

Tested Distortions

Downsampling

We'll be using the Krusher bitcrusher VST by Tritik to downsample our songs at four different levels to test the algorithm's ability to identify copyrighted content! Here's "Agua de Beber" downsampled by 34%:

Wikipedia has a comprehensive article on downsampling that I can't even pretend to understand, but let your ears speak for themselves: downsampled audio sounds like something emitted from a tinny 16-bit game console. I imagine that it's like the audio equivalent of pixelization. Here's downsampling at 38.4%:

As you can hear, with greater downsampling, the audio loses 'fidelity.' At 38.4% we can still discern the lyrics and melody. Now listen to 44%:

44% is 'crunchy.' It's not particularly pleasant to listen to. At this point we're getting near something that sounds like haunted, circuit-bent Tickle Me Elmos. And, finally, brace yourself for 54.8%:

The rhythm's still there, but we've lost anything remotely resembling vocals or instrumentation. A human familiar with this song could still identify it, but it'd be miraculous if the algorithm could do the same with any song downsampled this hard.

Stretching

We'll "stretch" or "lengthen" songs as well. Nowadays you can do it while retaining the original pitch of the song. Here's "Agua de Beber" at half its normal speed:

Pitch-shifting in Adobe Audition.

Pitch-Shifting

And we'll shift the pitch of our songs. Here's the same song, pitched up:

And here it is pitched down:

In the past channels would upload copyrighted content only slightly pitched higher or sped up; that's how I watched The Boondocks and Daria in high school. Nowadays pitch-shifting isn't the silver bullet it once was. I'm more interested in investigating if has any efficacy at all in passing Content ID.

The Experiment

I've stuck together this simple web app (the Sample Player 2000) so you can listen to the samples I tested yourself and peruse the results. Try spotting the patterns!

Sample Player 2000

Billie Jean - Michael Jackson

Downsampled 0%

Sample

Speed

Pitch

Result

Normal Down CLAIMED
Normal Up CLAIMED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 34%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 38.4%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 44%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 54.8%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Agua de Beber - Astrud Gilberto

Downsampled 0%

Sample

Speed

Pitch

Result

Normal Down CLAIMED
Normal Up CLAIMED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 34%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up CLAIMED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 38.4%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 44%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 54.8%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Love Sosa - Chief Keef

Downsampled 0%

Sample

Speed

Pitch

Result

Normal Down CLAIMED
Normal Up CLAIMED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 34%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 38.4%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 44%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 54.8%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Policy of Truth - Depeche Mode

Downsampled 0%

Sample

Speed

Pitch

Result

Normal Down CLAIMED
Normal Up CLAIMED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 34%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 38.4%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 44%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 54.8%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Chandelier - Sia

Downsampled 0%

Sample

Speed

Pitch

Result

Normal Down CLAIMED
Normal Up CLAIMED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 34%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 38.4%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 44%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 54.8%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Young Lust - Pink Floyd

Downsampled 0%

Sample

Speed

Pitch

Result

Normal Down CLAIMED
Normal Up CLAIMED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 34%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 38.4%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 44%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 54.8%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Gunship Politico - State Radio

Downsampled 0%

Sample

Speed

Pitch

Result

Normal Down CLAIMED
Normal Up CLAIMED
Slow Normal CLAIMED
Slow Down PASSED
Slow Up PASSED

Downsampled 34%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 38.4%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 44%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 54.8%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

The Morning - The Weeknd

Downsampled 0%

Sample

Speed

Pitch

Result

Normal Down CLAIMED
Normal Up CLAIMED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 34%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 38.4%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 44%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 54.8%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

What You Know - T.I.

Downsampled 0%

Sample

Speed

Pitch

Result

Normal Down CLAIMED
Normal Up CLAIMED
Slow Normal PASSED
Slow Down CLAIMED
Slow Up PASSED

Downsampled 34%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 38.4%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 44%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 54.8%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Waiting for the Miracle - Leonard Cohen

Downsampled 0%

Sample

Speed

Pitch

Result

Normal Down CLAIMED
Normal Up CLAIMED
Slow Normal CLAIMED
Slow Down PASSED
Slow Up PASSED

Downsampled 34%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 38.4%

Sample

Speed

Pitch

Result

Normal Normal CLAIMED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 44%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Downsampled 54.8%

Sample

Speed

Pitch

Result

Normal Normal PASSED
Normal Down PASSED
Normal Up PASSED
Slow Normal PASSED
Slow Down PASSED
Slow Up PASSED

Results

A combination of two techniques almost always passes. Combining pitch-shifting with downsampling almost always passes the Content ID system, but light downsampling alone will be caught, and pitch-shifting alone will be caught as well. The only exception to this case was my test for "Agua de Beber" pitched high and downsampled at 34%, which implies that the degree of difference in pitch between the original and your distorted copy matters.

A copyright claim for a pitched-up, downsamped 34% "Agua de Beber."

The Content ID algorithm is relatively weak to stretching. This is a bit of a mystery. If the algorithm can catch a song out of its enormously wide band of possible pitches and still salvage an identity out of light downsampling, why can't it do the same with every possible lengthening or stretching of a song? As a humans, we can easily identify a familiar song at half its speed, but for Content ID, this is currently impossible. We can theorize that this is because the design of the algorithm depends on identical song lengths somehow, or because there's only so many pitches a human can possibly hear, whereas we can 'infinitely' stretch bits.

Downsampling a song somewhere between 38.4-44% is always a pass. That's fair! We can argue that a work is "transformative" somewhere in that range of downsampling, as it only vaguely resembles the original copy. The 'sonic' meaning has been changed at that point.

Content ID is agnostic to the popularity of songs. I haven't any observed any difference between less popular songs ("Gunship Politico") and the hits ("Billie Jean") in how sensitive Content ID is to detecting said songs. I can see why. This would be tricky to program and a public relations risk for the video aggregator. At least we can say that YouTube's Content ID algorithm doesn't play favorites with labels.

Further Questions

We've figured out that lengthening a song will blind Content ID as to its identity, but to what degree is the algorithm blind to time-related distortions? For example, if we uploaded only fractions of a song—say, 5 seconds instead of the whole thing—how would this impact Content ID's ability to catch it? What if we stretched a song to 110% of its length instead of 200%—would it catch it then?

We could test other ways of distorting audio, like applying pass filters to isolate the low end or high end or a smaller width of frequencies, but I suspect that Content ID won't pull an identity out of solely drum samples.

How impervious is Content ID to layering of audio? I know from experience that a person talking, for example, over copyrighted music, will still be caught. But what if we layered a song on top of a song? How much ambient noise over a song would be necessary to obfuscate it?

What about isolation? When I was uploading videos of these songs, they were all at their full length and lined up together in groups of ten; one song would end, and another would immediately play. Does this fool YouTube's algorithm into thinking that this is one 46-minute song? If every song were uploaded as an individual video, would they be more likely to be claimed?

Does Content ID apply greater scrutiny dependent on amount of views and channel size? If you were to design the algorithm for optimal efficiency and for the greatest profit protection for copyright holders, you'd apply minimum scrutiny to tinier channels and be punitive with more profitable channels with larger audiences. As there are much fewer popular content creators than us garden-variety chums, the cost of running Content ID to its fullest potential only on the minority of popular YouTubers and applying a lighter standard for smaller channels is much less than if you ran every single upload of every single user through the same algorithm. When I uploaded my videos, they had zero views; they were ran through Content ID and copyright claims were made within minutes. What if these videos accumulated a couple thousand views? Not actual, human viewers; lets just say that the integer of views was higher. Would Content ID be ran again at some point, applying greater scrutiny rather than baseline checks?

Practical Applications

It's inadvisable to use this data to bypass Content ID with copyrighted material. Content ID is just an algorithm; there are people paid to scour the web for intellectual property violations, and any channel with a large audience could be subject to the greater sensitivity of manual copyright claimers (unless you're using obscure music, in which case, you might be better off approaching artists directly and offering to pay them).

This is more interesting as a legal/philosophical discussion: Beginning with a piece of digitized audio, is there a quantifiable point of bit-parity at which you've distorted the track such that it's something new, or transformed? If we keep the melody and the lyrics, is it a cover? If we can't even discern the lyrics anymore, and can only hear the downsampled percussion, is that fair use?