home || about || projects || contact || mtv

StyleGAN MTV / MTVR

Example of a GAN interpolating through a latent space
Using sound + AI to make music videos

Watch and listen to some outputs of this project here. I also used A-Frame to create a 3D, VR-enabled experience - try it here.

After a hiatus from the experiments of mAIking The Band to focus on various client projects, I returned to the world of StyleGAN, with a renewed focus on using sound to synthesize images. While the keyboard prototype I had built before was great fun to play with and experiment, it was only reacting to the MIDI notes that were being played - not truly reacting to the music itself. I yearned to have a GAN that reacted to raw audio, with the timbre and rhythm determining what you would see.

After mucking about in a Google Colab notebook for a little bit, I wrote some software that would take the spectrogram of a given sound file as the input to the GAN, synthesizing frames that would interpolate through its latent space in time with music. Because the input was only the audio spectrogram, this created a cool effect where the same sounds would produce the same images. You can see the first video I made below - be warned, it does contain a lot of fast movement and cats.

The first successful test - cats to the tune of Billie Jean

From there I rapidly iterated, implementing features that would make the viewing experience more pleasant, such as adding a window function to smooth out the latent space interpolation, so that visual jumps weren't as quick and drastic. In addition, I added the ability to have low-frequency movement in the latent space so that the same sound didn't always produce the exact same image.

From there I wanted to be able to have the GAN react to live sound and tweak generation parameters in real time, so I built a quick javascript frontend which allowed this.

With these tools in place, I was able to host a DJ session at IDEO's 2020 May Frenzy where viewers requested songs for the AI to paint. I also briefly ran a Twitch channel complete with chat bot that viewers could submit songs to hear from youtube links.

After playing with the software I'd made for a while, I decided to move back towards the offline-processing direction, as you could create much more stable, higher-framerate and higher-image-quality outputs that way, while not having to contend with latency. I made the CLI more robust, added more features, and added support for StyleGAN 2 and StyleGAN 2 ADA models.

You can see some of the latest videos at ryanc.lol/mtv, where I've collected a playlist of videos set to pieces by impressionist composers Debussy and Satie. There is also a WebVR version available at ryanc.lol/mtvr if you would like a third dimension to your experience.

I used Python and the TensorFlow implementation of StyleGAN + librosa as the foundation for audio -> image synthesis. The real-time interface was built using Flask and a simple Javascript frontend using WebAudio and Vue.js. Code for offline generation can be found here.



⬅ back to projects