Welcome! 👋👋

This project demonstrates how to visualize a set of songs in a 2D plane. It does so by first extracting the mel-spectrograms of the audios and then passing them into a Convolution Auto Encoder with a latent space (read bottleneck layer) containing two axis (i.e., neurons) only. Of course, more stuff is done in the mean time so if you want to know more read such details below, ok?

oh, btw... this is a pet project, so take it easy.

Data Processing

🎶 Dataset 🎶

Data is available here. I downloaded one by one by hand. 😮‍💨

Once locally available, I converted the files to the .wav format. This is done in the convert_data_to_wav.ipynb jupyter file.

Naturally, I should have turned it into an utils function, but nah.🙄

I did a lil' bit of 'exploratory data analysis' on eda.ipynb. It was necessary because I was unsure how to feed different length audios into an autoencoder network. In the end, the question became 'which ones to feed'. This caused me to write the metadata.csv files, where I classify the tune as 'soundeffect' or 'soundscape'. More details below 😉

📊 On the data analysis 📊

The file metdata.csv contain a classification I personally did on the audio samples. I classify the samples into two categories:

soundeffect
soundscape

Soundeffects are, in general, short tunes that play when we interact we something in game. I expand on that concept to consider also melodies played under certain circumstances, like boss fight, and cutscenes. Two examples of such soundeffects are:

23. Priest of the Dark Order.wav: this track plays when Agahnin 'finishes off' Zelda. Even being a long tune, it is consider a cutscene soundeffect;
01. Title ~ Link to the Past.wav: this track is the menu theme. Then I considered it a soundeffect.

Likewise, the soundscapes are tunes played in exploration phases, in the open world or in a dungeon (or other closed spaces). It is important to note, though, some tunes are considered soundscapes even though they are short, like 02. Beginning of the Journey.wav or 08. Princess Zelda's Rescue.wav. One of the reasons is because some of such tunes occurs in pairs, because an ambience (like rain or wind) is applied to it. I kept them in the dataset to check if they appear close in the embedded space. For instance, we have the pairs: 25. Black Mist.wav and 26. Black Mist (Storm).wav, and 06. Majestic Castle.wav and 07. Majestic Castle (Storm).wav.

✂️🖇️ Preprocessing 🖇️✂️

As seen in processing.ipynb, I pad and crop the soundscapes. My logic was like:

"Take the distribution of audio lengths. I'll find a good point to crop/pad the files. Well, it looks like padding by copying and pasting over and over is not a big deal, since people on the internet post videos with 10 hours each on youtube and it sounds great (I took the idea of padding by repeating from it). Well, cropping at the 75% quartile looked OK for me because the longest file would lose only 25% of its contents at most and padding the rest was considered fine from design. The code I used to do that was defined at objects.py and dataset.py based on what I learned about Clean Architecture (domain objects, you know?). Of course, some unit testing would make it even prettier, but I did not care at the time, hee hee hee"

The products of preprocessing can be foun in the processed/ folder.

I then zipped the mel-specs with this command

$ zip -r ./data/processed/soundscapes-mel-sgrams.zip ./data/processed/soundscapes-mel-sgrams

and uploaded the resulting soundscapes-mel-sgrams.zip file to my GDrive.

🧮📉 Running the autoencoder 📉🧮

The autoencoder.ipynb contains the steps to generate the embedded space's coordinates for each tune. Basically I unzip the data, pad it to make a even-dimensions-tensor, define a Convolutional Autoencoder, fine tune it and extract the embedded (latent) dimensions.

I actually played a lot with the autoencoder class. It is a collage from several sources (you can check them out below on the Ref.). I tried to add more convolutional layers both in the encoder and decoder modules, but this ended by messing up the output over and over again. I then moved on and just left the quantity of filters in the conv layers as adjustable parameters. This 'version' of the class is the one seen at models.py.

I trained this network, which took like 36 minutes approx. The last step was then to extract the latent-space dimensions. The result is in the embeddings.json file.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome! 👋👋

Data Processing

🎶 Dataset 🎶

📊 On the data analysis 📊

✂️🖇️ Preprocessing 🖇️✂️

🧮📉 Running the autoencoder 📉🧮

🔭 Visualizing 🔭

📚 References 📚

MP3 to Wav

On Mel Spectrograms

On Autoencoder

About

Languages

mtxslv/autoencoder-chiptune

Folders and files

Latest commit

History

Repository files navigation

Welcome! 👋👋

Data Processing

🎶 Dataset 🎶

📊 On the data analysis 📊

✂️🖇️ Preprocessing 🖇️✂️

🧮📉 Running the autoencoder 📉🧮

🔭 Visualizing 🔭

📚 References 📚

MP3 to Wav

On Mel Spectrograms

On Autoencoder

About

Topics

Resources

Stars

Watchers

Forks

Languages