Back
  • The Art of Model Training with Neutone Morpho

    February 25, 2025

    Article written by Alfie Bradic

    I don’t like the term dataset. It has a cold, sterile connotation that feels out of place in our creative domains of music and sound design. Artists make art, not data – and training a Morpho model is no exception. In this article I’ll share some tips and considerations to keep in mind when training your own model. In the process, I hope to highlight how the medium of “datasets” is ripe for human creativity and self-expression. You can use the headings as a checklist to refer to before finalising your own training sounds. Let’s begin!

    Composing for the machine

    Creating your own Morpho model starts with you composing a collection of music or sounds for the network to learn from. I use the word “composing” because the elements you consider for conventional music are still relevant.

    What emotions do I want to convey? What instruments or sounds should I record? What chords or rhythms? What microphone or signal chain? What post-processing?

    The old adage “garbage in, garbage out” still applies here too. A dataset comprised of flat, repetitive or poorly-recorded sounds will not produce good results. Morpho will train faithfully on your material and will not attempt to “fix” or enhance your sounds. To be specific, Morpho will learn the timbre of your dataset only. The frequencies that make up your sounds, and how those frequencies rise and fall, are all the model cares about. Each sound you add to your dataset is a chance to provide new information to the model that will better equip it for its role as a tone morphing machine.

    Does your sound become softer as it gets quieter? Does it become harsher as it rises in pitch? Does it have a fast attack or a slow attack? The more questions your dataset can answer, the more expressive your model will become.

    1. Length

    The question “how long should my dataset be?” is hard to answer in exact terms. What the model needs is not strictly length but novelty. 1 hour of dynamic performance covering a range of volumes, pitches and playing styles will be vastly superior to 3 hours of a single note played at a single velocity.

    As a general rule of thumb, you should aim for at least 45 minutes of unique audio. For most sound sources, this is enough time to cover a healthy variety of timbres. If you don’t have 45 minutes, check out the augmentation section of this guide for effective ways to generate more.

    2. Frequency range

    The overall frequency range of your dataset is of vital importance to how your model behaves. Before we think about pitch (more on that later) we need to think about the overall frequency areas that the model occupies. We know intuitively that the treble control on a HiFi system can make things sound brighter when turned up and darker when turned down. The model makes similar observations about sounds during training, and will morph your input signals into the nearest equivalent found in your training data. For example, if your dataset contains just bass guitar and violin recordings, Morpho will transform bright inputs such as flutes into the violin and low inputs such as kick drums into the bass guitar.

    Be mindful of how much of the frequency spectrum your sounds cover. Datasets that are overly skewed towards one area will not translate well to diverse inputs. If your sound source is naturally skewed, consider creative ways to increase the frequency diversity of your audio. For example, if we wanted to include more high frequency content in our bass guitar dataset, we could record passages high up on the neck, or even more abstract sounds such as string noise from sliding our hand up and down.

    3. Transients

    It’s crucial to consider the attack characteristics of your dataset. If your sounds lack strong transients, percussive inputs like drums will also lose their punch when morphed. This is not necessarily a bad thing; perhaps you want to make a drone machine that softens your input sounds in this way. If you do want transients to be preserved however, you will need to think about how to include some appropriate examples in your dataset.

    Some sources, such as ambient field recordings, lack any strong percussive elements. You can try remedying this by loading your recordings into a sampler and getting more hands-on with sound design. For example, with a long recording loaded, randomise the playback start position and tweak the ADSR to get more aggressive transients that decay sharply. You can create a quasi-generative drum dataset with this approach by hooking up some LFOs to key parameters such as filter cutoff, playback speed and release time.

    Drum datasets have a different consideration: how should the model react to sustained inputs? These models will often exhibit noisy or unexpected behaviour when fed long sustaining sounds as the dataset contains no reference point for them. Drums don’t sustain in the same way as pitched instruments, but you could experiment with including military-style drum rolls for acoustic drums or very fast patterns for electronic drums.

    4. Timbral variation

    Consider the difference between a sine wave and a square wave, or a clean electric guitar and a distorted guitar. The harmonics of a sound are what give it its identity, and Morpho benefits from a healthy balanced diet of different timbres. If your dataset is based on field recordings or acoustic instruments, it’s likely your data will already have enough variety in this area. However, if you’re working with synthesis, try to vary your patches. This is important because the makeup of some synth sounds can remain static across pitch and velocity, with only the fundamental frequency and amplitude changing.

    5. Pitch

    Pitch is one of the more challenging aspects of a melodic dataset. First, any pitch that isn’t found in your dataset will not be reproduced when morphing. This means you won’t get pitches above and below the natural pitch range of the instrument unless you deliberately include re-pitched variations. This extends to pitches on a micro-level – a piano model will only reproduce the fixed pitches from the piano keys, but a vocal model might provide a more smooth and continuous pitch range if the singer used portamento and vibrato techniques.

    Second, the Morpho architecture currently does not learn pitch explicitly, and this can result in models not following input pitches exactly. Remember, the model only cares about the timbre of sounds. If it thinks your input C# note is closer to the timbre to a D note from its dataset, it will reproduce that instead. As a result, complexity increases further with the inclusion of polyphony such as chords or arrangements of multiple instruments in a single audio file.

    The key to good pitch behaviour is being deliberate about what you include. Here are some approaches you could try:

    Approach A: Fixed tonality (recommended)

    Rather than include all the possible notes your instrument can make, decide on a particular scale or mode, and stick to those notes instead. I recommend sacrificing some flexibility for the stability awarded by this method. You could further experiment by only using chords for your sustained sounds and single notes/arpeggios for your short sounds. This way, you’re trying to reinforce the model’s understanding of polyphony with a separate “sustain” variable (Pavlov’s polyphony, anyone?).

    Approach B: Systematic and monophonic

    Another approach might be to restrict your instrument to monophonic sounds and record performances of chromatic scales that vary in tempo and velocity. This can be fleshed out further with techniques such as glissando and vibrato. Experiments have shown this approach is more likely to succeed if your instrument’s pitch range is fairly high, such as a violin. If you’re working with a bass instrument, I’d recommend the fixed tonality approach instead.

    6. Noise

    Noise can ruin an otherwise great dataset. Humans are good at tuning out unwanted noise, but the model will learn all sounds indiscriminately and not care for which sounds are desirable and which are junk. If you recorded your sounds with a phone or portable recorder, check your noise floor. Trim any noisy silences or rustling between takes. A good denoising plugin can do remarkable things as well. iZotope’s RX 11, Klevgrand’s Brusfri and Reaper’s ReaFir are all good options.

    7. Mixing

    Mixing is your last chance to confirm that your dataset actually sounds pleasing to the ear. If it’s not enjoyable to listen to on its own, it’s unlikely the trained model will fair any better. EQ and multi-band compression are convenient for addressing problematic areas and improving consistency. However, you should avoid adding time-based effects such as delay and reverb unless they are a core theme of your model. Reverb in particular can have a negative impact on the model’s ability to react precisely to transients.

    8. Augmentation

    You can increase the length of your dataset by creating variations of the same recording. This can be a powerful tool but should not be abused or used as a shortcut. I recommend you refer to the topics above and consider which elements are lacking in variety before making any decisions.

    • Pitch shift is a common augmentation tool for audio datasets, but consider whether it is really necessary in your case. Modern pitch shift algorithms do a good job at preserving the timbre of your sounds, so they might not add the variety you need. On the other hand, slowing down the recording for a tape-style pitch shift can have dramatic effects on the timbre at the expense of authenticity.
    • If you have too many dark sounds and want to improve your overall EQ balance, consider adding a version of your recording that uses a high-pass filter and a high frequency boost. You can increase the timbral variety further by assigning the high-pass cutoff frequency to a very slow moving LFO.
    • If you’re working with a multitrack recording, you can create variations by combining multiple tracks or stems together. For example, you can use the individual channels of a drum kit recording followed by the whole kit together for more variety. Be careful when doing this with pitched sounds if your approach relies on strict polyphony rules.
    • If your dataset is more abstract or stylised, you can try making variations with different effects such as distortion or heavy compression. Transient designers are also useful to generate variations for percussive sources. Just keep in mind that the model values these variations equally to the “main” recordings, so be sure they sound just as good!

    Conclusion

    When your dataset is sounding great and it handles the above topics appropriately, you can begin training your own model with confidence. We’re constantly amazed by the creations of our talented community, and we want to hear what you make with your own Morpho models. Please tag us in any Neutone-driven sounds you share online!

    If you have feedback or questions about training your own Morpho models, please don’t hesitate to get in touch. You can shoot us an email or join our Discord to start a discussion with the community.

    support@neutone.ai

    Join our Discord