About 90% of the work in ML is data collection and classification by hand. The rest is the programming.
You don't need CUDA or anything else hardware wise to train audio models. Your data model is probably going to be 10000x smaller than image classification or generation models, it's going to go fast. Unless you want a PC anyway.
Learn python.
Forget the boards for now, make it work on a PC. No, seriously, all that is just noise for starters, you need to figure out much bigger things than what hardware to run it on.
Also, you should make a sanity check if you need ML/AI at all for it. I had a project that I sunk a lot of time trying to force in Pytorch machine learning, only to solve it with scipy in fraction of the time with better result.