Sometimes you just want to download your favorite YouTube videos. Given YouTube's continuous march toward censorship, this is never a bad idea, but there are a lot of other reasons to want access to the vast library of YouTube videos.
Step 1: Download YouTube Video/Audio
I recently started work on a little project that will download YouTube videos in order to grab the audio track and run it through Machine Learning algorithms. Step one, of course, is to download the video to your local storage. I originally thought Python would be the best way to proceed with this step, but it seems C# provides the most functionality in terms of downloading YouTube videos.
Check out this GitHub repository for a variety of sample C# code:
https://github.com/mdhatmaker/YouTubeDownload
This code does some video downloading (YouTube only), and it also has some simplistic ability to grab just the audio track and save it to an MP3 file.
Here's where I started when developing this code:
https://stackoverflow.com/questions/39877884/c-sharp-download-the-sound-of-a-youtube-video
Step 2: Convert Speech to Text
Once we have the audio track(s) saved locally, we can start converting the speech to text.
Within my miscellaneous Python code repo, check out the "voice_to_speech_demo.py" file:
https://github.com/mdhatmaker/Misc-python
The python SpeechRecognition library seems like a good place to start:
https://www.datasciencewiki.com/2019/12/speechvoiceaudio-to-text-conversion.html
My first attempt uses the Google Cloud Speech API to do the audio-to-text conversion. It seems I have to jump through some hoops to get the Google API to convert more than about 60-second clips, but it does work (and the accuracy seems quite good).
The SpeechRecognition python library supports a variety of speech-recognition APIs.
Below are the Speech recognition engine/API support as of now:
- CMU Sphinx (works offline)
- Google Speech Recognition
- Google Cloud Speech API
- Wit.ai
- Microsoft Bing Voice Recognition
- Houndify API
- IBM Speech to Text
- Snowboy Hotword Detection (works offline)
Obviously, I need to address the issue of processing longer audio clips. Also, it would be cool to try an API that does the speech recognition locally (rather than in the cloud). My plan is to try the following engines next and compare the results with Google Cloud Speech API: Google Speech Recognition, Microsoft Bing Voice Recognition, and IBM Speech to Text.
Step 3: Use Machine Learning Prediction to Simulate Speech Patterns
Once I have the "textified" version of audio from a speaker in some YouTube videos, I can start the process of doing "next-word prediction" to attempt to simulate the speaker's patterns. I'll write follow-up posts as I work through this step...
Conclusion
The goal of this nerdy little project is to pull a speaker's audio from his/her YouTube videos then simulate that person's speech patterns. If want to add some extra "coolness," I will evaluate some text-to-speech libraries to convert the simulated speech back to audio.
Sounds pretty fun. We'll see.