/plushcap/analysis/assemblyai/youtube/jCz5FbCQ098

How to Index Podcasts with Keywords like on Huberman's Website

Company
AssemblyAI

Date published
Feb. 1, 2024

Transcript

So you might know Andrew Huberman. He's a very successful podcaster. I think he's also originally a neuroscientist. He generally talks about wellness and health in his podcast. He has interesting guests over and has millions of lessons every month or every episode. And when you go on his website, you can actually search for anything that you're interested in and you can get some results. So for example, I can look for supplements and then you will get some results that could include a page where he specifically shares his ideas or thoughts about supplements for health and performance, and maps to some of the podcast episodes where he talks about supplements. You can search for sleep quality, for example, if you want to hear his or his guest's thoughts about sleep quality, you can listen to the specific timestamps. So it's interesting because Andrew Huberman is a very organized creator. So I don't know if he did this himself, probably not. He probably had a team behind it to make this happen. But it's still nice to see that you can get all this audio content that could be deemed as unstructured data into a very structured and searchable way. So I want to show you how you can build this yourself using AssemblyAI. It's actually quite a simple process here. They might have used a manual process, but we're going to do this completely automatically, just, or programmatically, if you like to call it that way, just by coding in Python. So there are multiple services you can use through AssemblyAI to achieve a similar result. So if we go all the way to audio intelligence, maybe I'll make my screen a bit bigger so it's clearer to see. Under audio intelligence you have topic detection, auto chapters, and key phrases. Key phrases and topic detection will give you what is being talked about in the video. Auto chapters will give you chapters, so your audio divided into chapters, and also a quick headline of what is being talked about in that chapter specifically. So that one would be useful to make that timestamps that we've seen. So if we search for it again, supplements, you see that he has his timestamps. So this timestamp is a specific point in a specific episode where he's talking about supplements. So this would be one way of getting that information. If you use topic detection, you're going to get a list of topics that are being talked about in your audio, but it's going to be in the whole audio and this categories, these topics are predetermined. So there's this thing called iab taxonomy, and it's basically a list of things any content can be about and you're going to get a sublist from that predetermined list. Key phrases, on the other hand is a little bit more free. You just get some key phrases or words. So this could be phrases could be more than one word that is mentioned in the audio that you send to AssemblyAI. So you can definitely build something in combination, all of these three in combination to get something like Huberman's website. But to keep this video a little bit more quick, I'm just going to show you how to use key phrases model to achieve something similar. But if you want me to cover the other ones too, definitely leave a comment and let us know. All right, so here is the code. I'm not going to go step by step and code everything from scratch because I think this is going to be a bit easier to show you how everything works. If you want to follow along though, or if you want to try it yourself, definitely go download it from the GitHub repository. I will leave a link to it in the description. So the first thing that I want to show you is to how to collect audios of episodes because yeah, you can go on Spotify and search for these episodes. Or maybe you listen to it on Apple Podcasts. It doesn't matter where you listen to it. Normally there is always an interface, but most of the time these podcasts, they have something called a RSS. And on this RSS, it's basically where Spotify and Apple podcasts also get the information for this new podcast. So basically we're going to the source and that's where we're going to collect the episode audio URLs from. So to find Huberman's RSS, generally what I do, honestly, I'll say Huberman podcast, RSS, Huberman Lab podcast or you can just search for whatever podcast you're looking for on Google. And then you'll probably end up at the landing page of that podcast and you can say other players RSS and then we will end up at a page like this that looks like something broke, but it didn't. This is the RSS. In the RSS you actually have a lot of information. You'll probably first have the title of the podcast itself and some information like the website of the podcast, the description. And then if you go further down, you're going to start seeing episodes. So the title for the episode, this link is probably a link to the website and not to the audio. So make sure to scrape the right thing and then a description, some itunes specific tags here the subtitle, they even provide you with a summary and probably also a date when it was published. Yeah, pub date. This is also something we can set aside and how long this episode is. This is in seconds. And also here, enclosure URL. This is going to be, if you see that, it's going to be an MP3 here. So this is going to be the audio. So let me just copy and paste this and show you that this leads to the audio. Welcome to the Huberman Lab podcast where we discuss science, science based tools for everyday life. All right, so that's basically it. So what you can do, and what I've done here is use this thing called feed Parser. And with this feed parser you can just pass it a RSS feed and it's going to make it very easy for you to extract anything that you want from there. So in this function, all I'm doing is to pass this RSS feed that I have already specified in my main Python file, and I'm passing it to the collect episode URLs function. In the collect episode URLs function, I pass the RSS feed to the feed parser to be parsed. And then for each episode, each entry being the episode, I extract the title, the subtitle publish date, and then the audio URL for every podcast, for every RSS. This might be a little bit different, these tags might be different. Best way to find it is to first take a look at the RSS here. Or one thing you can do is to kind of print it, maybe pre print it after you pass it to python to understand what everything looks like. So I'm not going to do that right now because quite a simple process. Leave a comment if you have any problems with this section of it. One other thing that I'm going to get is also the episode length, but I want to keep it in minutes, hours, minutes and seconds instead of seconds only because it doesn't really mean much if it says 11,311 seconds, just so that we can compare how long the episodes are and how long it took us to transcribe and extract keywords. All right, and then I'm going to append, it's kind of keep all of this information so the title, description, date, published audio URL and episode length into a dictionary so that it's going to be easier for me to extract information from it later. So in this episode info Dictionary, I'm going to have all of this information. And just to make it easier for me to reach it, I also keep the URLs in a separate list where I have just links to the episode audios and I will return this. And once I return this I'm going to pass the episode URLs and how many episodes I want out of that to my generate keywords from audio function. The reason I also included a number of episode is because I realized in this RSS feed I think there is 190 episodes or something like that. And there might be a limit to how far you want to look into. You can also make the limit, for example the date that you want to cut it at. Maybe you don't want to look at episodes older than a year, but yeah, maybe fair enough, you want to look at all the episodes, which is also possible. Then you could go ahead with that. So let's take a look at how we can generate the keywords from audio. And that function is right here. A couple of things before we get started on this section is that you definitely need a AssemblyAI API key for this, which is super easy to get. All you have to do is go to AssemblyAI.com. And then I think if I log in, I'm going to log in automatically. Oh no, I didn't. Well, you just say I want to create an account, fill in your information, and after you've done that, let me just sign in here and then you get your API key immediately. Yeah, just need to copy this one. If you are just getting started with AssemblyAI, you can also try these simple examples at first. There are a couple very simple ways of transcribing audio and identifying speakers in your audio. And there are also some example audio files included, so it'll make it easier for you. All right, so once you have your API key, you can either specify it here directly or you can make it one of your environment variables and then get it through your environment variables. It's easier for me now, so I do not have to share my API key on the screen every time and then go change it. So that's why I've opted for this option. All right, so we were going to take a look at the generate keywords from audio function and what we do here. So obviously what I want to do at first is to kind of only get the number of episodes that I want. And I do that by saying take in the episode URLs list, only take the first n episodes. And then this is maybe the easiest part which is using AssemblyAI. For that you only need to import AssemblyAI as AI, create a transcriber object. And normally you would just say transcriber, transcribe and then pass your URL to it and that's it. But this time we want to do some extra things. So I'll show you here what we're doing. So I also want to start a timer, and also this is where the timer ends, just to see how long this whole transcription is going to take. Because at the first try, I want to try to transcribe ten episodes at the same time. And I've calculated it before, it amounts to more than 24 hours of audio data and I want to see how long it takes. So I will print the result here, so we will be able to see how long the whole transcription took. Spoiler alert, it's going to be pretty quick. So here, instead of transcribing the audios one by one, you can actually transcribe them, a couple of them at a time. So there is this thing called concurrency limit at AssemblyAI. So once you created your account, you can go to your account and then you can see your concurrency limit. And this is how many audio files you can have transcribing at any given time. So for me it's 200. And like I said, huberman apparently had 190 episodes, so I could even pass all of them if I want to. But I'm not actually going to use the results immediately for a project that I have. So it's only for this tutorial. So I think ten or 20 would be like a good place to start. That's why I'm only going to pass that many. But anyway, so instead of waiting for each episode to be transcribed and starting to new one, I'm just going to pass all of them to AssemblyAI and they're going to deal with doing it at the same time. And they're going to do it in a pretty fast or short amount of time. So that's why I'm going to, through the transcriber, I call the transcribe group, and inside the transcribe group, I basically pass the episodes. So this is going to be the links the URLs to ten different audios of ten different episodes or audios of ten different episodes. And then I also want to call key phrases. When I was cleaning the code, apparently I deleted the wrong thing. So let's go here. Let's take a look at the documentation. For that, I just need to call auto highlights to be true. So inside the transcription config I can set auto highlights to be true. It's honestly this easy to toggle different audio intelligence models on and off. You can also say, for example, speaker labels you set it to true. And then as part of the response that you get from AssemblyAI, you will also have speaker labels included. Then all you have to do is parse it out of there. But for now, I only want this auto highlights and we don't really need to do anything else. So as you can see in the documentation, we create the config separately and then pass it here. But in this code, we just kind of create the config immediately here. And then, like I said, I'm going to stop the timer to see how long it took, and then we're going to print that and that's it. And as a result, I'm going to get the transcript group from the AI, and that is going to include different transcript responses within each of them. We're going to get a transcription plus the key phrases that are being talked about in this video. So once I return that, the next thing is to extract these key phrases and maybe bring them together with the episode information that I collected before. So this includes, let's take a look. What did we include? Title, description, date, order, URL. We already have, but also episode length. And this way we will have the key phrases that are being talked about in every video, plus some extra information. So if you want to use the results in a website like we've seen in Andrew Huberman's website, it will be much easier. The last thing that we want to call is generate analysis. But maybe I can set up some breakpoints here and there. And then we will actually take a look at what these responses look like. So I'm just going to start running this and I determined the number of episodes to be ten. Let's debug it. All right, so I'll just skip to the next one because we already talked about what's happening in there. Apparently for a better breakpoint in there. Let's go. All right, so let's take a look at what it looks like. I hope it's big enough to see. Maybe too big, but yeah, let's take a look. All right, so far, what we got from the collect episode URLs is the episode URLs and the episode info. Episode URLs, like I said, is just a long list of all the episode URLs. Yeah. Length is 191. So 191 episodes. And then episode info is just a list of dictionaries. And in each of these dictionaries we have title, description, date, auto URL and episode length. As you can see, the episodes are quite long. 3 hours, eight minutes. Another one, 1 hour, five minutes. This one, 3 hours, eleven minutes. So they're pretty long. Like I said, I calculated it previously and it was more than 24 hours of data. Okay, let's go to the next one. So right now is where we passed it to AssemblyAI, and when the transcription plus extracting the keywords is going to happen. So let's wait and see how long it's going to take for this 24 hours plus 24 hours audio file. All right, now the transcription is done and it took only two minutes and 14 seconds for ten episodes to be transcribed and their keywords to be extracted. And like I said, it's more than 24 hours of audio data. So that is quite fast. Next is where we left off is to generate analysis. What I passed to it is a transcript group and the episode info. So let's take a look at what those look like. We've seen what episode info is before. It's basically a list of dictionaries. And then we can see it has titles, description, date, audio URL, and episode length. And then let's take a look at what transcript group is. Transcript group is again a list of transcripts. So if you called AssemblyAI one by one, one of these elements in this list would be what you would get as a response. It includes the audio duration audio URL. We turned on audio highlights, auto highlights. So that's why we get our responses. Here inside auto highlights, we have results. And in the results we basically have a list. So this list says, one keyword that is talked about in this episode is more pain. How many times it was talked about, the timestamps of when it was talked about, in terms of start and end and rank. So it's basically how relevant this keyword is to the whole episode. There is no certain number of keywords that are being extracted. It depends on the audio. If there are more keywords, we get a longer list. If there are less keywords, we get a shorter list. All right, so I'll just make this smaller and show you what we do. So basically, for each transcript or for each episode in the transcript group, we get the key phrases. And it's basically just reading it from the JSON response that we received. We read the auto URL text. So the text of the key phrase, rank, start and end, and then we put them again in a list of dictionaries. So once this for loop runs, I have a data frame, and in it I have all the keywords on all the episodes and their timestamps of when they were mentioned. And what I do is I merge that with the episode info data frame. That I've just created here from that list of dictionaries. I match them on auto URL and make sure that I include all of the keywords. Just some extra thing I did here is to drop the start, end and rank information because I don't necessarily need it immediately. I just want to get a list of phrases, not their specific locations just yet. But if you want, you can keep that information and then I will save this on a csV. So let's just run this and then see what our CSV looks like. I mean, we also print it here, but it's probably easier to see on a csV. So this is a resulting file that we get from our analysis. We have the URL of the audio. Then we have the title of the episode, description of the episode date and the episode length. And then we get all the keywords that are mentioned in this episode. So for example, in this episode it's more about pain. Then we could take a look at this episode. It's about toxic people, strange things. And then we have an episode about red light therapy and the effect of light. I'm assuming we have one about sleep. We have one about food. So this is only ten episodes. Of course there are different ways how you can use this information. So just want to show you a couple of examples, also inspired by Andrew Huberman's website. So let's go to Huberman's website again. Huberman lab. Let's have this next to the results. We had one about light therapy, right? So for example, red light therapy. Red light therapy. Let's see what kind of results we get. Well, we didn't actually even get anything. What about light therapy? All right, so now we get something. But if you were specifically looking for red light therapy, it was not possible to find on Huberman's website. But this one, AMA number 14. Yeah, that is the same episode. What if you were curious about what he thinks about junk food, which I think is also something I've seen here in the food section. Yeah. For example, one of the keywords is junk food. Let's see what he thinks about junk food. Again, you don't really get a response, maybe more specifically than them. More generally you'll look for food. But if you want to know in which episode he talks about junk food, then you can take a look at our list and see that it's on the how sugar and processed foods impact your health episode. All right, so let's look at another one. For example, it talks about immune cells here, immune system. So then yeah, you see some of the other episodes where he's talking about it. One extra addition I found here is that you can also get the timestamps. So like I said, timestamps is included if you get the key phrases. So we can definitely do something similar to this. But if you want to get more of like a chapter where immune system is being discussed, you can definitely instead try auto chapters or maybe want to try it together with key phrases. So these two models from assemble the AI in combination would help you to build something like a timestamps too. But like I said, if you want to see how it works, or if you want to see how we can build that, leave a comment below and I'll make a video for that too. The last thing that I want to show you is how much this all cost me. So I've realized when I was coding, where is it here that I've made a mistake in the time start and end time part? As all faith of all demos go, there was just one thing that I didn't actually test that it was running and that didn't work. So I had to run it a couple of times over and over again. But I want to show you my stats on AssemblyAI. So like I said, if I go to usage, the ten episodes in total were 24 hours and 20 minutes or something like that. Because I had to run it over and over again. I ended up running 152 hours of audio transcription plus key phrases model for that, how much I paid is $56 for the core transcription and key phrases. So if I do a quick calculation, I paid $56.42 for 152.51 hours of transcription, and that comes up to zero point $37 per hour of audio transcribed. So it's nice to see that even if you have a very long audio you're working with, this could be very long podcast or bunch of podcast episodes, or it could be lectures, movies, anything that you might need to transcribe and get insights from, you can do with AssemblyAI in a pretty fast and economical way. Next up, if you want to learn how to build a program that will let you dictate directly into Google Docs, watch Samantha's video. But for now, thanks for watching. Don't forget to subscribe and I will see you in the next video.


By Matt Makai. 2021-2024.