Scrape data from a set of YouTube videos

Someone wise once wrote: write up the times you figured something out; you never know when someone else (or future you!) will find the walk-through to be helpful. (I’m paraphrasing. Heavily.)

This last week, I finally wrote a script I’d had on my to-do list for a long time. I needed to grab video views of various YouTube videos, whose URLs had been provided to me. Besides the tedious and time-consuming nature of collecting this data manually, doing so also introduces the possibility of recording the wrong numbers. So, in comes Python to save the day.

First, I had to figure out what libraries to use to actually scrape and process the data. There are various options available, but not all are equally easy to use, in my opinion (or at least, not on a deadline). I ended up finding a tutorial from PythonCode that explained how to scrape and process YouTube data with requests_html and Beautiful Soup. It was pretty easy to update my installs of these libraries and adapt the code to search through a list of URLs and dump those into a CSV file.

# libraries
import csv
import time # for export filename formatting
from requests_html import HTMLSession # to scrape the urls
from bs4 import BeautifulSoup as bs # to pull out and format information

# list of all youtube video urls
event_urls = list()

# csv of all events data, with youtube video urls as a column
input_file = 'url-events-2.csv' # could refactor to be a sysarg

with open(input_file, newline='',encoding='utf-8-sig') as events_input:
    reader = csv.reader(events_input)
    next(reader) # skip first line so requests doesn't try to scrape strings
    for row in reader:
        # limit the amount of information to just the essentials
        event_urls.append([row[0], row[1], row[4]]) # change this if shape of CSV changes

session = HTMLSession()

# !!!! need to address: multiple listed youtube URLs, these apparently don't throw an error

for e in event_urls: # for each event
    if not e[2] == '': # if it HAS a youtube url, then
        try:
            response = session.get(e[2]) # scrape the page
            response.html.render(sleep = 1) # execute javascript
            soup = bs(response.html.html, 'html.parser') # create bs object to parse html
            views = soup.find("meta", itemprop="interactionCount")['content'] # get the view count
        except: 
            views = 'n/a' # otherwise put n/a, some issue
            print('problem with event: ',e[1])
        e.append(views)
        # print(e)

# create filename based on the datetime right now
timestr = time.strftime("%Y%m%d-%H%M%S") # get datetime for output filename
output_filename = 'events_youtube_views_' + timestr + '.csv'

# write data to file
with open(output_filename, 'w', newline='\n') as f:
    writer = csv.writer(f)
    writer.writerows(event_urls)

I have a few notes to myself in the code above, mainly to address lack of robustness and ways to make the code work without having to modify it. Once these issues are resolved, I’ll probably update this post. Anyone who uses this code, just beware that as-is it’s dependent on the CSV being formatted in a certain way. It’s formatted this way because I had extra information to which I needed to append the video view data. You can get around this by using a simple list of URLs and iterating through that, pulling all the video metadata from YouTube, instead.

Maybe this will help someone, but most likely it will help future me! Either way, I was glad to figure it out and automate a tedious task.

Leave a comment

Design a site like this with WordPress.com
Get started