Scraping Movie Data with Scrapy

HW 2
Author

Alvin

In this blog post we’re going to use the Scrapy framework to scrape data from the Internet! The Github repository for this project can be found here.

To set up a Scrapy project, run these three commands in your terminal at the directory you want your project to be contained in:

conda activate <environment name>
scrapy startproject TMDB_scraper
cd TMDB_scraper

This will create a lot of files, but the only ones we’ll have to worry about are settings.py and another Python file we create in the spiders directory named tmdb_spider.py. In this file we define the TmdbSpider class.

The main goal of our spider (the scraper bot) is to find a list of actors that have worked on a particular movie and for each of those actors find out which movies or shows they have been in. The movie we’ll be starting with is Parasite, but you can easily change the start_urls variable to whichever movie or show you like.

Aside from start_urls, notice that the TmdbSpider class has a name variable. Every spider must have a unique name.

Next, we implement three methods:

    def parse(self,response):
        '''
        Assuming starts on movie page, navigates to cast page
        '''
        cast_url = response.request.url + "/cast"
        yield scrapy.Request(cast_url,callback = self.parse_full_credits)

When the program is run, a Scrapy request is called on the url(s) in start_urls, which returns the HTML documentation in response. Then parse() is called on this response. Here, we simply want to navigate from the movie page to the cast page. This part is hard-coded in, since the Crew and Cast URL of a movie is simply the movie URL appended with “/cast”. All we do here is yield another Scrapy Request, but this time we want the response to be handled by our next method, parse_full_credits().

    def parse_full_credits(self,response):
        '''
        Scrapes list of actor pages from cast page and navigates to each
        '''
        
        #finds list of actor pages linked in cast page
        cast = response.css("section.panel.pad")[0]
        actor_pages = cast.css("div.info p a::attr(href)").getall()
        
        #calls parse_actor_page for each actor page in list
        for page in actor_pages:
            yield response.follow(page, callback=self.parse_actor_page)

Here’s where things get more interesting. The css() function, when called, parses through the response and finds every block of HTML that is contained within the tags we specify. For example, when viewing the source code for the Cast and Crew webpage, notice that the link to each actor page is contained within the tags <div class="info"> <p> <a>, so we call css("div.info p a::attr(href)") (the ::attr(href) part is shorthand for telling the function to return links to follow, which are contained in the HTML as an href attribute). However, these tags hold for crew information as well, which we don’t need, so we need to find some way to isolate the cast only.

It turns out that each Cast and Crew page is separate into one panel for cast and one for crew. We can use css("section.panel.pad") to find both panel objects, i.e. whatever is contained within the tag <section class="panel pad">, and isolate just the first one. In this way we end up with a list of actor page links to follow, and for each of them we use follow() to yield another Scrapy request on the given URL. Here, we again specify a different method to use on the actor pages.

    def parse_actor_page(self,response):
        '''
        Starts on actor page, scrapes and yields movies/shows 
        that actor has been in
        '''
        
        #finds actor name
        actor_name = response.css("title::text").get().split(" — ")[0]
        
        #finds list of movies/shows that actor has been in
        credits = response.css("table.card.credits")[0]
        credit_names = credits.css("td.role a.tooltip bdi::text").getall()
        
        #yields dict for each movie/show, corresponds to a line in .csv
        for credit in credit_names:
            yield {"actor":actor_name,"movie_or_TV_name":credit}

The parse_actor_page() method starts on each of our actor pages and first isolates the actor’s name, which can be found in the <title> tag. Then, like before, we use the css() function to isolate the table that contains the list of movies and shows that the actor has been in, and then scrape that table for said list. Finally, we yield a dictionary for each item in the list with the actor name and movie or show name. Each yielded dictionary corresponds to a line in the spreadsheet that will be outputted.

Now that we have our spider code, we can run it by navigating to the TMBD_scraper directory in the terminal and running the command scrapy crawl tmdb_spider -o movies.csv. This runs the spider and saves its output in a file called movies.csv. Let’s use Pandas to see our results!

import pandas as pd

movies = pd.read_csv("movies.csv")
movies
actor movie_or_TV_name
0 Song Kang-ho Uncle Samshik
1 Song Kang-ho Cobweb
2 Song Kang-ho One Win
3 Song Kang-ho Emergency Declaration
4 Song Kang-ho Civilization Express - The Great Feast of Movies
... ... ...
644 Lee Jung-eun KBS Drama Special
645 Lee Jung-eun Mother
646 Lee Jung-eun Wanee & Junah
647 Lee Jung-eun A Masterpiece in My Life
648 Lee Jung-eun The Oscars

649 rows × 2 columns

This is exactly what we wanted: a list of every actor that appeared in Parasite, as well as every movie or show that they in turn have worked on. We can now run some analysis on this data. For example, what if we want to know which movie or TV show has the largest number of actors in common with Parasite?

#counts number of appearances of each movie or show in dataframe and sorts descending
count = movies.groupby("movie_or_TV_name").apply(pd.DataFrame.count)
count = count[["actor"]].reset_index()
count = count.sort_values(by="actor",ascending=False)
count = count.rename(columns={"movie_or_TV_name":"name","actor":"number of shared actors"})
count.head()
name number of shared actors
284 Parasite 44
411 The Oscars 8
262 Okja 5
43 Baeksang Arts Awards 5
248 Next Entertainment, Visionary 4

Okay, so there are a few rows that we don’t actually need; namely, award shows and Parasite itself. We’re just going to have to manually remove those.

count = count.drop([284,411,43,248])
count.head(10)
name number of shared actors
262 Okja 5
395 The Host 3
101 Emergency Declaration 3
273 Our Little Summer Vacation 3
110 Fight For My Way 3
320 Secret Sunshine 3
287 Peninsula 3
367 The Attorney 3
128 Hellbound 3
204 Man of Will 3

And just because we can, let’s graph our top 10.

#imports and visualization
from plotly import express as px
import plotly.io as pio 

pio.renderers.default= 'iframe'

#plotting
df = count.head(10)
fig = px.bar(data_frame = df, 
             x = "name", 
             y = "number of shared actors",
             width = 800,
             height = 400)

#labels and axis
fig.update_layout(title = "Movies/TV Shows Sharing Actors with 'Parasite'", 
                  xaxis_title = 'Movie/TV Show', 
                  yaxis_title = 'Number of Shared Actors')

It turns out that director Bong Joon-ho likes to work with the same actors! The top two films listed, Okja and The Host, are both also movies directed by him. That aside, if you like seeing familiar faces in your South Korean films, maybe you can take inspiration from the movies listed here. And if not, then at least you know how to find similar recommendations for a movie or show that you personally enjoy!