import pandas as pd
= pd.read_csv("movies.csv") movies
In this blog post we’re going to use the Scrapy framework to scrape data from the Internet! The Github repository for this project can be found here.
To set up a Scrapy project, run these three commands in your terminal at the directory you want your project to be contained in:
conda activate <environment name>
scrapy startproject TMDB_scraper
cd TMDB_scraper
This will create a lot of files, but the only ones we’ll have to worry about are settings.py
and another Python file we create in the spiders
directory named tmdb_spider.py
. In this file we define the TmdbSpider
class.
The main goal of our spider (the scraper bot) is to find a list of actors that have worked on a particular movie and for each of those actors find out which movies or shows they have been in. The movie we’ll be starting with is Parasite, but you can easily change the start_urls
variable to whichever movie or show you like.
Aside from start_urls
, notice that the TmdbSpider
class has a name
variable. Every spider must have a unique name.
Next, we implement three methods:
def parse(self,response):
'''
Assuming starts on movie page, navigates to cast page
'''
= response.request.url + "/cast"
cast_url yield scrapy.Request(cast_url,callback = self.parse_full_credits)
When the program is run, a Scrapy request is called on the url(s) in start_urls
, which returns the HTML documentation in response
. Then parse()
is called on this response. Here, we simply want to navigate from the movie page to the cast page. This part is hard-coded in, since the Crew and Cast URL of a movie is simply the movie URL appended with “/cast”. All we do here is yield another Scrapy Request, but this time we want the response to be handled by our next method, parse_full_credits()
.
def parse_full_credits(self,response):
'''
Scrapes list of actor pages from cast page and navigates to each
'''
#finds list of actor pages linked in cast page
= response.css("section.panel.pad")[0]
cast = cast.css("div.info p a::attr(href)").getall()
actor_pages
#calls parse_actor_page for each actor page in list
for page in actor_pages:
yield response.follow(page, callback=self.parse_actor_page)
Here’s where things get more interesting. The css()
function, when called, parses through the response and finds every block of HTML that is contained within the tags we specify. For example, when viewing the source code for the Cast and Crew webpage, notice that the link to each actor page is contained within the tags <div class="info"> <p> <a>
, so we call css("div.info p a::attr(href)")
(the ::attr(href)
part is shorthand for telling the function to return links to follow, which are contained in the HTML as an href
attribute). However, these tags hold for crew information as well, which we don’t need, so we need to find some way to isolate the cast only.
It turns out that each Cast and Crew page is separate into one panel for cast and one for crew. We can use css("section.panel.pad")
to find both panel objects, i.e. whatever is contained within the tag <section class="panel pad">
, and isolate just the first one. In this way we end up with a list of actor page links to follow, and for each of them we use follow()
to yield another Scrapy request on the given URL. Here, we again specify a different method to use on the actor pages.
def parse_actor_page(self,response):
'''
Starts on actor page, scrapes and yields movies/shows
that actor has been in
'''
#finds actor name
= response.css("title::text").get().split(" — ")[0]
actor_name
#finds list of movies/shows that actor has been in
= response.css("table.card.credits")[0]
credits = credits.css("td.role a.tooltip bdi::text").getall()
credit_names
#yields dict for each movie/show, corresponds to a line in .csv
for credit in credit_names:
yield {"actor":actor_name,"movie_or_TV_name":credit}
The parse_actor_page()
method starts on each of our actor pages and first isolates the actor’s name, which can be found in the <title>
tag. Then, like before, we use the css()
function to isolate the table that contains the list of movies and shows that the actor has been in, and then scrape that table for said list. Finally, we yield a dictionary for each item in the list with the actor name and movie or show name. Each yielded dictionary corresponds to a line in the spreadsheet that will be outputted.
Now that we have our spider code, we can run it by navigating to the TMBD_scraper directory in the terminal and running the command scrapy crawl tmdb_spider -o movies.csv
. This runs the spider and saves its output in a file called movies.csv. Let’s use Pandas to see our results!
movies
actor | movie_or_TV_name | |
---|---|---|
0 | Song Kang-ho | Uncle Samshik |
1 | Song Kang-ho | Cobweb |
2 | Song Kang-ho | One Win |
3 | Song Kang-ho | Emergency Declaration |
4 | Song Kang-ho | Civilization Express - The Great Feast of Movies |
... | ... | ... |
644 | Lee Jung-eun | KBS Drama Special |
645 | Lee Jung-eun | Mother |
646 | Lee Jung-eun | Wanee & Junah |
647 | Lee Jung-eun | A Masterpiece in My Life |
648 | Lee Jung-eun | The Oscars |
649 rows × 2 columns
This is exactly what we wanted: a list of every actor that appeared in Parasite, as well as every movie or show that they in turn have worked on. We can now run some analysis on this data. For example, what if we want to know which movie or TV show has the largest number of actors in common with Parasite?
#counts number of appearances of each movie or show in dataframe and sorts descending
= movies.groupby("movie_or_TV_name").apply(pd.DataFrame.count)
count = count[["actor"]].reset_index()
count = count.sort_values(by="actor",ascending=False)
count = count.rename(columns={"movie_or_TV_name":"name","actor":"number of shared actors"}) count
count.head()
name | number of shared actors | |
---|---|---|
284 | Parasite | 44 |
411 | The Oscars | 8 |
262 | Okja | 5 |
43 | Baeksang Arts Awards | 5 |
248 | Next Entertainment, Visionary | 4 |
Okay, so there are a few rows that we don’t actually need; namely, award shows and Parasite itself. We’re just going to have to manually remove those.
= count.drop([284,411,43,248])
count 10) count.head(
name | number of shared actors | |
---|---|---|
262 | Okja | 5 |
395 | The Host | 3 |
101 | Emergency Declaration | 3 |
273 | Our Little Summer Vacation | 3 |
110 | Fight For My Way | 3 |
320 | Secret Sunshine | 3 |
287 | Peninsula | 3 |
367 | The Attorney | 3 |
128 | Hellbound | 3 |
204 | Man of Will | 3 |
And just because we can, let’s graph our top 10.
#imports and visualization
from plotly import express as px
import plotly.io as pio
= 'iframe'
pio.renderers.default
#plotting
= count.head(10)
df = px.bar(data_frame = df,
fig = "name",
x = "number of shared actors",
y = 800,
width = 400)
height
#labels and axis
= "Movies/TV Shows Sharing Actors with 'Parasite'",
fig.update_layout(title = 'Movie/TV Show',
xaxis_title = 'Number of Shared Actors') yaxis_title
It turns out that director Bong Joon-ho likes to work with the same actors! The top two films listed, Okja and The Host, are both also movies directed by him. That aside, if you like seeing familiar faces in your South Korean films, maybe you can take inspiration from the movies listed here. And if not, then at least you know how to find similar recommendations for a movie or show that you personally enjoy!