Using python Beautifulsoup
In this notebook we scrape Wikipedia pages to create a dataset on Disney movies.
the following steps are covered in the project:
First I will start with Data Collection process where I take data from multiple sources and build a disney_movie dataset.
Then the data is stored as json format. Then I take that unstructured data and load it in CSV Format and start cleaning and preprocessing it.
Also, I Collected additional data such as IDB ratingg and rotten tomato score from OMDB API for every title and attached to the data.
Task #1: Scrape the infobox from Toy Story 3 wiki page (save in python dictionary)
https://en.wikipedia.org/wiki/Toy_Story_3
Task #2: Scrape infobox for all movies in List of Disney Films (save as list of dictionaries)
https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films
#### Import Necessary Libraries
from bs4 import BeautifulSoup as bs
import requests
r = requests.get("https://en.wikipedia.org/wiki/Toy_Story_3")
# Convert to a beautiful soup object
soup = bs(r.content)
# Print out the HTML
contents = soup.prettify()
#print(contents)
The data we seek is in the infobox vevent class
info_box = soup.find(class_="infobox vevent")
#print(info_box.prettify())
Lets get all the table rows in this info box so that its easy to go through the rows
info_rows = info_box.find_all("tr")
for row in info_rows:
print(row.prettify())
<tr> <th class="infobox-above summary" colspan="2" style="font-size: 125%; font-style: italic;"> Toy Story 3 </th> </tr> <tr> <td class="infobox-image" colspan="2"> <a class="image" href="/wiki/File:Toy_Story_3_poster.jpg" title="All of the toys packed close together, holding up a large numeral '3', with Buzz, who is putting a friendly arm around Woody's shoulder, and Woody holding the top of the 3."> <img alt="All of the toys packed close together, holding up a large numeral '3', with Buzz, who is putting a friendly arm around Woody's shoulder, and Woody holding the top of the 3." class="thumbborder" data-file-height="326" data-file-width="220" decoding="async" height="326" src="//upload.wikimedia.org/wikipedia/en/6/69/Toy_Story_3_poster.jpg" width="220"/> </a> <div class="infobox-caption"> Theatrical release poster </div> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Directed by </th> <td class="infobox-data"> <a href="/wiki/Lee_Unkrich" title="Lee Unkrich"> Lee Unkrich </a> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Screenplay by </th> <td class="infobox-data"> <a href="/wiki/Michael_Arndt" title="Michael Arndt"> Michael Arndt </a> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Story by </th> <td class="infobox-data"> <div class="plainlist"> <ul> <li> <a href="/wiki/John_Lasseter" title="John Lasseter"> John Lasseter </a> </li> <li> <a href="/wiki/Andrew_Stanton" title="Andrew Stanton"> Andrew Stanton </a> </li> <li> Lee Unkrich </li> </ul> </div> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Produced by </th> <td class="infobox-data"> <a href="/wiki/Darla_K._Anderson" title="Darla K. Anderson"> Darla K. Anderson </a> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Starring </th> <td class="infobox-data"> <div class="plainlist"> <ul> <li> <a href="/wiki/Tom_Hanks" title="Tom Hanks"> Tom Hanks </a> </li> <li> <a href="/wiki/Tim_Allen" title="Tim Allen"> Tim Allen </a> </li> <li> <a href="/wiki/Joan_Cusack" title="Joan Cusack"> Joan Cusack </a> </li> <li> <a href="/wiki/Don_Rickles" title="Don Rickles"> Don Rickles </a> </li> <li> <a href="/wiki/Wallace_Shawn" title="Wallace Shawn"> Wallace Shawn </a> </li> <li> <a href="/wiki/John_Ratzenberger" title="John Ratzenberger"> John Ratzenberger </a> </li> <li> <a href="/wiki/Estelle_Harris" title="Estelle Harris"> Estelle Harris </a> </li> <li> <a href="/wiki/Ned_Beatty" title="Ned Beatty"> Ned Beatty </a> </li> <li> <a href="/wiki/Michael_Keaton" title="Michael Keaton"> Michael Keaton </a> </li> <li> <a href="/wiki/Jodi_Benson" title="Jodi Benson"> Jodi Benson </a> </li> <li> <a href="/wiki/John_Morris_(actor)" title="John Morris (actor)"> John Morris </a> </li> </ul> </div> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Cinematography </th> <td class="infobox-data"> <div class="plainlist"> <ul> <li> Jeremy Lasky </li> <li> Kim White </li> </ul> </div> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Edited by </th> <td class="infobox-data"> <a href="/wiki/Ken_Schretzmann" title="Ken Schretzmann"> Ken Schretzmann </a> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Music by </th> <td class="infobox-data"> <a href="/wiki/Randy_Newman" title="Randy Newman"> Randy Newman </a> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> <div style="display:inline-block; padding:0.1em 0;line-height:1.2em;"> Production <br/> companies </div> </th> <td class="infobox-data"> <div style="vertical-align: middle;"> <div class="plainlist"> <ul> <li> <a href="/wiki/Walt_Disney_Pictures" title="Walt Disney Pictures"> Walt Disney Pictures </a> </li> <li> <a class="mw-redirect" href="/wiki/Pixar_Animation_Studios" title="Pixar Animation Studios"> Pixar Animation Studios </a> </li> </ul> </div> </div> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Distributed by </th> <td class="infobox-data"> <a href="/wiki/Walt_Disney_Studios_Motion_Pictures" title="Walt Disney Studios Motion Pictures"> Walt Disney Studios <br/> Motion Pictures </a> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> <div style="display:inline-block; padding:0.1em 0;line-height:1.2em;white-space: normal;"> Release dates </div> </th> <td class="infobox-data"> <div class="plainlist"> <ul> <li> June 12, 2010 <span style="display:none"> ( <span class="bday dtstart published updated"> 2010-06-12 </span> ) </span> ( <a href="/wiki/Taormina_Film_Fest" title="Taormina Film Fest"> Taormina Film Fest </a> ) </li> <li> June 18, 2010 <span style="display:none"> ( <span class="bday dtstart published updated"> 2010-06-18 </span> ) </span> (United States) </li> </ul> </div> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> <div style="display:inline-block; padding:0.1em 0;line-height:1.2em;white-space: normal;"> Running time </div> </th> <td class="infobox-data"> 103 minutes <sup class="reference" id="cite_ref-mojo1_1-0"> <a href="#cite_note-mojo1-1"> [1] </a> </sup> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Country </th> <td class="infobox-data"> United States </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Language </th> <td class="infobox-data"> English </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Budget </th> <td class="infobox-data"> $200 million <sup class="reference" id="cite_ref-mojo1_1-1"> <a href="#cite_note-mojo1-1"> [1] </a> </sup> </td> </tr> <tr> <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;"> Box office </th> <td class="infobox-data"> $1.067 billion <sup class="reference" id="cite_ref-mojo1_1-2"> <a href="#cite_note-mojo1-1"> [1] </a> </sup> </td> </tr>
movie_info_ = {}
for index, row in enumerate(info_rows):
if index == 0:
movie_info_['title'] = row.find("th").get_text()
elif index == 1:
continue
else:
content_key = row.find("th").get_text()
content_value = row.find("td").get_text()
movie_info_[content_key] = content_value
print(movie_info_)
{'title': 'Toy Story 3', 'Directed by': 'Lee Unkrich', 'Screenplay by': 'Michael Arndt', 'Story by': '\nJohn Lasseter\nAndrew Stanton\nLee Unkrich\n', 'Produced by': 'Darla K. Anderson', 'Starring': '\nTom Hanks\nTim Allen\nJoan Cusack\nDon Rickles\nWallace Shawn\nJohn Ratzenberger\nEstelle Harris\nNed Beatty\nMichael Keaton\nJodi Benson\nJohn Morris\n', 'Cinematography': '\nJeremy Lasky\nKim White\n', 'Edited by': 'Ken Schretzmann', 'Music by': 'Randy Newman', 'Productioncompanies': '\nWalt Disney Pictures\nPixar Animation Studios\n', 'Distributed by': 'Walt Disney StudiosMotion Pictures', 'Release dates': '\nJune\xa012,\xa02010\xa0(2010-06-12) (Taormina Film Fest)\nJune\xa018,\xa02010\xa0(2010-06-18) (United States)\n', 'Running time': '103 minutes[1]', 'Country': 'United States', 'Language': 'English', 'Budget': '$200\xa0million[1]', 'Box office': '$1.067\xa0billion[1]'}
in the above when there are multiple names we need to iterate through the list.
we will separate the content value based on the list
def get_content_value(row_data):
if row_data.find("li"):
return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
else:
return row_data.get_text(" ", strip=True).replace("\xa0", " ")
movie_info = {}
for index, row in enumerate(info_rows):
if index == 0:
movie_info['title'] = row.find("th").get_text(" ", strip=True)
elif index == 1:
continue
else:
content_key = row.find("th").get_text(" ", strip=True)
content_value = get_content_value(row.find("td"))
movie_info[content_key] = content_value
movie_info
{'title': 'Toy Story 3', 'Directed by': 'Lee Unkrich', 'Screenplay by': 'Michael Arndt', 'Story by': ['John Lasseter', 'Andrew Stanton', 'Lee Unkrich'], 'Produced by': 'Darla K. Anderson', 'Starring': ['Tom Hanks', 'Tim Allen', 'Joan Cusack', 'Don Rickles', 'Wallace Shawn', 'John Ratzenberger', 'Estelle Harris', 'Ned Beatty', 'Michael Keaton', 'Jodi Benson', 'John Morris'], 'Cinematography': ['Jeremy Lasky', 'Kim White'], 'Edited by': 'Ken Schretzmann', 'Music by': 'Randy Newman', 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'], 'Distributed by': 'Walt Disney Studios Motion Pictures', 'Release dates': ['June 12, 2010 ( 2010-06-12 ) ( Taormina Film Fest )', 'June 18, 2010 ( 2010-06-18 ) (United States)'], 'Running time': '103 minutes [1]', 'Country': 'United States', 'Language': 'English', 'Budget': '$200 million [1]', 'Box office': '$1.067 billion [1]'}
r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")
# Convert to a beautiful soup object
soup = bs(r.content)
# Print out the HTML
contents = soup.prettify()
#print(contents)
movies = soup.select(".wikitable.sortable i")
movies[0:10]
[<i><a href="/wiki/Academy_Award_Review_of_Walt_Disney_Cartoons" title="Academy Award Review of Walt Disney Cartoons">Academy Award Review of Walt Disney Cartoons</a></i>, <i><a href="/wiki/Snow_White_and_the_Seven_Dwarfs_(1937_film)" title="Snow White and the Seven Dwarfs (1937 film)">Snow White and the Seven Dwarfs</a></i>, <i><a href="/wiki/Pinocchio_(1940_film)" title="Pinocchio (1940 film)">Pinocchio</a></i>, <i><a href="/wiki/Fantasia_(1940_film)" title="Fantasia (1940 film)">Fantasia</a></i>, <i><a href="/wiki/The_Reluctant_Dragon_(1941_film)" title="The Reluctant Dragon (1941 film)">The Reluctant Dragon</a></i>, <i><a href="/wiki/Dumbo" title="Dumbo">Dumbo</a></i>, <i><a href="/wiki/Bambi" title="Bambi">Bambi</a></i>, <i><a href="/wiki/Saludos_Amigos" title="Saludos Amigos">Saludos Amigos</a></i>, <i><a href="/wiki/Victory_Through_Air_Power_(film)" title="Victory Through Air Power (film)">Victory Through Air Power</a></i>, <i><a href="/wiki/The_Three_Caballeros" title="The Three Caballeros">The Three Caballeros</a></i>]
movies[0].a['href']
'/wiki/Academy_Award_Review_of_Walt_Disney_Cartoons'
movies[0].a['title']
'Academy Award Review of Walt Disney Cartoons'
tag.decompose() will remove the tags
we can remove the sapan tag from the release dates
def get_content_value(row_data):
if row_data.find("li"):
return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
elif row_data.find("br"):
return [text for text in row_data.stripped_strings]
else:
return row_data.get_text(" ", strip=True).replace("\xa0", " ")
#to remove tags
def clean_tags(soup):
for tag in soup.find_all(["sup", "span"]):
tag.decompose()
def get_info_box(url):
r = requests.get(url)
soup = bs(r.content)
info_box = soup.find(class_="infobox vevent")
info_rows = info_box.find_all("tr")
clean_tags(soup)
movie_info = {}
for index, row in enumerate(info_rows):
if index == 0:
movie_info['title'] = row.find("th").get_text(" ", strip=True)
else:
header = row.find('th')#check only if table header then do the below operation
if header:
content_key = row.find("th").get_text(" ", strip=True)#only if there is header then add content
content_value = get_content_value(row.find("td"))
movie_info[content_key] = content_value
return movie_info
get_info_box("https://en.wikipedia.org/wiki/One_Little_Indian_(film)")
{'title': 'One Little Indian', 'Directed by': 'Bernard McEveety', 'Written by': 'Harry Spalding', 'Produced by': 'Winston Hibler', 'Starring': ['James Garner', 'Vera Miles', 'Pat Hingle', 'Morgan Woodward', 'Jodie Foster'], 'Cinematography': 'Charles F. Wheeler', 'Edited by': 'Robert Stafford', 'Music by': 'Jerry Goldsmith', 'Production company': 'Walt Disney Productions', 'Distributed by': 'Buena Vista Distribution', 'Release date': ['June 20, 1973'], 'Running time': '90 Minutes', 'Country': 'United States', 'Language': 'English', 'Box office': '$2 million'}
r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")
soup = bs(r.content)
movies = soup.select(".wikitable.sortable i a")
base_path = "https://en.wikipedia.org/"
movie_info_list = []
for index, movie in enumerate(movies):
if index % 10 == 0:
print(index)
try:
relative_path = movie['href']
full_path = base_path + relative_path
title = movie['title']
movie_info_list.append(get_info_box(full_path))
except Exception as e:
print(movie.get_text())
print(e)
0 10 20 30 40 Zorro the Avenger 'NoneType' object has no attribute 'find' The Sign of Zorro 'NoneType' object has no attribute 'find' 50 60 70 80 90 100 110 120 130 140 The London Connection 'NoneType' object has no attribute 'find' 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 The Beatles: Get Back – The Rooftop Concert 'NoneType' object has no attribute 'find' 490 500 61 'NoneType' object has no attribute 'find_all' All Night Long 'NoneType' object has no attribute 'find' 510 Keeper of the Lost Cities 'NoneType' object has no attribute 'find_all' Muppet Man 'NoneType' object has no attribute 'find_all' 520 Sister Act 3 'NoneType' object has no attribute 'find' The Thief 'NoneType' object has no attribute 'find_all' Tom Sawyer 'NoneType' object has no attribute 'find_all' 530 Tower of Terror 'NoneType' object has no attribute 'find_all' Tron: Ares 'NoneType' object has no attribute 'find' FC Barcelona 'NoneType' object has no attribute 'find_all' Young Woman and the Sea 'NoneType' object has no attribute 'find_all'
len(movie_info_list)
519
saving all dictionaries as Jason file
import json
def save_data(title, data):
with open(title, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
import json
def load_data(title):
with open(title, encoding="utf-8") as f:
return json.load(f)
save_data("disney_movies_data_cleaned.json", movie_info_list)
movie_info_list = load_data("disney_movies_data_cleaned.json")
movie_info_list[-40]
{'title': 'Jungle Cruise', 'Directed by': 'Jaume Collet-Serra', 'Screenplay by': ['Michael Green', 'Glenn Ficarra', 'John Requa'], 'Story by': ['John Norville', 'Josh Goldstein', 'Glenn Ficarra', 'John Requa'], 'Based on': "Walt Disney 's Jungle Cruise", 'Produced by': ['John Davis', 'John Fox', 'Beau Flynn', 'Dwayne Johnson', 'Dany Garcia', 'Hiram Garcia'], 'Starring': ['Dwayne Johnson', 'Emily Blunt', 'Édgar Ramírez', 'Jack Whitehall', 'Jesse Plemons', 'Paul Giamatti'], 'Cinematography': 'Flavio Labiano', 'Edited by': 'Joel Negron', 'Music by': 'James Newton Howard', 'Production companies': ['Walt Disney Pictures', 'Davis Entertainment', 'Seven Bucks Productions', 'Flynn Picture Company'], 'Distributed by': 'Walt Disney Studios Motion Pictures', 'Release dates': ['July 24, 2021 ( Disneyland Resort )', 'July 30, 2021 (United States)'], 'Running time': '128 minutes', 'Country': 'United States', 'Language': 'English', 'Budget': '$200 million', 'Box office': '$220.9 million'}
print([movie.get('Running time', 'N/A') for movie in movie_info_list])
['41 minutes (74 minutes 1966 release)', '83 minutes', '88 minutes', '126 minutes', '74 minutes', '64 minutes', '70 minutes', '42 minutes', '70 min', '71 minutes', '75 minutes', '94 minutes', '73 minutes', '75 minutes', '82 minutes', '68 minutes', '74 minutes', '96 minutes', '75 minutes', '84 minutes', '77 minutes', '92 minutes', '69 minutes', '81 minutes', ['60 minutes (VHS version)', '71 minutes (original)'], '127 minutes', '192 minutes', '76 minutes', '75 minutes', '73 minutes', '85 minutes', '81 minutes', '70 minutes', '90 min.', '80 minutes', '75 minutes', '83 minutes', '83 minutes', '72 minutes', '97 minutes', '75 minutes', '104 minutes', '93 minutes', '105 minutes', '95 minutes', '97 minutes', '134 minutes', '69 minutes', '92 minutes', '126 minutes', '79 minutes', '97 minutes', '128 minutes', '73 minutes', '91 minutes', '105 minutes', '98 minutes', '130 minutes', '89 minutes', '93 minutes', '67 minutes', '98 minutes', '100 minutes', '118 minutes', '103 minutes', '110 minutes', '80 min.', '74 minutes', '91 minutes', '91 minutes', '97 minutes', '118 minutes', '139 minutes', '131 mins.', '92 minutes', '87 minutes', '116 minutes', '93 minutes', '110 min.', '110 min.', '131 minutes', '101 minutes', '108 minutes', '84 minutes', '78 minutes', '75 minutes', ['164 minutes', '(', 'Los Angeles', 'premiere)', '144 minutes', '(', 'New York City', 'premiere)', '118 minutes', '(General release)', '172 minutes', '(', "Director's cut", ')'], '106 minutes', '110 minutes', '99 minutes', '113 mins.', '108 minutes', '112 minutes', '93 minutes', '91 minutes', '93 minutes', '100 minutes', '100 minutes', '79 minutes', '96 minutes', '113 minutes', '89 minutes', ['117 minutes (1971 original version)', '139 minutes (1996 reconstruction version)'], '92 minutes', '88 minutes', '92 minutes', '87 minutes', '93 minutes', '93 minutes', '93 minutes', '90 Minutes', '83 minutes', '96 minutes', '88 minutes', '89 minutes', '91 minutes', '93 minutes', '92 minutes', '97 minutes', '100 minutes', '100 minutes', '89 minutes', 'N/A', '91 minutes', '112 minutes', '115 minutes', '95 minutes', '91 min.', '97 minutes', '104 minutes', '74 minutes', '48 minutes', '77 minutes', '104 minutes', '128 minutes', '101 minutes', '94 minutes', '104 minutes', '90 minutes', '100 minutes', '88 minutes', '93 minutes', '98 minutes', '112 minutes', '84 minutes', '97 minutes', '97 minutes', '114 minutes', '96 minutes', '97 minutes', '109 minutes', '83 minutes', '90 minutes', '107 minutes', '96 minutes', '103 minutes', '91 min', '95 minutes', '105 minutes', '113 minutes', '80 minutes', '101 minutes', '90 minutes', '74 minutes', '90 minutes', '89 minutes', '110 minutes', '74 minutes', '93 minutes', '84 minutes', '83 minutes', '74 minutes', '77 minutes', '107 minutes', '93 minutes', '88 minutes', '108 minutes', '84 minutes', '121 minutes', '89 minutes', '104 minutes', '90 minutes', '86 minutes', '84 minutes', '108 minutes', '107 minutes', '96 minutes', '98 minutes', '105 minutes', '108 minutes', '94 minutes', '106 minutes', '102 minutes', '88 minutes', '102 minutes', '102 minutes', '97 minutes', '111 minutes', '100 minutes', '96 minutes', '98 minutes', '78 minutes', '81 minutes', '108 minutes', '89 minutes', '99 minutes', '89 minutes', '81 minutes', '92 minutes', '100 minutes', '89 minutes', '79 minutes', '91 minutes', '101 minutes', '104 minutes', '103 minutes', '86 minutes', '105 minutes', '75 minutes', '93 minutes', '92 minutes', '98 minutes', '95 minutes', '93 minutes', '87 minutes', '93 minutes', '87 minutes', '128 minutes', '77 minutes', '86 minutes', '95 minutes', '114 minutes', '93 minutes', '83 minutes', '83 minutes', '88 minutes', '78 minutes', '112 minutes', '92 minutes', '78 minutes', '72 minutes', '82 minutes', '75 minutes', '104 minutes', '75 minutes', '113 minutes', '100 minutes', '78 minutes', '83 minutes', '69 minutes', '96 minutes', '115 minutes', '86 minutes', '92 minutes', '65 minutes', '99 minutes', '73 minutes', '73 minutes', '66 minutes', '128 minutes', '85 minutes', '88 minutes', '125 minutes', '96 minutes', '104 minutes', '95 minutes', '74 minutes', '72 minutes', '88 minutes', '75 minutes', '61 minutes', '117 minutes', '94 minutes', '100 minutes', '143 minutes', '96 minutes', '64 minutes', '87 minutes', '85 minutes', '86 minutes', '50 minutes', '74 minutes', '136 minutes', '76 minutes', '85 minutes', '65 minutes', '76 minutes', '40 minutes', '120 minutes', '84 minutes', '113 minutes', '65 minutes', '115 minutes', '67 minutes', '131 minutes', '100 minutes', '79 minutes', '68 minutes', '95 minutes', '97 minutes', '75 minutes', '100 minutes', '119 minutes', '100 minutes', '76 minutes', '68 minutes', '67 minutes', '120 minutes', '81 minutes', '143 minutes', '72 minutes', '118 minutes', '40 minutes', '72 minutes', '120 minutes', '99 minutes', '82 minutes', '117 minutes', '72 minutes', '150 minutes', '104 minutes', '73 minutes', '76 minutes', '92 minutes', '69 minutes', '70 minutes', '95 minutes', '94 minutes', '167 minutes', '111 minutes', '85 minutes', '82 minutes', '87 minutes', '110 minutes', '107 minutes', '124 minutes', '74 minutes', '82 minutes', '150 minutes', '97 minutes', '77 minutes', '91 minutes', '100 minutes', '112 minutes', '93 minutes', '78 minutes', '96 minutes', '99 minutes', ['76 minutes', '85 minutes'], '98 minutes', '97 minutes', ['99 minutes (UK)', '90 minutes (U.S.)'], '101 minutes', '96 minutes', 'N/A', '90 minutes', '101 minutes', '107 minutes', '82 minutes', '101 minutes', '96 minutes', '88 minutes', '97 minutes', '108 minutes', ['104 minutes (Original cut)', '84 minutes (Disney cut)'], '116 minutes', '103 minutes', '109 minutes', '115 minutes', '74 minutes', '79 minutes', '123 minutes', '108 minutes', '100 minutes', '125 minutes', 'N/A', '84 minutes', '90 minutes', '88 minutes', '109 minutes', '89 minutes', '104 minutes', '137 minutes', '106 minutes', '63 minutes', '88 minutes', '103 minutes', '93 minutes', '95 minutes', '132 minutes', '77 minutes', '96 minutes', '93 minutes', '104 minutes', '89 minutes', '87 minutes', '75 minutes', '101 minutes', '130 minutes', '77 minutes', '104 minutes', '149 minutes', '92 minutes', '81 minutes', '102 minutes', '125 minutes', '107 minutes', '78 minutes', '77 minutes', '124 minutes', '97 minutes', '84 minutes', '127 minutes', '81 minutes', '102 minutes', '124 minutes', '129 minutes', '76 minutes', '106 minutes', '82 minutes', '130 minutes', '95 minutes', '154 minutes', '97 minutes', '117 minutes', '108 minutes', '106 minutes', '99 minutes', '113 minutes', '97 minutes', '118 minutes', '103 minutes', '124 minutes', '107 minutes', '161 minutes', '80 minutes', '129 minutes', '76 minutes', '129 minutes', '102 minutes', 'N/A', '162 minutes', '114 minutes', '105 minutes', '109 minutes', '118 minutes', '104 minutes', '99 minutes', '112 minutes', '131 minutes', '112 minutes', '76 minutes', '128 minutes', '100 minutes', '118 minutes', '119 minutes', '104 minutes', '100 minutes', '104 minutes', '114 minutes', '99 minutes', '102 minutes', '107 minutes', '78 minutes', '86 minutes', '95 minutes', '160 minutes', '85 minutes', '100 minutes', '95 minutes', '115 minutes', '110 minutes', '122 minutes', '101 minutes', '121 minutes', '95 minutes', '107 minutes', '134 minutes', '95 minutes', '128 minutes', '109 minutes', '58 minutes', 'N/A', '82 minutes', '99 minutes', '107 minutes', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '58 minutes', 'N/A', 'N/A', 'N/A', 'N/A', '102 minutes', 'N/A', '81 minutes', '70 minutes', 'N/A', 'N/A', '93 minutes', '91 minutes', '78 minutes', '85 minutes', '124 minutes', '83 minutes', 'N/A', 'N/A', '107 minutes', 'N/A', '79 minutes', '74 minutes', ['468 minutes'], '108 minutes']
# "85 minutes", '41 minutes (74 minutes 1966 release)','N/A', ['468 minutes'], '70 min'
def minutes_to_integer(running_time):
if running_time == "N/A":
return None
if isinstance(running_time, list):
return int(running_time[0].split(" ")[0])
else: # is a string
return int(running_time.split(" ")[0])
for movie in movie_info_list:
movie['Running time (int)'] = minutes_to_integer(movie.get('Running time', "N/A"))
print([movie.get('Running time (int)', 'N/A') for movie in movie_info_list])
[41, 83, 88, 126, 74, 64, 70, 42, 70, 71, 75, 94, 73, 75, 82, 68, 74, 96, 75, 84, 77, 92, 69, 81, 60, 127, 192, 76, 75, 73, 85, 81, 70, 90, 80, 75, 83, 83, 72, 97, 75, 104, 93, 105, 95, 97, 134, 69, 92, 126, 79, 97, 128, 73, 91, 105, 98, 130, 89, 93, 67, 98, 100, 118, 103, 110, 80, 74, 91, 91, 97, 118, 139, 131, 92, 87, 116, 93, 110, 110, 131, 101, 108, 84, 78, 75, 164, 106, 110, 99, 113, 108, 112, 93, 91, 93, 100, 100, 79, 96, 113, 89, 117, 92, 88, 92, 87, 93, 93, 93, 90, 83, 96, 88, 89, 91, 93, 92, 97, 100, 100, 89, None, 91, 112, 115, 95, 91, 97, 104, 74, 48, 77, 104, 128, 101, 94, 104, 90, 100, 88, 93, 98, 112, 84, 97, 97, 114, 96, 97, 109, 83, 90, 107, 96, 103, 91, 95, 105, 113, 80, 101, 90, 74, 90, 89, 110, 74, 93, 84, 83, 74, 77, 107, 93, 88, 108, 84, 121, 89, 104, 90, 86, 84, 108, 107, 96, 98, 105, 108, 94, 106, 102, 88, 102, 102, 97, 111, 100, 96, 98, 78, 81, 108, 89, 99, 89, 81, 92, 100, 89, 79, 91, 101, 104, 103, 86, 105, 75, 93, 92, 98, 95, 93, 87, 93, 87, 128, 77, 86, 95, 114, 93, 83, 83, 88, 78, 112, 92, 78, 72, 82, 75, 104, 75, 113, 100, 78, 83, 69, 96, 115, 86, 92, 65, 99, 73, 73, 66, 128, 85, 88, 125, 96, 104, 95, 74, 72, 88, 75, 61, 117, 94, 100, 143, 96, 64, 87, 85, 86, 50, 74, 136, 76, 85, 65, 76, 40, 120, 84, 113, 65, 115, 67, 131, 100, 79, 68, 95, 97, 75, 100, 119, 100, 76, 68, 67, 120, 81, 143, 72, 118, 40, 72, 120, 99, 82, 117, 72, 150, 104, 73, 76, 92, 69, 70, 95, 94, 167, 111, 85, 82, 87, 110, 107, 124, 74, 82, 150, 97, 77, 91, 100, 112, 93, 78, 96, 99, 76, 98, 97, 99, 101, 96, None, 90, 101, 107, 82, 101, 96, 88, 97, 108, 104, 116, 103, 109, 115, 74, 79, 123, 108, 100, 125, None, 84, 90, 88, 109, 89, 104, 137, 106, 63, 88, 103, 93, 95, 132, 77, 96, 93, 104, 89, 87, 75, 101, 130, 77, 104, 149, 92, 81, 102, 125, 107, 78, 77, 124, 97, 84, 127, 81, 102, 124, 129, 76, 106, 82, 130, 95, 154, 97, 117, 108, 106, 99, 113, 97, 118, 103, 124, 107, 161, 80, 129, 76, 129, 102, None, 162, 114, 105, 109, 118, 104, 99, 112, 131, 112, 76, 128, 100, 118, 119, 104, 100, 104, 114, 99, 102, 107, 78, 86, 95, 160, 85, 100, 95, 115, 110, 122, 101, 121, 95, 107, 134, 95, 128, 109, 58, None, 82, 99, 107, None, None, None, None, None, None, None, None, 58, None, None, None, None, 102, None, 81, 70, None, None, 93, 91, 78, 85, 124, 83, None, None, 107, None, 79, 74, 468, 108]
print([movie.get('Budget', 'N/A') for movie in movie_info_list])
['N/A', '$1.49 million', '$2.6 million', '$2.28 million', '$600,000', '$950,000', '$858,000', 'N/A', '$788,000', 'N/A', '$1.35 million', '$2.125 million', 'N/A', '$1.5 million', '$1.5 million', 'N/A', '$2.2 million', '$1,800,000', '$3 million', 'N/A', '$4 million', '$2 million', '$300,000', '$1.8 million', 'N/A', '$5 million', 'N/A', '$4 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$700,000', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$6 million', 'under $1 million or $1,250,000', 'N/A', '$2 million', 'N/A', 'N/A', '$2.5 million', 'N/A', 'N/A', '$4 million', '$3.6 million', 'N/A', 'N/A', 'N/A', 'N/A', '$3 million', 'N/A', '$3 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$3 million', 'N/A', 'N/A', 'N/A', 'N/A', '$4.4–6 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$4 million', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$4 million', 'N/A', 'N/A', 'N/A', '$6.3 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', 'N/A', '$8 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'AU$1 million', 'N/A', 'N/A', 'N/A', 'N/A', '$5 million', 'N/A', 'N/A', 'N/A', '$7.5 million', 'N/A', '$10 million', 'N/A', 'N/A', '$3.5 to 4 million', 'N/A', 'N/A', 'N/A', '$5.25 million', '$20 million', 'N/A', '$9 million', 'N/A', '$6-8 million', '$20 million', 'N/A', 'N/A', '$18 million', '$12 million', '$14 million', 'N/A', '$17 million', '$5 million', 'unknown', '$20 million', '$11 million', '$28 million', '$44 million', 'N/A', 'N/A', '$14 million', '$9 million', 'N/A', 'A$8.7 million', '$31 million', '$18 million', '$5 million', '$40 million', '$20 million', 'N/A', '$14 million', '60 million Norwegian Kroner (around $8.7 million in 1989)', 'N/A', '$35-40 million', '$25 million', '$15 million', '$32 million', '$14 million', '$28 million', '$12 million', 'N/A', 'N/A', '$6.5 million', '$28 million', '$17 million', '$30 million', 'N/A', '$13 million', 'N/A', 'N/A', '$45 million', '$31 million', 'N/A', '$22 million', '$30 million', 'N/A', '$22 million', ['$32 million', '(estimated)'], '$18 million', '$55 million', '$24 million', '$15 million', '$12 million', 'N/A', '$30 million', 'N/A', '$31 million', 'N/A', '$38 million', '$70 million', '$15 million', 'N/A', '$67 million', 'N/A', '$32 million', '$7 million', '$85 million', '$55 million', '$3 million', '$16 million', '$80 million', '$30 million', '$24 million', '$90 million', '$15 million', 'N/A', '$30 million', '$120 million', '$90 million', '$65 million', '$5 million', 'N/A', '$130 million', '$75–90 million', '$10 million', '$90 million', '$15 million –$30 million', '$4,000,000 (estimated)', '$127.5 million', '$80–$85 million', '$65 million', 'N/A', '$30 million', '$85 million', '$100 million', '$23 million', 'N/A', '$90–120 million', '$26 million', '$25 million', '$115 million', 'N/A', '$33 million', '$20 million', '$5 million', 'N/A', '$22 million', '$80 million', '$35 million', '(US$15–19.2 million)', '$15 million', '$65 million', '$140 million', 'N/A', '$20 million', '$12 million', '$46 million', '$13 million', '$20 million', '$17 million', '$94 million', '$140 million', '$26 million', 'N/A', 'N/A', '$46 million', '$90 million', 'N/A', '$10 million', '$28 million', 'N/A', '$15 million', 'N/A', '$110 million', 'N/A', '$110 million', 'N/A', '$45 million', 'N/A', '$92–145 million', 'N/A', '$100 million', 'N/A', 'N/A', '$20 million', '$56 million', '$25 million', 'N/A', '$50 million', ['¥', '2.4 billion', 'US$24 million'], '$35 million', '$35 million', 'N/A', 'N/A', '$25 million', '$150 million', '$180 million', 'N/A', '$30 million', '$1 million', 'N/A', '$40 million', '$50 million', '$80 million', '$120 million', 'N/A', '$225 million', '$30 million', 'N/A', '$24 million', '$12 million', 'N/A', 'N/A', '$17 million', '$150 million', '$300 million', '$150 million', 'N/A', '$25 million', 'N/A', '$22 million', '$85 million', '$130 million', '$7 million', '$25 million', '$225 million', '$180 million', 'N/A', '$20 million', 'N/A', '$11 million', 'N/A', '$50 million', '$150 million', '$80 million', 'N/A', '$50 million', '$30 million', '$47 million', 'N/A', '$175 million', 'N/A', '$150 million', ['¥', '3.4 billion', '(', 'US$', '34 million)'], 'N/A', '$30—$35 million', '$8 million ( ₽ 350 million)', '$175–200 million', '$35 million', '$105 million', '$150 –$200 million', '$80 million', '$150–200 million', '$200 million', '$150 million', '$22 million', 'N/A', '$30—$35 million', '$35 million', 'N/A', '$260 million', '$170 million', '', 'N/A', 'N/A', '$150 million', 'N/A', '$5 million', '$8 million', ['$410.6 million (gross)', '$378.5 million (net)'], '$200 million', '$30 million', 'N/A', '$45 million', 'N/A', '$23 million', ['$306.6 million (gross)', '$263.7 million (net)'], '$5 million', 'N/A', '$185 million', '$25 million', 'N/A', '$39 million', '$30—35 million', '$165 million', '$200 million', 'N/A', '$200 million', '$225–250 million', '$50 million', 'N/A', '$150 million', '$35 million', '$50 million', 'N/A', '$5 million', '$25 million', '$180–263 million', '$50 million', '(US$3.0 million)', '$28 million', '$165 million', '$50 million', '$17 million', 'N/A', '$84.21-95 million', '$5–10 million', '$180–190 million', '$175 million', '', '$175–200 million', '$70–80 million', '$150 million', '$175–177 million', 'N/A', '$170 million', '$175–200 million', '$140 million', '$65 million', '$15 million', '$150–175 million', '', '$8 million', '$160–255 million', '$5–10 million', '$230–320 million', '$175 million', 'N/A', ['131 crore'], ['~$8 million', '₽', '370 million'], '$175–225 million', '$100–130 million', '$200 million', '$75 million', '$120–133 million', '$175 million', '$130 million', '$170 million', 'N/A', '$183 million', '$200 million', '$250–260 million', '$185 million', '$60 million', 'N/A', '$150 million', '$40 million', '$42 million', '$175–200 million', 'N/A', 'N/A', 'N/A', '$125 million', '$12.5 million (stage production)', 'N/A', '$24 million', 'N/A', '$200 million', 'Unknown', '$26 million', '$150 million', '₽ 650 million', 'N/A', '$100 million+', '$100 million', 'N/A', '$200 million', '$120–150 million', 'N/A', '₽ 454 237 000', 'N/A', '$175 million', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', '$11 million', 'N/A', '$28 million', '$858,000', '$53.4 million', 'N/A', '$85 million', '$70 million', '$75–90 million', '$80 million', '$130 million', '$5 million', 'N/A', 'N/A', '$18 million or $25 million', 'N/A', '$4 million', '$3 million', 'N/A', '$35-40 million']
import re
amounts = r"thousand|million|billion"
number = r"\d+(,\d{3})*\.*\d*"
word_re = rf"\${number}(-|\sto\s|–)?({number})?\s({amounts})"
value_re = rf"\${number}"
def word_to_value(word):
value_dict = {"thousand": 1000, "million": 1000000, "billion": 1000000000}
return value_dict[word]
def parse_word_syntax(string):
value_string = re.search(number, string).group()
value = float(value_string.replace(",", ""))
word = re.search(amounts, string, flags=re.I).group().lower()
word_value = word_to_value(word)
return value*word_value
def parse_value_syntax(string):
value_string = re.search(number, string).group()
value = float(value_string.replace(",", ""))
return value
'''
money_conversion("$12.2 million") --> 12200000 ## Word syntax
money_conversion("$790,000") --> 790000 ## Value syntax
'''
def money_conversion(money):
if money == "N/A":
return None
if isinstance(money, list):
money = money[0]
word_syntax = re.search(word_re, money, flags=re.I)
value_syntax = re.search(value_re, money)
if word_syntax:
return parse_word_syntax(word_syntax.group())
elif value_syntax:
return parse_value_syntax(value_syntax.group())
else:
return None
for movie in movie_info_list:
movie['Budget (float)'] = money_conversion(movie.get('Budget', "N/A"))
movie['Box office (float)'] = money_conversion(movie.get('Box office', "N/A"))
movie_info_list[-40]
{'title': 'Jungle Cruise', 'Directed by': 'Jaume Collet-Serra', 'Screenplay by': ['Michael Green', 'Glenn Ficarra', 'John Requa'], 'Story by': ['John Norville', 'Josh Goldstein', 'Glenn Ficarra', 'John Requa'], 'Based on': "Walt Disney 's Jungle Cruise", 'Produced by': ['John Davis', 'John Fox', 'Beau Flynn', 'Dwayne Johnson', 'Dany Garcia', 'Hiram Garcia'], 'Starring': ['Dwayne Johnson', 'Emily Blunt', 'Édgar Ramírez', 'Jack Whitehall', 'Jesse Plemons', 'Paul Giamatti'], 'Cinematography': 'Flavio Labiano', 'Edited by': 'Joel Negron', 'Music by': 'James Newton Howard', 'Production companies': ['Walt Disney Pictures', 'Davis Entertainment', 'Seven Bucks Productions', 'Flynn Picture Company'], 'Distributed by': 'Walt Disney Studios Motion Pictures', 'Release dates': ['July 24, 2021 ( Disneyland Resort )', 'July 30, 2021 (United States)'], 'Running time': '128 minutes', 'Country': 'United States', 'Language': 'English', 'Budget': '$200 million', 'Box office': '$220.9 million', 'Running time (int)': 128, 'Budget (float)': 200000000.0, 'Box office (float)': 220900000.0}
money_conversion(str(movie_info_list[-40]["Budget"]))
200000000.0
# Convert Dates into datetimes
print([movie.get('Release date', 'N/A') for movie in movie_info_list])
[['May 19, 1937'], 'N/A', 'N/A', ['November 13, 1940'], ['June 27, 1941'], 'N/A', 'N/A', 'N/A', ['July 17, 1943'], 'N/A', 'N/A', 'N/A', ['September 27, 1947'], 'May 27, 1948', 'N/A', ['October 5, 1949'], 'N/A', 'N/A', 'N/A', 'N/A', ['February 5, 1953 (United States)'], ['July 23, 1953 (US)'], ['November 10, 1953'], 'N/A', ['August 17, 1954'], ['December 23, 1954'], 'May 25, 1955', ['June 22, 1955'], ['September 14, 1955'], 'December 22, 1955', 'June 8, 1956', 'July 18, 1956', ['September 4, 1956'], ['December 20, 1956'], 'June 19, 1957', 'August 28, 1957', ['December 25, 1957'], ['July 8, 1958'], ['August 12, 1958'], ['December 25, 1958'], ['January 29, 1959'], ['March 19, 1959'], 'N/A', ['November 10, 1959'], 'January 21, 1960 ( Sarasota, FL )', ['February 24, 1960'], 'May 19, 1960', 'N/A', ['November 1, 1960'], ['December 21, 1960'], ['January 25, 1961'], 'March 16, 1961', ['June 21, 1961'], ['July 12, 1961'], ['July 17, 1961'], ['December 14, 1961'], 'April 5, 1962', ['May 17, 1962'], ['June 6, 1962'], 'September 26, 1962', ['November 7, 1962 (Los Angeles)'], 'N/A', ['January 16, 1963'], ['March 29, 1963 (U.S.)'], ['June 1, 1963'], ['July 7, 1963'], 'November 20, 1963', ['December 25, 1963'], ['March 12, 1964'], ['February 11, 1964 (Los Angeles)'], 'N/A', ['July 2, 1964'], 'N/A', 'November 10, 1964', ['December 18, 1964'], ['August 18, 1965 (Los Angeles)'], ['December 2, 1965'], 'N/A', 'N/A', 'October 1, 1966', ['December 1, 1966'], ['February 8, 1967 (San Francisco)'], ['June 15, 1967'], ['July 12, 1967'], ['October 18, 1967'], ['October 19, 1967'], 'N/A', ['February 8, 1968'], ['March 21, 1968 ( Radio City Music Hall )'], ['June 26, 1968'], ['December 3, 1968'], 'N/A', ['May 9, 1969'], 'September 4, 1969', ['November 28, 1969'], ['February 11, 1970'], 'July 1, 1970', 'N/A', 'N/A', 'March 17, 1971', 'June 22, 1971', ['June 30, 1971'], ['October 7, 1971'], ['March 22, 1972 (US)'], 'July 12, 1972', ['July 5, 1972'], ['October 18, 1972'], ['December 22, 1972'], 'February 1, 1973', ['March 23, 1973'], ['June 20, 1973'], ['November 8, 1973'], ['December 14, 1973'], ['June 6, 1974'], ['July 31, 1974'], ['August 1, 1974'], 'December 20, 1974 (with Winnie the Pooh and Tigger Too )', ['February 6, 1975'], ['March 21, 1975 (United States)'], ['July 1, 1975'], ['July 9, 1975'], ['October 8, 1975'], ['1948–1960'], 'December 25, 1975', ['February 11, 1976 (Los Angeles)'], ['July 1, 1976'], ['July 7, 1976'], ['December 16, 1976'], ['December 17, 1976'], 'N/A', ['March 11, 1977'], ['June 20, 1977'], ['June 22, 1977'], ['June 24, 1977'], ['November 3, 1977'], 'December 16, 1977', ['March 10, 1978'], ['June 30, 1978 (New York)'], 'July 5, 1978', 'February 9, 1979', ['June 27, 1979'], 'N/A', 'N/A', ['February 8, 1980'], 'N/A', ['June 27, 1980'], ['July 9, 1980 (Los Angeles)'], 'N/A', 'February 11, 1981 (Los Angeles)', 'March 20, 1981', ['June 26, 1981'], ['July 10, 1981'], ['August 7, 1981'], ['February 5, 1982'], ['July 9, 1982'], 'July 30, 1982', ['March 11, 1983'], ['April 29, 1983 (United States)'], 'N/A', ['June 21, 1985 (United States)'], ['July 24, 1985'], ['September 27, 1985'], ['November 22, 1985'], ['July 2, 1986'], ['August 1, 1986'], ['June 5, 1987'], ['24 March 1988 (Australia)'], ['November 18, 1988'], ['June 23, 1989'], 'August 18, 1989', ['November 17, 1989'], ['August 3, 1990'], ['November 16, 1990'], ['January 18, 1991'], 'N/A', ['May 24, 1991'], ['June 21, 1991'], 'N/A', ['April 10, 1992'], ['July 17, 1992'], ['October 2, 1992'], ['November 25, 1992'], ['December 11, 1992'], 'February 3, 1993', ['March 12, 1993'], ['April 2, 1993'], ['July 16, 1993'], ['October 1, 1993'], ['November 12, 1993'], ['January 14, 1994'], ['February 11, 1994'], ['March 25, 1994'], ['April 14, 1994'], ['June 15, 1994 (United States)'], ['July 15, 1994'], ['October 28, 1994'], ['November 11, 1994'], ['December 25, 1994'], ['February 17, 1995'], ['March 3, 1995'], ['March 24, 1995'], ['April 7, 1995'], 'N/A', ['July 28, 1995'], ['August 11, 1995'], ['September 27, 1995'], 'N/A', 'N/A', ['December 22, 1995'], ['February 16, 1996'], ['March 8, 1996'], 'N/A', 'N/A', ['December 20, 1996'], ['October 4, 1996'], ['November 27, 1996 (United States)'], ['February 14, 1997'], 'N/A', 'March 18, 1997', ['June 13, 1997 (United States)'], ['July 16, 1997'], ['August 1, 1997'], ['October 10, 1997'], 'N/A', ['December 25, 1997'], ['March 27, 1998 (United States)'], ['June 19, 1998'], ['July 29, 1998 (United States)'], 'September 29, 1998', ['November 13, 1998'], ['November 20, 1998'], ['December 25, 1998'], ['February 12, 1999'], ['March 26, 1999'], ['May 14, 1999'], 'N/A', ['July 23, 1999'], 'N/A', 'N/A', 'February 11, 2000', ['March 10, 2000'], ['May 19, 2000 (United States)'], 'N/A', ['July 7, 2000'], ['September 19, 2000'], ['September 29, 2000'], ['November 22, 2000 (United States)'], 'N/A', 'N/A', ['February 27, 2001'], 'N/A', 'N/A', ['October 5, 2001'], 'N/A', ['November 6, 2001'], ['January 18, 2002'], 'N/A', ['February 26, 2002'], ['March 19, 2002'], ['March 29, 2002'], 'N/A', 'N/A', ['20 July 2001 (Japan)'], ['October 11, 2002 (U.S.)'], ['November 1, 2002'], ['November 27, 2002'], ['January 21, 2003'], ['February 14, 2003'], ['March 11, 2003'], 'N/A', 'N/A', ['April 18, 2003 (United States)'], ['May 2, 2003 (United States)'], ['May 30, 2003'], 'N/A', 'N/A', ['August 26, 2003'], ['October 21, 2003'], ['November 1, 2003'], ['November 26, 2003'], ['December 25, 2003'], ['January 16, 2004'], ['February 20, 2004'], ['February 10, 2004'], ['February 17, 2004 (United States)'], ['March 9, 2004'], ['April 2, 2004'], ['April 22, 2004'], ['June 16, 2004'], ['July 2, 2004'], ['August 11, 2004'], ['August 17, 2004'], 'N/A', ['November 9, 2004'], ['November 19, 2004'], ['January 28, 2005'], 'N/A', ['February 11, 2005'], ['March 4, 2005'], ['March 18, 2005'], ['June 14, 2005'], ['June 22, 2005'], 'N/A', ['July 29, 2005'], 'N/A', ['August 30, 2005'], ['September 13, 2005'], ['September 30, 2005'], 'N/A', 'N/A', 'December 13, 2005', 'January 13, 2006', ['January 27, 2006'], 'N/A', ['February 17, 2006 (United States)'], ['March 10, 2006'], ['April 14, 2006'], 'N/A', 'N/A', 'N/A', ['August 25, 2006'], ['August 29, 2006'], 'N/A', ['November 3, 2006'], ['December 12, 2006'], ['February 6, 2007'], ['February 16, 2007'], 'N/A', 'N/A', 'N/A', ['June 29, 2007'], ['August 3, 2007'], ['August 28, 2007'], 'N/A', 'N/A', ['December 21, 2007'], 'February 1, 2008', ['March 7, 2008'], 'N/A', 'N/A', ['August 26, 2008'], ['October 3, 2008 (United States)'], 'October 17, 2008', 'N/A', 'N/A', ['October 28, 2008'], ['November 21, 2008'], ['December 25, 2008'], ['February 27, 2009'], 'N/A', ['April 10, 2009'], 'N/A', 'N/A', 'N/A', 'N/A', ['July 24, 2009'], ['July 19, 2008'], ['2008'], 'N/A', ['October 29, 2009'], 'N/A', ['November 25, 2009'], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', ['July 29, 2006'], 'N/A', 'N/A', 'N/A', ['8 October 2010'], 'N/A', 'N/A', ['14 January 2011 (India)'], ['February 1, 2011'], ['17 February 2011'], 'N/A', ['22 April 2011'], 'N/A', ['April 29, 2011'], 'N/A', 'N/A', 'N/A', ['September 20, 2011'], 'N/A', ['January 31, 2012'], ['17 July 2010'], 'N/A', 'N/A', ['25 May 2012'], 'N/A', ['August 15, 2012'], ['September 18, 2012'], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', ['August 27, 2013'], 'N/A', 'N/A', 'N/A', 'N/A', ['April 18, 2014 (United States)'], 'N/A', 'N/A', 'N/A', ['19 September 2014'], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', ['April 17, 2015 (United States)'], 'N/A', 'N/A', ['19 June 2015'], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', ['24 June 2005'], 'N/A', 'N/A', 'N/A', 'N/A', ['June 30, 2017 (United States)'], ['14 July 2017'], ['October 29, 2017 ( Russia )'], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', ['April 17, 2019 (United States)'], 'N/A', 'N/A', 'N/A', ['October 18, 2019'], ['November 12, 2019 (United States)'], ['November 12, 2019 (United States)'], 'N/A', ['December 20, 2019 (United States)'], 'N/A', 'N/A', ['March 13, 2020 (United States)'], 'N/A', ['April 3, 2020'], ['June 12, 2020'], ['July 3, 2020'], ['July 31, 2020'], ['August 14, 2020'], ['August 21, 2020 (United States)'], 'N/A', ['December 4, 2020 (United States)'], ['December 11, 2020 (United States)'], 'N/A', ['January 1, 2021 ( Russia )'], ['February 19, 2021 (United States)'], ['March 5, 2021 (United States)'], 'N/A', 'N/A', 'N/A', 'N/A', ['December 3, 2021'], ['December 23, 2021'], 'N/A', 'N/A', 'N/A', 'N/A', ['April 22, 2022'], ['May 20, 2022'], ['June 17, 2022'], ['March 10, 2023'], 'May 26, 2023', 'September 2022', ['2022'], ['December 3, 2021'], ['2022'], 'N/A', 'N/A', ['2022'], ['November 25, 1987'], 'N/A', 'N/A', 'N/A', 'N/A', '1950–present', ['June 13, 1997 (United States)'], 'N/A', ['July 23, 1999'], 'N/A', ['December 21, 2007'], ['November 8, 1973'], 'N/A', 'N/A', ['June 6, 1986 (United States)'], 'N/A', 'N/A', ['December 25, 1963'], 'N/A', ['June 21, 1991']]
movie_info_list[-50]
{'title': 'The One and Only Ivan', 'Directed by': 'Thea Sharrock', 'Screenplay by': 'Mike White', 'Based on': ['The One and Only Ivan', 'by', 'K. A. Applegate'], 'Produced by': ['Angelina Jolie', 'Allison Shearmur', 'Brigham Taylor'], 'Starring': ['Sam Rockwell', 'Angelina Jolie', 'Danny DeVito', 'Helen Mirren', 'Ramón Rodríguez', 'Ariana Greenblatt', 'Bryan Cranston'], 'Cinematography': 'Florian Ballhaus', 'Edited by': 'Barney Pilling', 'Music by': 'Craig Armstrong', 'Production companies': ['Walt Disney Pictures', 'Jolie Pas Productions'], 'Distributed by': 'Walt Disney Studios Motion Pictures', 'Release date': ['August 21, 2020 (United States)'], 'Running time': '95 minutes', 'Country': 'United States', 'Language': 'English', 'Running time (int)': 95, 'Budget (float)': None, 'Box office (float)': None}
# June 28, 1950
from datetime import datetime
dates = [movie.get('Release date', 'N/A') for movie in movie_info_list]
def clean_date(date):
return date.split("(")[0].strip()
def date_conversion(date):
if isinstance(date, list):
date = date[0]
if date == "N/A":
return None
date_str = clean_date(date)
fmts = ["%B %d, %Y", "%d %B %Y"]
for fmt in fmts:
try:
return datetime.strptime(date_str, fmt)
except:
pass
return None
for movie in movie_info_list:
movie['Release date (datetime)'] = date_conversion(movie.get('Release date', 'N/A'))
movie_info_list[50]
{'title': '101 Dalmatians', 'Directed by': ['Clyde Geronimi', 'Hamilton Luske', 'Wolfgang Reitherman'], 'Story by': 'Bill Peet', 'Based on': ['The Hundred and One Dalmatians', 'by', 'Dodie Smith'], 'Produced by': 'Walt Disney', 'Starring': ['Rod Taylor', 'Cate Bauer', 'Betty Lou Gerson', 'Ben Wright', 'Bill Lee (singing voice)', 'Lisa Davis', 'Martha Wentworth'], 'Edited by': ['Roy M. Brewer, Jr.', 'Donald Halliday'], 'Music by': 'George Bruns', 'Production company': 'Walt Disney Productions', 'Distributed by': 'Buena Vista Distribution', 'Release date': ['January 25, 1961'], 'Running time': '79 minutes', 'Country': 'United States', 'Language': 'English', 'Budget': '$3.6 million', 'Box office': '$303 million', 'Running time (int)': 79, 'Budget (float)': 3600000.0, 'Box office (float)': 303000000.0, 'Release date (datetime)': datetime.datetime(1961, 1, 25, 0, 0)}
import pickle
def save_data_pickle(name, data):
with open(name, 'wb') as f:
pickle.dump(data, f)
def load_data_pickle(name):
with open(name, 'rb') as f:
return pickle.load(f)
save_data_pickle("disney_movie_data_cleaned_more.pickle", movie_info_list)
a = load_data_pickle("disney_movie_data_cleaned_more.pickle")
a == movie_info_list
True
movie_info_list = load_data_pickle('disney_movie_data_cleaned_more.pickle')
movie_info_list[-50]
{'title': 'The One and Only Ivan', 'Directed by': 'Thea Sharrock', 'Screenplay by': 'Mike White', 'Based on': ['The One and Only Ivan', 'by', 'K. A. Applegate'], 'Produced by': ['Angelina Jolie', 'Allison Shearmur', 'Brigham Taylor'], 'Starring': ['Sam Rockwell', 'Angelina Jolie', 'Danny DeVito', 'Helen Mirren', 'Ramón Rodríguez', 'Ariana Greenblatt', 'Bryan Cranston'], 'Cinematography': 'Florian Ballhaus', 'Edited by': 'Barney Pilling', 'Music by': 'Craig Armstrong', 'Production companies': ['Walt Disney Pictures', 'Jolie Pas Productions'], 'Distributed by': 'Walt Disney Studios Motion Pictures', 'Release date': ['August 21, 2020 (United States)'], 'Running time': '95 minutes', 'Country': 'United States', 'Language': 'English', 'Running time (int)': 95, 'Budget (float)': None, 'Box office (float)': None, 'Release date (datetime)': datetime.datetime(2020, 8, 21, 0, 0)}
The Open Movie Database
The OMDb API is a RESTful web service to obtain movie information, all content and images on the site are contributed and maintained by our users.
# http://www.omdbapi.com/?apikey=[yourkey]&
import requests
import urllib
import os
def get_omdb_info(title):
base_url = "http://www.omdbapi.com/?"
parameters = {"apikey": os.environ['OMDB_API_KEY'], 't': title}
params_encoded = urllib.parse.urlencode(parameters)
full_url = base_url + params_encoded
return requests.get(full_url).json()
def get_rotten_tomato_score(omdb_info):
ratings = omdb_info.get('Ratings', [])
for rating in ratings:
if rating['Source'] == 'Rotten Tomatoes':
return rating['Value']
return None
get_omdb_info("The One and Only Ivan")
{'Title': 'The One and Only Ivan', 'Year': '2020', 'Rated': 'PG', 'Released': '21 Aug 2020', 'Runtime': '95 min', 'Genre': 'Adventure, Comedy, Drama', 'Director': 'Thea Sharrock', 'Writer': 'Mike White, Katherine Applegate', 'Actors': 'Sam Rockwell, Bryan Cranston, Phillipa Soo', 'Plot': 'A gorilla named Ivan tries to piece together his past with the help of an elephant named Ruby as they hatch a plan to escape from captivity.', 'Language': 'English', 'Country': 'United States', 'Awards': 'Nominated for 1 Oscar. 2 wins & 4 nominations total', 'Poster': 'https://m.media-amazon.com/images/M/MV5BZWY3OTNhNWUtMDk2My00ZGVhLWE5ODQtM2NkOTZiMWM2MGY2XkEyXkFqcGdeQXVyNjMwMzc3MjE@._V1_SX300.jpg', 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '6.6/10'}, {'Source': 'Rotten Tomatoes', 'Value': '71%'}, {'Source': 'Metacritic', 'Value': '58/100'}], 'Metascore': '58', 'imdbRating': '6.6', 'imdbVotes': '11,829', 'imdbID': 'tt3661394', 'Type': 'movie', 'DVD': '21 Aug 2020', 'BoxOffice': 'N/A', 'Production': 'N/A', 'Website': 'N/A', 'Response': 'True'}
for movie in movie_info_list:
title = movie['title']
omdb_info = get_omdb_info(title)
movie['imdb'] = omdb_info.get('imdbRating', None)
movie['metascore'] = omdb_info.get('Metascore', None)
movie['rotten_tomatoes'] = get_rotten_tomato_score(omdb_info)
movie_info_list[-50]
{'title': 'The One and Only Ivan', 'Directed by': 'Thea Sharrock', 'Screenplay by': 'Mike White', 'Based on': ['The One and Only Ivan', 'by', 'K. A. Applegate'], 'Produced by': ['Angelina Jolie', 'Allison Shearmur', 'Brigham Taylor'], 'Starring': ['Sam Rockwell', 'Angelina Jolie', 'Danny DeVito', 'Helen Mirren', 'Ramón Rodríguez', 'Ariana Greenblatt', 'Bryan Cranston'], 'Cinematography': 'Florian Ballhaus', 'Edited by': 'Barney Pilling', 'Music by': 'Craig Armstrong', 'Production companies': ['Walt Disney Pictures', 'Jolie Pas Productions'], 'Distributed by': 'Walt Disney Studios Motion Pictures', 'Release date': ['August 21, 2020 (United States)'], 'Running time': '95 minutes', 'Country': 'United States', 'Language': 'English', 'Running time (int)': 95, 'Budget (float)': None, 'Box office (float)': None, 'Release date (datetime)': datetime.datetime(2020, 8, 21, 0, 0), 'imdb': '6.6', 'metascore': '58', 'rotten_tomatoes': '71%'}
save_data_pickle('disney_movie_data_final.pickle', movie_info_list)
movie_info_list[100]
{'title': 'Scandalous John', 'Directed by': 'Robert Butler', 'Written by': 'Bill Walsh', 'Produced by': 'Bill Walsh', 'Starring': ['Brian Keith', 'Alfonso Arau', 'Michele Carey'], 'Cinematography': 'Frank V. Phillips', 'Edited by': 'Cotton Warburton', 'Music by': 'Rod McKuen', 'Production company': 'Walt Disney Productions', 'Distributed by': 'Buena Vista Distribution', 'Release date': 'June 22, 1971', 'Running time': '113 minutes', 'Country': 'United States', 'Language': 'English', 'Running time (int)': 113, 'Budget (float)': None, 'Box office (float)': None, 'Release date (datetime)': datetime.datetime(1971, 6, 22, 0, 0), 'imdb': '5.8', 'metascore': 'N/A', 'rotten_tomatoes': '20%'}
movie_info_copy = [movie.copy() for movie in movie_info_list]
for movie in movie_info_copy:
current_date = movie['Release date (datetime)']
if current_date:
movie['Release date (datetime)'] = current_date.strftime("%B %d, %Y")
else:
movie['Release date (datetime)'] = None
save_data("disney_data_final.json", movie_info_copy)
import pandas as pd
disney_movie_data = pd.DataFrame(movie_info_list)
disney_movie_data.head()
title | Production company | Distributed by | Release date | Running time | Country | Language | Box office | Running time (int) | Budget (float) | ... | Original concept by | Created by | Original work | Owner | Music | Lyrics | Book | Basis | Productions | Awards | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Academy Award Review of | Walt Disney Productions | United Artists | [May 19, 1937] | 41 minutes (74 minutes 1966 release) | United States | English | $45.472 | 41.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Snow White and the Seven Dwarfs | Walt Disney Productions | RKO Radio Pictures | NaN | 83 minutes | United States | English | $418 million | 83.0 | 1490000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Pinocchio | Walt Disney Productions | RKO Radio Pictures | NaN | 88 minutes | United States | English | $164 million | 88.0 | 2600000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Fantasia | Walt Disney Productions | RKO Radio Pictures | [November 13, 1940] | 126 minutes | United States | English | $76.4–$83.3 million (United States and Canada) | 126.0 | 2280000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | The Reluctant Dragon | Walt Disney Productions | RKO Radio Pictures | [June 27, 1941] | 74 minutes | United States | English | $960,000 (worldwide rentals) | 74.0 | 600000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 50 columns
disney_movie_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 519 entries, 0 to 518 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 519 non-null object 1 Production company 214 non-null object 2 Distributed by 517 non-null object 3 Release date 339 non-null object 4 Running time 496 non-null object 5 Country 464 non-null object 6 Language 498 non-null object 7 Box office 401 non-null object 8 Running time (int) 496 non-null float64 9 Budget (float) 307 non-null float64 10 Box office (float) 389 non-null float64 11 Release date (datetime) 332 non-null datetime64[ns] 12 imdb 501 non-null object 13 metascore 501 non-null object 14 rotten_tomatoes 374 non-null object 15 Directed by 514 non-null object 16 Written by 227 non-null object 17 Based on 278 non-null object 18 Produced by 505 non-null object 19 Starring 480 non-null object 20 Music by 509 non-null object 21 Release dates 172 non-null object 22 Budget 317 non-null object 23 Story by 172 non-null object 24 Narrated by 68 non-null object 25 Cinematography 389 non-null object 26 Edited by 464 non-null object 27 Languages 19 non-null object 28 Screenplay by 244 non-null object 29 Countries 49 non-null object 30 Color process 4 non-null object 31 Production companies 302 non-null object 32 Japanese 5 non-null object 33 Hepburn 5 non-null object 34 Adaptation by 1 non-null object 35 Animation by 1 non-null object 36 Traditional 2 non-null object 37 Simplified 2 non-null object 38 Original title 1 non-null object 39 Layouts by 2 non-null object 40 Original concept by 1 non-null object 41 Created by 1 non-null object 42 Original work 1 non-null object 43 Owner 1 non-null object 44 Music 1 non-null object 45 Lyrics 1 non-null object 46 Book 1 non-null object 47 Basis 1 non-null object 48 Productions 1 non-null object 49 Awards 1 non-null object dtypes: datetime64[ns](1), float64(3), object(46) memory usage: 202.9+ KB
disney_movie_data.to_csv("disney_movie_data_final.csv")
running_times = disney_movie_data.sort_values(['Running time (int)'], ascending=False)
running_times.head(10)
title | Production company | Distributed by | Release date | Running time | Country | Language | Box office | Running time (int) | Budget (float) | ... | Original concept by | Created by | Original work | Owner | Music | Lyrics | Book | Basis | Productions | Awards | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
517 | Tinker Bell | DisneyToon Studios | [Walt Disney Studios, Home Entertainment] | NaN | [468 minutes] | United States | English | NaN | 468.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
26 | Davy Crockett: King of the Wild Frontier | Walt Disney Productions | Buena Vista Film Distribution Co., Inc. | May 25, 1955 | 192 minutes | United States | English | $50 million (US) | 192.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
328 | Pirates of the Caribbean: At World's End | NaN | Buena Vista Pictures | NaN | 167 minutes | United States | English | $960.9 million | 167.0 | 300000000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
86 | The Happiest Millionaire | Walt Disney Productions | Buena Vista Distribution | NaN | [164 minutes, (, Los Angeles, premiere), 144 m... | United States | English | $5 million (U.S./Canada rentals) | 164.0 | 5000000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
441 | Jagga Jasoos | NaN | UTV Motion Pictures | [14 July 2017] | 162 minutes | India | Hindi | 83 crore | 162.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
434 | Dangal | NaN | UTV Motion Pictures | NaN | 161 minutes | India | Hindi | est. est. | 161.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
466 | Hamilton | NaN | Walt Disney Studios Motion Pictures | [July 3, 2020] | 160 minutes | United States | English | NaN | 160.0 | 12500000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
422 | ABCD 2 | Walt Disney Pictures | UTV Motion Pictures | [19 June 2015] | 154 minutes | India | Hindi | est. | 154.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
319 | Pirates of the Caribbean: Dead Man's Chest | NaN | Buena Vista Pictures | NaN | 150 minutes | United States | English | $1.066 billion | 150.0 | 225000000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
338 | The Chronicles of Narnia: Prince Caspian | NaN | Walt Disney Studios Motion Pictures | NaN | 150 minutes | NaN | English | $419.7 million | 150.0 | 225000000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 rows × 50 columns