Bud Spencer and Terence Hill movies
Download worked project
Among the greatest gifts of Italy to the world we can certainly count Terence Hill and Bud Spencer movies.
Their film career can be found in Wikidata, a project by the Wikimedia foundation which aims to store only machine-readable data, like numbers, strings, and so on interlinked with many references. Each entity in Wikidata has an identifier, for example Terence Hill is the entity Q243430 and Bud Spencer is Q221074.
Wikidata can be queried using the SPARQL language: we performed this query repeated for several languages, and downloaded CSV files (among the many formats which can be chosen). Even if not necessary for the purposes of the exercise, you are invited to play a bit with the interface, like trying different visualizations (i.e. try clicking the eye in the middle-left corner and then select Graph) - or see other examples.
REQUIREMENTS: Having read Relational data tutorial , which contains also instructions for installing required libraries.
What to do
Unzip exercises zip in a folder, you should obtain something like this:
bud-spencer-terence-hill-movies-prj
bud-spencer-terence-hill-movies.ipynb
bud-spencer-terence-hill-movies-sol.ipynb
bud-spencer-terence-hill-movies-de.csv
bud-spencer-terence-hill-movies-en.csv
bud-spencer-terence-hill-movies-es.csv
bud-spencer-terence-hill-movies-it.csv
soft.py
jupman.py
WARNING: to correctly visualize the notebook, it MUST be in an unzipped folder !
open Jupyter Notebook from that folder. Two things should open, first a console and then a browser. The browser should show a file list: navigate the list and open the notebook
bud-spencer-terence-hill-movies.ipynb
Go on reading the notebook, and write in the appropriate cells when asked
Shortcut keys:
to execute Python code inside a Jupyter cell, press
Control + Enter
to execute Python code inside a Jupyter cell AND select next cell, press
Shift + Enter
to execute Python code inside a Jupyter cell AND a create a new cell aftwerwards, press
Alt + Enter
If the notebooks look stuck, try to select
Kernel -> Restart
The datasets
You are given some CSVs of movies, all having names ending in -xy.csv
, where xy
can be a language tag like it
, en
, de
, es
… They mostly contain the same data except for the movie labels which are in the corresponding language. The final goal will be displaying the network of movies and put in evidence the ones co-starring the famous duo.
Each file row contains info about a single actor starring in a movie. Multiple lines with same movie id will mean multiple actors are co-starring. We can see an excerpt of first four lines of english version: notice second movie has id Q180638 and is co-starred by both Bud Spencer and Terence Hill
star,starLabel,movie,movieLabel,firstReleased
http://www.wikidata.org/entity/Q221074,Bud Spencer,http://www.wikidata.org/entity/Q116187,Thieves and Robbers,1983-02-11T00:00:00Z
http://www.wikidata.org/entity/Q221074,Bud Spencer,http://www.wikidata.org/entity/Q180638,Odds and Evens,1978-10-28T00:00:00Z
http://www.wikidata.org/entity/Q243430,Terence Hill,http://www.wikidata.org/entity/Q180638,Odds and Evens,1978-10-28T00:00:00Z
1. load
Write a function that given a filename_prefix
and list of languages
, parses the corresponding files and RETURNS a dictionary of dictionaries, which maps movies id to movies data, in the format as in the exerpt.
When a label is missing, you will find instead an id like
Q3778078
: substitute it with empty string (HINT: to recognize ids you might useis_digit()
method)convert date numbers to proper integers
DO NOT put constant ids nor language tags in the code (so no
'Q221074'
nor'it'
…)
Example (complete output can be found in expected_db.py):
>>> load('bud-spencer-terence-hill-movies', ['en', 'it', 'de'])
{
'Q116187': {
'actors': [('Q221074', 'Bud Spencer')],
'first_release': (1983, 2, 11),
'names': {'de': 'Bud, der Ganovenschreck',
'en': 'Thieves and Robbers',
'it': 'Cane e gatto'}
}
'Q180638': {
'actors': [('Q221074', 'Bud Spencer'), ('Q243430', 'Terence Hill')],
'first_release': (1978, 10, 28),
'names': {'de': 'Zwei sind nicht zu bremsen',
'en': 'Odds and Evens',
'it': 'Pari e dispari'}
}
'Q231967': {
'actors': [('Q221074', 'Bud Spencer'), ('Q243430', 'Terence Hill')],
'first_release': (1981, 1, 1),
'names': {'de': 'Zwei Asse trumpfen auf',
'en': 'A Friend Is a Treasure',
'it': 'Chi trova un amico, trova un tesoro'}
}
.
.
.
}
[2]:
import csv
def load(filename_prefix, languages):
raise Exception('TODO IMPLEMENT ME !')
movies_db = load('bud-spencer-terence-hill-movies', ['en', 'it', 'de'])
#movies_db = load('bud-spencer-terence-hill-movies', ['es', 'en', 'de','it'])
movies_db
[3]:
# TESTING
from pprint import pformat; from expected_movies_db import expected_movies_db
for sid in expected_movies_db.keys():
if sid not in movies_db: print('\nERROR: MISSING movie', sid); break
for k in expected_movies_db[sid]:
if k not in movies_db[sid]:
print('\nERROR at movie', sid,'\n\n MISSING key:', k); break
if expected_movies_db[sid][k] != movies_db[sid][k]:
print('\nERROR at movie', sid, 'key:',k)
print(' ACTUAL:\n', pformat(movies_db[sid][k]))
print(' EXPECTED:\n', pformat(expected_movies_db[sid][k]))
break
if len(movies_db) > len(expected_movies_db):
print('ERROR! There are more movies than expected!')
print(' ACTUAL:\n', len(movies_db))
print(' EXPECTED:\n', len(expected_movies_db))
2. save_table
Write a function that given a movies db and a list of languages
, writes a new file merged.csv
separate actor names with
and
use only the year as date
file must be formatted like this
movie_id,name en,name it,first_release,actors
Q116187,Thieves and Robbers,Cane e gatto,1983,Bud Spencer
Q180638,Odds and Evens,Pari e dispari,1978,Bud Spencer and Terence Hill
Complete expected file is in expected-merged.csv
Show solution[5]:
import csv
def save_table(movies, languages):
raise Exception('TODO IMPLEMENT ME !')
save_table(movies_db, ['en','it'])
#save_table(movies_db, ['de'])
saved file to merged.csv
[6]:
# TESTING
with open('expected-merged.csv',encoding='utf-8', newline='') as expected_f:
with open('merged.csv',encoding='utf-8', newline='') as f:
expected_reader = csv.reader(expected_f, delimiter=',')
reader = csv.reader(f, delimiter=',')
i = 0
for expected_row in expected_reader:
try:
row = next(reader)
except:
print('ERROR at row', i, ': ACTUAL rows are less than EXPECTED!')
break
for j in range(len(expected_row)):
if expected_row[j] != row[j]:
print('ERROR at row', i, ' cell index', j)
print(row)
print('\nACTUAL :', row[j])
print('\nEXPECTED:', expected_row[j])
break
i += 1
3. show_graph
Display a NetworkX graph of movies (see examples) from since_year
(included) to until_year
(included), in the given language
display actor names as capitalized
display co-starred movies, non co-starred movies and actors with different colors by setting node attributes
style='filled'
and i.e.fillcolor='green'
(see some color names)
DO NOT use labels as node ids
DO NOT write constants in your code, so no 'Terence'
nor 'TERENCE'
…
Example 1
>>> show_graph(movies_db, 1970, 1975, 'en')
Example 2
>>> show_graph(movies_db, 1970, 1974, 'it')
[7]:
import networkx as nx
from soft import draw_nx
def show_graph(movies, since_year, until_year, language):
G = nx.DiGraph()
G.graph['graph']= { 'layout':'neato'} # don't delete these!
raise Exception('TODO IMPLEMENT ME !')
show_graph(movies_db, 1970, 1975, 'en')
[8]:
show_graph(movies_db, 1970, 1974, 'it')
[ ]: