Wikispeedia
Download worked project
What is the semantic distance between Batman and the Bible? Wikispeedia is a fun game where you are given (or can choose) apparently unrelated source and a target Wikipedia pages, and you are asked to reach target page by only clicking links you find along the pages you visit. These click paths provide valuable information regarding human behaviour and the semantic connection between different topics, for example search engines might use such information to better understand user queries. You will analyze a dataset of such paths.
Data source: https://snap.stanford.edu/data/wikispeedia.html
- Robert West and Jure Leskovec: Human Wayfinding in Information Networks. 21st International World Wide Web Conference (WWW), 2012. 
- Robert West, Joelle Pineau, and Doina Precup: Wikispeedia: An Online Game for Inferring Semantic Distances between Concepts. 21st International Joint Conference on Artificial Intelligence (IJCAI), 2009. 

REQUIREMENTS: Having read Relational data tutorial , which contains also instructions for installing required libraries.
What to do
- Unzip exercises zip in a folder, you should obtain something like this: 
wikispeedia-prj
    wikispeedia.ipynb
    wikispeedia-sol.ipynb
    paths_finished.tsv
    soft.py
    jupman.py
WARNING: to correctly visualize the notebook, it MUST be in an unzipped folder !
- open Jupyter Notebook from that folder. Two things should open, first a console and then a browser. The browser should show a file list: navigate the list and open the notebook - wikispeedia.ipynb
- Go on reading the notebook, and write in the appropriate cells when asked 
Shortcut keys:
- to execute Python code inside a Jupyter cell, press - Control + Enter
- to execute Python code inside a Jupyter cell AND select next cell, press - Shift + Enter
- to execute Python code inside a Jupyter cell AND a create a new cell aftwerwards, press - Alt + Enter
- If the notebooks look stuck, try to select - Kernel -> Restart
The dataset
Each row of the dataset paths_finished.tsv is a user session, where user navigates from start page to end page. Columns are hashedIpAddress, timestamp, durationInSec, path and rating.
We define a session group as all sessions which have same start page and same end page, for example all these paths start with Linguistics and end in Rome:
[1]:
import pandas as pd
import numpy as np
pd.options.display.max_colwidth = None
df = pd.read_csv('paths_finished.tsv', encoding='UTF-8', skiprows=16, header=None, sep='\t')
df[5890:5902]
[1]:
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 5890 | 2f8c281d5e0b0e93 | 1248913182 | 106 | Linguistics;Philosophy;Aristotle;Ancient_Greece;Italy;Rome | 3.0 | 
| 5891 | 389b67fa365b727a | 1249604227 | 89 | Linguistics;Language;English_language;Latin;Rome | 2.0 | 
| 5892 | 0299542414c3f20a | 1257970576 | 94 | Linguistics;Philosophy;Plato;Latin;Rome | NaN | 
| 5893 | 2b6e83d366a7514d | 1260188882 | 113 | Linguistics;Philosophy;Thomas_Aquinas;Italy;Rome | 2.0 | 
| 5894 | 0d57c8c57d75e2f5 | 1282028286 | 153 | Linguistics;Language;Spanish_language;Vulgar_Latin;Roman_Empire;Rome | NaN | 
| 5895 | 0d57c8c57d75e2f5 | 1295244051 | 62 | Linguistics;Philosophy;Socrates;Ancient_Greece;Ancient_Rome;Rome | 1.0 | 
| 5896 | 772843f73d9cf93d | 1307880434 | 177 | Linguistics;Language;German_language;Latin_alphabet;<;Italy;Rome | NaN | 
| 5897 | 654430d34a08a0f5 | 1339755238 | 81 | Linguistics;Philosophy;Augustine_of_Hippo;Italy;Rome | 2.0 | 
| 5898 | 12470aee3d5ad152 | 1344167616 | 40 | Linguistics;Philosophy;Plato;Italy;Rome | NaN | 
| 5899 | 6365b049395c53ce | 1345421532 | 67 | Linguistics;Culture;Ancient_Rome;Rome | 1.0 | 
| 5900 | 05786cb24102850d | 1347671980 | 65 | Linguistics;Language;English_language;Latin;Rome | 2.0 | 
| 5901 | 369a3f7e77700217 | 1349302559 | 56 | Linguistics;Language;English_language;Latin;Rome | NaN | 
In this other session group, all sessions start with Pikachu and end with Sun:
[2]:
df[45121:45138]
[2]:
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 45121 | 0d57c8c57d75e2f5 | 1278403914 | 17 | Pikachu;North_America;Earth;Planet;Sun | 1.0 | 
| 45122 | 0d57c8c57d75e2f5 | 1278403938 | 42 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45123 | 0d57c8c57d75e2f5 | 1278403972 | 17 | Pikachu;Tree;Plant;<;Sunlight;Sun | 1.0 | 
| 45124 | 0d57c8c57d75e2f5 | 1278403991 | 8 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45125 | 0d57c8c57d75e2f5 | 1278404007 | 9 | Pikachu;Tree;Sunlight;Sun | 1.0 | 
| 45126 | 0d57c8c57d75e2f5 | 1278404048 | 8 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45127 | 0d57c8c57d75e2f5 | 1278404061 | 16 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45128 | 0d57c8c57d75e2f5 | 1278404065 | 8 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45129 | 0d57c8c57d75e2f5 | 1278404070 | 8 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45130 | 0d57c8c57d75e2f5 | 1278404089 | 8 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45131 | 0d57c8c57d75e2f5 | 1278404095 | 7 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45132 | 0d57c8c57d75e2f5 | 1278404101 | 9 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45133 | 0d57c8c57d75e2f5 | 1278404117 | 6 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45134 | 0d57c8c57d75e2f5 | 1278404124 | 10 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45135 | 0d57c8c57d75e2f5 | 1278404129 | 9 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45136 | 0d57c8c57d75e2f5 | 1278404142 | 7 | Pikachu;Tree;Sunlight;Sun | NaN | 
| 45137 | 0d57c8c57d75e2f5 | 1278404145 | 6 | Pikachu;Tree;Sunlight;Sun | NaN | 
1. filter_back
Whenever a user clicks the Back button, she navigates back one page. This fact is tracked in the data by the presence of a '<' symbol. Write a function which RETURN a NEW path without pages which were navigated back.
NOTE: you can have duplicates even without presence of <, because a user might end up to a previous page just by following circular links.
DO NOT misuse search methods, I’m watching you }:-[
Show solution[3]:
def filter_back(path):
    raise Exception('TODO IMPLEMENT ME !')
assert filter_back([]) == []
assert filter_back(['alfa']) == ['alfa']
assert filter_back(['beta','alfa','charlie']) == ['beta','alfa','charlie']
assert filter_back(['charlie', 'tango','<']) == ['charlie']
inp = ['charlie', 'tango','<']
assert filter_back(inp) == ['charlie']  # new
assert inp == ['charlie', 'tango','<']
assert filter_back(['alfa', 'beta', 'charlie','<','<','delta']) == ['alfa','delta']
assert filter_back(['alfa', 'beta', 'charlie','delta','eagle','<','<','golf','<','<','hotel']) \
       == ['alfa','beta','hotel']
# circular paths
assert filter_back(['alfa','beta','alfa','alfa','beta']) ==  ['alfa','beta','alfa','alfa','beta']
assert filter_back(['alfa','beta','alfa','<','charlie','charlie','delta','charlie','<','charlie','delta']) \
       == ['alfa','beta','charlie','charlie','delta','charlie','delta']
2. load_db
Load the tab-separated file paths_finished.tsv with a CSV Reader. The file has some rows to skip and no column names: parse it and RETURN a list of dictionaries, with hashedIpAddress, timestamp, durationInSec, path and rating as fields:
- path: convert it with- filter_backfunction
- timestampand- durationInSec: convert to integer
- rating: convert to integer, if- NULL, set it to- None
Example:
>>> sessions_db = load_db('paths_finished.tsv')
Parsed 51318 sessions
>>> sessions_db[:2]
[{'hashedIpAddress': '6a3701d319fc3754',
  'timestamp': 1297740409,
  'durationInSec': 166,
  'path': ['14th_century',
   '15th_century',
   '16th_century',
   'Pacific_Ocean',
   'Atlantic_Ocean',
   'Accra',
   'Africa',
   'Atlantic_slave_trade',
   'African_slave_trade'],
  'rating': None},
 {'hashedIpAddress': '3824310e536af032',
  'timestamp': 1344753412,
  'durationInSec': 88,
  'path': ['14th_century',
   'Europe',
   'Africa',
   'Atlantic_slave_trade',
   'African_slave_trade'],
  'rating': 3}]
[4]:
import csv
def load_db(filename):
    raise Exception('TODO IMPLEMENT ME !')
sessions_db = load_db('paths_finished.tsv')
sessions_db[:2]
[5]:
# TESTING
from pprint import pprint
from expected_db import expected_db
for i in range(len(expected_db)):
    if expected_db[i] != sessions_db[i]:
        print('\nERROR at index', i, ':')
        print('  ACTUAL:')
        pprint(sessions_db[i])
        print('  EXPECTED:')
        pprint(expected_db[i])
        break
if len(sessions_db) != len(expected_db):
    print("ERROR: different lengths! sessions_db:", len(sessions_db), "expected_db", len(expected_db))
3. calc_stats
Write a function which takes the sessions db and RETURN a NEW dictionary which maps sessions groups expressed as tuples (start, end) to a dictionary of statistics about them
- dictionary key: tuple with start,end page 
- sessions: the number of sessions in that group
- avg_len: the average length (as number of edges) of all paths in the group
- pages: the total number of DISTINCT pages found among all sessions in that group
- freqs: a dictionary which maps edges found in all sessions of that group to their count
- max_freq: the highest count among all freqs
Output example (for complete output see expected_stats_db.py):
>>> calc_stats(sessions_db)
{
    ('Linguistics', 'Rome'): {'avg_len': 4.166666666666667,
                              'freqs': { ('Ancient_Greece', 'Ancient_Rome'): 1,
                                         ('Ancient_Greece', 'Italy'): 1,
                                         ('Ancient_Rome', 'Rome'): 2,
                                         ('Aristotle', 'Ancient_Greece'): 1,
                                         ('Augustine_of_Hippo', 'Italy'): 1,
                                         ('Culture', 'Ancient_Rome'): 1,
                                         ('English_language', 'Latin'): 3,
                                         ('German_language', 'Italy'): 1,
                                         ('Italy', 'Rome'): 5,
                                         ('Language', 'English_language'): 3,
                                         ('Language', 'German_language'): 1,
                                         ('Language', 'Spanish_language'): 1,
                                         ('Latin', 'Rome'): 4,
                                         ('Linguistics', 'Culture'): 1,
                                         ('Linguistics', 'Language'): 5,
                                         ('Linguistics', 'Philosophy'): 6,
                                         ('Philosophy', 'Aristotle'): 1,
                                         ('Philosophy', 'Augustine_of_Hippo'): 1,
                                         ('Philosophy', 'Plato'): 2,
                                         ('Philosophy', 'Socrates'): 1,
                                         ('Philosophy', 'Thomas_Aquinas'): 1,
                                         ('Plato', 'Italy'): 1,
                                         ('Plato', 'Latin'): 1,
                                         ('Roman_Empire', 'Rome'): 1,
                                         ('Socrates', 'Ancient_Greece'): 1,
                                         ('Spanish_language', 'Vulgar_Latin'): 1,
                                         ('Thomas_Aquinas', 'Italy'): 1,
                                         ('Vulgar_Latin', 'Roman_Empire'): 1},
                           'max_freq': 6,
                           'pages': 19,
                           'sessions': 12},
     ('Pikachu', 'Sun'):   {'avg_len': 3.0588235294117645,
                            'freqs': {('Earth', 'Planet'): 1,
                                      ('North_America', 'Earth'): 1,
                                      ('Pikachu', 'North_America'): 1,
                                      ('Pikachu', 'Tree'): 16,
                                      ('Planet', 'Sun'): 1,
                                      ('Sunlight', 'Sun'): 16,
                                      ('Tree', 'Sunlight'): 16},
                            'max_freq': 16,
                            'pages': 7,
                            'sessions': 17},
    .
    .
    .
}
[6]:
def calc_stats(sessions):
    raise Exception('TODO IMPLEMENT ME !')
stats_db = calc_stats(sessions_db)
[7]:
# TESTING
from pprint import pprint
from expected_stats_db import expected_stats_db
for t in expected_stats_db:
    if not t in stats_db:
        print('\nERROR: missing key for session group', t)
        break
    elif expected_stats_db[t] != stats_db[t]:
        print('\nERROR at key for session group', t, ':')
        print('  ACTUAL:')
        pprint(stats_db[t])
        print('  EXPECTED:')
        pprint(expected_stats_db[t])
        break
diff = stats_db.keys() - expected_stats_db.keys()
if len(diff) > 0:
    print('ERROR! Found these extra keys in stats_db:', diff)
4. plot_network
Given a sessions group (start_page, target_page), we want to display the pages clicked by users for all its sessions. Since some sessions share edges, we will show their click frequency.
- Set edges attributes - 'weight',- 'label'as \(freq\),- 'penwidth'as \(5 \large \frac{freq}{max\_freq}\) and- 'color'as- '#45daed'
- Only display edges (and pages connected by such edges) for which the click count is strictly greater than the given - threshold
- NOTE: when applying a threshold, it’s fine if some nodes don’t appear linked to either source or target 
Example 1:
>>> plot_network(stats_db,'Linguistics', 'Rome')

Example 2:
>>> plot_network(stats_db, 'Batman', 'Bible', 0)  # default threshold zero, big graph

[9]:
from soft import draw_nx
import networkx as nx
def plot_network(stats, source_page, target_page, threshold=0):
    raise Exception('TODO IMPLEMENT ME !')
plot_network(stats_db,'Linguistics', 'Rome')
[10]:
plot_network(stats_db, 'Batman', 'Bible', 0)  # default threshold zero, big graph
[11]:
plot_network(stats_db, 'Batman', 'Bible', 1)   # we take only edges > 1
