Analytics with Pandas : 2. Advanced operations

Download exercises zip 

Let’s see how to do more advanced operations with pandas, like grouping with groupby, joining tables with merge, and perform geospatial analysis with GeoPandas (only mentioned).

We chose to collect such topics in this notebook as tipically while executing these operations problems are more likely to arise and thus some further internet search is required.

1. Grouping

Reference:

PythonDataScienceHandbook: Aggregation and Grouping

To group items and perform statistics on each group, you can use the groupby method.

Let’s see an example of a possible grouping. First we reload again the astropi.csv described in the previous tutorial

[1]:

import pandas as pd
import numpy as np

df = pd.read_csv('astropi.csv', encoding='UTF-8')

Suppose we want to calculate how many readings of humidity fall into the interval defined by each integer value of humidity humidity_int, so to be able to plot a bar chart like this (actually there are faster methods with numpy for making histograms but here we follow the step by step approach)

humidity distribution

1.1 Let’s see a group

To get an initial idea, we could start checking only the rows that belong to the group 42, that is have a humidity value lying between 42.0 included until 43.0 excluded. We can use the transform method as previously seen, noting that group 42 holds 2776 rows:

[2]:

df[ df['humidity'].transform(int) == 42]

[2]:

	time_stamp	temp_cpu	temp_h	temp_p	humidity	pressure	pitch	roll	yaw	mag_x	mag_y	mag_z	accel_x	accel_y	accel_z	gyro_x	gyro_y	gyro_z	reset
19222	2016-02-18 16:37:00	33.18	28.96	26.51	42.99	1006.10	1.19	53.23	313.69	9.081925	-32.244905	-35.135448	-0.000581	0.018936	0.014607	0.000563	0.000346	-0.000113	0
19619	2016-02-18 17:43:50	33.34	29.06	26.62	42.91	1006.30	1.50	52.54	194.49	-53.197113	-4.014863	-20.257249	-0.000439	0.018838	0.014526	-0.000259	0.000323	-0.000181	0
19621	2016-02-18 17:44:10	33.38	29.06	26.62	42.98	1006.28	1.01	52.89	195.39	-52.911983	-4.207085	-20.754475	-0.000579	0.018903	0.014580	0.000415	-0.000232	0.000400	0
19655	2016-02-18 17:49:51	33.37	29.07	26.62	42.94	1006.28	0.93	53.21	203.76	-43.124080	-8.181511	-29.151436	-0.000432	0.018919	0.014608	0.000182	0.000341	0.000015	0
19672	2016-02-18 17:52:40	33.33	29.06	26.62	42.93	1006.24	1.34	52.71	206.97	-36.893841	-10.130503	-31.484077	-0.000551	0.018945	0.014794	-0.000378	-0.000013	-0.000101	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
110864	2016-02-29 09:24:21	31.56	27.52	24.83	42.94	1005.83	1.58	49.93	129.60	-15.169673	-27.642610	1.563183	-0.000682	0.017743	0.014646	-0.000264	0.000206	0.000196	0
110865	2016-02-29 09:24:30	31.55	27.50	24.83	42.72	1005.85	1.89	49.92	130.51	-15.832622	-27.729389	1.785682	-0.000736	0.017570	0.014855	0.000143	0.000199	-0.000024	0
110866	2016-02-29 09:24:41	31.58	27.50	24.83	42.83	1005.85	2.09	50.00	132.04	-16.646212	-27.719479	1.629533	-0.000647	0.017657	0.014799	0.000537	0.000257	0.000057	0
110867	2016-02-29 09:24:50	31.62	27.50	24.83	42.81	1005.88	2.88	49.69	133.00	-17.270447	-27.793136	1.703806	-0.000835	0.017635	0.014877	0.000534	0.000456	0.000195	0
110868	2016-02-29 09:25:00	31.57	27.51	24.83	42.94	1005.86	2.17	49.77	134.18	-17.885872	-27.824149	1.293345	-0.000787	0.017261	0.014380	0.000459	0.000076	0.000030	0

2776 rows × 19 columns

1.2 `groupby`

We can generlize and associate to each integer group the amount of rows belonging to that group with the groupby method. First let’s make a column holding the integer humidity value for each group:

[3]:

df['humidity_int'] = df['humidity'].transform( lambda x: int(x) )

[4]:

df[ ['time_stamp', 'humidity_int', 'humidity']  ].head()

[4]:

	time_stamp	humidity_int	humidity
0	2016-02-16 10:44:40	44	44.94
1	2016-02-16 10:44:50	45	45.12
2	2016-02-16 10:45:00	45	45.12
3	2016-02-16 10:45:10	45	45.32
4	2016-02-16 10:45:20	45	45.18

Then we can call groupby by writing down:

first the column where to group (humidity_int)
second the column where to calculate the statistics
finally the statistics to be performed, in this case .count() (other common ones are sum(), min(), max(), mean()…)

[5]:

df.groupby(['humidity_int'])['humidity'].count()

[5]:

humidity_int
42     2776
43     2479
44    13029
45    32730
46    35775
47    14176
48     7392
49      297
50      155
51      205
52      209
53      128
54      224
55      164
56      139
57      183
58      237
59      271
60      300
Name: humidity, dtype: int64

Note the result is a Series:

[6]:

result = df.groupby(['humidity_int'])['humidity'].count()

[7]:

type(result)

[7]:

pandas.core.series.Series

Since we would like a customized bar chart, for the sake of simplicity we could use the native plt.plot function of matplotlib, for which we will need one sequence for \(xs\) coordinates and another one for the \(ys\).

The sequence for \(xs\) can be extracted from the index of the Series:

[8]:

result.index

[8]:

Int64Index([42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
            59, 60],
           dtype='int64', name='humidity_int')

For the \(ys\) sequence we can directly use the Series like this:

[9]:

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

plt.bar(result.index, result)

plt.xlabel('humidity groups')
plt.ylabel('count')
plt.title('Humidity distribution')

plt.xticks(result.index, result.index) # shows labels as integers
plt.tick_params(bottom=False)          # removes the little bottom lines


plt.show()

../_images/pandas_pandas2-advanced-sol_17_0.png

1.3 Modifying a dataframe by plugging in the result of a grouping

Notice we’ve got only 19 rows in the grouped series:

[10]:

df.groupby(['humidity_int'])['humidity'].count()

[10]:

humidity_int
42     2776
43     2479
44    13029
45    32730
46    35775
47    14176
48     7392
49      297
50      155
51      205
52      209
53      128
54      224
55      164
56      139
57      183
58      237
59      271
60      300
Name: humidity, dtype: int64

How could we fill the whole original table, assigning to each row the count of its own group?

We can use transform like this:

[11]:

df.groupby(['humidity_int'])['humidity'].transform('count')

[11]:

0         13029
1         32730
2         32730
3         32730
4         32730
          ...
110864     2776
110865     2776
110866     2776
110867     2776
110868     2776
Name: humidity, Length: 110869, dtype: int64

As usual, group_by does not modify the dataframe, if we want the result stored in the dataframe we need to assign the result to a new column:

[12]:

df['humidity_counts'] = df.groupby(['humidity_int'])['humidity'].transform('count')

[13]:

df

[13]:

	time_stamp	temp_cpu	temp_h	temp_p	humidity	pressure	pitch	roll	yaw	mag_x	...	mag_z	accel_x	accel_y	accel_z	gyro_x	gyro_y	gyro_z	reset	humidity_int	humidity_counts
0	2016-02-16 10:44:40	31.88	27.57	25.01	44.94	1001.68	1.49	52.25	185.21	-46.422753	...	-12.129346	-0.000468	0.019439	0.014569	0.000942	0.000492	-0.000750	20	44	13029
1	2016-02-16 10:44:50	31.79	27.53	25.01	45.12	1001.72	1.03	53.73	186.72	-48.778951	...	-12.943096	-0.000614	0.019436	0.014577	0.000218	-0.000005	-0.000235	0	45	32730
2	2016-02-16 10:45:00	31.66	27.53	25.01	45.12	1001.72	1.24	53.57	186.21	-49.161878	...	-12.642772	-0.000569	0.019359	0.014357	0.000395	0.000600	-0.000003	0	45	32730
3	2016-02-16 10:45:10	31.69	27.52	25.01	45.32	1001.69	1.57	53.63	186.03	-49.341941	...	-12.615509	-0.000575	0.019383	0.014409	0.000308	0.000577	-0.000102	0	45	32730
4	2016-02-16 10:45:20	31.66	27.54	25.01	45.18	1001.71	0.85	53.66	186.46	-50.056683	...	-12.678341	-0.000548	0.019378	0.014380	0.000321	0.000691	0.000272	0	45	32730
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
110864	2016-02-29 09:24:21	31.56	27.52	24.83	42.94	1005.83	1.58	49.93	129.60	-15.169673	...	1.563183	-0.000682	0.017743	0.014646	-0.000264	0.000206	0.000196	0	42	2776
110865	2016-02-29 09:24:30	31.55	27.50	24.83	42.72	1005.85	1.89	49.92	130.51	-15.832622	...	1.785682	-0.000736	0.017570	0.014855	0.000143	0.000199	-0.000024	0	42	2776
110866	2016-02-29 09:24:41	31.58	27.50	24.83	42.83	1005.85	2.09	50.00	132.04	-16.646212	...	1.629533	-0.000647	0.017657	0.014799	0.000537	0.000257	0.000057	0	42	2776
110867	2016-02-29 09:24:50	31.62	27.50	24.83	42.81	1005.88	2.88	49.69	133.00	-17.270447	...	1.703806	-0.000835	0.017635	0.014877	0.000534	0.000456	0.000195	0	42	2776
110868	2016-02-29 09:25:00	31.57	27.51	24.83	42.94	1005.86	2.17	49.77	134.18	-17.885872	...	1.293345	-0.000787	0.017261	0.014380	0.000459	0.000076	0.000030	0	42	2776

110869 rows × 21 columns

1.4 Exercise - meteo pressure intervals

✪✪✪ The dataset meteo.csv contains the weather data of Trento, November 2017 (source: www.meteotrentino.it). We would like to subdivide the pressure readings into three intervals A (low), B (medium), C (high), and count how many readings have been made for each interval.

IMPORTANT: assign the dataframe to a variable called meteo so to avoid confusion with other dataframes

1.4.1 Where are the intervals?

First, let’s find the pressure values for these 3 intervals and plot them as segments, so to end up with a chart like this:

Expected pressure intervals

Before doing the plot, we will need to know at which height we should plot the segments.

Load the dataset with pandas, calculate the following variables and PRINT them

use UTF-8 as encoding
round values with round function
the excursion is the difference between minimum and maximum
note intervalC coincides with the maximum

DO NOT use min and max as variable names (they are reserved functions!!)

Show solution

[14]:

import pandas as pd

# write here

minimum: 966.3
maximum: 998.3
excursion: 32.0
intervalA: 976.97
intervalB: 987.63
intervalC: 998.3

1.4.2 Segments plot

Try now to plot the chart of pressure and the 4 horizontal segments.

to overlay the segments with different colors, just make repeated calls to plt.plot
a segment is defined by two points: so just find the coordinates of those two points..
try leaving some space above and below the chart

REMEMBER title and labels

Show solution

[15]:

# write here

../_images/pandas_pandas2-advanced-sol_36_0.png

1.4.3 Assigning the intervals

We literally made a picture of where the intervals are located - let’s now ask ourselves how many readings have been done for each interval.

First, try creating a column which assigns to each reading the interval where it belongs to.

HINT 1: use transform
HINT 2: in the function you are going to define, do not recalculate inside values such as minimum, maximum, intervals etc because it would slow down Pandas. Instead, use the variables we’ve already defined - remember that transform riexecutes the argument function for each row!

Show solution

[16]:

# write here

[16]:

	Date	Pressure	Rain	Temp	PressureInterval
0	01/11/2017 00:00	995.4	0.0	5.4	C (high)
1	01/11/2017 00:15	995.5	0.0	6.0	C (high)
2	01/11/2017 00:30	995.5	0.0	5.9	C (high)
3	01/11/2017 00:45	995.7	0.0	5.4	C (high)
4	01/11/2017 01:00	995.7	0.0	5.3	C (high)
...	...	...	...	...	...
2873	30/11/2017 23:00	980.0	0.0	0.2	B (medium)
2874	30/11/2017 23:15	980.2	0.0	0.5	B (medium)
2875	30/11/2017 23:30	980.2	0.0	0.6	B (medium)
2876	30/11/2017 23:45	980.5	0.0	0.2	B (medium)
2877	01/12/2017 00:00	980.6	0.0	-0.3	B (medium)

2878 rows × 5 columns

1.4.4 Grouping by intervals

We would like to have an histogram like this one:

pressure distribution

First, create a grouping to count occurrences:

Show solution

[17]:

# write here

[17]:

PressureInterval
A (low)        255
B (medium)    1243
C (high)      1380
Name: Pressure, dtype: int64

Now plot it

NOTE: the result of groupby is also a Series, so it’s plottable as we’ve already seen…
REMEMBER title and axis labels

Show solution

[18]:

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# write here

../_images/pandas_pandas2-advanced-sol_52_0.png

1.5 Exercise - meteo average temperature

✪✪✪ Calculate the average temperature for each day, and show it in the plot, so to have a couple new columns like these:

    Day       Avg_day_temp
01/11/2017      7.983333
01/11/2017      7.983333
01/11/2017      7.983333
    .               .
    .               .
02/11/2017      7.384375
02/11/2017      7.384375
02/11/2017      7.384375
    .               .
    .               .

HINT 1: add 'Day' column by extracting only the day from the date. To do it, use the function .strapplied to all the column.

HINT 2: There are various ways to solve the exercise:

Most perfomant and elegant is with groupby operator, see Pandas trasform - more than meets the eye
As alternative, you may use a for to cycle through days. Typically, using a for is not a good idea with Pandas, as on large datasets it can take a lot to perform the updates. Still, since this dataset is small enough, you should get results in a decent amount of time.

[19]:

# write here

Show solution

[20]:


****SOLUTION 1 (EFFICIENT) -  best solution with groupby and transform
               Date  Pressure  Rain  Temp         Day  Avg_day_temp
0  01/11/2017 00:00     995.4   0.0   5.4  01/11/2017      7.983333
1  01/11/2017 00:15     995.5   0.0   6.0  01/11/2017      7.983333
2  01/11/2017 00:30     995.5   0.0   5.9  01/11/2017      7.983333
3  01/11/2017 00:45     995.7   0.0   5.4  01/11/2017      7.983333
4  01/11/2017 01:00     995.7   0.0   5.3  01/11/2017      7.983333

../_images/pandas_pandas2-advanced-sol_58_1.png

Show solution

[21]:

Show solution

[22]:

2. Merging tables

Suppose we want to add a column with geographical position of the ISS. To do so, we would need to join our dataset with another one containing such information. Let’s take for example the dataset iss-coords.csv

[23]:

iss_coords = pd.read_csv('iss-coords.csv', encoding='UTF-8')

[24]:

iss_coords

[24]:

	timestamp	lat	lon
0	2016-01-01 05:11:30	-45.103458	14.083858
1	2016-01-01 06:49:59	-37.597242	28.931170
2	2016-01-01 11:52:30	17.126141	77.535602
3	2016-01-01 11:52:30	17.126464	77.535861
4	2016-01-01 14:54:08	7.259561	70.001561
...	...	...	...
333	2016-02-29 13:23:17	-51.077590	-31.093987
334	2016-02-29 13:44:13	30.688553	-135.403820
335	2016-02-29 13:44:13	30.688295	-135.403533
336	2016-02-29 18:44:57	27.608774	-130.198781
337	2016-02-29 21:36:47	27.325186	-129.893278

338 rows × 3 columns

We notice there is a timestamp column, which unfortunately has a slightly different name that time_stamp column (notice the underscore _) in original astropi dataset:

[25]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110869 entries, 0 to 110868
Data columns (total 21 columns):
 #   Column           Non-Null Count   Dtype
---  ------           --------------   -----
 0   time_stamp       110869 non-null  object
 1   temp_cpu         110869 non-null  float64
 2   temp_h           110869 non-null  float64
 3   temp_p           110869 non-null  float64
 4   humidity         110869 non-null  float64
 5   pressure         110869 non-null  float64
 6   pitch            110869 non-null  float64
 7   roll             110869 non-null  float64
 8   yaw              110869 non-null  float64
 9   mag_x            110869 non-null  float64
 10  mag_y            110869 non-null  float64
 11  mag_z            110869 non-null  float64
 12  accel_x          110869 non-null  float64
 13  accel_y          110869 non-null  float64
 14  accel_z          110869 non-null  float64
 15  gyro_x           110869 non-null  float64
 16  gyro_y           110869 non-null  float64
 17  gyro_z           110869 non-null  float64
 18  reset            110869 non-null  int64
 19  humidity_int     110869 non-null  int64
 20  humidity_counts  110869 non-null  int64
dtypes: float64(17), int64(3), object(1)
memory usage: 17.8+ MB

To merge datasets according to the columns, we can use the command merge like this:

[26]:

# remember merge produces a NEW dataframe

geo_astropi = df.merge(iss_coords, left_on='time_stamp', right_on='timestamp')

# merge will add both time_stamp and timestamp columns,
# so we remove the duplicate column `timestamp`
geo_astropi = geo_astropi.drop('timestamp', axis=1)

[27]:

geo_astropi

[27]:

	time_stamp	temp_cpu	temp_h	temp_p	humidity	pressure	pitch	roll	yaw	mag_x	...	accel_y	accel_z	gyro_x	gyro_y	gyro_z	humidity_int	humidity_counts	lat	lon
0	2016-02-19 03:49:00	32.53	28.37	25.89	45.31	1006.04	1.31	51.63	34.91	21.125001	...	0.018851	0.014607	0.000060	-0.000304	0.000046	45	32730	31.434741	52.917464
1	2016-02-19 14:30:40	32.30	28.12	25.62	45.57	1007.42	1.49	52.29	333.49	16.083471	...	0.018687	0.014502	0.000208	-0.000499	0.000034	45	32730	-46.620658	-57.311657
2	2016-02-19 14:30:40	32.30	28.12	25.62	45.57	1007.42	1.49	52.29	333.49	16.083471	...	0.018687	0.014502	0.000208	-0.000499	0.000034	45	32730	-46.620477	-57.311138
3	2016-02-21 22:14:11	32.21	28.05	25.50	47.36	1012.41	0.67	52.40	27.57	15.441683	...	0.018800	0.014136	-0.000015	-0.000159	0.000221	47	14176	19.138359	-140.211489
4	2016-02-23 23:40:50	32.32	28.18	25.61	47.45	1010.62	1.14	51.41	33.68	11.994554	...	0.018276	0.014124	0.000368	0.000368	0.000030	47	14176	4.713819	80.261665
5	2016-02-24 10:05:51	32.39	28.26	25.70	46.83	1010.51	0.61	51.91	287.86	6.554283	...	0.018352	0.014344	-0.000664	-0.000518	0.000171	46	35775	-46.061583	22.246025
6	2016-02-25 00:23:01	32.38	28.18	25.62	46.52	1008.28	0.90	51.77	30.80	9.947132	...	0.018502	0.014366	0.000290	0.000314	-0.000375	46	35775	47.047346	137.958918
7	2016-02-27 01:43:10	32.42	28.34	25.76	45.72	1006.79	0.57	49.85	10.57	7.805606	...	0.017930	0.014378	-0.000026	-0.000013	-0.000047	45	32730	-41.049112	30.193004
8	2016-02-27 01:43:10	32.42	28.34	25.76	45.72	1006.79	0.57	49.85	10.57	7.805606	...	0.017930	0.014378	-0.000026	-0.000013	-0.000047	45	32730	-8.402991	-100.981726
9	2016-02-28 09:48:40	32.62	28.62	26.02	45.15	1006.06	1.12	50.44	301.74	10.348327	...	0.017620	0.014725	-0.000358	-0.000301	-0.000061	45	32730	50.047523	175.566751

10 rows × 23 columns

Exercise 2.1 - better merge

If you notice, above table does have lat and lon columns, but has very few rows. Why ? Try to merge the tables in some meaningful way so to have all the original rows and all cells of lat and lon filled.

For other merging stategies, read about attribute how in Why And How To Use Merge With Pandas in Python
To fill missing values don’t use fancy interpolation techniques, just put the station position in that given day or hour

Show solution

[28]:

# write here

geo_astropi = df.merge(iss_coords, left_on='time_stamp', right_on='timestamp', how='left')

pd.merge_ordered(df, iss_coords, fill_method='ffill', how='left', left_on='time_stamp', right_on='timestamp')
geo_astropi

[28]:

	time_stamp	temp_cpu	temp_h	temp_p	humidity	pressure	pitch	roll	yaw	mag_x	...	accel_z	gyro_x	gyro_y	gyro_z	reset	humidity_int	humidity_counts	timestamp	lat	lon
0	2016-02-16 10:44:40	31.88	27.57	25.01	44.94	1001.68	1.49	52.25	185.21	-46.422753	...	0.014569	0.000942	0.000492	-0.000750	20	44	13029	NaN	NaN	NaN
1	2016-02-16 10:44:50	31.79	27.53	25.01	45.12	1001.72	1.03	53.73	186.72	-48.778951	...	0.014577	0.000218	-0.000005	-0.000235	0	45	32730	NaN	NaN	NaN
2	2016-02-16 10:45:00	31.66	27.53	25.01	45.12	1001.72	1.24	53.57	186.21	-49.161878	...	0.014357	0.000395	0.000600	-0.000003	0	45	32730	NaN	NaN	NaN
3	2016-02-16 10:45:10	31.69	27.52	25.01	45.32	1001.69	1.57	53.63	186.03	-49.341941	...	0.014409	0.000308	0.000577	-0.000102	0	45	32730	NaN	NaN	NaN
4	2016-02-16 10:45:20	31.66	27.54	25.01	45.18	1001.71	0.85	53.66	186.46	-50.056683	...	0.014380	0.000321	0.000691	0.000272	0	45	32730	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
110866	2016-02-29 09:24:21	31.56	27.52	24.83	42.94	1005.83	1.58	49.93	129.60	-15.169673	...	0.014646	-0.000264	0.000206	0.000196	0	42	2776	NaN	NaN	NaN
110867	2016-02-29 09:24:30	31.55	27.50	24.83	42.72	1005.85	1.89	49.92	130.51	-15.832622	...	0.014855	0.000143	0.000199	-0.000024	0	42	2776	NaN	NaN	NaN
110868	2016-02-29 09:24:41	31.58	27.50	24.83	42.83	1005.85	2.09	50.00	132.04	-16.646212	...	0.014799	0.000537	0.000257	0.000057	0	42	2776	NaN	NaN	NaN
110869	2016-02-29 09:24:50	31.62	27.50	24.83	42.81	1005.88	2.88	49.69	133.00	-17.270447	...	0.014877	0.000534	0.000456	0.000195	0	42	2776	NaN	NaN	NaN
110870	2016-02-29 09:25:00	31.57	27.51	24.83	42.94	1005.86	2.17	49.77	134.18	-17.885872	...	0.014380	0.000459	0.000076	0.000030	0	42	2776	NaN	NaN	NaN

110871 rows × 24 columns

[28]:

# write here

[28]:

	time_stamp	temp_cpu	temp_h	temp_p	humidity	pressure	pitch	roll	yaw	mag_x	...	accel_z	gyro_x	gyro_y	gyro_z	reset	humidity_int	humidity_counts	timestamp	lat	lon
0	2016-02-16 10:44:40	31.88	27.57	25.01	44.94	1001.68	1.49	52.25	185.21	-46.422753	...	0.014569	0.000942	0.000492	-0.000750	20	44	13029	NaN	NaN	NaN
1	2016-02-16 10:44:50	31.79	27.53	25.01	45.12	1001.72	1.03	53.73	186.72	-48.778951	...	0.014577	0.000218	-0.000005	-0.000235	0	45	32730	NaN	NaN	NaN
2	2016-02-16 10:45:00	31.66	27.53	25.01	45.12	1001.72	1.24	53.57	186.21	-49.161878	...	0.014357	0.000395	0.000600	-0.000003	0	45	32730	NaN	NaN	NaN
3	2016-02-16 10:45:10	31.69	27.52	25.01	45.32	1001.69	1.57	53.63	186.03	-49.341941	...	0.014409	0.000308	0.000577	-0.000102	0	45	32730	NaN	NaN	NaN
4	2016-02-16 10:45:20	31.66	27.54	25.01	45.18	1001.71	0.85	53.66	186.46	-50.056683	...	0.014380	0.000321	0.000691	0.000272	0	45	32730	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
110866	2016-02-29 09:24:21	31.56	27.52	24.83	42.94	1005.83	1.58	49.93	129.60	-15.169673	...	0.014646	-0.000264	0.000206	0.000196	0	42	2776	NaN	NaN	NaN
110867	2016-02-29 09:24:30	31.55	27.50	24.83	42.72	1005.85	1.89	49.92	130.51	-15.832622	...	0.014855	0.000143	0.000199	-0.000024	0	42	2776	NaN	NaN	NaN
110868	2016-02-29 09:24:41	31.58	27.50	24.83	42.83	1005.85	2.09	50.00	132.04	-16.646212	...	0.014799	0.000537	0.000257	0.000057	0	42	2776	NaN	NaN	NaN
110869	2016-02-29 09:24:50	31.62	27.50	24.83	42.81	1005.88	2.88	49.69	133.00	-17.270447	...	0.014877	0.000534	0.000456	0.000195	0	42	2776	NaN	NaN	NaN
110870	2016-02-29 09:25:00	31.57	27.51	24.83	42.94	1005.86	2.17	49.77	134.18	-17.885872	...	0.014380	0.000459	0.000076	0.000030	0	42	2776	NaN	NaN	NaN

110871 rows × 24 columns

3. GeoPandas

You can easily manipulate geographical data with GeoPandas library. For some nice online tutorial, we refer to Geospatial Analysis and Representation for Data Science course website @ master in Data Science University of Trento, by Maurizio Napolitano (FBK)

Continue

Go on with the challenges worksheet

Analytics with Pandas : 2. Advanced operations

Download exercises zip

1. Grouping

1.1 Let’s see a group

1.2 groupby

1.3 Modifying a dataframe by plugging in the result of a grouping

1.4 Exercise - meteo pressure intervals

1.4.1 Where are the intervals?

1.4.2 Segments plot

1.4.3 Assigning the intervals

1.4.4 Grouping by intervals

1.5 Exercise - meteo average temperature

2. Merging tables

Exercise 2.1 - better merge

3. GeoPandas

Continue

Download exercises zip 

1.2 `groupby`