Part 1 — Weather data: collection¶

At the time of my IB exams, I was in Hong Kong, so I navigated to the Hong Kong Observatory to download the csv files, and was very quickly reality-checked when I realized the data I sought were nowhere to be found. Eventually I found a subpage that had daily extracts of records for my timeframe, but the problem was, the page looked like this:

A nice front-end presentation with no access to the back. (source)

And there was no 'Download' button in sight. In defeat, I copy-pasted 2019's records for March, April, May, and June (although June was not in the Toggl data) into a Jupyter notebook and prepared for some wrangling. With each month imported as a string, by some miracle, the text was delimited with tabs:

In [4]:

for line in mar.split('\n')[:5]: print(line)

01	1016.1	22.4	20.8	19.6	18.8	89	91	0.4	0.4	060	30.6
02	1012.7	23.9	21.5	19.9	19.2	87	84	Trace	5.0	070	23.3
03	1011.3	23.5	21.5	20.0	19.1	87	85	6.3	4.2	010	8.0
04	1013.7	22.6	20.9	19.3	17.7	82	83	10.2	3.2	080	26.1

We ideally want this in a pandas data frame or in .csv format, so we can begin by stripping any newlines and replacing the delimiter.

In [5]:

(mar, apr, may, jun) = (mon.strip('\n').replace('\t', ',') for mon in (mar, apr, may, jun))

Then, we can aggregate all months into one yearly string, and format the dates properly by joining 2019/0 with the month and date:

In [6]:

yearly = ""
for m, month_obj in enumerate((mar, apr, may, jun), start=3):
    yearly += f'2019/0{m}/' + month_obj.replace('\n', f'\n2019/0{m}/') + '\n'

The data also have some special characters indicating partial records, but we'll just strip those too, and add our headers. This gives the following:

In [7]:

yearly = yearly.rstrip('\n').replace('Trace', '0.0').replace('#', '') # remove trailing \n at end, markers
yearly = "date,mean_pressure,max_temp,mean_temp,min_temp,mean_dew,humid,cloud,rain,sunshine,wind_dir,wind_speed\n" + yearly

In [8]:

for line in yearly.split('\n')[:5]: print(line)

date,mean_pressure,max_temp,mean_temp,min_temp,mean_dew,humid,cloud,rain,sunshine,wind_dir,wind_speed
2019/03/01,1016.1,22.4,20.8,19.6,18.8,89,91,0.4,0.4,060,30.6
2019/03/02,1012.7,23.9,21.5,19.9,19.2,87,84,0.0,5.0,070,23.3
2019/03/03,1011.3,23.5,21.5,20.0,19.1,87,85,6.3,4.2,010,8.0
2019/03/04,1013.7,22.6,20.9,19.3,17.7,82,83,10.2,3.2,080,26.1

Now we can write this to a file and load it with pandas.

In [9]:

with open("hk_weather.csv", "w") as csv_file:
    print(yearly, file=csv_file)
hk_df = pd.read_csv('hk_weather.csv', header=0, index_col=False)
print(hk_df.shape)

(122, 12)

Our last preprocessing job is to tell pandas our date column is of datetime format, and our job is done.

In [10]:

hk_df["date"] = pd.to_datetime(hk_df["date"])
hk_df.head()

Out[10]:

	date	mean_pressure	max_temp	mean_temp	min_temp	mean_dew	humid	cloud	rain	sunshine	wind_dir	wind_speed
0	2019-03-01	1016.1	22.4	20.8	19.6	18.8	89	91	0.4	0.4	060	30.6
1	2019-03-02	1012.7	23.9	21.5	19.9	19.2	87	84	0.0	5.0	070	23.3
2	2019-03-03	1011.3	23.5	21.5	20.0	19.1	87	85	6.3	4.2	010	8.0
3	2019-03-04	1013.7	22.6	20.9	19.3	17.7	82	83	10.2	3.2	080	26.1
4	2019-03-05	1012.1	26.7	22.2	17.7	20.0	88	89	30.3	2.0	070	22.7

Because I'm curious, let's check out how temperature and rain are distributed over the four months:

In [11]:

temps = alt.Chart(hk_df).mark_bar(tooltip=alt.TooltipContent('encoding')).encode(
    alt.X("mean_temp:Q", bin=True, title="Average Temperature (C)"),
    alt.Y("count()", title="Number of days"),
    color=alt.Color("month(date):O", title="Month", legend=alt.Legend(direction="horizontal", orient="top"))
).properties(width=250, height=200)

rains = alt.Chart(hk_df).mark_bar(tooltip=alt.TooltipContent('encoding')).encode(
    alt.X("rain:Q", bin=alt.Bin(extent=[0,120], step=10), title="Rainfall (mm)"),
    alt.Y("count()", title=None),
    color="month(date):O"
).properties(width=250, height=200)

temps | rains

Out[11]:

This is not just an excuse to show off the sheer power of Altair (the chart is interactive, too), but we can see that there are not that many rainy days in our dataset. Either way, we've essentially got half of the data processing done.

Part 2 — Toggl data: quantifying productivity¶

I glossed over this in the introduction, but if we're going to approach the question "Does weather impact my productivity?" we're going to need to define 'productivity.' Before we tackle that, let's load the data from Toggl:

In [12]:

toggl = pd.read_csv('time_entries.csv', header=0)
toggl['Email'] = '-'
toggl.head()

Out[12]:

	User	Email	Client	Project	Task	Description	Billable	Start date	Start time	End date	End time	Duration	Tags	Amount ()
0	Jack Oliver	-	NaN	Study / Revision	NaN	Physics Notes	No	2019-03-30	19:30:36	2019-03-30	20:35:15	01:04:39	NaN	NaN
1	Jack Oliver	-	NaN	Study / Revision	NaN	Physics Notes	No	2019-03-30	20:47:40	2019-03-30	21:19:10	00:31:30	NaN	NaN
2	Jack Oliver	-	NaN	Entertainment	NaN	Darkest Dungeon	No	2019-03-30	21:23:38	2019-03-30	21:48:38	00:25:00	NaN	NaN
3	Jack Oliver	-	NaN	Entertainment	NaN	Darkest Dungeon	No	2019-03-30	21:48:47	2019-03-30	22:13:42	00:24:55	NaN	NaN
4	Jack Oliver	-	NaN	Study / Revision	NaN	Physics Notes	No	2019-03-31	09:50:52	2019-03-31	10:15:52	00:25:00	NaN	NaN

Understanding the Toggl data¶

In Toggl, each entry requires a 'description' and a 'project' (originally designed for firms to track employee hours). Here, a 'description' in the data represents a specific task done, e.g. Physics Notes, and a 'project' represents what type of task that belongs to, e.g. Study / Revision. Toggl also records the date and time you started and ended the task, in addition to the duration.

The data frame is obviously quite bloated (and includes my email!), so let's subset the columns:

In [13]:

toggl = toggl.iloc[:, [3, 5, 7, 8, 11]]
toggl.head()

Out[13]:

	Project	Description	Start date	Start time	Duration
0	Study / Revision	Physics Notes	2019-03-30	19:30:36	01:04:39
1	Study / Revision	Physics Notes	2019-03-30	20:47:40	00:31:30
2	Entertainment	Darkest Dungeon	2019-03-30	21:23:38	00:25:00
3	Entertainment	Darkest Dungeon	2019-03-30	21:48:47	00:24:55
4	Study / Revision	Physics Notes	2019-03-31	09:50:52	00:25:00

Defining productivity¶

So how can we quantify productivity? There would be many ways to do this, but my approach is simple: assume that productivity is directly proportional to the minutes of schoolwork logged (in some timeframe). This way, if I goofed off, I wouldn't have recorded any work, so Minutes would be 0. An extra benefit of totalling the minutes worked rather than the amount of tasks done in a day is that we don't interpret doing many small tasks as 'being more productive' than doing one large task (also, we lack sufficient data to count entries as units of productivity).

It may be argued this approach is counterintuitive given productivity is largely thought of alongside efficiency, and this approach may actually penalize efficiency by rewarding longer tasks (even if the effort is low). However, since our sample is quite small, my intuition says we can't reliably derive a measure like minutes worked per complete task, so we'll stick to the prior approach. As an aside, this is an example of an assumption and design decision made in the data science process that may fundamentally alter the results of an analysis.

With that out of the way, let's add a new row to our data frame called Minutes that is the total minutes worked for one entry. We can pass the string Duration to a function to convert the HH:MM:SS format to a numeric type:

In [14]:

def str_to_dt(in_string):
    timeq = [int(i) for i in in_string.split(":")]
    return timeq[0]*60 + timeq[1] + timeq[2]/60

In [15]:

toggl["Minutes"] = toggl["Duration"].map(str_to_dt)

The map() function here is essentially taking every record of Duration and passing it to str_to_dt() to compute the minutes worked. Let's take this opportunity to also correct some types, and we'll add a new column called Task_Start to store both the date and time of starting a time entry.

In [16]:

toggl["Task_Start"] = pd.to_datetime(toggl["Start date"] + " " + toggl["Start time"])
toggl["Start date"] = pd.to_datetime(toggl["Start date"])
toggl["Start time"] = pd.to_datetime(toggl["Start time"])

In [17]:

toggl.head()

Out[17]:

	Project	Description	Start date	Start time	Duration	Minutes	Task_Start
0	Study / Revision	Physics Notes	2019-03-30	2021-10-01 19:30:36	01:04:39	64.650000	2019-03-30 19:30:36
1	Study / Revision	Physics Notes	2019-03-30	2021-10-01 20:47:40	00:31:30	31.500000	2019-03-30 20:47:40
2	Entertainment	Darkest Dungeon	2019-03-30	2021-10-01 21:23:38	00:25:00	25.000000	2019-03-30 21:23:38
3	Entertainment	Darkest Dungeon	2019-03-30	2021-10-01 21:48:47	00:24:55	24.916667	2019-03-30 21:48:47
4	Study / Revision	Physics Notes	2019-03-31	2021-10-01 09:50:52	00:25:00	25.000000	2019-03-31 09:50:52

Now we're ready to visualize the data. The two plots effortlessly made with Altair display how minutes are distributed in each project, and when each task was done. Click on a colored bar in the top plot to explore a project, and hover over points on the lower chart to see individual tasks. You can also try clicking and dragging on the lower plot to highlight a timeframe to see the minutes distribution update.

In [18]:

proj_select = alt.selection(type='interval', encodings=['x'])
hist_select = alt.selection_single(fields=["Project"])

hist_alt = alt.Chart(toggl).mark_bar(tooltip=alt.TooltipContent('encoding')).encode(
    alt.X("Minutes:Q", bin=alt.Bin(extent=[0,250], step=15), title="Minutes spent doing entry"),
    alt.Y("count()", title="Number of entries"),
    color="Project:N"
).properties(
    width=400
).add_selection(
    hist_select
).transform_filter(
    proj_select
).transform_filter(
    hist_select
)

line_alt = alt.Chart(toggl).mark_point(size=60).encode(
    alt.X("Start date:T", title="Date"),
    alt.Y("Minutes:Q", title="Minutes spent doing entry"),
    color=alt.condition(hist_select, 'Project:N', alt.value('lightgray')),
    opacity=alt.value(0.65),
    tooltip=alt.Tooltip('Description:N')
).add_selection(
    proj_select
).properties(
    width=400
)

hist_alt & line_alt

Out[18]:

Unsurprisingly, homework and revision dominate the entries. The 'Entertainment' category is also quick to fade as exams begin to creep up (beginning on May 6 in 2019). There are also some hints my study habits were more sporadic towards the beginning of exams too, with entries becoming more densely distributed at shorter minute counts.

We can also inquire into the day specifics. The two charts below implicitly answer the questions "Which day was I most productive throughout the week?" and "At what time did I record the most tasks?"

In [19]:

w_chart = alt.Chart(toggl).mark_bar(tooltip=alt.TooltipContent('encoding')).encode(
    alt.X("day(Task_Start):T", title="Day of week"),
    alt.Y("count()", title="Number of entries"),
    color=alt.Color("Project:N", legend=alt.Legend(orient='top'))
).properties(
    width=250,
    height=200
)

h_chart = alt.Chart(toggl).mark_area(tooltip=alt.TooltipContent('encoding')).encode(
    alt.X("hours(Task_Start):T", title="Time of day"),
    alt.Y("count()", title=None),
    color='Project:N'
).properties(
    width=250,
    height=200
)

w_chart | h_chart

Out[19]:

These charts show entertainment and homework taking over during weekends, and revision stagnating throughout the week (remember, I was on study-leave for these months).

Part 3 — Answering the question(s)¶

Let's move on to trying to answer the question at hand: "Does the weather affect my productivity?"

Rain analysis¶

First, I want to investigate whether rain makes me more prone to working hard. We're aiming for data with just productive entries, so let's drop all rows that aren't concerning 'Homework' or 'Study / Revision':

In [20]:

toggl = toggl[(toggl["Project"] == "Study / Revision") & (toggl["Project"] != "Homework")].reset_index(drop=True)

In [21]:

toggl.head()

Out[21]:

	Project	Description	Start date	Start time	Duration	Minutes	Task_Start
0	Study / Revision	Physics Notes	2019-03-30	2021-10-01 19:30:36	01:04:39	64.65	2019-03-30 19:30:36
1	Study / Revision	Physics Notes	2019-03-30	2021-10-01 20:47:40	00:31:30	31.50	2019-03-30 20:47:40
2	Study / Revision	Physics Notes	2019-03-31	2021-10-01 09:50:52	00:25:00	25.00	2019-03-31 09:50:52
3	Study / Revision	Physics Notes	2019-03-31	2021-10-01 10:26:23	00:25:00	25.00	2019-03-31 10:26:23
4	Study / Revision	Physics Notes	2019-03-31	2021-10-01 10:56:28	00:25:00	25.00	2019-03-31 10:56:28

Now let's take a look at our minutes data for 'productive' entries:

In [22]:

toggl["Minutes"].describe()

Out[22]:

count    204.000000
mean      39.876797
std       26.987821
min        0.050000
25%       19.787500
50%       31.900000
75%       56.320833
max      126.966667
Name: Minutes, dtype: float64

As an aside, can we investigate which subject/task took me the longest to complete work for, on average:

In [23]:

toggl.groupby('Description')['Minutes'].describe().sort_values('mean', ascending=False)

Out[23]:

	count	mean	std	min	25%	50%	75%	max
Description
English Notes	5.0	61.806667	30.260271	29.450000	32.200000	65.133333	89.000000	93.250000
Planning Schedule	1.0	56.333333	NaN	56.333333	56.333333	56.333333	56.333333	56.333333
Mathematics Past Papers	34.0	53.134804	34.469471	5.550000	25.041667	41.800000	83.041667	126.966667
Computer Science Notes	21.0	39.665079	24.453009	11.150000	20.050000	36.950000	58.000000	105.750000
Physics Past Papers	23.0	38.766667	26.062142	5.516667	15.150000	44.400000	57.958333	80.200000
Spanish Notes	6.0	38.233333	22.977312	12.050000	18.445833	38.841667	56.212500	66.016667
Economics Past Papers	11.0	37.972727	27.203248	6.133333	18.358333	25.216667	52.025000	89.416667
Economics Notes	61.0	36.199727	25.419009	2.366667	18.416667	29.366667	44.366667	120.333333
Computer Science Past Papers	16.0	36.158333	23.603877	0.050000	25.670833	30.075000	44.720833	93.733333
Spanish Past Papers	9.0	32.940741	20.024442	0.433333	24.133333	30.333333	35.416667	72.516667
Physics Notes	14.0	30.442857	19.917956	0.216667	21.025000	25.000000	40.191667	64.650000
English Past Papers	1.0	28.266667	NaN	28.266667	28.266667	28.266667	28.266667	28.266667
Mathematics Notes	2.0	26.783333	24.701597	9.316667	18.050000	26.783333	35.516667	44.250000

So English seems to be the most laborious with Mathematics past papers trailing. Economics also has the highest entry total of 72. I could aggregate these further into classes, but I digress.

At this point we should ask about the number of days we have:

In [24]:

toggl["Start date"].nunique()

Out[24]:

39 is a pretty small sample size, but the analysis is still worth a shot anyways. Let's start by just grabbing the related rows of our weather data frame:

In [25]:

raindf = hk_df.iloc[:, [0, 8]]

In [26]:

raindf.head()

Out[26]:

	date	rain
0	2019-03-01	0.4
1	2019-03-02	0.0
2	2019-03-03	6.3
3	2019-03-04	10.2
4	2019-03-05	30.3

Then we can aggregate our Toggl data into two rows: days, and minutes worked for that day.

In [27]:

toggl_days = toggl.groupby('Start date')['Minutes'].sum().reset_index().rename(columns={"Start date": "date", "Minutes": "minutes"})

This operation isn't too intimidating — in English, this is telling pandas to compute the sum of minutes worked for each day, and .reset_index() fixes our indices (left-most bolded column), and we finish by renaming the columns.

In [28]:

toggl_days.head()

Out[28]:

	date	minutes
0	2019-03-30	96.150000
1	2019-03-31	82.050000
2	2019-04-01	185.583333
3	2019-04-02	94.050000
4	2019-04-03	184.066667

A nice result. We can now seamlessly join these tables and write them to a .csv:

In [29]:

df = pd.merge(toggl_days, raindf)
df.head()

Out[29]:

	date	minutes	rain
0	2019-03-30	96.150000	0.0
1	2019-03-31	82.050000	7.7
2	2019-04-01	185.583333	0.0
3	2019-04-02	94.050000	0.0
4	2019-04-03	184.066667	0.0

In [30]:

df.to_csv("rain_raw.csv", index=False)

To gauge whether rain has an effect on minutes worked, we want to be able to classify each day as either 'rainy' or 'not rainy'. One method for classifying days based on precipitation is to use a threshold $\boldsymbol{\lambda} \in \{0.1,0.2,1,10,25\}$ such that:

$$ \verb|rain| = \begin{cases} \text{False} \quad &\text{if} \, \verb|rain| \lt \lambda \\ \text{True} \quad &\text{if} \, \verb|rain| \ge \lambda \\ \end{cases} $$

To choose $\lambda$, let's check out our data for rain:

In [31]:

df['rain'].describe()

Out[31]:

count    39.000000
mean      7.169231
std      14.187896
min       0.000000
25%       0.000000
50%       0.900000
75%       8.350000
max      75.800000
Name: rain, dtype: float64

In [32]:

alt.Chart(df).mark_bar().encode(
    alt.X("rain:Q", bin=alt.Bin(extent=[0,80], step=5)),
    alt.Y("count()")
)

Out[32]:

Given Hong Kong's humid nature alongside the median precipitation, $\lambda=1\,\text{mm}$ appears to be a good fit for the data. We can redefine our rain column to achieve the following data frame:

In [33]:

df.loc[df['rain'] >= 1, 'rain_day'] = "Yes"
df.loc[df['rain'] < 1, 'rain_day'] = "No"

df.drop(columns='rain', inplace=True)
df.rename(columns={"rain_day": "rain"}, inplace=True)

In [34]:

df.head()

Out[34]:

	date	minutes	rain
0	2019-03-30	96.150000	No
1	2019-03-31	82.050000	Yes
2	2019-04-01	185.583333	No
3	2019-04-02	94.050000	No
4	2019-04-03	184.066667	No

Let's see how the data are split:

In [35]:

df['rain'].value_counts()

Out[35]:

No     20
Yes    19
Name: rain, dtype: int64

Which is as good of a split as we could've hoped for. Now for some preliminary visualization: let's see how rainy/non-rainy days are distributed against minutes:

In [36]:

alt.Chart(df).mark_bar(tooltip=alt.TooltipContent("encoding")).encode(
    alt.X("minutes:Q", bin=alt.Bin(extent=[25,500], step=25), title="Minutes spent on schoolwork"),
    alt.Y("count()", title="Number of entries"),
    alt.Color("rain:N", title="Rain",
              scale=alt.Scale(domain=["No", "Yes"], range=['#ffdfbf', 'steelblue']))
)

Out[36]:

By eye, it looks like rainy days have slightly higher minute counts. Let's compute the mean of both types of days:

In [37]:

df.groupby('rain')['minutes'].mean()

Out[37]:

rain
No     200.647500
Yes    216.942982
Name: minutes, dtype: float64

So the average task time is higher on average, but is the difference statistically significant? One way of answering this is to knock on classical statistics' door and ask for a p-value. We want to test the alternative hypothesis that rainy days, on average, have higher minute counts. So our null hypothesis will be that there is no difference between minutes worked on average on rainy and non-rainy days. So our set-up is as follows:

$$ H_0:\mu_R=\mu_N \\ \text{vs} \\ H_A:\mu_R\gt\mu_N$$

where $\mu_R,\mu_N$ are the true means of minutes worked for rainy and non-rainy days respectively. Let's write the data frame to a .csv and head into R to perform a simple t-test:

In [38]:

df.to_csv("rain_school.csv", index=False)

rain <- read.csv("rain_school.csv")
t.test(rain[rain$rain=="Yes",]$minutes, rain[rain$rain=="No",]$minutes, alternative='greater')

    Welch Two Sample t-test

data:  rain[rain$rain == "Yes", ]$minutes and rain[rain$rain == "No", ]$minutes
t = 0.48793, df = 35.485, p-value = 0.3143
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -40.11065       Inf
sample estimates:
mean of x mean of y 
 216.9430  200.6475

So we obtain a p-value of $\approx0.3$, which is quite large. To interpret this, we can say that if we repeated the experiment for minutes worked on rainy/non-rainy days, and there really was no impact of rain on minutes worked, the probability of seeing more minutes worked for rainy days than non-rainy days to the extent we saw or to a greater extent just by chance is 30%.

While that is a mouthful, it tells us there is a relatively high probability of seeing a positive deviation in minutes worked just by chance if there was no difference in rainy/non-rainy days — so we fail to reject the null hypothesis. The mini-conclusion here is that more data are definitely needed.

Predicting minutes worked¶

Let's backtrack and look at the rest of the weather data:

In [39]:

hk_df.head()

Out[39]:

	date	mean_pressure	max_temp	mean_temp	min_temp	mean_dew	humid	cloud	rain	sunshine	wind_dir	wind_speed
0	2019-03-01	1016.1	22.4	20.8	19.6	18.8	89	91	0.4	0.4	060	30.6
1	2019-03-02	1012.7	23.9	21.5	19.9	19.2	87	84	0.0	5.0	070	23.3
2	2019-03-03	1011.3	23.5	21.5	20.0	19.1	87	85	6.3	4.2	010	8.0
3	2019-03-04	1013.7	22.6	20.9	19.3	17.7	82	83	10.2	3.2	080	26.1
4	2019-03-05	1012.1	26.7	22.2	17.7	20.0	88	89	30.3	2.0	070	22.7

There are an abundance of variables — let's just choose mean_temp, humid, cloud, rain, and sunshine as factors we think may have an impact on productivity.

In [40]:

weather = hk_df.iloc[:, [0, 3, 6, 7, 8, 9]]
weather.head()

Out[40]:

	date	mean_temp	humid	cloud	rain	sunshine
0	2019-03-01	20.8	89	91	0.4	0.4
1	2019-03-02	21.5	87	84	0.0	5.0
2	2019-03-03	21.5	87	85	6.3	4.2
3	2019-03-04	20.9	82	83	10.2	3.2
4	2019-03-05	22.2	88	89	30.3	2.0

Let's join this with our previous table containing the minutes, and write that to a .csv:

In [41]:

df = pd.merge(df.iloc[: , [0, 1]], weather)
df.to_csv("full_days.csv", index=False)
df.head()

Out[41]:

	date	minutes	mean_temp	humid	cloud	rain	sunshine
0	2019-03-30	96.150000	23.1	86	81	0.0	0.6
1	2019-03-31	82.050000	21.4	85	93	7.7	0.1
2	2019-04-01	185.583333	20.3	81	91	0.0	0.2
3	2019-04-02	94.050000	20.7	76	89	0.0	3.6
4	2019-04-03	184.066667	22.8	80	85	0.0	7.0

If you were wondering, sunshine is the number of hours of bright sunshine hours, not total. Let's head over to R to try to build some models for predicting minutes. We can begin by just checking out the correlations between our variables:

> days <- read.csv("full_days.csv")
> cor(days[c(2:7)])
            minutes   mean_temp      humid       cloud        rain    sunshine
minutes    1.00000000  0.08525068  0.1479811 -0.07462029 -0.04984429  0.07714833
mean_temp  0.08525068  1.00000000 -0.3882086 -0.63293077 -0.23396924  0.70265663
humid      0.14798108 -0.38820860  1.0000000  0.56440892  0.53635898 -0.63297290
cloud     -0.07462029 -0.63293077  0.5644089  1.00000000  0.33323933 -0.90797842
rain      -0.04984429 -0.23396924  0.5363590  0.33323933  1.00000000 -0.38933099
sunshine   0.07714833  0.70265663 -0.6329729 -0.90797842 -0.38933099  1.00000000

We can see the weather associations we'd expect — temperature/cloud and sunshine, humidity and cloud, etc. But the correlations among minutes look pretty unpromising. Regardless, let's begin by partitioning our data with a 75/25 training/testing split:

set.seed(30734592) # for reproduceability
train.ind <- sample(seq_len(nrow(days)), size = floor(.75*nrow(days)))
days.train <- days[train.ind,]
days.test <- days[-train.ind,]

And fit a multiple regression equation by modeling minutes as a function of our weather variables:

> days.fit <- lm(minutes ~ mean_temp + humid + cloud + rain + sunshine, data=days.train)
> summary(days.fit)

Residuals:
    Min       1Q   Median       3Q      Max 
-117.588  -65.600   -2.683   69.162  151.200 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -139.4934   413.7243  -0.337    0.739
mean_temp    -13.0997     9.7955  -1.337    0.194
humid          7.0194     4.1107   1.708    0.101
cloud          0.2519     2.0980   0.120    0.905
rain          -1.7436     1.2900  -1.352    0.190
sunshine      10.9424    14.1435   0.774    0.447

Residual standard error: 86.59 on 23 degrees of freedom
Multiple R-squared:  0.1723,    Adjusted R-squared:  -0.007695 
F-statistic: 0.9572 on 5 and 23 DF,  p-value: 0.4639

Our p-values are pretty large across the board, but there are some hints a few variables may be associated with minutes. We can calculate the Root Mean Square Error (RMSE) by

$$ \text{RMSE} = \sqrt{ \frac{\sum_{i=1}^{n}(\hat{y}_i - y_i)^2} {n} }$$

where $ y_i $ are the true values for our targets, and $\hat{y}_i$ are the predicted values, and $n$ is our sample size. This can be done in R for our test dataset:

> sqrt(mean((days.test$minutes - predict(days.fit, days.test))^2))
[1] 160.763

Which is quite a large average difference. We can try a few other models — bagged regression trees are generally good for reducing variance. We can create one in R with the rpart package:

rain_bag <- bagging(
    formula = minutes ~ mean_temp + humid + cloud + rain + sunshine,
    data = days.train,
    nbagg = 100,
    coob = TRUE,
    control = rpart.control(minsplit=2, cp=0)
)

> sqrt(mean((days.test$minutes - predict(rain_bag, days.test))^2))
[1] 154.6983

And we achieve a new RMSE of 155, which is a slight improvement over 161. We can try one more model — a random forest, for fun. We can use the randomForest library in R to elegantly do this:

rain.rf <- randomForest(minutes ~ mean_temp + humid + cloud + rain + sunshine,
                        data=days.train,
                        importance=TRUE)

> sqrt(mean((days.test$minutes - predict(rain.rf, days.test))^2))
[1] 149.7549

And an RMSE of 150 is again an improvement over 155 and 161, but is still very high. To conclude, we can plot the predictions and compare models:

Conclusion¶

As is typical with analysis of small samples, more data are needed to conclusively suggest there is any difference in (my) productivity owing to the weather. Looking back at the linear model, perhaps an interesting route to explore is humidity due to its notable t-statistic and seemingly borderline association ( $p\approx0.1$). Of course, this may very well be solely due to chance (the more low-quality features we include, the more likely we find a non-existent correlation due to chance).

At the very least, I'm happy I had the opportunity to poke around in some of my old data and refresh my pandas/R/Altair skils, and conduct some surface-level exploratory analysis even if the results were inconclusive. As is sometimes the case with throwaway personal projects, the famous saying can be invoked — it's about the journey, not the destination.

Does the Weather Affect my Productivity?