Does the Weather Affect my Productivity?

A lazy summer morning, evening?
Image from The New Yorker (source)

For most of us, the environment in which we work is an extremely impactful factor that contributes to our capacity for productivity. Lighting, odor, temperature, furniture — these are parameters of our environment we can tune to allow ourselves to work better.

But we can't control everything, that would be cheating. Weather conditions are an external factor that may or may not affect productivity: one study suggests worse weather is correlated with higher productivity.

This got me thinking — could I conduct a toy case-study on myself? Where would I get the data? Suddenly, I remembered in the months of study-leave leading up to my International Baccalaureate (IB) exams, I had tracked my time spent studying and doing homework using Toggl.

My hours logged leading up to the IB examinations

So I had 135 hours' worth of data over March, April, and May 2019 that could be downloaded as a .csv — not bad. Also, at the time, I was totally blind to what I would do with the data, meaning I possibly ducked data collection biases by being naïve! I wagered it wouldn't be too hard to find the weather data, and the rest of this blog post is what ensued.

Part 1 — Weather data: collection

At the time of my IB exams, I was in Hong Kong, so I navigated to the Hong Kong Observatory to download the csv files, and was very quickly reality-checked when I realized the data I sought were nowhere to be found. Eventually I found a subpage that had daily extracts of records for my timeframe, but the problem was, the page looked like this:

A nice front-end presentation with no access to the back.
A nice front-end presentation with no access to the back. (source)

And there was no 'Download' button in sight. In defeat, I copy-pasted 2019's records for March, April, May, and June (although June was not in the Toggl data) into a Jupyter notebook and prepared for some wrangling. With each month imported as a string, by some miracle, the text was delimited with tabs:

In [4]:
for line in mar.split('\n')[:5]: print(line)
01	1016.1	22.4	20.8	19.6	18.8	89	91	0.4	0.4	060	30.6
02	1012.7	23.9	21.5	19.9	19.2	87	84	Trace	5.0	070	23.3
03	1011.3	23.5	21.5	20.0	19.1	87	85	6.3	4.2	010	8.0
04	1013.7	22.6	20.9	19.3	17.7	82	83	10.2	3.2	080	26.1

We ideally want this in a pandas data frame or in .csv format, so we can begin by stripping any newlines and replacing the delimiter.

In [5]:
(mar, apr, may, jun) = (mon.strip('\n').replace('\t', ',') for mon in (mar, apr, may, jun))

Then, we can aggregate all months into one yearly string, and format the dates properly by joining 2019/0 with the month and date:

In [6]:
yearly = ""
for m, month_obj in enumerate((mar, apr, may, jun), start=3):
    yearly += f'2019/0{m}/' + month_obj.replace('\n', f'\n2019/0{m}/') + '\n'

The data also have some special characters indicating partial records, but we'll just strip those too, and add our headers. This gives the following:

In [7]:
yearly = yearly.rstrip('\n').replace('Trace', '0.0').replace('#', '') # remove trailing \n at end, markers
yearly = "date,mean_pressure,max_temp,mean_temp,min_temp,mean_dew,humid,cloud,rain,sunshine,wind_dir,wind_speed\n" + yearly
In [8]:
for line in yearly.split('\n')[:5]: print(line)
date,mean_pressure,max_temp,mean_temp,min_temp,mean_dew,humid,cloud,rain,sunshine,wind_dir,wind_speed
2019/03/01,1016.1,22.4,20.8,19.6,18.8,89,91,0.4,0.4,060,30.6
2019/03/02,1012.7,23.9,21.5,19.9,19.2,87,84,0.0,5.0,070,23.3
2019/03/03,1011.3,23.5,21.5,20.0,19.1,87,85,6.3,4.2,010,8.0
2019/03/04,1013.7,22.6,20.9,19.3,17.7,82,83,10.2,3.2,080,26.1

Now we can write this to a file and load it with pandas.

In [9]:
with open("hk_weather.csv", "w") as csv_file:
    print(yearly, file=csv_file)
hk_df = pd.read_csv('hk_weather.csv', header=0, index_col=False)
print(hk_df.shape)
(122, 12)

Our last preprocessing job is to tell pandas our date column is of datetime format, and our job is done.

In [10]:
hk_df["date"] = pd.to_datetime(hk_df["date"])
hk_df.head()
Out[10]:
date mean_pressure max_temp mean_temp min_temp mean_dew humid cloud rain sunshine wind_dir wind_speed
0 2019-03-01 1016.1 22.4 20.8 19.6 18.8 89 91 0.4 0.4 060 30.6
1 2019-03-02 1012.7 23.9 21.5 19.9 19.2 87 84 0.0 5.0 070 23.3
2 2019-03-03 1011.3 23.5 21.5 20.0 19.1 87 85 6.3 4.2 010 8.0
3 2019-03-04 1013.7 22.6 20.9 19.3 17.7 82 83 10.2 3.2 080 26.1
4 2019-03-05 1012.1 26.7 22.2 17.7 20.0 88 89 30.3 2.0 070 22.7

Because I'm curious, let's check out how temperature and rain are distributed over the four months:

In [11]:
temps = alt.Chart(hk_df).mark_bar(tooltip=alt.TooltipContent('encoding')).encode(
    alt.X("mean_temp:Q", bin=True, title="Average Temperature (C)"),
    alt.Y("count()", title="Number of days"),
    color=alt.Color("month(date):O", title="Month", legend=alt.Legend(direction="horizontal", orient="top"))
).properties(width=250, height=200)

rains = alt.Chart(hk_df).mark_bar(tooltip=alt.TooltipContent('encoding')).encode(
    alt.X("rain:Q", bin=alt.Bin(extent=[0,120], step=10), title="Rainfall (mm)"),
    alt.Y("count()", title=None),
    color="month(date):O"
).properties(width=250, height=200)

temps | rains
Out[11]:

This is not just an excuse to show off the sheer power of Altair (the chart is interactive, too), but we can see that there are not that many rainy days in our dataset. Either way, we've essentially got half of the data processing done.

Part 2 — Toggl data: quantifying productivity

I glossed over this in the introduction, but if we're going to approach the question "Does weather impact my productivity?" we're going to need to define 'productivity.' Before we tackle that, let's load the data from Toggl:

In [12]:
toggl = pd.read_csv('time_entries.csv', header=0)
toggl['Email'] = '-'
toggl.head()
Out[12]:
User Email Client Project Task Description Billable Start date Start time End date End time Duration Tags Amount ()
0 Jack Oliver - NaN Study / Revision NaN Physics Notes No 2019-03-30 19:30:36 2019-03-30 20:35:15 01:04:39 NaN NaN
1 Jack Oliver - NaN Study / Revision NaN Physics Notes No 2019-03-30 20:47:40 2019-03-30 21:19:10 00:31:30 NaN NaN
2 Jack Oliver - NaN Entertainment NaN Darkest Dungeon No 2019-03-30 21:23:38 2019-03-30 21:48:38 00:25:00 NaN NaN
3 Jack Oliver - NaN Entertainment NaN Darkest Dungeon No 2019-03-30 21:48:47 2019-03-30 22:13:42 00:24:55 NaN NaN
4 Jack Oliver - NaN Study / Revision NaN Physics Notes No 2019-03-31 09:50:52 2019-03-31 10:15:52 00:25:00 NaN NaN

Understanding the Toggl data

In Toggl, each entry requires a 'description' and a 'project' (originally designed for firms to track employee hours). Here, a 'description' in the data represents a specific task done, e.g. Physics Notes, and a 'project' represents what type of task that belongs to, e.g. Study / Revision. Toggl also records the date and time you started and ended the task, in addition to the duration.

The data frame is obviously quite bloated (and includes my email!), so let's subset the columns:

In [13]:
toggl = toggl.iloc[:, [3, 5, 7, 8, 11]]
toggl.head()
Out[13]:
Project Description Start date Start time Duration
0 Study / Revision Physics Notes 2019-03-30 19:30:36 01:04:39
1 Study / Revision Physics Notes 2019-03-30 20:47:40 00:31:30
2 Entertainment Darkest Dungeon 2019-03-30 21:23:38 00:25:00
3 Entertainment Darkest Dungeon 2019-03-30 21:48:47 00:24:55
4 Study / Revision Physics Notes 2019-03-31 09:50:52 00:25:00

Defining productivity

So how can we quantify productivity? There would be many ways to do this, but my approach is simple: assume that productivity is directly proportional to the minutes of schoolwork logged (in some timeframe). This way, if I goofed off, I wouldn't have recorded any work, so Minutes would be 0. An extra benefit of totalling the minutes worked rather than the amount of tasks done in a day is that we don't interpret doing many small tasks as 'being more productive' than doing one large task (also, we lack sufficient data to count entries as units of productivity).

It may be argued this approach is counterintuitive given productivity is largely thought of alongside efficiency, and this approach may actually penalize efficiency by rewarding longer tasks (even if the effort is low). However, since our sample is quite small, my intuition says we can't reliably derive a measure like minutes worked per complete task, so we'll stick to the prior approach. As an aside, this is an example of an assumption and design decision made in the data science process that may fundamentally alter the results of an analysis.

With that out of the way, let's add a new row to our data frame called Minutes that is the total minutes worked for one entry. We can pass the string Duration to a function to convert the HH:MM:SS format to a numeric type:

In [14]:
def str_to_dt(in_string):
    timeq = [int(i) for i in in_string.split(":")]
    return timeq[0]*60 + timeq[1] + timeq[2]/60
In [15]:
toggl["Minutes"] = toggl["Duration"].map(str_to_dt)

The map() function here is essentially taking every record of Duration and passing it to str_to_dt() to compute the minutes worked. Let's take this opportunity to also correct some types, and we'll add a new column called Task_Start to store both the date and time of starting a time entry.

In [16]:
toggl["Task_Start"] = pd.to_datetime(toggl["Start date"] + " " + toggl["Start time"])
toggl["Start date"] = pd.to_datetime(toggl["Start date"])
toggl["Start time"] = pd.to_datetime(toggl["Start time"])
In [17]:
toggl.head()
Out[17]:
Project Description Start date Start time Duration Minutes Task_Start
0 Study / Revision Physics Notes 2019-03-30 2021-10-01 19:30:36 01:04:39 64.650000 2019-03-30 19:30:36
1 Study / Revision Physics Notes 2019-03-30 2021-10-01 20:47:40 00:31:30 31.500000 2019-03-30 20:47:40
2 Entertainment Darkest Dungeon 2019-03-30 2021-10-01 21:23:38 00:25:00 25.000000 2019-03-30 21:23:38
3 Entertainment Darkest Dungeon 2019-03-30 2021-10-01 21:48:47 00:24:55 24.916667 2019-03-30 21:48:47
4 Study / Revision Physics Notes 2019-03-31 2021-10-01 09:50:52 00:25:00 25.000000 2019-03-31 09:50:52

Now we're ready to visualize the data. The two plots effortlessly made with Altair display how minutes are distributed in each project, and when each task was done. Click on a colored bar in the top plot to explore a project, and hover over points on the lower chart to see individual tasks. You can also try clicking and dragging on the lower plot to highlight a timeframe to see the minutes distribution update.

In [18]:
proj_select = alt.selection(type='interval', encodings=['x'])
hist_select = alt.selection_single(fields=["Project"])

hist_alt = alt.Chart(toggl).mark_bar(tooltip=alt.TooltipContent('encoding')).encode(
    alt.X("Minutes:Q", bin=alt.Bin(extent=[0,250], step=15), title="Minutes spent doing entry"),
    alt.Y("count()", title="Number of entries"),
    color="Project:N"
).properties(
    width=400
).add_selection(
    hist_select
).transform_filter(
    proj_select
).transform_filter(
    hist_select
)

line_alt = alt.Chart(toggl).mark_point(size=60).encode(
    alt.X("Start date:T", title="Date"),
    alt.Y("Minutes:Q", title="Minutes spent doing entry"),
    color=alt.condition(hist_select, 'Project:N', alt.value('lightgray')),
    opacity=alt.value(0.65),
    tooltip=alt.Tooltip('Description:N')
).add_selection(
    proj_select
).properties(
    width=400
)

hist_alt & line_alt
Out[18]:

Unsurprisingly, homework and revision dominate the entries. The 'Entertainment' category is also quick to fade as exams begin to creep up (beginning on May 6 in 2019). There are also some hints my study habits were more sporadic towards the beginning of exams too, with entries becoming more densely distributed at shorter minute counts.

We can also inquire into the day specifics. The two charts below implicitly answer the questions "Which day was I most productive throughout the week?" and "At what time did I record the most tasks?"

In [19]:
w_chart = alt.Chart(toggl).mark_bar(tooltip=alt.TooltipContent('encoding')).encode(
    alt.X("day(Task_Start):T", title="Day of week"),
    alt.Y("count()", title="Number of entries"),
    color=alt.Color("Project:N", legend=alt.Legend(orient='top'))
).properties(
    width=250,
    height=200
)

h_chart = alt.Chart(toggl).mark_area(tooltip=alt.TooltipContent('encoding')).encode(
    alt.X("hours(Task_Start):T", title="Time of day"),
    alt.Y("count()", title=None),
    color='Project:N'
).properties(
    width=250,
    height=200
)

w_chart | h_chart
Out[19]:

These charts show entertainment and homework taking over during weekends, and revision stagnating throughout the week (remember, I was on study-leave for these months).

Part 3 — Answering the question(s)

Let's move on to trying to answer the question at hand: "Does the weather affect my productivity?"

Rain analysis

First, I want to investigate whether rain makes me more prone to working hard. We're aiming for data with just productive entries, so let's drop all rows that aren't concerning 'Homework' or 'Study / Revision':

In [20]:
toggl = toggl[(toggl["Project"] == "Study / Revision") & (toggl["Project"] != "Homework")].reset_index(drop=True)
In [21]:
toggl.head()
Out[21]:
Project Description Start date Start time Duration Minutes Task_Start
0 Study / Revision Physics Notes 2019-03-30 2021-10-01 19:30:36 01:04:39 64.65 2019-03-30 19:30:36
1 Study / Revision Physics Notes 2019-03-30 2021-10-01 20:47:40 00:31:30 31.50 2019-03-30 20:47:40
2 Study / Revision Physics Notes 2019-03-31 2021-10-01 09:50:52 00:25:00 25.00 2019-03-31 09:50:52
3 Study / Revision Physics Notes 2019-03-31 2021-10-01 10:26:23 00:25:00 25.00 2019-03-31 10:26:23
4 Study / Revision Physics Notes 2019-03-31 2021-10-01 10:56:28 00:25:00 25.00 2019-03-31 10:56:28

Now let's take a look at our minutes data for 'productive' entries:

In [22]:
toggl["Minutes"].describe()
Out[22]:
count    204.000000
mean      39.876797
std       26.987821
min        0.050000
25%       19.787500
50%       31.900000
75%       56.320833
max      126.966667
Name: Minutes, dtype: float64

As an aside, can we investigate which subject/task took me the longest to complete work for, on average:

In [23]:
toggl.groupby('Description')['Minutes'].describe().sort_values('mean', ascending=False)
Out[23]:
count mean std min 25% 50% 75% max
Description
English Notes 5.0 61.806667 30.260271 29.450000 32.200000 65.133333 89.000000 93.250000
Planning Schedule 1.0 56.333333 NaN 56.333333 56.333333 56.333333 56.333333 56.333333
Mathematics Past Papers 34.0 53.134804 34.469471 5.550000 25.041667 41.800000 83.041667 126.966667
Computer Science Notes 21.0 39.665079 24.453009 11.150000 20.050000 36.950000 58.000000 105.750000
Physics Past Papers 23.0 38.766667 26.062142 5.516667 15.150000 44.400000 57.958333 80.200000
Spanish Notes 6.0 38.233333 22.977312 12.050000 18.445833 38.841667 56.212500 66.016667
Economics Past Papers 11.0 37.972727 27.203248 6.133333 18.358333 25.216667 52.025000 89.416667
Economics Notes 61.0 36.199727 25.419009 2.366667 18.416667 29.366667 44.366667 120.333333
Computer Science Past Papers 16.0 36.158333 23.603877 0.050000 25.670833 30.075000 44.720833 93.733333
Spanish Past Papers 9.0 32.940741 20.024442 0.433333 24.133333 30.333333 35.416667 72.516667
Physics Notes 14.0 30.442857 19.917956 0.216667 21.025000 25.000000 40.191667 64.650000
English Past Papers 1.0 28.266667 NaN 28.266667 28.266667 28.266667 28.266667 28.266667
Mathematics Notes 2.0 26.783333 24.701597 9.316667 18.050000 26.783333 35.516667 44.250000

So English seems to be the most laborious with Mathematics past papers trailing. Economics also has the highest entry total of 72. I could aggregate these further into classes, but I digress.

At this point we should ask about the number of days we have:

In [24]:
toggl["Start date"].nunique()
Out[24]:
39

39 is a pretty small sample size, but the analysis is still worth a shot anyways. Let's start by just grabbing the related rows of our weather data frame:

In [25]:
raindf = hk_df.iloc[:, [0, 8]]
In [26]:
raindf.head()
Out[26]:
date rain
0 2019-03-01 0.4
1 2019-03-02 0.0
2 2019-03-03 6.3
3 2019-03-04 10.2
4 2019-03-05 30.3

Then we can aggregate our Toggl data into two rows: days, and minutes worked for that day.

In [27]:
toggl_days = toggl.groupby('Start date')['Minutes'].sum().reset_index().rename(columns={"Start date": "date", "Minutes": "minutes"})

This operation isn't too intimidating — in English, this is telling pandas to compute the sum of minutes worked for each day, and .reset_index() fixes our indices (left-most bolded column), and we finish by renaming the columns.

In [28]:
toggl_days.head()
Out[28]:
date minutes
0 2019-03-30 96.150000
1 2019-03-31 82.050000
2 2019-04-01 185.583333
3 2019-04-02 94.050000
4 2019-04-03 184.066667

A nice result. We can now seamlessly join these tables and write them to a .csv:

In [29]:
df = pd.merge(toggl_days, raindf)
df.head()
Out[29]:
date minutes rain
0 2019-03-30 96.150000 0.0
1 2019-03-31 82.050000 7.7
2 2019-04-01 185.583333 0.0
3 2019-04-02 94.050000 0.0
4 2019-04-03 184.066667 0.0
In [30]:
df.to_csv("rain_raw.csv", index=False)

To gauge whether rain has an effect on minutes worked, we want to be able to classify each day as either 'rainy' or 'not rainy'. One method for classifying days based on precipitation is to use a threshold $\boldsymbol{\lambda} \in \{0.1,0.2,1,10,25\}$ such that:

$$ \verb|rain| = \begin{cases} \text{False} \quad &\text{if} \, \verb|rain| \lt \lambda \\ \text{True} \quad &\text{if} \, \verb|rain| \ge \lambda \\ \end{cases} $$

To choose $\lambda$, let's check out our data for rain:

In [31]:
df['rain'].describe()
Out[31]:
count    39.000000
mean      7.169231
std      14.187896
min       0.000000
25%       0.000000
50%       0.900000
75%       8.350000
max      75.800000
Name: rain, dtype: float64
In [32]:
alt.Chart(df).mark_bar().encode(
    alt.X("rain:Q", bin=alt.Bin(extent=[0,80], step=5)),
    alt.Y("count()")
)
Out[32]:

Given Hong Kong's humid nature alongside the median precipitation, $\lambda=1\,\text{mm}$ appears to be a good fit for the data. We can redefine our rain column to achieve the following data frame:

In [33]:
df.loc[df['rain'] >= 1, 'rain_day'] = "Yes"
df.loc[df['rain'] < 1, 'rain_day'] = "No"

df.drop(columns='rain', inplace=True)
df.rename(columns={"rain_day": "rain"}, inplace=True)
In [34]:
df.head()
Out[34]:
date minutes rain
0 2019-03-30 96.150000 No
1 2019-03-31 82.050000 Yes
2 2019-04-01 185.583333 No
3 2019-04-02 94.050000 No
4 2019-04-03 184.066667 No

Let's see how the data are split:

In [35]:
df['rain'].value_counts()
Out[35]:
No     20
Yes    19
Name: rain, dtype: int64

Which is as good of a split as we could've hoped for. Now for some preliminary visualization: let's see how rainy/non-rainy days are distributed against minutes:

In [36]:
alt.Chart(df).mark_bar(tooltip=alt.TooltipContent("encoding")).encode(
    alt.X("minutes:Q", bin=alt.Bin(extent=[25,500], step=25), title="Minutes spent on schoolwork"),
    alt.Y("count()", title="Number of entries"),
    alt.Color("rain:N", title="Rain",
              scale=alt.Scale(domain=["No", "Yes"], range=['#ffdfbf', 'steelblue']))
)
Out[36]:

By eye, it looks like rainy days have slightly higher minute counts. Let's compute the mean of both types of days:

In [37]:
df.groupby('rain')['minutes'].mean()
Out[37]:
rain
No     200.647500
Yes    216.942982
Name: minutes, dtype: float64

So the average task time is higher on average, but is the difference statistically significant? One way of answering this is to knock on classical statistics' door and ask for a p-value. We want to test the alternative hypothesis that rainy days, on average, have higher minute counts. So our null hypothesis will be that there is no difference between minutes worked on average on rainy and non-rainy days. So our set-up is as follows:

$$ H_0:\mu_R=\mu_N \\ \text{vs} \\ H_A:\mu_R\gt\mu_N$$

where $\mu_R,\mu_N$ are the true means of minutes worked for rainy and non-rainy days respectively. Let's write the data frame to a .csv and head into R to perform a simple t-test:

In [38]:
df.to_csv("rain_school.csv", index=False)
rain <- read.csv("rain_school.csv")
t.test(rain[rain$rain=="Yes",]$minutes, rain[rain$rain=="No",]$minutes, alternative='greater')
    Welch Two Sample t-test

data:  rain[rain$rain == "Yes", ]$minutes and rain[rain$rain == "No", ]$minutes
t = 0.48793, df = 35.485, p-value = 0.3143
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -40.11065       Inf
sample estimates:
mean of x mean of y 
 216.9430  200.6475 

So we obtain a p-value of $\approx0.3$, which is quite large. To interpret this, we can say that if we repeated the experiment for minutes worked on rainy/non-rainy days, and there really was no impact of rain on minutes worked, the probability of seeing more minutes worked for rainy days than non-rainy days to the extent we saw or to a greater extent just by chance is 30%.

While that is a mouthful, it tells us there is a relatively high probability of seeing a positive deviation in minutes worked just by chance if there was no difference in rainy/non-rainy days — so we fail to reject the null hypothesis. The mini-conclusion here is that more data are definitely needed.

Predicting minutes worked

Let's backtrack and look at the rest of the weather data:

In [39]:
hk_df.head()
Out[39]:
date mean_pressure max_temp mean_temp min_temp mean_dew humid cloud rain sunshine wind_dir wind_speed
0 2019-03-01 1016.1 22.4 20.8 19.6 18.8 89 91 0.4 0.4 060 30.6
1 2019-03-02 1012.7 23.9 21.5 19.9 19.2 87 84 0.0 5.0 070 23.3
2 2019-03-03 1011.3 23.5 21.5 20.0 19.1 87 85 6.3 4.2 010 8.0
3 2019-03-04 1013.7 22.6 20.9 19.3 17.7 82 83 10.2 3.2 080 26.1
4 2019-03-05 1012.1 26.7 22.2 17.7 20.0 88 89 30.3 2.0 070 22.7

There are an abundance of variables — let's just choose mean_temp, humid, cloud, rain, and sunshine as factors we think may have an impact on productivity.

In [40]:
weather = hk_df.iloc[:, [0, 3, 6, 7, 8, 9]]
weather.head()
Out[40]:
date mean_temp humid cloud rain sunshine
0 2019-03-01 20.8 89 91 0.4 0.4
1 2019-03-02 21.5 87 84 0.0 5.0
2 2019-03-03 21.5 87 85 6.3 4.2
3 2019-03-04 20.9 82 83 10.2 3.2
4 2019-03-05 22.2 88 89 30.3 2.0

Let's join this with our previous table containing the minutes, and write that to a .csv:

In [41]:
df = pd.merge(df.iloc[: , [0, 1]], weather)
df.to_csv("full_days.csv", index=False)
df.head()
Out[41]:
date minutes mean_temp humid cloud rain sunshine
0 2019-03-30 96.150000 23.1 86 81 0.0 0.6
1 2019-03-31 82.050000 21.4 85 93 7.7 0.1
2 2019-04-01 185.583333 20.3 81 91 0.0 0.2
3 2019-04-02 94.050000 20.7 76 89 0.0 3.6
4 2019-04-03 184.066667 22.8 80 85 0.0 7.0

If you were wondering, sunshine is the number of hours of bright sunshine hours, not total. Let's head over to R to try to build some models for predicting minutes. We can begin by just checking out the correlations between our variables:

> days <- read.csv("full_days.csv")
> cor(days[c(2:7)])
            minutes   mean_temp      humid       cloud        rain    sunshine
minutes    1.00000000  0.08525068  0.1479811 -0.07462029 -0.04984429  0.07714833
mean_temp  0.08525068  1.00000000 -0.3882086 -0.63293077 -0.23396924  0.70265663
humid      0.14798108 -0.38820860  1.0000000  0.56440892  0.53635898 -0.63297290
cloud     -0.07462029 -0.63293077  0.5644089  1.00000000  0.33323933 -0.90797842
rain      -0.04984429 -0.23396924  0.5363590  0.33323933  1.00000000 -0.38933099
sunshine   0.07714833  0.70265663 -0.6329729 -0.90797842 -0.38933099  1.00000000

We can see the weather associations we'd expect — temperature/cloud and sunshine, humidity and cloud, etc. But the correlations among minutes look pretty unpromising. Regardless, let's begin by partitioning our data with a 75/25 training/testing split:

set.seed(30734592) # for reproduceability
train.ind <- sample(seq_len(nrow(days)), size = floor(.75*nrow(days)))
days.train <- days[train.ind,]
days.test <- days[-train.ind,]

And fit a multiple regression equation by modeling minutes as a function of our weather variables:

> days.fit <- lm(minutes ~ mean_temp + humid + cloud + rain + sunshine, data=days.train)
> summary(days.fit)

Residuals:
    Min       1Q   Median       3Q      Max 
-117.588  -65.600   -2.683   69.162  151.200 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -139.4934   413.7243  -0.337    0.739
mean_temp    -13.0997     9.7955  -1.337    0.194
humid          7.0194     4.1107   1.708    0.101
cloud          0.2519     2.0980   0.120    0.905
rain          -1.7436     1.2900  -1.352    0.190
sunshine      10.9424    14.1435   0.774    0.447

Residual standard error: 86.59 on 23 degrees of freedom
Multiple R-squared:  0.1723,    Adjusted R-squared:  -0.007695 
F-statistic: 0.9572 on 5 and 23 DF,  p-value: 0.4639

Our p-values are pretty large across the board, but there are some hints a few variables may be associated with minutes. We can calculate the Root Mean Square Error (RMSE) by

$$ \text{RMSE} = \sqrt{ \frac{\sum_{i=1}^{n}(\hat{y}_i - y_i)^2} {n} }$$

where $ y_i $ are the true values for our targets, and $\hat{y}_i$ are the predicted values, and $n$ is our sample size. This can be done in R for our test dataset:

> sqrt(mean((days.test$minutes - predict(days.fit, days.test))^2))
[1] 160.763

Which is quite a large average difference. We can try a few other models — bagged regression trees are generally good for reducing variance. We can create one in R with the rpart package:

rain_bag <- bagging(
    formula = minutes ~ mean_temp + humid + cloud + rain + sunshine,
    data = days.train,
    nbagg = 100,
    coob = TRUE,
    control = rpart.control(minsplit=2, cp=0)
)
> sqrt(mean((days.test$minutes - predict(rain_bag, days.test))^2))
[1] 154.6983

And we achieve a new RMSE of 155, which is a slight improvement over 161. We can try one more model — a random forest, for fun. We can use the randomForest library in R to elegantly do this:

rain.rf <- randomForest(minutes ~ mean_temp + humid + cloud + rain + sunshine,
                        data=days.train,
                        importance=TRUE)
> sqrt(mean((days.test$minutes - predict(rain.rf, days.test))^2))
[1] 149.7549

And an RMSE of 150 is again an improvement over 155 and 161, but is still very high. To conclude, we can plot the predictions and compare models:

Models predicting on the test dataset
Models predicting on the test dataset

Conclusion

As is typical with analysis of small samples, more data are needed to conclusively suggest there is any difference in (my) productivity owing to the weather. Looking back at the linear model, perhaps an interesting route to explore is humidity due to its notable t-statistic and seemingly borderline association ( $p\approx0.1$). Of course, this may very well be solely due to chance (the more low-quality features we include, the more likely we find a non-existent correlation due to chance).

At the very least, I'm happy I had the opportunity to poke around in some of my old data and refresh my pandas/R/Altair skils, and conduct some surface-level exploratory analysis even if the results were inconclusive. As is sometimes the case with throwaway personal projects, the famous saying can be invoked — it's about the journey, not the destination.

Comments