DS 3000 - Summer 2020

DS Report

Identifying Patterns in Plant Growth

Yevhen Horban, Pranav Ahluwalia


Executive Summary:

This project is focused on identifying the patterns and predicting the number of roses produced in a specific greenhouse based on the conditions inside and outside of the greenhouse. Plants usually need nutritious soil, water, sunlight, and carbon dioxide to grow. While the greenhouse controls the constant supply of nutrition and water, other conditions are variable. The amount of how much roses are produced appears to have a sinusoidal pattern with a variable vertical shift, amplitude, and period that depend on weather conditions.



1. INTRODUCTION

Problem Statement

In our project, we analyze the effects of different environmental factors on the number of roses a specific greenhouse produces. This project assumes that if the environment were controlled, then the production of flowers would follow a sinusoidal pattern with no noise. Since the climate is not controlled, it is our objective to explore the relationship between its changes and the shift in the pattern of production.

Significance of the Problem

The nature of this particular business is such that the flowers are often sold before they are produced, and it is necessary to know ahead of time how much produce to sell. Also, at times when the peaks of the production curves of different flowers align the freezers that preserve flowers before they are sold run out of the room and the lifespan of the flowers that are left in the room temperature shortens. Tackling this problem matters because if a company knows the amount of product it is going to produce, it can plan for accommodating the storage and adjust prices if necessary to maximize its profit.

Questions/Hypothesis

Given the problem and its importance, as mentioned earlier, we set out to tackle the following questions:

  • Is there a linear relationship between the weather conditions and the number of flowers produced? If not, what is the relationship?
  • Is there an accumulative or delayed effect of the weather conditions on the dependent variable? What is it?
  • Is there a pattern in the number of flowers produced? What periods appear to be more stable or have less noise in the number of flowers produced? What parts of the natural cycle (amplitude, period, vertical shift) are affected by the weather?
  • Is there a relationship between the weather conditions and the deviation from the natural production cycle? Which features are the most influential on the deviations of the cycle?
  • Is there a linear or nonlinear regression model that can predict the number of flowers produced? Which one is more accurate?
  • How do the features correlate with each other? Can we reduce their number to make the prediction model more accurate?

Hypotheses:

  • Hypothesis about features:
    • Null: The quantity of the flowers produced and the selected weather features have no relationship
    • Alternative: The number of flowers produced and the selected weather features have a relationship
  • Hypothesis about machine learning algorithms:
    • Null: Linear regression IS the best predictor of the quantity of the flowers produced based on the weather features.
    • Alternative: Linear regression IS NOT the best predictor of the quantity of the flowers produced based on the weather features.
  • Hypothesis about additional features on regression models:
    • Null: Mathematically manipulated features WILL NOT change the accuracy of the models.
    • Alternative: Mathematically manipulated features WILL change the accuracy of the models.


2. METHOD

2.1. Data Acquisition

We obtained our time-series data from the company that owns the greenhouse and combined it with the historical weather data from Weather Underground to get 1036 by 18 data set. We are going to use the daily averages of the amount of sunlight, the number of minutes the lamps were turned on, the amount of CO2, the temperature inside and outside, dew point, humidity, and atmospheric pressure to predict the number of flowers produced on the given date using regression.

https://www.wunderground.com/history/monthly/ua/brovary/UKBB

In [1]:
import pandas as pd

url_org_data = 'https://raw.githubusercontent.com/h0rban/flower-harvest-prediction/master/data/redeagle.csv'
df = pd.read_csv(url_org_data)
df
Out[1]:
Date Rsum Lamps T_inside CO2 T_Max T_Avg T_Min DP_Max DP_Avg DP_Min H_Max H_Avg H_Min P_Max P_Avg P_Min Produced
0 7/31/17 2566 0 24.4 573 28.9 22.7 13.9 17.2 14.7 12.8 94 64.3 37 29.7 29.6 29.6 800
1 8/1/17 2182 0 27.0 650 33.9 25.4 17.2 21.1 17.3 13.9 94 65.0 31 29.8 29.7 29.7 716
2 8/2/17 2390 0 27.6 715 33.9 26.9 18.9 20.0 18.2 15.0 100 63.3 34 29.7 29.7 29.6 1125
3 8/3/17 2239 0 27.1 595 32.2 26.5 20.0 21.1 18.7 17.2 88 64.5 40 29.6 29.5 29.5 1280
4 8/4/17 2141 0 27.1 634 33.9 26.6 18.9 22.2 17.9 12.8 100 64.0 29 29.5 29.4 29.4 1971
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1032 5/28/20 1174 680 20.2 657 17.2 12.2 7.8 12.8 10.7 7.8 100 91.4 72 29.7 29.6 29.4 10697
1033 5/29/20 728 430 20.3 717 17.8 14.7 12.2 13.9 12.1 11.1 100 85.3 64 29.4 29.3 29.3 7234
1034 5/30/20 466 310 20.0 698 17.2 13.9 12.2 13.9 12.4 10.0 100 91.5 68 29.3 29.2 29.1 8939
1035 5/31/20 1338 250 20.0 564 17.8 13.4 10.0 11.1 9.7 7.8 94 79.3 59 29.3 29.3 29.2 6104
1036 6/1/20 1266 250 19.4 694 33.9 29.7 25.0 28.9 26.2 23.9 100 80.9 52 29.3 29.3 29.3 10040

1037 rows × 18 columns

2.2. Variables

Rsum is in lux
Pressure is in Hg
Lamps are in minutes
Humidity is in percent
Temperatures are in ºC
CO2 in parts per million

Independent Variables (feature variables)

inside of the greenhouse:
    - lux of sun
    - temperature
    - concentration of CO2
    - minutes the lamps were turned on

minimum, maximum, and average outside of the greenhouse:
    - temperature
    - dew point
    - humidity
    - pressure


Dependent Variables (target variable)

 quantity of flowers produced

Rationale

Our features are all the independent variables listed above. They are already continuous, numerical data points and do not require feature engineering with them. Our target variable is the number of flowers produced, which is also a continuous numerical value and a target variable for a regression model. However, we will try to modify both features and the target variable to reduce noise and explore possible relationships.

2.3. Data Analysis

Predictive Model

We will predict the number of flowers produced the following day, given the weather conditions. These predictors are significant because flowers are living creatures that respond to their environment. In addition to the features described above, we are going to mathematically manipulate the data to explore some of our questions and hypothesis.

A Supervised Learning Problem

Since will are seeking to obtain a continuous output variable, this is a supervised regression problem.

Machine Learning Algorithms to be Applied

We will be using Multiple Linear, Ridge, Lasso, kNN, Bayesian Ridge, KernelRidge, and Gaussian Process regression algorithms because it is not clear whether they will be accurate predictors of the number of flowers produced. We expect Bayesian Ridge, KernelRidge, and Gaussian Process regression algorithms work the best because the examples in the scikit learn documentation suggests that they are good at predicting sinusoidal data.


3. RESULTS

3.1. Data Wrangling

Simple data cleaning

Drop the rows with the missing values

In [2]:
df.dropna(inplace = True)

Process variables

Convert string representation of date to Date object

In [3]:
df['Date'] =  pd.to_datetime(df['Date'], format='%m/%d/%y')

Data Wrangling

Select features and target form the dataframe

In [4]:
features = df.drop(['Date', 'Produced'], axis = 1)
features
Out[4]:
Rsum Lamps T_inside CO2 T_Max T_Avg T_Min DP_Max DP_Avg DP_Min H_Max H_Avg H_Min P_Max P_Avg P_Min
0 2566 0 24.4 573 28.9 22.7 13.9 17.2 14.7 12.8 94 64.3 37 29.7 29.6 29.6
1 2182 0 27.0 650 33.9 25.4 17.2 21.1 17.3 13.9 94 65.0 31 29.8 29.7 29.7
2 2390 0 27.6 715 33.9 26.9 18.9 20.0 18.2 15.0 100 63.3 34 29.7 29.7 29.6
3 2239 0 27.1 595 32.2 26.5 20.0 21.1 18.7 17.2 88 64.5 40 29.6 29.5 29.5
4 2141 0 27.1 634 33.9 26.6 18.9 22.2 17.9 12.8 100 64.0 29 29.5 29.4 29.4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1032 1174 680 20.2 657 17.2 12.2 7.8 12.8 10.7 7.8 100 91.4 72 29.7 29.6 29.4
1033 728 430 20.3 717 17.8 14.7 12.2 13.9 12.1 11.1 100 85.3 64 29.4 29.3 29.3
1034 466 310 20.0 698 17.2 13.9 12.2 13.9 12.4 10.0 100 91.5 68 29.3 29.2 29.1
1035 1338 250 20.0 564 17.8 13.4 10.0 11.1 9.7 7.8 94 79.3 59 29.3 29.3 29.2
1036 1266 250 19.4 694 33.9 29.7 25.0 28.9 26.2 23.9 100 80.9 52 29.3 29.3 29.3

1037 rows × 16 columns

In [5]:
target = df['Produced']
target
Out[5]:
0         800
1         716
2        1125
3        1280
4        1971
        ...  
1032    10697
1033     7234
1034     8939
1035     6104
1036    10040
Name: Produced, Length: 1037, dtype: int64

Feature extraction

There might not be a linear relationship between features and the target variable, so we use a polynomial feature extraction technique to explore nonlinear relationships in our predictive models.

In [6]:
from sklearn.preprocessing import PolynomialFeatures

# returns a data frame that contains polynomials from 2 to the given degree of the given features
def features_to_poly(features, degrees):
    poly_feature_dfs = []
    for feature in features.columns:
        values = features[feature]
        poly = PolynomialFeatures(degree = degrees, include_bias = False) # we tried using interaction_only = False but it did not work for some reason
        feature_poly = poly.fit_transform(values.values.reshape(-1,1))
        feature_df = pd.DataFrame(feature_poly, columns = [feature + '^' + str(index) for index in range(1, degrees + 1)])
        poly_feature_dfs.append(feature_df.drop(feature + '^' + str(1), axis = 1))
    return pd.concat(poly_feature_dfs, axis = 1)


features_poly = features_to_poly(features, 10)
features_poly
Out[6]:
Rsum^2 Rsum^3 Rsum^4 Rsum^5 Rsum^6 Rsum^7 Rsum^8 Rsum^9 Rsum^10 Lamps^2 ... P_Avg^10 P_Min^2 P_Min^3 P_Min^4 P_Min^5 P_Min^6 P_Min^7 P_Min^8 P_Min^9 P_Min^10
0 6584356.0 1.689546e+10 4.335374e+13 1.112457e+17 2.854565e+20 7.324813e+23 1.879547e+27 4.822918e+30 1.237561e+34 0.0 ... 5.163178e+14 876.16 25934.336 767656.3456 2.272263e+07 6.725898e+08 1.990866e+10 5.892963e+11 1.744317e+13 5.163178e+14
1 4761124.0 1.038877e+10 2.266830e+13 4.946223e+16 1.079266e+20 2.354958e+23 5.138519e+26 1.121225e+30 2.446513e+33 0.0 ... 5.340286e+14 882.09 26198.073 778082.7681 2.310906e+07 6.863390e+08 2.038427e+10 6.054128e+11 1.798076e+13 5.340286e+14
2 5712100.0 1.365192e+10 3.262809e+13 7.798113e+16 1.863749e+20 4.454360e+23 1.064592e+27 2.544375e+30 6.081056e+33 0.0 ... 5.340286e+14 876.16 25934.336 767656.3456 2.272263e+07 6.725898e+08 1.990866e+10 5.892963e+11 1.744317e+13 5.163178e+14
3 5013121.0 1.122438e+10 2.513138e+13 5.626916e+16 1.259867e+20 2.820841e+23 6.315864e+26 1.414122e+30 3.166219e+33 0.0 ... 4.991375e+14 870.25 25672.375 757335.0625 2.234138e+07 6.590708e+08 1.944259e+10 5.735564e+11 1.691991e+13 4.991375e+14
4 4583881.0 9.814089e+09 2.101197e+13 4.498662e+16 9.631635e+19 2.062133e+23 4.415027e+26 9.452572e+29 2.023796e+33 0.0 ... 4.824733e+14 864.36 25412.184 747118.2096 2.196528e+07 6.457791e+08 1.898591e+10 5.581856e+11 1.641066e+13 4.824733e+14
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1032 1378276.0 1.618096e+09 1.899645e+12 2.230183e+15 2.618235e+18 3.073808e+21 3.608650e+24 4.236555e+27 4.973716e+30 462400.0 ... 5.163178e+14 864.36 25412.184 747118.2096 2.196528e+07 6.457791e+08 1.898591e+10 5.581856e+11 1.641066e+13 4.824733e+14
1033 529984.0 3.858284e+08 2.808830e+11 2.044829e+14 1.488635e+17 1.083726e+20 7.889528e+22 5.743577e+25 4.181324e+28 184900.0 ... 4.663116e+14 858.49 25153.757 737005.0801 2.159425e+07 6.327115e+08 1.853845e+10 5.431765e+11 1.591507e+13 4.663116e+14
1034 217156.0 1.011947e+08 4.715673e+10 2.197504e+13 1.024037e+16 4.772011e+18 2.223757e+21 1.036271e+24 4.829022e+26 96100.0 ... 4.506387e+14 846.81 24642.171 717087.1761 2.086724e+07 6.072366e+08 1.767058e+10 5.142140e+11 1.496363e+13 4.354416e+14
1035 1790244.0 2.395346e+09 3.204974e+12 4.288255e+15 5.737685e+18 7.677022e+21 1.027186e+25 1.374374e+28 1.838913e+31 62500.0 ... 4.663116e+14 852.64 24897.088 726994.9696 2.122825e+07 6.198650e+08 1.810006e+10 5.285217e+11 1.543283e+13 4.506387e+14
1036 1602756.0 2.029089e+09 2.568827e+12 3.252135e+15 4.117203e+18 5.212378e+21 6.598871e+24 8.354171e+27 1.057638e+31 62500.0 ... 4.663116e+14 858.49 25153.757 737005.0801 2.159425e+07 6.327115e+08 1.853845e+10 5.431765e+11 1.591507e+13 4.663116e+14

1037 rows × 144 columns

One of the questions we are exploring in this project is whether the weather features have a delayed or an accumulating effect on the target variable. Hence, we will use a feature extraction techniques to generate additional features to explore hidden relationships in our data.

In [9]:
import statistics as st

# returns the given values after applying a given function to the given number of moving values
# this allows to use moving average, range and other functions by ignoring the missing vaules on the edges
def n_moving_func(values, n, func):
    out = []
    length = len(values)
    for index in range(0, length):
        s = index - n // 2
        e = index + n // 2 + 1
        if s < 0:
            s = 0
        if e >= length:
            e = length 
        out.append(func(values[s : e]))
    return out


# returns the given values after applying a given function to the given number of previous values
# this allows to sum, or find the range and other functions by ignoring the missing values in the begining
def n_prev_func(values, n, func):
    out = []
    for index in range(0, len(values)):
        s = index - n + 1
        if s < 0:
            s = 0
            temp = values[s : index + 1]
            temp.append(values[index])
            out.append(func(temp))
        else:
            out.append(func(values[s : index + 1]))
    return out


# returns the derivative of given values
# the first value will be the same as the second
def df_dt(values):
    out = [values[1] - values[0]]
    for index in range(1, len(values)):
        out.append(values[index] - values[index - 1])
    return out


# returns linked relatives of the given values
# the first value and dividing by zero will result in a 1
def link_r(values):
    out = [1]
    for index in range(1, len(values)):
        out.append(1) if values[index - 1] == 0 else out.append(values[index] / values[index - 1])
    return out


# returns the range of the given values
def get_range(values):
    return max(values) - min(values)


# returns a dataframe with the result of applying every given function to every column in the given dataframe
def apply_func_to_df(dataframe, funcs, include_original = False):
    new_df = pd.DataFrame()
    for feature in dataframe.columns:
        if include_original:
            new_df[feature] = dataframe[feature]
        for name, func in funcs.items():
            new_df[feature + '_' + name] = pd.Series(func(dataframe[feature]))
    return new_df


# functions that are going to applied to the features
dd_and_lr = {'dd': df_dt, 'lr' : link_r}
funcs = {'3_moving_mean': lambda values: n_moving_func(values, 3, st.mean),
         '5_moving_mean': lambda values: n_moving_func(values, 5, st.mean),
         '7_moving_mean': lambda values: n_moving_func(values, 7, st.mean),
         '3_moving_sd': lambda values: n_moving_func(values, 3, st.stdev),
         '5_moving_sd': lambda values: n_moving_func(values, 5, st.stdev),
         '7_moving_sd': lambda values: n_moving_func(values, 7, st.stdev),
         '3_prev_sum': lambda values: n_prev_func(values.tolist(), 3, sum),
         '7_prev_sum': lambda values: n_prev_func(values.tolist(), 7, sum),
         '21_prev_sum': lambda values: n_prev_func(values.tolist(), 21, sum),
         '3_prev_sd': lambda values: n_prev_func(values.tolist(), 3, st.stdev),
         '7_prev_sd': lambda values: n_prev_func(values.tolist(), 7, st.stdev),
         '21_prev_sd': lambda values: n_prev_func(values.tolist(), 21, st.stdev),
         '3_prev_range': lambda values: n_prev_func(values.tolist(), 3, get_range),
         '7_prev_range': lambda values: n_prev_func(values.tolist(), 7, get_range),
         '21_prev_range': lambda values: n_prev_func(values.tolist(), 21, get_range)}

# this will take a while
features_additional =  apply_func_to_df(apply_func_to_df(features, funcs), dd_and_lr, include_original = True)
features_additional
Out[9]:
Rsum_3_moving_mean Rsum_3_moving_mean_dd Rsum_3_moving_mean_lr Rsum_5_moving_mean Rsum_5_moving_mean_dd Rsum_5_moving_mean_lr Rsum_7_moving_mean Rsum_7_moving_mean_dd Rsum_7_moving_mean_lr Rsum_3_moving_sd ... P_Min_21_prev_sd_lr P_Min_3_prev_range P_Min_3_prev_range_dd P_Min_3_prev_range_lr P_Min_7_prev_range P_Min_7_prev_range_dd P_Min_7_prev_range_lr P_Min_21_prev_range P_Min_21_prev_range_dd P_Min_21_prev_range_lr
0 2374.000000 5.333333 1.000000 2379.333333 -35.083333 1.000000 2344.250000 -40.650000 1.000000 271.529004 ... 1.000000 0.0 1.000000e-01 1.000000 0.0 0.1 1.000000 0.0 0.1 1.0
1 2379.333333 5.333333 1.002247 2344.250000 -35.083333 0.985255 2303.600000 -40.650000 0.982660 192.222094 ... 1.000000 0.1 1.000000e-01 1.000000 0.1 0.1 1.000000 0.1 0.1 1.0
2 2270.333333 -109.000000 0.954189 2303.600000 -40.650000 0.982660 2263.500000 -40.100000 0.982592 107.481781 ... 0.866025 0.1 0.000000e+00 1.000000 0.1 0.0 1.000000 0.1 0.0 1.0
3 2256.666667 -13.666667 0.993980 2203.000000 -100.600000 0.956329 2178.857143 -84.642857 0.962605 125.436571 ... 1.673320 0.2 1.000000e-01 2.000000 0.2 0.1 2.000000 0.2 0.1 2.0
4 2147.666667 -109.000000 0.951699 2100.800000 -102.200000 0.953609 1863.714286 -315.142857 0.855363 88.189191 ... 1.447494 0.2 3.552714e-15 1.000000 0.3 0.1 1.500000 0.3 0.1 1.5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1032 1081.333333 -153.333333 0.875810 979.600000 12.400000 1.012821 948.571429 -86.000000 0.916874 317.315826 ... 0.952095 0.3 2.000000e-01 3.000000 0.3 0.1 1.500000 0.7 0.0 1.0
1033 789.333333 -292.000000 0.729963 1009.600000 30.000000 1.030625 1071.714286 123.142857 1.129819 357.962754 ... 1.017321 0.4 1.000000e-01 1.333333 0.4 0.1 1.333333 0.7 0.0 1.0
1034 844.000000 54.666667 1.069257 994.400000 -15.200000 0.984945 1052.333333 -19.380952 0.981916 447.423737 ... 1.092212 0.3 -1.000000e-01 0.750000 0.6 0.2 1.500000 0.7 0.0 1.0
1035 1023.333333 179.333333 1.212480 949.500000 -44.900000 0.954847 994.400000 -57.933333 0.944948 484.005510 ... 1.035749 0.2 -1.000000e-01 0.666667 0.6 0.0 1.000000 0.7 0.0 1.0
1036 1302.000000 278.666667 1.272313 1023.333333 73.833333 1.077760 949.500000 -44.900000 0.954847 50.911688 ... 0.937762 0.2 0.000000e+00 1.000000 0.6 0.0 1.000000 0.7 0.0 1.0

1037 rows × 720 columns

The following line combines the resulting data frames of feature extraction into a one. Also, add a day number column to two of the data frames (it turned out to be significant to the models)

In [10]:
features_combined = pd.concat([features, apply_func_to_df(features, dd_and_lr), features_poly, features_additional], axis = 1)

features['DayNum'] = df.index
features_combined['DayNum'] = df.index

features_combined
Out[10]:
Rsum Lamps T_inside CO2 T_Max T_Avg T_Min DP_Max DP_Avg DP_Min ... P_Min_3_prev_range P_Min_3_prev_range_dd P_Min_3_prev_range_lr P_Min_7_prev_range P_Min_7_prev_range_dd P_Min_7_prev_range_lr P_Min_21_prev_range P_Min_21_prev_range_dd P_Min_21_prev_range_lr DayNum
0 2566 0 24.4 573 28.9 22.7 13.9 17.2 14.7 12.8 ... 0.0 1.000000e-01 1.000000 0.0 0.1 1.000000 0.0 0.1 1.0 0
1 2182 0 27.0 650 33.9 25.4 17.2 21.1 17.3 13.9 ... 0.1 1.000000e-01 1.000000 0.1 0.1 1.000000 0.1 0.1 1.0 1
2 2390 0 27.6 715 33.9 26.9 18.9 20.0 18.2 15.0 ... 0.1 0.000000e+00 1.000000 0.1 0.0 1.000000 0.1 0.0 1.0 2
3 2239 0 27.1 595 32.2 26.5 20.0 21.1 18.7 17.2 ... 0.2 1.000000e-01 2.000000 0.2 0.1 2.000000 0.2 0.1 2.0 3
4 2141 0 27.1 634 33.9 26.6 18.9 22.2 17.9 12.8 ... 0.2 3.552714e-15 1.000000 0.3 0.1 1.500000 0.3 0.1 1.5 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1032 1174 680 20.2 657 17.2 12.2 7.8 12.8 10.7 7.8 ... 0.3 2.000000e-01 3.000000 0.3 0.1 1.500000 0.7 0.0 1.0 1032
1033 728 430 20.3 717 17.8 14.7 12.2 13.9 12.1 11.1 ... 0.4 1.000000e-01 1.333333 0.4 0.1 1.333333 0.7 0.0 1.0 1033
1034 466 310 20.0 698 17.2 13.9 12.2 13.9 12.4 10.0 ... 0.3 -1.000000e-01 0.750000 0.6 0.2 1.500000 0.7 0.0 1.0 1034
1035 1338 250 20.0 564 17.8 13.4 10.0 11.1 9.7 7.8 ... 0.2 -1.000000e-01 0.666667 0.6 0.0 1.000000 0.7 0.0 1.0 1035
1036 1266 250 19.4 694 33.9 29.7 25.0 28.9 26.2 23.9 ... 0.2 0.000000e+00 1.000000 0.6 0.0 1.000000 0.7 0.0 1.0 1036

1037 rows × 913 columns

Target Extraction

The data for the target variable may be too noisy for the prediction algorithms to identify the pattern. Hence, we will explore the possibilities of smoothening the target variable.

Please see these links for scatterplots:

  • Target vs Time
  • Moving Average Target vs Time
  • Average Previous Target vs Time
In [11]:
import plotly.express as px

target = df['Produced']
px.line(df, x = 'Date', y = target)\
.update_layout(title = "Target vs Time",
               title_x = 0.5,
               yaxis_title = "Quantity").show()
px.line(df, x = 'Date', y = n_moving_func(target.tolist(), 7, st.mean))\
.update_layout(title = "Moving average of Target (n = 7) vs Time",
               title_x = 0.5,
               yaxis_title = "Quantity").show()
px.line(df, x = 'Date', y = n_prev_func(target.tolist(), 7, st.mean))\
.update_layout(title = "Average of previous values of Target (n = 7) vs Time",
               title_x = 0.5,
               yaxis_title = "Quantity").show()
target_smooth = pd.Series(n_prev_func(target.tolist(), 7, st.mean))
target_smooth
Out[11]:
0        800.000000
1        744.000000
2        941.500000
3       1040.200000
4       1310.500000
           ...     
1032    8531.428571
1033    8254.142857
1034    8335.142857
1035    8120.000000
1036    8425.714286
Length: 1037, dtype: float64

The following cell attempts to mathematically fit a curve that would represent the patterns in the data.
Refer to the following links for graphs:

  • Sin Fit vs Time
  • Logistical Sin Fit vs Time
In [12]:
import numpy as np
import plotly.graph_objects as go
from scipy.optimize import leastsq
from sklearn.metrics import r2_score

t = df.index 
data = target_smooth # any target can be used here
date = df['Date']


def fit_sin(x, params):
    # f(x) = a * sin(bx + c) + d
    assert len(params) == 4
    return params[0] * np.sin(params[1] * x + params[2]) + params[3]


def fit_sin_log(x, params):
    # g(x) = a / (1 + e^(bx)) + d
    # s(x) = g(x) * f(x)
    assert len(params) == 4 + 3
    return (params[0] / (1 + np.exp(params[1] * x)) + params[2]) * fit_sin(x, params[3:])


# returns the optimized parameters and the data of the model for given x 
# by optimizing fitting and optimizing given function to the given data
def optimize(x, params, func, data, display_graph = False):
    data_guess = func(x, params)
    optimize_func = lambda var: (func(x, var) - data)**2
    params_opt = leastsq(optimize_func, params)[0]
    data_opt = func(x, params_opt)
    if display_graph == True:
        go.Figure()\
        .add_trace(go.Scatter(x = date, 
                              y = data, 
                              mode = 'lines', 
                              name = 'target', 
                              opacity = 0.7))\
        .add_trace(go.Scatter(x = date, 
                              y = data_guess, 
                              mode = 'lines', 
                              name = 'first guess', 
                              opacity = 0.7))\
        .add_trace(go.Scatter(x = date, 
                              y = data_opt, 
                              mode = 'lines', 
                              name = 'after fitting', 
                              opacity = 0.7))\
        .update_layout(title = "Smooth Target vs Time",
                   title_x = 0.5,
                   yaxis_title = "Quantity").show()
    return params_opt, data_opt


# guess parameters for sin and optimize
params = [st.mean(data) / 4,
          2 * np.pi / (558 - 507), 
          545, 
          st.mean(data)]
results_sin = optimize(t, params, fit_sin, data, display_graph = True)


# guess parameters for logistical function and optimize
params = [2, -0.012, -1]
params.extend(results_sin[0])
results_sl = optimize(t, params, fit_sin_log, data, display_graph = True)

str_summary = 'Optimized parameters P for model: \nsl(x) = (P_1 / (1 + e^(P_2 * x)) + P_3) * (P_4 * sin(P_5 * x + P_6) + P_7)'
for i in range(0, len(results_sl[0])):
      str_summary += '\n\t\tP_' +  str(i + 1) + ' = ' + str(round(results_sl[0][i], 4))
str_summary += '\n\nVariance explained by the model: ' + str(round(r2_score(target_smooth, results_sl[1]),4))
print(str_summary)
Optimized parameters P for model: 
sl(x) = (P_1 / (1 + e^(P_2 * x)) + P_3) * (P_4 * sin(P_5 * x + P_6) + P_7)
		P_1 = 1.4215
		P_2 = -0.008
		P_3 = -0.5313
		P_4 = 1889.5082
		P_5 = 0.13
		P_6 = 541.4696
		P_7 = 9588.2094

Variance explained by the model: 0.4821

Assuming that the target variable follows this exact model, we can identify another target variable, which is the error between the fitted model and the actual data. If our prediction models find the relationship between the features and error of this model, we will be able to predict the actual target. Se the following link for the graph:

  • Fitted Error
In [13]:
target_dif = data - results_sl[1]
px.line(df, x = 'Date', y = target_dif)\
.update_layout(title = "Fitted Model Error vs Time",
               title_x = 0.5,
               yaxis_title = "Error (Quantity)").show()

target_dif
Out[13]:
0      -1224.866961
1      -1329.988384
2      -1176.548214
3      -1116.387394
4       -878.735424
           ...     
1032     273.488812
1033     208.875637
1034     494.342039
1035     472.009489
1036     955.624657
Length: 1037, dtype: float64

Preprocessing and Training/Test Selection and Feature Selection

We created an object Split to keep track of our splitted, scaled, and selected sets better. The object is initialized with features target and a name. It has functions that scales X sets selects features.

In [14]:
import math
from sklearn.feature_selection import RFE
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# class that represents a training/testing split of features and targets
class Split:
    
    # constructor
    def __init__(self, features, target, name):
        self.features, self.target, self.name = features, target, name
        self.num_features = len(features.columns)
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(features, target, random_state = 3000)
        self.selected_features = []
    
    # scale X sets
    def scale_test_sets(self):
        # create scaler object based on X_train
        scaler = MinMaxScaler().fit(self.X_train)
        # scale X train and test sets
        self.X_train_scaled = scaler.transform(self.X_train)
        self.X_test_scaled = scaler.transform(self.X_test)
        # assing selected sets to scales X train and test sets
        self.X_train_selected = self.X_train_scaled
        self.X_test_selected = self.X_test_scaled
        
    # define selected features
    def select_features(self, num_features = None):
        # defualt number of features to select is the square root of the current number of features + 1
        if num_features == None:
            num_features = math.floor(math.sqrt(self.num_features)) + 1
        # instantiate selection object
        feature_selection = RFE(DecisionTreeRegressor(random_state = 3000), n_features_to_select = num_features)
        # fit scaled X train set to y train set
        feature_selection.fit(self.X_train_scaled, self.y_train)
        # transform previous scaled sets and assign them to selected
        self.X_train_selected = feature_selection.transform(self.X_train_scaled)
        self.X_test_selected = feature_selection.transform(self.X_test_scaled)

        # print selected feature names
        out = 'Selected features for ' + self.name + ':\n'
        for feature in [feature for feature, status in zip(self.features, feature_selection.get_support()) if status == True]:
            out += '\t' + feature + '\n'
            self.selected_features.append(feature)
        print(out + '\n')
        

# list of different Split objects
splits = [Split(features, target, 'Features and target'),
          Split(features, target_smooth, 'Features and smoothed target'),
          Split(features, target_dif, 'Features and Fitted Error'),
          Split(features_combined, target, 'Features Combined and target'),
          Split(features_combined, target_smooth, 'Features Combined and smoothed target'),
          Split(features_combined, target_dif, 'Features Combined and Fitted Error')]

# scale training sets and select features for each Split
for split in splits:
    split.scale_test_sets()
    split.select_features()
Selected features for Features and target:
	Rsum
	T_inside
	CO2
	T_Avg
	DayNum


Selected features for Features and smoothed target:
	Rsum
	T_inside
	CO2
	H_Min
	DayNum


Selected features for Features and Fitted Error:
	Rsum
	Lamps
	T_inside
	H_Avg
	DayNum


Selected features for Features Combined and target:
	H_Avg_lr
	CO2^10
	Rsum_3_moving_sd
	Rsum_3_moving_sd_dd
	Rsum_7_moving_sd_dd
	Rsum_21_prev_sd
	T_inside_7_moving_sd
	T_inside_21_prev_sum
	CO2_7_moving_sd_lr
	CO2_7_prev_sum
	CO2_21_prev_sum
	DP_Avg_21_prev_sum_lr
	DP_Avg_7_prev_range
	DP_Min_7_moving_mean
	DP_Min_21_prev_sd_dd
	H_Max_7_moving_mean
	H_Max_7_prev_sd_lr
	H_Avg_3_moving_sd_dd
	H_Avg_7_moving_sd
	H_Avg_7_prev_sum
	H_Avg_3_prev_sd
	H_Min_21_prev_sum
	H_Min_21_prev_sd
	H_Min_7_prev_range
	P_Max_7_moving_sd
	P_Max_3_prev_sum
	P_Max_21_prev_sum
	P_Max_21_prev_range
	P_Avg_7_prev_sum
	P_Min_21_prev_sd_dd
	DayNum


Selected features for Features Combined and smoothed target:
	Rsum_7_prev_sum
	Rsum_7_prev_sd
	Rsum_21_prev_range
	Lamps_21_prev_sd
	Lamps_7_prev_range
	Lamps_21_prev_range
	T_inside_21_prev_sum
	T_inside_7_prev_sd_lr
	CO2_7_prev_sum
	CO2_21_prev_sum
	CO2_21_prev_range
	T_Avg_7_prev_sum
	T_Avg_7_prev_sum_lr
	T_Avg_21_prev_sum_dd
	T_Avg_7_prev_sd
	DP_Max_7_moving_sd_dd
	DP_Avg_21_prev_range
	DP_Min_5_moving_mean
	DP_Min_7_prev_sum
	DP_Min_21_prev_sum
	DP_Min_7_prev_range
	H_Max_7_prev_range
	H_Avg_21_prev_sum
	H_Min_5_moving_mean
	H_Min_7_moving_sd
	P_Max_3_moving_mean
	P_Max_21_prev_sd
	P_Avg_7_moving_mean
	P_Avg_21_prev_sum
	P_Min_21_prev_sd
	DayNum


Selected features for Features Combined and Fitted Error:
	Rsum_7_moving_sd
	Rsum_7_moving_sd_dd
	Lamps_5_moving_mean_dd
	Lamps_21_prev_sum_lr
	Lamps_7_prev_range
	Lamps_21_prev_range
	T_inside_7_prev_sum
	T_inside_21_prev_sum
	CO2_7_prev_sd
	CO2_21_prev_sd
	T_Max_7_prev_sum
	T_Avg_21_prev_range
	DP_Max_7_prev_sum_dd
	DP_Avg_21_prev_sd_dd
	DP_Min_7_moving_sd
	DP_Min_7_prev_sum
	DP_Min_21_prev_range
	H_Max_7_moving_sd
	H_Max_7_prev_sum
	H_Max_21_prev_sum
	H_Max_21_prev_sd
	H_Max_21_prev_sd_dd
	H_Avg_7_moving_mean
	H_Avg_7_moving_sd
	H_Min_21_prev_sum
	P_Max_5_moving_sd_dd
	P_Max_7_prev_sum
	P_Min_7_moving_mean
	P_Min_21_prev_sum
	P_Min_7_prev_sd
	DayNum


3.2. Data Exploration

The following data exploration is only going to cover the data we identified as significant. First, we will explore the relationship between the selected features in one of our splits and then explore the correlations between features and the target variable.

In [17]:
import matplotlib.pyplot as plt
import seaborn as sns

# enable in-line rendering
%matplotlib notebook

# construct the data frame of the selected features for fifth split and add a produced column
heat_df = features_combined[splits[4].selected_features]
heat_df['Produced'] = target_smooth
plt.rcParams.update({'font.size': 7})
plt.figure(figsize=(10, 10))
sns.heatmap(heat_df.corr(), cmap='coolwarm', square = True)
plt.savefig("heatmap.png")
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Visualization Explanation - Heatmap

Please see the heatmap image here:

  • HeatMap

This heatmap shows the correlations between the selected features in split "Features Combined and smoothed target" and smooth target. As can be seen on the graph, there is no strong linear correlation between the smooth target and the selected features. This is somewhat reasonable given the nonlinear nature of our data. More importantly, we suspect that our target variable is going to be affected by a combination of these selected features. The only strong correlation appears to be between similar selected features for obvious reasons. On the graph, they can be seen as rectangles of the same color.

In [15]:
for feature in ['Rsum', 'Lamps', 'T_inside', 'H_Avg']:
    px.scatter(df,
               x = feature,
               y = target_dif,
               template = 'none',
               color = target_dif,
               opacity = 0.8,
               trendline = 'ols')\
    .update_traces(marker={"size":7})\
    .update_layout(title = "Error vs " + feature,
                   title_x = 0.5,
                   yaxis_title = "Fitted Error (Quantity)")\
    .show()

Visualization Explanation - Scatter Plots

Please see the scatter plots here:

  • Error vs Rsum
  • Error vs Lamps
  • Error vs T_inside
  • Error vs H_avg

In Figures 1-4, we plot the selected features of the third split against the fitted mathematical model's error and display the ordinary least squares line. On all of the above graphs besides T_inside, we observe a zero correlation, which makes sense due to the sinusoidal nature of our data. The points are evenly distributed on top and the bottom of the zero line and, hence, cancel each other out. The only significant positive correlation is present on the plot of Error vs. T_inside. If you look at the earlier graph called 'Fitted Model Error vs. Time,' you can observe that most positive errors occur in the summer, while most negative errors occur in the winter. Similarly, the temperature tends to be higher in the summer and colder in the winter. Hence, these two variables are very likely to correlate.

3.3. Model Construction

Based on the observations made from the visualizations above, we suspect there will be a relationship between the target variables and the selected mathematically summarized weather conditions.

In [18]:
# model dictionary
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import LinearRegression
from sklearn.gaussian_process import GaussianProcessRegressor

estimators = {'Linear': LinearRegression(),
              'Ridge': Ridge(max_iter = 100000), 
              'Lasso': Lasso(max_iter = 100000), 
              'kNN': KNeighborsRegressor(),
              'BayesianRidge': BayesianRidge(n_iter = 100000),
              'KernelRidge': KernelRidge(),
              'GaussianProcess': GaussianProcessRegressor()}

estimators.values()
Out[18]:
dict_values([LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False), Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=100000,
      normalize=False, random_state=None, solver='auto', tol=0.001), Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=100000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False), KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform'), BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
              fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=100000,
              normalize=False, tol=0.001, verbose=False), KernelRidge(alpha=1, coef0=1, degree=3, gamma=None, kernel='linear',
            kernel_params=None), GaussianProcessRegressor(alpha=1e-10, copy_X_train=True, kernel=None,
                         n_restarts_optimizer=0, normalize_y=False,
                         optimizer='fmin_l_bfgs_b', random_state=None)])
In [19]:
def fit_regressors(estimators, split):
    
    X_train, X_test, y_train, y_test = split.X_train_selected, split.X_test_selected, split.y_train, split.y_test
    
    # select a classifier and change parameters
    for name, estimator in estimators.items():
        
        # fit a model to the training data
        estimator.fit(X = X_train, y = y_train)
        
        # compute R^2 scores
        train_score = round(r2_score(y_train, estimator.predict(X_train)), 4)
        test_score = round(r2_score(y_test, estimator.predict(X_test)), 4)
        
        #print accuracy
        print(name, ':\n\tR^2 value for training set: ', train_score, 
              '\n\tR^2 value for testing set : ', test_score, '\n', sep = '')


for split in splits:
    print('–––––––––––––––––––––––––––––––––––––––––––––––––\n\n\t\t' + split.name.upper()  + '\n')
    fit_regressors(estimators, split)
–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES AND TARGET

Linear:
	R^2 value for training set: 0.236
	R^2 value for testing set : 0.2104

Ridge:
	R^2 value for training set: 0.2352
	R^2 value for testing set : 0.208

Lasso:
	R^2 value for training set: 0.236
	R^2 value for testing set : 0.2106

kNN:
	R^2 value for training set: 0.5269
	R^2 value for testing set : 0.2031

BayesianRidge:
	R^2 value for training set: 0.2358
	R^2 value for testing set : 0.2092

KernelRidge:
	R^2 value for training set: 0.2323
	R^2 value for testing set : 0.2019

GaussianProcess:
	R^2 value for training set: 0.7732
	R^2 value for testing set : -7.4677

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES AND SMOOTHED TARGET

Linear:
	R^2 value for training set: 0.2879
	R^2 value for testing set : 0.313

Ridge:
	R^2 value for training set: 0.2874
	R^2 value for testing set : 0.3097

Lasso:
	R^2 value for training set: 0.2879
	R^2 value for testing set : 0.3121

kNN:
	R^2 value for training set: 0.6644
	R^2 value for testing set : 0.5381

BayesianRidge:
	R^2 value for training set: 0.2878
	R^2 value for testing set : 0.3113

KernelRidge:
	R^2 value for training set: 0.284
	R^2 value for testing set : 0.3165

GaussianProcess:
	R^2 value for training set: 0.88
	R^2 value for testing set : -3.9679

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES AND FITTED ERROR

Linear:
	R^2 value for training set: 0.0435
	R^2 value for testing set : 0.0824

Ridge:
	R^2 value for training set: 0.0433
	R^2 value for testing set : 0.0784

Lasso:
	R^2 value for training set: 0.0434
	R^2 value for testing set : 0.0811

kNN:
	R^2 value for training set: 0.5203
	R^2 value for testing set : 0.2887

BayesianRidge:
	R^2 value for training set: 0.042
	R^2 value for testing set : 0.0702

KernelRidge:
	R^2 value for training set: 0.031
	R^2 value for testing set : 0.0567

GaussianProcess:
	R^2 value for training set: 0.8157
	R^2 value for testing set : -21.3513

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES COMBINED AND TARGET

Linear:
	R^2 value for training set: 0.3059
	R^2 value for testing set : 0.2438

Ridge:
	R^2 value for training set: 0.2989
	R^2 value for testing set : 0.2257

Lasso:
	R^2 value for training set: 0.305
	R^2 value for testing set : 0.2388

kNN:
	R^2 value for training set: 0.6663
	R^2 value for testing set : 0.4208

BayesianRidge:
	R^2 value for training set: 0.2964
	R^2 value for testing set : 0.2235

KernelRidge:
	R^2 value for training set: 0.2989
	R^2 value for testing set : 0.2257

GaussianProcess:
	R^2 value for training set: 1.0
	R^2 value for testing set : -0.4826

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES COMBINED AND SMOOTHED TARGET

Linear:
	R^2 value for training set: 0.4611
	R^2 value for testing set : 0.522

Ridge:
	R^2 value for training set: 0.4486
	R^2 value for testing set : 0.4718

Lasso:
	R^2 value for training set: 0.4574
	R^2 value for testing set : 0.509

kNN:
	R^2 value for training set: 0.9383
	R^2 value for testing set : 0.9103

BayesianRidge:
	R^2 value for training set: 0.4516
	R^2 value for testing set : 0.4805

KernelRidge:
	R^2 value for training set: 0.4486
	R^2 value for testing set : 0.4727

GaussianProcess:
	R^2 value for training set: 1.0
	R^2 value for testing set : 0.9374

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES COMBINED AND FITTED ERROR

Linear:
	R^2 value for training set: 0.3161
	R^2 value for testing set : 0.2717

Ridge:
	R^2 value for training set: 0.265
	R^2 value for testing set : 0.2217

Lasso:
	R^2 value for training set: 0.309
	R^2 value for testing set : 0.2709

kNN:
	R^2 value for training set: 0.8855
	R^2 value for testing set : 0.7317

BayesianRidge:
	R^2 value for training set: 0.2999
	R^2 value for testing set : 0.2677

KernelRidge:
	R^2 value for training set: 0.2653
	R^2 value for testing set : 0.2217

GaussianProcess:
	R^2 value for training set: 1.0
	R^2 value for testing set : 0.8141

3.4. Model Evaluation

We discarded innefective models and decided to evaluate those with a passable performance using the following

Predicting Smoothed Target

  • Linear for features combined with smoothed target
  • KNN for features combined with smoothed target
  • Gaussian for features combined with smoothed target

Predicting Fitted Error

  • Linear for features combined and fitted error
  • KNN for features combined with smoothed target
  • Guassiann for features combiend with smoothed target

Evaluation Metrics using Cross Validation

  • R^2
  • Mean Accuracy
  • Standard Deviation
In [20]:
f_combined_smoothed = splits[4]
f_combined_smoothed_err = splits[5]

Cross validation of highest performing models on selected features for smoothed target and error

In [21]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits = 10, random_state = 3000, shuffle = True)
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

evalEstimators = {'Linear': LinearRegression(),
              'kNN': KNeighborsRegressor(),
              'GaussianProcess': GaussianProcessRegressor()}

#Dictionary of model names and scores
scores_target_smoothed = {}
scores_error_smoothed = {}
for name, model in evalEstimators.items():
    scores_target_smoothed[name] = cross_val_score(estimator=model,X=f_combined_smoothed.X_train_selected,
                                                   y=f_combined_smoothed.y_train,cv = 5)
    print(name + " Target_Smoothed done")
    scores_error_smoothed[name] = cross_val_score(estimator=model,X=f_combined_smoothed_err.X_train_selected,
                                                  y=f_combined_smoothed_err.y_train,cv = 5)
    print(name + " Error Fitted done")
Linear Target_Smoothed done
Linear Error Fitted done
kNN Target_Smoothed done
kNN Error Fitted done
GaussianProcess Target_Smoothed done
GaussianProcess Error Fitted done

Prediction Accuracy

  • KNN and Gaussian appear to have the best predictive performance on the validation tests
  • The same is true when predicting Error CV mean scores
  • Low standard deviations indicate that the afforementioned models are reliably accurate
In [22]:
print("Target Smoothed CV scores")
for name, score in scores_target_smoothed.items():
    print('–––––––––––––––––––––––––––––––––––––––––––––––––\n\n\t\t' + name  + '\n')
    print("Mean Score:",score.mean())
    print("Standard Deviation",f"{score.std():.2%}")
print("\n\n")
print("\t\tError CV Mean Scores")
for name, score in scores_target_smoothed.items():
    print('–––––––––––––––––––––––––––––––––––––––––––––––––\n\n\t\t' + name  + '\n')
    print("Mean Score:",score.mean())
    print("Standard Deviation",f"{score.std():.2%}")
Target Smoothed CV scores
–––––––––––––––––––––––––––––––––––––––––––––––––

		Linear

Mean Score: 0.4060851503976419
Standard Deviation 4.94%
–––––––––––––––––––––––––––––––––––––––––––––––––

		kNN

Mean Score: 0.8445501092238917
Standard Deviation 2.61%
–––––––––––––––––––––––––––––––––––––––––––––––––

		GaussianProcess

Mean Score: 0.9041986090947965
Standard Deviation 2.35%



		Error CV Mean Scores
–––––––––––––––––––––––––––––––––––––––––––––––––

		Linear

Mean Score: 0.4060851503976419
Standard Deviation 4.94%
–––––––––––––––––––––––––––––––––––––––––––––––––

		kNN

Mean Score: 0.8445501092238917
Standard Deviation 2.61%
–––––––––––––––––––––––––––––––––––––––––––––––––

		GaussianProcess

Mean Score: 0.9041986090947965
Standard Deviation 2.35%

3.5. Model Optimization

Rationale For Tuning

After performing a cross-validation test to evaluate our three best models, we can then select from that set, two models: kNN and Gaussian Process for hyper-parameter tuning. Since these models have a high degree of accuracy, we want to reduce the risk of over-fitting these models to retain predictive value upon future data for flower production.

Tuning

In [23]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

param_grid1 = {'alpha':[.001, .01, .1, 1, 10, 100]}
param_grid2 = {"n_neighbors":[1, 2, 3, 5, 10], "metric": ["euclidean", "manhattan", "minkowski"]}
best_estimators = {
              'kNN': KNeighborsRegressor(),
              'GaussianProcess': GaussianProcessRegressor()}

def optimize(x_train, y_train):
    out = {}
    for name, model in best_estimators.items():
        if name == 'kNN':
            grid_search = GridSearchCV(model, param_grid2, cv=5)
        if name == 'GaussianProcess':
            grid_search = GridSearchCV(model, param_grid1, cv=5)
        grid_search.fit(X=x_train, y=y_train)
        print(name)
        print("\tBest cross-validation score: ", grid_search.best_score_)
        print("\tBest parameters: ", grid_search.best_params_)
        print()
        out[name] = grid_search.best_params_
    return out

Results on Training Set For Parameter Tuning

In [24]:
print("Model Optimization for Selected features on Smoothed Target data \n__________________________________________________ \n")
best_params = optimize(f_combined_smoothed.X_train_selected, f_combined_smoothed.y_train)
print('\n')
print("Model Optimization for selected features for error prediction \n__________________________________________________ \n")
best_params_error = optimize(f_combined_smoothed_err.X_train_selected, f_combined_smoothed_err.y_train)
print('\n')
Model Optimization for Selected features on Smoothed Target data 
__________________________________________________ 

kNN
	Best cross-validation score:  0.9447797514224517
	Best parameters:  {'metric': 'manhattan', 'n_neighbors': 2}

GaussianProcess
	Best cross-validation score:  0.91222188459969
	Best parameters:  {'alpha': 0.001}



Model Optimization for selected features for error prediction 
__________________________________________________ 

kNN
	Best cross-validation score:  0.9025645999268768
	Best parameters:  {'metric': 'manhattan', 'n_neighbors': 1}

GaussianProcess
	Best cross-validation score:  0.8186171169981172
	Best parameters:  {'alpha': 0.001}



3.6. Model Testing

Using our findings from the parameter tuning phase, we will test the accuracy and performance of our models on our testing data set

In [25]:
param_grid1 = {'alpha':[.001, .01, .1, 1, 10, 100]}
param_grid2 = {"n_neighbors":[1, 2, 3, 5, 10], "metric": ["euclidean", "manhattan", "minkowski"]}
optimal_estimators = {
              'kNN Smoothed Target': KNeighborsRegressor(n_neighbors = 1, metric = 'manhattan'),
              'GaussianProcess Smoothed Target': GaussianProcessRegressor(alpha = .001),
                'kNN Error' : KNeighborsRegressor(n_neighbors = 2),
                'GaussianProcess Error' :GaussianProcessRegressor(alpha = .001) }

# selected on smooth and error
for split in splits[4:]:
    print('–––––––––––––––––––––––––––––––––––––––––––––––––\n\n\t\t' + split.name.upper()  + '\n')
    fit_regressors(optimal_estimators, split)
–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES COMBINED AND SMOOTHED TARGET

kNN Smoothed Target:
	R^2 value for training set: 1.0
	R^2 value for testing set : 0.9693

GaussianProcess Smoothed Target:
	R^2 value for training set: 0.9983
	R^2 value for testing set : 0.9497

kNN Error:
	R^2 value for training set: 0.9872
	R^2 value for testing set : 0.966

GaussianProcess Error:
	R^2 value for training set: 0.9983
	R^2 value for testing set : 0.9497

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES COMBINED AND FITTED ERROR

kNN Smoothed Target:
	R^2 value for training set: 1.0
	R^2 value for testing set : 0.9413

GaussianProcess Smoothed Target:
	R^2 value for training set: 0.9966
	R^2 value for testing set : 0.833

kNN Error:
	R^2 value for training set: 0.9749
	R^2 value for testing set : 0.7965

GaussianProcess Error:
	R^2 value for training set: 0.9966
	R^2 value for testing set : 0.833


4. DISCUSSION

Summary of Analysis

In order to perform our analysis, we split our weather conditions and greenhouse data into feature variables and target variable, the number of flowers produced. Then we created several functions like sum, range, or standard deviation that were applied to every feature and recorder in a 'features combined' data frame along with making our features polynomial to explore nonlinear relationships. We also smoothened our target variable and tried to come up with a mathematical model describing its distribution. As a result, we obtained two additional targets (smoothened amount of flowers produced and the error between the mathematical model and the smoothened quantity produced) to explore in our data analysis. We did not visualize all of our data because of the sinusoidal nature of our target variable but focused on the selected relationships among our features and features with the number of flowers produced.

To run our regression models, we created six splits of data with a different combination of features and target variables (features, features combined, target, smooth target, error), scaled the appropriate sets, and selected the most significant features. We evaluated seven different supervised machine learning models - subcategory regression - using all six of the training and testing sets described previously. f We then evaluated the regression models using a cross validation split in order to adaquetely determine which models were the most promising and warranted hyper-parameter tuning. Our results dicated that kNN and Gaussian Process were our best models from the cross mean validation scores which were: 84% and 90% respectively.

Afterwards, we applied a GridSearchCV to fine tune the following parameters:

Smoothed Target Value

  • Gaussian Process: Alpha Value = .001
  • kNN: n_neighbors = 2 and Metric = Manhattan

Error Prediction

  • Gaussian Process: Alpha Value = .001
  • kNN: n_neighbors = 1 and Metric = Manhattan

Interpretation of Findings

Algorithms Compared

We compared Multiple Linear, Ridge, Lasso, kNN, BayesianRidge, KernelRidge, and GaussianProcess regression algorithms.

Algorithms with Best Performance The algorithms with the best performance were kNN and Gaussian process, and Linear Regression

Smoothed Target Value r^2
* Multiple Linear: 0.40
* kNN: .84
* Gaussian Process: .90

Error Prediction r^2
* Multiple Linear: 0.40
* kNN: .84
* Gaussian Process: .90

Evaluation after Optimization

After tuning and optimization we saw a moderate improvement in the results. After optimization, we were unable to improve the linear regression model to a sufficient standard, and as such, it was discarded from our final analysis.

The accuracy of the models on our testing data improved to the following values.

Smoothed Target Value r^2
* kNN: .96
* Gaussian Process: .94

Error Prediction r^2
* kNN: .98
* Gaussian Process: .94



Algorithms for Use in Predictive Model

We expected Bayesian Ridge, KernelRidge, and Gaussian Process regression algorithms to work best because the examples in the scikit learn documentations suggests that they are good at predicting sinusoidal data. However, upon review of our results we found that Gaussian Process was the only model from our original predictions that produced sufficient results. Knn and Gaussian models form the best predictive models for our data set.

Our Original Research Questions

Is there a linear relationship between the weather conditions and the number of flowers produced? If not, what is the relationship?

In our data visualization and analysis, we found that there is no direct linear relationship between the weather conditions and the number of flowers produced. Moreover, the linear regression model proved to one of the least accurate when it came to predicting the number of flowers produced or the deviation from the assumed cycle.

Is there an accumulative or delayed effect of the weather conditions on the dependent variable? What is it?

When we applied the feature selection technique, we discovered that accumulative, delayed, and other mathematically modified features had a more significant effect on the dependent variable than the current weather conditions.

Is there a pattern in the number of flowers produced? What periods appear to be more stable or have less noise in the number of flowers produced? What parts of the natural cycle (amplitude, period, vertical shift) are affected by the weather?

It is evident from our visualizations and attempts to mathematically describe the pattern that the number of flowers produced follows a sinusoidal pattern.

Is there a relationship between the weather conditions and the deviation from the natural production cycle? Which features are the most influential on the deviations of the cycle?

We were able to construct an accurate model with the deviation from the cycle as a target variable, which means that there has to be a relationship. The list of significant features can be found in the data analysis section, but we are not sure about the order of significance. Coming up with twenty or so selected features has proven to be very computationally expensive for the combined dataset.

Is there a linear or nonlinear regression model that can predict the number of flowers produced? Which one is more accurate?

Yes. kNN (non linear) and Gaussian Process were both predictive of our data. However, neither of those models are linear.

How do the features correlate with each other? Can we reduce their number to make the prediction model more accurate?

In the heat map visualization of our data, we found that the most significant features do not correlate with each other. However, we did not check all the features, and the ones we have already been selected as significant. Reducing the number of features in the combined data set proved to produce more accurate prediction models and less accurate predictive models for an original data set of features.

Reflection on Our Findings

In our original problem, we wanted to find out if our selected (strongly correlated) wheather conditions were predictive of the sinusoidal pattern originally observed in our data. After testing a variety of models including several linear models and kNN and Gaussian Process we found that linear models were not predictive of our data but kNN and Gaussian Process models were. Upon tuning the parameters of our models and employing our feature selection, we were able to generate two highly accurate models that performed well under cross validation tests.

Conclusion

We expected better results from the Kernel Ridge regression as scikit learn documentation says it is often highly predictive of sinusoidal data. Perhaps we could have done a better job fine tuning the prameters of that specific model or perhaps a Kernel ridge regression warrants a more specific feature selection process. We were surprised that kNN model could be tuned to such a high degree of accuracy. The performance of the Gaussian Process was not surprising as it is often highly predictive of cyclical data.

Future work of this project could involve combining the error prediction model with the smoothed target prediction. Additionally we believe this model has effective real world utility for greenhouse companies and possibly for other crops that are farmed. The real world implications of this project could perhaps be employed by certain hedge funds or investment firms to predict yields of various crops (other than flowers) in order to effectively price commodity futures.


CONTRIBUTIONS

Jake:

  • problem description and specification
  • data acquisition
  • feature and target extraction
  • preprocessing and feature selection
  • data exploration

Pranav

  • Model Fitting and Selection
  • Model Prediction
  • Model Optimization
  • Model Evaluation
  • Model Testing
In [ ]: