DS 3000 - Summer 2020

DS Report

Identifying Patterns in Plant Growth

Yevhen Horban, Pranav Ahluwalia

Executive Summary:¶

This project is focused on identifying the patterns and predicting the number of roses produced in a specific greenhouse based on the conditions inside and outside of the greenhouse. Plants usually need nutritious soil, water, sunlight, and carbon dioxide to grow. While the greenhouse controls the constant supply of nutrition and water, other conditions are variable. The amount of how much roses are produced appears to have a sinusoidal pattern with a variable vertical shift, amplitude, and period that depend on weather conditions.

1. INTRODUCTION¶

Problem Statement

In our project, we analyze the effects of different environmental factors on the number of roses a specific greenhouse produces. This project assumes that if the environment were controlled, then the production of flowers would follow a sinusoidal pattern with no noise. Since the climate is not controlled, it is our objective to explore the relationship between its changes and the shift in the pattern of production.

Significance of the Problem

The nature of this particular business is such that the flowers are often sold before they are produced, and it is necessary to know ahead of time how much produce to sell. Also, at times when the peaks of the production curves of different flowers align the freezers that preserve flowers before they are sold run out of the room and the lifespan of the flowers that are left in the room temperature shortens. Tackling this problem matters because if a company knows the amount of product it is going to produce, it can plan for accommodating the storage and adjust prices if necessary to maximize its profit.

Questions/Hypothesis

Given the problem and its importance, as mentioned earlier, we set out to tackle the following questions:

Is there a linear relationship between the weather conditions and the number of flowers produced? If not, what is the relationship?
Is there an accumulative or delayed effect of the weather conditions on the dependent variable? What is it?
Is there a pattern in the number of flowers produced? What periods appear to be more stable or have less noise in the number of flowers produced? What parts of the natural cycle (amplitude, period, vertical shift) are affected by the weather?
Is there a relationship between the weather conditions and the deviation from the natural production cycle? Which features are the most influential on the deviations of the cycle?
Is there a linear or nonlinear regression model that can predict the number of flowers produced? Which one is more accurate?
How do the features correlate with each other? Can we reduce their number to make the prediction model more accurate?

Hypotheses:

Hypothesis about features:
- Null: The quantity of the flowers produced and the selected weather features have no relationship
- Alternative: The number of flowers produced and the selected weather features have a relationship
Hypothesis about machine learning algorithms:
- Null: Linear regression IS the best predictor of the quantity of the flowers produced based on the weather features.
- Alternative: Linear regression IS NOT the best predictor of the quantity of the flowers produced based on the weather features.
Hypothesis about additional features on regression models:
- Null: Mathematically manipulated features WILL NOT change the accuracy of the models.
- Alternative: Mathematically manipulated features WILL change the accuracy of the models.

2. METHOD¶

2.1. Data Acquisition¶

We obtained our time-series data from the company that owns the greenhouse and combined it with the historical weather data from Weather Underground to get 1036 by 18 data set. We are going to use the daily averages of the amount of sunlight, the number of minutes the lamps were turned on, the amount of CO2, the temperature inside and outside, dew point, humidity, and atmospheric pressure to predict the number of flowers produced on the given date using regression.

https://www.wunderground.com/history/monthly/ua/brovary/UKBB

import pandas as pd

url_org_data = 'https://raw.githubusercontent.com/h0rban/flower-harvest-prediction/master/data/redeagle.csv'
df = pd.read_csv(url_org_data)
df

2.2. Variables¶

Rsum is in lux
Pressure is in Hg
Lamps are in minutes
Humidity is in percent
Temperatures are in ºC
CO2 in parts per million

Independent Variables (feature variables)

inside of the greenhouse:
    - lux of sun
    - temperature
    - concentration of CO2
    - minutes the lamps were turned on

minimum, maximum, and average outside of the greenhouse:
    - temperature
    - dew point
    - humidity
    - pressure

Dependent Variables (target variable)

 quantity of flowers produced

Rationale

Our features are all the independent variables listed above. They are already continuous, numerical data points and do not require feature engineering with them. Our target variable is the number of flowers produced, which is also a continuous numerical value and a target variable for a regression model. However, we will try to modify both features and the target variable to reduce noise and explore possible relationships.

2.3. Data Analysis¶

Predictive Model

We will predict the number of flowers produced the following day, given the weather conditions. These predictors are significant because flowers are living creatures that respond to their environment. In addition to the features described above, we are going to mathematically manipulate the data to explore some of our questions and hypothesis.

A Supervised Learning Problem

Since will are seeking to obtain a continuous output variable, this is a supervised regression problem.

Machine Learning Algorithms to be Applied

We will be using Multiple Linear, Ridge, Lasso, kNN, Bayesian Ridge, KernelRidge, and Gaussian Process regression algorithms because it is not clear whether they will be accurate predictors of the number of flowers produced. We expect Bayesian Ridge, KernelRidge, and Gaussian Process regression algorithms work the best because the examples in the scikit learn documentation suggests that they are good at predicting sinusoidal data.

3. RESULTS¶

3.1. Data Wrangling¶

Simple data cleaning¶

Drop the rows with the missing values

df.dropna(inplace = True)

Process variables¶

Convert string representation of date to Date object

df['Date'] =  pd.to_datetime(df['Date'], format='%m/%d/%y')

Data Wrangling¶

Select features and target form the dataframe

features = df.drop(['Date', 'Produced'], axis = 1)
features

target = df['Produced']
target

0         800
1         716
2        1125
3        1280
4        1971
        ...  
1032    10697
1033     7234
1034     8939
1035     6104
1036    10040
Name: Produced, Length: 1037, dtype: int64

Feature extraction¶

There might not be a linear relationship between features and the target variable, so we use a polynomial feature extraction technique to explore nonlinear relationships in our predictive models.

from sklearn.preprocessing import PolynomialFeatures

# returns a data frame that contains polynomials from 2 to the given degree of the given features
def features_to_poly(features, degrees):
    poly_feature_dfs = []
    for feature in features.columns:
        values = features[feature]
        poly = PolynomialFeatures(degree = degrees, include_bias = False) # we tried using interaction_only = False but it did not work for some reason
        feature_poly = poly.fit_transform(values.values.reshape(-1,1))
        feature_df = pd.DataFrame(feature_poly, columns = [feature + '^' + str(index) for index in range(1, degrees + 1)])
        poly_feature_dfs.append(feature_df.drop(feature + '^' + str(1), axis = 1))
    return pd.concat(poly_feature_dfs, axis = 1)


features_poly = features_to_poly(features, 10)
features_poly

One of the questions we are exploring in this project is whether the weather features have a delayed or an accumulating effect on the target variable. Hence, we will use a feature extraction techniques to generate additional features to explore hidden relationships in our data.

import statistics as st

# returns the given values after applying a given function to the given number of moving values
# this allows to use moving average, range and other functions by ignoring the missing vaules on the edges
def n_moving_func(values, n, func):
    out = []
    length = len(values)
    for index in range(0, length):
        s = index - n // 2
        e = index + n // 2 + 1
        if s < 0:
            s = 0
        if e >= length:
            e = length 
        out.append(func(values[s : e]))
    return out


# returns the given values after applying a given function to the given number of previous values
# this allows to sum, or find the range and other functions by ignoring the missing values in the begining
def n_prev_func(values, n, func):
    out = []
    for index in range(0, len(values)):
        s = index - n + 1
        if s < 0:
            s = 0
            temp = values[s : index + 1]
            temp.append(values[index])
            out.append(func(temp))
        else:
            out.append(func(values[s : index + 1]))
    return out


# returns the derivative of given values
# the first value will be the same as the second
def df_dt(values):
    out = [values[1] - values[0]]
    for index in range(1, len(values)):
        out.append(values[index] - values[index - 1])
    return out


# returns linked relatives of the given values
# the first value and dividing by zero will result in a 1
def link_r(values):
    out = [1]
    for index in range(1, len(values)):
        out.append(1) if values[index - 1] == 0 else out.append(values[index] / values[index - 1])
    return out


# returns the range of the given values
def get_range(values):
    return max(values) - min(values)


# returns a dataframe with the result of applying every given function to every column in the given dataframe
def apply_func_to_df(dataframe, funcs, include_original = False):
    new_df = pd.DataFrame()
    for feature in dataframe.columns:
        if include_original:
            new_df[feature] = dataframe[feature]
        for name, func in funcs.items():
            new_df[feature + '_' + name] = pd.Series(func(dataframe[feature]))
    return new_df


# functions that are going to applied to the features
dd_and_lr = {'dd': df_dt, 'lr' : link_r}
funcs = {'3_moving_mean': lambda values: n_moving_func(values, 3, st.mean),
         '5_moving_mean': lambda values: n_moving_func(values, 5, st.mean),
         '7_moving_mean': lambda values: n_moving_func(values, 7, st.mean),
         '3_moving_sd': lambda values: n_moving_func(values, 3, st.stdev),
         '5_moving_sd': lambda values: n_moving_func(values, 5, st.stdev),
         '7_moving_sd': lambda values: n_moving_func(values, 7, st.stdev),
         '3_prev_sum': lambda values: n_prev_func(values.tolist(), 3, sum),
         '7_prev_sum': lambda values: n_prev_func(values.tolist(), 7, sum),
         '21_prev_sum': lambda values: n_prev_func(values.tolist(), 21, sum),
         '3_prev_sd': lambda values: n_prev_func(values.tolist(), 3, st.stdev),
         '7_prev_sd': lambda values: n_prev_func(values.tolist(), 7, st.stdev),
         '21_prev_sd': lambda values: n_prev_func(values.tolist(), 21, st.stdev),
         '3_prev_range': lambda values: n_prev_func(values.tolist(), 3, get_range),
         '7_prev_range': lambda values: n_prev_func(values.tolist(), 7, get_range),
         '21_prev_range': lambda values: n_prev_func(values.tolist(), 21, get_range)}

# this will take a while
features_additional =  apply_func_to_df(apply_func_to_df(features, funcs), dd_and_lr, include_original = True)
features_additional

The following line combines the resulting data frames of feature extraction into a one. Also, add a day number column to two of the data frames (it turned out to be significant to the models)

features_combined = pd.concat([features, apply_func_to_df(features, dd_and_lr), features_poly, features_additional], axis = 1)

features['DayNum'] = df.index
features_combined['DayNum'] = df.index

features_combined

Target Extraction¶

The data for the target variable may be too noisy for the prediction algorithms to identify the pattern. Hence, we will explore the possibilities of smoothening the target variable.

Please see these links for scatterplots:

import plotly.express as px

target = df['Produced']
px.line(df, x = 'Date', y = target)\
.update_layout(title = "Target vs Time",
               title_x = 0.5,
               yaxis_title = "Quantity").show()
px.line(df, x = 'Date', y = n_moving_func(target.tolist(), 7, st.mean))\
.update_layout(title = "Moving average of Target (n = 7) vs Time",
               title_x = 0.5,
               yaxis_title = "Quantity").show()
px.line(df, x = 'Date', y = n_prev_func(target.tolist(), 7, st.mean))\
.update_layout(title = "Average of previous values of Target (n = 7) vs Time",
               title_x = 0.5,
               yaxis_title = "Quantity").show()
target_smooth = pd.Series(n_prev_func(target.tolist(), 7, st.mean))
target_smooth

0        800.000000
1        744.000000
2        941.500000
3       1040.200000
4       1310.500000
           ...     
1032    8531.428571
1033    8254.142857
1034    8335.142857
1035    8120.000000
1036    8425.714286
Length: 1037, dtype: float64

The following cell attempts to mathematically fit a curve that would represent the patterns in the data.
Refer to the following links for graphs:

import numpy as np
import plotly.graph_objects as go
from scipy.optimize import leastsq
from sklearn.metrics import r2_score

t = df.index 
data = target_smooth # any target can be used here
date = df['Date']


def fit_sin(x, params):
    # f(x) = a * sin(bx + c) + d
    assert len(params) == 4
    return params[0] * np.sin(params[1] * x + params[2]) + params[3]


def fit_sin_log(x, params):
    # g(x) = a / (1 + e^(bx)) + d
    # s(x) = g(x) * f(x)
    assert len(params) == 4 + 3
    return (params[0] / (1 + np.exp(params[1] * x)) + params[2]) * fit_sin(x, params[3:])


# returns the optimized parameters and the data of the model for given x 
# by optimizing fitting and optimizing given function to the given data
def optimize(x, params, func, data, display_graph = False):
    data_guess = func(x, params)
    optimize_func = lambda var: (func(x, var) - data)**2
    params_opt = leastsq(optimize_func, params)[0]
    data_opt = func(x, params_opt)
    if display_graph == True:
        go.Figure()\
        .add_trace(go.Scatter(x = date, 
                              y = data, 
                              mode = 'lines', 
                              name = 'target', 
                              opacity = 0.7))\
        .add_trace(go.Scatter(x = date, 
                              y = data_guess, 
                              mode = 'lines', 
                              name = 'first guess', 
                              opacity = 0.7))\
        .add_trace(go.Scatter(x = date, 
                              y = data_opt, 
                              mode = 'lines', 
                              name = 'after fitting', 
                              opacity = 0.7))\
        .update_layout(title = "Smooth Target vs Time",
                   title_x = 0.5,
                   yaxis_title = "Quantity").show()
    return params_opt, data_opt


# guess parameters for sin and optimize
params = [st.mean(data) / 4,
          2 * np.pi / (558 - 507), 
          545, 
          st.mean(data)]
results_sin = optimize(t, params, fit_sin, data, display_graph = True)


# guess parameters for logistical function and optimize
params = [2, -0.012, -1]
params.extend(results_sin[0])
results_sl = optimize(t, params, fit_sin_log, data, display_graph = True)

str_summary = 'Optimized parameters P for model: \nsl(x) = (P_1 / (1 + e^(P_2 * x)) + P_3) * (P_4 * sin(P_5 * x + P_6) + P_7)'
for i in range(0, len(results_sl[0])):
      str_summary += '\n\t\tP_' +  str(i + 1) + ' = ' + str(round(results_sl[0][i], 4))
str_summary += '\n\nVariance explained by the model: ' + str(round(r2_score(target_smooth, results_sl[1]),4))
print(str_summary)

Optimized parameters P for model: 
sl(x) = (P_1 / (1 + e^(P_2 * x)) + P_3) * (P_4 * sin(P_5 * x + P_6) + P_7)
		P_1 = 1.4215
		P_2 = -0.008
		P_3 = -0.5313
		P_4 = 1889.5082
		P_5 = 0.13
		P_6 = 541.4696
		P_7 = 9588.2094

Variance explained by the model: 0.4821

Assuming that the target variable follows this exact model, we can identify another target variable, which is the error between the fitted model and the actual data. If our prediction models find the relationship between the features and error of this model, we will be able to predict the actual target. Se the following link for the graph:

target_dif = data - results_sl[1]
px.line(df, x = 'Date', y = target_dif)\
.update_layout(title = "Fitted Model Error vs Time",
               title_x = 0.5,
               yaxis_title = "Error (Quantity)").show()

target_dif

0      -1224.866961
1      -1329.988384
2      -1176.548214
3      -1116.387394
4       -878.735424
           ...     
1032     273.488812
1033     208.875637
1034     494.342039
1035     472.009489
1036     955.624657
Length: 1037, dtype: float64

Preprocessing and Training/Test Selection and Feature Selection¶

We created an object Split to keep track of our splitted, scaled, and selected sets better. The object is initialized with features target and a name. It has functions that scales X sets selects features.

import math
from sklearn.feature_selection import RFE
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# class that represents a training/testing split of features and targets
class Split:
    
    # constructor
    def __init__(self, features, target, name):
        self.features, self.target, self.name = features, target, name
        self.num_features = len(features.columns)
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(features, target, random_state = 3000)
        self.selected_features = []
    
    # scale X sets
    def scale_test_sets(self):
        # create scaler object based on X_train
        scaler = MinMaxScaler().fit(self.X_train)
        # scale X train and test sets
        self.X_train_scaled = scaler.transform(self.X_train)
        self.X_test_scaled = scaler.transform(self.X_test)
        # assing selected sets to scales X train and test sets
        self.X_train_selected = self.X_train_scaled
        self.X_test_selected = self.X_test_scaled
        
    # define selected features
    def select_features(self, num_features = None):
        # defualt number of features to select is the square root of the current number of features + 1
        if num_features == None:
            num_features = math.floor(math.sqrt(self.num_features)) + 1
        # instantiate selection object
        feature_selection = RFE(DecisionTreeRegressor(random_state = 3000), n_features_to_select = num_features)
        # fit scaled X train set to y train set
        feature_selection.fit(self.X_train_scaled, self.y_train)
        # transform previous scaled sets and assign them to selected
        self.X_train_selected = feature_selection.transform(self.X_train_scaled)
        self.X_test_selected = feature_selection.transform(self.X_test_scaled)

        # print selected feature names
        out = 'Selected features for ' + self.name + ':\n'
        for feature in [feature for feature, status in zip(self.features, feature_selection.get_support()) if status == True]:
            out += '\t' + feature + '\n'
            self.selected_features.append(feature)
        print(out + '\n')
        

# list of different Split objects
splits = [Split(features, target, 'Features and target'),
          Split(features, target_smooth, 'Features and smoothed target'),
          Split(features, target_dif, 'Features and Fitted Error'),
          Split(features_combined, target, 'Features Combined and target'),
          Split(features_combined, target_smooth, 'Features Combined and smoothed target'),
          Split(features_combined, target_dif, 'Features Combined and Fitted Error')]

# scale training sets and select features for each Split
for split in splits:
    split.scale_test_sets()
    split.select_features()

Selected features for Features and target:
	Rsum
	T_inside
	CO2
	T_Avg
	DayNum


Selected features for Features and smoothed target:
	Rsum
	T_inside
	CO2
	H_Min
	DayNum


Selected features for Features and Fitted Error:
	Rsum
	Lamps
	T_inside
	H_Avg
	DayNum


Selected features for Features Combined and target:
	H_Avg_lr
	CO2^10
	Rsum_3_moving_sd
	Rsum_3_moving_sd_dd
	Rsum_7_moving_sd_dd
	Rsum_21_prev_sd
	T_inside_7_moving_sd
	T_inside_21_prev_sum
	CO2_7_moving_sd_lr
	CO2_7_prev_sum
	CO2_21_prev_sum
	DP_Avg_21_prev_sum_lr
	DP_Avg_7_prev_range
	DP_Min_7_moving_mean
	DP_Min_21_prev_sd_dd
	H_Max_7_moving_mean
	H_Max_7_prev_sd_lr
	H_Avg_3_moving_sd_dd
	H_Avg_7_moving_sd
	H_Avg_7_prev_sum
	H_Avg_3_prev_sd
	H_Min_21_prev_sum
	H_Min_21_prev_sd
	H_Min_7_prev_range
	P_Max_7_moving_sd
	P_Max_3_prev_sum
	P_Max_21_prev_sum
	P_Max_21_prev_range
	P_Avg_7_prev_sum
	P_Min_21_prev_sd_dd
	DayNum


Selected features for Features Combined and smoothed target:
	Rsum_7_prev_sum
	Rsum_7_prev_sd
	Rsum_21_prev_range
	Lamps_21_prev_sd
	Lamps_7_prev_range
	Lamps_21_prev_range
	T_inside_21_prev_sum
	T_inside_7_prev_sd_lr
	CO2_7_prev_sum
	CO2_21_prev_sum
	CO2_21_prev_range
	T_Avg_7_prev_sum
	T_Avg_7_prev_sum_lr
	T_Avg_21_prev_sum_dd
	T_Avg_7_prev_sd
	DP_Max_7_moving_sd_dd
	DP_Avg_21_prev_range
	DP_Min_5_moving_mean
	DP_Min_7_prev_sum
	DP_Min_21_prev_sum
	DP_Min_7_prev_range
	H_Max_7_prev_range
	H_Avg_21_prev_sum
	H_Min_5_moving_mean
	H_Min_7_moving_sd
	P_Max_3_moving_mean
	P_Max_21_prev_sd
	P_Avg_7_moving_mean
	P_Avg_21_prev_sum
	P_Min_21_prev_sd
	DayNum


Selected features for Features Combined and Fitted Error:
	Rsum_7_moving_sd
	Rsum_7_moving_sd_dd
	Lamps_5_moving_mean_dd
	Lamps_21_prev_sum_lr
	Lamps_7_prev_range
	Lamps_21_prev_range
	T_inside_7_prev_sum
	T_inside_21_prev_sum
	CO2_7_prev_sd
	CO2_21_prev_sd
	T_Max_7_prev_sum
	T_Avg_21_prev_range
	DP_Max_7_prev_sum_dd
	DP_Avg_21_prev_sd_dd
	DP_Min_7_moving_sd
	DP_Min_7_prev_sum
	DP_Min_21_prev_range
	H_Max_7_moving_sd
	H_Max_7_prev_sum
	H_Max_21_prev_sum
	H_Max_21_prev_sd
	H_Max_21_prev_sd_dd
	H_Avg_7_moving_mean
	H_Avg_7_moving_sd
	H_Min_21_prev_sum
	P_Max_5_moving_sd_dd
	P_Max_7_prev_sum
	P_Min_7_moving_mean
	P_Min_21_prev_sum
	P_Min_7_prev_sd
	DayNum

3.2. Data Exploration¶

The following data exploration is only going to cover the data we identified as significant. First, we will explore the relationship between the selected features in one of our splits and then explore the correlations between features and the target variable.

import matplotlib.pyplot as plt
import seaborn as sns

# enable in-line rendering
%matplotlib notebook

# construct the data frame of the selected features for fifth split and add a produced column
heat_df = features_combined[splits[4].selected_features]
heat_df['Produced'] = target_smooth
plt.rcParams.update({'font.size': 7})
plt.figure(figsize=(10, 10))
sns.heatmap(heat_df.corr(), cmap='coolwarm', square = True)
plt.savefig("heatmap.png")

/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Visualization Explanation - Heatmap¶

Please see the heatmap image here:

This heatmap shows the correlations between the selected features in split "Features Combined and smoothed target" and smooth target. As can be seen on the graph, there is no strong linear correlation between the smooth target and the selected features. This is somewhat reasonable given the nonlinear nature of our data. More importantly, we suspect that our target variable is going to be affected by a combination of these selected features. The only strong correlation appears to be between similar selected features for obvious reasons. On the graph, they can be seen as rectangles of the same color.

for feature in ['Rsum', 'Lamps', 'T_inside', 'H_Avg']:
    px.scatter(df,
               x = feature,
               y = target_dif,
               template = 'none',
               color = target_dif,
               opacity = 0.8,
               trendline = 'ols')\
    .update_traces(marker={"size":7})\
    .update_layout(title = "Error vs " + feature,
                   title_x = 0.5,
                   yaxis_title = "Fitted Error (Quantity)")\
    .show()

Visualization Explanation - Scatter Plots¶

Please see the scatter plots here:

In Figures 1-4, we plot the selected features of the third split against the fitted mathematical model's error and display the ordinary least squares line. On all of the above graphs besides T_inside, we observe a zero correlation, which makes sense due to the sinusoidal nature of our data. The points are evenly distributed on top and the bottom of the zero line and, hence, cancel each other out. The only significant positive correlation is present on the plot of Error vs. T_inside. If you look at the earlier graph called 'Fitted Model Error vs. Time,' you can observe that most positive errors occur in the summer, while most negative errors occur in the winter. Similarly, the temperature tends to be higher in the summer and colder in the winter. Hence, these two variables are very likely to correlate.

3.3. Model Construction¶

Based on the observations made from the visualizations above, we suspect there will be a relationship between the target variables and the selected mathematically summarized weather conditions.

# model dictionary
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import LinearRegression
from sklearn.gaussian_process import GaussianProcessRegressor

estimators = {'Linear': LinearRegression(),
              'Ridge': Ridge(max_iter = 100000), 
              'Lasso': Lasso(max_iter = 100000), 
              'kNN': KNeighborsRegressor(),
              'BayesianRidge': BayesianRidge(n_iter = 100000),
              'KernelRidge': KernelRidge(),
              'GaussianProcess': GaussianProcessRegressor()}

estimators.values()

dict_values([LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False), Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=100000,
      normalize=False, random_state=None, solver='auto', tol=0.001), Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=100000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False), KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform'), BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
              fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=100000,
              normalize=False, tol=0.001, verbose=False), KernelRidge(alpha=1, coef0=1, degree=3, gamma=None, kernel='linear',
            kernel_params=None), GaussianProcessRegressor(alpha=1e-10, copy_X_train=True, kernel=None,
                         n_restarts_optimizer=0, normalize_y=False,
                         optimizer='fmin_l_bfgs_b', random_state=None)])

def fit_regressors(estimators, split):
    
    X_train, X_test, y_train, y_test = split.X_train_selected, split.X_test_selected, split.y_train, split.y_test
    
    # select a classifier and change parameters
    for name, estimator in estimators.items():
        
        # fit a model to the training data
        estimator.fit(X = X_train, y = y_train)
        
        # compute R^2 scores
        train_score = round(r2_score(y_train, estimator.predict(X_train)), 4)
        test_score = round(r2_score(y_test, estimator.predict(X_test)), 4)
        
        #print accuracy
        print(name, ':\n\tR^2 value for training set: ', train_score, 
              '\n\tR^2 value for testing set : ', test_score, '\n', sep = '')


for split in splits:
    print('–––––––––––––––––––––––––––––––––––––––––––––––––\n\n\t\t' + split.name.upper()  + '\n')
    fit_regressors(estimators, split)

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES AND TARGET

Linear:
	R^2 value for training set: 0.236
	R^2 value for testing set : 0.2104

Ridge:
	R^2 value for training set: 0.2352
	R^2 value for testing set : 0.208

Lasso:
	R^2 value for training set: 0.236
	R^2 value for testing set : 0.2106

kNN:
	R^2 value for training set: 0.5269
	R^2 value for testing set : 0.2031

BayesianRidge:
	R^2 value for training set: 0.2358
	R^2 value for testing set : 0.2092

KernelRidge:
	R^2 value for training set: 0.2323
	R^2 value for testing set : 0.2019

GaussianProcess:
	R^2 value for training set: 0.7732
	R^2 value for testing set : -7.4677

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES AND SMOOTHED TARGET

Linear:
	R^2 value for training set: 0.2879
	R^2 value for testing set : 0.313

Ridge:
	R^2 value for training set: 0.2874
	R^2 value for testing set : 0.3097

Lasso:
	R^2 value for training set: 0.2879
	R^2 value for testing set : 0.3121

kNN:
	R^2 value for training set: 0.6644
	R^2 value for testing set : 0.5381

BayesianRidge:
	R^2 value for training set: 0.2878
	R^2 value for testing set : 0.3113

KernelRidge:
	R^2 value for training set: 0.284
	R^2 value for testing set : 0.3165

GaussianProcess:
	R^2 value for training set: 0.88
	R^2 value for testing set : -3.9679

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES AND FITTED ERROR

Linear:
	R^2 value for training set: 0.0435
	R^2 value for testing set : 0.0824

Ridge:
	R^2 value for training set: 0.0433
	R^2 value for testing set : 0.0784

Lasso:
	R^2 value for training set: 0.0434
	R^2 value for testing set : 0.0811

kNN:
	R^2 value for training set: 0.5203
	R^2 value for testing set : 0.2887

BayesianRidge:
	R^2 value for training set: 0.042
	R^2 value for testing set : 0.0702

KernelRidge:
	R^2 value for training set: 0.031
	R^2 value for testing set : 0.0567

GaussianProcess:
	R^2 value for training set: 0.8157
	R^2 value for testing set : -21.3513

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES COMBINED AND TARGET

Linear:
	R^2 value for training set: 0.3059
	R^2 value for testing set : 0.2438

Ridge:
	R^2 value for training set: 0.2989
	R^2 value for testing set : 0.2257

Lasso:
	R^2 value for training set: 0.305
	R^2 value for testing set : 0.2388

kNN:
	R^2 value for training set: 0.6663
	R^2 value for testing set : 0.4208

BayesianRidge:
	R^2 value for training set: 0.2964
	R^2 value for testing set : 0.2235

KernelRidge:
	R^2 value for training set: 0.2989
	R^2 value for testing set : 0.2257

GaussianProcess:
	R^2 value for training set: 1.0
	R^2 value for testing set : -0.4826

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES COMBINED AND SMOOTHED TARGET

Linear:
	R^2 value for training set: 0.4611
	R^2 value for testing set : 0.522

Ridge:
	R^2 value for training set: 0.4486
	R^2 value for testing set : 0.4718

Lasso:
	R^2 value for training set: 0.4574
	R^2 value for testing set : 0.509

kNN:
	R^2 value for training set: 0.9383
	R^2 value for testing set : 0.9103

BayesianRidge:
	R^2 value for training set: 0.4516
	R^2 value for testing set : 0.4805

KernelRidge:
	R^2 value for training set: 0.4486
	R^2 value for testing set : 0.4727

GaussianProcess:
	R^2 value for training set: 1.0
	R^2 value for testing set : 0.9374

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES COMBINED AND FITTED ERROR

Linear:
	R^2 value for training set: 0.3161
	R^2 value for testing set : 0.2717

Ridge:
	R^2 value for training set: 0.265
	R^2 value for testing set : 0.2217

Lasso:
	R^2 value for training set: 0.309
	R^2 value for testing set : 0.2709

kNN:
	R^2 value for training set: 0.8855
	R^2 value for testing set : 0.7317

BayesianRidge:
	R^2 value for training set: 0.2999
	R^2 value for testing set : 0.2677

KernelRidge:
	R^2 value for training set: 0.2653
	R^2 value for testing set : 0.2217

GaussianProcess:
	R^2 value for training set: 1.0
	R^2 value for testing set : 0.8141

3.4. Model Evaluation¶

We discarded innefective models and decided to evaluate those with a passable performance using the following

Predicting Smoothed Target¶

Linear for features combined with smoothed target
KNN for features combined with smoothed target
Gaussian for features combined with smoothed target

Predicting Fitted Error¶

Linear for features combined and fitted error
KNN for features combined with smoothed target
Guassiann for features combiend with smoothed target

Evaluation Metrics using Cross Validation¶

R^2
Mean Accuracy
Standard Deviation

f_combined_smoothed = splits[4]
f_combined_smoothed_err = splits[5]

Cross validation of highest performing models on selected features for smoothed target and error¶

from sklearn.model_selection import KFold
kfold = KFold(n_splits = 10, random_state = 3000, shuffle = True)
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

evalEstimators = {'Linear': LinearRegression(),
              'kNN': KNeighborsRegressor(),
              'GaussianProcess': GaussianProcessRegressor()}

#Dictionary of model names and scores
scores_target_smoothed = {}
scores_error_smoothed = {}
for name, model in evalEstimators.items():
    scores_target_smoothed[name] = cross_val_score(estimator=model,X=f_combined_smoothed.X_train_selected,
                                                   y=f_combined_smoothed.y_train,cv = 5)
    print(name + " Target_Smoothed done")
    scores_error_smoothed[name] = cross_val_score(estimator=model,X=f_combined_smoothed_err.X_train_selected,
                                                  y=f_combined_smoothed_err.y_train,cv = 5)
    print(name + " Error Fitted done")

Linear Target_Smoothed done
Linear Error Fitted done
kNN Target_Smoothed done
kNN Error Fitted done
GaussianProcess Target_Smoothed done
GaussianProcess Error Fitted done

Prediction Accuracy¶

KNN and Gaussian appear to have the best predictive performance on the validation tests
The same is true when predicting Error CV mean scores
Low standard deviations indicate that the afforementioned models are reliably accurate

print("Target Smoothed CV scores")
for name, score in scores_target_smoothed.items():
    print('–––––––––––––––––––––––––––––––––––––––––––––––––\n\n\t\t' + name  + '\n')
    print("Mean Score:",score.mean())
    print("Standard Deviation",f"{score.std():.2%}")
print("\n\n")
print("\t\tError CV Mean Scores")
for name, score in scores_target_smoothed.items():
    print('–––––––––––––––––––––––––––––––––––––––––––––––––\n\n\t\t' + name  + '\n')
    print("Mean Score:",score.mean())
    print("Standard Deviation",f"{score.std():.2%}")

Target Smoothed CV scores
–––––––––––––––––––––––––––––––––––––––––––––––––

		Linear

Mean Score: 0.4060851503976419
Standard Deviation 4.94%
–––––––––––––––––––––––––––––––––––––––––––––––––

		kNN

Mean Score: 0.8445501092238917
Standard Deviation 2.61%
–––––––––––––––––––––––––––––––––––––––––––––––––

		GaussianProcess

Mean Score: 0.9041986090947965
Standard Deviation 2.35%


		Error CV Mean Scores
–––––––––––––––––––––––––––––––––––––––––––––––––

		Linear

Mean Score: 0.4060851503976419
Standard Deviation 4.94%
–––––––––––––––––––––––––––––––––––––––––––––––––

		kNN

Mean Score: 0.8445501092238917
Standard Deviation 2.61%
–––––––––––––––––––––––––––––––––––––––––––––––––

		GaussianProcess

Mean Score: 0.9041986090947965
Standard Deviation 2.35%

3.5. Model Optimization¶

Rationale For Tuning¶

After performing a cross-validation test to evaluate our three best models, we can then select from that set, two models: kNN and Gaussian Process for hyper-parameter tuning. Since these models have a high degree of accuracy, we want to reduce the risk of over-fitting these models to retain predictive value upon future data for flower production.

Tuning¶

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

param_grid1 = {'alpha':[.001, .01, .1, 1, 10, 100]}
param_grid2 = {"n_neighbors":[1, 2, 3, 5, 10], "metric": ["euclidean", "manhattan", "minkowski"]}
best_estimators = {
              'kNN': KNeighborsRegressor(),
              'GaussianProcess': GaussianProcessRegressor()}

def optimize(x_train, y_train):
    out = {}
    for name, model in best_estimators.items():
        if name == 'kNN':
            grid_search = GridSearchCV(model, param_grid2, cv=5)
        if name == 'GaussianProcess':
            grid_search = GridSearchCV(model, param_grid1, cv=5)
        grid_search.fit(X=x_train, y=y_train)
        print(name)
        print("\tBest cross-validation score: ", grid_search.best_score_)
        print("\tBest parameters: ", grid_search.best_params_)
        print()
        out[name] = grid_search.best_params_
    return out

Results on Training Set For Parameter Tuning¶

print("Model Optimization for Selected features on Smoothed Target data \n__________________________________________________ \n")
best_params = optimize(f_combined_smoothed.X_train_selected, f_combined_smoothed.y_train)
print('\n')
print("Model Optimization for selected features for error prediction \n__________________________________________________ \n")
best_params_error = optimize(f_combined_smoothed_err.X_train_selected, f_combined_smoothed_err.y_train)
print('\n')

Model Optimization for Selected features on Smoothed Target data 
__________________________________________________ 

kNN
	Best cross-validation score:  0.9447797514224517
	Best parameters:  {'metric': 'manhattan', 'n_neighbors': 2}

GaussianProcess
	Best cross-validation score:  0.91222188459969
	Best parameters:  {'alpha': 0.001}


Model Optimization for selected features for error prediction 
__________________________________________________ 

kNN
	Best cross-validation score:  0.9025645999268768
	Best parameters:  {'metric': 'manhattan', 'n_neighbors': 1}

GaussianProcess
	Best cross-validation score:  0.8186171169981172
	Best parameters:  {'alpha': 0.001}

3.6. Model Testing¶

Using our findings from the parameter tuning phase, we will test the accuracy and performance of our models on our testing data set

param_grid1 = {'alpha':[.001, .01, .1, 1, 10, 100]}
param_grid2 = {"n_neighbors":[1, 2, 3, 5, 10], "metric": ["euclidean", "manhattan", "minkowski"]}
optimal_estimators = {
              'kNN Smoothed Target': KNeighborsRegressor(n_neighbors = 1, metric = 'manhattan'),
              'GaussianProcess Smoothed Target': GaussianProcessRegressor(alpha = .001),
                'kNN Error' : KNeighborsRegressor(n_neighbors = 2),
                'GaussianProcess Error' :GaussianProcessRegressor(alpha = .001) }

# selected on smooth and error
for split in splits[4:]:
    print('–––––––––––––––––––––––––––––––––––––––––––––––––\n\n\t\t' + split.name.upper()  + '\n')
    fit_regressors(optimal_estimators, split)

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES COMBINED AND SMOOTHED TARGET

kNN Smoothed Target:
	R^2 value for training set: 1.0
	R^2 value for testing set : 0.9693

GaussianProcess Smoothed Target:
	R^2 value for training set: 0.9983
	R^2 value for testing set : 0.9497

kNN Error:
	R^2 value for training set: 0.9872
	R^2 value for testing set : 0.966

GaussianProcess Error:
	R^2 value for training set: 0.9983
	R^2 value for testing set : 0.9497

–––––––––––––––––––––––––––––––––––––––––––––––––

		FEATURES COMBINED AND FITTED ERROR

kNN Smoothed Target:
	R^2 value for training set: 1.0
	R^2 value for testing set : 0.9413

GaussianProcess Smoothed Target:
	R^2 value for training set: 0.9966
	R^2 value for testing set : 0.833

kNN Error:
	R^2 value for training set: 0.9749
	R^2 value for testing set : 0.7965

GaussianProcess Error:
	R^2 value for training set: 0.9966
	R^2 value for testing set : 0.833

4. DISCUSSION¶

Summary of Analysis¶

In order to perform our analysis, we split our weather conditions and greenhouse data into feature variables and target variable, the number of flowers produced. Then we created several functions like sum, range, or standard deviation that were applied to every feature and recorder in a 'features combined' data frame along with making our features polynomial to explore nonlinear relationships. We also smoothened our target variable and tried to come up with a mathematical model describing its distribution. As a result, we obtained two additional targets (smoothened amount of flowers produced and the error between the mathematical model and the smoothened quantity produced) to explore in our data analysis. We did not visualize all of our data because of the sinusoidal nature of our target variable but focused on the selected relationships among our features and features with the number of flowers produced.

To run our regression models, we created six splits of data with a different combination of features and target variables (features, features combined, target, smooth target, error), scaled the appropriate sets, and selected the most significant features. We evaluated seven different supervised machine learning models - subcategory regression - using all six of the training and testing sets described previously. f We then evaluated the regression models using a cross validation split in order to adaquetely determine which models were the most promising and warranted hyper-parameter tuning. Our results dicated that kNN and Gaussian Process were our best models from the cross mean validation scores which were: 84% and 90% respectively.

Afterwards, we applied a GridSearchCV to fine tune the following parameters:

Smoothed Target Value¶

Gaussian Process: Alpha Value = .001
kNN: n_neighbors = 2 and Metric = Manhattan

Error Prediction¶

Gaussian Process: Alpha Value = .001
kNN: n_neighbors = 1 and Metric = Manhattan

Interpretation of Findings¶

Algorithms Compared

We compared Multiple Linear, Ridge, Lasso, kNN, BayesianRidge, KernelRidge, and GaussianProcess regression algorithms.

Algorithms with Best Performance The algorithms with the best performance were kNN and Gaussian process, and Linear Regression

Smoothed Target Value r^2
* Multiple Linear: 0.40
* kNN: .84
* Gaussian Process: .90

Error Prediction r^2
* Multiple Linear: 0.40
* kNN: .84
* Gaussian Process: .90

Evaluation after Optimization

After tuning and optimization we saw a moderate improvement in the results. After optimization, we were unable to improve the linear regression model to a sufficient standard, and as such, it was discarded from our final analysis.

The accuracy of the models on our testing data improved to the following values.

Smoothed Target Value r^2
* kNN: .96
* Gaussian Process: .94

Error Prediction r^2
* kNN: .98
* Gaussian Process: .94

Algorithms for Use in Predictive Model

We expected Bayesian Ridge, KernelRidge, and Gaussian Process regression algorithms to work best because the examples in the scikit learn documentations suggests that they are good at predicting sinusoidal data. However, upon review of our results we found that Gaussian Process was the only model from our original predictions that produced sufficient results. Knn and Gaussian models form the best predictive models for our data set.

Our Original Research Questions¶

Is there a linear relationship between the weather conditions and the number of flowers produced? If not, what is the relationship?

In our data visualization and analysis, we found that there is no direct linear relationship between the weather conditions and the number of flowers produced. Moreover, the linear regression model proved to one of the least accurate when it came to predicting the number of flowers produced or the deviation from the assumed cycle.

Is there an accumulative or delayed effect of the weather conditions on the dependent variable? What is it?

When we applied the feature selection technique, we discovered that accumulative, delayed, and other mathematically modified features had a more significant effect on the dependent variable than the current weather conditions.

Is there a pattern in the number of flowers produced? What periods appear to be more stable or have less noise in the number of flowers produced? What parts of the natural cycle (amplitude, period, vertical shift) are affected by the weather?

It is evident from our visualizations and attempts to mathematically describe the pattern that the number of flowers produced follows a sinusoidal pattern.

Is there a relationship between the weather conditions and the deviation from the natural production cycle? Which features are the most influential on the deviations of the cycle?

We were able to construct an accurate model with the deviation from the cycle as a target variable, which means that there has to be a relationship. The list of significant features can be found in the data analysis section, but we are not sure about the order of significance. Coming up with twenty or so selected features has proven to be very computationally expensive for the combined dataset.

Is there a linear or nonlinear regression model that can predict the number of flowers produced? Which one is more accurate?

Yes. kNN (non linear) and Gaussian Process were both predictive of our data. However, neither of those models are linear.

How do the features correlate with each other? Can we reduce their number to make the prediction model more accurate?

In the heat map visualization of our data, we found that the most significant features do not correlate with each other. However, we did not check all the features, and the ones we have already been selected as significant. Reducing the number of features in the combined data set proved to produce more accurate prediction models and less accurate predictive models for an original data set of features.

Reflection on Our Findings

In our original problem, we wanted to find out if our selected (strongly correlated) wheather conditions were predictive of the sinusoidal pattern originally observed in our data. After testing a variety of models including several linear models and kNN and Gaussian Process we found that linear models were not predictive of our data but kNN and Gaussian Process models were. Upon tuning the parameters of our models and employing our feature selection, we were able to generate two highly accurate models that performed well under cross validation tests.

Conclusion¶

We expected better results from the Kernel Ridge regression as scikit learn documentation says it is often highly predictive of sinusoidal data. Perhaps we could have done a better job fine tuning the prameters of that specific model or perhaps a Kernel ridge regression warrants a more specific feature selection process. We were surprised that kNN model could be tuned to such a high degree of accuracy. The performance of the Gaussian Process was not surprising as it is often highly predictive of cyclical data.

Future work of this project could involve combining the error prediction model with the smoothed target prediction. Additionally we believe this model has effective real world utility for greenhouse companies and possibly for other crops that are farmed. The real world implications of this project could perhaps be employed by certain hedge funds or investment firms to predict yields of various crops (other than flowers) in order to effectively price commodity futures.

CONTRIBUTIONS¶

Jake:

problem description and specification
data acquisition
feature and target extraction
preprocessing and feature selection
data exploration

Pranav

Model Fitting and Selection
Model Prediction
Model Optimization
Model Evaluation
Model Testing

	Date	Rsum	Lamps	T_inside	CO2	T_Max	T_Avg	T_Min	DP_Max	DP_Avg	DP_Min	H_Max	H_Avg	H_Min	P_Max	P_Avg	P_Min	Produced
0	7/31/17	2566	0	24.4	573	28.9	22.7	13.9	17.2	14.7	12.8	94	64.3	37	29.7	29.6	29.6	800
1	8/1/17	2182	0	27.0	650	33.9	25.4	17.2	21.1	17.3	13.9	94	65.0	31	29.8	29.7	29.7	716
2	8/2/17	2390	0	27.6	715	33.9	26.9	18.9	20.0	18.2	15.0	100	63.3	34	29.7	29.7	29.6	1125
3	8/3/17	2239	0	27.1	595	32.2	26.5	20.0	21.1	18.7	17.2	88	64.5	40	29.6	29.5	29.5	1280
4	8/4/17	2141	0	27.1	634	33.9	26.6	18.9	22.2	17.9	12.8	100	64.0	29	29.5	29.4	29.4	1971
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1032	5/28/20	1174	680	20.2	657	17.2	12.2	7.8	12.8	10.7	7.8	100	91.4	72	29.7	29.6	29.4	10697
1033	5/29/20	728	430	20.3	717	17.8	14.7	12.2	13.9	12.1	11.1	100	85.3	64	29.4	29.3	29.3	7234
1034	5/30/20	466	310	20.0	698	17.2	13.9	12.2	13.9	12.4	10.0	100	91.5	68	29.3	29.2	29.1	8939
1035	5/31/20	1338	250	20.0	564	17.8	13.4	10.0	11.1	9.7	7.8	94	79.3	59	29.3	29.3	29.2	6104
1036	6/1/20	1266	250	19.4	694	33.9	29.7	25.0	28.9	26.2	23.9	100	80.9	52	29.3	29.3	29.3	10040

	Rsum	Lamps	T_inside	CO2	T_Max	T_Avg	T_Min	DP_Max	DP_Avg	DP_Min	H_Max	H_Avg	H_Min	P_Max	P_Avg	P_Min
0	2566	0	24.4	573	28.9	22.7	13.9	17.2	14.7	12.8	94	64.3	37	29.7	29.6	29.6
1	2182	0	27.0	650	33.9	25.4	17.2	21.1	17.3	13.9	94	65.0	31	29.8	29.7	29.7
2	2390	0	27.6	715	33.9	26.9	18.9	20.0	18.2	15.0	100	63.3	34	29.7	29.7	29.6
3	2239	0	27.1	595	32.2	26.5	20.0	21.1	18.7	17.2	88	64.5	40	29.6	29.5	29.5
4	2141	0	27.1	634	33.9	26.6	18.9	22.2	17.9	12.8	100	64.0	29	29.5	29.4	29.4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1032	1174	680	20.2	657	17.2	12.2	7.8	12.8	10.7	7.8	100	91.4	72	29.7	29.6	29.4
1033	728	430	20.3	717	17.8	14.7	12.2	13.9	12.1	11.1	100	85.3	64	29.4	29.3	29.3
1034	466	310	20.0	698	17.2	13.9	12.2	13.9	12.4	10.0	100	91.5	68	29.3	29.2	29.1
1035	1338	250	20.0	564	17.8	13.4	10.0	11.1	9.7	7.8	94	79.3	59	29.3	29.3	29.2
1036	1266	250	19.4	694	33.9	29.7	25.0	28.9	26.2	23.9	100	80.9	52	29.3	29.3	29.3

	Rsum^2	Rsum^3	Rsum^4	Rsum^5	Rsum^6	Rsum^7	Rsum^8	Rsum^9	Rsum^10	Lamps^2	...	P_Avg^10	P_Min^2	P_Min^3	P_Min^4	P_Min^5	P_Min^6	P_Min^7	P_Min^8	P_Min^9	P_Min^10
0	6584356.0	1.689546e+10	4.335374e+13	1.112457e+17	2.854565e+20	7.324813e+23	1.879547e+27	4.822918e+30	1.237561e+34	0.0	...	5.163178e+14	876.16	25934.336	767656.3456	2.272263e+07	6.725898e+08	1.990866e+10	5.892963e+11	1.744317e+13	5.163178e+14
1	4761124.0	1.038877e+10	2.266830e+13	4.946223e+16	1.079266e+20	2.354958e+23	5.138519e+26	1.121225e+30	2.446513e+33	0.0	...	5.340286e+14	882.09	26198.073	778082.7681	2.310906e+07	6.863390e+08	2.038427e+10	6.054128e+11	1.798076e+13	5.340286e+14
2	5712100.0	1.365192e+10	3.262809e+13	7.798113e+16	1.863749e+20	4.454360e+23	1.064592e+27	2.544375e+30	6.081056e+33	0.0	...	5.340286e+14	876.16	25934.336	767656.3456	2.272263e+07	6.725898e+08	1.990866e+10	5.892963e+11	1.744317e+13	5.163178e+14
3	5013121.0	1.122438e+10	2.513138e+13	5.626916e+16	1.259867e+20	2.820841e+23	6.315864e+26	1.414122e+30	3.166219e+33	0.0	...	4.991375e+14	870.25	25672.375	757335.0625	2.234138e+07	6.590708e+08	1.944259e+10	5.735564e+11	1.691991e+13	4.991375e+14
4	4583881.0	9.814089e+09	2.101197e+13	4.498662e+16	9.631635e+19	2.062133e+23	4.415027e+26	9.452572e+29	2.023796e+33	0.0	...	4.824733e+14	864.36	25412.184	747118.2096	2.196528e+07	6.457791e+08	1.898591e+10	5.581856e+11	1.641066e+13	4.824733e+14
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1032	1378276.0	1.618096e+09	1.899645e+12	2.230183e+15	2.618235e+18	3.073808e+21	3.608650e+24	4.236555e+27	4.973716e+30	462400.0	...	5.163178e+14	864.36	25412.184	747118.2096	2.196528e+07	6.457791e+08	1.898591e+10	5.581856e+11	1.641066e+13	4.824733e+14
1033	529984.0	3.858284e+08	2.808830e+11	2.044829e+14	1.488635e+17	1.083726e+20	7.889528e+22	5.743577e+25	4.181324e+28	184900.0	...	4.663116e+14	858.49	25153.757	737005.0801	2.159425e+07	6.327115e+08	1.853845e+10	5.431765e+11	1.591507e+13	4.663116e+14
1034	217156.0	1.011947e+08	4.715673e+10	2.197504e+13	1.024037e+16	4.772011e+18	2.223757e+21	1.036271e+24	4.829022e+26	96100.0	...	4.506387e+14	846.81	24642.171	717087.1761	2.086724e+07	6.072366e+08	1.767058e+10	5.142140e+11	1.496363e+13	4.354416e+14
1035	1790244.0	2.395346e+09	3.204974e+12	4.288255e+15	5.737685e+18	7.677022e+21	1.027186e+25	1.374374e+28	1.838913e+31	62500.0	...	4.663116e+14	852.64	24897.088	726994.9696	2.122825e+07	6.198650e+08	1.810006e+10	5.285217e+11	1.543283e+13	4.506387e+14
1036	1602756.0	2.029089e+09	2.568827e+12	3.252135e+15	4.117203e+18	5.212378e+21	6.598871e+24	8.354171e+27	1.057638e+31	62500.0	...	4.663116e+14	858.49	25153.757	737005.0801	2.159425e+07	6.327115e+08	1.853845e+10	5.431765e+11	1.591507e+13	4.663116e+14

	Rsum_3_moving_mean	Rsum_3_moving_mean_dd	Rsum_3_moving_mean_lr	Rsum_5_moving_mean	Rsum_5_moving_mean_dd	Rsum_5_moving_mean_lr	Rsum_7_moving_mean	Rsum_7_moving_mean_dd	Rsum_7_moving_mean_lr	Rsum_3_moving_sd	...	P_Min_21_prev_sd_lr	P_Min_3_prev_range	P_Min_3_prev_range_dd	P_Min_3_prev_range_lr	P_Min_7_prev_range	P_Min_7_prev_range_dd	P_Min_7_prev_range_lr	P_Min_21_prev_range	P_Min_21_prev_range_dd	P_Min_21_prev_range_lr
0	2374.000000	5.333333	1.000000	2379.333333	-35.083333	1.000000	2344.250000	-40.650000	1.000000	271.529004	...	1.000000	0.0	1.000000e-01	1.000000	0.0	0.1	1.000000	0.0	0.1	1.0
1	2379.333333	5.333333	1.002247	2344.250000	-35.083333	0.985255	2303.600000	-40.650000	0.982660	192.222094	...	1.000000	0.1	1.000000e-01	1.000000	0.1	0.1	1.000000	0.1	0.1	1.0
2	2270.333333	-109.000000	0.954189	2303.600000	-40.650000	0.982660	2263.500000	-40.100000	0.982592	107.481781	...	0.866025	0.1	0.000000e+00	1.000000	0.1	0.0	1.000000	0.1	0.0	1.0
3	2256.666667	-13.666667	0.993980	2203.000000	-100.600000	0.956329	2178.857143	-84.642857	0.962605	125.436571	...	1.673320	0.2	1.000000e-01	2.000000	0.2	0.1	2.000000	0.2	0.1	2.0
4	2147.666667	-109.000000	0.951699	2100.800000	-102.200000	0.953609	1863.714286	-315.142857	0.855363	88.189191	...	1.447494	0.2	3.552714e-15	1.000000	0.3	0.1	1.500000	0.3	0.1	1.5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1032	1081.333333	-153.333333	0.875810	979.600000	12.400000	1.012821	948.571429	-86.000000	0.916874	317.315826	...	0.952095	0.3	2.000000e-01	3.000000	0.3	0.1	1.500000	0.7	0.0	1.0
1033	789.333333	-292.000000	0.729963	1009.600000	30.000000	1.030625	1071.714286	123.142857	1.129819	357.962754	...	1.017321	0.4	1.000000e-01	1.333333	0.4	0.1	1.333333	0.7	0.0	1.0
1034	844.000000	54.666667	1.069257	994.400000	-15.200000	0.984945	1052.333333	-19.380952	0.981916	447.423737	...	1.092212	0.3	-1.000000e-01	0.750000	0.6	0.2	1.500000	0.7	0.0	1.0
1035	1023.333333	179.333333	1.212480	949.500000	-44.900000	0.954847	994.400000	-57.933333	0.944948	484.005510	...	1.035749	0.2	-1.000000e-01	0.666667	0.6	0.0	1.000000	0.7	0.0	1.0
1036	1302.000000	278.666667	1.272313	1023.333333	73.833333	1.077760	949.500000	-44.900000	0.954847	50.911688	...	0.937762	0.2	0.000000e+00	1.000000	0.6	0.0	1.000000	0.7	0.0	1.0