Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: calculuslab/Calculus_Lab
Path: blob/main/141-Labs/Optional Project - Linear and Quadratic Regression.ipynb
Views: ⁴⁹¹

Kernel: SageMath 9.2

Project - Linear and Quadratic Regression

Overview

In this lab, we will learn what regression is, and how to find a regression line. The task for the project will be to find a quadratic function that best fits some data. (i.e. you will perfrom quadratic regression).

Important SageMath Commands Introduced in this Lab

ParseError: KaTeX parse error: Undefined control sequence: \hfill at position 32: …|l|l|} \hline \̲h̲f̲i̲l̲l̲ ̲\textbf{Command…

The following Python script generates some data for us to play with. You don't have to understand the code in the cell, just its output.

We generated some x-values stored in a list named "X" and some y-values stored in list called "Y". We then created a list called "Points" that contains the (x,y)-values together. Each x-value is paired with the y-value at the same spot in the list. For example 2nd x-value is paired with 2nd y-value.

In [ ]:

import numpy as np
noise = np.random.normal(size=20)
X = np.arange(0,10,.5)
Y = 2*X + noise
Points = []
for i in range(20):
    Points.append([X[i],Y[i]])

Now that we've created our data, let's see what it looks like. We can plot our data using the ParseError: KaTeX parse error: Expected 'EOF', got '_' at position 16: \textbf{scatter_̲plot}(\cdots) command.

In [ ]:

scatter_plot(Points,title='Our Data')

Our goal now is to find the line that best fits the data. This line, called a regression line, can be used to make prediction about the outcome of an experiment by using your data. So how do we find such a line?

Any line is given by the equation $y=ax+b$ . So we really need to find the $a,b$ that best fit the data. To find such $a,b$ we need to ask ourselves "What do we mean by 'best fit'?".

A common interpretation of "best fit" is that the average error between the line and the data is as small as it can be. Well what does error mean? Recall that the distance between two numbers $c$ and $d$ is given by $|c-d|$ . So if $(x_1,y_1)$ is a data point, the distance from the line's prediction $y(x_1)$ to the actual value $y_1$ , is given by $|ax_1+b-y_1|$ . The problem with this definition of error, is that it's not differentiable at every point, so minimizing it can be difficult.

So instead of minimizing the error, we can minimize the error squared instead. The squared error of the prediction $y(x_1)$ is given by $(ax_1+b-y_1)^2$ . This definition of error is often called the "Mean Squared Error" or "MSE".

Now if we have $N$ points, then the average error of our prediction is given by the formula $\text{Error} = \frac{1}{N}\Big( (ax_1+b-y_1)^2 + (ax_2+b-y_2)^2 +\cdots + (ax_N+b-y_N)^2\Big)$

If we want to minimize the error above, the we need to solve the system of equations given by

\left\{ \begin{array}{l} \frac{d\text{ Error}}{da}=0 \\ \frac{d\text{ Error}}{db} = 0 \end{array} \right.

To compute Error, we first need to let $a$ and $b$ be variables which we do in the following cell using the $\textbf{var}(\cdots)$ command.

In [ ]:

var('a')
var('b')

Next, any prediction $y(x_i)$ is given by $y(x_i)=ax_i +b$ for any $x$ data point $x_i$ . Since we have all of our $x$ data stored in the list $X$ , we can make a new list called "predictions" by multplying each $x$ data point by $a$ and adding $b$ . This can accompished in Sage using the following cell:

In [ ]:

predictions = a*X+b

In [ ]:

predictions

We can then subtract the actual values at each $x$ point from the predicted points by subtracting the list $Y$ from predictions, and then squaring the result. So we'll create a new list called "errors" containing these values.

In [ ]:

errors = (predictions-Y)^2

In [ ]:

errors

Lastly our average error is given by adding up each of numbers in the list "errors" and dividing by the number of points, which is 20. This can be accomplished using the $\textbf{sum}(\cdots)$ command.

In [ ]:

error = (1/20)*sum(errors)

Now we want to minimize "error" with respect to $a$ and $b$ . To do this we use $\text{diff}(\cdots)$ command and save the equations $\frac{d\text{ Error}}{da}=0$ and $\frac{d\text{ Error}}{da}=0$ to the variable names "eq1" and "eq2" respectively. We do this in the following cell.

In [ ]:

eq1 = diff(error,b) == 0
eq2 = diff(error,a) == 0

Finally, to find $a$ and $b$ , we solve that system of equations for $a$ and $b$ which can be done in Sage using the $\textbf{solve}(\cdots)$ command.

In [ ]:

show(solve([eq1,eq2],[a,b]))

In [ ]:

S = solve([eq1,eq2],[a,b])

In [ ]:

We then copy and paste the "a" and "b" values and save them to the variables "a" and "b". Then append " $.0$ " to the end of each value so that the number is saved as a decimal instead of a ridiculous fraction.

In [ ]:

a = 474251695107426/284162969002297.0

In [ ]:

b = 51199133380809/29911891473926.0

In [ ]:

In [ ]:

Let's look at the line we just found and see how it compares to our data. So in Sage, define $y=ax+b$ and plot it in the same graph as the data.

In [ ]:

def y(x):
    return a*x+b

One new thing here, if you two different plots of different types, you graph them in the same window by running $plot1+plot2$ like below.

In [ ]:

plot(y,xmin=0,xmax=10) + scatter_plot(Points)

It looks our regression line fits the data pretty well. You can now use the line we just found to predict value the y-values for x-values that lie in between our data points.

Let's look at the average error of regression line. This can be found by computing $y(x_i)$ for each data point $x_i$ and adding up $(y(x_i)-y_i)^2$ for each data point $(x_i,y_i)$ , and dividing the sum by the number of points,20.

First we create a list called $squared$ _ $error$ by running the line $(y(X)-Y)^2$ . This list contains all the values $(y(x_i)-y_i)^2$ .

In [ ]:

squared_errors = (y(X)-Y)^2

In [ ]:

MSE = (1/20)*sum(squared_errors)

The average error of our prediction is now stored in the variable $MSE$ .

In [ ]:

MSE

For the next portion of the lab, I have created some "quadratic-looking" data below. You job will be to find the "parabola of best fit". After all, most phenomena in science are not linear, so we need to understand how to fit non-linear data as well. Once again, don't worry about understanding the cell below as it's just creating the data for you.

All you need to know is that x-values of the data are stored in the list "X", y-values of the data are stored in the list "Y", and the corresponding $(x,y)$ coordinates are stored in the list "Points".

In [ ]:

import numpy as np
noise = np.random.normal(size=20)
X = np.arange(-5,5,.5)
Y = -X^2 +2*X+1 + noise
Points = []
for i in range(20):
    Points.append([X[i],Y[i]])

Here I have plotted the quadratic data for you so that you can understand better what is meant by "parabola of best fit". $\mathbf{\text{Note that the xmin = -5 and xmax = 5 for the data. Keep this in mind when plotting your parabola later.}}$

In [ ]:

scatter_plot(Points)

Instructions

Find the parabola of best fit and complete project report as specified by the Project Report Guidelines located on the lab website. The due date will be specified by your TA.

Remember that any quadratic function is of the form $ax^2 + bx +c$ , so this time, you will have 3 equations with 3 unknowns $a,b$ , and $c$ . You will also need to transform the list $X$ accordingly as well. The rest of your code will be nearly identical to the linear case.

Your Job

Your task is to do the following:

Write a system of 3 equations in the 3 unkowns $a,b$ , and $c$ . Make sure to include your equations in your report and you must explain the reasoning for your equations in your report.
Solve the equations in (1) with SageMath to find the values $a,b$ , and $c$ .
Define the regression function $y(x)=ax^2+bx+c$ and plot it in the same window as your data.
Use SageMath to compute the MSE of yoru predictions and discuss the error in your report.

Extra Credit

Fabricate some data of your own that follows a sine, logarithmic, cubic (or really any function you want) pattern and find the function of best fit accordingly. Let me know if you need help creating the appropriate data so that you can perform the regression. I'm more than happy to help with that part.

Discuss the difference in error that you may notice for your more complicated function. If your regression has more error, why do you think that is? Hypothesize on this.

One caveat to using "fancier" regression is that SageMath can have some problems solving complex systems of equations that are not linear. For instance, SageMath can regress $a\cdot\text{sin}(x)+b$ just fine, but has trouble with $a\cdot\text{sin}(bx)+c$ where an unknown is on the inside of the $\text{sin}$ . To save yourself time and frustration, I would suggest just sticking with a pattern like $a\cdot f(x) + b\cdot g(x) +c$ where all unknowns are $\mathbf{outside}$ of the functions $f$ and $g$ .

In [ ]:

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Project - Linear and Quadratic Regression

Overview

Important SageMath Commands Introduced in this Lab

Instructions

Your Job

Extra Credit

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Project - Linear and Quadratic Regression

Overview

Important SageMath Commands Introduced in this Lab

Instructions

Your Job

Extra Credit

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.