CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
calculuslab

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: calculuslab/Calculus_Lab
Path: blob/main/141-Labs/Optional Project - Linear and Quadratic Regression.ipynb
Views: 491
Kernel: SageMath 9.2

Project - Linear and Quadratic Regression

Overview

In this lab, we will learn what regression is, and how to find a regression line. The task for the project will be to find a quadratic function that best fits some data. (i.e. you will perfrom quadratic regression).

Important SageMath Commands Introduced in this Lab

ParseError: KaTeX parse error: Undefined control sequence: \hfill at position 32: …|l|l|} \hline \̲h̲f̲i̲l̲l̲ ̲\textbf{Command…

The following Python script generates some data for us to play with. You don't have to understand the code in the cell, just its output.

We generated some x-values stored in a list named "X" and some y-values stored in list called "Y". We then created a list called "Points" that contains the (x,y)-values together. Each x-value is paired with the y-value at the same spot in the list. For example 2nd x-value is paired with 2nd y-value.

import numpy as np noise = np.random.normal(size=20) X = np.arange(0,10,.5) Y = 2*X + noise Points = [] for i in range(20): Points.append([X[i],Y[i]])

Now that we've created our data, let's see what it looks like. We can plot our data using the ParseError: KaTeX parse error: Expected 'EOF', got '_' at position 16: \textbf{scatter_̲plot}(\cdots) command.

scatter_plot(Points,title='Our Data')

Our goal now is to find the line that best fits the data. This line, called a regression line, can be used to make prediction about the outcome of an experiment by using your data. So how do we find such a line?

Any line is given by the equation y=ax+by=ax+b. So we really need to find the a,ba,b that best fit the data. To find such a,ba,b we need to ask ourselves "What do we mean by 'best fit'?".

A common interpretation of "best fit" is that the average error between the line and the data is as small as it can be. Well what does error mean? Recall that the distance between two numbers cc and dd is given by cd|c-d|. So if (x1,y1)(x_1,y_1) is a data point, the distance from the line's prediction y(x1)y(x_1) to the actual value y1y_1, is given by ax1+by1|ax_1+b-y_1|. The problem with this definition of error, is that it's not differentiable at every point, so minimizing it can be difficult.

So instead of minimizing the error, we can minimize the error squared instead. The squared error of the prediction y(x1)y(x_1) is given by (ax1+by1)2(ax_1+b-y_1)^2. This definition of error is often called the "Mean Squared Error" or "MSE".

Now if we have NN points, then the average error of our prediction is given by the formula Error=1N((ax1+by1)2+(ax2+by2)2++(axN+byN)2)\text{Error} = \frac{1}{N}\Big( (ax_1+b-y_1)^2 + (ax_2+b-y_2)^2 +\cdots + (ax_N+b-y_N)^2\Big)

If we want to minimize the error above, the we need to solve the system of equations given by

{d Errorda=0d Errordb=0\left\{ \begin{array}{l} \frac{d\text{ Error}}{da}=0 \\ \frac{d\text{ Error}}{db} = 0 \end{array} \right.

To compute Error, we first need to let aa and bb be variables which we do in the following cell using the var()\textbf{var}(\cdots) command.

var('a') var('b')

Next, any prediction y(xi)y(x_i) is given by y(xi)=axi+by(x_i)=ax_i +b for any xx data point xix_i. Since we have all of our xx data stored in the list XX, we can make a new list called "predictions" by multplying each xx data point by aa and adding bb. This can accompished in Sage using the following cell:

predictions = a*X+b
predictions

We can then subtract the actual values at each xx point from the predicted points by subtracting the list YY from predictions, and then squaring the result. So we'll create a new list called "errors" containing these values.

errors = (predictions-Y)^2
errors

Lastly our average error is given by adding up each of numbers in the list "errors" and dividing by the number of points, which is 20. This can be accomplished using the sum()\textbf{sum}(\cdots) command.

error = (1/20)*sum(errors)

Now we want to minimize "error" with respect to aa and bb. To do this we use diff()\text{diff}(\cdots) command and save the equations d Errorda=0\frac{d\text{ Error}}{da}=0 and d Errorda=0\frac{d\text{ Error}}{da}=0 to the variable names "eq1" and "eq2" respectively. We do this in the following cell.

eq1 = diff(error,b) == 0 eq2 = diff(error,a) == 0

Finally, to find aa and bb, we solve that system of equations for aa and bb which can be done in Sage using the solve()\textbf{solve}(\cdots) command.

show(solve([eq1,eq2],[a,b]))
S = solve([eq1,eq2],[a,b])
S

We then copy and paste the "a" and "b" values and save them to the variables "a" and "b". Then append ".0.0" to the end of each value so that the number is saved as a decimal instead of a ridiculous fraction.

a = 474251695107426/284162969002297.0
b = 51199133380809/29911891473926.0
a

Let's look at the line we just found and see how it compares to our data. So in Sage, define y=ax+by=ax+b and plot it in the same graph as the data.

def y(x): return a*x+b

One new thing here, if you two different plots of different types, you graph them in the same window by running plot1+plot2plot1+plot2 like below.

plot(y,xmin=0,xmax=10) + scatter_plot(Points)

It looks our regression line fits the data pretty well. You can now use the line we just found to predict value the y-values for x-values that lie in between our data points.

Let's look at the average error of regression line. This can be found by computing y(xi)y(x_i) for each data point xix_i and adding up (y(xi)yi)2(y(x_i)-y_i)^2 for each data point (xi,yi)(x_i,y_i), and dividing the sum by the number of points,20.

First we create a list called squaredsquared_errorerror by running the line (y(X)Y)2(y(X)-Y)^2. This list contains all the values (y(xi)yi)2(y(x_i)-y_i)^2.

squared_errors = (y(X)-Y)^2
MSE = (1/20)*sum(squared_errors)

The average error of our prediction is now stored in the variable MSEMSE.

MSE

For the next portion of the lab, I have created some "quadratic-looking" data below. You job will be to find the "parabola of best fit". After all, most phenomena in science are not linear, so we need to understand how to fit non-linear data as well. Once again, don't worry about understanding the cell below as it's just creating the data for you.

All you need to know is that x-values of the data are stored in the list "X", y-values of the data are stored in the list "Y", and the corresponding (x,y)(x,y) coordinates are stored in the list "Points".

import numpy as np noise = np.random.normal(size=20) X = np.arange(-5,5,.5) Y = -X^2 +2*X+1 + noise Points = [] for i in range(20): Points.append([X[i],Y[i]])

Here I have plotted the quadratic data for you so that you can understand better what is meant by "parabola of best fit". Note that the xmin = -5 and xmax = 5 for the data. Keep this in mind when plotting your parabola later.\mathbf{\text{Note that the xmin = -5 and xmax = 5 for the data. Keep this in mind when plotting your parabola later.}}

scatter_plot(Points)

Instructions

Find the parabola of best fit and complete project report as specified by the Project Report Guidelines located on the lab website. The due date will be specified by your TA.

Remember that any quadratic function is of the form ax2+bx+cax^2 + bx +c, so this time, you will have 3 equations with 3 unknowns a,ba,b, and cc. You will also need to transform the list XX accordingly as well. The rest of your code will be nearly identical to the linear case.

Your Job

Your task is to do the following:

  1. Write a system of 3 equations in the 3 unkowns a,ba,b, and cc. Make sure to include your equations in your report and you must explain the reasoning for your equations in your report.

  2. Solve the equations in (1) with SageMath to find the values a,ba,b, and cc.

  3. Define the regression function y(x)=ax2+bx+cy(x)=ax^2+bx+c and plot it in the same window as your data.

  4. Use SageMath to compute the MSE of yoru predictions and discuss the error in your report.

Extra Credit

Fabricate some data of your own that follows a sine, logarithmic, cubic (or really any function you want) pattern and find the function of best fit accordingly. Let me know if you need help creating the appropriate data so that you can perform the regression. I'm more than happy to help with that part.

Discuss the difference in error that you may notice for your more complicated function. If your regression has more error, why do you think that is? Hypothesize on this.

One caveat to using "fancier" regression is that SageMath can have some problems solving complex systems of equations that are not linear. For instance, SageMath can regress asin(x)+ba\cdot\text{sin}(x)+b just fine, but has trouble with asin(bx)+ca\cdot\text{sin}(bx)+c where an unknown is on the inside of the sin\text{sin}. To save yourself time and frustration, I would suggest just sticking with a pattern like af(x)+bg(x)+ca\cdot f(x) + b\cdot g(x) +c where all unknowns are outside\mathbf{outside} of the functions ff and gg.