Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/141-Labs/Optional Project - Linear and Quadratic Regression.ipynb
Views: 491
Project - Linear and Quadratic Regression
Overview
In this lab, we will learn what regression is, and how to find a regression line. The task for the project will be to find a quadratic function that best fits some data. (i.e. you will perfrom quadratic regression).
Important SageMath Commands Introduced in this Lab
ParseError: KaTeX parse error: Undefined control sequence: \hfill at position 32: …|l|l|} \hline \̲h̲f̲i̲l̲l̲ ̲\textbf{Command…The following Python script generates some data for us to play with. You don't have to understand the code in the cell, just its output.
We generated some x-values stored in a list named "X" and some y-values stored in list called "Y". We then created a list called "Points" that contains the (x,y)-values together. Each x-value is paired with the y-value at the same spot in the list. For example 2nd x-value is paired with 2nd y-value.
Now that we've created our data, let's see what it looks like. We can plot our data using the ParseError: KaTeX parse error: Expected 'EOF', got '_' at position 16: \textbf{scatter_̲plot}(\cdots) command.
Our goal now is to find the line that best fits the data. This line, called a regression line, can be used to make prediction about the outcome of an experiment by using your data. So how do we find such a line?
Any line is given by the equation . So we really need to find the that best fit the data. To find such we need to ask ourselves "What do we mean by 'best fit'?".
A common interpretation of "best fit" is that the average error between the line and the data is as small as it can be. Well what does error mean? Recall that the distance between two numbers and is given by . So if is a data point, the distance from the line's prediction to the actual value , is given by . The problem with this definition of error, is that it's not differentiable at every point, so minimizing it can be difficult.
So instead of minimizing the error, we can minimize the error squared instead. The squared error of the prediction is given by . This definition of error is often called the "Mean Squared Error" or "MSE".
Now if we have points, then the average error of our prediction is given by the formula
If we want to minimize the error above, the we need to solve the system of equations given by
To compute Error, we first need to let and be variables which we do in the following cell using the command.
Next, any prediction is given by for any data point . Since we have all of our data stored in the list , we can make a new list called "predictions" by multplying each data point by and adding . This can accompished in Sage using the following cell:
We can then subtract the actual values at each point from the predicted points by subtracting the list from predictions, and then squaring the result. So we'll create a new list called "errors" containing these values.
Lastly our average error is given by adding up each of numbers in the list "errors" and dividing by the number of points, which is 20. This can be accomplished using the command.
Now we want to minimize "error" with respect to and . To do this we use command and save the equations and to the variable names "eq1" and "eq2" respectively. We do this in the following cell.
Finally, to find and , we solve that system of equations for and which can be done in Sage using the command.
We then copy and paste the "a" and "b" values and save them to the variables "a" and "b". Then append "" to the end of each value so that the number is saved as a decimal instead of a ridiculous fraction.
Let's look at the line we just found and see how it compares to our data. So in Sage, define and plot it in the same graph as the data.
One new thing here, if you two different plots of different types, you graph them in the same window by running like below.
It looks our regression line fits the data pretty well. You can now use the line we just found to predict value the y-values for x-values that lie in between our data points.
Let's look at the average error of regression line. This can be found by computing for each data point and adding up for each data point , and dividing the sum by the number of points,20.
First we create a list called _ by running the line . This list contains all the values .
The average error of our prediction is now stored in the variable .
For the next portion of the lab, I have created some "quadratic-looking" data below. You job will be to find the "parabola of best fit". After all, most phenomena in science are not linear, so we need to understand how to fit non-linear data as well. Once again, don't worry about understanding the cell below as it's just creating the data for you.
All you need to know is that x-values of the data are stored in the list "X", y-values of the data are stored in the list "Y", and the corresponding coordinates are stored in the list "Points".
Here I have plotted the quadratic data for you so that you can understand better what is meant by "parabola of best fit".
Instructions
Find the parabola of best fit and complete project report as specified by the Project Report Guidelines located on the lab website. The due date will be specified by your TA.
Remember that any quadratic function is of the form , so this time, you will have 3 equations with 3 unknowns , and . You will also need to transform the list accordingly as well. The rest of your code will be nearly identical to the linear case.
Your Job
Your task is to do the following:
Write a system of 3 equations in the 3 unkowns , and . Make sure to include your equations in your report and you must explain the reasoning for your equations in your report.
Solve the equations in (1) with SageMath to find the values , and .
Define the regression function and plot it in the same window as your data.
Use SageMath to compute the MSE of yoru predictions and discuss the error in your report.
Extra Credit
Fabricate some data of your own that follows a sine, logarithmic, cubic (or really any function you want) pattern and find the function of best fit accordingly. Let me know if you need help creating the appropriate data so that you can perform the regression. I'm more than happy to help with that part.
Discuss the difference in error that you may notice for your more complicated function. If your regression has more error, why do you think that is? Hypothesize on this.
One caveat to using "fancier" regression is that SageMath can have some problems solving complex systems of equations that are not linear. For instance, SageMath can regress just fine, but has trouble with where an unknown is on the inside of the . To save yourself time and frustration, I would suggest just sticking with a pattern like where all unknowns are of the functions and .