CoCalc -- exam-1.ipynb

Rise of Terrorism Project Report

Micah Jennings - GEOG 30323: Data Analysis & Visualization

¹¹⁵⁰ views

Kernel: Anaconda (Python 3)

Score: 82/100

Take-home exam #1

You're starting to get the hang of working with data in Python; now, it's time to put those skills to the test. This is your first of three take-home examinations this semester, which are designed to evaluate your grasp of the concepts and techniques you're learning in this course.

The take-home examinations will consist of a series of structured and semi-structured questions that you will answer with data analysis. In this examination, you will answer ten questions which pertain to two different datasets I've provided for you. Each question will be worth ten points, giving a total possible score of 100; your scores will then be weighted as this take-home examination constitutes 16 percent of your overall course grade.

Your responses to the following questions should consist of the following components:

The code used to answer the question, and any results you've produced;
Visualizations, when appropriate;
A short write-up in Markdown explaining what you did, and analyzing/interpreting your results when appropriate.

Responses to the questions that are "correct" will receive at least 7 out of the possible 10 points. Responses that receive 8, 9, or 10 points will show evidence of effort beyond the minimum necessary to answer the question. This could include but is not limited to noticeable thoughtfulness in the interpretation of the data; responses that blend computation, visualization, and description quite well; or evidence of going beyond the basic parameters of the assignment.

This examination should be completed in this Jupyter Notebook, like your weekly assignments. It is due by the end of the day on Sunday, October 16 for full credit, and will incur a 10 percent penalty (which is equivalent to one question!) for each day it is submitted past the deadline.

Part I: Large colleges/universities in Texas

The first five questions of this take-home exam will employ data from the U.S. Department of Education's College Scorecard, https://collegescorecard.ed.gov/. The College Scorecard includes hundreds of indicators about U.S. colleges and universities; however, you'll only be working with a few here. I've trimmed down the data so that it includes the largest Bachelor's degree granting universities in the state of Texas - specifically, those with an undergraduate enrollment of 3500 or more.

The data are found in your Exam 1 directory in a CSV file named txcolleges2016.csv. The columns in the dataset are as follows:

instnm: Institution name
city: City the institution is located in
sat_avg: The average combined SAT score (math + verbal) of admitted students
ugds: The number of undergraduates enrolled at the institution
pctfloan: The percentage of students receiving federal loans
grad_rate: The percentage of entering freshman who graduate within 6 years
median_earn: The median earnings of graduates 10 years after graduation, for students who received federal financial aid

You'll be analyzing the data for patterns and trends. The dataset is fairly small - but don't be tempted to just do some of the work in Excel. You must show how you answered the question in Python to receive credit for your responses.

Question 1: Describe the distribution of six-year graduation rates among large universities in Texas. Which universities have the highest graduation rates? The lowest? What does the "shape" of the distribution look like? Are there any outliers?

In [2]:

tx_college = "txcolleges2016.csv"

In [3]:

import pandas as pd
import seaborn as sns
tx = pd.read_csv(tx_college)

In [4]:

tx_sorted = tx.sort_values(by = 'grad_rate', ascending = True)
tx_sorted.head()

Out[4]:

In [5]:

tx_sorted = tx.sort_values(by = 'grad_rate', ascending = False)
tx_sorted.head()

Out[5]:

In [6]:

tx.hist( "grad_rate", color = "yellow")

Out[6]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f658c21e2b0>]], dtype=object)

According to the data shown, Ric UNiversiy has the highest graduation rate out of all the colleges studied. Texas Southern University has the lowest graduation rate. After analyzing the histogram that I created, we can see that the majority of the colleges has a graduation rate of 0.4 to 0.459 (40% to 45%). AS a college student and a concerned citizen, I find this quite interesting and am curious to find out the graduation rate of those only in college for 4 years.

Score: 7.5/10
KW comments: You are on the right track here, but I'd like to see some more depth from you in your response. Can you comment on the shape of the distribution? How about some additional analysis?

Question 2: Describe the distribution of federal loan recipient percentages among large universities in Texas. Which universities have the highest percentages of their student bodies receiving federal loans? Which have the lowest? What does the "shape" of the distribution look like? Are there any outliers?

In [7]:

tx_sorted = tx.sort_values(by = 'pctfloan', ascending = True)
tx_sorted.head()

Out[7]:

In [8]:

tx_sorted = tx.sort_values(by = 'pctfloan', ascending = False)
tx_sorted.head()

Out[8]:

In [9]:

sns.boxplot(tx_sorted.pctfloan, color = "red")

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f658b19d4a8>

The majority of the colleges studied are ranged from 0.42 to about 0.56. The overall range is from 0.3 to 0.8. There are two major outliers in the data which represent the highest and lowest ranges / colleges. The college with the highest percentage of federal loan recipients is Wayland Baptist University. The college with the lowest percentage is Rice Univeristy.

Score: 7.5/10
KW comments: Similar to question 1 - this answers the question, but does not go beyond the basic parameters of it. I'd encourage you to do some deeper analysis here to boost your score.

Question 3: Analyze the (potential) relationship between the six-year graduation rates of colleges and the median earnings of graduates 10 years after graduation. What relationship do you observe? Is this relationship meaningful, in your opinion, or is it "spurious" - meaning that there is something else that we aren't accounting for that actually explains the relationship?

In [10]:

sns.lmplot(x = 'grad_rate', y = 'median_earn',   data = tx_sorted)

Out[10]:

<seaborn.axisgrid.FacetGrid at 0x7f6592e2c3c8>

By analyzing this chart we can see that the colleges that have a low graduation rate more than likely have people earning lower income. There is obviously a positive correlation between the two variables. Still, I think that this proves to be meaningful relationship because it is true that colleges with low graduation rating tend to have graudates that earn less money in their future careers. Still in some instances one can say that this relationship is spurious since we don't know the types of majors each college offer or the amount of funding the school itself has. SChool with alot of money generally have a better graduation rate.

Score: 8.5/10
KW comments: You've used `lmplot` appropriately here and provided some interesting reflection on the topic - though I do think there is opportunity to do more. For example - you mention that we don't know the types of majors offered, or amount of funding available. Consider doing a little extra research and finding this out for a couple colleges on either side of the plot - this could lead to some interesting discussion.

Question 4: Choose two of the quantitative variables in the dataset, and examine whether they appear to be interrelated. You can't choose the same two from Question 3, of course, though you could select one or the other.

What relationships do you observe? In your opinion, what factors might explain the observed relationship? Is there anything else you think should be accounted for?

In [11]:

 sns.lmplot(x = 'pctfloan', y = 'sat_avg', data = tx_sorted)# Make a bar or scatterplot and describe the relationship between the two variables : Federal loans and sat scores

Out[11]:

<seaborn.axisgrid.FacetGrid at 0x7f658b1f44a8>

For this question I decided to study the relationship SAT averages and the percentage of fedral loans recieved by students. After creating a scatterplot, I discovered that there is a negative correlation between the two factors. Those colleges that have students with a higher average of SAT scores have less percenages of students asking for federal loans. If you carefully look at this graph you can se that most of the colleges have students of score around 900-1000pts aon the SAT and those colleges also have about 50% of their students apply for federal loans. THis means that most college tend to provided aid to students with the amount determined by their SAT score. This also can helpmus determine the academic ability of the students that the colleges accept.

Score: 8.5/10
KW comments: I like this comparison that you've chosen and you have some good discussion here; I'd encourage you to think as well about the broader context. For example: colleges with small proportions of their student body receiving federal loans presumably are attracting students from wealthier backgrounds - who might have access to SAT prep courses, more resources, etc. that can improve scores.

Question 5: Certainly, there are many other variables that we could include in such an analysis, but are not found in the dataset. Think of two other variables or pieces of information that you'd want to include in an analysis of colleges in Texas. What are these variables? What types of analyses might you do with them?

The two factors that I would like to include are Family Income and Distance from . The reason is because these variable can help us better understand the overall student body that the college when it pretains to social demographics. Are this students mostly rich and affluent families and locations or are they from a more middle aged and localized area. If I had these two variables it would be easier for me to understand the graduation rate and median income of graduate stdudents from said college. Universiies that attract more affluient students and families are bound to have more money than the average college. With more money comes form progams available to students to learn better and a truly experiencial college experience especially when the college hosts distinguished majors in the medical, science/engineering and buisness fields. Using these two variables will change the view of the graduation rate and median income to a more clearer state which as result will mak the relationship between graduate rate and median income average more meaningful.

Score: 9/10
KW comments: Agreed - this would be interesting to look at! How might you analyze the data technique-wise?

Part II: 2015 NFL quarterback statistics

As we've discussed in class, sports is one of the most popular domains for public consumption and analysis of data. The statistics of teams and athletes are meticulously recorded in many instances, and heavily scrutinized. In the second part of this take-home examination, you'll be working with a dataset of NFL quarterback statistics from 2015 for quarterbacks who played in at least 10 games. The data are originally from Yahoo! Sports, and adapted slightly from the version found here: https://nathanbrixius.wordpress.com/2016/05/02/2015-nfl-statistics-by-player-and-team/.

The dataset is found in your Exam 1 directory as a CSV file named qb2015.csv, and has the following columns:

Name: The name of the quarterback
Team: An abbreviation for the quarterback's team
G: Games played in the 2015 regular season
QBRat: The quarterback rating for the season
Comp: The number of passes completed
Att: The number of passes attempted
Pct: The pass completion percentage
PassYds: The number of passing yards during the regular season
YdsPerG: The average number of passing yards per game
YdsPerA: The average number of passing yards per attempt
TD: The number of passing touchdowns
Int: The number of interceptions thrown
Rush: The number of rushing attempts
RushYds The number of rushing yards during the regular season
RushYdsG: The average number of rushing yards per game
RushAvg: The average number of rushing yards per carry
RushTD: The number of rushing touchdowns
Sack: The number of times the quarterback was sacked
YdsL: The number of yards lost due to sacks
Fum: The number of fumbles
FumL: The number of fumbles lost

If you are unfamiliar with football rules and need assistance with any of the terminology, please let me know.

Question 6: Chart the distribution of NFL quarterback ratings. Who were the highest-rated quarterbacks? Who were the lowest? What does the distribution of quarterback ratings look like?

In [12]:

qb_rates = "qb2015.csv"

In [13]:

import pandas as pd
import seaborn as sns
qb = pd.read_csv(qb_rates)
qb_rates = qb.sort_values(by = 'QBRat', ascending = True)
qb_rates.head()

Out[13]:

In [14]:

qb_rates = qb.sort_values(by = 'QBRat', ascending = False)
qb_rates.head()

Out[14]:

In [15]:

qb.hist( "QBRat", color = "green")

Out[15]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f658a343dd8>]], dtype=object)

In [16]:

sns.barplot(x = 'QBRat', y = 'Name', data = qb_rates)

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f658a32e160>

In 2015 the highest rating quarterback of the NFL was Russell Wilson with a 110.1 rating and the lowest rating quarter back was Peyton Manning. According to the histogram that I created most of the quarteracks are rated within the range of 80-110 with the highest number of quarterbacks ranging from 90 to 95. The are two quarterbacks that are considered to be outliers which are Peyton Manning and Nick Foles.

Score: 9/10
KW comments: This is a solid response to the question with good choice of visualization - I'd just recommend some more depth in the response.

Question 7: Your friend Taylor McRube says boldly, "Quarterbacks who rush the ball more throw more interceptions; it is a proven fact. Mobile quarterbacks are simply much more error-prone." Is Taylor correct or incorrect in this statement?

In [17]:

sns.lmplot(x = 'RushYds', y = 'Int',   data = qb_rates)

Out[17]:

<seaborn.axisgrid.FacetGrid at 0x7f6589d143c8>

If we were to look at this graph we easiy tell Taylor McRube that his bold statement is completely false. The majority of interpections created came from quarterbacks who rush the ball much less than average.

Score: 7.5/10
KW comments: You've used appropriate methods here - but I think that there is more that you could do. For example: there are multiple metrics for rushing in the dataset; it would be a good idea to test them all. You could also quantify these relationships with correlation coefficients, and draw out some specific examples from the data as well for discussion.

Question 8: The NFL quarterback rating is a composite of quarterbacks' statistical performances, including information on completion percentage, passing yards per attempt, touchdown passes, and interceptions. You can read more here: https://en.wikipedia.org/wiki/Passer_rating#NFL_formula.

Based on the data provided, which of these four variables in your dataset (completion percentage, passing yards per attempt, touchdown passes thrown, and interceptions thrown) appears to be most strongly correlated to the quarterback's passer rating? Hint: if you are feeling ambitious, this can all be done in a for loop!

In [18]:

for qbs in ['YdsPerA','Int','TD', 'Pct']:
    sns.kdeplot(qb_rates[qbs], shade = True, label = qbs)

Out[18]:

/projects/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

In [19]:

sns.lmplot(x = 'YdsPerA', y = 'QBRat',   data = qb_rates)

Out[19]:

<seaborn.axisgrid.FacetGrid at 0x7f6589c730f0>

In [20]:

sns.lmplot(x = 'Pct', y = 'QBRat',   data = qb_rates)

Out[20]:

<seaborn.axisgrid.FacetGrid at 0x7f6582b46518>

In [21]:

sns.lmplot(x = 'TD', y = 'QBRat',   data = qb_rates)

Out[21]:

<seaborn.axisgrid.FacetGrid at 0x7f6582ab4518>

In [22]:

sns.lmplot(x = 'Int', y = 'QBRat',   data = qb_rates)

Out[22]:

<seaborn.axisgrid.FacetGrid at 0x7f6582a9f4e0>

In [23]:

mylist = ['Pct','YdsPerA','Int','TD']
for string in mylist:
    print(qb_rates[string].corr(qb_rates['QBRat']))

Out[23]:

622821441429
778227501315
-0.483321642197
703262495974

Out of the four variables tested the one that has the strongest correlation is the number of passing yards per attempt. Above I have placed four graphs each one showing the relationship between Quarterback Rating and one of the four other variables. Three of the graphs show the data in a scattered and unorderly fashion with no sense of correlation while the one for passing yards as the data clustered together and procceding in a particular direction.

Score: 9/10
KW comments: Nicely done with the `for` loop! I wouldn't necessarily agree with your interpretation, however; all four variables are correlated with Quarterback Rating, which the charts do show. It is just that yards per attempt happens to be the strongest.

Question 9: Now, you pick! Conduct an exploratory data analysis of any column in the dataset (with the exception of quarterback rating, which you've already done). Options for analysis include an examination of univariate characteristics (e.g. the data distribution) and multivariate relationships with other columns.

In [24]:

sns.barplot(x = 'Fum', y = 'Name', data = qb_rates)

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f65829f9940>

In [25]:

sns.barplot(x = 'Sack', y = 'Name', data = qb_rates)

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f658294d6d8>

In [26]:

for qbs in ['Fum','Sack']:
    sns.kdeplot(qb_rates[qbs], shade = True, label = qbs)

Out[26]:

/projects/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

I decided to research the relationship between the numbe rof fumbles and the number of sacks a quarterback has encountered. I then discovered that most of the quarterbacks have around the same number of fumbles as well as sacks. JUdging by this I can accuratley hypotesize that quarterbacks tend to fumble the ball mostly when they are sacked.

Score: 7.5/10
KW comments: You've chosen a very interesting comparison here. I'd recommend a different approach, however. The bar charts are a good start; with the next step, use a scatterplot to determine the relationship between fumbles and sacks, and then draw out some examples on either end of the distributions. For example, Blake Bortles fumbles the most but is also sacked the most; this is worth mentioning.

Question 10: Understandably, our analysis here is limited to the techniques we've learned to this point in the course as well as the available data. If you could extend your analysis of NFL quarterbacks - either with additional data or different ways of analyzing the data - what would you want to do?

One way I could change up the data is to study their statistics depending on weather conditions. It would be interesting to see if a quarterbck's can play better in the heat, the cold, when its raining, and when its windy. I would alo like to add two columns to the board. The first one I would add was how many touchdown they made themselves rather than throwing a touchdown pass. The second column i would want to add if number of injuries during the season. If a quarterback has an injury then I believe that their statistics will be drastically affected.

Score: 8/10
KW comments: Good ideas here; how might you analyze this information? Provide more details.

In [ ]:

Score: 82/100

Take-home exam #1

Part I: Large colleges/universities in Texas

Part II: 2015 NFL quarterback statistics

Product

Resources

Company