CoCalc -- exam-2.ipynb

Rise of Terrorism Project Report

Micah Jennings - GEOG 30323: Data Analysis & Visualization

¹¹⁵⁰ views

Kernel: Anaconda (Python 3)

Score: 86.5/100

Geography 30323: Take-home exam #2

This second take home exam will continue to evaluate your skills in using Python for data analysis, with a focus on data wrangling and data visualization.

The examination consists of a series of structured and semi-structured questions that you will answer with data analysis. In this examination, you will answer a series of questions which pertain to two different datasets I've provided for you. There is a total possible score of 100; your scores will then be weighted as this take-home examination constitutes 16 percent of your overall course grade.

Your responses to the following questions should consist of the following components:

The code used to answer the question, and any results you've produced;
Visualizations, when appropriate;
A short write-up in Markdown explaining what you did, and analyzing/interpreting your results when appropriate.

Responses that receive 80% credit and up will show evidence of effort beyond the minimum necessary to answer the question. This could include but is not limited to noticeable thoughtfulness in the interpretation of the data; responses that blend computation, visualization, and description quite well; or evidence of going beyond the basic parameters of the assignment.

This examination should be completed in this Jupyter Notebook, like your weekly assignments. It is due by the end of the day on Sunday, November 20 for full credit, and will incur a 10 percent penalty (equivalent to 1.6 percent of your overall grade for the semester!) for each day it is submitted past the deadline.

This exam is more open-ended than Exam 1 - so take care with your responses!

Part I: More with baby names!

In Assignment 9, you got practice working with the babynames dataset derived from the Social Security Administration. In the first part of this take-home examination, you are going to extend your work in that assignment to answer a series of questions. As a reminder: the dataset is available from the following URL: http://personal.tcu.edu/kylewalker/data/babynames.csv.

Question 1 (10 points): What were the ten most popular female baby names in the United States during the 1960s? How did their popularity vary throughout the decade?

In [1]:

import pandas as pd
import seaborn as sns
pd.options.mode.chained_assignment = None # Turning off the "SettingWithCopyWarning"

df = pd.read_csv('http://personal.tcu.edu/kylewalker/data/babynames.csv')

df.head()

Out[1]:

In [53]:

 Sixtys = df[(df.year >= 1960) & (df.year <=1970) & (df.sex == 'F')].sort_values('n', ascending = False)[0:50].name.unique()

Sixtys

Out[53]:

array(['Lisa', 'Mary', 'Jennifer', 'Susan', 'Linda', 'Karen', 'Michelle',
       'Kimberly', 'Donna', 'Patricia'], dtype=object)

In [54]:

SixtyPP= ['Lisa', 'Mary', 'Jennifer', 'Susan', 'Linda', 'Karen', 'Michelle',
       'Kimberly', 'Donna', 'Patricia']
web2 = df[(df.name.isin(SixtyPP)) & (df.sex == 'F') & (df.year >= 1960) &  (df.year <=1970)]
web2['per1000'] = web2.prop *1000
web2.head()

Out[54]:

In [55]:

sns.set_style('whitegrid')

afro = sns.factorplot(data = web2, x = 'year', y = 'n', hue = 'sex', col = 'name', palette = ['purple', 'red'], 
                       col_wrap = 2, markers = '.', join = True, sharey = False, size = 5)

afro.set_xticklabels(step = 10)
afro.set_axis_labels('Year', 'Number of names')

Out[55]:

<seaborn.axisgrid.FacetGrid at 0x7fb5bc214940>

In [56]:

wide1 = web2.pivot(index = 'name', columns = 'year', values = 'per1000')

sns.heatmap(wide1)

Out[56]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fb5b4acdda0>

In [57]:

import matplotlib.pyplot as plt

sns.heatmap(wide1, annot = True, cmap = 'Greens')

plt.xticks(rotation = 45)
plt.ylabel("")
plt.xlabel("")
plt.title("Most popular baby names in the 60s  - Female (rate per 1000)")

Out[57]:

<matplotlib.text.Text at 0x7fb5b49b5dd8>

What were the ten most popular female baby names in the United States during the 1960s? How did their popularity vary throughout the decade?

By researching the popularity of female names, which are Donna, Jenninfer, Karen, Kimberly, Linda, Lisa,Mary, MIchelle, Patricia and Susan, from during the 1960s, I came to an interesting conclusion. Of the ten names that I collected, only four of them actually increased in popularity while the other six names decreased in name usage and popularity. The one name that stood out to me was the name "Lisa". Not only did the name increase throughout the decade and reach the highest amount of popularity (with a rate of 32 per every 1000) of all the names in the time frame, it actually started to decrease in popularity after 1965. It decrease to the point where is was surpassed by the name Jennifer which interestingly enough was the least popular name at the beginning of the 1960s( with 3.5/1000). That during the 1906s the Civil Rights movement, the Cuban Missle Scare and the Vietnam War were underway. Could these events have anything to do with the changes in name populairty? Its hard to say. Though this decade was good for some names, it was very unforunate for others. Names like Mary, Susan and Linda suffered strong declines in their populairty as the decade went on. It fact it was so bad that the name Linda was known as the least popular of the tens name with a rate of 4.8 per 1000.

Score: 9 / 10
KW comments: You've done a good job with this question; your visualizations are solid and your data work to get you there makes sense. One thing you might consider is calculating a grouped sum over the decade, though your method works for generating the requisite names for the heat map. In your response, you touch on some potential interesting analysis, but don't really follow through with it; consider ways that you could form more concrete hypotheses about the patterns you've observed.

Question 2 (10 points): For each decade since 1950, what were the most popular (the number one) male and female baby names?

In [58]:

 Sixtys = df[(df.year >= 1950) & (df.year <=1959) & (df.sex == 'M')].sort_values('n', ascending = False)[0:1].name.unique()

Sixtys

Out[58]:

array(['Michael'], dtype=object)

In [59]:

 Sixtys = df[(df.year >= 1960) & (df.year <=1969) & (df.sex == 'M')].sort_values('n', ascending = False)[0:1].name.unique()
Sixtys

Out[59]:

array(['Michael'], dtype=object)

In [60]:

 Sixtys = df[(df.year >= 1970) & (df.year <=1979) & (df.sex == 'M')].sort_values('n', ascending = False)[0:1].name.unique()

Sixtys

Out[60]:

array(['Michael'], dtype=object)

In [61]:

 Sixtys = df[(df.year >= 1980) & (df.year <=1989) & (df.sex == 'M')].sort_values('n', ascending = False)[0:1].name.unique()

Sixtys

Out[61]:

array(['Michael'], dtype=object)

In [62]:

 Sixtys = df[(df.year >= 1990) & (df.year <=1999) & (df.sex == 'M')].sort_values('n', ascending = False)[0:1].name.unique()

Sixtys

Out[62]:

array(['Michael'], dtype=object)

In [63]:

 Sixtys = df[(df.year >= 2000) & (df.year <=2010) & (df.sex == 'M')].sort_values('n', ascending = False)[0:1].name.unique()

Sixtys

Out[63]:

array(['Jacob'], dtype=object)

In [64]:

 Sixtys = df[(df.year >= 1950) & (df.year <=1959) & (df.sex == 'F')].sort_values('n', ascending = False)[0:1].name.unique()

Sixtys

Out[64]:

array(['Linda'], dtype=object)

In [65]:

 Sixtys = df[(df.year >= 1960) & (df.year <=1969) & (df.sex == 'F')].sort_values('n', ascending = False)[0:1].name.unique()

Sixtys

Out[65]:

array(['Lisa'], dtype=object)

In [66]:

 Sixtys = df[(df.year >= 1970) & (df.year <=1979) & (df.sex == 'F')].sort_values('n', ascending = False)[0:1].name.unique()

Sixtys

Out[66]:

array(['Jennifer'], dtype=object)

In [67]:

 Sixtys = df[(df.year >= 1980) & (df.year <=1989) & (df.sex == 'F')].sort_values('n', ascending = False)[0:1].name.unique()

Sixtys

Out[67]:

array(['Jennifer'], dtype=object)

In [68]:

 Sixtys = df[(df.year >= 1990) & (df.year <=1999) & (df.sex == 'F')].sort_values('n', ascending = False)[0:1].name.unique()

Sixtys

Out[68]:

array(['Jessica'], dtype=object)

In [69]:

 Sixtys = df[(df.year >= 2000) & (df.year <=2010) & (df.sex == 'F')].sort_values('n', ascending = False)[0:1].name.unique()

Sixtys

Out[69]:

array(['Emily'], dtype=object)

In [70]:

SixtyGirl = ['Linda', 'Lisa', 'Jennifer', 'Jennifer', 'Jessica','Emily']

web2 = df[(df.name.isin(SixtyGirl)) & (df.sex == 'F') & (df.year >= 1960) &  (df.year <=2010)]
web2['per1000'] = web2.prop *1000
web2.head()

Out[70]:

In [71]:

missess = web2.pivot(index = 'year', columns = 'name', values = 'per1000')
missess.plot()
plt.ylabel("Popculture names per 1000", fontsize = 12)
plt.xlabel("")
plt.title("Modernized Names in the World", fontsize = 15)
plt.legend(title = "", fontsize = 12)

Out[71]:

<matplotlib.legend.Legend at 0x7fb5b4a08668>

In [72]:

SixtyMen = ['Michael', 'Jacob']
Yearly = [1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010]
web2 = df[(df.name.isin(SixtyMen)) & (df.sex == 'M') & (df.year.isin(Yearly))]
web2['per1000'] = web2.prop *1000
web2.head()

Out[72]:

In [73]:

misster = web2.pivot(index = 'year', columns = 'name', values = 'per1000')
misster.plot()

Out[73]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fb5b6b3ce80>

In [74]:

sns.heatmap(Sir, annot = True, cmap = 'Greens')

plt.xticks(rotation = 45)
plt.ylabel("")
plt.xlabel("")
plt.title("Most popular baby names in the 60s  - Male (per1000.)")

Out[74]:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-74-007fe10f991f> in <module>()
----> 1 sns.heatmap(Sir, annot = True, cmap = 'Greens')
      2 
      3 plt.xticks(rotation = 45)
      4 plt.ylabel("")
      5 plt.xlabel("")
NameError: name 'Sir' is not defined

What were the most popular (the number one) male and female baby names?

The results of this assignment proved to be quite interesting actually. For men names the name Micheal has been the most popular for the first 4-5 decades(1950-2000). However from 2000-2010 the name name Jacob has finally surpassed the popularity of the name Micheal. According to the line graph above for the two male names, you can see that the popularity of Micheal decreased while the populairty of Jacob increased until they intersected around the late 1990s. In the year 2000, "Jacob" came on on top. If I were to put a guess on the popularity of the name Micheal it would be due to the sucessful lives of important people who carried the name such as MIcheal Jordan or Micheal Jackson. Once their careers began to suffer a decline, so did the name popularity. As for females the title for most popular name has changed repeatly for the few decades. However, after close examination, I am able to conclude that the name "Jennifer" is the most popular female name in the given time frame being 1970-1990. From 1950-1960 the most popular name was Linda and from 1960 to 1970 the Lisa was the most popular name. from 1990 to 2000, Jessica was the most popular name and from 2000 to 2010 Emily was the most popular. Now by analyzing the line graph that I created you can see that each peaked at a different time . However, the reason why I say that Jennifer was the most popular is because it holds the record for the highest popularity rate amoungst the other names. Still just like other names once it reaches its peak ti decreased. Another thing to get from this analysis is that Emily was very unpopular at the begining of the 1950s however overtime it grew while the others fell.

Score: 8.5 / 10
KW comments: You are on the right track here with your analysis; you've picked out the names with the greatest rate in a given year for each decade, and then used some visualizations to illustrate some trends around these names. I'd encourage you to be more systematic and conclusive in your comments. For example, as with the previous question, you approach some analysis but could do more to back up your assertions in the response with data (e.g. your mention of Michael Jordan & Michael Jackson). Also: while you have great specific examples in your analysis, your writing does not incorporate this data, but rather talks about your results in general terms. Seek to tighten this connection.

Question 3 (10 points): Choose one name - male or female - and trace its popularity since 1880. When did the name you've chosen "peak" in popularity? What trends over time do you observe in its popularity?

In [2]:

prince = ['Ben', 'Anna']

sub2 = df[(df.name.isin(prince)) & (df.sex == 'M') & (df.year >= 1880) & (df.year <=2010)]

sub2['n'] = sub2.prop * 1000

sub2.head()

Out[2]:

In [3]:

wide2 = sub2.pivot(index = 'year', columns = 'name', values = 'n')

wide2.plot()

Out[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3ec5154a90>

In [4]:

disney_yrs = {'Ben': 1890, 
             'Anna': 1900}

for prince, year in disney_yrs.items(): 
    print(prince + ': ' + str(wide2[prince].ix[year]))

Out[4]:

Ben: 3.032556119
Anna: 0.154175377575

In [6]:

import matplotlib.pyplot as plt
wide2.plot()

plt.annotate('Ben', xy = (1890, 3.03), xycoords = 'data', 
            xytext = (1985, 0.64), textcoords = 'data', arrowprops = dict(arrowstyle = '->'), 
            fontsize = 10)

plt.annotate('Anna', xy = (1900, 0.15), xycoords = 'data', 
            xytext = (1900, 0.15), textcoords = 'data', arrowprops = dict(arrowstyle = '->'), 
            fontsize = 10)

Out[6]:

<matplotlib.text.Annotation at 0x7f3ec4e03ef0>

For this question the name that I chose to research is "Ben". The popularity of the name "Ben" reached its peak around the 1985. As the years went by the popularity of Ben made sharp declines. Though there werer times when the name increased, it was only for the spane of about a year until in dropped down even more significantly. My guess is thatas more people immigrated to the United States the name Ben begin to really lose its popularity. What we have to realize is that the data accounts for every demographic in the UNited States and names that are mostly used by whites will lose popularity with the increasing population of African Americans a,Mexicans and Asian Americans in the country.

Score: 8.5 / 10
KW comments: Good analysis here; given your focus on Ben, I would recommend removing Anna from the dataset as you are analyzing male Annas which isn't relevant to this analysis. I do think a big more deeper discussion would be warranted as well; Ben was a very popular name in the 1800s but may have become less so given the overall diffusion of name choices since then.

Question 4 (20 points): Certain names wax and wane in popularity over time; for example, some names that were very popular in the early 1900s are rarely seen today. However, some names have had remarkable staying power over time. Design an analysis to identify those names that have been the most consistently popular throughout the dataset.

In [7]:

 Conname = df[(df.year >= 1900) & (df.year <=2010) & (df.sex == 'M')].sort_values('n', ascending = False)[0:430].name.unique()

Conname

Out[7]:

array(['James', 'Michael', 'Robert', 'John', 'David', 'William',
       'Christopher', 'Richard', 'Mark', 'Jason', 'Matthew', 'Thomas',
       'Joshua', 'Charles', 'Gary', 'Daniel', 'Steven'], dtype=object)

In [8]:

ConMan= ['James', 'Micheal', 'Robert', 'John', 'David', 'William', 'Chistopher',
       'Richard', 'Mark', 'Jason', 'Matthew', 'Thomas','Joshua','Charles', 'Gary', 'Daniel','Steven']
arm = df[(df.name.isin(ConMan)) & (df.sex == 'M') & (df.year >= 1900) &  (df.year <=2010)]
arm['per1000'] = arm.prop *1000
arm.head()

Out[8]:

In [9]:

import cufflinks as cf
cf.go_offline()
gun = arm.pivot(index = 'year', columns = 'name', values = 'n')
gun.iplot(subplots = True, kind = 'line', shape = (6, 3), subplot_titles = True, 
           shared_yaxes = True, shared_xaxes = True, legend = False, title = '')

Out[9]:

For this question I decided to look for the names consistancy amoungst the top 17 names from the year 1900 to the year 2010. The graph above shows the names that were the highest on the list and the rate in which they were used through the alotted time frame. If you look at the graphs above you can see that every few names so a consistant pattern meaning that their general popularity is everchanging throughout the years. In these graphs the names Joshua, Micheal and Steven show the most consistancy with MIcheal being the most consistant. In the previous questions we saw the the name Micheal had huge changes in popularity but looking at it in a wider time frame, we can see that that "Micheal" has not changed as much as we thought. In fact, there is barely a change at all.

Score: 16 / 20
KW comments: I really like what you do have here; the small multiples Plotly chart is excellent and illuminative. That said - your analysis does not always do here what you say it is doing. You've fetched those names that have the greatest frequencies in any given year during your specified time period; it would be more effective to determine this using the `prop` column given that the baseline population values change quite a bit since 1900. As such - it is difficult to see how this empirically represents consistent popularity; in fact, some of the names on your chart - e.g. "Jason" = weren't very popular for a large proportion of the study range. Further - take care with some misspellings, which crept their way into your analysis; while there clearly are some Micheals in your dataset, "Michael" would give you a lot more to work with.

Part II: The Ebola epidemic of 2014-2015

In the second part of this exam, you'll be analyzing data regarding the 2014-2015 Ebola outbreak in West Africa. You may already be familiar with some of the details; however, it might be useful for you to familiarize yourselves with the epidemic from the following links:

Wikipedia: https://en.wikipedia.org/wiki/Ebola_virus_epidemic_in_West_Africa
CDC: http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/
World Health Organization: http://www.who.int/csr/disease/ebola/en/

You'll be working with a dataset compiled from WHO data on the Ebola outbreak and made available via the Humanitarian Data Exchange (https://data.hdx.rwlabs.org/). The dataset itself comes from the following page: https://data.hdx.rwlabs.org/dataset/ebola-cases-2014 and has four columns:

Indicator: The type of information represented by the row, in regards to the Ebola outbreak. A note: not every indicator is represented for every country at every date in the dataset.
Country: The country corresponding to the row;
Date: The date corresponding to the record, in YYYY-MM-DD format.
value: The number of cases represented by the record.

I'll give you a little help reading in the file. While it is fairly straightforward, you are going to use a new parameter in pd.read_csv(), parse_dates, to tell pandas to automatically recognize the values in the Date field as dates. Run the cell below to get started and generate a new data frame named df.

In [78]:

import pandas as pd

ebola_data = 'ebola.csv'

df = pd.read_csv(ebola_data, parse_dates = ['Date'])

Certainly, there is ample information about the Ebola outbreak online, including data analyses and visualizations. You are welcome to use those visualizations and that information to inspire your responses. However, your responses themselves must be generated by you in Python with replicable Python code.

Take care with your responses. As mentioned above, not all indicators are available for every country at each date. Further, some indicators are only measured at specific time intervals. Also: make sure you understand what the different indicators represent before performing operations with them or using them in an analysis.

Question 5 (10 points): Explain the basic parameters of the dataset. What "shape" is the dataset in? What information is contained in the dataset? What does each row of the dataset represent?

In [79]:

df.head()

Out[79]:

In [80]:

df.Indicator.unique()

Out[80]:

array(['Case fatality rate (CFR) of confirmed Ebola cases',
       'Case fatality rate (CFR) of confirmed, probable and suspected Ebola cases',
       'Case fatality rate (CFR) of probable Ebola cases',
       'Case fatality rate (CFR) of suspected Ebola cases',
       'Cumulative number of confirmed Ebola cases',
       'Cumulative number of confirmed Ebola deaths',
       'Cumulative number of confirmed, probable and suspected Ebola cases',
       'Cumulative number of confirmed, probable and suspected Ebola deaths',
       'Cumulative number of probable Ebola cases',
       'Cumulative number of probable Ebola deaths',
       'Cumulative number of suspected Ebola cases',
       'Cumulative number of suspected Ebola deaths',
       'Number of confirmed Ebola cases in the last 21 days',
       'Number of confirmed Ebola cases in the last 7 days',
       'Number of confirmed Ebola deaths in the last 21 days',
       'Number of confirmed, probable and suspected Ebola cases in the last 21 days',
       'Number of confirmed, probable and suspected Ebola cases in the last 7 days',
       'Number of confirmed, probable and suspected Ebola deaths in the last 21 days',
       'Number of probable Ebola cases in the last 21 days',
       'Number of probable Ebola cases in the last 7 days',
       'Number of probable Ebola deaths in the last 21 days',
       'Number of suspected Ebola cases in the last 21 days',
       'Number of suspected Ebola cases in the last 7 days',
       'Number of suspected Ebola deaths in the last 21 days',
       'Proportion of confirmed Ebola cases that are from the last 21 days',
       'Proportion of confirmed Ebola cases that are from the last 7 days',
       'Proportion of confirmed Ebola deaths that are from the last 21 days',
       'Proportion of confirmed, probable and suspected Ebola cases that are from the last 21 days',
       'Proportion of confirmed, probable and suspected Ebola cases that are from the last 7 days',
       'Proportion of confirmed, probable and suspected Ebola deaths that are from the last 21 days',
       'Proportion of probable Ebola cases that are from the last 21 days',
       'Proportion of probable Ebola cases that are from the last 7 days',
       'Proportion of probable Ebola deaths that are from the last 21 days',
       'Proportion of suspected Ebola cases that are from the last 21 days',
       'Proportion of suspected Ebola cases that are from the last 7 days',
       'Proportion of suspected Ebola deaths that are from the last 21 days'], dtype=object)

In [81]:

df.dtypes

Out[81]:

Indicator            object
Country              object
Date         datetime64[ns]
value               float64
dtype: object

In [82]:

df.describe

Out[82]:

<bound method NDFrame.describe of                                                Indicator       Country  \
    Case fatality rate (CFR) of confirmed Ebola cases        Guinea   
    Case fatality rate (CFR) of confirmed Ebola cases       Liberia   
    Case fatality rate (CFR) of confirmed Ebola cases  Sierra Leone   
    Case fatality rate (CFR) of confirmed Ebola cases       Nigeria   
    Case fatality rate (CFR) of confirmed Ebola cases        Guinea   
    Case fatality rate (CFR) of confirmed Ebola cases       Liberia   
    Case fatality rate (CFR) of confirmed Ebola cases  Sierra Leone   
    Case fatality rate (CFR) of confirmed Ebola cases       Nigeria   
    Case fatality rate (CFR) of confirmed Ebola cases       Senegal   
    Case fatality rate (CFR) of confirmed Ebola cases        Guinea   
   Case fatality rate (CFR) of confirmed Ebola cases       Liberia   
   Case fatality rate (CFR) of confirmed Ebola cases  Sierra Leone   
   Case fatality rate (CFR) of confirmed Ebola cases       Nigeria   
   Case fatality rate (CFR) of confirmed Ebola cases       Senegal   
   Case fatality rate (CFR) of confirmed Ebola cases        Guinea   
   Case fatality rate (CFR) of confirmed Ebola cases       Liberia   
   Case fatality rate (CFR) of confirmed Ebola cases  Sierra Leone   
   Case fatality rate (CFR) of confirmed Ebola cases       Nigeria   
   Case fatality rate (CFR) of confirmed Ebola cases       Senegal   
   Case fatality rate (CFR) of confirmed, probabl...        Guinea   
   Case fatality rate (CFR) of confirmed, probabl...       Liberia   
   Case fatality rate (CFR) of confirmed, probabl...  Sierra Leone   
   Case fatality rate (CFR) of confirmed, probabl...       Nigeria   
   Case fatality rate (CFR) of confirmed, probabl...        Guinea   
   Case fatality rate (CFR) of confirmed, probabl...       Liberia   
   Case fatality rate (CFR) of confirmed, probabl...  Sierra Leone   
   Case fatality rate (CFR) of confirmed, probabl...       Nigeria   
   Case fatality rate (CFR) of confirmed, probabl...       Senegal   
   Case fatality rate (CFR) of confirmed, probabl...        Guinea   
   Case fatality rate (CFR) of confirmed, probabl...       Liberia   
...                                                  ...           ...   
Proportion of suspected Ebola cases that are f...       Nigeria   
Proportion of suspected Ebola cases that are f...       Senegal   
Proportion of suspected Ebola cases that are f...        Guinea   
Proportion of suspected Ebola cases that are f...       Liberia   
Proportion of suspected Ebola cases that are f...  Sierra Leone   
Proportion of suspected Ebola cases that are f...        Guinea   
Proportion of suspected Ebola cases that are f...       Liberia   
Proportion of suspected Ebola cases that are f...  Sierra Leone   
Proportion of suspected Ebola cases that are f...        Guinea   
Proportion of suspected Ebola cases that are f...       Liberia   
Proportion of suspected Ebola cases that are f...  Sierra Leone   
Proportion of suspected Ebola cases that are f...        Guinea   
Proportion of suspected Ebola cases that are f...       Liberia   
Proportion of suspected Ebola cases that are f...  Sierra Leone   
Proportion of suspected Ebola cases that are f...        Guinea   
Proportion of suspected Ebola cases that are f...       Liberia   
Proportion of suspected Ebola cases that are f...  Sierra Leone   
Proportion of suspected Ebola cases that are f...        Guinea   
Proportion of suspected Ebola cases that are f...       Liberia   
Proportion of suspected Ebola cases that are f...  Sierra Leone   
Proportion of suspected Ebola deaths that are ...        Guinea   
Proportion of suspected Ebola deaths that are ...       Liberia   
Proportion of suspected Ebola deaths that are ...  Sierra Leone   
Proportion of suspected Ebola deaths that are ...        Guinea   
Proportion of suspected Ebola deaths that are ...       Liberia   
Proportion of suspected Ebola deaths that are ...  Sierra Leone   
Proportion of suspected Ebola deaths that are ...       Nigeria   
Proportion of suspected Ebola deaths that are ...        Guinea   
Proportion of suspected Ebola deaths that are ...       Liberia   
Proportion of suspected Ebola deaths that are ...  Sierra Leone   

            Date  value  
   2014-08-29   60.0  
   2014-08-29   70.0  
   2014-08-29   41.0  
   2014-08-29   40.0  
   2014-09-05   60.0  
   2014-09-05   70.0  
   2014-09-05   39.0  
   2014-09-05   39.0  
   2014-09-05    0.0  
   2014-09-12   59.0  
  2014-09-12   76.0  
  2014-09-12   37.0  
  2014-09-12   37.0  
  2014-09-12    0.0  
  2014-09-16   58.0  
  2014-09-16   71.0  
  2014-09-16   35.0  
  2014-09-16   37.0  
  2014-09-16    0.0  
  2014-08-29   66.0  
  2014-08-29   50.0  
  2014-08-29   41.0  
  2014-08-29   37.0  
  2014-09-05   64.0  
  2014-09-05   58.0  
  2014-09-05   39.0  
  2014-09-05   36.0  
  2014-09-05    0.0  
  2014-09-12   65.0  
  2014-09-12   55.0  
...          ...    ...  
2014-09-18  100.0  
2014-09-18    0.0  
2014-09-24   79.0  
2014-09-24   64.0  
2014-09-24   59.0  
2014-10-01   62.0  
2014-10-01   60.0  
2014-10-01   56.0  
2014-10-08   88.0  
2014-10-08   55.0  
2014-10-08   64.0  
2014-10-15   91.0  
2014-10-15   40.0  
2014-10-15   60.0  
2014-10-22    0.0  
2014-10-22   13.0  
2014-10-22   18.0  
2014-10-29  100.0  
2014-10-29   19.0  
2014-10-29   19.0  
2014-08-29    0.0  
2014-08-29    8.0  
2014-08-29   25.0  
2014-09-05   33.0  
2014-09-05   67.0  
2014-09-05   55.0  
2014-09-05    0.0  
2014-09-08    0.0  
2014-09-08   71.0  
2014-09-08   55.0  

[16972 rows x 4 columns]>

The dataset that we received touse in our research is a CSV shape file. The coutry,date and indicator are the objects while the value is a float. The dataset contains information about the outbreakof the ebola virus in all the country that recorded having a case of it. There are numerous indicators in the dataset such as the case fatality rate of confirmed cases which stands for the rate of which people have contracted and surcummed to the disease. The cumulative number of case as well as the proportion and number of cases suspected case and probable cases of Ebola stand as different ways of analyzing the population of those who had ebola.. It also has record of the deaths that occured due to the virus.

Score: 8.5 / 10
KW comments: This is straightforward and on the right track. I would encourage you to provide a deeper look at the data; one issue with this dataset, for example, is that it is not complete for every time period for every indicator; as such, be sure to account for its limitations which has implications for further analysis.

__Question 6 (20 points): __ Describe the overall global characteristics of the Ebola epidemic, using the information available to you in the dataset. Questions you could address include: How many people have been infected during the epidemic? How many people have died? Globally, when did the epidemic "peak"? Pursue any other analyses you feel would be interesting and relevant to this question.

In [83]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.offline as py
import cufflinks as cf
from ipywidgets import interact
from pandas_datareader import wb
ctrys = ['Guinea','Liberia', 'Sierra Leone','Nigeria','Senegal', 'United States of America','Spain','Mali','United Kingdom','Italy'] 
sick = df[(df.Country.isin(ctrys)) & (df.Indicator == 'Cumulative number of confirmed Ebola cases')]
sick['value'] = sick.value 
sick.head()

Out[83]:

In [84]:

sns.barplot(x = 'value', y = 'Country', data = sick)

Out[84]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fb5b47ba2b0>

In [85]:

cf.go_offline()
eb = sick.pivot(index = 'Date', columns = 'Country', values = 'value')
eb.iplot(subplots = True, kind = 'line', shape = (5, 2), subplot_titles = True, 
           shared_yaxes = True, shared_xaxes = True, legend = False, title = '')

Out[85]:

In [86]:

ctrys = ['Guinea','Liberia', 'Sierra Leone','Nigeria','Senegal', 'United States of America','Spain','Mali','United Kingdom','Italy']
sickdeath = df[(df.Country.isin(ctrys)) & (df.Indicator == 'Number of confirmed, probable and suspected Ebola cases in the last 21 days')]
sickdeath['value'] = sickdeath.value 
sickdeath.head()

Out[86]:

In [87]:

wide2 = sickdeath.pivot(index = 'Date', columns = 'Country', values = 'value')

wide2.plot()
plt.ylabel("Number of cases", fontsize = 12)
plt.xlabel("")
plt.title("Number of Confirmed,Probable and Suspected Cases", fontsize = 15)
plt.legend(title = "", fontsize = 12)

Out[87]:

<matplotlib.legend.Legend at 0x7fb5b45fa5f8>

In [95]:

ctrys = ['Guinea','Liberia', 'Sierra Leone','Nigeria','Senegal', 'United States of America','Spain','Mali','United Kingdom','Italy'] 
sicky = df[(df.Country.isin(ctrys)) & (df.Indicator == 'Cumulative number of confirmed Ebola deaths')]
sicky['value'] = sicky.value 
sicky.head()

Out[95]:

In [97]:

sns.barplot(x = 'value', y = 'Country', data = sicky)

Out[97]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fb5b454c7b8>

Describe the overall global characteristics of the Ebola epidemic, using the information available to you in the dataset. Questions you could address include: How many people have been infected during the epidemic? How many people have died? Globally, when did the epidemic "peak"? Pursue any other analyses you feel would be interesting and relevant to this question.

The ebola virus made huge impacts in some of the countries that we researched. Countries like Sierra Leone and Liberia has had enormous amounts of ebola cases in the year 2014 and 2015. For example Sierra Leone had a cumulative case number of about 8000 cases of ebola and Liberia had about 3000 cases. As for deaths I can confidently say that Sierra Leone was the country that suffered the most in the hands of the ebola virus.Even though there were about 8,000 cases, Sierra Leone suffered about 3,000 deaths during the virus outbreak while Guiena hhad around 1600 deaths andLiberia. In December of 2014 the Ebola virus reached its peak in the country of Sierra Leone with about 1400 cases. However, this was not the record of highest amount of cases in a country. That record goes to the country of Liberia where the country reached its peak of 1,700 cases in October 2014. If I were to guess, the ebola outbreak reached it peak during the months of November and December in 2014. What find interesting is that Liberia had a sharp decline in cases after November 2014 due to the increase of health aid and regulations. During that time the cases and deaths in Sierra Leone skyrocketed until it finally made a sharp drop at the end of December.

Score: 17 / 20
KW comments: This is a good response with some good interpretation of your findings. I do think it would be helpful to integrate your analysis, visualizations, and text a bit more, to help orient the reader in regards to your results. Additionally, take care with your chart choices. The bar plots are visualizing the mean by country of cumulative cases over time; this isn't as informative as the line plots, which are more appropriate for working with the time-series data.

__Question 7 (20 points): __ Focus in this question on the three countries most impacted by the Ebola outbreak: Guinea, Sierra Leone, and Liberia.

Your job is to conduct a comparative analysis of the Ebola outbreak in these three countries. The parameters of your response are up to you; however, you might consider looking at how the scale of the outbreak has varied between the countries; when the epidemic appears to have "peaked" in each country; and how cumulative and new cases in the three countries have varied over time.

In [88]:

ctrys = ['Guinea','Liberia','Sierra Leone'] 
sicken = df[(df.Country.isin(ctrys)) & (df.Indicator == 'Number of confirmed, probable and suspected Ebola cases in the last 21 days')]
sicken['value'] = sicken.value 
sicken.head()

Out[88]:

In [89]:

wide2 = sicken.pivot(index = 'Date', columns = 'Country', values = 'value')

wide2.plot()

plt.ylabel("Number of cases", fontsize = 12)
plt.xlabel("")
plt.title("Number of Confirmed,Probable and Suspected Cases", fontsize = 15)
plt.legend(title = "", fontsize = 12)

Out[89]:

<matplotlib.legend.Legend at 0x7fb5b478a470>

In [90]:

ctrys = ['Guinea','Liberia', 'Sierra Leone'] 
sickly = df[(df.Country.isin(ctrys)) & (df.Indicator == 'Proportion of confirmed Ebola deaths that are from the last 21 days')]
sickly['value'] = sickly.value 
sickly.head()

Out[90]:

In [91]:

wide2 = sickly.pivot(index = 'Date', columns = 'Country', values = 'value')

wide2.plot()
plt.ylabel("Number of Deaths", fontsize = 12)
plt.xlabel("")
plt.title("Number of Confirmed Deaths from Ebola Cases", fontsize = 15)
plt.legend(title = "", fontsize = 12)

Out[91]:

<matplotlib.legend.Legend at 0x7fb5b4530ef0>

In [92]:

ctrys = ['Guinea','Liberia', 'Sierra Leone'] 
sickman = df[(df.Country.isin(ctrys)) & (df.Indicator == 'Cumulative number of confirmed Ebola cases')]
sickman['value'] = sickman.value 
sickman.head()

Out[92]:

In [93]:

wide2 = sickman.pivot(index = 'Date', columns = 'Country', values = 'value')

wide2.plot()
plt.ylabel("Cumulative Number of cases", fontsize = 12)
plt.xlabel("")
plt.title("Cumulative Number of Confirmed Cases", fontsize = 15)
plt.legend(title = "", fontsize = 12)

Out[93]:

<matplotlib.legend.Legend at 0x7fb5b4639630>

Focus in this question on the three countries most impacted by the Ebola outbreak: Guinea, Sierra Leone, and Liberia. Your job is to conduct a comparative analysis of the Ebola outbreak in these three countries. The parameters of your response are up to you; however, you might consider looking at how the scale of the outbreak has varied between the countries; when the epidemic appears to have "peaked" in each country; and how cumulative and new cases in the three countries have varied over time.

For this question, I decided to research Guinea, Liberia and Sierra Leone. All three of these countries are lo0cated within the continent of Africa the origin of the Ebola virus. It is for this reason that the African countries are by far suffering the most when compared to the other countries that are located in the Americas and in Europe The United Sates had 4 cases while the UNited Kingdom only had 1 all of them survivinvg the illness.. The countries in American and in Europe had the technology and resources to combat the disease and thus there isn't much data on them to even work with whereas the African countries suffered far greater Though Sierra Leone had the most cases of Ebola, Liberia had most of its population surrcum to the disease. For all thre of these countries the number of cases did slow down until the beginning of 2015, when health assitance finally arrived at these countries to help. Each country had a different time in which the disease peaked for Sierra Leone it was in Sierra Leone it was in December of 2014, Liberia and Guinea were both in October 2014. The new cases were at first very existance and abundant but eventually they began to die down since people now knew the disease more clearly and how to prevent it Again this all happened around January of 2015. If you compared the first and the third graph of question 7 you can see that the number of confirmed , probable and suspected cases of ebola dropped around the same time as the cummulative number of cases began to level out the a point where no more cases bean to pop up.

Score: 19 / 20
KW comments: This is a strong answer, with clear, readable charts and appropriate choices of indicators to analyze and visualize - nice work.

In [ ]:

Score: 86.5/100

Geography 30323: Take-home exam #2

Part I: More with baby names!

Part II: The Ebola epidemic of 2014-2015

Product

Resources

Company