This is a basic python activity supporting your assessments 2 in FIT5145 TP1 2017. Python is a general purpose programming language
Of course there are lots of alternatives:
Compute mathematical expressions:
3+5
Define variables and assign values: Formula syntax comes from C, so some expressions are interpreted differently from R: • e.g. to compute 2 cubed type 2**3 not 2^3
A = 10
5*A + 6
Define a vector:
B = [5,5,3,0] #Can use square or round brackets: B = (5,5,3,0)
B
Access part of vector:
# BEWARE: First index of array is 0 not 1!
B[0]
B[1:3]
Load a library with “from … import … as …” keywords, e.g.:
from matplotlib import pyplot as plt
# Can also use shorter “.” notaion rather than “from” to load library: import matplotlib.pyplot as plt
# plt is the name we are giving to the imported library.
# We use it whenever we want to call a function provided by the library, e.g., plt.boxplot(…)
The pandas library provides a table structure called DataFrame
# importing the library and name it as pd
import pandas as pd
# create the dataframe
df = pd.DataFrame({
'StudentID' : [264422,264423,264444,264445,264446], # StudentID column and its data
'Name' : ['Steven','Alex', 'Bill', 'Mark','Bob'], # Name column and its data
'EnrolYear' : [2010,2010, 2011, 2011,2013], # Enrolment year column and its data
'Math' : [100,90,90,40,60], # Mathematics score column and its data
'English' : [60,70,80,80,60] # English score column and its data
})
# print out the dataframe
df
# Select a column by using its column name:
df['Name']
# Select row when it meet conditions
df[df['Name'] == 'Alex'] # select record of Alex
# Selection with complex conditions
filt = (df.Name != 'Bob') & (df.StudentID > 264423) # select rows where name is not 'Bob' and student ID is larger than 2644423
df[filt]
# operation on columns
# create a total score column
df['Total'] = df['Math'] + df['English']
df
Have a look at the each subject in different rows, so we melt the table
#column name 'Math' and 'English' are now values of column 'variable'
df = pd.melt(df, id_vars=['EnrolYear','Name','StudentID'],value_vars=['Math','English'],var_name='Subject')
# you can also rename a column seperately:
df.rename(columns = {'value':'Score'}, inplace = True)
df
Now assume if you got another data table storing the height of the student and suppose now for some reason you want to merge two data sets:
# notice in df2, 'Bob' (ID: 264446) is missing
df2 = pd.DataFrame({
'StudentID' : [264422,264423,264444,264445], # StudentID column and its data
'Height' : [160,155,175,175], # Height column and its data
})
df2
merge = pd.merge(df,df2, on=['StudentID'])
merge # have a look
ufo_reports = pd.read_csv('uforeports.csv')
ufo_reports.head() # print the first 5 records out to have a look / ufo_reports.tail()
# very similar if we have a xls file instead
ufo_reports_xls = pd.read_excel('uforeports_excel.xls', sheetname = 'uforeports') # we will have to specify the sheet name
ufo_reports_xls.head() # print the first 5 records out to have a look
# notice that here we just have the same data in two different file format, and so python recognise the Time variable differently
Notice that Time columns in two dataframe are in indeed, in different formats, check it out!
print (ufo_reports.Time.dtypes) # just an object
print (ufo_reports_xls.Time.dtypes) # datetime format, this comes along with xls, you can format columns
What is the difference between object and datatime?
print(ufo_reports.Time[1])
type(ufo_reports.Time[1])
print(ufo_reports_xls.Time[1])
print(type(ufo_reports_xls.Time[1]))
ufo_reports_xls.Time[1].year
So in the dataframe reading from csv files, the Time column is just an object, we can change that to datatime format so they are consistent.
ufo_reports.Time = pd.to_datetime(ufo_reports.Time) # change to datetime format
print (ufo_reports.Time.dtypes) # print its datatype now
ufo_reports.head() # print the head to check whether it looks the same as the other dataframe
We could easily do some auditing or exploration of the data using python. Graphical data exploration for outlier identification could be found in Section 5. Here let's use the titanic data set.
titanic = pd.read_csv('titanic.csv') # read in titanic data into pandas dataframe
print (titanic.shape) # have a look at the dimension of the dataframe
titanic.head() # printing out the first few rows
# have a look at the data types of each columns
titanic.dtypes
We can do summary statistics to see the count, number of unique values, as well as the value range of numeric fields.
titanic.describe()
# looking at unique values of a column
titanic.age.value_counts()
Continue to use the titanic data set, let's focus on the age of passengers. We show how to look at the mean of column using describe() function (alternatively, you can get the mean by titanic['age'].mean()). Here we will go for a more detailed look at the age for different groups.
titanic.head()
# this groupby has created a series with a hierachical index. We can then apply methods to the individual groups
sex_class = titanic.groupby(['sex','class'])['age']
# have a look
sex_class.mean()
In a more sophisticated case, if you want to do statistics on multiple columns or if you want to apply different aggregate function to different columns, this is one way to go....
fun = {'who':{'passengers':'count'},'age':{'average age':'mean'}} # specify aggregation functions to columns 'who' and 'age'
# by doing this, we've also give the statistics a name,
# e.g., the number of records in 'who' is named as 'passengers', and,
# the mean values of 'age' is called 'average age'
groupbyClass = titanic.groupby('class').agg(fun) # groupby and apply functions
# you can have a look now
# groupbyClass
# we try to make it look pretty, i.e., make it as a dataframe, you can have a look at it by yourself step by step
groupbyClass = groupbyClass.reset_index() # reset its index
groupbyClass.columns = groupbyClass.columns.droplevel(0) # drop level 0 index
groupbyClass.rename(columns = {'':'class'},inplace = True) # rename the first column
# now have a look, so this tells statistic about the average age and the number of passengers in each class
groupbyClass
Apply function is a built-in function in python, it is a very handy tool when we want to apply a function to each element of a list, series, or rows/columns of dataframes etc.
It is also very handy when we combine the use of lambda function (also called anonymous function: is a function that is defined without a name).
# here, we want to see: the age diversity in each class; and the number of people elder than 50 in each class
# so in the following, the two lamda functions apply to the age column,
# the 'unique age values' counts it unique not null values,
# and the 'elder age group' counts the number of values larger than 50
# notice that here we apply two functions to the same column 'age'.
fun = {'age':{'unique age values':lambda x: x.nunique(),'elder age group':lambda x: x[x>50].count()}}
# just making it pretty
groupbyClass = titanic.groupby('class').agg(fun).reset_index() # reset its index
groupbyClass.columns = groupbyClass.columns.droplevel(0) # drop level 0 index
groupbyClass.rename(columns = {'':'class'},inplace = True) # rename the first column
groupbyClass # now have a look
You can just simiply plot the value to have a look. But sometimes this wouldn't tell anything, e.g.
import matplotlib.pyplot as plt #import library for plotting
# This following magic line is to make sure graph shown inline of the notebook
%matplotlib inline
# continue to use titanic data set
plt.plot(titanic.fare) # this simply plots the fare of each passenger, so there are about 900 passengers.
plt.show()
Instead, we could have a look at the distribution by histogram.
titanic.fare.hist(bins = 200) # you can try different bin number to have a look
plt.xlim(0, 300) # setting limit on x-axis
plt.ylim(0,600) # setting limit on y-axis
Or alternatively, we can use boxplot, this helps you to identify any outliers
A boxplot (also called a box and whisker diagram) is a simple visual representation of key features of a univariate sample. It displays five-point summaries and potential outliers in graphical form.
The following figure shows an illustration graph of boxplot elements. <img src="Boxplot.png",width=500,height=500>
titanic.boxplot(column = 'fare')
plt.ylim(0, 600) # setting limit on y-axis
# we can see that there are lots of outliers in column 'fare'
We can see that there are lots of outliers in column 'fare', but it probably make much more sense to look at it by different classes:
titanic.boxplot(column = 'fare', by = 'class')
plt.ylim(0, 600) # setting limit on y-axis
# we can see that there are lots of outliers in column 'fare'
Now, we can remove the outliers in different classes. Notice that you could have make the following code more efficient and accurate. But we are trying to keep the simplicity here.
# deleting outliers for First class
filt = ~((titanic['class'] == 'First') & (titanic['fare'] > 160))
titanic = titanic[filt]
# deleting outliers for Second class
filt = ~((titanic['class'] == 'Second') & (titanic['fare'] > 50))
titanic = titanic[filt]
# deleting outliers for Third class
filt = ~((titanic['class'] == 'Third') & (titanic['fare'] > 30))
titanic = titanic[filt]
# Now have a look at it, should be much better!
titanic.boxplot(column = 'fare', by = 'class')
plt.ylim(0, 600) # setting limit on y-axis
# we can see that there are lots of outliers in column 'fare'
We can compare fare for different classes and for kids/adult by bar chart
# prepare the statistics of children fare (under 18) and adult fare in different classes
fun = {'fare':{'child fare':lambda x: x[x<=18].mean(),'adult fare':lambda x: x[x>18].mean()}}
groupbyClass = titanic.groupby('class').agg(fun) # groupby and apply functions
groupbyClass = groupbyClass.reset_index() # reset its index
groupbyClass.columns = groupbyClass.columns.droplevel(0) # drop level 0 index
groupbyClass.rename(columns = {'':'class'},inplace = True) # rename the first column
# now have a look, so this tells statistic about the average age and the number of passengers in each class
groupbyClass
import numpy as np
ind = np.arange(0, 3*2, 2)
width = 0.5 # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(ind, groupbyClass['child fare'], width, color='g')
rects2 = ax.bar(ind + width, groupbyClass['adult fare'], width, color='c')
# add some text for labels, title and axes ticks
ax.set_ylabel('fare')
ax.set_title('Child and Adult fare for different classes')
ax.set_xticks(ind + width+0.1)
ax.set_xticklabels(groupbyClass['class'],rotation='vertical')
ax.legend((rects1[0], rects2[0]), ('child fare', 'adult fare'))
fig.set_size_inches(18.5, 10.5)
Input a simple data frame.
df = pd.DataFrame({'Name' : ['Mike Hussey','Aaron Finch', 'Brad Hogg' , 'Steve Smith' , 'George Bailey', 'Mitchell Johnson', 'Shaun Marsh', 'Glenn Maxwell','Pat Cummins' , 'Mitchell Starc', 'David Warner'],
'Age' : [39,28,44,25,32,33,31,26,22,25,28],
'IPLSal' :[310,662,103,828,672,1340,455,1240,207,1030,1140]})
df
Have a quick look at the data.
plt.scatter(df['Age'], df['IPLSal'])
# and plot to see data
plt.show()
We now have 3 views of the same dataset, the raw data embedded in the code (or in a file), the dataframe (fairly similar), and the plot. What information do you gain/lose in these different views?
We can see a general trend, from left to right, which you could probably draw a line through to show an estimate (linear regression).
from scipy.stats import linregress
# We use a standard package:
slope, intercept, r_value, p_value, std_err = linregress(df['Age'],df['IPLSal'])
# Here's our function:
line = [slope*xi + intercept for xi in df['Age']]
# plot up the line
plt.plot(df['Age'],line,'r-', linewidth=3)
# plot up the data points
plt.scatter(df['Age'], df['IPLSal'])
plt.show()
In the above code what is ‘slope’? And what is ‘intercept’? What do they represent on the curve?
slope
intercept