Folder full of pertinent coursework

¹⁶⁶⁶ views

Kernel: Python 2 (SageMath)

GHCND

Sean Paradiso

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
            
data = pd.read_csv('598354.csv')
data

Out[1]:

As described by GHCND_documentation.pdf, this data is 'a composite of climate records from numerous sources that were merged and then subjected to a suite of quality assurance reviews.'

In [2]:

data.describe()

Out[2]:

In [3]:

station = data.groupby("STATION").size()
#station

In [4]:

#station.index

In [5]:

#station.describe()

The following code provides a more succinct version of the column names for user ease

In [6]:

data.columns

Out[6]:

Index([u'STATION', u'STATION_NAME', u'DATE', u'MDPR', u'DAPR', u'PRCP',
       u'TMAX', u'TMIN', u'TOBS'],
      dtype='object')

Here we display the names of the stations with recorded data and show their respective data size. The sizes were shown so that when making a selection of three stations we could compare sample sizes and choose statoins with a similar number of data points.

In [7]:

name = data.groupby("STATION_NAME").size()
name

Out[7]:

STATION_NAME
ACTON CALIFORNIA CA US                 610
ACTON ESCONDIDO CANYON CA US           273
ADELANTO 3.1 S CA US                    22
ADIN MOUNTAIN CA US                    596
ADIN RANGER STATION CA US              582
AHWAHNEE 2.5 NNW CA US                 571
ALAMO 1.0 WSW CA US                      5
ALBION 4.0 SE CA US                    563
ALDER POINT CALIFORNIA CA US           562
ALDER SPRINGS CALIFORNIA CA US         610
ALPINE CA US                           610
ALPINE CALIFORNIA CA US                610
ALTA SIERRA 0.4 WSW CA US              567
ALTA SIERRA 1.3 S CA US                 45
ALTA SIERRA 1.4 SSW CA US               63
ALTA SIERRA 2.3 WSW CA US               55
ALTADENA 0.7 ESE CA US                 583
ALTADENA CA US                         577
ALTURAS CA US                          497
ALTURAS MUNICIPAL AIRPORT CA US        608
AMBOY CA US                            370
AMERICAN CANYON 0.3 S CA US            138
AMERICAN CANYON 3.5 NE CA US            16
ANAHEIM 4.9 E CA US                    580
ANAHEIM 4.9 ENE CA US                  602
ANAHEIM 7.3 E CA US                    394
ANAHEIM CA US                          610
ANAHEIM HILLS 1.1 SE CA US              38
ANDERSON 2.6 NE CA US                  104
ANDERSON 8.5 WNW CA US                 205
                                      ... 
WINDSOR 0.6 NNE CA US                  597
WINDSOR 1.2 NNW CA US                   54
WINDSOR 1.4 SE CA US                   609
WINDSOR 1.5 WNW CA US                   81
WINDSOR 1.8 SE CA US                    63
WINTERS CA US                          604
WOFFORD HEIGHTS CALIFORNIA CA US       610
WOLVERTON CALIFORNIA CA US             610
WOODACRE 0.6 SW CA US                  145
WOODACRE CALIFORNIA CA US              610
WOODLAND 1 WNW CA US                   527
WOODLAND 2.8 SE CA US                  449
WOODLAND HILLS PIERCE COLLEGE CA US    610
WOODSIDE 3.4 S CA US                   558
WOODSIDE FIRE STATION 1 CA US          356
WRIGHTWOOD 1.2 WNW CA US               395
YOLLA BOLLA CALIFORNIA CA US           610
YOSEMITE LAKES 4.7 S CA US             597
YOSEMITE PARK HDQUARTERS CA US         502
YOSEMITE VILLAGE 12 W CA US            606
YREKA 0.9 WNW CA US                    481
YREKA 4.5 S CA US                      295
YREKA CA US                            608
YUCAIPA 1.5 NNE CA US                  500
YUCCA MESA CA US                       577
YUCCA VALLEY 1.1 SW CA US               29
YUCCA VALLEY 2.7 ENE CA US             531
YUCCA VALLEY CA US                     577
YUCCA VALLEY CALIFORNIA CA US          575
YUROK CALIFORNIA CA US                 601
Length: 1345, dtype: int64

The code below was just a superfluous method of displaying the names of the stations through the utilization of a for loop.

In [8]:

group1 = data.groupby("STATION_NAME")
#for name, group in group1:
    #print(name)

The next three lines are the station selections and the three lines beyond (namely, tmm1,2,3) are streamlining the data so only the information in which we are interested, i.e. minimum and maximum temperature, are displayed.

In [9]:

selection1 = group1.get_group('AMBOY CA US')
#selection1

In [10]:

selection2 = group1.get_group('ALTURAS CA US')
#selection2

In [11]:

selection3 = group1.get_group('ANAHEIM CA US')
#selection3

In [12]:

tmm1 = selection1.iloc[:,6:8]
#tmm1

In [13]:

tmm2 = selection2.iloc[:,6:8]
#tmm2

In [14]:

tmm3 = selection3.iloc[:,6:8]
#tmm3

Here we correct the data because there are numerous inputs of -9999 and this is clearly not a recorded value but most likely a form of placeholder. In order to make everything readable/plottable, we simply replace every instance of -9999 with NaN (not a number).

Below these long columns of data we have the actual plot of the minimum and maximum temperatures for our first selection.

In [15]:

corrected1 = tmm1[tmm1 > -9999]
corrected1

Out[15]:

In [16]:

corrected1.plot()

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f521c28bad0>

We again correct for the -9999 values and plot the necessary data for our second selection.

In [17]:

corrected2 = tmm2[tmm2 > -9999]
corrected2

Out[17]:

In [18]:

corrected2.plot()

Out[18]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f521c5fd8d0>

Finally, we, again, correct and plot for the data we selected for our third station.

In [19]:

corrected3 = tmm3[tmm3 > -9999]
corrected3

Out[19]:

In [20]:

corrected3.plot()

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f521a06bb50>

Here we extract the precipitation for every date in our data. We then proceed to furhter specify this selection to the June 2015 span that is required.

In [21]:

prcp = data.groupby("PRCP").size()
prcp

Out[21]:

PRCP
-9999     171378
      403317
          66
        7160
          26
        5698
          19
          10
        3587
           8
       2696
         14
          3
       2494
          4
       1761
          5
          3
       1614
          6
       1638
          5
          5
       1376
          8
       3093
          9
          9
       1277
          6
           ...  
        1
        1
        1
        1
        1
        1
        1
        1
        1
        1
        1
        1
        1
        1
        1
        1
        1
        2
        1
        1
        1
        1
        1
        1
        1
        1
        1
        1
        1
       1
Length: 707, dtype: int64

In [22]:

date = data.iloc[:, [2, 5]]
date

June2015 = date[516:546]
June2015

Out[22]:

As we can see from the above data, every precipitation value for June 2015 is -9999 so we double check the data and correct for these unwanted values and replace them with NaN as seen below. Our initial results were confirmed and thus we didn't plot any data due to the sheer lack of data.

In [23]:

percip = data.iloc[:,5:6]
#corper = percip[percip > -9999]
for i in percip[percip <= -9999]:
    percip[i] = 0
    corper = percip[i]
junepercip = corper.iloc[516:546]
junepercip
junepercip.plot()

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f5219f73810>

In [24]:

selection1
s1p = selection1.iloc[:,5:6]
for i in s1p[s1p <= -9999]:
    s1p[i] = 0
    s1corper = s1p[i]
s1june = corper.iloc[516:546]
s1june
s1p.plot()

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f5219f9fb90>

In [25]:

selection2
s2p = selection2.iloc[:,5:6]
for i in s2p[s2p <= -9999]:
    s2p[i] = 0
    s2corper = s2p[i]
s2june = corper.iloc[516:546]
s2june
s2p.plot()

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f52141b30d0>

In [26]:

selection3
s3p = selection3.iloc[:,5:6]
for i in s3p[s3p <= -9999]:
    s3p[i] = 0
    s3corper = s3p[i]
s3june = corper.iloc[516:546]
s3june
s3p.plot()

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f521410d290>

GHCND

Sean Paradiso

As described by GHCND_documentation.pdf, this data is 'a composite of climate records from numerous sources that were merged and then subjected to a suite of quality assurance reviews.'

The following code provides a more succinct version of the column names for user ease

Here we display the names of the stations with recorded data and show their respective data size. The sizes were shown so that when making a selection of three stations we could compare sample sizes and choose statoins with a similar number of data points.

The code below was just a superfluous method of displaying the names of the stations through the utilization of a for loop.

The next three lines are the station selections and the three lines beyond (namely, tmm1,2,3) are streamlining the data so only the information in which we are interested, i.e. minimum and maximum temperature, are displayed.

Here we correct the data because there are numerous inputs of -9999 and this is clearly not a recorded value but most likely a form of placeholder. In order to make everything readable/plottable, we simply replace every instance of -9999 with NaN (not a number).

Below these long columns of data we have the actual plot of the minimum and maximum temperatures for our first selection.

We again correct for the -9999 values and plot the necessary data for our second selection.

Finally, we, again, correct and plot for the data we selected for our third station.

Here we extract the precipitation for every date in our data. We then proceed to furhter specify this selection to the June 2015 span that is required.

As we can see from the above data, every precipitation value for June 2015 is -9999 so we double check the data and correct for these unwanted values and replace them with NaN as seen below. Our initial results were confirmed and thus we didn't plot any data due to the sheer lack of data.

Product

Resources

Company