Annotated follow-along guide: Date string manipulations with Python¶

Import packages and libraries¶

In [ ]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
In [ ]:
#Read the dataset
df = pd.read_csv('./eda_manipulate_date_strings_with_python.csv')
df.head()
Out[ ]:
date number_of_strikes center_point_geom
0 2016-08-05 16 POINT(-101.5 24.7)
1 2016-08-05 16 POINT(-85 34.3)
2 2016-08-05 16 POINT(-89 41.4)
3 2016-08-05 16 POINT(-89.8 30.7)
4 2016-08-05 16 POINT(-86.2 37.9)
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10479003 entries, 0 to 10479002
Data columns (total 3 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   date               object
 1   number_of_strikes  int64 
 2   center_point_geom  object
dtypes: int64(1), object(2)
memory usage: 239.8+ MB

Create new time columns¶

In [ ]:
#Convert the `date` column to datetime.
df['date']=pd.to_datetime(df['date'])

Next, let's create four new columns: week, month, quarter, and year. We can do this by using the datetime.strftime() method of the datetime object. strftime is short for "string format time." We will use this method on the datetime data in the week column, and it will extract the information we specify, formatted as a string.

To specify the information to extract, we will use strftime format codes. You can find a full list of available codes to use in the strftime format codes documentation. In this case, we will use %Y for year, %V for week number, %q for quarter.

In [ ]:
#Create four new columns.
df['week']=df['date'].dt.strftime('%Y-W%V')
df['month']=df['date'].dt.strftime('%Y-%m')
df['quarter']=df['date'].dt.to_period('Q').dt.strftime('%Y-Q%q')
df['year']=df['date'].dt.strftime('%Y')
In [ ]:
df.head()
Out[ ]:
date number_of_strikes center_point_geom week month quarter year
0 2016-08-05 16 POINT(-101.5 24.7) 2016-W31 2016-08 2016-Q3 2016
1 2016-08-05 16 POINT(-85 34.3) 2016-W31 2016-08 2016-Q3 2016
2 2016-08-05 16 POINT(-89 41.4) 2016-W31 2016-08 2016-Q3 2016
3 2016-08-05 16 POINT(-89.8 30.7) 2016-W31 2016-08 2016-Q3 2016
4 2016-08-05 16 POINT(-86.2 37.9) 2016-W31 2016-08 2016-Q3 2016

Plot the number of weekly lightning strikes in 2018¶

Next, we will plot the number of weekly lightning strikes.

In [ ]:
#Create a new dataframe view of just 2018 data, summed by week.
df_by_week_2018=df[df['year']=='2018'][['week','number_of_strikes']].groupby('week').sum().sort_values('week',ascending=False).reset_index()
In [ ]:
df_by_week_2018.head()
Out[ ]:
week number_of_strikes
0 2018-W52 105388
1 2018-W51 53170
2 2018-W50 66790
3 2018-W49 85965
4 2018-W48 47041
In [ ]:
#Plot a bar graph of weekly strike totals in 2018
plt.bar(x=df_by_week_2018['week'],
        height=df_by_week_2018['number_of_strikes'])
plt.xlabel("Week number")
plt.ylabel("Number of lightning strikes")
plt.title('Number of lightning strikes per week (2018)')
plt.show()
No description has been provided for this image
In [ ]:
#increase the output size
plt.figure(figsize=(20,5))
plt.bar(x=df_by_week_2018['week'],
        height=df_by_week_2018['number_of_strikes'])
plt.xlabel('Week number')
plt.ylabel('Number of lightning strikes')
plt.title("Number of lightning strikes per week (2018)")
plt.xticks(rotation = 45, fontsize = 8) # Rotate x-axis labels and decrease font size.

plt.show()
No description has been provided for this image

Plot the number of quarterly lightning strikes from 2016-2018¶

Next, let's plot lightning strikes by quarter for the full date range of available data. For a visualization, it will be easiest to work with numbers in millions, such as 25.2 million. As an example, the following code will divide the number_of_strikes column by one million.

In [ ]:
df_by_quarter=df['number_of_strikes'].div(1000000)
df_by_quarter.head()
Out[ ]:
0    0.000016
1    0.000016
2    0.000016
3    0.000016
4    0.000016
Name: number_of_strikes, dtype: float64
In [ ]:
#Group 2016-2018 data by quarter and sum
df_by_quarter=df[['quarter','number_of_strikes']].groupby(['quarter']).sum().reset_index()

#Format as text, in millions.
df_by_quarter['number_of_strikes_formatted']=df_by_quarter['number_of_strikes'].div(1000000).round(1).astype(str) + "M"

df_by_quarter.head()
Out[ ]:
quarter number_of_strikes number_of_strikes_formatted
0 2016-Q1 2683798 2.7M
1 2016-Q2 15084857 15.1M
2 2016-Q3 21843820 21.8M
3 2016-Q4 1969754 2.0M
4 2017-Q1 2444279 2.4M

Add labels¶

Before we start plotting, let's write a function that will help label each bar in the plot with its corresponding number_of_strikes_formatted text. The function uses plt.text(), which is a pyplot function whose positional arguments are x, y, and s. x represents the x-axis coordinates, y represents the y-axis coordinates, and s represents the text that we want to appear at these coordinates.

In [ ]:
def addlabels(x,y,labels):
    '''
        Iterates over data and plots text labels above each bar of bar graph. 
    '''
    for i in range(len(x)):
        plt.text(i,y[i],labels[i],ha='center',va='bottom')

Plot the bar graph¶

In [ ]:
plt.figure(figsize = (15, 5))
plt.bar(x = df_by_quarter['quarter'], height = df_by_quarter['number_of_strikes'])
addlabels(df_by_quarter['quarter'], df_by_quarter['number_of_strikes'], df_by_quarter['number_of_strikes_formatted'])
plt.plot()
plt.xlabel('Quarter')
plt.ylabel('Number of lightning strikes')
plt.title('Number of lightning strikes per quarter (2016-2018)')
plt.show()
No description has been provided for this image

Create a grouped bar chart¶

Now, we will create a grouped bar chart to better compare year-over-year changes each quarter. We can do this by creating two new columns that break out the quarter and year from the quarter column. We will use the quarter column and take the last two characters to get quarter_number, and take the first four characters to get year.

In [ ]:
# Create two new columns.
df_by_quarter['quarter_number'] = df_by_quarter['quarter'].str[-2:]
df_by_quarter['year'] = df_by_quarter['quarter'].str[:4]
df_by_quarter.head()
Out[ ]:
quarter number_of_strikes number_of_strikes_formatted quarter_number year
0 2016-Q1 2683798 2.7M Q1 2016
1 2016-Q2 15084857 15.1M Q2 2016
2 2016-Q3 21843820 21.8M Q3 2016
3 2016-Q4 1969754 2.0M Q4 2016
4 2017-Q1 2444279 2.4M Q1 2017
In [ ]:
plt.figure(figsize = (15, 5))
p = sns.barplot(
    data = df_by_quarter,
    x = 'quarter_number',
    y = 'number_of_strikes',
    hue = 'year')
for b in p.patches:
    p.annotate(str(round(b.get_height()/1000000, 1))+'M', 
                   (b.get_x() + b.get_width() / 2., b.get_height() + 1.2e6), 
                   ha = 'center', va = 'bottom', 
                   xytext = (0, -12), 
                   textcoords = 'offset points')
plt.xlabel("Quarter")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes per quarter (2016-2018)")
plt.show()
No description has been provided for this image