Annotated follow-along guide: Date string manipulations with Python¶
Import packages and libraries¶
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
#Read the dataset
df = pd.read_csv('./eda_manipulate_date_strings_with_python.csv')
df.head()
| date | number_of_strikes | center_point_geom | |
|---|---|---|---|
| 0 | 2016-08-05 | 16 | POINT(-101.5 24.7) |
| 1 | 2016-08-05 | 16 | POINT(-85 34.3) |
| 2 | 2016-08-05 | 16 | POINT(-89 41.4) |
| 3 | 2016-08-05 | 16 | POINT(-89.8 30.7) |
| 4 | 2016-08-05 | 16 | POINT(-86.2 37.9) |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10479003 entries, 0 to 10479002 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 date object 1 number_of_strikes int64 2 center_point_geom object dtypes: int64(1), object(2) memory usage: 239.8+ MB
Create new time columns¶
#Convert the `date` column to datetime.
df['date']=pd.to_datetime(df['date'])
Next, let's create four new columns: week, month, quarter, and year. We can do this by using the datetime.strftime() method of the datetime object. strftime is short for "string format time." We will use this method on the datetime data in the week column, and it will extract the information we specify, formatted as a string.
To specify the information to extract, we will use strftime format codes. You can find a full list of available codes to use in the strftime format codes documentation. In this case, we will use %Y for year, %V for week number, %q for quarter.
#Create four new columns.
df['week']=df['date'].dt.strftime('%Y-W%V')
df['month']=df['date'].dt.strftime('%Y-%m')
df['quarter']=df['date'].dt.to_period('Q').dt.strftime('%Y-Q%q')
df['year']=df['date'].dt.strftime('%Y')
df.head()
| date | number_of_strikes | center_point_geom | week | month | quarter | year | |
|---|---|---|---|---|---|---|---|
| 0 | 2016-08-05 | 16 | POINT(-101.5 24.7) | 2016-W31 | 2016-08 | 2016-Q3 | 2016 |
| 1 | 2016-08-05 | 16 | POINT(-85 34.3) | 2016-W31 | 2016-08 | 2016-Q3 | 2016 |
| 2 | 2016-08-05 | 16 | POINT(-89 41.4) | 2016-W31 | 2016-08 | 2016-Q3 | 2016 |
| 3 | 2016-08-05 | 16 | POINT(-89.8 30.7) | 2016-W31 | 2016-08 | 2016-Q3 | 2016 |
| 4 | 2016-08-05 | 16 | POINT(-86.2 37.9) | 2016-W31 | 2016-08 | 2016-Q3 | 2016 |
Plot the number of weekly lightning strikes in 2018¶
Next, we will plot the number of weekly lightning strikes.
#Create a new dataframe view of just 2018 data, summed by week.
df_by_week_2018=df[df['year']=='2018'][['week','number_of_strikes']].groupby('week').sum().sort_values('week',ascending=False).reset_index()
df_by_week_2018.head()
| week | number_of_strikes | |
|---|---|---|
| 0 | 2018-W52 | 105388 |
| 1 | 2018-W51 | 53170 |
| 2 | 2018-W50 | 66790 |
| 3 | 2018-W49 | 85965 |
| 4 | 2018-W48 | 47041 |
#Plot a bar graph of weekly strike totals in 2018
plt.bar(x=df_by_week_2018['week'],
height=df_by_week_2018['number_of_strikes'])
plt.xlabel("Week number")
plt.ylabel("Number of lightning strikes")
plt.title('Number of lightning strikes per week (2018)')
plt.show()
#increase the output size
plt.figure(figsize=(20,5))
plt.bar(x=df_by_week_2018['week'],
height=df_by_week_2018['number_of_strikes'])
plt.xlabel('Week number')
plt.ylabel('Number of lightning strikes')
plt.title("Number of lightning strikes per week (2018)")
plt.xticks(rotation = 45, fontsize = 8) # Rotate x-axis labels and decrease font size.
plt.show()
Plot the number of quarterly lightning strikes from 2016-2018¶
Next, let's plot lightning strikes by quarter for the full date range of available data. For a visualization, it will be easiest to work with numbers in millions, such as 25.2 million. As an example, the following code will divide the number_of_strikes column by one million.
df_by_quarter=df['number_of_strikes'].div(1000000)
df_by_quarter.head()
0 0.000016 1 0.000016 2 0.000016 3 0.000016 4 0.000016 Name: number_of_strikes, dtype: float64
#Group 2016-2018 data by quarter and sum
df_by_quarter=df[['quarter','number_of_strikes']].groupby(['quarter']).sum().reset_index()
#Format as text, in millions.
df_by_quarter['number_of_strikes_formatted']=df_by_quarter['number_of_strikes'].div(1000000).round(1).astype(str) + "M"
df_by_quarter.head()
| quarter | number_of_strikes | number_of_strikes_formatted | |
|---|---|---|---|
| 0 | 2016-Q1 | 2683798 | 2.7M |
| 1 | 2016-Q2 | 15084857 | 15.1M |
| 2 | 2016-Q3 | 21843820 | 21.8M |
| 3 | 2016-Q4 | 1969754 | 2.0M |
| 4 | 2017-Q1 | 2444279 | 2.4M |
Add labels¶
Before we start plotting, let's write a function that will help label each bar in the plot with its corresponding number_of_strikes_formatted text. The function uses plt.text(), which is a pyplot function whose positional arguments are x, y, and s. x represents the x-axis coordinates, y represents the y-axis coordinates, and s represents the text that we want to appear at these coordinates.
def addlabels(x,y,labels):
'''
Iterates over data and plots text labels above each bar of bar graph.
'''
for i in range(len(x)):
plt.text(i,y[i],labels[i],ha='center',va='bottom')
Plot the bar graph¶
plt.figure(figsize = (15, 5))
plt.bar(x = df_by_quarter['quarter'], height = df_by_quarter['number_of_strikes'])
addlabels(df_by_quarter['quarter'], df_by_quarter['number_of_strikes'], df_by_quarter['number_of_strikes_formatted'])
plt.plot()
plt.xlabel('Quarter')
plt.ylabel('Number of lightning strikes')
plt.title('Number of lightning strikes per quarter (2016-2018)')
plt.show()
Create a grouped bar chart¶
Now, we will create a grouped bar chart to better compare year-over-year changes each quarter. We can do this by creating two new columns that break out the quarter and year from the quarter column. We will use the quarter column and take the last two characters to get quarter_number, and take the first four characters to get year.
# Create two new columns.
df_by_quarter['quarter_number'] = df_by_quarter['quarter'].str[-2:]
df_by_quarter['year'] = df_by_quarter['quarter'].str[:4]
df_by_quarter.head()
| quarter | number_of_strikes | number_of_strikes_formatted | quarter_number | year | |
|---|---|---|---|---|---|
| 0 | 2016-Q1 | 2683798 | 2.7M | Q1 | 2016 |
| 1 | 2016-Q2 | 15084857 | 15.1M | Q2 | 2016 |
| 2 | 2016-Q3 | 21843820 | 21.8M | Q3 | 2016 |
| 3 | 2016-Q4 | 1969754 | 2.0M | Q4 | 2016 |
| 4 | 2017-Q1 | 2444279 | 2.4M | Q1 | 2017 |
plt.figure(figsize = (15, 5))
p = sns.barplot(
data = df_by_quarter,
x = 'quarter_number',
y = 'number_of_strikes',
hue = 'year')
for b in p.patches:
p.annotate(str(round(b.get_height()/1000000, 1))+'M',
(b.get_x() + b.get_width() / 2., b.get_height() + 1.2e6),
ha = 'center', va = 'bottom',
xytext = (0, -12),
textcoords = 'offset points')
plt.xlabel("Quarter")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes per quarter (2016-2018)")
plt.show()