DiscoveringΒΆ
Part-01ΒΆ
The NOAA keeps a daily record of lightning strikes across much of North America.And we have tasked with performing EDA on this data setso that it can be used to predict future lightning strikes in this region.
OverviewΒΆ
we will use pandas to examine 2018 lightning strike data collected by the National Oceanic and Atmospheric Administration (NOAA). Then, we will calculate the total number of strikes for each month and plot this information on a bar graph.
Import packages and librariesΒΆ
Before getting started, we will need to import all the required libraries and extensions. Throughout the course, we will be using pandas, numpy, and datetime for operations, and matplotlib, pyplot, and seaborn for plotting.
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
#Read the dataset
df=pd.read_csv('./eda_using_basic_data_functions_in_python_dataset1.csv')
#inspect the first 10 rows.
df.head(10)
| date | number_of_strikes | center_point_geom | |
|---|---|---|---|
| 0 | 2018-01-03 | 194 | POINT(-75 27) |
| 1 | 2018-01-03 | 41 | POINT(-78.4 29) |
| 2 | 2018-01-03 | 33 | POINT(-73.9 27) |
| 3 | 2018-01-03 | 38 | POINT(-73.8 27) |
| 4 | 2018-01-03 | 92 | POINT(-79 28) |
| 5 | 2018-01-03 | 119 | POINT(-78 28) |
| 6 | 2018-01-03 | 35 | POINT(-79.3 28) |
| 7 | 2018-01-03 | 60 | POINT(-79.1 28) |
| 8 | 2018-01-03 | 41 | POINT(-78.7 28) |
| 9 | 2018-01-03 | 119 | POINT(-78.6 28) |
Notice that data is structured as one row per day along with the geometric location.
A quick way to determine how many rows and columns of data there are in total is to use df.shape. The information will be output as: ([rows],[columns]).
df.shape
(3401012, 3)
#Get more information about the data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3401012 entries, 0 to 3401011 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 date object 1 number_of_strikes int64 2 center_point_geom object dtypes: int64(1), object(2) memory usage: 77.8+ MB
Convert the date column to datetimeΒΆ
As you see the date column data type is 'object' rather than 'date',objects are strings!. So date data type have additional functionality for dealing with date and time. So must convert it first.
#Convert date column to date time
df['date']=pd.to_datetime(df['date'])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3401012 entries, 0 to 3401011 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 date datetime64[ns] 1 number_of_strikes int64 2 center_point_geom object dtypes: datetime64[ns](1), int64(1), object(1) memory usage: 77.8+ MB
Calculate the days with the most strikesΒΆ
As part of discovering, you want to get an idea of the highest data points. For this dataset, we can calculate the top 10 days of 2018 with the most number of lightning strikes using the groupby(), sum(), and sort_values() functions from pandas.
When using groupby() on the date column, the function combines all rows with the same date into a single row.
Then, using sum() performs a sum calculation on all other summable columns. In this case, we are summing all the lightning strikes that happened on each day. Notice that the center_point_geom column is not included in the output. That's because, as a string object, this column is not summable.
Finally, sort_values() returns the results in descending order of total strikes for each day in the data.
#Calculate dates with most lightning strikes
df[['date','number_of_strikes']].groupby(['date']).sum().sort_values('number_of_strikes',ascending=False).reset_index().head(10)
| date | number_of_strikes | |
|---|---|---|
| 0 | 2018-08-29 | 1070457 |
| 1 | 2018-08-17 | 969774 |
| 2 | 2018-08-28 | 917199 |
| 3 | 2018-08-27 | 824589 |
| 4 | 2018-08-30 | 802170 |
| 5 | 2018-08-19 | 786225 |
| 6 | 2018-08-18 | 741180 |
| 7 | 2018-08-16 | 734475 |
| 8 | 2018-08-31 | 723624 |
| 9 | 2018-08-15 | 673455 |
Extract the month dataΒΆ
#Create a new 'month' column
df['month']=df['date'].dt.month
df.head()
| date | number_of_strikes | center_point_geom | month | |
|---|---|---|---|---|
| 0 | 2018-01-03 | 194 | POINT(-75 27) | 1 |
| 1 | 2018-01-03 | 41 | POINT(-78.4 29) | 1 |
| 2 | 2018-01-03 | 33 | POINT(-73.9 27) | 1 |
| 3 | 2018-01-03 | 38 | POINT(-73.8 27) | 1 |
| 4 | 2018-01-03 | 92 | POINT(-79 28) | 1 |
Calculate the number of strikes per monthΒΆ
#Calculate total number of strikes per month
df[['month','number_of_strikes']].groupby(['month']).sum().sort_values('number_of_strikes',ascending=False)
| number_of_strikes | |
|---|---|
| month | |
| 8 | 15525255 |
| 7 | 8320400 |
| 6 | 6445083 |
| 5 | 4166726 |
| 9 | 3018336 |
| 2 | 2071315 |
| 4 | 1524339 |
| 10 | 1093962 |
| 1 | 860045 |
| 3 | 854168 |
| 11 | 409263 |
| 12 | 312097 |
Convert the month number to textΒΆ
To help read the data more easily, let's convert the month number to text using the datetime function dt.month_name() and add this as a new column in the dataframe. str.slice will omit the text after the first three letters.
#Create a new 'month_txt' column
df['month_txt']=df['date'].dt.month_name().str.slice(stop=3)
df.head()
| date | number_of_strikes | center_point_geom | month | month_txt | |
|---|---|---|---|---|---|
| 0 | 2018-01-03 | 194 | POINT(-75 27) | 1 | Jan |
| 1 | 2018-01-03 | 41 | POINT(-78.4 29) | 1 | Jan |
| 2 | 2018-01-03 | 33 | POINT(-73.9 27) | 1 | Jan |
| 3 | 2018-01-03 | 38 | POINT(-73.8 27) | 1 | Jan |
| 4 | 2018-01-03 | 92 | POINT(-79 28) | 1 | Jan |
Create a new dataframeΒΆ
The objective is to plot the total number of strikes per month as a bar graph.To help with the plotting, we will create a new dataframe called df_by_month.This will allow us to easily access the month,month text, and total number of strikes for each month.
#Create a new helper dataframe for plotting.
df_by_month=df[['month','month_txt','number_of_strikes']].groupby(['month','month_txt']).sum().sort_values('month',ascending=True).reset_index()
df_by_month
| month | month_txt | number_of_strikes | |
|---|---|---|---|
| 0 | 1 | Jan | 860045 |
| 1 | 2 | Feb | 2071315 |
| 2 | 3 | Mar | 854168 |
| 3 | 4 | Apr | 1524339 |
| 4 | 5 | May | 4166726 |
| 5 | 6 | Jun | 6445083 |
| 6 | 7 | Jul | 8320400 |
| 7 | 8 | Aug | 15525255 |
| 8 | 9 | Sep | 3018336 |
| 9 | 10 | Oct | 1093962 |
| 10 | 11 | Nov | 409263 |
| 11 | 12 | Dec | 312097 |
Make a bar chartΒΆ
plt.bar(x=df_by_month['month_txt'],
height=df_by_month['number_of_strikes'],label='Number_of_strikes')
plt.xlabel("Months(2018)")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes in 2018 by months")
plt.legend()
plt.show()
Part-2ΒΆ
the firm wants insights into unicorn companiesβcompanies that are valued at over one billion dollars. The data you will use for this task provides information on over 1,000 unicorn companies, including their industry, country, year founded, and select investors. You will use this information to gain insights into how and when companies reach this prestigious milestone and to make recommentations for next steps to the investing firm.
#Load the data frame
companies = pd.read_csv('./Unicorn_Companies.csv')
Display the first 10 rows of the dataΒΆ
companies.head()
| Company | Valuation | Date Joined | Industry | City | Country/Region | Continent | Year Founded | Funding | Select Investors | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Bytedance | $180B | 4/7/17 | Artificial intelligence | Beijing | China | Asia | 2012 | $8B | Sequoia Capital China, SIG Asia Investments, S... |
| 1 | SpaceX | $100B | 12/1/12 | Other | Hawthorne | United States | North America | 2002 | $7B | Founders Fund, Draper Fisher Jurvetson, Rothen... |
| 2 | SHEIN | $100B | 7/3/18 | E-commerce & direct-to-consumer | Shenzhen | China | Asia | 2008 | $2B | Tiger Global Management, Sequoia Capital China... |
| 3 | Stripe | $95B | 1/23/14 | Fintech | San Francisco | United States | North America | 2010 | $2B | Khosla Ventures, LowercaseCapital, capitalG |
| 4 | Klarna | $46B | 12/12/11 | Fintech | Stockholm | Sweden | Europe | 2005 | $4B | Institutional Venture Partners, Sequoia Capita... |
Exploration through the result
The "Date Joined" column represents when the company became a "unicorn" reaching one billion dollars in valuation.
The "Select Investors" column represent the top investors in the company.
Assess the size of the data setΒΆ
companies.size
10740
Get basic information about the datasetΒΆ
companies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1074 entries, 0 to 1073 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Company 1074 non-null object 1 Valuation 1074 non-null object 2 Date Joined 1074 non-null object 3 Industry 1074 non-null object 4 City 1058 non-null object 5 Country/Region 1074 non-null object 6 Continent 1074 non-null object 7 Year Founded 1074 non-null int64 8 Funding 1074 non-null object 9 Select Investors 1073 non-null object dtypes: int64(1), object(9) memory usage: 84.0+ KB
Exploration
Dtype is listed as int64 in the Year Founded column. This means that the year a company was founded is represented as an integer.
Dtype is listed as object for the Date Joined column. This means that the date a company became a unicorn is represented as an object.
Statistical testsΒΆ
#Get discriptive statistics
companies.describe()
| Year Founded | |
|---|---|
| count | 1074.000000 |
| mean | 2012.895717 |
| std | 5.698573 |
| min | 1919.000000 |
| 25% | 2011.000000 |
| 50% | 2014.000000 |
| 75% | 2016.000000 |
| max | 2021.000000 |
Convert the Date joined column Dtype to datetime data typeΒΆ
companies['Date Joined']=pd.to_datetime(companies['Date Joined'])
C:\Users\pc\AppData\Local\Temp\ipykernel_7180\1458111891.py:1: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. companies['Date Joined']=pd.to_datetime(companies['Date Joined'])
companies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1074 entries, 0 to 1073 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Company 1074 non-null object 1 Valuation 1074 non-null object 2 Date Joined 1074 non-null datetime64[ns] 3 Industry 1074 non-null object 4 City 1058 non-null object 5 Country/Region 1074 non-null object 6 Continent 1074 non-null object 7 Year Founded 1074 non-null int64 8 Funding 1074 non-null object 9 Select Investors 1073 non-null object dtypes: datetime64[ns](1), int64(1), object(8) memory usage: 84.0+ KB
Create a Year Joined columnΒΆ
companies['Year Joined']=companies['Date Joined'].dt.year
companies.head()
| Company | Valuation | Date Joined | Industry | City | Country/Region | Continent | Year Founded | Funding | Select Investors | Year Joined | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Bytedance | $180B | 2017-04-07 | Artificial intelligence | Beijing | China | Asia | 2012 | $8B | Sequoia Capital China, SIG Asia Investments, S... | 2017 |
| 1 | SpaceX | $100B | 2012-12-01 | Other | Hawthorne | United States | North America | 2002 | $7B | Founders Fund, Draper Fisher Jurvetson, Rothen... | 2012 |
| 2 | SHEIN | $100B | 2018-07-03 | E-commerce & direct-to-consumer | Shenzhen | China | Asia | 2008 | $2B | Tiger Global Management, Sequoia Capital China... | 2018 |
| 3 | Stripe | $95B | 2014-01-23 | Fintech | San Francisco | United States | North America | 2010 | $2B | Khosla Ventures, LowercaseCapital, capitalG | 2014 |
| 4 | Klarna | $46B | 2011-12-12 | Fintech | Stockholm | Sweden | Europe | 2005 | $4B | Institutional Venture Partners, Sequoia Capita... | 2011 |
Result and evaluationΒΆ
Take sample of the dataΒΆ
It is not necessary to take a sample of the data in order to conduct the visualizations and EDA that follow. But you may encounter scenarios in the future where you will need to take a sample of the data due to time and resource limitations. For the purpose of developing your skills around sampling, take a sample of the data and work with that sample for the next steps of analysis you want to conduct. Use the sample() function for this task.
- Use
sample()with thenparameter set to50to randomly sample 50 unicorn companies from the data. Be sure to specify therandom_stateparameter to ensure reproducibility of your work. Save the result to a variable calledcompanies_sampled.
#Sample the data
companies_sample=companies.sample(n=50,random_state=42)
Visualize the time it took companies to reach unicorn statusΒΆ
#Prepare data for plotting
#Create new `years_till_unicorn` column
companies_sample['years_till_unicorn']=companies_sample['Year Joined']-companies_sample['Year Founded']
#Group the data by `industry`. For each industry, get the max value in the `years_till_unicorn` column.
grouped =companies_sample[['Industry','years_till_unicorn']].groupby('Industry').sum().sort_values('years_till_unicorn',ascending=True).reset_index()
grouped
| Industry | years_till_unicorn | |
|---|---|---|
| 0 | Consumer & retail | 1 |
| 1 | Auto & transportation | 2 |
| 2 | Artificial intelligence | 11 |
| 3 | Data management & analytics | 18 |
| 4 | Mobile & telecommunications | 21 |
| 5 | Cybersecurity | 23 |
| 6 | Other | 26 |
| 7 | Health | 32 |
| 8 | Supply chain, logistics, & delivery | 37 |
| 9 | E-commerce & direct-to-consumer | 38 |
| 10 | Fintech | 57 |
| 11 | Internet software & services | 95 |
#Create the plot
plt.bar(grouped.Industry,grouped.years_till_unicorn)
#Set title
plt.title("Bar plot of maximum years taken by company to become unicorn per industry (from sample)")
#set x-axis label
plt.xlabel("Industry")
#set y-axis label
plt.ylabel("Maximum number of years")
#Rotate labels on the x-axis as a way to avoid overlap in the position of the text
plt.xticks(rotation=45,horizontalalignment='right')
plt.show()
- This bar plot shows that for this sample of unicorn companies, the largest value for maximum time taken to become a unicorn occurred in the Heath and Fintech industries, while the smallest value occurred in the Consumer & Retail industry.
Visualize the maximum unicorn company valuation per industryΒΆ
Visualize unicorn companies' maximum valuation for each industry represented in the sample. To create a bar plot to visualize this, use the bar() function from the matplotlib.pyplot module. Before plotting, create a new column that represents the companies' valuations as numbers (instead of strings, as they're currently represented). Then, use this new column to plot your data.
# Create a column representing company valuation as numeric data
# Create new column
companies_sample['valuation_billions'] = companies_sample['Valuation']
# Remove the '$' from each value
companies_sample['valuation_billions'] = companies_sample['valuation_billions'].str.replace('$', '')
# Remove the 'B' from each value
companies_sample['valuation_billions'] = companies_sample['valuation_billions'].str.replace('B', '')
# Convert column to type int
companies_sample['valuation_billions'] = companies_sample['valuation_billions'].astype('int')
companies_sample.head()
| Company | Valuation | Date Joined | Industry | City | Country/Region | Continent | Year Founded | Funding | Select Investors | Year Joined | years_till_unicorn | valuation_billions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 542 | Aiven | $2B | 2021-10-18 | Internet software & services | Helsinki | Finland | Europe | 2016 | $210M | Institutional Venture Partners, Atomico, Early... | 2021 | 5 | 2 |
| 370 | Jusfoun Big Data | $2B | 2018-07-09 | Data management & analytics | Beijing | China | Asia | 2010 | $137M | Boxin Capital, DT Capital Partners, IDG Capital | 2018 | 8 | 2 |
| 307 | Innovaccer | $3B | 2021-02-19 | Health | San Francisco | United States | North America | 2014 | $379M | M12, WestBridge Capital, Lightspeed Venture Pa... | 2021 | 7 | 3 |
| 493 | Algolia | $2B | 2021-07-28 | Internet software & services | San Francisco | United States | North America | 2012 | $334M | Accel, Alven Capital, Storm Ventures | 2021 | 9 | 2 |
| 350 | SouChe Holdings | $3B | 2017-11-01 | E-commerce & direct-to-consumer | Hangzhou | China | Asia | 2012 | $1B | Morningside Ventures, Warburg Pincus, CreditEa... | 2017 | 5 | 3 |
# Prepare data for modeling
grouped = (companies_sample[["Industry", "valuation_billions"]]
.groupby("Industry")
.max()
.sort_values(by="valuation_billions")
)
grouped
| valuation_billions | |
|---|---|
| Industry | |
| Auto & transportation | 1 |
| Consumer & retail | 1 |
| Other | 2 |
| Supply chain, logistics, & delivery | 2 |
| Cybersecurity | 3 |
| Health | 3 |
| Data management & analytics | 4 |
| E-commerce & direct-to-consumer | 4 |
| Internet software & services | 5 |
| Mobile & telecommunications | 7 |
| Fintech | 10 |
| Artificial intelligence | 12 |
# Create bar plot
# with Industry column as the categories of the bars
# and new valuation column as the heights of the bars
### YOUR CODE HERE ###
plt.bar(grouped.index, grouped["valuation_billions"])
# Set title
### YOUR CODE HERE ###
plt.title("Bar plot of maximum unicorn company valuation per industry (from sample)")
# Set x-axis label
### YOUR CODE HERE ###
plt.xlabel("Industry")
# Set y-axis label
### YOUR CODE HERE ###
plt.ylabel("Maximum valuation in billions of dollars")
# Rotate labels on the x-axis as a way to avoid overlap in the positions of the text
### YOUR CODE HERE ###
plt.xticks(rotation=45, horizontalalignment='right')
# Display the plot
### YOUR CODE HERE ###
plt.show()
- This bar plot shows that for this sample of unicorn companies, the highest maximum valuation occurred in the Artificial Intelligence industry, while the lowest maximum valuation occurred in the Auto & transportation, and Consumer & retail industries.
ConsiderationsΒΆ
What are some key takeaways that you learned from this ?
- Functions in the
pandaslibrary can be used to gather characteristics about the data at hand.- The
info()anddescribe()functions were especially useful for gathering basic information about a dataset and finding descriptive statistics, respectively.
- The
- Functions in the
matplotlib.pyplotmodule can be used to create visualizations to further understand specific aspects of the data.- The
bar()function allowed you to create bar plots that helped visualize categorical information about the data. You were able to visualize the maximum years to become a unicorn and maximum valuation for each industry represented in the sample taken from the data.
- The
What findings would you share with others?
- There are 1074 unicorn companies represented in this dataset.
- Some companies took longer to reach unicorn status but have accrued high valuation as of March 2022. Companies could take longer to achieve unicorn status for a number of reasons, including requiring more funding or taking longer to develop a business model.
What recommendations would you share with stakeholders based on these findings?
It may be helpful to focus more on industry specifics. Next steps to consider:
- Identify the main industries that the investing firm is interested in investing in.
- Select a subset of this data that includes only companies in those industries.
- Analyze that subset more closely. Determine which companies have higher valuation but do not have as many investors currently. They may be good candidates to consider investing in.
References
Bhat, M.A. (2022, March). Unicorn Companies.