%%html
<style>
p {
max-width: 45em;
word-wrap: break-word;
}
li{
max-width: 40em;
word-wrap: break-word;
margin:0px 0px 5px 0px;
}
.output {
margin:0px 0px 0px 0px;
}
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
margin:10px 0px 30px -30px;
}
</style>
Exploratory Data Analysis in Python
# Import packages and set package parameters
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from math import sqrt, floor, ceil
import numpy as np
import pandas as pd
pd.set_option('precision',2)
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter, MaxNLocator
plt.rcParams['figure.figsize'] = (10,6) # Set default figure size in inches (width, height)
import seaborn as sns
pyv = !python --version
print(f'{"PACKAGE":<20} VERSION')
print(f'{"Python":<20} {str(pyv[0]).split()[1]}')
print('\n'.join(f'{m.__name__:<20} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))
The goal of this notebook is to illustrate the process of exploratory data analysis (EDA) with Jupyter Notebook using Python. Carrying out an EDA using Jupyter Notebook provides several advantages that other data analysis software may lack:
These features make Jupyter Notebook an ideal tool to carry out, share and reproduce data analyses.
This notebook demonstrates how easy it is to get an overview of a clean dataset and to visualize and identify relationships between variables in a dataset using a systematic approach. This is why a sample dataset contained in a Python package will be used rather than extracting a raw dataset from a primary source seeing as accessing and preparing a dataset for analysis often constitutes the bulk of the work.
Several functions are created in this notebook to significantly improve on existing plotting functions and make it easier to quickly review the data. Of course writing these functions initially takes a bit of time, but the point is that they can be easily reused on any other dataset and packaged into a module and they are all that is needed to get a quick and complete overview of all the variables contained in the dataset.
It is also worth noting here that there exists several Python packages that offer useful tools for EDA: Pandas Profiling, Sweetviz and DataPrep are worth checking out.
The seaborn package contains several built-in sample datasets listed below. Other built-in datasets can be accessed with the statsmodels and scikit-learn packages. Furthermore, statsmodels also provides easy access to R datasets through the Rdatasets project. The vega_datasets package provides offline access to the collection of datasets used in Vega and Vega-lite examples.
sns.get_dataset_names()
To carry out this illustrative EDA, the tips dataset will be used as an example since it contains numerical continuous, numerical discrete and categorical variables making it a perfect basis to explore the capabilities of Python for exploring relationships between different types of variables. This dataset is also included in the Rdatasets collection with the following description:
"One waiter recorded information about each tip he received over a period of a few months working in one restaurant. He collected several variables:
In all he recorded 244 tips. The data was reported in a collection of case studies for business statistics:
Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL, USA: Richard D. Irwin Publishing."
The restaurant's location is unfortunately not provided, but considering the publisher's location, it is safe to assume that it is somewhere in the United States and therefore the currency is in US dollars.
Obviously, the tip variable is of most interest as the outcome variable and the main objective is to understand how the other variables measured influence it. But the purpose of this EDA is primarily to apply a systematic approach to analyze all variables on an equal footing so that no major feature is missed, seeing as an excessive focus on the outcome variable may prevent the uncovering of interesting relationships between other variables.
The systematic part of this EDA is carried out in sections 2 to 5. It is this part that can easily be reused with any other dataset containing variables of similar types. Part 6 explores the tips dataset further, investigating features uncovered in the systematic part.
tips = sns.load_dataset('tips')
tips
tips.info()
These two tables show that this dataset does not require any further preparation before analysis, there are no missing values and the variables as well as the values of categorical variables are all named clearly.
The dataset contains three types of variables:
The next steps involve looking at some key descriptive statistics describing these variables and visualizing how the scores of each variable are distributed.
tips.describe()
The range of bill totals does not exceed about \$50 which appears rather low, even accounting for the additional service charge paid as tips, considering that it is a restaurant with parties going up to 6 people. Indeed, prices were certainly lower compared to today's standard due to inflation, for example food priced at \$20 in 1995 would cost close to \$36 in 2020. Furthermore, information about the location and the type of restaurant is missing.
Tips do not go below \$1 which is understandable seeing as tips are part of the wage of waiters and waitresses in the United States, it is expected that customers give a tip of 15-20% of the bill.
The size of parties ranges from 1 to 6 people. As it is a numerical discrete variable containing only 6 unique values, it will be regarded as a categorical variable for visualization purposes in most sections of this EDA.
Taking a closer look at the distribution of numerical variables can sometimes provide more insights. Below are plotted the histograms for the total_bill and tip variables. The default histogram functions in the matplotlib and seaborn packages automatically group the variable scores into 10 bins, a number which can easily by modified but there is the issue that the bins do not automatically align with round values of the variable's unit of measurement.
For this reason, a function is created here which automatically creates bins that are aligned with round values based on a given bin width. This makes it easier to read the histogram. A kernel density estimator is also included to visualize the probability density. This can be useful for statistical inference purposes where it is important to know whether or not the distribution is close to normal, or whether it may correspond to another type of theoretical distribution.
# Create histogram function
def hist_align_bins(variable, xlabel, title, bin_width):
"""
Create a histogram with bins of a selected width and overlaid with
a kernel density estimator.
A list of bins for the histogram 'bins' argument is created based on
the range of the variable scores and on the selected width of bins.
The bins are automatically aligned with round units making the
histogram more readable.
Parameters
----------
variable: pandas Series with dtype integer or float
Numerical variable to plot in histogram.
xlabel: str
Label x axis.
title: str
Figure title.
bin_width: integer, float
Width of histogram bins in the units of the variable.
"""
leftmost_bin_edge = variable.min() - variable.min()%bin_width
rightmost_bin_edge = variable.max() - variable.max()%bin_width + 2*bin_width
bins_list = np.arange(leftmost_bin_edge, rightmost_bin_edge, bin_width)
ax_hist = sns.histplot(data=variable, bins=bins_list, kde=True, alpha= 0.9,
edgecolor='white', linewidth=0.5,
line_kws=dict(alpha=0.5, linewidth=1.5,
label='kernel density estimator'))
ax_hist.get_lines()[0].set_color('black') # manually edit line color due to bug with line_kws
# Additional formatting
ax_hist.set_xlabel(xlabel, size=12, labelpad=15)
ax_hist.set_ylabel('Count', size=12, labelpad=15)
plt.title(title, size=14, pad=30)
plt.legend(frameon=False)
sns.despine()
hist_align_bins(variable=tips['total_bill'],
xlabel='Bill total',
title='Distribution of bill totals',
bin_width=5)
hist_align_bins(variable=tips['tip'],
xlabel='Tip',
title='Distribution of tips',
bin_width=0.5);
As could be expected, tips are usually rounded at the nearest dollar. (The bins are left inclusive.)
The size variable is numerical discrete, containing only 6 unique values, so it will be regarded as a categorical variable in this section.
# Size data type is converted to str using lambda function to avoid
# creating a copy of the dataframe only for this table
tips.astype(str).describe(exclude=[np.number])
3 out of 4 of the categorical variables are binary and each of these have one score that clearly outweighs the other, as shown in more detail in the distributions printed below.
# Print counts of unique values of categorical variables and numerical
# discrete variables with 10 or fewer unique values
nb_unique_limit=10
for variable in tips:
if len(tips[variable].unique()) <= nb_unique_limit:
print(variable)
print(tips[variable].value_counts())
print()
The distribution of categorical variables can also be visualized with bar charts. It is also useful to represent numerical discrete variables in this way when they contain few unique values, such as in the case of the variable size.
def bar_sbs(df, nb_unique_limit=10):
"""
Create side-by-side bar charts of all categorical and numerical
discrete variables within dataset than contain a number of levels
less or equal to the selected limit of unique values.
Parameters
----------
df: pandas DataFrame
DataFrame containing categorical and/or numerical discrete
variables.
nb_unique_limit: integer, default=10
Number of unique values used as a cut-off to maintain chart
readability.
"""
# Select variables to plot
var_cat = df[[variable for variable in df if\
len(df[variable].unique()) <= nb_unique_limit]]
ncols = len(var_cat.columns)
# Create figure with required number of subplots
fig, axes = plt.subplots(1, ncols, figsize=(12, 4))
fig.tight_layout(pad=2)
sns.despine()
# Draw seaborn bar charts
for i, var in enumerate(var_cat):
sns.countplot(x=var_cat[var], saturation=1, ax=axes[i])
axes[i].tick_params(which='major', axis='x', length=0, pad=5)
if i > 0:
axes[i].set(ylabel='')
bar_sbs(tips)
Key observations:
This marks the end of the initial overview of the dataset. It is now time to examine whether or not there are any apparent relationships between them. This is done in the following sections by systematically visualizing all variables in different data graphics depending on the type of each variable. The numerical discrete variable size is visualized both as a numerical and as a categorical variable.
The relationships between numerical variables can be conveniently visualized with a grid of scatter plots using the seaborn package.
sns.pairplot(tips,
diag_kws=dict(edgecolor='white', linewidth=0.5, alpha=1))
plt.gcf().set_size_inches(12,8);
The histograms in the diagonal show that all three numerical variable in this dataset are right-skewed. The scatter plots show that tips are positively associated with bills, as are bills with the size of the party. The tip vs total_bill scatter plot shows that there are more points in the lower right than the upper left. This means people are more often not so generous with their tip, the tips do not increase proportionaly to the bill in a constant way.
Seeing as the dataset contains three categorical variables with just two levels (sex, smoker, time) it can be interesting to look at how these groups are distributed in these scatter plots. Doing this for variables with more than two groups is rarely useful as readability decreases, for those cases it is better to use boxplots (see further below) or scatter plots drawn as small multiples (also known as 'trellis' or 'lattice' graphs).
sns.pairplot(tips, hue='sex',
plot_kws=dict(s=10),
diag_kws=dict(facecolor='None'))
plt.gcf().set_size_inches(13,8);
It seems that men give larger tips than women probably due to the fact that they pay larger bills more frequently. Indeed, bills of parties of 4 are more often paid by men relative to the male/female ratio of party sizes of 2 and 3. A possible explanation is that these parties may often be families where men take on the responsibility of making payments.
sns.pairplot(tips, hue='smoker',
plot_kws=dict(s=10),
diag_kws=dict(facecolor='None'))
plt.gcf().set_size_inches(13,8);
From the tip vs total_bill scatter plot, it appears that parties with at least one smoker that have a large bill to pay tend to tip less generously on average than non-smoker parties paying bills of a similar amount.
sns.pairplot(tips, hue='time',
plot_kws=dict(s=10),
diag_kws=dict(facecolor='None'))
plt.gcf().set_size_inches(13,8);
Bills and therefore tips are higher at dinner compared to lunch. This may be explained by the fact that party sizes are generally larger at dinner.
These last three plots show that even if it possible to spot some slight differences between colored groups, due to the fact that tips are positively associated with bills by the nature of how tipping works, the groups overlap a lot making it difficult to spot differences. Therefore to better visualize differences between groups, it is best in this case to use boxplots, which is done in the following sections.
The relationship between numerical variables and categorical variables is best represented using boxplots. This works best for numerical continuous variables. For numerical discrete variables, it depends on the number of unique values and how they are distributed. A rough guideline can be to regard numerical discrete variables as categorical variables when they contain fewer than 10 unique values. The size variable contains only 6 unique values so it will be regarded as a categorical variable in the following plots.
Also if any categorical variable were to have many unique values (e.g. names of countries of the world), it would have to be split in smaller groups to be plotted separately to keep the figures in this section and the next readable. With this dataset, this is not an issue as there are no more than 4 unique values per categorical variable.
The boxplots are drawn with whiskers that reach out to farthest data point within the interval contained by the 5th and 95th percentiles. Outliers are plotted as separate dots. The number of observations in each group is indicated beneath the ticklabels.
def boxgrid(df, nb_unique_limit=10):
"""
Create grid of side-by-side boxplots with categorical and numerical
discrete variables on the x axis and numerical continuous variables
on the y axis.
Parameters
----------
df: pandas DataFrame
DataFrame containing numerical continuous and categorical and/or
numerical discrete variables.
nb_unique_limit: integer, default=10
Number of unique values used as a cut-off to maintain chart
readability.
"""
# Select variables to plot
var_num = df[[variable for variable in df if\
len(df[variable].unique()) > nb_unique_limit]]
var_cat = df[[variable for variable in df if\
len(df[variable].unique()) <= nb_unique_limit]]
nrows = len(var_num.columns)
ncols = len(var_cat.columns)
# Create figure with required dimensions to fit all subplots
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*2.5, nrows*4))
fig.tight_layout(pad=2)
sns.despine()
# Draw boxplots with additional formatting
for n, var_n in enumerate(var_num):
for c, var_c in enumerate(var_cat):
sns.boxplot(data=df, x=var_c, y=var_n,
whis=(5, 95), linewidth=1, width=0.3,
fliersize=1, saturation=1, ax=axes[n][c])
## Additional formatting
# Create labels showing number of observations (nobs) per
# group and position them in terms of xy data coordinates
nobs = df.groupby(var_c)[var_n].count().values
for x_position, nobs in enumerate(nobs):
y_pos_factor = -0.1-0.005*nrows
axes[n][c].text(x=x_position,
y=y_pos_factor*df[var_n].max(),
s=f'n{nobs}',
horizontalalignment='center',
size=9)
# Remove y axis labels for plots not on grid edge
if c > 0:
axes[n][c].set(ylabel='')
# Remove x axis labels for plots not on grid edge
if n < nrows-1:
axes[n][c].set(xlabel='')
# Add padding for the x axis labels to avoid overlap with
# nobs labels
else:
axes[n][c].set_xlabel(axes[n][c].get_xlabel(), labelpad=20)
# Set bottom limit y axis at 0 to improve readability
axes[n][c].set_ylim(0)
boxgrid(tips);
Key points regarding bills:
Key points regarding tips:
Any exact values represented in these plots can be extracted using the pandas groupby function. For example in the last plot bottom right, the line representing the median value for a group size 5 is not visible so it must be overlapping one of the edges of the box. The following function shows what this value is:
tips.groupby('size')['tip'].median()
Further descriptive statistics can be obtained with the describe function. For example, for the chart on the bottom left:
tips.groupby('sex')['tip'].describe()
There are several ways to visualize relationships between categorical variables. Using the function defined below, each categorical variable is displayed on two rows of bar charts except for one variable, the values of which are used to split the other variables into groups:
This makes it possible to visualize a lot of information in a small amount of space. The absolute counts show how many observations were made in each group (e.g. how many per day of the week) which helps provide more context on the data collection. The relative counts show the ratios between groups, which is more helpful for understanding the relationships between variables.
def bar_hue_sbs(df, hue, nb_unique_limit=10):
"""
Create side-by-side bar charts for each categorical and numerical
discrete variables within the dataset that contain a number of
levels less or equal to the selected limit of unique values. Groups
are created from the variable given as hue.
Parameters
----------
df: pandas DataFrame
DataFrame containing categorical and/or numerical discrete
variables.
hue: str
Name of variable contained in df used to create colored groups.
nb_unique_limit: integer, default=10
Number of unique values used as a cut-off to maintain chart
readability.
"""
# Select variables to plot
var_cat = df[[variable for variable in df if\
len(df[variable].unique())\
<= nb_unique_limit]].drop(columns=hue)
ncols = len(var_cat.columns)
nrows = 2
# Create figure with required dimensions to fit all subplots
fig, axes = plt.subplots(nrows, ncols, figsize=(12, 8))
fig.tight_layout(pad=2)
sns.despine()
fig.suptitle(f'Distribution of categorical variables grouped by {hue}',
verticalalignment='bottom',
fontsize=13)
for i, var in enumerate(var_cat):
# Draw bar chart representing var_cat values count conditioned
# by the hue variable
sns.countplot(x=var_cat[var], hue=df[hue], saturation=1, ax=axes[0][i])
# Draw the same values but as normalized stacked bars
pd.crosstab(var_cat[var], df[hue])\
.apply(lambda x: x*100/sum(x), axis=1)\
.plot.bar(stacked=True, ax=axes[1][i]).tick_params(rotation=0)
## Additional formatting
axes[1][i].yaxis.set_major_formatter(PercentFormatter(decimals=0))
for row in range(nrows):
# Remove x axis tick marks
axes[row][i].tick_params(which='major', axis='x', length=0, pad=5)
# Remove y axis labels from plots not on grid edge
if i > 0:
axes[row][i].set(ylabel='')
# Remove x axis labels from plots not on grid edge
if row < nrows-1:
axes[row][i].set(xlabel='')
# Add legend to upper right plot
axes[row][i].legend(frameon=False)
if i < ncols-1 or row==1:
axes[row][i].legend().remove()
bar_hue_sbs(tips, 'sex')
Men pay the bill more often Saturday and Sunday, at dinner time and for any party size of 2 and above, though parties of 1, 5, or 6 people are quite rare so the ratios for those should not be considered as representative of the norm.
bar_hue_sbs(tips, 'smoker')
There are relatively more parties with smokers on Fridays and Saturdays compared to Thursdays and Sundays.
bar_hue_sbs(tips, 'day')
The two charts on the left reiterate features already uncovered. The third one brings some valuable new insights. The majority of observations were made at lunch time on Thursdays and at dinner time on Saturdays and Sundays. This suggests that the waiter who collected this data must have had most of his shifts at these times.
This also impacts how the insights gathered from this dataset should be evaluated. This shines a new light on the boxplot grid figure in section 4:
bar_hue_sbs(tips, 'time')
Again, men appear to pay the bill more often than women at dinner time (mainly on Saturdays and Sundays) compared to lunch time (mainly on Thursdays). Party sizes can be examined in the next figure.
bar_hue_sbs(tips, 'size')
The last chart on the bottom right shows that lunch time (Thursdays mainly and Fridays) attracts mostly parties of two. Larger parties are more common at dinner on weekends.
This last figure marks the end of the systematic part of this EDA. Many insights have been uncovered thanks to the systematic approach of visualizing the distributions of each variable as well as the relationships between all variables when analyzing them two-by-two. Proceeding in this way may help avoid missing interesting features in the dataset, compared to if the analysis had focused mainly on the variable of most interest that is the amount of tips. Indeed, the analysis of categorical variables has provided some useful contextual information.
The key points observed so far that relate to the size of tips can be summarized here (extracted from figures in sections 3 and 4):
Different steps can be taken to analyze the data further. One thing that can be done is to further study relationships between variables by further splitting groups and creating plots representing 3 or more variables. One case of displaying 3 variables has already been covered in section 3 where the scatter plots representing 2 numerical variables were split into colored groups according to each of the binary categorical variables. Another thing that can be done is to create new variables by transforming or combining the existing ones and see if new patterns emerge.
In the case of this dataset, it is obviously interesting to look at the size of tips relative to the bill. In the United States, tips in restaurants are expected to be at a rate of 15-20% of the bill. The scatter plots in section 3 showed that tips were well correlated with the bill but with some minor patterns of deviation. Visualizing the tip rate relative to the other categorical variables should help make some of those features observed much more obvious. Indeed, any pair of numerical variables as well correlated as tip and total_bill make good candidates for being combined into a single one. This makes it much easier to study the influence from other factors measured with categorical variables.
This is why based on the observations made so far, it would be interesting to investigate the following questions regarding the tip rate:
To do this, first the tip rate must be computed. Then the visualizations produced for the numerical variables in sections 2 and 4 can be reproduced for this new variable.
# Compute tip rate in percent of total bill
tips['tip_rate'] = tips['tip']/tips['total_bill']*100
tips['tip_rate'].describe()
The tip rate is closer to 15% than 20%, there is at least one very stingy person (6.1%) and one very generous person (71%).
hist_align_bins(variable=tips['tip_rate'],
xlabel='Tip rate (%)',
title='Distribution of tip rates',
bin_width=2.5)
The histogram highlights several extreme outliers. We can find out how they relate to the bill and tip size by drawing scatter plots.
# Draw scatter plots of tip rate relative to other numerical variables
# (seaborn pairplot not used due to bug in package v 0.11)
x_vars = ['total_bill', 'tip', 'size']
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 4))
fig.subplots_adjust(wspace=0.1)
for i, x_var, ax in zip([0,1,2], x_vars, axes):
tips.plot.scatter(x_var, 'tip_rate', edgecolor='white', linewidth=0.3, ax=ax)
if i > 0:
ax.set_ylabel('')
ax.set_yticklabels([])
sns.despine()
The extreme tip rates (>30%) concern small bills only so they do not affect much the total tips received by the waiter. The tip rate decreases with bill and party sizes.
Next, the relationship between tip rate and the categorical variables can be visualized with side-by-side boxplots. The tip variable is shown again for comparison. Note that the three outliers above 30% are left out of the figures shown in the rest of this section to make it easier to see differences.
boxgrid(tips.loc[tips['tip_rate']<30].drop(columns=['total_bill']))
The following points can be made regarding the questions stated:
A few extra figures can help provide answers to these additional questions.
The following figure displays tip rates of male vs female bill payer depending on meal time and day of the week.
# Create seaborn facetgrid to draw subplots according to time and day
grid = sns.FacetGrid(data=tips[tips['tip_rate']<30], row='time', col='day')
# Draw boxplots of tip rate by sex
grid.map_dataframe(sns.boxplot, x='sex', y='tip_rate',
whis=(5, 95), linewidth=1, width=0.3, fliersize=False,
hue_order='sex', palette=sns.color_palette(), saturation=1)
# Draw stripplots on top of boxplots
grid.map_dataframe(sns.stripplot, x='sex', y='tip_rate',
color='black', size=3, alpha=0.7)
# Additional formatting
grid.set_axis_labels('', 'Tip rate (%)')
plt.gcf().set_size_inches(9, 9);
The boxplots are overlaid with stripplots to show how many observations are represented by each boxplot seeing as the number of points per plot can be quite low once the observations are split into groups depending on time and day. This works well with this dataset seeing as it contains a small number of observations.
In line with the previous figure, there does not appear to be any notable difference in the average tip rate between men and women depending on the time and on the day. The data from Friday is too sparse to make a reliable conclusion, it would definitely be interesting to have more data from that day. The women appear to tip at a slightly more consistent rate than men, especially on Thursdays and Sundays.
The following figure displays tip rates of smoker vs non-smoker parties depending on the meal time and the sex of the bill payer.
# Create seaborn facetgrid to draw subplots according to time
grid = sns.FacetGrid(data=tips[tips['tip_rate']<30], col='time')
# Draw boxplots of tip rate by smoker status and sex
grid.map_dataframe(sns.boxplot, x='smoker', y='tip_rate', dodge=True,
whis=(5, 95), linewidth=1, fliersize=False,
hue='sex', palette=sns.color_palette(), saturation=1)
# Draw stripplots on top of boxplots
grid.map_dataframe(sns.stripplot, x='smoker', y='tip_rate', dodge=True,
hue='sex', color='black', size=3, alpha=0.7)
# Additional formatting
grid.set_axis_labels('Smoker', 'Tip rate (%)')
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend(handles, labels[:2], frameon=False)
plt.gcf().set_size_inches(6, 6)
plt.gcf().tight_layout(pad=3);
Lunch time tip rates (on Thursdays and Fridays) show no clear difference between smoker and non-smoker parties. Dinner time rates (on Saturdays and Sundays) show that non-smoker parties tip at a more consistent rate, with men who appear to tip at higher rate than men in smoker parties. Women from non-smoker parties have the most consistent tip rate and they seem to have the tendency to tip at a rate either close to 15% or close to 20%, as points seem to cluster near those values.
To further explore factors potentially affecting the tip rate according to smoker status, the plots can be split according to size. The only remaining categorical variable day is already well represented by time so it will not be considered further.
# Create seaborn facetgrid to draw subplots according to size and time
grid = sns.FacetGrid(data=tips[tips['tip_rate']<30], col='size', row='time')
# Draw boxplots of tip rate by smoker status
grid.map_dataframe(sns.boxplot, x='smoker', y='tip_rate', dodge=True,
hue='sex', palette=sns.color_palette(),
whis=(5, 95), linewidth=1, fliersize=False, saturation=1)
# Draw stripplots with hue by sex
grid.map_dataframe(sns.stripplot, x='smoker', y='tip_rate',
dodge=True, hue='sex', color='black', size=3,
alpha=0.7, label='')
# Additional formatting
grid.set_axis_labels('Smoker', 'Tip rate (%)')
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend(handles, labels[:2], frameon=False)
plt.gcf().set_size_inches(12, 8);
Tip rates are most consistent for non-smoker parties of 3 and 4 people, falling closely around 15%. Regarding parties of 2 people, again it appears that tip rates of women from non-smoker parties cluster at 15% and 20%. Most plots have too few points to extract reliable information, which illustrates the limits of small multiples when analyzing a small dataset.
To begin exploring this issue, it is interesting to first get some more context on how the tips are distributed between parties of different sizes. We can start by looking at how the median tips vary by party size and compare that to the median bill paid, on a per person basis. The median is used rather than the average to remove the effect from extreme outliers.
# Create dataframe containing median values by party size
tips_median = tips.groupby('size').median()
tips_median.columns = ['total_bill_median', 'tip_median', 'tip_rate']
tips_median
# Create dataframe containing median values of bill and tip on a
# per person basis
tips_per_person = tips_median.divide(tips['size'].sort_values()\
.unique(), axis='index')
tips_per_person['tip_rate'] = tips_per_person['tip_rate']\
*tips['size'].sort_values().unique()
tips_per_person.columns = ['bill_per_person', 'tip_per_person', 'tip_rate']
tips_per_person
# Create figure of bar charts representing median bill and median tip
# paid per person by party size
fig, axes = plt.subplots(1, 3, figsize=(13, 5))
fig.tight_layout(pad=4)
fig.suptitle(f'Distribution of median tip and bill paid per person by party size',
fontsize=13)
# Draw bar charts
tips_per_person.plot.bar(y=['bill_per_person', 'tip_per_person'],
width=0.8, legend='Bill per person', ax=axes[0])
tips_per_person.plot.bar(y=['bill_per_person', 'tip_per_person'],
stacked=True, ax=axes[1])
tips_per_person.drop(columns='tip_rate')\
.apply(lambda x: x*100/sum(x), axis=1)\
.plot.bar(y=['bill_per_person', 'tip_per_person'],
stacked=True, ax=axes[2])
# Additional formatting
sns.despine()
for ax in axes:
ax.legend().remove()
ax.set_xlabel('Party size', size=10, labelpad=20)
ax.set_ylabel('US dollars per person', labelpad=10)
ax.tick_params(which='major', axis='x', length=0, pad=5, rotation=0)
axes[0].set_ylim(axes[1].get_ylim())
axes[0].legend(frameon=False)
axes[2].set_ylabel('')
axes[2].yaxis.set_major_formatter(PercentFormatter(decimals=0))
# Create labels on x axis showing number of observations per group
for ax in axes:
nobs = tips.groupby('size')['size'].count().values
for x_position, nobs in enumerate(nobs):
ax.text(x=x_position,
y=-0.1*max(ax.get_ylim()),
s=f'n{nobs}',
horizontalalignment='center',
size=9)
The median tip and bill paid per person decrease with party size. A possible explanation in the decrease of median bill per person is that larger parties tend to be families with children who consume less on average than adults. The last chart shows that. relative to the bill. the tip decreases a bit with party size, as previously illustrated with boxplots. We can take a closer look at this in the next figure.
# Draw graph of median tip rate by party size
ax = tips_per_person.plot.line(y='tip_rate')
ax.yaxis.set_major_formatter(PercentFormatter(decimals=0))
ax.set_xlabel('Party size', size=12, labelpad=10)
ax.set_ylabel('Median tip rate', size=12, labelpad=10)
ax.legend().remove()
sns.despine();
Note that the tip rates shown here are obtained by extracting the median rates by party from the original table. The rates reflected by the difference in shares shown in the stacked bar chart above are obtained after stacking the median bill and median tip per person, which gives a slightly different result (representing non-observed median tip rates).
Contrary to the boxplots shown earlier, taking the median tip rate alone makes the decrease from party size 2 to 4 less apparent seeing as the rate actually increases a bit from 2 to 3. Nevertheless, looking at the rates in this way makes differences between party sizes look rather significant, which brings us back to the initial question: how much does this inconsistent tip rate affect the total tips of the waiter? A Pareto chart comes in handy in such a situation.
# Create dataframe containing total tips by party size and sort in
# descending order
tips_sum = tips.drop(columns='tip_rate').groupby('size').sum()\
.reset_index().sort_values('tip', ascending=False)
tips_sum['size'] = tips_sum['size'].apply(str) # Change data type for charting issue
tips_sum = tips_sum.set_index('size')
tips_sum.columns = ['total_bill_sum', 'tip_sum']
tips_sum
# Compute cumulative percentage of total tips
tips_sum['tip_cumul_pc'] = 100*tips_sum['tip_sum'].cumsum()\
/tips_sum['tip_sum'].sum()
tips_sum
# Create Pareto chart of cumulative tips by party size
# code based on https://stackoverflow.com/questions/53577630/how-to-make-pareto-chart-in-python
fig, ax = plt.subplots()
# Draw bar chart total tips by party size
ax_bar = tips_sum.plot.bar(use_index=True, y='tip_sum',
color='mediumseagreen',
ax=ax)
ax_bar.set_ylabel('Total tips (USD)', size=12, labelpad=15)
ax_bar.set_xlabel('Party size', size=12, labelpad=15)
# Draw graph of cumulative percentage
ax_line = tips_sum.plot.line(use_index=True, y='tip_cumul_pc',
marker='o', color='sienna', alpha=0.7,
ax=ax, secondary_y=True)
ax_line.set_ylim([0,110])
ax_line.tick_params(axis='y', colors='sienna')
ax_line.yaxis.set_major_formatter(PercentFormatter())
ax_line.set_ylabel('Cumulative percentage of total tips',
color='sienna', size=12, labelpad=15)
# Additional formatting
ax.tick_params(which='major', axis='x', length=0, pad=5)
ax.legend().remove()
sns.despine(right=False);
Most of the tips are earned from parties of 2 to 4 people, slightly more than half of which come from parties of 2. We know that parties of 3 and 4 tip at a slightly different median rate than parties of 2, so the next point to investigate is what difference does this make with regards to total tips: how much would the total tips change if everyone would tip at a consistent rate regardless of party size? To answer this, the median tip rate of parties of 2 can be used as a benchmark.
# Assign median tip rate for parties of 2
tip_rate_2p = tips_median['tip_rate'][2]
tip_rate_2p
# Compute total theoretical tips at median rate of parties of 2 by
# party size
tip_2p_rate_sum = tips.assign(tip_2p_rate_sum=lambda tips:\
tips['total_bill']*tip_rate_2p/100)\
.groupby('size').sum()['tip_2p_rate_sum']
tip_2p_rate_sum.index = tip_2p_rate_sum.index.astype(str)
tip_2p_rate_sum
# Merge total theoretical tips with table of Pareto chart
tips_2p = tips_sum.merge(tip_2p_rate_sum, left_index=True, right_index=True)
tips_2p
# Compute cumulative percentage of total theoretical tips
tips_2p['tip_2p_rate_cumul_pc'] = 100*tips_2p['tip_2p_rate_sum'].cumsum()\
/tips_2p['tip_2p_rate_sum'].sum()
tips_2p
# Create Pareto chart of total tips measured compared to total
# theoretical tips if everyone tipped at the median rate of parties of 2
fig, ax = plt.subplots()
# Draw bar chart of benchmark (theoretical) total tips and measured
# total tips by party size
ax_bar = tips_2p.plot.bar(use_index=True, y=['tip_sum', 'tip_2p_rate_sum'],
color=['mediumseagreen', 'lightgrey'],
label=['Total tips measured',
'Total tips if everyone tipped at the median rate of parties of 2'],
ax=ax)
# Draw graph of benchmark and measured cumulative percentages
ax_line = tips_2p.plot.line(use_index=True,
y=['tip_cumul_pc','tip_2p_rate_cumul_pc'],
marker='o', color=['sienna', 'grey'], alpha=0.7,
ax=ax, secondary_y=True)
ax_line.set_ylim([0,130])
ax_line.set_ylabel('Cumulative percentage of total tips', color='sienna',
size=12, labelpad=15)
ax_line.tick_params(axis='y', colors='sienna')
ax_line.yaxis.set_major_formatter(PercentFormatter())
# Additional formatting
ax.set_xlabel('Party size', size=12, labelpad=15)
ax.set_ylabel('Total tips (USD)', size=12, labelpad=15)
ax.tick_params(which='major', axis='x', length=0, pad=5, rotation=0)
ax.legend(frameon=False, loc='upper center')
sns.despine(right=False);
Applying a consistent median tip rate to all party sizes produces some differences in total tips per party size but nothing very significant. Indeed, effects from applying a different tip rate to the original bills, which vary a lot between parties of the same size (in addition to the tip rate, see boxplots section 4), end up being somewhat balanced out. The changes to total tips for party sizes of 3 and 4 are noticeable but not that much more than the change to the total tips of party size 2 that is purely due to the effects from applying a different tip rate to bills that vary a lot within that group.
Also, the median rates of party sizes 2-4 happen to be quite similar despite the tip rate boxplots (displayed at the beginning of section 6) showing a clear downward trend. It is quite possible that if more data was collected on parties of 3 and 4, the differences in median tip rate would become more apparent, causing a more significant impact on total tips by party size.
This last figure marks the end of the further analysis concerning the research questions. It has provided useful insights regarding the tip rate and how it varies (or not) according to other factors recorded in the dataset. The key observations made can be summarized here again:
The analysis of this dataset on restaurant tips has provided many insights. The main points are that obviously tips increase with the bill, though not at the same rate. The tip rate decreases slightly with party size. Overall, people tip mostly around 15% of the bill, meaning they are generally more stingy than generous with regards to the 15-20% rate recommended in the United States. The per person consumption decreases with party size, meaning the restaurant and therefore the waiter earn less per chair occupied by larger parties. Whether this is an issue or not depends on the resources needed to serve larger parties vs smaller ones on a per person basis. To evaluate this, more data is needed, especially information on time spent serving each table.
Information on additional factors may also bring new insights. For example it could be interesting to know the sex of all members in parties as well as the number of children. Above all, it would be interesting to obtain more complete data covering all the shifts of the week, including the hours spent per shift. Gathering new data would also make it possible to carry out statistical tests to see whether any of the differences uncovered are statistically significant. This could then help both the waiter and the restaurant owner take some steps to increase profit.
The exploratory data analysis carried out on this sample dataset illustrates how much knowledge can be gained by using a systematic approach to studying all the variables, even with a small and somewhat sparse dataset such as this one. Doing this in Jupyter Notebook using Python makes it easy to follow each step. It also makes it easy to repeat the process with another dataset that would contain a similar variety of types of variables. The systematic part of the EDA in sections 2 to 5 helps ensure that no significant feature in the dataset is missed. As shown with this example, it also helps shape interesting research questions for further analysis. This is what makes Jupyter Notebook a valuable tool for analyzing data and efficiently extracting valuable insights.