Creating frequency distribution tables and histograms using Python.

An important part of data analysis is to be able to see what the data looks like.

Common methods for this are frequency distribution tables and histograms.

In this article, we will use Python to

How to create frequency distribution tables and histograms in Python
How to display histograms of numerical data using Matplotlib
Create cumulative frequency distribution tables and plot them into histograms in Python.

How to create frequency distributions and histograms in Python

To create frequency distribution tables in Python, all you need is the data and the modules necessary to generate the same distribution table.

Reading the required modules and datasets

First, the data, I used the number of ambulance calls from ‘Introduction to Statistics by Example‘ as the data.

You can use any data that is readily available in your area.

The following two modules are used to produce frequency distribution tables

pandas: for number tables and time series analysis
numpy: for numerical calculations

Now let’s load the data.

# # 必要なモジュールの読込 
# import pandas as pd 
import numpy as np 
# データファイルの読込 
ambulance_Dep = pd.read_excel("救急車出動回数.xlsx") 
ambulance_Dep.head() # 先頭の5行だけ表示

Then, for later convenience, index the columns of the read data using the set_index() function to index the ‘year, month and day’ column.

ambulance_Dep = ambulance_Dep.set_index('年月日')  
ambulance_Dep.head()

Getting summary statistics

We want to know what the data looks like, so we use the describe() function to look at the basic statistics for the number of ambulance runs.

ambulance.describe()

count 40: sample size of 40 days
mean 2.05: average daily turnout during the period
std 1.299: Standard deviation
min 0: Minimum daily turnout
max 5: Maximum daily turnout

In this way, basic data information can be obtained.

Create frequency distribution tables

Finding the number of classes from the Sturges formula

When creating a frequency distribution table, it is necessary to find the number of classes, which determines how many samples can be in a column.

The Sturgess formula is useful for this purpose.

Sturges formula: $k = log_{2}N+1$

k is the number of classes.
N is the sample size (number of observations).

Before applying this formula to the data, the DataFrame numbers are converted to ndarray objects.

dispatch = np.array(ambulance_Dep['出動回数'])

From here, you can refer to this page to create up to one frequency distribution table at a time.

Run the following code,

def frequency_distribution(data, class_width=None):
 　　data = np.asarray(data) 　　
     if class_width is None:
 　　　　class_size = 1 + int(np.log2(data.size).round())
 　　　　class_width = round((data.max() - data.min())/class_size)
 　　
     bins = np.arange(0, data.max() + class_width + 1, class_width)
 　　hist = np.histogram(data, bins)[0]
 　　cumsum = hist.cumsum() 　　

     return pd.DataFrame({'階級値': (bins[1:] + bins[:-1])/2,
 　　'度数': hist,
 　　'累積度数': cumsum,
 　　'相対度数':hist/cumsum[-1],
 　　'累積相対度数':cumsum/cumsum[-1]},
 　　index=pd.Index([f'{bins[i]}以上{bins[i+1]}未満'
  　　　　　　　　　　for i in range(hist.size)],
  　　　　　　　　　　name='階級'))
 frequency_distribution(dispatch)

Draw a histogram of this frequency distribution table.

The required module is the matplotlib plotting module.

import matplotlib.pyplot as plt # 描画モジュールのインポート 
fig = plt.figure() 
ax = fig.add_subplot(1,1,1) 
ax.hist(dispatch, bins = 6, histtype = 'barstacked', ec = 'black')
plt.show()

The above is a description of the process from frequency distribution tables to drawing histograms.

How to create frequency distributions and histograms in Python

Reading the required modules and datasets

Getting summary statistics

Create frequency distribution tables

コメント