Bivariate Analysis (08)

Practical Explanation

ยท

5 min read

As mentioned in my previous blog post on bivariate analysis as part of exploratory data analysis (EDA), we will gain practical insight into bivariate analysis by implementing Python code.

Types

So, well already know that under Bivariate Analysis we've three kinds, which are as follows;

  1. Numerical-Numerical Analysis

  2. Numerical-Categorical Analysis

  3. Categorical-Categorical Analysis

However, in this blog post, I will be touching on the Tests and different hypothesis illustrations as well.

Don't worry we'll understand the same things in detail

Let's begin the practical approach for each of these:

Numerical-Numerical Analysis

For numerical-numerical analysis, we will use a dataset containing information about the heights and weights of individuals. We will use a scatter plot and correlation coefficient to explore the relationship between height and weight.

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv('height_weight.csv')

# Create a scatter plot
plt.scatter(data['Height'], data['Weight'])
plt.title('Height vs Weight')
plt.xlabel('Height')
plt.ylabel('Weight')
plt.show()

# Calculate the correlation coefficient
correlation = data['Height'].corr(data['Weight'])
print('Correlation between Height and Weight:', correlation)
Code Explained
The above code will create a scatter plot that displays the relationship between height and weight and calculates the correlation coefficient between the two variables. A correlation coefficient value closer to 1 indicates a strong positive correlation, while a value closer to -1 indicates a strong negative correlation. A value closer to 0 indicates no correlation.

Suppose we want to test if there is a significant difference in the mean height of males and females in a population. We can perform a two-sample t-test to evaluate this hypothesis.

import pandas as pd
import scipy.stats as stats

# Load the dataset
data = pd.read_csv('height_gender.csv')

# Separate data by gender
male_data = data[data['Gender'] == 'M']['Height']
female_data = data[data['Gender'] == 'F']['Height']

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(male_data, female_data)
print("t-statistic: ", t_statistic)
print("p-value: ", p_value)
Code Explained
In the above code, we first load the dataset containing information about height and gender. We then separate the data by gender into male and female data. We then perform a two-sample t-test using the ttest_ind function from the scipy.stats module. The output of the test is the t-statistic and p-value. If the p-value is less than our significance level (usually 0.05), we can reject the null hypothesis and conclude that there is a significant difference in the mean height of males and females.

Numerical-Categorical Analysis

For numerical-categorical analysis, we will use a dataset containing information about the gender and salaries of individuals. We will use a box plot to explore the relationship between gender and salary.

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv('gender_salary.csv')

# Create a box plot
data.boxplot('Salary', by='Gender')
plt.title('Salary by Gender')
plt.xlabel('Gender')
plt.ylabel('Salary')
plt.show()
Code Explained
The above code will create a box plot that displays the relationship between gender and salary. The boxplot function is used to plot our data, and the by argument is used to specify the variable we want to group by. In this case, we want to group by gender.

Suppose we want to test if there is a significant difference in the average salary of employees based on their job positions. We can perform a one-way ANOVA (Analysis of Variance) test to evaluate this hypothesis.

import pandas as pd
import scipy.stats as stats

# Load the dataset
data = pd.read_csv('job_salary.csv')

# Perform one-way ANOVA test
f_statistic, p_value = stats.f_oneway(data['Salary'][data['Job Position'] == 'Manager'],
                                      data['Salary'][data['Job Position'] == 'Engineer'],
                                      data['Salary'][data['Job Position'] == 'Analyst'])
print("F-statistic: ", f_statistic)
print("p-value: ", p_value)
Code Explained
In the above code, we first load the dataset containing information about job position and salary. We then perform a one-way ANOVA test using the f_oneway function from the scipy.stats module. The output of the test is the F-statistic and p-value. If the p-value is less than our significance level (usually 0.05), we can reject the null hypothesis and conclude that there is a significant difference in the average salary of employees based on their job positions.

Categorical-Categorical Analysis

For categorical-categorical analysis, we will use a dataset containing information about the type of cars and the number of accidents. We will use a stacked bar chart to explore the relationship between car types and accident counts.

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv('car_accidents.csv')

# Create a stacked bar chart
data.groupby('Car Type')['Accident Count'].sum().plot(kind='bar', stacked=True)
plt.title('Accident Count by Car Type')
plt.xlabel('Car Type')
plt.ylabel('Accident Count')
plt.show()
Code Explained
The above code will create a stacked bar chart that displays the relationship between car types and accident counts. The groupby function is used to group the data by car type, and the sum function is used to calculate the total accident counts for each car type. The plot function is used to plot our data, and the kind argument is set to 'bar' to create a bar chart. The stacked argument is set to True to create a stacked bar chart.

Suppose we want to test if there is an association between the car type and the number of accidents. We can perform a chi-square test to evaluate this hypothesis.

import pandas as pd
import scipy.stats as stats

# Load the dataset
data = pd.read_csv('car_accidents.csv')

# Create a contingency table
contingency_table = pd.crosstab(data['Car Type'], data['Accident'])

# Perform chi-square test
chi2_statistic, p_value, dof, expected_values = stats.chi2_contingency(contingency_table)
print("Chi-square statistic: ", chi2_statistic)
print("p-value: ", p_value)
Code Explained
The given code uses Python and Pandas to load a dataset on car accidents, creates a contingency table using pd.crosstab(), and performs a chi-square test on the contingency table using stats.chi2_contingency(). The chi-square test calculates the association between car types and accidents, and the results, including the chi-square statistic and p-value, are printed using print().

That's the end of the article readers!

Will be explaining more in my following blogs!

"The great thing about being a statistician is that you get to play in everyone's backyard." - John Tukey

Do subscribe and keep supporting! ๐Ÿ˜Š

ย