Practical Encoding (12)

In this, we will understand everything about encoding what I explained before but through Python coding:

One Hot Encoding

One hot encoding is a popular technique for converting categorical variables into a format that can be used for machine learning models. Here's an example of how to perform one hot encoding in Python using the Pandas library:

import pandas as pd

# create a sample dataframe with categorical data
df = pd.DataFrame({'fruit': ['apple', 'banana', 'banana', 'orange', 'apple']})

# perform one hot encoding on the 'fruit' column
one_hot = pd.get_dummies(df['fruit'])

# combine the one hot encoded data with the original dataframe
df = pd.concat([df, one_hot], axis=1)

# remove the original 'fruit' column
df.drop('fruit', axis=1, inplace=True)

# display the resulting dataframe
print(df)

In this example, we start by creating a sample dataframe with a categorical variable called "fruit". We then use the get_dummies() function from the Pandas library to perform one hot encoding on the "fruit" column. This creates a new dataframe with a column for each possible value of "fruit", and a 1 in the corresponding column for each row where that value appears.

We then use the concat() function to combine the one hot encoded data with the original dataframe, and the drop() function to remove the original "fruit" column. Finally, we print the resulting dataframe to see the one hot encoded data in action.

Output

   fruit_apple  fruit_banana  fruit_orange
0            1             0             0
1            0             1             0
2            0             1             0
3            0             0             1
4            1             0             0

One Hot Encoding with Many Categorical

Let's understand with an example of how to perform one hot encoding with multiple categorical variables in Python using the Pandas library:

import pandas as pd

# create a sample dataframe with categorical data
df = pd.DataFrame({
    'fruit': ['apple', 'banana', 'banana', 'orange', 'apple'],
    'color': ['red', 'yellow', 'green', 'orange', 'green'],
    'size': ['small', 'medium', 'medium', 'large', 'small']
})

# perform one hot encoding on all categorical columns
one_hot = pd.get_dummies(df, columns=['fruit', 'color', 'size'])

# display the resulting dataframe
print(one_hot)

In this example, we start by creating a sample dataframe with multiple categorical variables, including "fruit", "color", and "size". We then use the get_dummies() function from the Pandas library to perform one hot encoding on all of the categorical columns in the dataframe.

We specify the columns to encode by passing a list of column names to the columns parameter of the get_dummies() function. This creates a new dataframe with a column for each possible value of each categorical variable, and a 1 in the corresponding column for each row where that value appears.

Output

   apple  banana  orange
0      1       0       0
1      0       1       0
2      0       1       0
3      0       0       1
4      1       0       0

Finally, we print the resulting dataframe to see the one hot encoded data in action. Note that the resulting dataframe has a much larger number of columns than the original dataframe, because each categorical variable has been expanded into multiple columns.

Mean Encoding

Mean encoding, also known as target encoding, is another technique for encoding categorical variables that can be used for machine learning models. Here's an example of how to perform mean encoding in Python using the Pandas library:

import pandas as pd

# create a sample dataframe with categorical data
df = pd.DataFrame({'fruit': ['apple', 'banana', 'banana', 'orange', 'apple'], 'target': [1, 0, 1, 1, 0]})

# compute the mean target value for each category
means = df.groupby('fruit')['target'].mean()

# replace the categorical values with their mean target values
df['fruit_mean'] = df['fruit'].map(means)

# display the resulting dataframe
print(df)

In this example, we start by creating a sample dataframe with a categorical variable called "fruit" and a target variable called "target". We then compute the mean target value for each category of "fruit" using the groupby() and mean() functions from the Pandas library.

We create a new column in the dataframe called "fruit_mean" by mapping each categorical value of "fruit" to its corresponding mean target value using the map() function. This replaces each categorical value with its mean target value.

Finally, we print the resulting dataframe to see the mean encoded data in action. The resulting dataframe will look like this:

    fruit  target  fruit_mean
0   apple       1         0.5
1  banana       0         0.5
2  banana       1         0.5
3  orange       1         1.0
4   apple       0         0.5

Label Encoding

Label Encoding is a technique used to encode categorical data into numerical values. In this technique, each unique category is assigned a value from 0 to n-1, where n is the number of unique categories.

Here's an example of how to perform Label Encoding in Python:

from sklearn.preprocessing import LabelEncoder
import pandas as pd

# create a sample dataframe with categorical data
df = pd.DataFrame({
    'fruit': ['apple', 'banana', 'banana', 'orange', 'apple']
})

# create a label encoder object
le = LabelEncoder()

# fit the label encoder to the 'fruit' column
le.fit(df['fruit'])

# transform the 'fruit' column using the label encoder
df['fruit_encoded'] = le.transform(df['fruit'])

# display the resulting dataframe
print(df)

Here's a breakdown of what each part of the code does:

First, we import the necessary libraries: pandas for creating a sample dataframe, and LabelEncoder from sklearn.preprocessing for performing the label encoding.
We then create a sample dataframe with a single categorical column called "fruit".
Next, we create a LabelEncoder object called le.
We fit the le object to the "fruit" column of the dataframe using the fit() method.
We then transform the "fruit" column using the transform() method of the le object, which assigns numerical values to each unique category.
We add the resulting encoded column to the original dataframe with the name "fruit_encoded".
Finally, we display the resulting dataframe with the encoded column.

Output:

    fruit  fruit_encoded
0   apple              0
1  banana              1
2  banana              1
3  orange              2
4   apple              0

Target Guided Ordinal Encoding

Target Guided Ordinal Encoding is a technique used for encoding categorical variables where the numerical values are assigned based on the relationship between the variable and the target variable. This encoding technique can be useful when there is a strong correlation between the categorical variable and the target variable, and when other encoding techniques like Label Encoding or One-Hot Encoding do not perform well.

Here's how you can implement Target Guided Ordinal Encoding in Python:

import pandas as pd
import numpy as np

# create a sample dataframe with categorical and target data
df = pd.DataFrame({
    'city': ['London', 'Paris', 'London', 'Tokyo', 'Paris', 'Paris'],
    'target': [1, 0, 1, 0, 0, 1]
})

# calculate the mean target value for each unique category
mean_target = df.groupby('city')['target'].mean()

# sort the categories based on their mean target value
mean_target_sorted = mean_target.sort_values()

# create a mapping between the categories and their mean target values
mapping = {category: i for i, category in enumerate(mean_target_sorted.index)}

# apply the mapping to the 'city' column to create the encoded column
df['city_encoded'] = df['city'].map(mapping)

# display the resulting dataframe
print(df)

In this example, we first create a sample dataframe with a categorical variable 'city' and a target variable 'target'. We then group the dataframe by the 'city' column and calculate the mean target value for each unique category. We sort the categories based on their mean target value and create a mapping between the categories and their corresponding mean target values. Finally, we apply the mapping to the 'city' column to create the encoded column 'city_encoded'.

The resulting dataframe will have a new column 'city_encoded' that contains the numerical values assigned to each category based on their mean target value. The categories with a higher mean target value will be assigned a higher numerical value, indicating a stronger correlation with the target variable.

That's the end of the article readers!

Will be explaining more in my following blogs!

"Knowledge is knowing the right answer, intelligence is asking the right question." - Unknown

Do subscribe and keep supporting! 😊