Kinds of Encoding (11)

Encoding in Pandas

ยท

6 min read

What is One-Hot Encoding and how to use Pandas get_dummies function ...

Definition

Encoding refers to the process of converting categorical data (i.e., data that consists of groups or categories, rather than numerical values) into a format that can be used by machine learning algorithms.

Encoding is just the conversion of Categorical Values into Numerical Values

Types

There are broadly 2 kinds of Encoding: -

  1. Nominal Encoding (No ranks, no hierarchy, etc.,)

    Nominal encoding is used when the categories of the categorical data do not have any particular order.

  2. Ordinal Encoding (Ranks, hierarchy. etc.,)

    Ordinal encoding is used when the categories of the categorical data have a particular order.

Nominal Types

Under Nominal Encoding, there are 3 types of Encoding: -

  1. One Hot Encoding

  2. One Hot Encoding with many Categorical

  3. Mean Encoding

Ordinal Types

Under Ordinal Encoding, there are 2 types of Encoding: -

  1. Label Encoding

  2. Target Guided Ordinal Encoding

Let's Understand Encoding in Detail!

One Hot Encoding

One-hot encoding is a technique used to convert categorical data into a format that can be used for machine learning models. It creates a binary column for each category in the data, with a value of 1 indicating that the observation belongs to that category, and a value of 0 indicating that it does not.

Let's take an example to understand this better. Suppose you have a dataset with a categorical column called "fruit", which has three categories: apple, banana, and orange. Using one-hot encoding, you can create three new columns called "fruit_apple", "fruit_banana", and "fruit_orange", where each row has a value of 1 in the column corresponding to its fruit and 0 in the other columns.

So, if you have an observation with the fruit value "apple", the row will have a value of 1 in the "fruit_apple" column and 0 in the "fruit_banana" and "fruit_orange" columns.

Here's another example to make things more interesting. Suppose you have a dataset with a categorical column called "animal", which has five categories: dog, cat, bird, fish, and turtle. Using one-hot encoding, you can create five new columns called "animal_dog", "animal_cat", "animal_bird", "animal_fish", and "animal_turtle".

If you have an observation with the animal value "turtle", the row will have a value of 0 in the "animal_dog", "animal_cat", "animal_bird", and "animal_fish" columns, and a value of 1 in the "animal_turtle" column.

One Hot Encoding with many Categorical

One-hot encoding with many categorical variables, also known as dummy encoding, is a technique used to convert multiple categorical columns into a format that can be used for machine learning models. It works by creating a binary column for each category in each categorical column, with a value of 1 indicating that the observation belongs to that category, and a value of 0 indicating that it does not.

Let's take an example to understand this better. Suppose you have a dataset with three categorical columns: "fruit", "color", and "size". The "fruit" column has three categories: apple, banana, and orange; the "color" column has two categories: red and green; and the "size" column has two categories: small and large.

Using one-hot encoding with many categorical variables, you can create seven new columns: "fruit_apple", "fruit_banana", "fruit_orange", "color_red", "color_green", "size_small", and "size_large". Each row will have a value of 1 in the appropriate column(s) for its categories and a value of 0 in all other columns.

For example, if you have an observation with the fruit value "banana", the color value "green", and the size value "large", the row will have a value of 1 in the "fruit_banana", "color_green", and "size_large" columns, and a value of 0 in all other columns.

Mean Encoding

Mean encoding, also known as target encoding, is a technique used to convert categorical data into numerical values based on the mean value of the target variable for each category. It works by replacing each category in the categorical column with the mean value of the target variable for observations within that category.

Let's take an example to understand this better. Suppose you have a dataset with a categorical column called "city", which has several categories such as "New York", "Los Angeles", "Chicago", and "Miami". You also have a target variable called "price", which is the price of a house in each city.

Using mean encoding, you can replace each category in the "city" column with the mean value of the "price" variable for observations with that city. For example, the mean price of a house in New York might be $500,000, while the mean price in Los Angeles might be $400,000.

So, if you have an observation with the "city" value "New York", you would replace it with the mean value of $500,000. This can be done for all categories in the "city" column, resulting in a numerical representation of the categorical data.

Label Encoding

Label encoding is a technique used to convert categorical data into numerical values by assigning each category a unique integer value. It works by replacing each category in the categorical column with an integer value, starting from 0 for the first category and increasing by 1 for each subsequent category.

Let's take an example to understand this better. Suppose you have a dataset with a categorical column called "color", which has three categories: red, green, and blue. Using label encoding, you would replace each category with an integer value: 0 for red, 1 for green, and 2 for blue.

So, if you have an observation with the "color" value "red", you would replace it with the integer value 0. This can be done for all categories in the "color" column, resulting in a numerical representation of the categorical data.

Target Guided Ordinal Encoding

Target guided ordinal encoding, also known as ordered mean encoding, is a technique used to convert categorical data into numerical values based on the relationship between the categories and the target variable. It works by replacing each category in the categorical column with a numerical value that reflects the relationship between the category and the target variable.

Let's take an example to understand this better. Suppose you have a dataset with a categorical column called "grade", which has five categories: A, B, C, D, and E. You also have a target variable called "pass", which indicates whether a student passed or failed a test.

Using target guided ordinal encoding, you can replace each category in the "grade" column with a numerical value that reflects the relationship between the category and the target variable. For example, if the proportion of students who passed the test was highest for category A and lowest for category E, you would assign a higher numerical value to category A and a lower numerical value to category E.

So, if you have an observation with the "grade" value "A", you would replace it with a higher numerical value, reflecting the higher proportion of students who passed the test in that category. This can be done for all categories in the "grade" column, resulting in a numerical representation of the categorical data that reflects the relationship between the categories and the target variable.

That's the end of the article readers!

Will be explaining more in my following blogs!

"Give me six hours to chop down a tree and I will spend the first four sharpening the axe." - Abraham Lincoln

Do subscribe and keep supporting! ๐Ÿ˜Š

ย