A categorical variable or value is one which is discrete is nature. One which has two or more values attached to it. Attribute variables or qualitative variables is another name for it. It takes limited number of possible values. All mathematical models need numeric data to work with. So first, to process the categorical features, then to incorporate in any model.
Types of Categorical Variable
Categorical variables can be classified mainly into two types.
Sometimes we store the data in a text format which represent different trait of observations.
Example: gender can be classified into two possibilities, male and female. This is nominal feature. Here, there is no order of preference.
Secondly, say we have economic status which segregates into high, medium and low. We also refer them as ordinal features.
Encoding Techniques for Categorical Variables
There are various techniques to encode categorical features. Some of the common methods to discuss here.
One EDA (Exploratory Data Analysis) that can be done on categorical data is checking the frequency distribution of the categories. Replacing values with a suitable number is a viable option.
Another popular method is label encoding. Ordinal encoding is another name for it. This process converts each value in a column to a number. One set of categories may be assigned one number, another set may be assigned another. An integer value is allocated to each of the labels.
A better method than label encoding is one hot encoding. In this process, it takes a label encoded column and splits it into multiple columns. One hot encoding produces one feature per category, each binary. Either 0 or 1. It works best with nominal data. In short, it does dummy creation. However, one hot encoding increases dimensionality.
In target encoding, mean of the target variable replaces the categorical value. It is a Bayesian encoding technique. Non-categorical columns are automatically dropped by the target encoder model. But again, this method may lead to overfitting. Even improper distribution of train & test data. To solve these issues, we may reduce the target value from the overall mean value. To balance out the distribution, the categories may use extreme values.
A fourth type of encoding is the effect encoding. Here we use three values -1, 0 and 1 to represent the data.
Encoding of categorical data is an unavoidable part while working with non-numeric kind of data. Every method has its benefits along with some unwanted effects. It is necessary for us to understand as to which method is suitable for which use case.