One-Hot Encoding in Data Science
Categorical Data
Data processing is an important step in the machine learning process in order to transform the raw data to a useful and efficient features. When working with categorical data it’s necessary to convert it into a suitable form before feeding it to a machine learning model.
In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values
In this article, we will discover how to convert categorical data by the one-hot encoding in python using Pandas and Scikit-Learn.
One-Hot Encoding
One-Hot encoding is a vector representation where each category in the values set is converted to a binary feature containing 1
where the category is present in the current record and 0
otherwise.
For the sake of simplicity, I constructed a small dataset representing a list of cars.
import pandas as pd
df = pd.DataFrame({
"name": ["Golf", "A3", "Leon", "Passat", "X6M"],
"price": [32000, 38000, 28000, 36000, 75000],
"brand": ["VW", "Audi", "Seat", "VW", "BMW"],
"color": ["Black", "Blue", "Red", "Blue", "Black"]
})
In this dataset, we have two categorical variables brand
and color
, each of them have a finite set of values {”VW”, “Audi”, “Seat”, “BMW”}
and {”Black”, “Blue”, “Red”}
When working with actual data that contain huge number of rows, it would be helpful to check the possible values of a categorical column in a dataframe as follows:
brands = list(df["brand"].unique())
colors = list(df["color"].unique())
print("Brands labels:", brands)
print("Colors labels:", colors)
One-Hot Encoding with Pandas
One-Hot Encoding can be implemented with pandas using the get_dummies
function that takes the following parameters (Learn more):
data
:array-like, Series, or DataFrame
— The data containing categorical variables of which to get dummy indicators.columns
:*list-like*
, (default:*None*
) — Column names in the DataFrame to be encoded. By default(None)
, all the columns with object or categorydtype
will be converted.prefix
: ***str
,list
ofstr
, ordict
ofstr
, (***default:None
) — Converted column names will be appended to the givenprefix
, it can be a singlestr
, a list of strings with the same length of the columns list, or a dict mapping column names to prefixes.drop_first
:bool
(default:False
) — Removing the 1st level to getk-1
dummies ofk
categorical level.
df_oh = pd.get_dummies(
data=df,
columns=["brand", "color"],
prefix=["b", "c"])
For example, we can notice that the first binary column b_Audi
has only one 1
because we have only one car (A3
) of this brand (Audi
), whereas in the fourth binary column b_VW
we have two 1
s because of the two cars (Golf
and Passat
) of that brand VW
. We can notice the same for colors columns where we have two Black
, two Blue
and one Red
.
One-Hot Encoding with scikit-learn
The scikit-learn library provides the OneHotEncoder
class which is a transformer that takes an array-like of integers or strings and convert it to one-hot numeric array. By default, this transformer returns a sparse matrix but a dense array can be returned by setting the sparse
parameter to False
.
Before passing the categorical data to the encoder, it would be helpful to construct the list of new columns names as follows:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
cat_cols = ["brand", "color"]
cat_cols_encoded = []
for col in cat_cols:
cat_cols_encoded += [f"{col[0]}_{cat}" for cat in list(df[col].unique())]
cat_cols_encoded
['b_VW', 'b_Audi', 'b_Seat', 'b_BMW', 'c_Black', 'c_Blue', 'c_Red']
Once the list of columns names is constructed, we can fit and transform the categorical data using the One-Hot Encoder.
oh_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_cols = oh_encoder.fit_transform(df[cat_cols])
df_enc = pd.DataFrame(encoded_cols, columns=cat_cols_encoded)
df_oh = df.join(df_enc)
df_oh
Conclusion
One-Hot encoding is not the only way to handle categorical variables but its usage is popular in data science among other methods like Label Encoding or Ordinal Encoding, Dummy Encoding etc... Each method has its own pros and cons so it I encourage you to discover other methods in order to decide which one is more suitable for your project.
Feel free to leave a comment or contact me if you have any questions / suggestions.
You can find the Jupyter-Notebook here to reproduce the results shown in this article.
Lately, I’ve been delving into different online gaming platforms, and let me tell you, one of them has left me absolutely speechless! The variety of games available is unmatched, and the graphics? Absolutely stunning. For those who love casino games, betshezi.online is a total game-changer. I found myself especially captivated by the slot machines - they truly capture the essence of Vegas and are hard to resist.
This is the code that I used after taking guide from this post you can check https://www.digitaca.com/ to see the website where I used this code.
One Hot Encoding is a common way of preprocessing categorical features for machine learning models. This type of encoding creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category. Like https://outdoorbasketballshop.com/best-basketball-hoops-for-kids-and-children/