OneHotEncoder
This post is a continuation of a previous post about LabelEncoder. This time it will be about a technique called one hot encoding or one-hot. Having categories converted into corresponding numbers, we can also convert them into several columns (the number of columns depends on how many categories there are), which contain zeros and ones, respectively, denoting whether a row belongs to a category or not. We use this method when we use an algorithm that may have a problem with numeric variables (because they assume some order).
Suppose we have a category "color" with three possible values: red, green, blue.
Variable | Color |
---|---|
1 | red |
2 | green |
3 | blue |
4 | green |
5 | red |
After coding the labels, red will be converted to zero, green will be converted to two, and blue will be converted to one (the scikit-learn library internally sorts this data somehow, and only to these sorted data it assigns the appropriate number).
Variable | Color |
---|---|
1 | 0 |
2 | 2 |
3 | 1 |
4 | 2 |
5 | 0 |
We process this data using the one hot encoding technique:
Variable | Red | Green | Blue |
---|---|---|---|
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 0 | 1 | 0 |
4 | 0 | 0 | 1 |
5 | 1 | 0 | 0 |
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from __future__ import print_function
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
np.set_printoptions(threshold=np.inf)
X = np.array(['red', 'green', 'blue', 'green', 'red'])
lbl_enc = LabelEncoder()
lbl_enc.fit(X)
X_cat = lbl_enc.transform(X)
print(X_cat)
# here we change a one-dimensional array to a two-dimensional one, because the script throws an error
X_cat = X_cat.reshape(-1, 1)
encoder = OneHotEncoder()
encoder.fit(X_cat)
X_cat = encoder.transform(X_cat)
X_cat = X_cat.toarray()
print()
print(X_cat)
At the output we get:
[0 2 1 2 0]
[[ 1. 0. 0.]
[ 0. 0. 1.]
[ 0. 1. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]]