OneHotEncoder

February 17, 2017 · 2 min read

This post is a continuation of a previous post about LabelEncoder. This time it will be about a technique called one hot encoding or one-hot. Having categories converted into corresponding numbers, we can also convert them into several columns (the number of columns depends on how many categories there are), which contain zeros and ones, respectively, denoting whether a row belongs to a category or not. We use this method when we use an algorithm that may have a problem with numeric variables (because they assume some order).

Suppose we have a category "color" with three possible values: red, green, blue.

Variable	Color
1	red
2	green
3	blue
4	green
5	red

After coding the labels, red will be converted to zero, green will be converted to two, and blue will be converted to one (the scikit-learn library internally sorts this data somehow, and only to these sorted data it assigns the appropriate number).

Variable	Color
1	0
2	2
3	1
4	2
5	0

We process this data using the one hot encoding technique:

Variable	Red	Green	Blue
1	1	0	0
2	0	0	1
3	0	1	0
4	0	0	1
5	1	0	0

#!/usr/bin/python2.7
# -*- coding: utf-8 -*-

from __future__ import print_function

import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

np.set_printoptions(threshold=np.inf)


X = np.array(['red', 'green', 'blue', 'green', 'red'])

lbl_enc = LabelEncoder()
lbl_enc.fit(X)
X_cat = lbl_enc.transform(X)

print(X_cat)

# here we change a one-dimensional array to a two-dimensional one, because the script throws an error
X_cat = X_cat.reshape(-1, 1)

encoder = OneHotEncoder()
encoder.fit(X_cat)
X_cat = encoder.transform(X_cat)
X_cat = X_cat.toarray()

print()
print(X_cat)

At the output we get:

[0 2 1 2 0]

[[ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]