4 functions of scikit-learn preprocesses data such as machine learning



Hello every one this is candle. In this time we will prreprocess a data with scikit-learn which is machine learning library of python.

We will use scikit-learn called
With scikit-learn you can use what is called a converter, and you can convert the input data with fit_transform () method.Since there are many converters, I will introduce the following four converters that are often used in machine learning.



scikit-learn 0.19.1

For running sample code, you need numpy aside from these libs.


imuter replaces the missing value (None) contained in the data with another specified value.
These values are set by default as arguments.

Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=0, copy=True)

missing_values is type of float and replaces all values in the data corresponding to the specified value. Use this when you want to replace other real numbers that are not None.
strategy is type of str and sets mean (median), median (mode), mode (mode).
Axis is type of int . When 0 is specified, it replaced with the average value of the column (vertical).
If 1 is specified, replaces with the average value of the row (horizontal).

Let’s try it. Create file in to the somewhere you like.


Write this

from sklearn.preprocessing import Imputer
import numpy as np
data = np.array([[7, 2, 3],
                 [8, None, 3],
                 [3, 8, 5]])
imputer = Imputer()
new_data = imputer.fit_transform(data)

Run it.

[[ 7.  2.  3.]
 [ 8.  5.  3.]
 [ 3.  8.  5.]]

The place where None was replaced with 5.


Standardize the data.
The following values are set by default as arguments.

StandardScaler(copy=True, with_mean=True, with_std=True)

Create a file.


Write these.

from sklearn.preprocessing import StandardScaler
import numpy as np
data = np.array([[7., 2., 3.],
                 [8., 5., 3.],
                 [3., 8., 5.]])
standard_scaler = StandardScaler()
new_data = standard_scaler.fit_transform(data)

Run it.


[[ 0.46291005 -1.22474487 -0.70710678]
 [ 0.9258201   0.         -0.70710678]
 [-1.38873015  1.22474487  1.41421356]]


It maps data to the specified range.
The following values are set by default as arguments.

MinMaxScaler(feature_range=(0, 1), copy=True)

feature_range is a tuple, and it is specified as (minimum value, maximum value).
The default value is mapped between 0 and 1.

Create a file.


Write this

from sklearn.preprocessing import MinMaxScaler
import numpy as np
data = np.array([[0., 2.],
                 [3., 4.],
                 [10., 7.]])
standard_scaler = MinMaxScaler(feature_range=(0, 1))
new_data = standard_scaler.fit_transform(data)

Run it.


[[ 0.   0. ]
 [ 0.3  0.4]
 [ 1.   1. ]]

As you can see from the output results, mapping is performed for each column (axis = 0) when the input of the converter is a two-dimensional array.


Change label of integer value to one-hot label.

OneHotEncoder(n_values='auto', categorical_features='all', dtype=<class 'numpy.float64'>, sparse=True, handle_unknown='error')

Create a file


Write this

from sklearn.preprocessing import OneHotEncoder
import numpy as np
data = np.array([0, 2, 1, 1]).reshape(-1, 1)
one_hot = OneHotEncoder()
new_data = one_hot.fit_transform(data).toarray()

Run it.


[[ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]]


If the label is string, you can use the pandas Series method factorize() to convert the label of an integer value.

import pandas as pd
data = pd.Series(["apple", "orange", "banana", "banana"])
new_data, _ = data.factorize()
[0 1 2 2]


Preprocessing can be expected to increase learning performance by taking one time before doing machine learning. Please take advantage of it.


