Search
 
SCRIPT & CODE EXAMPLE
 

PYTHON

preprocessing data in python

# importing libraries
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler

# data set link
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# data parameters
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# preparating of dataframe using the data at given link and defined columns list
dataframe = pandas.read_csv(url, names = names)
array = dataframe.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

# initialising the MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
# learning the statistical parameters for each of the data and transforming
rescaledX = scaler.fit_transform(X)

# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

After rescaling see that all of the values are in the range between 0 and 1. 

Output: 

[[ 0.353  0.744  0.59   0.354  0.0    0.501  0.234  0.483]
 [ 0.059  0.427  0.541  0.293  0.0    0.396  0.117  0.167]
 [ 0.471  0.92   0.525  0.     0.0    0.347  0.254  0.183]
 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.0  ]
 [ 0.0    0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]
 
2. Binarize Data (Make Binary)  

We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.
This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.
We can create new binary attributes in Python using scikit-learn with the Binarizer class.
Code: Python code for binarization 

# import libraries
from sklearn.preprocessing import Binarizer
import pandas
import numpy

# data set link
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# data parameters
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# preparating of dataframe using the data at given link and defined columns list
dataframe = pandas.read_csv(url, names = names)
array = dataframe.values

# separate array into input and output components
X = array[:, 0:8]
Y = array[:, 8]
binarizer = Binarizer(threshold = 0.0).fit(X)
binaryX = binarizer.transform(X)

# summarize transformed data
numpy.set_printoptions(precision = 3)
print(binaryX[0:5,:])

We can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1. 

Output: 

[[ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  0.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.]
 [ 0.  1.  1.  1.  1.  1.  1.  1.]]
3. Standardize Data  

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
We can standardize data using scikit-learn with the StandardScaler class.
Code: Python code to Standardize data (0 mean, 1 stdev)  


# importing libraries
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
 
# data set link
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# data parameters
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
 
# preparating of dataframe using the data at given link and defined columns list
dataframe = pandas.read_csv(url, names = names)
array = dataframe.values
 
# separate array into input and output components
X = array[:, 0:8]
Y = array[:, 8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
 
# summarize transformed data
numpy.set_printoptions(precision = 3)
print(rescaledX[0:5,:])
The values for each attribute now have a mean value of 0 and a standard deviation of 1. 

Output: 

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]
 
References:  

https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/
https://www.xenonstack.com/blog/data-preprocessing-data-wrangling-in-machine-learning-deep-learning
 
Comment

PREVIOUS NEXT
Code Example
Python :: rsa decryption 
Python :: pandas array of dataframes 
Python :: python get colorscale 
Python :: how to skip error python 
Python :: logistic regression python family binomial 
Python :: how to sort dataframe in python by length of groups 
Python :: pandas recognize type from strings 
Python :: get sum of column before a date python 
Python :: pandas assign value to row based on condition 
Python :: slice python 
Python :: python create list of empty lists 
Python :: list and tuple difference in python 
Python :: how to use mtproto proxy for telethon 
Python :: django login required class based views 
Python :: format binary string python 
Python :: what is cpython 
Python :: pandas split column fixed width 
Python :: handlebars python 
Python :: Binary search tree deleting 
Python :: Python program to count Even and Odd numbers using while loop in a List 
Python :: python utf upper() 
Python :: How to build a Least Recently Used (LRU) cache, in Python? 
Python :: how to check if some file exists in python 
Python :: how to encode a string in python 
Python :: df shape 
Python :: create instances of a class in a for loop 
Python :: Python use number twice without variable 
Python :: looping through strings 
Python :: pandas most and least occurrence value 
Python :: python remove multiple element from list by index 
ADD CONTENT
Topic
Content
Source link
Name
1+5 =