Update README.md

4 years ago · ccb53f2159
parent efa0d102e5
commit ccb53f2159
1 changed files with 95 additions and 0 deletions
--- a/1-Introduction/5-data-preprocessing-tool/README.md
+++ b/1-Introduction/5-data-preprocessing-tool/README.md
@ -5,4 +5,99 @@ Each row belong to different customers and includes customer name, age, salary a
 x --> freatures  (first three colums) </br>
 y --> dependent variable vector (last column)
 - - - - -
 ## Importing dataset
 > What is the difference between the independent variables and the dependent variable?<br>
 The independent variables are the input data that you have, with each you want to predict something. That
 something is the dependent variable.
 <br>
 > In Python, why do we create X and y separately?
 Because we want to work with Numpy arrays, instead of Pandas dataframes. Numpy arrays are the most
 convenient format to work with when doing data preprocessing and building Machine Learning models. So
 we create two separate arrays, one that contains our independent variables (also called the input features),
 and another one that contains our dependent variable (what we want to predict).
 <br>
 > In Python, what does ’iloc’ exactly do?
 It locates the column by its index. In other words, using ’iloc’ allows us to take columns by just taking their
 index.
 <br>
 > In Python, what does ’.values’ exactly do?
 It returns the values of the columns you are taking (by their index) inside a Numpy array. That is basically
 how X and y become Numpy arrays
 <br>
 - - - - -
 ## Taking care of missing data
 > In Python, what is the difference between fit and transform?
 The fit part is used to extract some info of the data on which the object is applied (here, Imputer will
 spot the missing values and get the mean of the column). Then, the transform part is used to apply some
 transformation (here, Imputer will replace the missing value by the mean).
 <br>
 - - - - -
 ## Encoding categorical data
 > In Python, what do the two ’fit_transform’ methods do?
 When the ’fit_transform()’ method is called from the LabelEncoder() class, it transforms the categories
 strings into integers. For example, it transforms France, Spain and Germany into 0, 1 and 2. Then, when
 the ’fit_transform()’ method is called from the OneHotEncoder() class, it creates separate columns for each
 different labels with binary values 0 and 1. Those separate columns are the dummy variables.
 <br>
 - - - - -
 ## Splitting the dataset into the Training set and Test set
 > What is the difference between the training set and the test set?
 The training set is a subset of your data on which your model will learn how to predict the dependent
 variable with the independent variables. The test set is the complimentary subset from the training set, on
 which you will evaluate your model to see if it manages to predict correctly the dependent variable with the
 independent variables.
 <br>
 > Why do we split on the dependent variable?
 Because we want to have well distributed values of the dependent variable in the training and test set. For
 example if we only had the same value of the dependent variable in the training set, our model wouldn’t be
 able to learn any correlation between the independent and dependent variables.
 <br>
 - - - - -
 ## Feature scaling
 > Do we really have to apply Feature Scaling on the dummy variables?
 Yes, if you want to optimize the accuracy of your model predictions.
 No, if you want to keep the most interpretation as possible in your model.
 <br>
 > When should we use Standardization and Normalization?
 Generally you should normalize (normalization) when the data is normally distributed, and scale (standardization) 
 when the data is not normally distributed. In doubt, you should go for standardization. Howeverwhat is commonly 
 done is that the two scaling methods are tested.