Update README.md

4 years ago · ccb53f2159
parent efa0d102e5
commit ccb53f2159
1 changed files with 95 additions and 0 deletions
--- a/1-Introduction/5-data-preprocessing-tool/README.md
+++ b/1-Introduction/5-data-preprocessing-tool/README.md
@ -5,4 +5,99 @@ Each row belong to different customers and includes customer name, age, salary a
 x --> freatures  (first three colums) </br>
 y --> dependent variable vector (last column)

+- - - - -

+## Importing dataset
+> What is the difference between the independent variables and the dependent variable?<br>
+
+The independent variables are the input data that you have, with each you want to predict something. That
+something is the dependent variable.
+
+<br>
+
+> In Python, why do we create X and y separately?
+
+Because we want to work with Numpy arrays, instead of Pandas dataframes. Numpy arrays are the most
+convenient format to work with when doing data preprocessing and building Machine Learning models. So
+we create two separate arrays, one that contains our independent variables (also called the input features),
+and another one that contains our dependent variable (what we want to predict).
+
+<br>
+
+> In Python, what does ’iloc’ exactly do?
+
+It locates the column by its index. In other words, using ’iloc’ allows us to take columns by just taking their
+index.
+
+<br>
+
+> In Python, what does ’.values’ exactly do?
+
+It returns the values of the columns you are taking (by their index) inside a Numpy array. That is basically
+how X and y become Numpy arrays
+
+<br>
+
+- - - - -
+
+## Taking care of missing data
+> In Python, what is the difference between fit and transform?
+
+The fit part is used to extract some info of the data on which the object is applied (here, Imputer will
+spot the missing values and get the mean of the column). Then, the transform part is used to apply some
+transformation (here, Imputer will replace the missing value by the mean).
+
+
+<br>
+
+- - - - -
+
+## Encoding categorical data
+> In Python, what do the two ’fit_transform’ methods do?
+
+When the ’fit_transform()’ method is called from the LabelEncoder() class, it transforms the categories
+strings into integers. For example, it transforms France, Spain and Germany into 0, 1 and 2. Then, when
+the ’fit_transform()’ method is called from the OneHotEncoder() class, it creates separate columns for each
+different labels with binary values 0 and 1. Those separate columns are the dummy variables.
+
+<br>
+
+- - - - -
+
+## Splitting the dataset into the Training set and Test set
+> What is the difference between the training set and the test set?
+
+
+The training set is a subset of your data on which your model will learn how to predict the dependent
+variable with the independent variables. The test set is the complimentary subset from the training set, on
+which you will evaluate your model to see if it manages to predict correctly the dependent variable with the
+independent variables.
+
+<br>
+
+> Why do we split on the dependent variable?
+
+
+Because we want to have well distributed values of the dependent variable in the training and test set. For
+example if we only had the same value of the dependent variable in the training set, our model wouldn’t be
+able to learn any correlation between the independent and dependent variables.
+
+<br>
+
+- - - - -
+
+## Feature scaling
+
+> Do we really have to apply Feature Scaling on the dummy variables?
+
+
+Yes, if you want to optimize the accuracy of your model predictions.
+No, if you want to keep the most interpretation as possible in your model.
+
+<br>
+
+> When should we use Standardization and Normalization?
+
+Generally you should normalize (normalization) when the data is normally distributed, and scale (standardization) 
+when the data is not normally distributed. In doubt, you should go for standardization. Howeverwhat is commonly 
+done is that the two scaling methods are tested.