From 9bf7a4a61daaba6be73786fd31ab43930ebca2c1 Mon Sep 17 00:00:00 2001 From: Jen Looper Date: Tue, 8 Jun 2021 11:27:09 -0400 Subject: [PATCH] a note about balancing data --- 4-Classification/1-Introduction/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/4-Classification/1-Introduction/README.md b/4-Classification/1-Introduction/README.md index 5c31b02ef..7c3d56bc5 100644 --- a/4-Classification/1-Introduction/README.md +++ b/4-Classification/1-Introduction/README.md @@ -176,7 +176,9 @@ Now that you have cleaned the data, use [SMOTE](https://imbalanced-learn.org/dev oversample = SMOTE() transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df) ``` -By balancing your data, you'll have better results when classifying it. Now you can check the numbers of labels per ingredient: +By balancing your data, you'll have better results when classifying it. Think about a binary classification. If most of your data is one class, a ML model is going to predict that class more frequently, just because there is more data for it. Balancing the data takes any skewed data and helps remove this imbalance. + +Now you can check the numbers of labels per ingredient: ```python print(f'new label count: {transformed_label_df.value_counts()}')