Skip to Content
Learn
Decision Trees
Information Gain

We know that we want to end up with leaves with a low Gini Impurity, but we still need to figure out which features to split on in order to achieve this. For example, is it better if we split our dataset of students based on how much sleep they got or how much time they spent studying?

To answer this question, we can calculate the information gain of splitting the data on a certain feature. Information gain measures difference in the impurity of the data before and after the split. For example, let’s say you had a dataset with an impurity of 0.5. After splitting the data based on a feature, you end up with three groups with impurities 0, 0.375, and 0. The information gain of splitting the data in that way is 0.5 - 0 - 0.375 - 0 = 0.125.

Not bad! By splitting the data in that way, we’ve gained some information about how the data is structured — the datasets after the split are purer than they were before the split. The higher the information gain the better — if information gain is 0, then splitting the data on that feature was useless! Unfortunately, right now it’s possible for information gain to be negative. In the next exercise, we’ll calculate weighted information gain to fix that problem.

Instructions

1.

We’ve given you a set of labels named unsplit_labels and two different ways of splitting those labels into smaller subsets. Let’s calculate the information gain of splitting the labels in this way.

At the bottom of your code, begin by creating a variable named info_gain. info_gain should start at the Gini impurity of the unsplit_labels.

2.

We now want to subtract the impurity of each subset in split_labels_1 from info_gain.

Loop through every subset in split_labels_1. We want to change the value of info_gain.

For every subset, calculate the Gini impurity and subtract it from info_gain.

3.

Outside of your loop, print info_gain.

We’ve given you a second way to split the data. Instead of looping through the subsets in split_labels_1, loop through the subsets in split_labels_2.

Which split resulted in more information gain?

Once again, in the next exercise, we’ll put the code you wrote into a function named information_gain that takes unsplit_labels and split_labels as parameters.

Folder Icon

Sign up to start coding

Already have an account?