Sklearn DecisionTreeClassifier F-Score Different Results with Each run

I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.

data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.

The following code is what I did to preprocess and format my data:

import numpy as np

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import f1_score





# Data Preprocessing Step

# =============================================================================

data = pd.read_csv("./data/train.csv")



X = data.iloc[:, :-1]

y = data.iloc[:, 6]



# Choose which columns are categorical data, and convert them to numeric data.

labelenc = LabelEncoder()

categorical_data = list(data.select_dtypes(include='object').columns)



for i in range(len(categorical_data)):

    X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])





# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.

onehotenc = OneHotEncoder()

X = onehotenc.fit_transform(X).toarray()

y = y.values



X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)



min_max_scaler = MinMaxScaler()

X_train_scaled = min_max_scaler.fit_transform(X_train)

X_val_scaled = min_max_scaler.fit_transform(X_val)

The next code is for the actual decision tree model training:

dectree = DecisionTreeClassifier(class_weight='balanced')

dectree = dectree.fit(X_train_scaled, y_train)

predictions = dectree.predict(X_val_scaled)

score = f1_score(y_val, predictions, average='macro')



print("Score is = {}".format(score))

The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.

On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."

I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:

Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?

I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?

Thank you.

edited Nov 22 '18 at 16:06

asked Nov 22 '18 at 16:00

Seankala

3511213

1

try with dectree = DecisionTreeClassifier(random_state=42)

– Sociopath
Nov 22 '18 at 16:04

Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for random_state produce the warning message but others don't... would you happen to know why?

– Seankala
Nov 22 '18 at 16:09

1

Use random_state everywhere where it is applicable. In your case, its train_test_split() and DecisionTreeClassifier(). Also, use stratify option in train_test_split() to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric warning).

– Vivek Kumar
Nov 23 '18 at 5:58

Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)

– Seankala
Nov 24 '18 at 2:10

add a comment |

data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.

The following code is what I did to preprocess and format my data:

import numpy as np

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import f1_score





# Data Preprocessing Step

# =============================================================================

data = pd.read_csv("./data/train.csv")



X = data.iloc[:, :-1]

y = data.iloc[:, 6]



# Choose which columns are categorical data, and convert them to numeric data.

labelenc = LabelEncoder()

categorical_data = list(data.select_dtypes(include='object').columns)



for i in range(len(categorical_data)):

    X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])





# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.

onehotenc = OneHotEncoder()

X = onehotenc.fit_transform(X).toarray()

y = y.values



X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)



min_max_scaler = MinMaxScaler()

X_train_scaled = min_max_scaler.fit_transform(X_train)

X_val_scaled = min_max_scaler.fit_transform(X_val)

The next code is for the actual decision tree model training:

dectree = DecisionTreeClassifier(class_weight='balanced')

dectree = dectree.fit(X_train_scaled, y_train)

predictions = dectree.predict(X_val_scaled)

score = f1_score(y_val, predictions, average='macro')



print("Score is = {}".format(score))

The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.

On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."

I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:

Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?

I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?

Thank you.

edited Nov 22 '18 at 16:06

asked Nov 22 '18 at 16:00

Seankala

3511213

1

try with dectree = DecisionTreeClassifier(random_state=42)

– Sociopath
Nov 22 '18 at 16:04

Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for random_state produce the warning message but others don't... would you happen to know why?

– Seankala
Nov 22 '18 at 16:09

1

Use random_state everywhere where it is applicable. In your case, its train_test_split() and DecisionTreeClassifier(). Also, use stratify option in train_test_split() to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric warning).

– Vivek Kumar
Nov 23 '18 at 5:58

Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)

– Seankala
Nov 24 '18 at 2:10

add a comment |

data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.

The following code is what I did to preprocess and format my data:

import numpy as np

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import f1_score





# Data Preprocessing Step

# =============================================================================

data = pd.read_csv("./data/train.csv")



X = data.iloc[:, :-1]

y = data.iloc[:, 6]



# Choose which columns are categorical data, and convert them to numeric data.

labelenc = LabelEncoder()

categorical_data = list(data.select_dtypes(include='object').columns)



for i in range(len(categorical_data)):

    X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])





# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.

onehotenc = OneHotEncoder()

X = onehotenc.fit_transform(X).toarray()

y = y.values



X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)



min_max_scaler = MinMaxScaler()

X_train_scaled = min_max_scaler.fit_transform(X_train)

X_val_scaled = min_max_scaler.fit_transform(X_val)

The next code is for the actual decision tree model training:

dectree = DecisionTreeClassifier(class_weight='balanced')

dectree = dectree.fit(X_train_scaled, y_train)

predictions = dectree.predict(X_val_scaled)

score = f1_score(y_val, predictions, average='macro')



print("Score is = {}".format(score))

The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.

On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."

I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:

Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?

I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?

Thank you.

edited Nov 22 '18 at 16:06

asked Nov 22 '18 at 16:00

Seankala

3511213

data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.

The following code is what I did to preprocess and format my data:

import numpy as np

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import f1_score





# Data Preprocessing Step

# =============================================================================

data = pd.read_csv("./data/train.csv")



X = data.iloc[:, :-1]

y = data.iloc[:, 6]



# Choose which columns are categorical data, and convert them to numeric data.

labelenc = LabelEncoder()

categorical_data = list(data.select_dtypes(include='object').columns)



for i in range(len(categorical_data)):

    X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])





# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.

onehotenc = OneHotEncoder()

X = onehotenc.fit_transform(X).toarray()

y = y.values



X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)



min_max_scaler = MinMaxScaler()

X_train_scaled = min_max_scaler.fit_transform(X_train)

X_val_scaled = min_max_scaler.fit_transform(X_val)

The next code is for the actual decision tree model training:

dectree = DecisionTreeClassifier(class_weight='balanced')

dectree = dectree.fit(X_train_scaled, y_train)

predictions = dectree.predict(X_val_scaled)

score = f1_score(y_val, predictions, average='macro')



print("Score is = {}".format(score))

The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.

On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."

I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:

Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?

I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?

Thank you.

python machine-learning scikit-learn

edited Nov 22 '18 at 16:06

asked Nov 22 '18 at 16:00

Seankala

3511213

edited Nov 22 '18 at 16:06

asked Nov 22 '18 at 16:00

Seankala

3511213

edited Nov 22 '18 at 16:06

asked Nov 22 '18 at 16:00

Seankala

3511213

asked Nov 22 '18 at 16:00

Seankala

3511213

asked Nov 22 '18 at 16:00

Seankala

3511213

1

try with dectree = DecisionTreeClassifier(random_state=42)

– Sociopath
Nov 22 '18 at 16:04

Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for random_state produce the warning message but others don't... would you happen to know why?

– Seankala
Nov 22 '18 at 16:09

1

Use random_state everywhere where it is applicable. In your case, its train_test_split() and DecisionTreeClassifier(). Also, use stratify option in train_test_split() to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric warning).

– Vivek Kumar
Nov 23 '18 at 5:58

Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)

– Seankala
Nov 24 '18 at 2:10

add a comment |

1

try with dectree = DecisionTreeClassifier(random_state=42)

– Sociopath
Nov 22 '18 at 16:04

Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for random_state produce the warning message but others don't... would you happen to know why?

– Seankala
Nov 22 '18 at 16:09

1

Use random_state everywhere where it is applicable. In your case, its train_test_split() and DecisionTreeClassifier(). Also, use stratify option in train_test_split() to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric warning).

– Vivek Kumar
Nov 23 '18 at 5:58

Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)

– Seankala
Nov 24 '18 at 2:10

try with dectree = DecisionTreeClassifier(random_state=42)

– Sociopath
Nov 22 '18 at 16:04

Hello. Thanks for the comment. I've tried with that and noticed that I no longer get the warning message and also get a consistent value. May I ask why 42 though? Also, I've noticed that some values for random_state produce the warning message but others don't... would you happen to know why?

– Seankala
Nov 22 '18 at 16:09

Use random_state everywhere where it is applicable. In your case, its train_test_split() and DecisionTreeClassifier(). Also, use stratify option in train_test_split() to get a balanced split (classes) between train and test data (which may help in avoiding the UndefinedMetric warning).

– Vivek Kumar
Nov 23 '18 at 5:58

Thanks for the feedback @VivekKumar. I didn't know about these methods. I'll make sure to keep them in mind next time. :)

– Seankala
Nov 24 '18 at 2:10

add a comment |

1 Answer
1

active

oldest

votes

You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.

In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.

#train test split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)



#Decision tree model

dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)

answered Nov 22 '18 at 16:15

Naveen

797114

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434609%2fsklearn-decisiontreeclassifier-f-score-different-results-with-each-run%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

#train test split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)



#Decision tree model

dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)

answered Nov 22 '18 at 16:15

Naveen

797114

add a comment |

#train test split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)



#Decision tree model

dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)

answered Nov 22 '18 at 16:15

Naveen

797114

add a comment |

#train test split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)



#Decision tree model

dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)

answered Nov 22 '18 at 16:15

Naveen

797114

#train test split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)



#Decision tree model

dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)

answered Nov 22 '18 at 16:15

Naveen

797114

answered Nov 22 '18 at 16:15

Naveen

797114

answered Nov 22 '18 at 16:15

Naveen

797114

answered Nov 22 '18 at 16:15

Naveen

797114

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk