How to evaluate text based models with scikit-learn?

I have the following dataframe with data:

index   field1      field2            field3

1079    COMPUTER    long text....     3

Field1 is a category and field2 is a description and field3 is just an integer representation of field1.

I am using the following code to learn field2 to category mappings with sklearn:

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB



X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)

After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.

X_test_counts = count_vect.fit_transform(X_test)

X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)

clf.score(X_test_tfidf, y_test)

It throws the following error:

ValueError: dimension mismatch

Is there a way test the model and get the score or accuracy with such dataset?

UPDATE: Adding similar transformation to the test set.

edited Nov 25 '18 at 17:26

asked Nov 25 '18 at 14:37

Istvan

2,93743064

Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.

– Prayson W. Daniel
Nov 25 '18 at 18:29

add a comment |

I have the following dataframe with data:

index   field1      field2            field3

1079    COMPUTER    long text....     3

Field1 is a category and field2 is a description and field3 is just an integer representation of field1.

I am using the following code to learn field2 to category mappings with sklearn:

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB



X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)

After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.

X_test_counts = count_vect.fit_transform(X_test)

X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)

clf.score(X_test_tfidf, y_test)

It throws the following error:

ValueError: dimension mismatch

Is there a way test the model and get the score or accuracy with such dataset?

UPDATE: Adding similar transformation to the test set.

edited Nov 25 '18 at 17:26

asked Nov 25 '18 at 14:37

Istvan

2,93743064

Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.

– Prayson W. Daniel
Nov 25 '18 at 18:29

add a comment |

I have the following dataframe with data:

index   field1      field2            field3

1079    COMPUTER    long text....     3

Field1 is a category and field2 is a description and field3 is just an integer representation of field1.

I am using the following code to learn field2 to category mappings with sklearn:

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB



X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)

After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.

X_test_counts = count_vect.fit_transform(X_test)

X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)

clf.score(X_test_tfidf, y_test)

It throws the following error:

ValueError: dimension mismatch

Is there a way test the model and get the score or accuracy with such dataset?

UPDATE: Adding similar transformation to the test set.

edited Nov 25 '18 at 17:26

asked Nov 25 '18 at 14:37

Istvan

2,93743064

I have the following dataframe with data:

index   field1      field2            field3

1079    COMPUTER    long text....     3

Field1 is a category and field2 is a description and field3 is just an integer representation of field1.

I am using the following code to learn field2 to category mappings with sklearn:

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB



X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)

After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.

X_test_counts = count_vect.fit_transform(X_test)

X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)

clf.score(X_test_tfidf, y_test)

It throws the following error:

ValueError: dimension mismatch

Is there a way test the model and get the score or accuracy with such dataset?

UPDATE: Adding similar transformation to the test set.

python scikit-learn nlp

edited Nov 25 '18 at 17:26

asked Nov 25 '18 at 14:37

Istvan

2,93743064

edited Nov 25 '18 at 17:26

asked Nov 25 '18 at 14:37

Istvan

2,93743064

edited Nov 25 '18 at 17:26

asked Nov 25 '18 at 14:37

Istvan

2,93743064

asked Nov 25 '18 at 14:37

Istvan

2,93743064

asked Nov 25 '18 at 14:37

Istvan

2,93743064

Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.

– Prayson W. Daniel
Nov 25 '18 at 18:29

add a comment |

Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.

– Prayson W. Daniel
Nov 25 '18 at 18:29

Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.

– Prayson W. Daniel
Nov 25 '18 at 18:29

add a comment |

3 Answers
3

active

oldest

votes

From the code you have provided it looks like you may have forgotten to convert/transform X_test like you did with X_train.

Update:

As for the new error that is now displayed in the question:

ValueError: dimension mismatch

Since the transformer has already been fitted to the training set, you should just call .transform() on the test set:

tfidf_transformer.transform(X_test_counts)

More info here.

edited Nov 25 '18 at 18:22

answered Nov 25 '18 at 15:29

runcoderun

37437

I have updated the question. I have tried that too.

– Istvan
Nov 25 '18 at 17:27

Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

– runcoderun
Nov 25 '18 at 18:14

add a comment |

The MultinomialNB classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].

To fix your issue change your code to something like this:

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())

X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)



clf = MultinomialNB().fit(X_train, y_train)

clf.predict(X_test)

To enhance your code use Pipeline:

from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)

text_clf = Pipeline([('vect', CountVectorizer()),

                    ('tfidf', TfidfTransformer()),

                    ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, y_train)

text_clf.predict(X_test)

edited Nov 25 '18 at 18:20

answered Nov 25 '18 at 18:03

Amir

7,87264173

add a comment |

You should only transform your test data. Not fit_transform.
You fit_transform training data and only transform test data.
So if you remove “fit_” on the text data, it should work.

It is better to use pipelines that will do transformation and then train/score/predict. E.g.

from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB



model = Pipeline(steps = [    

            ('word_vec', CountVectorizer()),

            ('word_tdf',  TfidfTransformer()),

            ('mnb',MultinomialNB()),

        ])



simple_model.fit(X_train,y_train)

simple_model.score(X_test,y_test)

This allows you to have easier code and less likely to fit_transform your text data.

edited Nov 25 '18 at 18:26

answered Nov 25 '18 at 18:09

Prayson W. Daniel

2,08311219

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468573%2fhow-to-evaluate-text-based-models-with-scikit-learn%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

From the code you have provided it looks like you may have forgotten to convert/transform X_test like you did with X_train.

Update:

As for the new error that is now displayed in the question:

ValueError: dimension mismatch

Since the transformer has already been fitted to the training set, you should just call .transform() on the test set:

tfidf_transformer.transform(X_test_counts)

More info here.

edited Nov 25 '18 at 18:22

answered Nov 25 '18 at 15:29

runcoderun

37437

I have updated the question. I have tried that too.

– Istvan
Nov 25 '18 at 17:27

Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

– runcoderun
Nov 25 '18 at 18:14

add a comment |

From the code you have provided it looks like you may have forgotten to convert/transform X_test like you did with X_train.

Update:

As for the new error that is now displayed in the question:

ValueError: dimension mismatch

Since the transformer has already been fitted to the training set, you should just call .transform() on the test set:

tfidf_transformer.transform(X_test_counts)

More info here.

edited Nov 25 '18 at 18:22

answered Nov 25 '18 at 15:29

runcoderun

37437

I have updated the question. I have tried that too.

– Istvan
Nov 25 '18 at 17:27

Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

– runcoderun
Nov 25 '18 at 18:14

add a comment |

From the code you have provided it looks like you may have forgotten to convert/transform X_test like you did with X_train.

Update:

As for the new error that is now displayed in the question:

ValueError: dimension mismatch

Since the transformer has already been fitted to the training set, you should just call .transform() on the test set:

tfidf_transformer.transform(X_test_counts)

More info here.

edited Nov 25 '18 at 18:22

answered Nov 25 '18 at 15:29

runcoderun

37437

From the code you have provided it looks like you may have forgotten to convert/transform X_test like you did with X_train.

Update:

As for the new error that is now displayed in the question:

ValueError: dimension mismatch

Since the transformer has already been fitted to the training set, you should just call .transform() on the test set:

tfidf_transformer.transform(X_test_counts)

More info here.

edited Nov 25 '18 at 18:22

answered Nov 25 '18 at 15:29

runcoderun

37437

edited Nov 25 '18 at 18:22

answered Nov 25 '18 at 15:29

runcoderun

37437

answered Nov 25 '18 at 15:29

runcoderun

37437

answered Nov 25 '18 at 15:29

runcoderun

37437

I have updated the question. I have tried that too.

– Istvan
Nov 25 '18 at 17:27

Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

– runcoderun
Nov 25 '18 at 18:14

add a comment |

I have updated the question. I have tried that too.

– Istvan
Nov 25 '18 at 17:27

Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

– runcoderun
Nov 25 '18 at 18:14

I have updated the question. I have tried that too.

– Istvan
Nov 25 '18 at 17:27

Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

– runcoderun
Nov 25 '18 at 18:14

add a comment |

To fix your issue change your code to something like this:

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())

X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)



clf = MultinomialNB().fit(X_train, y_train)

clf.predict(X_test)

To enhance your code use Pipeline:

from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)

text_clf = Pipeline([('vect', CountVectorizer()),

                    ('tfidf', TfidfTransformer()),

                    ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, y_train)

text_clf.predict(X_test)

edited Nov 25 '18 at 18:20

answered Nov 25 '18 at 18:03

Amir

7,87264173

add a comment |

To fix your issue change your code to something like this:

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())

X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)



clf = MultinomialNB().fit(X_train, y_train)

clf.predict(X_test)

To enhance your code use Pipeline:

from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)

text_clf = Pipeline([('vect', CountVectorizer()),

                    ('tfidf', TfidfTransformer()),

                    ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, y_train)

text_clf.predict(X_test)

edited Nov 25 '18 at 18:20

answered Nov 25 '18 at 18:03

Amir

7,87264173

add a comment |

To fix your issue change your code to something like this:

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())

X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)



clf = MultinomialNB().fit(X_train, y_train)

clf.predict(X_test)

To enhance your code use Pipeline:

from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)

text_clf = Pipeline([('vect', CountVectorizer()),

                    ('tfidf', TfidfTransformer()),

                    ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, y_train)

text_clf.predict(X_test)

edited Nov 25 '18 at 18:20

answered Nov 25 '18 at 18:03

Amir

7,87264173

To fix your issue change your code to something like this:

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())

X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)



clf = MultinomialNB().fit(X_train, y_train)

clf.predict(X_test)

To enhance your code use Pipeline:

from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)

text_clf = Pipeline([('vect', CountVectorizer()),

                    ('tfidf', TfidfTransformer()),

                    ('clf', MultinomialNB())])

text_clf = text_clf.fit(X_train, y_train)

text_clf.predict(X_test)

edited Nov 25 '18 at 18:20

answered Nov 25 '18 at 18:03

Amir

7,87264173

edited Nov 25 '18 at 18:20

answered Nov 25 '18 at 18:03

Amir

7,87264173

answered Nov 25 '18 at 18:03

Amir

7,87264173

answered Nov 25 '18 at 18:03

Amir

7,87264173

add a comment |

You should only transform your test data. Not fit_transform.
You fit_transform training data and only transform test data.
So if you remove “fit_” on the text data, it should work.

It is better to use pipelines that will do transformation and then train/score/predict. E.g.

from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB



model = Pipeline(steps = [    

            ('word_vec', CountVectorizer()),

            ('word_tdf',  TfidfTransformer()),

            ('mnb',MultinomialNB()),

        ])



simple_model.fit(X_train,y_train)

simple_model.score(X_test,y_test)

This allows you to have easier code and less likely to fit_transform your text data.

edited Nov 25 '18 at 18:26

answered Nov 25 '18 at 18:09

Prayson W. Daniel

2,08311219

add a comment |

You should only transform your test data. Not fit_transform.
You fit_transform training data and only transform test data.
So if you remove “fit_” on the text data, it should work.

It is better to use pipelines that will do transformation and then train/score/predict. E.g.

from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB



model = Pipeline(steps = [    

            ('word_vec', CountVectorizer()),

            ('word_tdf',  TfidfTransformer()),

            ('mnb',MultinomialNB()),

        ])



simple_model.fit(X_train,y_train)

simple_model.score(X_test,y_test)

This allows you to have easier code and less likely to fit_transform your text data.

edited Nov 25 '18 at 18:26

answered Nov 25 '18 at 18:09

Prayson W. Daniel

2,08311219

add a comment |

You should only transform your test data. Not fit_transform.
You fit_transform training data and only transform test data.
So if you remove “fit_” on the text data, it should work.

It is better to use pipelines that will do transformation and then train/score/predict. E.g.

from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB



model = Pipeline(steps = [    

            ('word_vec', CountVectorizer()),

            ('word_tdf',  TfidfTransformer()),

            ('mnb',MultinomialNB()),

        ])



simple_model.fit(X_train,y_train)

simple_model.score(X_test,y_test)

This allows you to have easier code and less likely to fit_transform your text data.

edited Nov 25 '18 at 18:26

answered Nov 25 '18 at 18:09

Prayson W. Daniel

2,08311219

You should only transform your test data. Not fit_transform.
You fit_transform training data and only transform test data.
So if you remove “fit_” on the text data, it should work.

It is better to use pipelines that will do transformation and then train/score/predict. E.g.

from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB



model = Pipeline(steps = [    

            ('word_vec', CountVectorizer()),

            ('word_tdf',  TfidfTransformer()),

            ('mnb',MultinomialNB()),

        ])



simple_model.fit(X_train,y_train)

simple_model.score(X_test,y_test)

This allows you to have easier code and less likely to fit_transform your text data.

edited Nov 25 '18 at 18:26

answered Nov 25 '18 at 18:09

Prayson W. Daniel

2,08311219

edited Nov 25 '18 at 18:26

answered Nov 25 '18 at 18:09

Prayson W. Daniel

2,08311219

answered Nov 25 '18 at 18:09

Prayson W. Daniel

2,08311219

answered Nov 25 '18 at 18:09

Prayson W. Daniel

2,08311219

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk