How to evaluate text based models with scikit-learn?
I have the following dataframe with data:
index field1 field2 field3
1079 COMPUTER long text.... 3
Field1 is a category and field2 is a description and field3 is just an integer representation of field1.
I am using the following code to learn field2 to category mappings with sklearn:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.
X_test_counts = count_vect.fit_transform(X_test)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
clf.score(X_test_tfidf, y_test)
It throws the following error:
ValueError: dimension mismatch
Is there a way test the model and get the score or accuracy with such dataset?
UPDATE: Adding similar transformation to the test set.
python scikit-learn nlp
add a comment |
I have the following dataframe with data:
index field1 field2 field3
1079 COMPUTER long text.... 3
Field1 is a category and field2 is a description and field3 is just an integer representation of field1.
I am using the following code to learn field2 to category mappings with sklearn:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.
X_test_counts = count_vect.fit_transform(X_test)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
clf.score(X_test_tfidf, y_test)
It throws the following error:
ValueError: dimension mismatch
Is there a way test the model and get the score or accuracy with such dataset?
UPDATE: Adding similar transformation to the test set.
python scikit-learn nlp
Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.
– Prayson W. Daniel
Nov 25 '18 at 18:29
add a comment |
I have the following dataframe with data:
index field1 field2 field3
1079 COMPUTER long text.... 3
Field1 is a category and field2 is a description and field3 is just an integer representation of field1.
I am using the following code to learn field2 to category mappings with sklearn:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.
X_test_counts = count_vect.fit_transform(X_test)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
clf.score(X_test_tfidf, y_test)
It throws the following error:
ValueError: dimension mismatch
Is there a way test the model and get the score or accuracy with such dataset?
UPDATE: Adding similar transformation to the test set.
python scikit-learn nlp
I have the following dataframe with data:
index field1 field2 field3
1079 COMPUTER long text.... 3
Field1 is a category and field2 is a description and field3 is just an integer representation of field1.
I am using the following code to learn field2 to category mappings with sklearn:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.
X_test_counts = count_vect.fit_transform(X_test)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
clf.score(X_test_tfidf, y_test)
It throws the following error:
ValueError: dimension mismatch
Is there a way test the model and get the score or accuracy with such dataset?
UPDATE: Adding similar transformation to the test set.
python scikit-learn nlp
python scikit-learn nlp
edited Nov 25 '18 at 17:26
Istvan
asked Nov 25 '18 at 14:37
IstvanIstvan
2,93743064
2,93743064
Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.
– Prayson W. Daniel
Nov 25 '18 at 18:29
add a comment |
Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.
– Prayson W. Daniel
Nov 25 '18 at 18:29
Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.
– Prayson W. Daniel
Nov 25 '18 at 18:29
Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.
– Prayson W. Daniel
Nov 25 '18 at 18:29
add a comment |
3 Answers
3
active
oldest
votes
From the code you have provided it looks like you may have forgotten to convert/transform X_test
like you did with X_train
.
Update:
As for the new error that is now displayed in the question:
ValueError: dimension mismatch
Since the transformer has already been fitted to the training set, you should just call .transform()
on the test set:
tfidf_transformer.transform(X_test_counts)
More info here.
I have updated the question. I have tried that too.
– Istvan
Nov 25 '18 at 17:27
Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?
– runcoderun
Nov 25 '18 at 18:14
add a comment |
The MultinomialNB
classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF
transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].
To fix your issue change your code to something like this:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)
clf = MultinomialNB().fit(X_train, y_train)
clf.predict(X_test)
To enhance your code use Pipeline:
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
text_clf.predict(X_test)
add a comment |
You should only transform your test data. Not fit_transform.
You fit_transform training data and only transform test data.
So if you remove “fit_” on the text data, it should work.
It is better to use pipelines that will do transformation and then train/score/predict. E.g.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
model = Pipeline(steps = [
('word_vec', CountVectorizer()),
('word_tdf', TfidfTransformer()),
('mnb',MultinomialNB()),
])
simple_model.fit(X_train,y_train)
simple_model.score(X_test,y_test)
This allows you to have easier code and less likely to fit_transform your text data.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468573%2fhow-to-evaluate-text-based-models-with-scikit-learn%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
From the code you have provided it looks like you may have forgotten to convert/transform X_test
like you did with X_train
.
Update:
As for the new error that is now displayed in the question:
ValueError: dimension mismatch
Since the transformer has already been fitted to the training set, you should just call .transform()
on the test set:
tfidf_transformer.transform(X_test_counts)
More info here.
I have updated the question. I have tried that too.
– Istvan
Nov 25 '18 at 17:27
Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?
– runcoderun
Nov 25 '18 at 18:14
add a comment |
From the code you have provided it looks like you may have forgotten to convert/transform X_test
like you did with X_train
.
Update:
As for the new error that is now displayed in the question:
ValueError: dimension mismatch
Since the transformer has already been fitted to the training set, you should just call .transform()
on the test set:
tfidf_transformer.transform(X_test_counts)
More info here.
I have updated the question. I have tried that too.
– Istvan
Nov 25 '18 at 17:27
Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?
– runcoderun
Nov 25 '18 at 18:14
add a comment |
From the code you have provided it looks like you may have forgotten to convert/transform X_test
like you did with X_train
.
Update:
As for the new error that is now displayed in the question:
ValueError: dimension mismatch
Since the transformer has already been fitted to the training set, you should just call .transform()
on the test set:
tfidf_transformer.transform(X_test_counts)
More info here.
From the code you have provided it looks like you may have forgotten to convert/transform X_test
like you did with X_train
.
Update:
As for the new error that is now displayed in the question:
ValueError: dimension mismatch
Since the transformer has already been fitted to the training set, you should just call .transform()
on the test set:
tfidf_transformer.transform(X_test_counts)
More info here.
edited Nov 25 '18 at 18:22
answered Nov 25 '18 at 15:29
runcoderunruncoderun
37437
37437
I have updated the question. I have tried that too.
– Istvan
Nov 25 '18 at 17:27
Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?
– runcoderun
Nov 25 '18 at 18:14
add a comment |
I have updated the question. I have tried that too.
– Istvan
Nov 25 '18 at 17:27
Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?
– runcoderun
Nov 25 '18 at 18:14
I have updated the question. I have tried that too.
– Istvan
Nov 25 '18 at 17:27
I have updated the question. I have tried that too.
– Istvan
Nov 25 '18 at 17:27
Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?
– runcoderun
Nov 25 '18 at 18:14
Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?
– runcoderun
Nov 25 '18 at 18:14
add a comment |
The MultinomialNB
classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF
transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].
To fix your issue change your code to something like this:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)
clf = MultinomialNB().fit(X_train, y_train)
clf.predict(X_test)
To enhance your code use Pipeline:
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
text_clf.predict(X_test)
add a comment |
The MultinomialNB
classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF
transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].
To fix your issue change your code to something like this:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)
clf = MultinomialNB().fit(X_train, y_train)
clf.predict(X_test)
To enhance your code use Pipeline:
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
text_clf.predict(X_test)
add a comment |
The MultinomialNB
classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF
transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].
To fix your issue change your code to something like this:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)
clf = MultinomialNB().fit(X_train, y_train)
clf.predict(X_test)
To enhance your code use Pipeline:
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
text_clf.predict(X_test)
The MultinomialNB
classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF
transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].
To fix your issue change your code to something like this:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)
clf = MultinomialNB().fit(X_train, y_train)
clf.predict(X_test)
To enhance your code use Pipeline:
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
text_clf.predict(X_test)
edited Nov 25 '18 at 18:20
answered Nov 25 '18 at 18:03
AmirAmir
7,87264173
7,87264173
add a comment |
add a comment |
You should only transform your test data. Not fit_transform.
You fit_transform training data and only transform test data.
So if you remove “fit_” on the text data, it should work.
It is better to use pipelines that will do transformation and then train/score/predict. E.g.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
model = Pipeline(steps = [
('word_vec', CountVectorizer()),
('word_tdf', TfidfTransformer()),
('mnb',MultinomialNB()),
])
simple_model.fit(X_train,y_train)
simple_model.score(X_test,y_test)
This allows you to have easier code and less likely to fit_transform your text data.
add a comment |
You should only transform your test data. Not fit_transform.
You fit_transform training data and only transform test data.
So if you remove “fit_” on the text data, it should work.
It is better to use pipelines that will do transformation and then train/score/predict. E.g.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
model = Pipeline(steps = [
('word_vec', CountVectorizer()),
('word_tdf', TfidfTransformer()),
('mnb',MultinomialNB()),
])
simple_model.fit(X_train,y_train)
simple_model.score(X_test,y_test)
This allows you to have easier code and less likely to fit_transform your text data.
add a comment |
You should only transform your test data. Not fit_transform.
You fit_transform training data and only transform test data.
So if you remove “fit_” on the text data, it should work.
It is better to use pipelines that will do transformation and then train/score/predict. E.g.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
model = Pipeline(steps = [
('word_vec', CountVectorizer()),
('word_tdf', TfidfTransformer()),
('mnb',MultinomialNB()),
])
simple_model.fit(X_train,y_train)
simple_model.score(X_test,y_test)
This allows you to have easier code and less likely to fit_transform your text data.
You should only transform your test data. Not fit_transform.
You fit_transform training data and only transform test data.
So if you remove “fit_” on the text data, it should work.
It is better to use pipelines that will do transformation and then train/score/predict. E.g.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
model = Pipeline(steps = [
('word_vec', CountVectorizer()),
('word_tdf', TfidfTransformer()),
('mnb',MultinomialNB()),
])
simple_model.fit(X_train,y_train)
simple_model.score(X_test,y_test)
This allows you to have easier code and less likely to fit_transform your text data.
edited Nov 25 '18 at 18:26
answered Nov 25 '18 at 18:09
Prayson W. DanielPrayson W. Daniel
2,08311219
2,08311219
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468573%2fhow-to-evaluate-text-based-models-with-scikit-learn%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.
– Prayson W. Daniel
Nov 25 '18 at 18:29