How to evaluate text based models with scikit-learn?












0















I have the following dataframe with data:



index   field1      field2            field3
1079 COMPUTER long text.... 3


Field1 is a category and field2 is a description and field3 is just an integer representation of field1.



I am using the following code to learn field2 to category mappings with sklearn:



from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)


After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.



X_test_counts = count_vect.fit_transform(X_test)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
clf.score(X_test_tfidf, y_test)


It throws the following error:



ValueError: dimension mismatch


Is there a way test the model and get the score or accuracy with such dataset?



UPDATE: Adding similar transformation to the test set.










share|improve this question

























  • Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.

    – Prayson W. Daniel
    Nov 25 '18 at 18:29


















0















I have the following dataframe with data:



index   field1      field2            field3
1079 COMPUTER long text.... 3


Field1 is a category and field2 is a description and field3 is just an integer representation of field1.



I am using the following code to learn field2 to category mappings with sklearn:



from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)


After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.



X_test_counts = count_vect.fit_transform(X_test)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
clf.score(X_test_tfidf, y_test)


It throws the following error:



ValueError: dimension mismatch


Is there a way test the model and get the score or accuracy with such dataset?



UPDATE: Adding similar transformation to the test set.










share|improve this question

























  • Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.

    – Prayson W. Daniel
    Nov 25 '18 at 18:29
















0












0








0








I have the following dataframe with data:



index   field1      field2            field3
1079 COMPUTER long text.... 3


Field1 is a category and field2 is a description and field3 is just an integer representation of field1.



I am using the following code to learn field2 to category mappings with sklearn:



from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)


After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.



X_test_counts = count_vect.fit_transform(X_test)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
clf.score(X_test_tfidf, y_test)


It throws the following error:



ValueError: dimension mismatch


Is there a way test the model and get the score or accuracy with such dataset?



UPDATE: Adding similar transformation to the test set.










share|improve this question
















I have the following dataframe with data:



index   field1      field2            field3
1079 COMPUTER long text.... 3


Field1 is a category and field2 is a description and field3 is just an integer representation of field1.



I am using the following code to learn field2 to category mappings with sklearn:



from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)


After I trained the model I can use it to predict a category and it works well. However, I would like to evaluate the model using the test set.



X_test_counts = count_vect.fit_transform(X_test)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
clf.score(X_test_tfidf, y_test)


It throws the following error:



ValueError: dimension mismatch


Is there a way test the model and get the score or accuracy with such dataset?



UPDATE: Adding similar transformation to the test set.







python scikit-learn nlp






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 25 '18 at 17:26







Istvan

















asked Nov 25 '18 at 14:37









IstvanIstvan

2,93743064




2,93743064













  • Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.

    – Prayson W. Daniel
    Nov 25 '18 at 18:29





















  • Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.

    – Prayson W. Daniel
    Nov 25 '18 at 18:29



















Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.

– Prayson W. Daniel
Nov 25 '18 at 18:29







Note: similar transformation on text data does not mean fit_tranform but only transform. ;) that is why you get that error. See my answer below.

– Prayson W. Daniel
Nov 25 '18 at 18:29














3 Answers
3






active

oldest

votes


















2














From the code you have provided it looks like you may have forgotten to convert/transform X_test like you did with X_train.



Update:

As for the new error that is now displayed in the question:



ValueError: dimension mismatch


Since the transformer has already been fitted to the training set, you should just call .transform() on the test set:



tfidf_transformer.transform(X_test_counts)



More info here.






share|improve this answer


























  • I have updated the question. I have tried that too.

    – Istvan
    Nov 25 '18 at 17:27











  • Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

    – runcoderun
    Nov 25 '18 at 18:14





















0














The MultinomialNB classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].



To fix your issue change your code to something like this:



count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)

clf = MultinomialNB().fit(X_train, y_train)
clf.predict(X_test)


To enhance your code use Pipeline:



from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
text_clf.predict(X_test)





share|improve this answer

































    0














    You should only transform your test data. Not fit_transform.
    You fit_transform training data and only transform test data.
    So if you remove “fit_” on the text data, it should work.



    It is better to use pipelines that will do transformation and then train/score/predict. E.g.



    from sklearn.pipeline import Pipeline
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.naive_bayes import MultinomialNB

    model = Pipeline(steps = [
    ('word_vec', CountVectorizer()),
    ('word_tdf', TfidfTransformer()),
    ('mnb',MultinomialNB()),
    ])

    simple_model.fit(X_train,y_train)
    simple_model.score(X_test,y_test)


    This allows you to have easier code and less likely to fit_transform your text data.






    share|improve this answer

























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468573%2fhow-to-evaluate-text-based-models-with-scikit-learn%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      2














      From the code you have provided it looks like you may have forgotten to convert/transform X_test like you did with X_train.



      Update:

      As for the new error that is now displayed in the question:



      ValueError: dimension mismatch


      Since the transformer has already been fitted to the training set, you should just call .transform() on the test set:



      tfidf_transformer.transform(X_test_counts)



      More info here.






      share|improve this answer


























      • I have updated the question. I have tried that too.

        – Istvan
        Nov 25 '18 at 17:27











      • Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

        – runcoderun
        Nov 25 '18 at 18:14


















      2














      From the code you have provided it looks like you may have forgotten to convert/transform X_test like you did with X_train.



      Update:

      As for the new error that is now displayed in the question:



      ValueError: dimension mismatch


      Since the transformer has already been fitted to the training set, you should just call .transform() on the test set:



      tfidf_transformer.transform(X_test_counts)



      More info here.






      share|improve this answer


























      • I have updated the question. I have tried that too.

        – Istvan
        Nov 25 '18 at 17:27











      • Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

        – runcoderun
        Nov 25 '18 at 18:14
















      2












      2








      2







      From the code you have provided it looks like you may have forgotten to convert/transform X_test like you did with X_train.



      Update:

      As for the new error that is now displayed in the question:



      ValueError: dimension mismatch


      Since the transformer has already been fitted to the training set, you should just call .transform() on the test set:



      tfidf_transformer.transform(X_test_counts)



      More info here.






      share|improve this answer















      From the code you have provided it looks like you may have forgotten to convert/transform X_test like you did with X_train.



      Update:

      As for the new error that is now displayed in the question:



      ValueError: dimension mismatch


      Since the transformer has already been fitted to the training set, you should just call .transform() on the test set:



      tfidf_transformer.transform(X_test_counts)



      More info here.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 25 '18 at 18:22

























      answered Nov 25 '18 at 15:29









      runcoderunruncoderun

      37437




      37437













      • I have updated the question. I have tried that too.

        – Istvan
        Nov 25 '18 at 17:27











      • Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

        – runcoderun
        Nov 25 '18 at 18:14





















      • I have updated the question. I have tried that too.

        – Istvan
        Nov 25 '18 at 17:27











      • Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

        – runcoderun
        Nov 25 '18 at 18:14



















      I have updated the question. I have tried that too.

      – Istvan
      Nov 25 '18 at 17:27





      I have updated the question. I have tried that too.

      – Istvan
      Nov 25 '18 at 17:27













      Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

      – runcoderun
      Nov 25 '18 at 18:14







      Fair enough. I have updated my answer accordingly. But is it safe to assume that the fact that you have changed the error in your question means that my answer solved the original error?

      – runcoderun
      Nov 25 '18 at 18:14















      0














      The MultinomialNB classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].



      To fix your issue change your code to something like this:



      count_vect = CountVectorizer()
      X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
      X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)

      clf = MultinomialNB().fit(X_train, y_train)
      clf.predict(X_test)


      To enhance your code use Pipeline:



      from sklearn.pipeline import Pipeline
      X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
      text_clf = Pipeline([('vect', CountVectorizer()),
      ('tfidf', TfidfTransformer()),
      ('clf', MultinomialNB())])
      text_clf = text_clf.fit(X_train, y_train)
      text_clf.predict(X_test)





      share|improve this answer






























        0














        The MultinomialNB classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].



        To fix your issue change your code to something like this:



        count_vect = CountVectorizer()
        X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
        X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)

        clf = MultinomialNB().fit(X_train, y_train)
        clf.predict(X_test)


        To enhance your code use Pipeline:



        from sklearn.pipeline import Pipeline
        X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
        text_clf = Pipeline([('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', MultinomialNB())])
        text_clf = text_clf.fit(X_train, y_train)
        text_clf.predict(X_test)





        share|improve this answer




























          0












          0








          0







          The MultinomialNB classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].



          To fix your issue change your code to something like this:



          count_vect = CountVectorizer()
          X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
          X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)

          clf = MultinomialNB().fit(X_train, y_train)
          clf.predict(X_test)


          To enhance your code use Pipeline:



          from sklearn.pipeline import Pipeline
          X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
          text_clf = Pipeline([('vect', CountVectorizer()),
          ('tfidf', TfidfTransformer()),
          ('clf', MultinomialNB())])
          text_clf = text_clf.fit(X_train, y_train)
          text_clf.predict(X_test)





          share|improve this answer















          The MultinomialNB classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. TFIDF transform to encode documents into continuous-valued features. However, in practice, fractional counts such as tf-idf may also work [reference].



          To fix your issue change your code to something like this:



          count_vect = CountVectorizer()
          X_train_counts = count_vect.fit_transform(df['Text'].values.tolist())
          X_train, X_test, y_train, y_test = train_test_split(X_train_counts, df['category_id'], random_state = 0)

          clf = MultinomialNB().fit(X_train, y_train)
          clf.predict(X_test)


          To enhance your code use Pipeline:



          from sklearn.pipeline import Pipeline
          X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['category_id'], random_state = 0)
          text_clf = Pipeline([('vect', CountVectorizer()),
          ('tfidf', TfidfTransformer()),
          ('clf', MultinomialNB())])
          text_clf = text_clf.fit(X_train, y_train)
          text_clf.predict(X_test)






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 25 '18 at 18:20

























          answered Nov 25 '18 at 18:03









          AmirAmir

          7,87264173




          7,87264173























              0














              You should only transform your test data. Not fit_transform.
              You fit_transform training data and only transform test data.
              So if you remove “fit_” on the text data, it should work.



              It is better to use pipelines that will do transformation and then train/score/predict. E.g.



              from sklearn.pipeline import Pipeline
              from sklearn.model_selection import train_test_split
              from sklearn.feature_extraction.text import CountVectorizer
              from sklearn.feature_extraction.text import TfidfTransformer
              from sklearn.naive_bayes import MultinomialNB

              model = Pipeline(steps = [
              ('word_vec', CountVectorizer()),
              ('word_tdf', TfidfTransformer()),
              ('mnb',MultinomialNB()),
              ])

              simple_model.fit(X_train,y_train)
              simple_model.score(X_test,y_test)


              This allows you to have easier code and less likely to fit_transform your text data.






              share|improve this answer






























                0














                You should only transform your test data. Not fit_transform.
                You fit_transform training data and only transform test data.
                So if you remove “fit_” on the text data, it should work.



                It is better to use pipelines that will do transformation and then train/score/predict. E.g.



                from sklearn.pipeline import Pipeline
                from sklearn.model_selection import train_test_split
                from sklearn.feature_extraction.text import CountVectorizer
                from sklearn.feature_extraction.text import TfidfTransformer
                from sklearn.naive_bayes import MultinomialNB

                model = Pipeline(steps = [
                ('word_vec', CountVectorizer()),
                ('word_tdf', TfidfTransformer()),
                ('mnb',MultinomialNB()),
                ])

                simple_model.fit(X_train,y_train)
                simple_model.score(X_test,y_test)


                This allows you to have easier code and less likely to fit_transform your text data.






                share|improve this answer




























                  0












                  0








                  0







                  You should only transform your test data. Not fit_transform.
                  You fit_transform training data and only transform test data.
                  So if you remove “fit_” on the text data, it should work.



                  It is better to use pipelines that will do transformation and then train/score/predict. E.g.



                  from sklearn.pipeline import Pipeline
                  from sklearn.model_selection import train_test_split
                  from sklearn.feature_extraction.text import CountVectorizer
                  from sklearn.feature_extraction.text import TfidfTransformer
                  from sklearn.naive_bayes import MultinomialNB

                  model = Pipeline(steps = [
                  ('word_vec', CountVectorizer()),
                  ('word_tdf', TfidfTransformer()),
                  ('mnb',MultinomialNB()),
                  ])

                  simple_model.fit(X_train,y_train)
                  simple_model.score(X_test,y_test)


                  This allows you to have easier code and less likely to fit_transform your text data.






                  share|improve this answer















                  You should only transform your test data. Not fit_transform.
                  You fit_transform training data and only transform test data.
                  So if you remove “fit_” on the text data, it should work.



                  It is better to use pipelines that will do transformation and then train/score/predict. E.g.



                  from sklearn.pipeline import Pipeline
                  from sklearn.model_selection import train_test_split
                  from sklearn.feature_extraction.text import CountVectorizer
                  from sklearn.feature_extraction.text import TfidfTransformer
                  from sklearn.naive_bayes import MultinomialNB

                  model = Pipeline(steps = [
                  ('word_vec', CountVectorizer()),
                  ('word_tdf', TfidfTransformer()),
                  ('mnb',MultinomialNB()),
                  ])

                  simple_model.fit(X_train,y_train)
                  simple_model.score(X_test,y_test)


                  This allows you to have easier code and less likely to fit_transform your text data.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 25 '18 at 18:26

























                  answered Nov 25 '18 at 18:09









                  Prayson W. DanielPrayson W. Daniel

                  2,08311219




                  2,08311219






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468573%2fhow-to-evaluate-text-based-models-with-scikit-learn%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      404 Error Contact Form 7 ajax form submitting

                      How to know if a Active Directory user can login interactively

                      Refactoring coordinates for Minecraft Pi buildings written in Python