Pandas iloc wrong index causing problems with subtraction












1















I should start by saying that I am quite new to pandas and numpy (and machine learning in general).



I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...



I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:



path = os. getcwd() + '\ex1data1.txt'
data = pd.read_csv(path, header=None)

numRows = data.shape[0]
numCols = data.shape[1]

X = data.iloc[:,0:numCols-1].copy()
theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
y = data.iloc[:,-1].copy()

#start computing cost sum((X-theta)-y).^2)
predictions = X.dot(theta)
print("predictions shape: {0}".format(predictions.shape))
print(predictions.head())
print("y shape: {0}".format(y.shape))
print(y.head())

errors = predictions.subtract(y)

print("errors shape: {0}".format(errors.shape))
print(errors.head())


output:



predictions shape: (97, 1)
0
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
y shape: (97, 1)
1
0 17.5920
1 9.1302
2 13.6620
3 11.8540
4 6.8233
errors shape: (97, 2)
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN


I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN, and as X has no column 1 values, they are all NaN, resulting in a 97x2 NaN matrix.



If I use ‍‍‍‍‍‍y = data.ix[:,-1:0] - the above code does the correct calculations. Output:



 errors shape: (97, 1)
0
0 -6.1101
1 -5.5277
2 -8.5186
3 -7.0032
4 -5.8598


But I am trying to stay away from ix as it has been said it is deprecating.



How to I tell pandas that the new matrix has a start column of 0 and why is this not the default behavior?










share|improve this question



























    1















    I should start by saying that I am quite new to pandas and numpy (and machine learning in general).



    I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...



    I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:



    path = os. getcwd() + '\ex1data1.txt'
    data = pd.read_csv(path, header=None)

    numRows = data.shape[0]
    numCols = data.shape[1]

    X = data.iloc[:,0:numCols-1].copy()
    theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
    y = data.iloc[:,-1].copy()

    #start computing cost sum((X-theta)-y).^2)
    predictions = X.dot(theta)
    print("predictions shape: {0}".format(predictions.shape))
    print(predictions.head())
    print("y shape: {0}".format(y.shape))
    print(y.head())

    errors = predictions.subtract(y)

    print("errors shape: {0}".format(errors.shape))
    print(errors.head())


    output:



    predictions shape: (97, 1)
    0
    0 0.0
    1 0.0
    2 0.0
    3 0.0
    4 0.0
    y shape: (97, 1)
    1
    0 17.5920
    1 9.1302
    2 13.6620
    3 11.8540
    4 6.8233
    errors shape: (97, 2)
    0 1
    0 NaN NaN
    1 NaN NaN
    2 NaN NaN
    3 NaN NaN
    4 NaN NaN


    I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN, and as X has no column 1 values, they are all NaN, resulting in a 97x2 NaN matrix.



    If I use ‍‍‍‍‍‍y = data.ix[:,-1:0] - the above code does the correct calculations. Output:



     errors shape: (97, 1)
    0
    0 -6.1101
    1 -5.5277
    2 -8.5186
    3 -7.0032
    4 -5.8598


    But I am trying to stay away from ix as it has been said it is deprecating.



    How to I tell pandas that the new matrix has a start column of 0 and why is this not the default behavior?










    share|improve this question

























      1












      1








      1








      I should start by saying that I am quite new to pandas and numpy (and machine learning in general).



      I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...



      I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:



      path = os. getcwd() + '\ex1data1.txt'
      data = pd.read_csv(path, header=None)

      numRows = data.shape[0]
      numCols = data.shape[1]

      X = data.iloc[:,0:numCols-1].copy()
      theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
      y = data.iloc[:,-1].copy()

      #start computing cost sum((X-theta)-y).^2)
      predictions = X.dot(theta)
      print("predictions shape: {0}".format(predictions.shape))
      print(predictions.head())
      print("y shape: {0}".format(y.shape))
      print(y.head())

      errors = predictions.subtract(y)

      print("errors shape: {0}".format(errors.shape))
      print(errors.head())


      output:



      predictions shape: (97, 1)
      0
      0 0.0
      1 0.0
      2 0.0
      3 0.0
      4 0.0
      y shape: (97, 1)
      1
      0 17.5920
      1 9.1302
      2 13.6620
      3 11.8540
      4 6.8233
      errors shape: (97, 2)
      0 1
      0 NaN NaN
      1 NaN NaN
      2 NaN NaN
      3 NaN NaN
      4 NaN NaN


      I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN, and as X has no column 1 values, they are all NaN, resulting in a 97x2 NaN matrix.



      If I use ‍‍‍‍‍‍y = data.ix[:,-1:0] - the above code does the correct calculations. Output:



       errors shape: (97, 1)
      0
      0 -6.1101
      1 -5.5277
      2 -8.5186
      3 -7.0032
      4 -5.8598


      But I am trying to stay away from ix as it has been said it is deprecating.



      How to I tell pandas that the new matrix has a start column of 0 and why is this not the default behavior?










      share|improve this question














      I should start by saying that I am quite new to pandas and numpy (and machine learning in general).



      I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...



      I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:



      path = os. getcwd() + '\ex1data1.txt'
      data = pd.read_csv(path, header=None)

      numRows = data.shape[0]
      numCols = data.shape[1]

      X = data.iloc[:,0:numCols-1].copy()
      theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
      y = data.iloc[:,-1].copy()

      #start computing cost sum((X-theta)-y).^2)
      predictions = X.dot(theta)
      print("predictions shape: {0}".format(predictions.shape))
      print(predictions.head())
      print("y shape: {0}".format(y.shape))
      print(y.head())

      errors = predictions.subtract(y)

      print("errors shape: {0}".format(errors.shape))
      print(errors.head())


      output:



      predictions shape: (97, 1)
      0
      0 0.0
      1 0.0
      2 0.0
      3 0.0
      4 0.0
      y shape: (97, 1)
      1
      0 17.5920
      1 9.1302
      2 13.6620
      3 11.8540
      4 6.8233
      errors shape: (97, 2)
      0 1
      0 NaN NaN
      1 NaN NaN
      2 NaN NaN
      3 NaN NaN
      4 NaN NaN


      I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN, and as X has no column 1 values, they are all NaN, resulting in a 97x2 NaN matrix.



      If I use ‍‍‍‍‍‍y = data.ix[:,-1:0] - the above code does the correct calculations. Output:



       errors shape: (97, 1)
      0
      0 -6.1101
      1 -5.5277
      2 -8.5186
      3 -7.0032
      4 -5.8598


      But I am trying to stay away from ix as it has been said it is deprecating.



      How to I tell pandas that the new matrix has a start column of 0 and why is this not the default behavior?







      python pandas






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 22 '18 at 20:23









      AserianAserian

      3991418




      3991418
























          1 Answer
          1






          active

          oldest

          votes


















          2














          Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:



          predictions[0].subtract(y[1])


          To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.



          Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:



          predictions.iloc[:, 0].subtract(y.iloc[:, 0])


          Because in each DataFrame you want all the rows and the first column






          share|improve this answer


























          • Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?

            – Aserian
            Nov 22 '18 at 20:42











          • Yeah looks pretty acceptable overall

            – Sven Harris
            Nov 22 '18 at 20:52











          • Thank you for your help

            – Aserian
            Nov 22 '18 at 20:57











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437621%2fpandas-iloc-wrong-index-causing-problems-with-subtraction%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:



          predictions[0].subtract(y[1])


          To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.



          Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:



          predictions.iloc[:, 0].subtract(y.iloc[:, 0])


          Because in each DataFrame you want all the rows and the first column






          share|improve this answer


























          • Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?

            – Aserian
            Nov 22 '18 at 20:42











          • Yeah looks pretty acceptable overall

            – Sven Harris
            Nov 22 '18 at 20:52











          • Thank you for your help

            – Aserian
            Nov 22 '18 at 20:57
















          2














          Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:



          predictions[0].subtract(y[1])


          To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.



          Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:



          predictions.iloc[:, 0].subtract(y.iloc[:, 0])


          Because in each DataFrame you want all the rows and the first column






          share|improve this answer


























          • Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?

            – Aserian
            Nov 22 '18 at 20:42











          • Yeah looks pretty acceptable overall

            – Sven Harris
            Nov 22 '18 at 20:52











          • Thank you for your help

            – Aserian
            Nov 22 '18 at 20:57














          2












          2








          2







          Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:



          predictions[0].subtract(y[1])


          To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.



          Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:



          predictions.iloc[:, 0].subtract(y.iloc[:, 0])


          Because in each DataFrame you want all the rows and the first column






          share|improve this answer















          Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:



          predictions[0].subtract(y[1])


          To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.



          Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:



          predictions.iloc[:, 0].subtract(y.iloc[:, 0])


          Because in each DataFrame you want all the rows and the first column







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 22 '18 at 20:43

























          answered Nov 22 '18 at 20:35









          Sven HarrisSven Harris

          1,8571412




          1,8571412













          • Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?

            – Aserian
            Nov 22 '18 at 20:42











          • Yeah looks pretty acceptable overall

            – Sven Harris
            Nov 22 '18 at 20:52











          • Thank you for your help

            – Aserian
            Nov 22 '18 at 20:57



















          • Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?

            – Aserian
            Nov 22 '18 at 20:42











          • Yeah looks pretty acceptable overall

            – Sven Harris
            Nov 22 '18 at 20:52











          • Thank you for your help

            – Aserian
            Nov 22 '18 at 20:57

















          Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?

          – Aserian
          Nov 22 '18 at 20:42





          Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?

          – Aserian
          Nov 22 '18 at 20:42













          Yeah looks pretty acceptable overall

          – Sven Harris
          Nov 22 '18 at 20:52





          Yeah looks pretty acceptable overall

          – Sven Harris
          Nov 22 '18 at 20:52













          Thank you for your help

          – Aserian
          Nov 22 '18 at 20:57





          Thank you for your help

          – Aserian
          Nov 22 '18 at 20:57


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437621%2fpandas-iloc-wrong-index-causing-problems-with-subtraction%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          404 Error Contact Form 7 ajax form submitting

          How to know if a Active Directory user can login interactively

          TypeError: fit_transform() missing 1 required positional argument: 'X'