Measuring covariance on several rows












0















I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).



One calculation is the covariance.
I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.



Here, you see a snapshot of the excel file:



Items and their demand over 24 months



The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.



The calculations are as follows:



First I have to calculate the averages per item. This is already something I found by doing the following code:



after importing the following:



import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


I imported the file:



df = pd.read_excel("Directory\Covariance.xlsx")


And calculated the average per row:



x=df.iloc[:,1:].values
df['avg'] = x.mean(axis=1)


This gives the file with an extra column, the average (avg):



Items, their demand and the average



The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:



(column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.



After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).



So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).



The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?










share|improve this question





























    0















    I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).



    One calculation is the covariance.
    I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.



    Here, you see a snapshot of the excel file:



    Items and their demand over 24 months



    The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.



    The calculations are as follows:



    First I have to calculate the averages per item. This is already something I found by doing the following code:



    after importing the following:



    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np


    I imported the file:



    df = pd.read_excel("Directory\Covariance.xlsx")


    And calculated the average per row:



    x=df.iloc[:,1:].values
    df['avg'] = x.mean(axis=1)


    This gives the file with an extra column, the average (avg):



    Items, their demand and the average



    The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:



    (column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.



    After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).



    So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).



    The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?










    share|improve this question



























      0












      0








      0








      I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).



      One calculation is the covariance.
      I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.



      Here, you see a snapshot of the excel file:



      Items and their demand over 24 months



      The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.



      The calculations are as follows:



      First I have to calculate the averages per item. This is already something I found by doing the following code:



      after importing the following:



      import pandas as pd
      import matplotlib.pyplot as plt
      import numpy as np


      I imported the file:



      df = pd.read_excel("Directory\Covariance.xlsx")


      And calculated the average per row:



      x=df.iloc[:,1:].values
      df['avg'] = x.mean(axis=1)


      This gives the file with an extra column, the average (avg):



      Items, their demand and the average



      The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:



      (column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.



      After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).



      So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).



      The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?










      share|improve this question
















      I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).



      One calculation is the covariance.
      I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.



      Here, you see a snapshot of the excel file:



      Items and their demand over 24 months



      The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.



      The calculations are as follows:



      First I have to calculate the averages per item. This is already something I found by doing the following code:



      after importing the following:



      import pandas as pd
      import matplotlib.pyplot as plt
      import numpy as np


      I imported the file:



      df = pd.read_excel("Directory\Covariance.xlsx")


      And calculated the average per row:



      x=df.iloc[:,1:].values
      df['avg'] = x.mean(axis=1)


      This gives the file with an extra column, the average (avg):



      Items, their demand and the average



      The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:



      (column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.



      After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).



      So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).



      The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?







      python pandas statistics covariance






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 23 '18 at 17:22









      desertnaut

      17.3k63768




      17.3k63768










      asked Nov 23 '18 at 8:01









      Steven PaulySteven Pauly

      528




      528
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:



          df = pd.read_excel("Directory\Covariance.xlsx", index_col=0)


          Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!



          avg = df.mean(axis=1)


          To calculate the covariance matrix, just call .cov(). This however calculates pair-wise covariances of columns, to transpose the dataframe first:



          cov = df.T.cov()


          If you want, you can put everything together in 1 dataframe:



          df['avg'] = avg
          df = df.join(cov, rsuffix='_cov')


          Note: the covariance matrix includes the covariance with itself = the variance per item.






          share|improve this answer
























          • thanks! This works perfect!

            – Steven Pauly
            Nov 23 '18 at 13:35











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53442705%2fmeasuring-covariance-on-several-rows%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:



          df = pd.read_excel("Directory\Covariance.xlsx", index_col=0)


          Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!



          avg = df.mean(axis=1)


          To calculate the covariance matrix, just call .cov(). This however calculates pair-wise covariances of columns, to transpose the dataframe first:



          cov = df.T.cov()


          If you want, you can put everything together in 1 dataframe:



          df['avg'] = avg
          df = df.join(cov, rsuffix='_cov')


          Note: the covariance matrix includes the covariance with itself = the variance per item.






          share|improve this answer
























          • thanks! This works perfect!

            – Steven Pauly
            Nov 23 '18 at 13:35
















          0














          Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:



          df = pd.read_excel("Directory\Covariance.xlsx", index_col=0)


          Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!



          avg = df.mean(axis=1)


          To calculate the covariance matrix, just call .cov(). This however calculates pair-wise covariances of columns, to transpose the dataframe first:



          cov = df.T.cov()


          If you want, you can put everything together in 1 dataframe:



          df['avg'] = avg
          df = df.join(cov, rsuffix='_cov')


          Note: the covariance matrix includes the covariance with itself = the variance per item.






          share|improve this answer
























          • thanks! This works perfect!

            – Steven Pauly
            Nov 23 '18 at 13:35














          0












          0








          0







          Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:



          df = pd.read_excel("Directory\Covariance.xlsx", index_col=0)


          Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!



          avg = df.mean(axis=1)


          To calculate the covariance matrix, just call .cov(). This however calculates pair-wise covariances of columns, to transpose the dataframe first:



          cov = df.T.cov()


          If you want, you can put everything together in 1 dataframe:



          df['avg'] = avg
          df = df.join(cov, rsuffix='_cov')


          Note: the covariance matrix includes the covariance with itself = the variance per item.






          share|improve this answer













          Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:



          df = pd.read_excel("Directory\Covariance.xlsx", index_col=0)


          Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!



          avg = df.mean(axis=1)


          To calculate the covariance matrix, just call .cov(). This however calculates pair-wise covariances of columns, to transpose the dataframe first:



          cov = df.T.cov()


          If you want, you can put everything together in 1 dataframe:



          df['avg'] = avg
          df = df.join(cov, rsuffix='_cov')


          Note: the covariance matrix includes the covariance with itself = the variance per item.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 23 '18 at 8:25









          RobRob

          2,30611024




          2,30611024













          • thanks! This works perfect!

            – Steven Pauly
            Nov 23 '18 at 13:35



















          • thanks! This works perfect!

            – Steven Pauly
            Nov 23 '18 at 13:35

















          thanks! This works perfect!

          – Steven Pauly
          Nov 23 '18 at 13:35





          thanks! This works perfect!

          – Steven Pauly
          Nov 23 '18 at 13:35


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53442705%2fmeasuring-covariance-on-several-rows%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          404 Error Contact Form 7 ajax form submitting

          How to know if a Active Directory user can login interactively

          TypeError: fit_transform() missing 1 required positional argument: 'X'