Measuring covariance on several rows
I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).
One calculation is the covariance.
I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.
Here, you see a snapshot of the excel file:
Items and their demand over 24 months
The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.
The calculations are as follows:
First I have to calculate the averages per item. This is already something I found by doing the following code:
after importing the following:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
I imported the file:
df = pd.read_excel("Directory\Covariance.xlsx")
And calculated the average per row:
x=df.iloc[:,1:].values
df['avg'] = x.mean(axis=1)
This gives the file with an extra column, the average (avg):
Items, their demand and the average
The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:
(column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.
After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).
So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).
The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?
python pandas statistics covariance
add a comment |
I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).
One calculation is the covariance.
I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.
Here, you see a snapshot of the excel file:
Items and their demand over 24 months
The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.
The calculations are as follows:
First I have to calculate the averages per item. This is already something I found by doing the following code:
after importing the following:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
I imported the file:
df = pd.read_excel("Directory\Covariance.xlsx")
And calculated the average per row:
x=df.iloc[:,1:].values
df['avg'] = x.mean(axis=1)
This gives the file with an extra column, the average (avg):
Items, their demand and the average
The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:
(column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.
After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).
So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).
The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?
python pandas statistics covariance
add a comment |
I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).
One calculation is the covariance.
I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.
Here, you see a snapshot of the excel file:
Items and their demand over 24 months
The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.
The calculations are as follows:
First I have to calculate the averages per item. This is already something I found by doing the following code:
after importing the following:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
I imported the file:
df = pd.read_excel("Directory\Covariance.xlsx")
And calculated the average per row:
x=df.iloc[:,1:].values
df['avg'] = x.mean(axis=1)
This gives the file with an extra column, the average (avg):
Items, their demand and the average
The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:
(column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.
After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).
So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).
The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?
python pandas statistics covariance
I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).
One calculation is the covariance.
I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.
Here, you see a snapshot of the excel file:
Items and their demand over 24 months
The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.
The calculations are as follows:
First I have to calculate the averages per item. This is already something I found by doing the following code:
after importing the following:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
I imported the file:
df = pd.read_excel("Directory\Covariance.xlsx")
And calculated the average per row:
x=df.iloc[:,1:].values
df['avg'] = x.mean(axis=1)
This gives the file with an extra column, the average (avg):
Items, their demand and the average
The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:
(column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.
After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).
So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).
The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?
python pandas statistics covariance
python pandas statistics covariance
edited Nov 23 '18 at 17:22
desertnaut
17.3k63768
17.3k63768
asked Nov 23 '18 at 8:01
Steven PaulySteven Pauly
528
528
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:
df = pd.read_excel("Directory\Covariance.xlsx", index_col=0)
Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!
avg = df.mean(axis=1)
To calculate the covariance matrix, just call .cov()
. This however calculates pair-wise covariances of columns, to transpose the dataframe first:
cov = df.T.cov()
If you want, you can put everything together in 1 dataframe:
df['avg'] = avg
df = df.join(cov, rsuffix='_cov')
Note: the covariance matrix includes the covariance with itself = the variance per item.
thanks! This works perfect!
– Steven Pauly
Nov 23 '18 at 13:35
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53442705%2fmeasuring-covariance-on-several-rows%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:
df = pd.read_excel("Directory\Covariance.xlsx", index_col=0)
Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!
avg = df.mean(axis=1)
To calculate the covariance matrix, just call .cov()
. This however calculates pair-wise covariances of columns, to transpose the dataframe first:
cov = df.T.cov()
If you want, you can put everything together in 1 dataframe:
df['avg'] = avg
df = df.join(cov, rsuffix='_cov')
Note: the covariance matrix includes the covariance with itself = the variance per item.
thanks! This works perfect!
– Steven Pauly
Nov 23 '18 at 13:35
add a comment |
Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:
df = pd.read_excel("Directory\Covariance.xlsx", index_col=0)
Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!
avg = df.mean(axis=1)
To calculate the covariance matrix, just call .cov()
. This however calculates pair-wise covariances of columns, to transpose the dataframe first:
cov = df.T.cov()
If you want, you can put everything together in 1 dataframe:
df['avg'] = avg
df = df.join(cov, rsuffix='_cov')
Note: the covariance matrix includes the covariance with itself = the variance per item.
thanks! This works perfect!
– Steven Pauly
Nov 23 '18 at 13:35
add a comment |
Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:
df = pd.read_excel("Directory\Covariance.xlsx", index_col=0)
Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!
avg = df.mean(axis=1)
To calculate the covariance matrix, just call .cov()
. This however calculates pair-wise covariances of columns, to transpose the dataframe first:
cov = df.T.cov()
If you want, you can put everything together in 1 dataframe:
df['avg'] = avg
df = df.join(cov, rsuffix='_cov')
Note: the covariance matrix includes the covariance with itself = the variance per item.
Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:
df = pd.read_excel("Directory\Covariance.xlsx", index_col=0)
Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!
avg = df.mean(axis=1)
To calculate the covariance matrix, just call .cov()
. This however calculates pair-wise covariances of columns, to transpose the dataframe first:
cov = df.T.cov()
If you want, you can put everything together in 1 dataframe:
df['avg'] = avg
df = df.join(cov, rsuffix='_cov')
Note: the covariance matrix includes the covariance with itself = the variance per item.
answered Nov 23 '18 at 8:25
RobRob
2,30611024
2,30611024
thanks! This works perfect!
– Steven Pauly
Nov 23 '18 at 13:35
add a comment |
thanks! This works perfect!
– Steven Pauly
Nov 23 '18 at 13:35
thanks! This works perfect!
– Steven Pauly
Nov 23 '18 at 13:35
thanks! This works perfect!
– Steven Pauly
Nov 23 '18 at 13:35
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53442705%2fmeasuring-covariance-on-several-rows%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown