Compute how many samples have been improved, according to a minimum threshold or confidence interval, in a...

I have the following dataframe:

ID      VAL1    VAL2

Q2241   0.3333  0.3353

Q2242   0.5     0.5

Q2243   0.3333  0.3333

Q2244   0.2137  0.4792

Q2245   0.1429  0.2

Q2246   0.5     0.5

Q2247   0.4167  0.6667

Q2248   1       1

Q2249   0.125   0.0909

Q2250   0.2     0.2

Q2251   0.325   0.2667

Q2252   0.1667  0.2

Q2253   0.3333  0.25

Q2254   0.45    0.8333

Q2255   0.3333  0.5

Q2256   1       1

Q2257   0.5     0.51

Q2258   0.3929  0.3333

Q2259   0.3611  0.625

Is there a way to correctly compute the number of samples (ID) where VAL2 is significantly higher/lower than VAL1 in a given dataframe.
I'm looking for something like t-test, where a measure gives results like the following example:

Win Tie Loss        

64  36  137

where:

Win: number of IDs where VAL2 is higher than VAL1 with some confidence interval  

Tie: number of IDs where VAL2 ~ VAL1 (no significant difference, 0.0001 for example) 

Loss: number of IDs where VAL2 is lower than VAL1 with some confidence interval

asked Nov 24 '18 at 10:28

Belkacem Thiziri

69111

add a comment |

I have the following dataframe:

ID      VAL1    VAL2

Q2241   0.3333  0.3353

Q2242   0.5     0.5

Q2243   0.3333  0.3333

Q2244   0.2137  0.4792

Q2245   0.1429  0.2

Q2246   0.5     0.5

Q2247   0.4167  0.6667

Q2248   1       1

Q2249   0.125   0.0909

Q2250   0.2     0.2

Q2251   0.325   0.2667

Q2252   0.1667  0.2

Q2253   0.3333  0.25

Q2254   0.45    0.8333

Q2255   0.3333  0.5

Q2256   1       1

Q2257   0.5     0.51

Q2258   0.3929  0.3333

Q2259   0.3611  0.625

Win Tie Loss        

64  36  137

where:

Win: number of IDs where VAL2 is higher than VAL1 with some confidence interval  

Tie: number of IDs where VAL2 ~ VAL1 (no significant difference, 0.0001 for example) 

Loss: number of IDs where VAL2 is lower than VAL1 with some confidence interval

asked Nov 24 '18 at 10:28

Belkacem Thiziri

69111

add a comment |

I have the following dataframe:

ID      VAL1    VAL2

Q2241   0.3333  0.3353

Q2242   0.5     0.5

Q2243   0.3333  0.3333

Q2244   0.2137  0.4792

Q2245   0.1429  0.2

Q2246   0.5     0.5

Q2247   0.4167  0.6667

Q2248   1       1

Q2249   0.125   0.0909

Q2250   0.2     0.2

Q2251   0.325   0.2667

Q2252   0.1667  0.2

Q2253   0.3333  0.25

Q2254   0.45    0.8333

Q2255   0.3333  0.5

Q2256   1       1

Q2257   0.5     0.51

Q2258   0.3929  0.3333

Q2259   0.3611  0.625

Win Tie Loss        

64  36  137

where:

Win: number of IDs where VAL2 is higher than VAL1 with some confidence interval  

Tie: number of IDs where VAL2 ~ VAL1 (no significant difference, 0.0001 for example) 

Loss: number of IDs where VAL2 is lower than VAL1 with some confidence interval

asked Nov 24 '18 at 10:28

Belkacem Thiziri

69111

I have the following dataframe:

ID      VAL1    VAL2

Q2241   0.3333  0.3353

Q2242   0.5     0.5

Q2243   0.3333  0.3333

Q2244   0.2137  0.4792

Q2245   0.1429  0.2

Q2246   0.5     0.5

Q2247   0.4167  0.6667

Q2248   1       1

Q2249   0.125   0.0909

Q2250   0.2     0.2

Q2251   0.325   0.2667

Q2252   0.1667  0.2

Q2253   0.3333  0.25

Q2254   0.45    0.8333

Q2255   0.3333  0.5

Q2256   1       1

Q2257   0.5     0.51

Q2258   0.3929  0.3333

Q2259   0.3611  0.625

Win Tie Loss        

64  36  137

where:

Win: number of IDs where VAL2 is higher than VAL1 with some confidence interval  

Tie: number of IDs where VAL2 ~ VAL1 (no significant difference, 0.0001 for example) 

Loss: number of IDs where VAL2 is lower than VAL1 with some confidence interval

python dataframe statistics difference

asked Nov 24 '18 at 10:28

Belkacem Thiziri

69111

asked Nov 24 '18 at 10:28

Belkacem Thiziri

69111

asked Nov 24 '18 at 10:28

Belkacem Thiziri

69111

asked Nov 24 '18 at 10:28

Belkacem Thiziri

69111

asked Nov 24 '18 at 10:28

Belkacem Thiziri

69111

add a comment |

1 Answer
1

active

oldest

votes

tol = 0.0001

win = (df.VAL2 > (df.VAL1 + tol)).sum()

loss = (df.VAL2 < (df.VAL1 - tol)).sum()

tie = ((df.VAL1 - df.VAL2).abs() <= tol).sum()



df = pd.DataFrame([{'Win': win, 'Tie':tie, 'Loss': loss}])

print (df)

#    Loss  Tie  Win

# 0     4    6    9

answered Nov 24 '18 at 11:20

Ghilas BELHADJ

5,31962961

Thanks @Ghilas BELHADJ I already tried something like that, but I was wondering if there is some specific method in data science that enable to do such a statistic, some official function like "t-test" ?

– Belkacem Thiziri
Nov 24 '18 at 12:31

There is a list of Software implementations in the bottom of your Wikipedia page. So basically, you can use scipy.stats.ttest_ind in Python.

– Ghilas BELHADJ
Nov 24 '18 at 12:37

The t-test evaluates the significance of the differences between 2 distributions but does not give how many samples that are significantly different. One better solution will combine the significance probability given by the t-test with something else to compute the number of these samples. Do you think that computing the t-test value for each line in my data frame will make sense?

– Belkacem Thiziri
Nov 24 '18 at 12:51

Probably not. but I'll let you know if I find something.

– Ghilas BELHADJ
Nov 24 '18 at 13:38

Okay, thanks. I'll discuss that with my advisors then leave a comment here.

– Belkacem Thiziri
Nov 24 '18 at 13:41

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457222%2fcompute-how-many-samples-have-been-improved-according-to-a-minimum-threshold-or%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

tol = 0.0001

win = (df.VAL2 > (df.VAL1 + tol)).sum()

loss = (df.VAL2 < (df.VAL1 - tol)).sum()

tie = ((df.VAL1 - df.VAL2).abs() <= tol).sum()



df = pd.DataFrame([{'Win': win, 'Tie':tie, 'Loss': loss}])

print (df)

#    Loss  Tie  Win

# 0     4    6    9

answered Nov 24 '18 at 11:20

Ghilas BELHADJ

5,31962961

Thanks @Ghilas BELHADJ I already tried something like that, but I was wondering if there is some specific method in data science that enable to do such a statistic, some official function like "t-test" ?

– Belkacem Thiziri
Nov 24 '18 at 12:31

There is a list of Software implementations in the bottom of your Wikipedia page. So basically, you can use scipy.stats.ttest_ind in Python.

– Ghilas BELHADJ
Nov 24 '18 at 12:37

The t-test evaluates the significance of the differences between 2 distributions but does not give how many samples that are significantly different. One better solution will combine the significance probability given by the t-test with something else to compute the number of these samples. Do you think that computing the t-test value for each line in my data frame will make sense?

– Belkacem Thiziri
Nov 24 '18 at 12:51

Probably not. but I'll let you know if I find something.

– Ghilas BELHADJ
Nov 24 '18 at 13:38

Okay, thanks. I'll discuss that with my advisors then leave a comment here.

– Belkacem Thiziri
Nov 24 '18 at 13:41

add a comment |

tol = 0.0001

win = (df.VAL2 > (df.VAL1 + tol)).sum()

loss = (df.VAL2 < (df.VAL1 - tol)).sum()

tie = ((df.VAL1 - df.VAL2).abs() <= tol).sum()



df = pd.DataFrame([{'Win': win, 'Tie':tie, 'Loss': loss}])

print (df)

#    Loss  Tie  Win

# 0     4    6    9

answered Nov 24 '18 at 11:20

Ghilas BELHADJ

5,31962961

Thanks @Ghilas BELHADJ I already tried something like that, but I was wondering if there is some specific method in data science that enable to do such a statistic, some official function like "t-test" ?

– Belkacem Thiziri
Nov 24 '18 at 12:31

There is a list of Software implementations in the bottom of your Wikipedia page. So basically, you can use scipy.stats.ttest_ind in Python.

– Ghilas BELHADJ
Nov 24 '18 at 12:37

The t-test evaluates the significance of the differences between 2 distributions but does not give how many samples that are significantly different. One better solution will combine the significance probability given by the t-test with something else to compute the number of these samples. Do you think that computing the t-test value for each line in my data frame will make sense?

– Belkacem Thiziri
Nov 24 '18 at 12:51

Probably not. but I'll let you know if I find something.

– Ghilas BELHADJ
Nov 24 '18 at 13:38

Okay, thanks. I'll discuss that with my advisors then leave a comment here.

– Belkacem Thiziri
Nov 24 '18 at 13:41

add a comment |

tol = 0.0001

win = (df.VAL2 > (df.VAL1 + tol)).sum()

loss = (df.VAL2 < (df.VAL1 - tol)).sum()

tie = ((df.VAL1 - df.VAL2).abs() <= tol).sum()



df = pd.DataFrame([{'Win': win, 'Tie':tie, 'Loss': loss}])

print (df)

#    Loss  Tie  Win

# 0     4    6    9

answered Nov 24 '18 at 11:20

Ghilas BELHADJ

5,31962961

tol = 0.0001

win = (df.VAL2 > (df.VAL1 + tol)).sum()

loss = (df.VAL2 < (df.VAL1 - tol)).sum()

tie = ((df.VAL1 - df.VAL2).abs() <= tol).sum()



df = pd.DataFrame([{'Win': win, 'Tie':tie, 'Loss': loss}])

print (df)

#    Loss  Tie  Win

# 0     4    6    9

answered Nov 24 '18 at 11:20

Ghilas BELHADJ

5,31962961

answered Nov 24 '18 at 11:20

Ghilas BELHADJ

5,31962961

answered Nov 24 '18 at 11:20

Ghilas BELHADJ

5,31962961

answered Nov 24 '18 at 11:20

Ghilas BELHADJ

5,31962961

Thanks @Ghilas BELHADJ I already tried something like that, but I was wondering if there is some specific method in data science that enable to do such a statistic, some official function like "t-test" ?

– Belkacem Thiziri
Nov 24 '18 at 12:31

There is a list of Software implementations in the bottom of your Wikipedia page. So basically, you can use scipy.stats.ttest_ind in Python.

– Ghilas BELHADJ
Nov 24 '18 at 12:37

The t-test evaluates the significance of the differences between 2 distributions but does not give how many samples that are significantly different. One better solution will combine the significance probability given by the t-test with something else to compute the number of these samples. Do you think that computing the t-test value for each line in my data frame will make sense?

– Belkacem Thiziri
Nov 24 '18 at 12:51

Probably not. but I'll let you know if I find something.

– Ghilas BELHADJ
Nov 24 '18 at 13:38

Okay, thanks. I'll discuss that with my advisors then leave a comment here.

– Belkacem Thiziri
Nov 24 '18 at 13:41

add a comment |

Thanks @Ghilas BELHADJ I already tried something like that, but I was wondering if there is some specific method in data science that enable to do such a statistic, some official function like "t-test" ?

– Belkacem Thiziri
Nov 24 '18 at 12:31

There is a list of Software implementations in the bottom of your Wikipedia page. So basically, you can use scipy.stats.ttest_ind in Python.

– Ghilas BELHADJ
Nov 24 '18 at 12:37

The t-test evaluates the significance of the differences between 2 distributions but does not give how many samples that are significantly different. One better solution will combine the significance probability given by the t-test with something else to compute the number of these samples. Do you think that computing the t-test value for each line in my data frame will make sense?

– Belkacem Thiziri
Nov 24 '18 at 12:51

Probably not. but I'll let you know if I find something.

– Ghilas BELHADJ
Nov 24 '18 at 13:38

Okay, thanks. I'll discuss that with my advisors then leave a comment here.

– Belkacem Thiziri
Nov 24 '18 at 13:41

Thanks @Ghilas BELHADJ I already tried something like that, but I was wondering if there is some specific method in data science that enable to do such a statistic, some official function like "t-test" ?

– Belkacem Thiziri
Nov 24 '18 at 12:31

There is a list of Software implementations in the bottom of your Wikipedia page. So basically, you can use scipy.stats.ttest_ind in Python.

– Ghilas BELHADJ
Nov 24 '18 at 12:37

The t-test evaluates the significance of the differences between 2 distributions but does not give how many samples that are significantly different. One better solution will combine the significance probability given by the t-test with something else to compute the number of these samples. Do you think that computing the t-test value for each line in my data frame will make sense?

– Belkacem Thiziri
Nov 24 '18 at 12:51

Probably not. but I'll let you know if I find something.

– Ghilas BELHADJ
Nov 24 '18 at 13:38

Okay, thanks. I'll discuss that with my advisors then leave a comment here.

– Belkacem Thiziri
Nov 24 '18 at 13:41

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk