Pandas iloc wrong index causing problems with subtraction
I should start by saying that I am quite new to pandas and numpy (and machine learning in general).
I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...
I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:
path = os. getcwd() + '\ex1data1.txt'
data = pd.read_csv(path, header=None)
numRows = data.shape[0]
numCols = data.shape[1]
X = data.iloc[:,0:numCols-1].copy()
theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
y = data.iloc[:,-1].copy()
#start computing cost sum((X-theta)-y).^2)
predictions = X.dot(theta)
print("predictions shape: {0}".format(predictions.shape))
print(predictions.head())
print("y shape: {0}".format(y.shape))
print(y.head())
errors = predictions.subtract(y)
print("errors shape: {0}".format(errors.shape))
print(errors.head())
output:
predictions shape: (97, 1)
0
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
y shape: (97, 1)
1
0 17.5920
1 9.1302
2 13.6620
3 11.8540
4 6.8233
errors shape: (97, 2)
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN
, and as X has no column 1 values, they are all NaN
, resulting in a 97x2 NaN
matrix.
If I use y = data.ix[:,-1:0]
- the above code does the correct calculations. Output:
errors shape: (97, 1)
0
0 -6.1101
1 -5.5277
2 -8.5186
3 -7.0032
4 -5.8598
But I am trying to stay away from ix as it has been said it is deprecating.
How to I tell pandas
that the new matrix has a start column of 0 and why is this not the default behavior?
python pandas
add a comment |
I should start by saying that I am quite new to pandas and numpy (and machine learning in general).
I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...
I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:
path = os. getcwd() + '\ex1data1.txt'
data = pd.read_csv(path, header=None)
numRows = data.shape[0]
numCols = data.shape[1]
X = data.iloc[:,0:numCols-1].copy()
theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
y = data.iloc[:,-1].copy()
#start computing cost sum((X-theta)-y).^2)
predictions = X.dot(theta)
print("predictions shape: {0}".format(predictions.shape))
print(predictions.head())
print("y shape: {0}".format(y.shape))
print(y.head())
errors = predictions.subtract(y)
print("errors shape: {0}".format(errors.shape))
print(errors.head())
output:
predictions shape: (97, 1)
0
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
y shape: (97, 1)
1
0 17.5920
1 9.1302
2 13.6620
3 11.8540
4 6.8233
errors shape: (97, 2)
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN
, and as X has no column 1 values, they are all NaN
, resulting in a 97x2 NaN
matrix.
If I use y = data.ix[:,-1:0]
- the above code does the correct calculations. Output:
errors shape: (97, 1)
0
0 -6.1101
1 -5.5277
2 -8.5186
3 -7.0032
4 -5.8598
But I am trying to stay away from ix as it has been said it is deprecating.
How to I tell pandas
that the new matrix has a start column of 0 and why is this not the default behavior?
python pandas
add a comment |
I should start by saying that I am quite new to pandas and numpy (and machine learning in general).
I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...
I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:
path = os. getcwd() + '\ex1data1.txt'
data = pd.read_csv(path, header=None)
numRows = data.shape[0]
numCols = data.shape[1]
X = data.iloc[:,0:numCols-1].copy()
theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
y = data.iloc[:,-1].copy()
#start computing cost sum((X-theta)-y).^2)
predictions = X.dot(theta)
print("predictions shape: {0}".format(predictions.shape))
print(predictions.head())
print("y shape: {0}".format(y.shape))
print(y.head())
errors = predictions.subtract(y)
print("errors shape: {0}".format(errors.shape))
print(errors.head())
output:
predictions shape: (97, 1)
0
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
y shape: (97, 1)
1
0 17.5920
1 9.1302
2 13.6620
3 11.8540
4 6.8233
errors shape: (97, 2)
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN
, and as X has no column 1 values, they are all NaN
, resulting in a 97x2 NaN
matrix.
If I use y = data.ix[:,-1:0]
- the above code does the correct calculations. Output:
errors shape: (97, 1)
0
0 -6.1101
1 -5.5277
2 -8.5186
3 -7.0032
4 -5.8598
But I am trying to stay away from ix as it has been said it is deprecating.
How to I tell pandas
that the new matrix has a start column of 0 and why is this not the default behavior?
python pandas
I should start by saying that I am quite new to pandas and numpy (and machine learning in general).
I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...
I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:
path = os. getcwd() + '\ex1data1.txt'
data = pd.read_csv(path, header=None)
numRows = data.shape[0]
numCols = data.shape[1]
X = data.iloc[:,0:numCols-1].copy()
theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
y = data.iloc[:,-1].copy()
#start computing cost sum((X-theta)-y).^2)
predictions = X.dot(theta)
print("predictions shape: {0}".format(predictions.shape))
print(predictions.head())
print("y shape: {0}".format(y.shape))
print(y.head())
errors = predictions.subtract(y)
print("errors shape: {0}".format(errors.shape))
print(errors.head())
output:
predictions shape: (97, 1)
0
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
y shape: (97, 1)
1
0 17.5920
1 9.1302
2 13.6620
3 11.8540
4 6.8233
errors shape: (97, 2)
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN
, and as X has no column 1 values, they are all NaN
, resulting in a 97x2 NaN
matrix.
If I use y = data.ix[:,-1:0]
- the above code does the correct calculations. Output:
errors shape: (97, 1)
0
0 -6.1101
1 -5.5277
2 -8.5186
3 -7.0032
4 -5.8598
But I am trying to stay away from ix as it has been said it is deprecating.
How to I tell pandas
that the new matrix has a start column of 0 and why is this not the default behavior?
python pandas
python pandas
asked Nov 22 '18 at 20:23
AserianAserian
3991418
3991418
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:
predictions[0].subtract(y[1])
To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.
Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:
predictions.iloc[:, 0].subtract(y.iloc[:, 0])
Because in each DataFrame you want all the rows and the first column
Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?
– Aserian
Nov 22 '18 at 20:42
Yeah looks pretty acceptable overall
– Sven Harris
Nov 22 '18 at 20:52
Thank you for your help
– Aserian
Nov 22 '18 at 20:57
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437621%2fpandas-iloc-wrong-index-causing-problems-with-subtraction%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:
predictions[0].subtract(y[1])
To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.
Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:
predictions.iloc[:, 0].subtract(y.iloc[:, 0])
Because in each DataFrame you want all the rows and the first column
Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?
– Aserian
Nov 22 '18 at 20:42
Yeah looks pretty acceptable overall
– Sven Harris
Nov 22 '18 at 20:52
Thank you for your help
– Aserian
Nov 22 '18 at 20:57
add a comment |
Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:
predictions[0].subtract(y[1])
To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.
Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:
predictions.iloc[:, 0].subtract(y.iloc[:, 0])
Because in each DataFrame you want all the rows and the first column
Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?
– Aserian
Nov 22 '18 at 20:42
Yeah looks pretty acceptable overall
– Sven Harris
Nov 22 '18 at 20:52
Thank you for your help
– Aserian
Nov 22 '18 at 20:57
add a comment |
Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:
predictions[0].subtract(y[1])
To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.
Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:
predictions.iloc[:, 0].subtract(y.iloc[:, 0])
Because in each DataFrame you want all the rows and the first column
Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:
predictions[0].subtract(y[1])
To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.
Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:
predictions.iloc[:, 0].subtract(y.iloc[:, 0])
Because in each DataFrame you want all the rows and the first column
edited Nov 22 '18 at 20:43
answered Nov 22 '18 at 20:35
Sven HarrisSven Harris
1,8571412
1,8571412
Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?
– Aserian
Nov 22 '18 at 20:42
Yeah looks pretty acceptable overall
– Sven Harris
Nov 22 '18 at 20:52
Thank you for your help
– Aserian
Nov 22 '18 at 20:57
add a comment |
Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?
– Aserian
Nov 22 '18 at 20:42
Yeah looks pretty acceptable overall
– Sven Harris
Nov 22 '18 at 20:52
Thank you for your help
– Aserian
Nov 22 '18 at 20:57
Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?
– Aserian
Nov 22 '18 at 20:42
Thank you very much for the help! I didn't realize that the columns, or column names rather, mattered. Is there a more succinct way to turn a matrix into two separate matrices? Or is the way that I am doing it acceptable?
– Aserian
Nov 22 '18 at 20:42
Yeah looks pretty acceptable overall
– Sven Harris
Nov 22 '18 at 20:52
Yeah looks pretty acceptable overall
– Sven Harris
Nov 22 '18 at 20:52
Thank you for your help
– Aserian
Nov 22 '18 at 20:57
Thank you for your help
– Aserian
Nov 22 '18 at 20:57
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437621%2fpandas-iloc-wrong-index-causing-problems-with-subtraction%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown