Drop rows based on float length in Python
I have a DataFrame with zip codes, among other things. The data, as a sample, looks like this:
Zip Item1 Item2 Item3
78264.0 pan elephant blue
73909.0 steamer panda yellow
2602.0 pot rhino orange
59661.0 fork zebra green
861893.0 sink ocelot red
77892.0 spatula doggie brown
Some of these zip codes are invalid, having either too many or too few digits. I'm trying to remove those rows that have an invalid number of characters/digits (seven characters in this case, because I am checking length based on str()
and the .0
is included in there). The following lengths
loop:
zips = mydata.iloc[:,0].astype(str)
lengths =
for i in zips:
lengths.append(len(i))
produces a series (not to be confused with Series, although maybe it is--I'm new at Python) of zip code character lengths for each row. I am then trying to subset the DataFrame based on the information from the lengths
variable. I tried a couple of different ways; this following was the latest version:
for i in lengths.index(i):
if mydata.iloc[i:,0] != 7:
mydata.iloc[i:,0].drop()
Naturally, this fails, with a ValueError: '44114.0' is not in list
error. Can anyone give some advice as to how to do what I'm trying to accomplish?
python-3.x pandas dataframe subset
add a comment |
I have a DataFrame with zip codes, among other things. The data, as a sample, looks like this:
Zip Item1 Item2 Item3
78264.0 pan elephant blue
73909.0 steamer panda yellow
2602.0 pot rhino orange
59661.0 fork zebra green
861893.0 sink ocelot red
77892.0 spatula doggie brown
Some of these zip codes are invalid, having either too many or too few digits. I'm trying to remove those rows that have an invalid number of characters/digits (seven characters in this case, because I am checking length based on str()
and the .0
is included in there). The following lengths
loop:
zips = mydata.iloc[:,0].astype(str)
lengths =
for i in zips:
lengths.append(len(i))
produces a series (not to be confused with Series, although maybe it is--I'm new at Python) of zip code character lengths for each row. I am then trying to subset the DataFrame based on the information from the lengths
variable. I tried a couple of different ways; this following was the latest version:
for i in lengths.index(i):
if mydata.iloc[i:,0] != 7:
mydata.iloc[i:,0].drop()
Naturally, this fails, with a ValueError: '44114.0' is not in list
error. Can anyone give some advice as to how to do what I'm trying to accomplish?
python-3.x pandas dataframe subset
if the zip codes are stored as floats, containing too few digits might be because they start with 0 (which is valid for zip codes), but those leading 0s get dropped for floats
– Henry Woody
Nov 26 '18 at 2:36
can you post a snippet of your data for example
– Henry Woody
Nov 26 '18 at 2:37
Correct. It's part of what I factored into my analysis. It's how I received the data, and it's part of why I now need to fix that. The data is damaged, but it is what it is.
– Yehuda
Nov 26 '18 at 2:37
add a comment |
I have a DataFrame with zip codes, among other things. The data, as a sample, looks like this:
Zip Item1 Item2 Item3
78264.0 pan elephant blue
73909.0 steamer panda yellow
2602.0 pot rhino orange
59661.0 fork zebra green
861893.0 sink ocelot red
77892.0 spatula doggie brown
Some of these zip codes are invalid, having either too many or too few digits. I'm trying to remove those rows that have an invalid number of characters/digits (seven characters in this case, because I am checking length based on str()
and the .0
is included in there). The following lengths
loop:
zips = mydata.iloc[:,0].astype(str)
lengths =
for i in zips:
lengths.append(len(i))
produces a series (not to be confused with Series, although maybe it is--I'm new at Python) of zip code character lengths for each row. I am then trying to subset the DataFrame based on the information from the lengths
variable. I tried a couple of different ways; this following was the latest version:
for i in lengths.index(i):
if mydata.iloc[i:,0] != 7:
mydata.iloc[i:,0].drop()
Naturally, this fails, with a ValueError: '44114.0' is not in list
error. Can anyone give some advice as to how to do what I'm trying to accomplish?
python-3.x pandas dataframe subset
I have a DataFrame with zip codes, among other things. The data, as a sample, looks like this:
Zip Item1 Item2 Item3
78264.0 pan elephant blue
73909.0 steamer panda yellow
2602.0 pot rhino orange
59661.0 fork zebra green
861893.0 sink ocelot red
77892.0 spatula doggie brown
Some of these zip codes are invalid, having either too many or too few digits. I'm trying to remove those rows that have an invalid number of characters/digits (seven characters in this case, because I am checking length based on str()
and the .0
is included in there). The following lengths
loop:
zips = mydata.iloc[:,0].astype(str)
lengths =
for i in zips:
lengths.append(len(i))
produces a series (not to be confused with Series, although maybe it is--I'm new at Python) of zip code character lengths for each row. I am then trying to subset the DataFrame based on the information from the lengths
variable. I tried a couple of different ways; this following was the latest version:
for i in lengths.index(i):
if mydata.iloc[i:,0] != 7:
mydata.iloc[i:,0].drop()
Naturally, this fails, with a ValueError: '44114.0' is not in list
error. Can anyone give some advice as to how to do what I'm trying to accomplish?
python-3.x pandas dataframe subset
python-3.x pandas dataframe subset
edited Nov 26 '18 at 2:44
Yehuda
asked Nov 26 '18 at 2:33
YehudaYehuda
189113
189113
if the zip codes are stored as floats, containing too few digits might be because they start with 0 (which is valid for zip codes), but those leading 0s get dropped for floats
– Henry Woody
Nov 26 '18 at 2:36
can you post a snippet of your data for example
– Henry Woody
Nov 26 '18 at 2:37
Correct. It's part of what I factored into my analysis. It's how I received the data, and it's part of why I now need to fix that. The data is damaged, but it is what it is.
– Yehuda
Nov 26 '18 at 2:37
add a comment |
if the zip codes are stored as floats, containing too few digits might be because they start with 0 (which is valid for zip codes), but those leading 0s get dropped for floats
– Henry Woody
Nov 26 '18 at 2:36
can you post a snippet of your data for example
– Henry Woody
Nov 26 '18 at 2:37
Correct. It's part of what I factored into my analysis. It's how I received the data, and it's part of why I now need to fix that. The data is damaged, but it is what it is.
– Yehuda
Nov 26 '18 at 2:37
if the zip codes are stored as floats, containing too few digits might be because they start with 0 (which is valid for zip codes), but those leading 0s get dropped for floats
– Henry Woody
Nov 26 '18 at 2:36
if the zip codes are stored as floats, containing too few digits might be because they start with 0 (which is valid for zip codes), but those leading 0s get dropped for floats
– Henry Woody
Nov 26 '18 at 2:36
can you post a snippet of your data for example
– Henry Woody
Nov 26 '18 at 2:37
can you post a snippet of your data for example
– Henry Woody
Nov 26 '18 at 2:37
Correct. It's part of what I factored into my analysis. It's how I received the data, and it's part of why I now need to fix that. The data is damaged, but it is what it is.
– Yehuda
Nov 26 '18 at 2:37
Correct. It's part of what I factored into my analysis. It's how I received the data, and it's part of why I now need to fix that. The data is damaged, but it is what it is.
– Yehuda
Nov 26 '18 at 2:37
add a comment |
3 Answers
3
active
oldest
votes
You can write this more concisely using Pandas filtering rather than loops and ifs.
Here is an example:
valid_zips = mydata[mydata.astype(str).str.len() == 7]
or
zip_code_upper_bound = 100000
valid_zips = mydata[mydata < zip_code_upper_bound]
assuming fractional numbers are not included in your set. Note that the first example will remove shorter zips, while the second will leave them in, which you might want as they could have had leading zeros.
Sample output:
With df
defined as (from your example):
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
2 2602.0 pot rhino orange
3 59661.0 fork zebra green
4 861893.0 sink ocelot red
5 77892.0 spatula doggie brown
Using the following code:
df[df.Zip.astype(str).str.len() == 7]
The result is:
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
3 59661.0 fork zebra green
5 77892.0 spatula doggie brown
I do these index functions in R all the time, but I'm not familiar enough with Python's index functions yet. I'm going to be referencing this question many times in the future. Where can I find the documentation for Python bracket indexing and slicing?
– Yehuda
Nov 26 '18 at 2:56
The slice notation is shorthand for using .where which accepts a boolean series of the same shape and returns rows for which the series isTrue
, you can think of it as getting rows for which the condition (in brackets) is true. Here is a long page on indexing and selecting in Pandas: pandas.pydata.org/pandas-docs/stable/indexing.html
– Henry Woody
Nov 26 '18 at 3:03
add a comment |
Using str.len
df[df.iloc[:,0].astype(str).str.len()!=7]
A
1 1.222222
2 1.222200
dput :
df=pd.DataFrame({'A':[1.22222,1.222222,1.2222]})
add a comment |
See if this works
df1 = df['ZipCode'].astype(str).map(len)==5
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53474062%2fdrop-rows-based-on-float-length-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can write this more concisely using Pandas filtering rather than loops and ifs.
Here is an example:
valid_zips = mydata[mydata.astype(str).str.len() == 7]
or
zip_code_upper_bound = 100000
valid_zips = mydata[mydata < zip_code_upper_bound]
assuming fractional numbers are not included in your set. Note that the first example will remove shorter zips, while the second will leave them in, which you might want as they could have had leading zeros.
Sample output:
With df
defined as (from your example):
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
2 2602.0 pot rhino orange
3 59661.0 fork zebra green
4 861893.0 sink ocelot red
5 77892.0 spatula doggie brown
Using the following code:
df[df.Zip.astype(str).str.len() == 7]
The result is:
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
3 59661.0 fork zebra green
5 77892.0 spatula doggie brown
I do these index functions in R all the time, but I'm not familiar enough with Python's index functions yet. I'm going to be referencing this question many times in the future. Where can I find the documentation for Python bracket indexing and slicing?
– Yehuda
Nov 26 '18 at 2:56
The slice notation is shorthand for using .where which accepts a boolean series of the same shape and returns rows for which the series isTrue
, you can think of it as getting rows for which the condition (in brackets) is true. Here is a long page on indexing and selecting in Pandas: pandas.pydata.org/pandas-docs/stable/indexing.html
– Henry Woody
Nov 26 '18 at 3:03
add a comment |
You can write this more concisely using Pandas filtering rather than loops and ifs.
Here is an example:
valid_zips = mydata[mydata.astype(str).str.len() == 7]
or
zip_code_upper_bound = 100000
valid_zips = mydata[mydata < zip_code_upper_bound]
assuming fractional numbers are not included in your set. Note that the first example will remove shorter zips, while the second will leave them in, which you might want as they could have had leading zeros.
Sample output:
With df
defined as (from your example):
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
2 2602.0 pot rhino orange
3 59661.0 fork zebra green
4 861893.0 sink ocelot red
5 77892.0 spatula doggie brown
Using the following code:
df[df.Zip.astype(str).str.len() == 7]
The result is:
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
3 59661.0 fork zebra green
5 77892.0 spatula doggie brown
I do these index functions in R all the time, but I'm not familiar enough with Python's index functions yet. I'm going to be referencing this question many times in the future. Where can I find the documentation for Python bracket indexing and slicing?
– Yehuda
Nov 26 '18 at 2:56
The slice notation is shorthand for using .where which accepts a boolean series of the same shape and returns rows for which the series isTrue
, you can think of it as getting rows for which the condition (in brackets) is true. Here is a long page on indexing and selecting in Pandas: pandas.pydata.org/pandas-docs/stable/indexing.html
– Henry Woody
Nov 26 '18 at 3:03
add a comment |
You can write this more concisely using Pandas filtering rather than loops and ifs.
Here is an example:
valid_zips = mydata[mydata.astype(str).str.len() == 7]
or
zip_code_upper_bound = 100000
valid_zips = mydata[mydata < zip_code_upper_bound]
assuming fractional numbers are not included in your set. Note that the first example will remove shorter zips, while the second will leave them in, which you might want as they could have had leading zeros.
Sample output:
With df
defined as (from your example):
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
2 2602.0 pot rhino orange
3 59661.0 fork zebra green
4 861893.0 sink ocelot red
5 77892.0 spatula doggie brown
Using the following code:
df[df.Zip.astype(str).str.len() == 7]
The result is:
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
3 59661.0 fork zebra green
5 77892.0 spatula doggie brown
You can write this more concisely using Pandas filtering rather than loops and ifs.
Here is an example:
valid_zips = mydata[mydata.astype(str).str.len() == 7]
or
zip_code_upper_bound = 100000
valid_zips = mydata[mydata < zip_code_upper_bound]
assuming fractional numbers are not included in your set. Note that the first example will remove shorter zips, while the second will leave them in, which you might want as they could have had leading zeros.
Sample output:
With df
defined as (from your example):
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
2 2602.0 pot rhino orange
3 59661.0 fork zebra green
4 861893.0 sink ocelot red
5 77892.0 spatula doggie brown
Using the following code:
df[df.Zip.astype(str).str.len() == 7]
The result is:
Zip Item1 Item2 Item3
0 78264.0 pan elephant blue
1 73909.0 steamer panda yellow
3 59661.0 fork zebra green
5 77892.0 spatula doggie brown
edited Nov 26 '18 at 2:57
answered Nov 26 '18 at 2:43
Henry WoodyHenry Woody
4,6123927
4,6123927
I do these index functions in R all the time, but I'm not familiar enough with Python's index functions yet. I'm going to be referencing this question many times in the future. Where can I find the documentation for Python bracket indexing and slicing?
– Yehuda
Nov 26 '18 at 2:56
The slice notation is shorthand for using .where which accepts a boolean series of the same shape and returns rows for which the series isTrue
, you can think of it as getting rows for which the condition (in brackets) is true. Here is a long page on indexing and selecting in Pandas: pandas.pydata.org/pandas-docs/stable/indexing.html
– Henry Woody
Nov 26 '18 at 3:03
add a comment |
I do these index functions in R all the time, but I'm not familiar enough with Python's index functions yet. I'm going to be referencing this question many times in the future. Where can I find the documentation for Python bracket indexing and slicing?
– Yehuda
Nov 26 '18 at 2:56
The slice notation is shorthand for using .where which accepts a boolean series of the same shape and returns rows for which the series isTrue
, you can think of it as getting rows for which the condition (in brackets) is true. Here is a long page on indexing and selecting in Pandas: pandas.pydata.org/pandas-docs/stable/indexing.html
– Henry Woody
Nov 26 '18 at 3:03
I do these index functions in R all the time, but I'm not familiar enough with Python's index functions yet. I'm going to be referencing this question many times in the future. Where can I find the documentation for Python bracket indexing and slicing?
– Yehuda
Nov 26 '18 at 2:56
I do these index functions in R all the time, but I'm not familiar enough with Python's index functions yet. I'm going to be referencing this question many times in the future. Where can I find the documentation for Python bracket indexing and slicing?
– Yehuda
Nov 26 '18 at 2:56
The slice notation is shorthand for using .where which accepts a boolean series of the same shape and returns rows for which the series is
True
, you can think of it as getting rows for which the condition (in brackets) is true. Here is a long page on indexing and selecting in Pandas: pandas.pydata.org/pandas-docs/stable/indexing.html– Henry Woody
Nov 26 '18 at 3:03
The slice notation is shorthand for using .where which accepts a boolean series of the same shape and returns rows for which the series is
True
, you can think of it as getting rows for which the condition (in brackets) is true. Here is a long page on indexing and selecting in Pandas: pandas.pydata.org/pandas-docs/stable/indexing.html– Henry Woody
Nov 26 '18 at 3:03
add a comment |
Using str.len
df[df.iloc[:,0].astype(str).str.len()!=7]
A
1 1.222222
2 1.222200
dput :
df=pd.DataFrame({'A':[1.22222,1.222222,1.2222]})
add a comment |
Using str.len
df[df.iloc[:,0].astype(str).str.len()!=7]
A
1 1.222222
2 1.222200
dput :
df=pd.DataFrame({'A':[1.22222,1.222222,1.2222]})
add a comment |
Using str.len
df[df.iloc[:,0].astype(str).str.len()!=7]
A
1 1.222222
2 1.222200
dput :
df=pd.DataFrame({'A':[1.22222,1.222222,1.2222]})
Using str.len
df[df.iloc[:,0].astype(str).str.len()!=7]
A
1 1.222222
2 1.222200
dput :
df=pd.DataFrame({'A':[1.22222,1.222222,1.2222]})
answered Nov 26 '18 at 2:38
Wen-BenWen-Ben
116k83369
116k83369
add a comment |
add a comment |
See if this works
df1 = df['ZipCode'].astype(str).map(len)==5
add a comment |
See if this works
df1 = df['ZipCode'].astype(str).map(len)==5
add a comment |
See if this works
df1 = df['ZipCode'].astype(str).map(len)==5
See if this works
df1 = df['ZipCode'].astype(str).map(len)==5
answered Nov 26 '18 at 2:37
Ken DekalbKen Dekalb
317112
317112
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53474062%2fdrop-rows-based-on-float-length-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
if the zip codes are stored as floats, containing too few digits might be because they start with 0 (which is valid for zip codes), but those leading 0s get dropped for floats
– Henry Woody
Nov 26 '18 at 2:36
can you post a snippet of your data for example
– Henry Woody
Nov 26 '18 at 2:37
Correct. It's part of what I factored into my analysis. It's how I received the data, and it's part of why I now need to fix that. The data is damaged, but it is what it is.
– Yehuda
Nov 26 '18 at 2:37