Pandas newbie, looking for suggestion for improvement
up vote
0
down vote
favorite
The following works, but seems to me to be overly complex. Is there an easier way to calculate time differences and calculate summary statistics? I especially am looking to replace the for loop
import pandas as pd
import numpy as np
# Read in the csv file using the 'record_id' field as the index, keeping only the timestamp
df = pd.read_csv("my_data.csv", sep=',', index_col='record_id', usecols=["record_id", "timestamp"])
# Group them by record_id
record_id_grouping = df.groupby("record_id")
# Create a list of data frames, each with a different record_id
df_list = [x for _, x in record_id_grouping]
new_df_list =
# Iterate over the list of data frames
for df in df_list:
# Add a time difference column
df['diff'] = df["timestamp"].diff()
# Drop the timestamp column and any data frame rows with NaN
df = df.loc[:,["diff"]].dropna()
# Append the new data frame to a new list
new_df_list.append(df)
# Remove any data frames from the list that are empty
new_df_list = [df for df in new_df_list if df.empty == False]
# Put all the data frames in the list back into a single data frame
new_df = pd.concat(new_df_list)
# Calculate mean, std, max, min and count for each record_id in the data frame
final_df = new_df.groupby("record_id").agg(['mean', 'std', 'max', 'min', 'count'])
# Drop the diff level
final_df.columns = final_df.columns.droplevel()
# Drop any rows that have Nan in them.
final_df = final_df.dropna()
pandas
New contributor
add a comment |
up vote
0
down vote
favorite
The following works, but seems to me to be overly complex. Is there an easier way to calculate time differences and calculate summary statistics? I especially am looking to replace the for loop
import pandas as pd
import numpy as np
# Read in the csv file using the 'record_id' field as the index, keeping only the timestamp
df = pd.read_csv("my_data.csv", sep=',', index_col='record_id', usecols=["record_id", "timestamp"])
# Group them by record_id
record_id_grouping = df.groupby("record_id")
# Create a list of data frames, each with a different record_id
df_list = [x for _, x in record_id_grouping]
new_df_list =
# Iterate over the list of data frames
for df in df_list:
# Add a time difference column
df['diff'] = df["timestamp"].diff()
# Drop the timestamp column and any data frame rows with NaN
df = df.loc[:,["diff"]].dropna()
# Append the new data frame to a new list
new_df_list.append(df)
# Remove any data frames from the list that are empty
new_df_list = [df for df in new_df_list if df.empty == False]
# Put all the data frames in the list back into a single data frame
new_df = pd.concat(new_df_list)
# Calculate mean, std, max, min and count for each record_id in the data frame
final_df = new_df.groupby("record_id").agg(['mean', 'std', 'max', 'min', 'count'])
# Drop the diff level
final_df.columns = final_df.columns.droplevel()
# Drop any rows that have Nan in them.
final_df = final_df.dropna()
pandas
New contributor
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
The following works, but seems to me to be overly complex. Is there an easier way to calculate time differences and calculate summary statistics? I especially am looking to replace the for loop
import pandas as pd
import numpy as np
# Read in the csv file using the 'record_id' field as the index, keeping only the timestamp
df = pd.read_csv("my_data.csv", sep=',', index_col='record_id', usecols=["record_id", "timestamp"])
# Group them by record_id
record_id_grouping = df.groupby("record_id")
# Create a list of data frames, each with a different record_id
df_list = [x for _, x in record_id_grouping]
new_df_list =
# Iterate over the list of data frames
for df in df_list:
# Add a time difference column
df['diff'] = df["timestamp"].diff()
# Drop the timestamp column and any data frame rows with NaN
df = df.loc[:,["diff"]].dropna()
# Append the new data frame to a new list
new_df_list.append(df)
# Remove any data frames from the list that are empty
new_df_list = [df for df in new_df_list if df.empty == False]
# Put all the data frames in the list back into a single data frame
new_df = pd.concat(new_df_list)
# Calculate mean, std, max, min and count for each record_id in the data frame
final_df = new_df.groupby("record_id").agg(['mean', 'std', 'max', 'min', 'count'])
# Drop the diff level
final_df.columns = final_df.columns.droplevel()
# Drop any rows that have Nan in them.
final_df = final_df.dropna()
pandas
New contributor
The following works, but seems to me to be overly complex. Is there an easier way to calculate time differences and calculate summary statistics? I especially am looking to replace the for loop
import pandas as pd
import numpy as np
# Read in the csv file using the 'record_id' field as the index, keeping only the timestamp
df = pd.read_csv("my_data.csv", sep=',', index_col='record_id', usecols=["record_id", "timestamp"])
# Group them by record_id
record_id_grouping = df.groupby("record_id")
# Create a list of data frames, each with a different record_id
df_list = [x for _, x in record_id_grouping]
new_df_list =
# Iterate over the list of data frames
for df in df_list:
# Add a time difference column
df['diff'] = df["timestamp"].diff()
# Drop the timestamp column and any data frame rows with NaN
df = df.loc[:,["diff"]].dropna()
# Append the new data frame to a new list
new_df_list.append(df)
# Remove any data frames from the list that are empty
new_df_list = [df for df in new_df_list if df.empty == False]
# Put all the data frames in the list back into a single data frame
new_df = pd.concat(new_df_list)
# Calculate mean, std, max, min and count for each record_id in the data frame
final_df = new_df.groupby("record_id").agg(['mean', 'std', 'max', 'min', 'count'])
# Drop the diff level
final_df.columns = final_df.columns.droplevel()
# Drop any rows that have Nan in them.
final_df = final_df.dropna()
pandas
pandas
New contributor
New contributor
New contributor
asked 3 mins ago
ACRL
1011
1011
New contributor
New contributor
add a comment |
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
ACRL is a new contributor. Be nice, and check out our Code of Conduct.
ACRL is a new contributor. Be nice, and check out our Code of Conduct.
ACRL is a new contributor. Be nice, and check out our Code of Conduct.
ACRL is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209334%2fpandas-newbie-looking-for-suggestion-for-improvement%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown