Error message when running jobs using torque. read_tcp_reply, Mismatching protocols. Expected protocol 4 but...
My system is Cent OS7 and I installed torque-6.1.0
configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings
My server name is "node00" and I added a slave node called "node01"
[root@node00 torque]# pbsnodes
node01
state = free
power_state = Running
np = 16
ntype = cluster
status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
mom_service_port = 15002
mom_manager_port = 15003
I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f
queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
log for exact details.
comment = Job started on Tue Dec 06 at 21:35
So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02; pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128; pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02; pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals
It seems like node01
and node00
can't send data each other. Is it right? and how can I fix this?
pbs torque
add a comment |
My system is Cent OS7 and I installed torque-6.1.0
configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings
My server name is "node00" and I added a slave node called "node01"
[root@node00 torque]# pbsnodes
node01
state = free
power_state = Running
np = 16
ntype = cluster
status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
mom_service_port = 15002
mom_manager_port = 15003
I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f
queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
log for exact details.
comment = Job started on Tue Dec 06 at 21:35
So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02; pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128; pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02; pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals
It seems like node01
and node00
can't send data each other. Is it right? and how can I fix this?
pbs torque
Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.
– clusterdude
Dec 6 '16 at 14:21
What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly likePermission denied (publickey,gssapi-keyex,gssapi-with-mic,password)
. It looks like ssh-permission problem. But i installed this version with./configure --with-rcp=/usr/bin/rcp
then why my torque still want to usessh
? ..
– bsjun
Dec 6 '16 at 14:45
Anyway.. I generated authorized keys then it run well. But it still confusing me whytorque
doesn't usersh
. Thanks !
– bsjun
Dec 6 '16 at 15:11
add a comment |
My system is Cent OS7 and I installed torque-6.1.0
configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings
My server name is "node00" and I added a slave node called "node01"
[root@node00 torque]# pbsnodes
node01
state = free
power_state = Running
np = 16
ntype = cluster
status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
mom_service_port = 15002
mom_manager_port = 15003
I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f
queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
log for exact details.
comment = Job started on Tue Dec 06 at 21:35
So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02; pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128; pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02; pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals
It seems like node01
and node00
can't send data each other. Is it right? and how can I fix this?
pbs torque
My system is Cent OS7 and I installed torque-6.1.0
configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings
My server name is "node00" and I added a slave node called "node01"
[root@node00 torque]# pbsnodes
node01
state = free
power_state = Running
np = 16
ntype = cluster
status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
mom_service_port = 15002
mom_manager_port = 15003
I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f
queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
log for exact details.
comment = Job started on Tue Dec 06 at 21:35
So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02; pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128; pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02; pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals
It seems like node01
and node00
can't send data each other. Is it right? and how can I fix this?
pbs torque
pbs torque
asked Dec 6 '16 at 12:49
bsjunbsjun
217
217
Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.
– clusterdude
Dec 6 '16 at 14:21
What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly likePermission denied (publickey,gssapi-keyex,gssapi-with-mic,password)
. It looks like ssh-permission problem. But i installed this version with./configure --with-rcp=/usr/bin/rcp
then why my torque still want to usessh
? ..
– bsjun
Dec 6 '16 at 14:45
Anyway.. I generated authorized keys then it run well. But it still confusing me whytorque
doesn't usersh
. Thanks !
– bsjun
Dec 6 '16 at 15:11
add a comment |
Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.
– clusterdude
Dec 6 '16 at 14:21
What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly likePermission denied (publickey,gssapi-keyex,gssapi-with-mic,password)
. It looks like ssh-permission problem. But i installed this version with./configure --with-rcp=/usr/bin/rcp
then why my torque still want to usessh
? ..
– bsjun
Dec 6 '16 at 14:45
Anyway.. I generated authorized keys then it run well. But it still confusing me whytorque
doesn't usersh
. Thanks !
– bsjun
Dec 6 '16 at 15:11
Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.
– clusterdude
Dec 6 '16 at 14:21
Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.
– clusterdude
Dec 6 '16 at 14:21
What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password)
. It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp
then why my torque still want to use ssh
? ..– bsjun
Dec 6 '16 at 14:45
What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password)
. It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp
then why my torque still want to use ssh
? ..– bsjun
Dec 6 '16 at 14:45
Anyway.. I generated authorized keys then it run well. But it still confusing me why
torque
doesn't use rsh
. Thanks !– bsjun
Dec 6 '16 at 15:11
Anyway.. I generated authorized keys then it run well. But it still confusing me why
torque
doesn't use rsh
. Thanks !– bsjun
Dec 6 '16 at 15:11
add a comment |
1 Answer
1
active
oldest
votes
With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
This is an error that shows on systems when:
- pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)
- When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.
- Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.
In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:
tcpdump -i <interface> tcp port 15001 and tcp[13]=4
08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0
On the node this is logged:
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
UPDATE:
We finally solved the problem by implementing MOM hierarchy in the file /var/spool/torque/server_priv/mom_hierarchy.
For a 500 node cluster we defined 8 groups (path in mom_hierarchy) with a top level of 2 nodes and one level with the rest of the nodes in that group. Something like this:
<path>
<level>node1,node2</level>
<level> comma separated list of some 60 nodes</level>
</path>
<path>
<level>node2,node1</level>
<level comma separated list of some 60 nodes</level>
</path>
<path>
<level>node3,node4</level>
<level>comma separated list of some 60 nodes</level>
</path>
<path>
<level>node4,node3</level>
<level>comma separated list of some 60 nodes</level>
</path>
.....
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f40995829%2ferror-message-when-running-jobs-using-torque-read-tcp-reply-mismatching-protoc%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
This is an error that shows on systems when:
- pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)
- When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.
- Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.
In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:
tcpdump -i <interface> tcp port 15001 and tcp[13]=4
08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0
On the node this is logged:
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
UPDATE:
We finally solved the problem by implementing MOM hierarchy in the file /var/spool/torque/server_priv/mom_hierarchy.
For a 500 node cluster we defined 8 groups (path in mom_hierarchy) with a top level of 2 nodes and one level with the rest of the nodes in that group. Something like this:
<path>
<level>node1,node2</level>
<level> comma separated list of some 60 nodes</level>
</path>
<path>
<level>node2,node1</level>
<level comma separated list of some 60 nodes</level>
</path>
<path>
<level>node3,node4</level>
<level>comma separated list of some 60 nodes</level>
</path>
<path>
<level>node4,node3</level>
<level>comma separated list of some 60 nodes</level>
</path>
.....
add a comment |
With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
This is an error that shows on systems when:
- pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)
- When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.
- Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.
In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:
tcpdump -i <interface> tcp port 15001 and tcp[13]=4
08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0
On the node this is logged:
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
UPDATE:
We finally solved the problem by implementing MOM hierarchy in the file /var/spool/torque/server_priv/mom_hierarchy.
For a 500 node cluster we defined 8 groups (path in mom_hierarchy) with a top level of 2 nodes and one level with the rest of the nodes in that group. Something like this:
<path>
<level>node1,node2</level>
<level> comma separated list of some 60 nodes</level>
</path>
<path>
<level>node2,node1</level>
<level comma separated list of some 60 nodes</level>
</path>
<path>
<level>node3,node4</level>
<level>comma separated list of some 60 nodes</level>
</path>
<path>
<level>node4,node3</level>
<level>comma separated list of some 60 nodes</level>
</path>
.....
add a comment |
With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
This is an error that shows on systems when:
- pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)
- When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.
- Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.
In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:
tcpdump -i <interface> tcp port 15001 and tcp[13]=4
08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0
On the node this is logged:
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
UPDATE:
We finally solved the problem by implementing MOM hierarchy in the file /var/spool/torque/server_priv/mom_hierarchy.
For a 500 node cluster we defined 8 groups (path in mom_hierarchy) with a top level of 2 nodes and one level with the rest of the nodes in that group. Something like this:
<path>
<level>node1,node2</level>
<level> comma separated list of some 60 nodes</level>
</path>
<path>
<level>node2,node1</level>
<level comma separated list of some 60 nodes</level>
</path>
<path>
<level>node3,node4</level>
<level>comma separated list of some 60 nodes</level>
</path>
<path>
<level>node4,node3</level>
<level>comma separated list of some 60 nodes</level>
</path>
.....
With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
This is an error that shows on systems when:
- pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)
- When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.
- Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.
In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:
tcpdump -i <interface> tcp port 15001 and tcp[13]=4
08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0
On the node this is logged:
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
UPDATE:
We finally solved the problem by implementing MOM hierarchy in the file /var/spool/torque/server_priv/mom_hierarchy.
For a 500 node cluster we defined 8 groups (path in mom_hierarchy) with a top level of 2 nodes and one level with the rest of the nodes in that group. Something like this:
<path>
<level>node1,node2</level>
<level> comma separated list of some 60 nodes</level>
</path>
<path>
<level>node2,node1</level>
<level comma separated list of some 60 nodes</level>
</path>
<path>
<level>node3,node4</level>
<level>comma separated list of some 60 nodes</level>
</path>
<path>
<level>node4,node3</level>
<level>comma separated list of some 60 nodes</level>
</path>
.....
edited Nov 28 '18 at 11:32
answered Oct 13 '18 at 15:41
John Damm SørensenJohn Damm Sørensen
444
444
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f40995829%2ferror-message-when-running-jobs-using-torque-read-tcp-reply-mismatching-protoc%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.
– clusterdude
Dec 6 '16 at 14:21
What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password)
. It looks like ssh-permission problem. But i installed this version with./configure --with-rcp=/usr/bin/rcp
then why my torque still want to usessh
? ..– bsjun
Dec 6 '16 at 14:45
Anyway.. I generated authorized keys then it run well. But it still confusing me why
torque
doesn't usersh
. Thanks !– bsjun
Dec 6 '16 at 15:11