Error message when running jobs using torque. read_tcp_reply, Mismatching protocols. Expected protocol 4 but...












0















My system is Cent OS7 and I installed torque-6.1.0 configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings



My server name is "node00" and I added a slave node called "node01"



[root@node00 torque]# pbsnodes
node01
state = free
power_state = Running
np = 16
ntype = cluster
status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
mom_service_port = 15002
mom_manager_port = 15003


I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f



queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
log for exact details.
comment = Job started on Tue Dec 06 at 21:35


So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206



12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02; pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128; pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02; pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals


It seems like node01 and node00 can't send data each other. Is it right? and how can I fix this?










share|improve this question























  • Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.

    – clusterdude
    Dec 6 '16 at 14:21











  • What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp then why my torque still want to use ssh? ..

    – bsjun
    Dec 6 '16 at 14:45











  • Anyway.. I generated authorized keys then it run well. But it still confusing me why torque doesn't use rsh. Thanks !

    – bsjun
    Dec 6 '16 at 15:11
















0















My system is Cent OS7 and I installed torque-6.1.0 configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings



My server name is "node00" and I added a slave node called "node01"



[root@node00 torque]# pbsnodes
node01
state = free
power_state = Running
np = 16
ntype = cluster
status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
mom_service_port = 15002
mom_manager_port = 15003


I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f



queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
log for exact details.
comment = Job started on Tue Dec 06 at 21:35


So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206



12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02; pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128; pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02; pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals


It seems like node01 and node00 can't send data each other. Is it right? and how can I fix this?










share|improve this question























  • Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.

    – clusterdude
    Dec 6 '16 at 14:21











  • What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp then why my torque still want to use ssh? ..

    – bsjun
    Dec 6 '16 at 14:45











  • Anyway.. I generated authorized keys then it run well. But it still confusing me why torque doesn't use rsh. Thanks !

    – bsjun
    Dec 6 '16 at 15:11














0












0








0








My system is Cent OS7 and I installed torque-6.1.0 configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings



My server name is "node00" and I added a slave node called "node01"



[root@node00 torque]# pbsnodes
node01
state = free
power_state = Running
np = 16
ntype = cluster
status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
mom_service_port = 15002
mom_manager_port = 15003


I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f



queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
log for exact details.
comment = Job started on Tue Dec 06 at 21:35


So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206



12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02; pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128; pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02; pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals


It seems like node01 and node00 can't send data each other. Is it right? and how can I fix this?










share|improve this question














My system is Cent OS7 and I installed torque-6.1.0 configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings



My server name is "node00" and I added a slave node called "node01"



[root@node00 torque]# pbsnodes
node01
state = free
power_state = Running
np = 16
ntype = cluster
status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
mom_service_port = 15002
mom_manager_port = 15003


I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f



queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
log for exact details.
comment = Job started on Tue Dec 06 at 21:35


So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206



12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02; pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128; pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02; pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals


It seems like node01 and node00 can't send data each other. Is it right? and how can I fix this?







pbs torque






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Dec 6 '16 at 12:49









bsjunbsjun

217




217













  • Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.

    – clusterdude
    Dec 6 '16 at 14:21











  • What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp then why my torque still want to use ssh? ..

    – bsjun
    Dec 6 '16 at 14:45











  • Anyway.. I generated authorized keys then it run well. But it still confusing me why torque doesn't use rsh. Thanks !

    – bsjun
    Dec 6 '16 at 15:11



















  • Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.

    – clusterdude
    Dec 6 '16 at 14:21











  • What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp then why my torque still want to use ssh? ..

    – bsjun
    Dec 6 '16 at 14:45











  • Anyway.. I generated authorized keys then it run well. But it still confusing me why torque doesn't use rsh. Thanks !

    – bsjun
    Dec 6 '16 at 15:11

















Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.

– clusterdude
Dec 6 '16 at 14:21





Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.

– clusterdude
Dec 6 '16 at 14:21













What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp then why my torque still want to use ssh? ..

– bsjun
Dec 6 '16 at 14:45





What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp then why my torque still want to use ssh? ..

– bsjun
Dec 6 '16 at 14:45













Anyway.. I generated authorized keys then it run well. But it still confusing me why torque doesn't use rsh. Thanks !

– bsjun
Dec 6 '16 at 15:11





Anyway.. I generated authorized keys then it run well. But it still confusing me why torque doesn't use rsh. Thanks !

– bsjun
Dec 6 '16 at 15:11












1 Answer
1






active

oldest

votes


















0














With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
This is an error that shows on systems when:




  1. pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)

  2. When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.

  3. Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.


In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:



    tcpdump -i <interface> tcp port 15001 and tcp[13]=4

08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0

On the node this is logged:
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals


UPDATE:
We finally solved the problem by implementing MOM hierarchy in the file /var/spool/torque/server_priv/mom_hierarchy.
For a 500 node cluster we defined 8 groups (path in mom_hierarchy) with a top level of 2 nodes and one level with the rest of the nodes in that group. Something like this:



<path>
<level>node1,node2</level>
<level> comma separated list of some 60 nodes</level>
</path>
<path>
<level>node2,node1</level>
<level comma separated list of some 60 nodes</level>
</path>
<path>
<level>node3,node4</level>
<level>comma separated list of some 60 nodes</level>
</path>
<path>
<level>node4,node3</level>
<level>comma separated list of some 60 nodes</level>
</path>
.....





share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f40995829%2ferror-message-when-running-jobs-using-torque-read-tcp-reply-mismatching-protoc%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
    This is an error that shows on systems when:




    1. pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)

    2. When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.

    3. Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.


    In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:



        tcpdump -i <interface> tcp port 15001 and tcp[13]=4

    08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0

    On the node this is logged:
    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals


    UPDATE:
    We finally solved the problem by implementing MOM hierarchy in the file /var/spool/torque/server_priv/mom_hierarchy.
    For a 500 node cluster we defined 8 groups (path in mom_hierarchy) with a top level of 2 nodes and one level with the rest of the nodes in that group. Something like this:



    <path>
    <level>node1,node2</level>
    <level> comma separated list of some 60 nodes</level>
    </path>
    <path>
    <level>node2,node1</level>
    <level comma separated list of some 60 nodes</level>
    </path>
    <path>
    <level>node3,node4</level>
    <level>comma separated list of some 60 nodes</level>
    </path>
    <path>
    <level>node4,node3</level>
    <level>comma separated list of some 60 nodes</level>
    </path>
    .....





    share|improve this answer






























      0














      With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
      This is an error that shows on systems when:




      1. pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)

      2. When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.

      3. Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.


      In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:



          tcpdump -i <interface> tcp port 15001 and tcp[13]=4

      08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0

      On the node this is logged:
      10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
      10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
      10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
      10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
      10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals


      UPDATE:
      We finally solved the problem by implementing MOM hierarchy in the file /var/spool/torque/server_priv/mom_hierarchy.
      For a 500 node cluster we defined 8 groups (path in mom_hierarchy) with a top level of 2 nodes and one level with the rest of the nodes in that group. Something like this:



      <path>
      <level>node1,node2</level>
      <level> comma separated list of some 60 nodes</level>
      </path>
      <path>
      <level>node2,node1</level>
      <level comma separated list of some 60 nodes</level>
      </path>
      <path>
      <level>node3,node4</level>
      <level>comma separated list of some 60 nodes</level>
      </path>
      <path>
      <level>node4,node3</level>
      <level>comma separated list of some 60 nodes</level>
      </path>
      .....





      share|improve this answer




























        0












        0








        0







        With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
        This is an error that shows on systems when:




        1. pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)

        2. When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.

        3. Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.


        In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:



            tcpdump -i <interface> tcp port 15001 and tcp[13]=4

        08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0

        On the node this is logged:
        10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
        10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
        10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
        10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
        10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals


        UPDATE:
        We finally solved the problem by implementing MOM hierarchy in the file /var/spool/torque/server_priv/mom_hierarchy.
        For a 500 node cluster we defined 8 groups (path in mom_hierarchy) with a top level of 2 nodes and one level with the rest of the nodes in that group. Something like this:



        <path>
        <level>node1,node2</level>
        <level> comma separated list of some 60 nodes</level>
        </path>
        <path>
        <level>node2,node1</level>
        <level comma separated list of some 60 nodes</level>
        </path>
        <path>
        <level>node3,node4</level>
        <level>comma separated list of some 60 nodes</level>
        </path>
        <path>
        <level>node4,node3</level>
        <level>comma separated list of some 60 nodes</level>
        </path>
        .....





        share|improve this answer















        With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
        This is an error that shows on systems when:




        1. pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)

        2. When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.

        3. Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.


        In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:



            tcpdump -i <interface> tcp port 15001 and tcp[13]=4

        08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0

        On the node this is logged:
        10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
        10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
        10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
        10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
        10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals


        UPDATE:
        We finally solved the problem by implementing MOM hierarchy in the file /var/spool/torque/server_priv/mom_hierarchy.
        For a 500 node cluster we defined 8 groups (path in mom_hierarchy) with a top level of 2 nodes and one level with the rest of the nodes in that group. Something like this:



        <path>
        <level>node1,node2</level>
        <level> comma separated list of some 60 nodes</level>
        </path>
        <path>
        <level>node2,node1</level>
        <level comma separated list of some 60 nodes</level>
        </path>
        <path>
        <level>node3,node4</level>
        <level>comma separated list of some 60 nodes</level>
        </path>
        <path>
        <level>node4,node3</level>
        <level>comma separated list of some 60 nodes</level>
        </path>
        .....






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 28 '18 at 11:32

























        answered Oct 13 '18 at 15:41









        John Damm SørensenJohn Damm Sørensen

        444




        444
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f40995829%2ferror-message-when-running-jobs-using-torque-read-tcp-reply-mismatching-protoc%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            404 Error Contact Form 7 ajax form submitting

            How to know if a Active Directory user can login interactively

            TypeError: fit_transform() missing 1 required positional argument: 'X'