Error message when running jobs using torque. read_tcp_reply, Mismatching protocols. Expected protocol 4 but...

My system is Cent OS7 and I installed torque-6.1.0 configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings

My server name is "node00" and I added a slave node called "node01"

[root@node00 torque]# pbsnodes

node01

     state = free

     power_state = Running

     np = 16

     ntype = cluster

     status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=

     mom_service_port = 15002

     mom_manager_port = 15003

I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f

queue_type = E

sched_hint = Unable to copy files back - please see the mother superior's

    log for exact details.

comment = Job started on Tue Dec 06 at 21:35

So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206

12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;Log;Log opened

12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0

12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;setpbsserver;node00

12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;mom_server_add;server node00 added

12/06/2016 21:35:33.405;02;   pbs_mom.14694;n/a;initialize;independent

12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe

12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.

12/06/2016 21:35:33.407;128;   pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs

12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;pbs_mom;Is up

12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487

12/06/2016 21:35:33.414;02;   pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals

It seems like node01 and node00 can't send data each other. Is it right? and how can I fix this?

asked Dec 6 '16 at 12:49

bsjun

217

Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.

– clusterdude
Dec 6 '16 at 14:21

What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp then why my torque still want to use ssh? ..

– bsjun
Dec 6 '16 at 14:45

Anyway.. I generated authorized keys then it run well. But it still confusing me why torque doesn't use rsh. Thanks !

– bsjun
Dec 6 '16 at 15:11

add a comment |

My system is Cent OS7 and I installed torque-6.1.0 configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings

My server name is "node00" and I added a slave node called "node01"

[root@node00 torque]# pbsnodes

node01

     state = free

     power_state = Running

     np = 16

     ntype = cluster

     status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=

     mom_service_port = 15002

     mom_manager_port = 15003

I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f

queue_type = E

sched_hint = Unable to copy files back - please see the mother superior's

    log for exact details.

comment = Job started on Tue Dec 06 at 21:35

So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206

12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;Log;Log opened

12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0

12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;setpbsserver;node00

12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;mom_server_add;server node00 added

12/06/2016 21:35:33.405;02;   pbs_mom.14694;n/a;initialize;independent

12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe

12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.

12/06/2016 21:35:33.407;128;   pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs

12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;pbs_mom;Is up

12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487

12/06/2016 21:35:33.414;02;   pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals

It seems like node01 and node00 can't send data each other. Is it right? and how can I fix this?

asked Dec 6 '16 at 12:49

bsjun

217

Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.

– clusterdude
Dec 6 '16 at 14:21

What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp then why my torque still want to use ssh? ..

– bsjun
Dec 6 '16 at 14:45

Anyway.. I generated authorized keys then it run well. But it still confusing me why torque doesn't use rsh. Thanks !

– bsjun
Dec 6 '16 at 15:11

add a comment |

My system is Cent OS7 and I installed torque-6.1.0 configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings

My server name is "node00" and I added a slave node called "node01"

[root@node00 torque]# pbsnodes

node01

     state = free

     power_state = Running

     np = 16

     ntype = cluster

     status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=

     mom_service_port = 15002

     mom_manager_port = 15003

I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f

queue_type = E

sched_hint = Unable to copy files back - please see the mother superior's

    log for exact details.

comment = Job started on Tue Dec 06 at 21:35

So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206

12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;Log;Log opened

12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0

12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;setpbsserver;node00

12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;mom_server_add;server node00 added

12/06/2016 21:35:33.405;02;   pbs_mom.14694;n/a;initialize;independent

12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe

12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.

12/06/2016 21:35:33.407;128;   pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs

12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;pbs_mom;Is up

12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487

12/06/2016 21:35:33.414;02;   pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals

It seems like node01 and node00 can't send data each other. Is it right? and how can I fix this?

asked Dec 6 '16 at 12:49

bsjun

217

My system is Cent OS7 and I installed torque-6.1.0 configuring with ./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings

My server name is "node00" and I added a slave node called "node01"

[root@node00 torque]# pbsnodes

node01

     state = free

     power_state = Running

     np = 16

     ntype = cluster

     status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=

     mom_service_port = 15002

     mom_manager_port = 15003

I submitted a simple job echo "sleep 5" | qsub
and then it returned an error message in qstat -f

queue_type = E

sched_hint = Unable to copy files back - please see the mother superior's

    log for exact details.

comment = Job started on Tue Dec 06 at 21:35

So I read the mother superior's log vi /var/spool/torque/mom_logs/20161206

12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;Log;Log opened

12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0

12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;setpbsserver;node00

12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;mom_server_add;server node00 added

12/06/2016 21:35:33.405;02;   pbs_mom.14694;n/a;initialize;independent

12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe

12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.

12/06/2016 21:35:33.407;128;   pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs

12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;pbs_mom;Is up

12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487

12/06/2016 21:35:33.414;02;   pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals

It seems like node01 and node00 can't send data each other. Is it right? and how can I fix this?

pbs torque

asked Dec 6 '16 at 12:49

bsjun

217

asked Dec 6 '16 at 12:49

bsjun

217

asked Dec 6 '16 at 12:49

bsjun

217

asked Dec 6 '16 at 12:49

bsjun

217

asked Dec 6 '16 at 12:49

bsjun

217

Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.

– clusterdude
Dec 6 '16 at 14:21

What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp then why my torque still want to use ssh? ..

– bsjun
Dec 6 '16 at 14:45

Anyway.. I generated authorized keys then it run well. But it still confusing me why torque doesn't use rsh. Thanks !

– bsjun
Dec 6 '16 at 15:11

add a comment |

Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.

– clusterdude
Dec 6 '16 at 14:21

What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp then why my torque still want to use ssh? ..

– bsjun
Dec 6 '16 at 14:45

Anyway.. I generated authorized keys then it run well. But it still confusing me why torque doesn't use rsh. Thanks !

– bsjun
Dec 6 '16 at 15:11

Things to to check: confirm the server and compute are the same version. Verify that you don't have extra pbs_mom processes running. Disable ipchains and iptables on both the server and the node. Increase $loglevel in mom_priv/config, restart pbs_mom, and then check the mom's log and syslog.

– clusterdude
Dec 6 '16 at 14:21

What you mentioned check lists done correctly. I installed lower torque version then the error messages are more kindly like Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). It looks like ssh-permission problem. But i installed this version with ./configure --with-rcp=/usr/bin/rcp then why my torque still want to use ssh? ..

– bsjun
Dec 6 '16 at 14:45

Anyway.. I generated authorized keys then it run well. But it still confusing me why torque doesn't use rsh. Thanks !

– bsjun
Dec 6 '16 at 15:11

add a comment |

1 Answer
1

active

oldest

votes

With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
This is an error that shows on systems when:

pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)

When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.

Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.

In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:

    tcpdump -i <interface> tcp port 15001 and tcp[13]=4



    08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0



On the node this is logged:

    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

UPDATE:
We finally solved the problem by implementing MOM hierarchy in the file /var/spool/torque/server_priv/mom_hierarchy.
For a 500 node cluster we defined 8 groups (path in mom_hierarchy) with a top level of 2 nodes and one level with the rest of the nodes in that group. Something like this:

<path>

<level>node1,node2</level>

<level> comma separated list of some 60 nodes</level>

</path>

<path> 

<level>node2,node1</level>

<level comma separated list of some 60 nodes</level>

</path>

<path>

<level>node3,node4</level>

<level>comma separated list of some 60 nodes</level>

</path>

<path>

<level>node4,node3</level>

<level>comma separated list of some 60 nodes</level>

</path>

.....

edited Nov 28 '18 at 11:32

answered Oct 13 '18 at 15:41

John Damm Sørensen

444

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f40995829%2ferror-message-when-running-jobs-using-torque-read-tcp-reply-mismatching-protoc%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
This is an error that shows on systems when:

pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)

When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.

Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.

In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:

    tcpdump -i <interface> tcp port 15001 and tcp[13]=4



    08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0



On the node this is logged:

    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

<path>

<level>node1,node2</level>

<level> comma separated list of some 60 nodes</level>

</path>

<path> 

<level>node2,node1</level>

<level comma separated list of some 60 nodes</level>

</path>

<path>

<level>node3,node4</level>

<level>comma separated list of some 60 nodes</level>

</path>

<path>

<level>node4,node3</level>

<level>comma separated list of some 60 nodes</level>

</path>

.....

edited Nov 28 '18 at 11:32

answered Oct 13 '18 at 15:41

John Damm Sørensen

444

add a comment |

With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
This is an error that shows on systems when:

pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)

When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.

Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.

In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:

    tcpdump -i <interface> tcp port 15001 and tcp[13]=4



    08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0



On the node this is logged:

    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

<path>

<level>node1,node2</level>

<level> comma separated list of some 60 nodes</level>

</path>

<path> 

<level>node2,node1</level>

<level comma separated list of some 60 nodes</level>

</path>

<path>

<level>node3,node4</level>

<level>comma separated list of some 60 nodes</level>

</path>

<path>

<level>node4,node3</level>

<level>comma separated list of some 60 nodes</level>

</path>

.....

edited Nov 28 '18 at 11:32

answered Oct 13 '18 at 15:41

John Damm Sørensen

444

add a comment |

With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
This is an error that shows on systems when:

pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)

When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.

Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.

In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:

    tcpdump -i <interface> tcp port 15001 and tcp[13]=4



    08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0



On the node this is logged:

    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

<path>

<level>node1,node2</level>

<level> comma separated list of some 60 nodes</level>

</path>

<path> 

<level>node2,node1</level>

<level comma separated list of some 60 nodes</level>

</path>

<path>

<level>node3,node4</level>

<level>comma separated list of some 60 nodes</level>

</path>

<path>

<level>node4,node3</level>

<level>comma separated list of some 60 nodes</level>

</path>

.....

edited Nov 28 '18 at 11:32

answered Oct 13 '18 at 15:41

John Damm Sørensen

444

With respect to the headline text: "read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0"
This is an error that shows on systems when:

pbs_mom runs on a node unknown to the pbs_server (excluded from the nodes file)

When the /var/spool/torque/server_priv/jobs directory get clogged with job files that should have been removed on job termination (this can easily grow to thousands of files as pbs_server is notoriously bad doing cleanup). Same thing applies for /var/spool/torque/server_priv/arrays directory.

Clearing the above two situations it is still seen on a system with 400 nodes and 1000 jobs (queued and/or running). In this case it happens 5-10 times an hour.

In all cases tcpdump shows on the pbs_server side that the mom is sent a tcp reset after it sent a status update. It is easily traced with:

    tcpdump -i <interface> tcp port 15001 and tcp[13]=4



    08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0



On the node this is logged:

    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

<path>

<level>node1,node2</level>

<level> comma separated list of some 60 nodes</level>

</path>

<path> 

<level>node2,node1</level>

<level comma separated list of some 60 nodes</level>

</path>

<path>

<level>node3,node4</level>

<level>comma separated list of some 60 nodes</level>

</path>

<path>

<level>node4,node3</level>

<level>comma separated list of some 60 nodes</level>

</path>

.....

edited Nov 28 '18 at 11:32

answered Oct 13 '18 at 15:41

John Damm Sørensen

444

edited Nov 28 '18 at 11:32

answered Oct 13 '18 at 15:41

John Damm Sørensen

444

answered Oct 13 '18 at 15:41

John Damm Sørensen

444

answered Oct 13 '18 at 15:41

John Damm Sørensen

444

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

g1zlzi,QTDrB 5OGCxKiFraCjPvVKSJP eDRyA iT apwSCldN7RTU1M

搜尋此網誌

Tukukkk