Errors while put tasks into qsub queue« Back to Questions List

Hi! Few minutes ago I have tried to put tasks into qsub queue using following command qsub -t 1-80 mainMcQsub.sh But for several tasks in the array I have got an errors. qstat shows me that 1085375 0.55500 MainMcQsub vplotnik Eqw 04/07/2020 17:02:03 1 1-16:1,18,20,21,24,27,32,33,36,37,40,43,46,47,50,52,56,59,63,65,70,74,77 Is there some problems on the cluster?
Posted by Vsilii Plotnikov
Asked on 07/04/2020 17:15
0

[vplotnik@ncx106 CheckRootVersion]$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
—————————————————————————————————————–
1088974 0.55500 CheckRootV vplotnik t 04/15/2020 12:53:29 all.q@ncx131.jinr.ru 1
1088984 0.55500 CheckRootV vplotnik t 04/15/2020 12:53:30 all.q@ncx141.jinr.ru 1
1088988 0.55500 CheckRootV vplotnik t 04/15/2020 12:53:43 all.q@ncx145.jinr.ru 1
1089002 0.55500 CheckRootV vplotnik t 04/15/2020 12:53:44 all.q@ncx159.jinr.ru 1
1089009 0.55500 CheckRootV vplotnik t 04/15/2020 12:53:45 all.q@ncx166.jinr.ru 1
1088953 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:19 ncx110 1
1088956 0.55500 CheckRootV vplotnik Eqw 04/15/2020 12:48:19 ncx113 1
1088961 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:21 ncx118 1
1088962 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:21 ncx119 1
1088963 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:21 ncx120 1
1088969 0.55500 CheckRootV vplotnik Eqw 04/15/2020 12:48:23 ncx126 1
1088971 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:23 ncx128 1
1088972 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:24 ncx129 1
1088977 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:25 ncx134 1
1088987 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:27 ncx144 1
1089016 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:38 ncx173 1
1089020 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:39 ncx177 1
1089021 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:39 ncx178 1
1089022 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:39 ncx179 1
1089026 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:40 ncx183 1
1089030 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:41 ncx187 1
1089031 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:42 ncx188 1
1089032 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:42 ncx189 1
1089033 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:42 ncx190 1
1089034 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:43 ncx191 1
1089035 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:46 ncx192 1
1089036 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:47 ncx193 1
1089037 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:47 ncx194 1
1089038 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:47 ncx195 1
1089039 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:48 ncx196 1
1089040 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:48 ncx197 1
1089041 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:48 ncx198 1
1089042 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:49 ncx199 1
1089043 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:49 ncx200 1
1089047 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:51 ncx204 1
1089052 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:53 ncx209 1
1089053 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:53 ncx210 1
1089054 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:54 ncx211 1
1089062 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:56 ncx219 1
1089063 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:56 ncx220 1
1089064 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:56 ncx221 1
1089065 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:56 ncx222 1
1089072 0.55500 CheckRootV vplotnik qw 04/15/2020 12:49:03 ncx229 1
1089073 0.55500 CheckRootV vplotnik qw 04/15/2020 12:49:03 ncx230 1
1089082 0.55500 CheckRootV vplotnik qw 04/15/2020 12:49:06 ncx239 1
1089083 0.55500 CheckRootV vplotnik qw 04/15/2020 12:49:06 ncx240 1

Posted by Vsilii Plotnikov
Answered On 15/04/2020 13:33
0

Hi! I have tried to execute test task on each batch machine in the range from ncx110 to ncx240 (131 machines). The task
was successfully fineshed on 85 machines. The task hung up on the other 46 machines with the different status.
qstat output is shown below. I can not understand why the ”t” and ”qw” statuses are appeared. Detailed description
of the error for 2 machines with the status ”Eqw” is

[vplotnik@ncx106 CheckRootVersion]$ qstat -j 1088956 | grep error
error reason 1: 04/15/2020 12:48:28 [3809:16249]: can’t stat() ”/nica/mpd19/plotnikov/CheckRootVersion” as stdout_path: Trans$

Why the test task has not executed on that 46 machines?

[vplotnik@ncx106 CheckRootVersion]$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
—————————————————————————————————————–
1088974 0.55500 CheckRootV vplotnik t 04/15/2020 12:53:29 all.q@ncx131.jinr.ru 1
1088984 0.55500 CheckRootV vplotnik t 04/15/2020 12:53:30 all.q@ncx141.jinr.ru 1
1088988 0.55500 CheckRootV vplotnik t 04/15/2020 12:53:43 all.q@ncx145.jinr.ru 1
1089002 0.55500 CheckRootV vplotnik t 04/15/2020 12:53:44 all.q@ncx159.jinr.ru 1
1089009 0.55500 CheckRootV vplotnik t 04/15/2020 12:53:45 all.q@ncx166.jinr.ru 1
1088953 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:19 1
1088956 0.55500 CheckRootV vplotnik Eqw 04/15/2020 12:48:19 1
1088961 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:21 1
1088962 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:21 1
1088963 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:21 1
1088969 0.55500 CheckRootV vplotnik Eqw 04/15/2020 12:48:23 1
1088971 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:23 1
1088972 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:24 1
1088977 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:25 1
1088987 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:27 1
1089016 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:38 1
1089020 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:39 1
1089021 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:39 1
1089022 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:39 1
1089026 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:40 1
1089030 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:41 1
1089031 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:42 1
1089032 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:42 1
1089033 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:42 1
1089034 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:43 1
1089035 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:46 1
1089036 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:47 1
1089037 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:47 1
1089038 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:47 1
1089039 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:48 1
1089040 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:48 1
1089041 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:48 1
1089042 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:49 1
1089043 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:49 1
1089047 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:51 1
1089052 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:53 1
1089053 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:53 1
1089054 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:54 1
1089062 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:56 1
1089063 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:56 1
1089064 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:56 1
1089065 0.55500 CheckRootV vplotnik qw 04/15/2020 12:48:56 1
1089072 0.55500 CheckRootV vplotnik qw 04/15/2020 12:49:03 1
1089073 0.55500 CheckRootV vplotnik qw 04/15/2020 12:49:03 1
1089082 0.55500 CheckRootV vplotnik qw 04/15/2020 12:49:06 1
1089083 0.55500 CheckRootV vplotnik qw 04/15/2020 12:49:06 1

Posted by Vsilii Plotnikov
Answered On 15/04/2020 13:30
0

The mpd19 drive is mounted to the ncx130, I checked. Everything should work, check it out.

Posted by Ivan Slepov
Answered On 14/04/2020 16:20
0

Hi! When all batch-machines will have an access to the disk /nica/mpd19?

Posted by Vsilii Plotnikov
Answered On 09/04/2020 00:02
0

Thank you for replay! I have got error’s description
[vplotnik@ncx106 vp_r7_v2]$ qstat -j 1085520 | grep error
error reason 18: 04/08/2020 11:56:28 [3809:285478]: can’t stat() ”/nica/mpd19/plotnikov/vp_r7_v2” as stdout_path: Transport endpoint is not connected KRB5CCNAME=none uid=3809 gid=363 363 24245
As I have understood, the reason of the error is unmounted disk /nica/mpd19 for particular batch-machine.
I have selected the batch-machine which executed one task from the array successfully. In my case it is ncx130. And I have started my new task on that batch-machine successfully again.

Posted by Vsilii Plotnikov
Answered On 08/04/2020 12:30
1
Posted by Konstantin Gertsenberger
Answered On 08/04/2020 12:04
1

Hi,
You can use the following command to display the error occured:
qstat -j [JOB_NUMBER] | grep error

To have possibility for restarting your task, use:
qsub -r yes myscript.sh
Then you can reschedule it:
qmod -r job_id
If it is an array job, the following command will only restart the task needed:
qmod -r job_id.task_id

Posted by Konstantin Gertsenberger
Answered On 08/04/2020 12:00