Fixing a “Failed” worker
Whenever a job fails for the 1st time, the job is deferred at the end of the phase and another job is assigned to that worker.
What If the Job Fails 2nd Time?
- If the run time of the job was < 10 min => the job is deferred at the end of the phase and another job is assigned to that Worker.
- if the run time of the job was >= 10 min => then the job status will be shown as “Failed”.
What If the job fails 3nd time?
If the Job fail third time the job status will be shown as “Failed”.
Review the Worker log information into
Example: adwork001.log name for the worker number 1.
After fixing error, start (if is not already started) AD Controller and to use the option 2 “Tell worker to restart a failed job”.
When prompted we have to specify the worker which must be restarted.
If all the workers are failed, we can type all to restart all Workers.
Restarting a Failed Patch Process
During a patch process (or adadmin process) if a job fails and cannot be restarted, then the patch must be restarted.
Here are the steps for doing this:
- Tell worker to quit (for all workers) => to manually shutdown/ quit the workers
- Tell manager that a worker failed its job
- Tell manager that a worker acknowledges quit => the manager will stop, the AutoPatch will restart the patch
PLEASE NOTE: When the patch will restart all the information in the database about this session must be accurate.
How to determine if a process is Hanging or not
- We can check the log file to see if some information is added or not to the log
- We can determine if the worker process is consuming CPU by issuing below
$ ps -eo pcpu,pid,user,args | grep workerid
3.We check if there are any child processes, which are consuming CPU by issuing following command:
$ ps -eo pcpu,pid,ppid,user,args | grep <Parent Process> | grep -v grep
Restarting a Hanging Worker Process
1. kill at the OS level the processes associated with the Hanging Worker
$ kill -9 ProcesssNumber
2. fix the problem
3. Restart the worker (or the job)
Restart an AD utility after a Node Crash
a. Start AD Controller
b. Choose “4. Tell manager that a worker failed its job“
c. Choose “2. Tell worker to restart a failed job“
d. Restart the AD utility that was running when the node
Shutting down the Manager
1. Start AD Controller
2. Choose “3. Tell worker to quit“
3. Verify that no worker processes are running