<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="es">
	<id>http://wiki.cima.fcen.uba.ar/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Pzaninelli</id>
	<title>Wikicima - Contribuciones del usuario [es]</title>
	<link rel="self" type="application/atom+xml" href="http://wiki.cima.fcen.uba.ar/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Pzaninelli"/>
	<link rel="alternate" type="text/html" href="http://wiki.cima.fcen.uba.ar/index.php/Especial:Contribuciones/Pzaninelli"/>
	<updated>2026-05-10T21:53:36Z</updated>
	<subtitle>Contribuciones del usuario</subtitle>
	<generator>MediaWiki 1.41.1</generator>
	<entry>
		<id>http://wiki.cima.fcen.uba.ar/index.php?title=WRF4L&amp;diff=1290</id>
		<title>WRF4L</title>
		<link rel="alternate" type="text/html" href="http://wiki.cima.fcen.uba.ar/index.php?title=WRF4L&amp;diff=1290"/>
		<updated>2019-03-13T18:12:51Z</updated>

		<summary type="html">&lt;p&gt;Pzaninelli: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;There is a far more powerful tool to manage work-flow of [http://www.meteo.unican.es/software/wrf4g/ WRF4G]. &lt;br /&gt;
&lt;br /&gt;
Llu&amp;amp;iacute;s developed a less powerful one which is here described&lt;br /&gt;
&lt;br /&gt;
WRF work-flow management is done via 5 scripts (these are the specifics for hydra [CIMA cluster]):&lt;br /&gt;
* &amp;lt;code&amp;gt;EXPERIMENTparameters.txt&amp;lt;/code&amp;gt;: General ASCII file which configures the experiment and chain of simulations (chunks). This is the unique file to modify&lt;br /&gt;
* &amp;lt;code&amp;gt;run_experiment.pbs&amp;lt;/code&amp;gt;: PBS-queue job which prepares the experiment of the environment&lt;br /&gt;
* &amp;lt;code&amp;gt;run_WPS.pbs&amp;lt;/code&amp;gt;: PBS-queue job which launch the WPS section of the model: &amp;lt;code&amp;gt;ungrib.exe&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;metgrid.exe&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;real.exe&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;run_WRF.pbs&amp;lt;/code&amp;gt;: PBS-queue job which launch the &amp;lt;code&amp;gt;wrf.exe&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;launch_pbs.bash&amp;lt;/code&amp;gt;: Necessary shell script to launch jobs which use more than one node in CIMA&#039;s &amp;lt;code&amp;gt;hydra&amp;lt;/code&amp;gt; cluster&lt;br /&gt;
* There is a folder called &amp;lt;code&amp;gt;components&amp;lt;/code&amp;gt; with shell and python scripts necessary for the work-flow management&lt;br /&gt;
&lt;br /&gt;
An experiment which contains a period of simulation is divided by &#039;&#039;&#039;chunks&#039;&#039;&#039; small pieces of times which are manageable by the model. The work-flow follows these steps using &amp;lt;code&amp;gt;run_experiments.pbs&amp;lt;/code&amp;gt;:&lt;br /&gt;
# Copy and link all the required files for a given &#039;&#039;&#039;chunk&#039;&#039;&#039; of the whole period of simulation following the content of &amp;lt;code&amp;gt;EXPERIMENTparameters.txt&amp;lt;/code&amp;gt;&lt;br /&gt;
# Launches &amp;lt;code&amp;gt;run_WPS.pbs&amp;lt;/code&amp;gt; which will produce the necessary files for the period of the given &#039;&#039;&#039;chunk&#039;&#039;&#039;&lt;br /&gt;
# Launches &amp;lt;code&amp;gt;run_WRF.pbs&amp;lt;/code&amp;gt; which will simulated the period of the given &#039;&#039;&#039;chunk&#039;&#039;&#039; (which waits until the end of &amp;lt;code&amp;gt;run_WPS.pbs&amp;lt;/code&amp;gt;)&lt;br /&gt;
# Launches the next &amp;lt;code&amp;gt;run_experiments.pbs&amp;lt;/code&amp;gt; (which waits until the end of &amp;lt;code&amp;gt;run_WRF.pbs&amp;lt;/code&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
All the scripts are located in &amp;lt;code&amp;gt;hydra&amp;lt;/code&amp;gt; at:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/share/tools/work-flows/WRF4L/hydra&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== How to simulate ==&lt;br /&gt;
# Creation of a new folder from where launch the experiment [ExperimentName] (g.e. somewhere at $HOME)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mkdir [ExperimentName]&lt;br /&gt;
cd [ExperimentName]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# copy the WRF4L files to this folder&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ cp /share/tools/work-flows/WRF4L/hydra/EXPERIMENTparameters.txt ./&lt;br /&gt;
$ cp /share/tools/work-flows/WRF4L/hydra/run_experiment.pbs ./&lt;br /&gt;
$ cp /share/tools/work-flows/WRF4L/hydra/run_WPS.pbs ./&lt;br /&gt;
$ cp /share/tools/work-flows/WRF4L/hydra/run_WRF.pbs ./&lt;br /&gt;
$ cp /share/WRF/launch_pbs.bash ./&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Edit the configuration/set-up of the simulation of the experiment&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ vim EXPERIMENTparameters.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Launch the simulation of the experiment&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qsub run_experiment.pbs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
When it is running one would have (runnig ORCHIDEE job &amp;lt;code&amp;gt;or_[SimName]&amp;lt;/code&amp;gt; `R&#039;, and &amp;lt;code&amp;gt;exp_[SimName]&amp;lt;/code&amp;gt; in hold `H&#039;):&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qstat -u $USER&lt;br /&gt;
&lt;br /&gt;
hydra: &lt;br /&gt;
                                                                         Req&#039;d  Req&#039;d   Elap&lt;br /&gt;
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time&lt;br /&gt;
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----&lt;br /&gt;
397.hydra            lluis.fi larga    wps_              27567     1  16   20gb 168:0 R   -- &lt;br /&gt;
398.hydra            lluis.fi larga    wrf_                --      1   1   20gb 168:0 H   -- &lt;br /&gt;
399.hydra            lluis.fi larga    exp_                --      1   1    2gb 168:0 H   -- &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of crash of the simulation, after fixing the issue, go to &amp;lt;code&amp;gt;[runHOME]/[ExpName]/[SimName]&amp;lt;/code&amp;gt; and re-launch the experiment (after the first run the &amp;lt;code&amp;gt;scratch&amp;lt;/code&amp;gt; is switched automatically to `false&#039;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qsub run_experiment.pbs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Checking the experiment ==&lt;br /&gt;
Once the experiment runs, one needs to look on (following name of the variables from &amp;lt;code&amp;gt;EXPERIMENTparameters.txt&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;[runHOME]/[ExpName]/[SimName]&amp;lt;/code&amp;gt;: Will content the copies of the templates &amp;lt;code&amp;gt;namelist.wps&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;namelist.input&amp;lt;/code&amp;gt; and a file &amp;lt;code&amp;gt;chunk_attemps.inf&amp;lt;/code&amp;gt; which counts how many times a &#039;&#039;&#039;chunk&#039;&#039;&#039; has been attempted to be run (if it reached 4 times, the &amp;lt;code&amp;gt;WRF4L&amp;lt;/code&amp;gt; is stopped)&lt;br /&gt;
* &amp;lt;code&amp;gt;[runHOME]/[ExpName]/[SimName]/run&amp;lt;/code&amp;gt;: actual folder where the computing nodes run the model. In a folder called &amp;lt;code&amp;gt;wrfout&amp;lt;/code&amp;gt; there is a folder for each &#039;&#039;&#039;chunk&#039;&#039;&#039; with the standard output of the model&lt;br /&gt;
* &amp;lt;code&amp;gt;[runHOME]/[ExpName]/[SimName]/run/wrfout/[YYYYi][MMi][DDi][HHi][MIi][SSi]-[YYYYf][MMf][DDf][HHf][MIf][SSf]&amp;lt;/code&amp;gt;: folder with the standard output and all the required files to run a given &#039;&#039;&#039;chunk&#039;&#039;&#039;. The content of all this folder is compressed and kept in  &amp;lt;code&amp;gt;[storageHOME]/[ExpName]/[SimName]/config_[YYYYi][MMi][DDi][HHi][MIi][SSi]-[YYYYf][MMf][DDf][HHf][MIf][SSf].tar.gz&amp;lt;/code&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;[storageHOME]/[ExpName]/[SimName]&amp;lt;/code&amp;gt; (in [storageHOST]): output of the already ran &#039;&#039;&#039;chunks&#039;&#039;&#039; as &amp;lt;code&amp;gt;[YYYYi][MMi][DDi][HHi][MIi][SSi]-[YYYYf][MMf][DDf][HHf][MIf][SSf]&amp;lt;/code&amp;gt; for a chunk from &amp;lt;code&amp;gt;[YYYYi]/[MMi]/[DDi] [HHi]:[MIi]:[SSi]&amp;lt;/code&amp;gt; to &amp;lt;code&amp;gt;[YYYYf]/[MMf]/[DDf] [HHf]:[MIf]:[SSf]&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== When something went wrong ===&lt;br /&gt;
If there has been any problem check the last chunk (in wrfout/[PERIODchunk]) to try to understand what happens and where the problem comes from:&lt;br /&gt;
* &amp;lt;code&amp;gt;rsl.[error/out].[nnnn]&amp;lt;/code&amp;gt;: These are the files which content the standard output while running the model. One file for each process. If the problem was something related to model execution and it has been prepared for the error, a correct message must appear. (look first for the largest files... &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ls -lrS rsl.error.*&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* &amp;lt;code&amp;gt;run_wrf.log&amp;lt;/code&amp;gt;: These are the files which content the standard output of the model. Search for `segmentation faults&#039; in form of (it might differ):&lt;br /&gt;
&amp;lt;pre&amp;gt;forrtl: error (63): output conversion error, unit -5, file Internal Formatted Write&lt;br /&gt;
Image              PC                Routine            Line        Source&lt;br /&gt;
orchidee_ol        00000000032B736A  Unknown               Unknown  Unknown&lt;br /&gt;
orchidee_ol        00000000032B5EE5  Unknown               Unknown  Unknown&lt;br /&gt;
orchidee_ol        0000000003265966  Unknown               Unknown  Unknown&lt;br /&gt;
orchidee_ol        0000000003226EB5  Unknown               Unknown  Unknown&lt;br /&gt;
orchidee_ol        0000000003226671  Unknown               Unknown  Unknown&lt;br /&gt;
orchidee_ol        000000000324BC3C  Unknown               Unknown  Unknown&lt;br /&gt;
orchidee_ol        0000000003249C94  Unknown               Unknown  Unknown&lt;br /&gt;
orchidee_ol        00000000004184DC  Unknown               Unknown  Unknown&lt;br /&gt;
libc.so.6          000000319021ECDD  Unknown               Unknown  Unknown&lt;br /&gt;
orchidee_ol        00000000004183D9  Unknown               Unknown  Unknown&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* on &amp;lt;code&amp;gt;[runHOME]/[ExpName]/[SimName]&amp;lt;/code&amp;gt;, check the output of the PBS jobs. Which are called:&lt;br /&gt;
** &amp;lt;code&amp;gt;exp-[SimName].o[nnnn]&amp;lt;/code&amp;gt;: output of the &amp;lt;code&amp;gt;run_experiment.pbs&amp;lt;/code&amp;gt;&lt;br /&gt;
** &amp;lt;code&amp;gt;wps-[SimName].o[nnnn]&amp;lt;/code&amp;gt;: output of the &amp;lt;code&amp;gt;run_WPS.pbs&amp;lt;/code&amp;gt;&lt;br /&gt;
** &amp;lt;code&amp;gt;wrf-[SimName].o[nnnn]&amp;lt;/code&amp;gt;: output of the &amp;lt;code&amp;gt;run_WRF.pbs&amp;lt;/code&amp;gt;&lt;br /&gt;
* Check &amp;lt;code&amp;gt;[runHOME]/[ExpName]/[SimName]/run/namelist.output&amp;lt;/code&amp;gt; which holds all the parameters (even the default ones) used in the simulation&lt;br /&gt;
&lt;br /&gt;
== EXPERIMENTSparameters.txt ==&lt;br /&gt;
This ASCII file configures all the simulation. It assumes:&lt;br /&gt;
* Required files, forcings, storage, compiled version of the code might be at different machines.&lt;br /&gt;
* There is a folder with a given template version of the &amp;lt;code&amp;gt;namelist.input&amp;lt;/code&amp;gt; which will be used and changed accordingly to the requirement of the experiments&lt;br /&gt;
&lt;br /&gt;
Name of the experiment&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Experiment name&lt;br /&gt;
ExpName = WRFsensSFC&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Name of the simulation. Here is understood that a given experiment could have the model configured with different set-ups (here identified with a different name of simulation)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Simulation name&lt;br /&gt;
SimName = control&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Which binary of &amp;lt;code&amp;gt;python&amp;lt;/code&amp;gt; 2.x to be used&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# python binary&lt;br /&gt;
pyBIN=/home/lluis.fita/bin/anaconda2/bin/python2.7&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this simulation be run from the beginning or not. If it is set to `true&#039;, it will remove all the pre-existing content of the folder [ExpName]/[SimName] in the running and in the storage spaces. &#039;&#039;&#039;Be careful&#039;&#039;&#039;. In case of `false&#039; simulation will continue from the last successful ran &#039;&#039;&#039;chunk&#039;&#039;&#039; (checking the restart files).&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Start from the beginning (keeping folder structure)&lt;br /&gt;
scratch = false&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Period of the simulation of the simulation (In this example from 1958 Jan 1st to 2015 Dec 31)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Experiment starting date&lt;br /&gt;
exp_start_date = 19790101000000&lt;br /&gt;
# Experiment ending date&lt;br /&gt;
exp_end_date = 20150101000000&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Length of the chunks (do not make chunks larger than 1-month!!)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Chunk Length [N]@[unit]&lt;br /&gt;
#  [unit]=[year, month, week, day, hour, minute, second]&lt;br /&gt;
chunk_length = 1@month&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Selection of the machines and users to each machine where the different requirement files are located and the output should be placed. &lt;br /&gt;
* &#039;&#039;&#039;NOTE:&#039;&#039;&#039; this will only work if one set-up the &amp;lt;code&amp;gt;.ssh&amp;lt;/code&amp;gt; public/private keys in each involved USER/HOST. &lt;br /&gt;
* &#039;&#039;&#039;NOTE 2:&#039;&#039;&#039; All the forcings, compiled code, ... are already at &amp;lt;code&amp;gt;hydra&amp;lt;/code&amp;gt; at the common space called &amp;lt;code&amp;gt;share&amp;lt;/code&amp;gt;&lt;br /&gt;
* &#039;&#039;&#039;NOTE 3:&#039;&#039;&#039; From the computing nodes, one can not access to the &amp;lt;code&amp;gt;/share&amp;lt;/code&amp;gt; folder and to any of the CIMA&#039;s storage machines: skogul, freyja, ... For that reason, one need to use these system of &amp;lt;code&amp;gt;[USER]@[HOST]&amp;lt;/code&amp;gt; accounts. &amp;lt;code&amp;gt;*.pbs&amp;lt;/code&amp;gt; scripts uses a series of wrappers of the standard functions: &amp;lt;code&amp;gt;cp, ln, ls, mv, ....&amp;lt;/code&amp;gt; which manage them `from&#039; and `to&#039; different pairs of &amp;lt;code&amp;gt;[USER]@[HOST]&amp;lt;/code&amp;gt;. &#039;&#039;&#039;NOTE:&#039;&#039;&#039; This will only work if the public/private ssh key pairs have been set-up (see more details at [[llaves_ssh]])&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Hosts&lt;br /&gt;
#   list of different hosts and specific user&lt;br /&gt;
#     [USER]@[HOST]&lt;br /&gt;
#   NOTE: this will only work if public keys have been set-up&lt;br /&gt;
##&lt;br /&gt;
# Host with compiled code, namelist templates&lt;br /&gt;
codeHOST=lluis.fita@hydra&lt;br /&gt;
# forcing Host with forcings (atmospherics and morphologicals)&lt;br /&gt;
forcingHOST=lluis.fita@hydra&lt;br /&gt;
# output Host with storage of output (including restarts)&lt;br /&gt;
outHOST=lluis.fita@hydra&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Templates of the configuration of WRF: &amp;lt;code&amp;gt;namelist.wps&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;namelist.input&amp;lt;/code&amp;gt; files. &#039;&#039;&#039;NOTE:&#039;&#039;&#039; they will be changed according to the content of &amp;lt;code&amp;gt;EXPERIMENTparameters.txt&amp;lt;/code&amp;gt; like period of the &#039;&#039;&#039;chunk&#039;&#039;&#039;, atmospheric forcing, differences of the set-up, ... (located in the &amp;lt;code&amp;gt;[codeHOST]&amp;lt;/code&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Folder with the `namelist.wps&#039;, `namelist.input&#039; and `geo_em.d[nn].nc&#039; of the experiment&lt;br /&gt;
domainHOME = /home/lluis.fita/salidas/estudios/dominmios/SA50k&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Folder where the WRF model will run in the computing nodes (on top of that there will be two more folders [ExpName]/[SimName]). WRF will run at the folder [ExpName]/[SimName]/run&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Running folder&lt;br /&gt;
runHOME = /home/lluis.fita/estudios/WRFsensSFC/sims&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Folder with the compiled version of the WPS (located at &amp;lt;code&amp;gt;[codeHOST]&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Folder with the compiled source of WPS&lt;br /&gt;
wpsHOME = /share/WRF/WRFV3.9.1/ifort/dmpar/WPS&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Folder with the compiled version of the WRF (located at &amp;lt;code&amp;gt;[codeHOST]&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Folder with the compiled source of WRF&lt;br /&gt;
wrfHOME = /share/WRF/WRFV3.9.1/ifort/dmpar/WRFV3&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Folder to storage all the output of the model (history files, restarts and compressed file with content of the configuration and the standard output of the given run). The content of the folder will be organized by chunks (located at &amp;lt;code&amp;gt;[storageHOST]&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Storage folder of the output&lt;br /&gt;
storageHOME = /home/lluis.fita/salidas/estudios/WRFsensSFC/sims/output&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Wether modules should be load (not used for &amp;lt;code&amp;gt;hydra&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Modules to load (&#039;None&#039; for any)&lt;br /&gt;
modulesLOAD = None&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Names of the files used to check that the &#039;&#039;&#039;chunk&#039;&#039;&#039; has properly ran &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Model reference output names (to be used as checking file names)&lt;br /&gt;
nameLISTfile = namelist.input # namelist&lt;br /&gt;
nameRSTfile = wrfrst_d01_ # restart file&lt;br /&gt;
nameOUTfile = wfrout_d01_ # output file&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Extensions of the files with the configuration of WRF (to be retrieved from &amp;lt;code&amp;gt;codeHOST&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;domainHOME&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Extensions of the files with the configuration of the model&lt;br /&gt;
configEXTS = wps:input&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To continue from a previous &#039;&#039;&#039;chunk&#039;&#039;&#039; one needs to use the `restart&#039; files. But they need to be renamed, because otherwise they will be re-written. Here one specifies the original name of the file &amp;lt;code&amp;gt;[origFile]&amp;lt;/code&amp;gt; and the name to be used to avoid the re-writting &amp;lt;code&amp;gt;[destFile]&amp;lt;/code&amp;gt;. It uses a complex bash script which even can deal with the change of dates according to the period of the &#039;&#039;&#039;chunk&#039;&#039;&#039; (&#039;:&#039; list of &amp;lt;code&amp;gt;[origFile]@[destFile]&amp;lt;/code&amp;gt;). They will located at the &amp;lt;code&amp;gt;[storageHOST]&amp;lt;/code&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# restart file names&lt;br /&gt;
# &#039;:&#039; list of [tmplrstfilen|[NNNNN1]?[val1]#[...[NNNNNn]?[valn]]@[tmpllinkname]|[NNNNN1]?[val1]#[...[NNNNNn]?[valn]]&lt;br /&gt;
#    [tmplrstfilen]: template name of the restart file (if necessary with [NNNNN] variables to be substituted)&lt;br /&gt;
#      [NNNNN]: section of the file name to be automatically substituted&lt;br /&gt;
#        `[YYYY]&#039;: year in 4 digits&lt;br /&gt;
#        `[YY]&#039;: year in 2 digits&lt;br /&gt;
#        `[MM]&#039;: month in 2 digits&lt;br /&gt;
#        `[DD]&#039;: day in 2 digits&lt;br /&gt;
#        `[HH]&#039;: hour in 2 digits&lt;br /&gt;
#        `[SS]&#039;: second in 2 digits&lt;br /&gt;
#        `[JJJ]&#039;: julian day in 3 digits&lt;br /&gt;
#      [val]: value to use (which is systematically defined in `run_OR.pbs&#039;)&lt;br /&gt;
#        `%Y%&#039;: year in 4 digits&lt;br /&gt;
#        `%y%&#039;: year in 2 digits&lt;br /&gt;
#        `%m%&#039;: month in 2 digits&lt;br /&gt;
#        `%d%&#039;: day in 2 digits&lt;br /&gt;
#        `%h%&#039;: hour in 2 digits&lt;br /&gt;
#        `%s%&#039;: second in 2 digits&lt;br /&gt;
#        `%j%&#039;: julian day in 3 digits&lt;br /&gt;
#    [tmpllinkname]: template name of the link of the restart file (if necessary with [NNNNN] variables to be substituted)&lt;br /&gt;
rstFILES=wrfrst_d01_[YYYY]-[MM]-[DD]_[HH]:[MI]:[SS]|YYYY?%Y#MM?%m#DD?%d#HH?%H#MI?%M#SS?%S@wrfrst_d01_[YYYY]-[MM]-[DD]_[HH]:[MI]:[SS]|YYYY?%Y#MM?%m#DD?%d#HH?%H#MI?%M#SS?%S&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Folder with the input data (located at &amp;lt;code&amp;gt;[forcingHOST]&amp;lt;/code&amp;gt;). &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Folder with the input morphological forcing data&lt;br /&gt;
indataHOME = /share/DATA/re-analysis/ERA-Interim&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Format of the input data and name of files&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Data format (grib, nc)&lt;br /&gt;
indataFMT= grib&lt;br /&gt;
# For `grib&#039; format&lt;br /&gt;
#   Head and tail of indata files names.&lt;br /&gt;
#     Assuming ${indataFheader}*[YYYY][MM]*${indataFtail}.[grib/nc]&lt;br /&gt;
indataFheader=ERAI_&lt;br /&gt;
indataFtail=&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of netCDF input data, there is a bash script which transforms the data to grib, to be used later by &amp;lt;code&amp;gt;ungrib&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Variable table to use in &amp;lt;code&amp;gt;ungrib&amp;lt;/code&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#   Type of Vtable for ungrib as Vtable.[VtableType]&lt;br /&gt;
VtableType=ERA-interim.pl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Folder with the atmospheric forcing data (located at &amp;lt;code&amp;gt;[forcingHOST]&amp;lt;/code&amp;gt;). &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# For `nc&#039; format&lt;br /&gt;
#   Folder which contents the atmospheric data to generate the initial state&lt;br /&gt;
iniatmosHOME = ./&lt;br /&gt;
#   Type of atmospheric data to generate the initial state&lt;br /&gt;
#     `ECMWFstd&#039;: ECMWF &#039;standard&#039; way ERAI_[pl/sfc][YYYY][MM]_[var1]-[var2].grib&lt;br /&gt;
#     `ERAI-IPSL&#039;: ECMWF ERA-INTERIM stored in the common IPSL way (.../4xdaily/[AN\_PL/AN\_SF])&lt;br /&gt;
iniatmosTYPE = &#039;ECMWFstd&#039;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here on can change values on the template &amp;lt;code&amp;gt;namelist.input&amp;lt;/code&amp;gt;. It will change the values of the provided parameters with a new value. If the given parameter is not in the template of the &amp;lt;code&amp;gt;namelist.input&amp;lt;/code&amp;gt; it will be automatically added.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
## Namelist changes&lt;br /&gt;
nlparameters = ra_sw_physics;4,ra_lw_physics;4,time_step;180&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Name of WRF&#039;s executable (to be localized at &amp;lt;code&amp;gt;[orHOME]&amp;lt;/code&amp;gt; folder from &amp;lt;code&amp;gt;[codeHOST]&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Name of the exectuable&lt;br /&gt;
nameEXEC=wrf.exe&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;:&#039; separated list of netCDF file names from WRF&#039;s output which do not need to be kept&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# netCDF Files which will not be kept anywhere&lt;br /&gt;
NokeptfileNAMES=&#039;&#039;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;:&#039; separated list of headers of netCDF file names from WRF&#039;s output which need to be kept&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Headers of netCDF files need to be kept&lt;br /&gt;
HkeptfileNAMES=wrfout_d:wrfxtrm_d:wrfpress_d&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;:&#039; separated list of headers of restarts netCDF file names from WRF&#039;s output which need to be kept&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# Headers of netCDF restart files need to be kept&lt;br /&gt;
HrstfileNAMES=wrfrst_d&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Parallel configuration of the run.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# WRF parallel run configuration&lt;br /&gt;
## Number of nodes&lt;br /&gt;
Nnodes = 1&lt;br /&gt;
## Number of mpi procs&lt;br /&gt;
Nmpiprocs = 16&lt;br /&gt;
## Number of shared memory threads (&#039;None&#039; for no openMP threads)&lt;br /&gt;
Nopenthreads = None&lt;br /&gt;
## Memory size of shared memory threads&lt;br /&gt;
SIZEopenthreads = 200M&lt;br /&gt;
## Memory for PBS jobs&lt;br /&gt;
MEMjobs = 30gb&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Generic definitions&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
## Generic&lt;br /&gt;
errormsg=ERROR -- error -- ERROR -- error&lt;br /&gt;
warnmsg=WARNING -- warning -- WARNING -- warning&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Pzaninelli</name></author>
	</entry>
	<entry>
		<id>http://wiki.cima.fcen.uba.ar/index.php?title=procesos_zombies&amp;diff=1144</id>
		<title>procesos zombies</title>
		<link rel="alternate" type="text/html" href="http://wiki.cima.fcen.uba.ar/index.php?title=procesos_zombies&amp;diff=1144"/>
		<updated>2018-12-07T18:19:44Z</updated>

		<summary type="html">&lt;p&gt;Pzaninelli: /* script general */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Un proceso zombie se entiende cómo ese proceso que fue lanzado dentro de la cola PBS de gestión de trabajos, que ocupa espacio en el sistema del clúster, pero que actualmente no está en curso.&lt;br /&gt;
&lt;br /&gt;
Esta situación suele ocurrir cuando el clúster se apaga de una manera no controloda, el espacio en el &amp;lt;code&amp;gt;&#039;home&#039;&amp;lt;/code&amp;gt; del clúster se llena.&lt;br /&gt;
&lt;br /&gt;
Es importante que se paren estos procesos, puesto que suponen un sobre esfuerzo para los nodos en los cuáles estos procesos zombies están ejecutándose.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;NOTA:&#039;&#039;&#039; este es el único caso en el cuál está permitido entrar en los nodos de cálculo del clúster.&lt;br /&gt;
&lt;br /&gt;
== script general ==&lt;br /&gt;
Hay una script the bash con la cuál se muestran los procesos &amp;lt;CODE&amp;gt;`wrf.exe&#039;&amp;lt;/CODE&amp;gt; de un usuario que están corriendo en todos los nodos del clúster&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
/share/tools/work-flows/components/bats/check_hydra.bash [user]&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
En donde &amp;lt;CODE&amp;gt;[user]&amp;lt;/CODE&amp;gt; es el código numérico del usuarie (ej.: lluis.fita --&amp;gt; 1624)&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! nombre&lt;br /&gt;
! número &lt;br /&gt;
|-&lt;br /&gt;
| lluis.fita&lt;br /&gt;
| 1624&lt;br /&gt;
|-&lt;br /&gt;
| pablo.zaninelli&lt;br /&gt;
| 1561&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Caso1: Pasos a seguir =&lt;br /&gt;
# Itendificar los procesos de la cola del sistema del usuario (ej. lluis.fita)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4426.hydra                 wrf_control      lluis.fita      1699:08: R larga          &lt;br /&gt;
4427.hydra                 ...nsSFC-control lluis.fita             0 H larga          &lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      649:09:3 R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      645:40:5 R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 603:42:2 R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 600:18:4 R larga          &lt;br /&gt;
4443.hydra                 WRF17O           lluis.fita             0 Q larga  &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
# El proceso &amp;lt;code&amp;gt;4426&amp;lt;/code&amp;gt; es una simulación de WRF. &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qstat -f 4426&lt;br /&gt;
Job Id: 4426.hydra&lt;br /&gt;
    Job_Name = wrf_control&lt;br /&gt;
    Job_Owner = lluis.fita@node48&lt;br /&gt;
    resources_used.cput = 1700:00:37&lt;br /&gt;
    resources_used.mem = 6264020kb&lt;br /&gt;
    resources_used.vmem = 15322976kb&lt;br /&gt;
    resources_used.walltime = 72:40:26&lt;br /&gt;
    job_state = R&lt;br /&gt;
    queue = larga&lt;br /&gt;
    server = hydra&lt;br /&gt;
    Checkpoint = u&lt;br /&gt;
    ctime = Sat Apr  7 10:47:06 2018&lt;br /&gt;
    depend = beforeany:4427.hydra@hydra&lt;br /&gt;
    Error_Path = node48:/home/lluis.fita/estudios/WRFsensSFC/simulations/contr&lt;br /&gt;
	ol/wrf_control.e4426&lt;br /&gt;
    exec_host = node48/23+node48/22+node48/21+node48/20+node48/19+node48/18+no&lt;br /&gt;
	de48/17+node48/16+node48/15+node48/14+node48/13+node48/12+node48/11+no&lt;br /&gt;
	de48/10+node48/9+node48/8+node48/7+node48/6+node48/5+node48/4+node48/3&lt;br /&gt;
	+node48/2+node48/1+node48/0+node51/23+node51/22+node51/21+node51/20+no&lt;br /&gt;
	de51/19+node51/18+node51/17+node51/16+node51/15+node51/14+node51/13+no&lt;br /&gt;
	de51/12+node51/11+node51/10+node51/9+node51/8+node51/7+node51/6+node51&lt;br /&gt;
	/5+node51/4+node51/3+node51/2+node51/1+node51/0&lt;br /&gt;
    exec_port = 15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15&lt;br /&gt;
	003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+&lt;br /&gt;
	15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+1500&lt;br /&gt;
	3+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15&lt;br /&gt;
	003+15003+15003&lt;br /&gt;
    Hold_Types = n&lt;br /&gt;
    Join_Path = oe&lt;br /&gt;
    Keep_Files = n&lt;br /&gt;
    Mail_Points = ae&lt;br /&gt;
    Mail_Users = lluis.fita@cima.fcen.uba.ar&lt;br /&gt;
    mtime = Sat Apr  7 10:48:06 2018&lt;br /&gt;
    Output_Path = node48:/home/lluis.fita/estudios/WRFsensSFC/simulations/cont&lt;br /&gt;
	rol/wrf_control.o4426&lt;br /&gt;
    Priority = 0&lt;br /&gt;
    qtime = Sat Apr  7 10:47:06 2018&lt;br /&gt;
    Rerunable = True&lt;br /&gt;
    Resource_List.mem = 30gb&lt;br /&gt;
    Resource_List.nodect = 2&lt;br /&gt;
    Resource_List.nodes = 2:ppn=24&lt;br /&gt;
    Resource_List.vmem = 30gb&lt;br /&gt;
    Resource_List.walltime = 168:00:00&lt;br /&gt;
    session_id = 19561&lt;br /&gt;
    Variable_List = PBS_O_QUEUE=larga,PBS_O_HOST=node48,&lt;br /&gt;
	PBS_O_HOME=/home/lluis.fita,PBS_O_LANG=en_US.UTF-8,&lt;br /&gt;
	PBS_O_LOGNAME=lluis.fita,&lt;br /&gt;
	PBS_O_PATH=/usr/local/bin:/opt/intel/composerxe-2011.3.174/bin/intel6&lt;br /&gt;
	4:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:&lt;br /&gt;
	/usr/local/maui/bin:/usr/local/maui/sbin:/usr/local/bin:/opt/intel/com&lt;br /&gt;
	poserxe/bin:/usr/local/maui/bin:/usr/local/maui/sbin:/home/lluis.fita/&lt;br /&gt;
	bin:/opt/intel/composerxe-2011.3.174/mpirt/bin/intel64:/usr/local/maui&lt;br /&gt;
	/bin:/usr/local/maui/sbin,PBS_O_MAIL=/var/spool/mail/lluis.fita,&lt;br /&gt;
	PBS_O_SHELL=/bin/bash,PBS_SERVER=hydra,&lt;br /&gt;
	PBS_O_WORKDIR=/home/lluis.fita/estudios/WRFsensSFC/simulations/contro&lt;br /&gt;
	l&lt;br /&gt;
    etime = Sat Apr  7 10:47:14 2018&lt;br /&gt;
    submit_args = -W depend=afterany:4425 /home/lluis.fita/estudios/WRFsensSFC&lt;br /&gt;
	/simulations/control/run_WRF.pbs&lt;br /&gt;
    start_time = Sat Apr  7 10:47:14 2018&lt;br /&gt;
    Walltime.Remaining = 343110&lt;br /&gt;
    start_count = 1&lt;br /&gt;
    fault_tolerant = False&lt;br /&gt;
    submit_host = node48&lt;br /&gt;
    init_work_dir = /home/lluis.fita/estudios/WRFsensSFC/simulations/control&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# La variable &amp;lt;code&amp;gt;PBS_O_WORKDIR&amp;lt;/code&amp;gt; contiene el path de ejecución del job. Se tiene que cercionar que el proceso (simulación WRF en este caso), está ejecutándose (en este caso si los rsl.out/error.[nnnn] se están actualizando). En este caso la ejecución del WRF se halla en $PBS_O_WORKDIR/run&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ls -l /home/lluis.fita/estudios/WRFsensSFC/simulations/control/run/rsl.error.0000&lt;br /&gt;
rw-r--r-- 1 lluis.fita cima 1204224 Apr  8 19:07 /home/lluis.fita/estudios/WRFsensSFC/simulations/control/run/rsl.error.0000&lt;br /&gt;
$ date&lt;br /&gt;
Tue Apr 10 11:32:35 ART 2018&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Está claro que el WRF no está ejecutándose adecuadamente. Así que se tendrá que entrar en el nodo en dónde se está ejecutando y parar el job, ya que el sistema de colas PBS no controla este proceso. El WRF se ejecuta en el nodo (valores de &amp;lt;CODE&amp;gt;exec_host&amp;lt;/CODE&amp;gt;) node48 y node51.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ ssh node48&lt;br /&gt;
[lluis.fita@node48 ~]$ top&lt;br /&gt;
top - 11:35:40 up 5 days, 47 min,  0 users,  load average: 23.00, 22.99, 23.03&lt;br /&gt;
Tasks: 246 total,  24 running, 198 sleeping,   0 stopped,  24 zombie&lt;br /&gt;
Cpu(s): 87.4%us,  8.4%sy,  0.0%ni,  4.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st&lt;br /&gt;
Mem:  32942908k total, 18055956k used, 14886952k free,     3972k buffers&lt;br /&gt;
Swap:        0k total,        0k used,        0k free, 12102316k cached&lt;br /&gt;
&lt;br /&gt;
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                       &lt;br /&gt;
21531 lluis.fi  20   0  508m 256m  28m R 100.0  0.8   4360:00 wrf.exe                                                                                                      &lt;br /&gt;
21532 lluis.fi  20   0  503m 248m  23m R 100.0  0.8   4358:44 wrf.exe                                                                                                      &lt;br /&gt;
21537 lluis.fi  20   0  503m 247m  23m R 100.0  0.8   4359:53 wrf.exe                                                                                                      &lt;br /&gt;
21540 lluis.fi  20   0  507m 252m  23m R 100.0  0.8   4359:52 wrf.exe                                                                                                      &lt;br /&gt;
21545 lluis.fi  20   0  503m 249m  24m R 100.0  0.8   4359:35 wrf.exe                                                                                                      &lt;br /&gt;
21546 lluis.fi  20   0  503m 247m  22m R 100.0  0.8   4359:51 wrf.exe                                                                                                      &lt;br /&gt;
21547 lluis.fi  20   0  507m 253m  23m R 100.0  0.8   4360:04 wrf.exe                                                                                                      &lt;br /&gt;
21548 lluis.fi  20   0  507m 254m  24m R 100.0  0.8   4359:52 wrf.exe                                                                                                      &lt;br /&gt;
21549 lluis.fi  20   0  495m 241m  23m R 100.0  0.7   4359:44 wrf.exe                                                                                                      &lt;br /&gt;
21550 lluis.fi  20   0  507m 251m  23m R 100.0  0.8   4359:47 wrf.exe                                                                                                      &lt;br /&gt;
21552 lluis.fi  20   0  501m 249m  27m R 100.0  0.8   4359:49 wrf.exe                                                                                                      &lt;br /&gt;
21529 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:42 wrf.exe                                                                                                       &lt;br /&gt;
21530 lluis.fi  20   0  500m 246m  25m R 99.7  0.8   4360:05 wrf.exe                                                                                                       &lt;br /&gt;
21533 lluis.fi  20   0  507m 253m  23m R 99.7  0.8   4359:55 wrf.exe                                                                                                       &lt;br /&gt;
21534 lluis.fi  20   0  507m 254m  24m R 99.7  0.8   4360:04 wrf.exe                                                                                                       &lt;br /&gt;
21535 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:40 wrf.exe                                                                                                       &lt;br /&gt;
21536 lluis.fi  20   0  503m 250m  24m R 99.7  0.8   4359:37 wrf.exe                                                                                                       &lt;br /&gt;
21539 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:01 wrf.exe                                                                                                       &lt;br /&gt;
21541 lluis.fi  20   0  503m 248m  23m R 99.7  0.8   4360:07 wrf.exe                                                                                                       &lt;br /&gt;
21542 lluis.fi  20   0  500m 250m  27m R 99.7  0.8   4359:49 wrf.exe                                                                                                       &lt;br /&gt;
21543 lluis.fi  20   0  500m 244m  23m R 99.7  0.8   4359:43 wrf.exe                                                                                                       &lt;br /&gt;
21544 lluis.fi  20   0  507m 251m  23m R 99.7  0.8   4359:34 wrf.exe                                                                                                       &lt;br /&gt;
21551 lluis.fi  20   0  507m 257m  27m R 99.7  0.8   4360:02 wrf.exe        &lt;br /&gt;
[lluis.fita@node48 ~]$ killall wrf.exe&lt;br /&gt;
[lluis.fita@node48 ~]$ top&lt;br /&gt;
top - 11:37:32 up 5 days, 49 min,  0 users,  load average: 19.46, 22.23, 22.79&lt;br /&gt;
Tasks: 199 total,   1 running, 197 sleeping,   0 stopped,   1 zombie&lt;br /&gt;
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st&lt;br /&gt;
Mem:  32942908k total, 12665572k used, 20277336k free,     3972k buffers&lt;br /&gt;
Swap:        0k total,        0k used,        0k free, 12048020k cached&lt;br /&gt;
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                       &lt;br /&gt;
   10 root      20   0     0    0    0 S  0.3  0.0  39:27.92 kworker/0:1                                                                                                   &lt;br /&gt;
  262 root      20   0     0    0    0 S  0.3  0.0   0:18.40 kpktgend_13                                                                                                   &lt;br /&gt;
  272 root      20   0     0    0    0 S  0.3  0.0   0:18.47 kpktgend_23                                                                                                   &lt;br /&gt;
27134 lluis.fi  20   0 15280 1256  888 R  0.3  0.0   0:00.01 top                                                                                                           &lt;br /&gt;
    1 root      20   0 25660 1736 1428 S  0.0  0.0   0:05.71 init                                                                                                          &lt;br /&gt;
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                                                      &lt;br /&gt;
    3 root      20   0     0    0    0 S  0.0  0.0   0:01.17 ksoftirqd/0  &lt;br /&gt;
[lluis.fita@node48 ~]$ exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Se repite el procedimiento en tantos nodos cómo ocupe el job (en este caso también &amp;lt;code&amp;gt;node51&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ ssh node51&lt;br /&gt;
[lluis.fita@node51 ~]$ top&lt;br /&gt;
[lluis.fita@node51 ~]$ killall wrf.exe&lt;br /&gt;
[lluis.fita@node51 ~]$ top&lt;br /&gt;
[lluis.fita@node51 ~]$ exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# El job re-aparece cómo &amp;lt;code&amp;gt;cancelled&amp;lt;/code&amp;gt; (letra &amp;lt;code&amp;gt;C&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4426.hydra                 wrf_control      lluis.fita      1703:44: C larga          &lt;br /&gt;
4427.hydra                 ...nsSFC-control lluis.fita             0 R larga          &lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      654:14:4 R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      650:46:2 R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 608:46:5 R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 605:22:5 R larga          &lt;br /&gt;
4443.hydra                 WRF17O           lluis.fita             0 R larga     &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Si el job no fuera cancelado, se hace encesaria la cancelación manual&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qdel 4426&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Otros jobs que estabn en la cola PBS, ahora ya se están ejecutándose al liberarse los nodos! Se observa en el [[http://scad.cima.fcen.uba.ar/ganglia/ ganglia]] del sistema cómo los nodos 48 y 51, se descargan de carga de cálculo&lt;br /&gt;
&lt;br /&gt;
[[File:Ganglia_afterZombie.png|frame|50px|Ejemplo de descarga de los nodos 48 y 51 después de matar un proceso &#039;zombie&#039;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= procesos sin PBS: Pasos a seguir II =&lt;br /&gt;
Puede darse el caso, que el proceso esté ocupando el nodo y no se muestre cómo trabajo de la cola. En este caso el nodo estará ejecutando un proceso, pero para el sistema de colas PBS, no estaría ocupando recursos con lo que el nodo estaría sobre trabjando. &lt;br /&gt;
&lt;br /&gt;
En este caso, al consultar el Ganglia del clúster, se ve que el nodo está en rojo y que está todo gris (en el ejemplo de Ganglia anterior los nodos 47,50 y 53). Para estar seguros será necesario entrar en todos los nodos del clúster uno a uno y asegurarse que no tenga ningún proceso sin job dependiente.&lt;br /&gt;
&lt;br /&gt;
Ejemplo con el usuario &amp;lt;CODE&amp;gt;pzaninelli&amp;lt;/CODE&amp;gt;&lt;br /&gt;
&lt;br /&gt;
# Mirar procesos corriendo&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      1248:26: R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      1244:57: R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 1201:37: R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 1197:29: R larga          &lt;br /&gt;
4445.hydra                 wrf_control      lluis.fita      593:41:0 R larga          &lt;br /&gt;
4446.hydra                 ...nsSFC-control lluis.fita             0 H larga          &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Determinar nodos y ruta de ejecución de todos los jobs del usuari.x.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat -f 4432&lt;br /&gt;
Job Id: 4432.hydra&lt;br /&gt;
    Job_Name = wrf_control50k&lt;br /&gt;
(...)&lt;br /&gt;
    exec_host = node46/23+node46/22+node46/21+node46/20+node46/19+node46/18+no&lt;br /&gt;
	de46/17+node46/16+node46/15+node46/14+node46/13+node46/12+node46/11+no&lt;br /&gt;
	de46/10+node46/9+node46/8+node46/7+node46/6+node46/5+node46/4+node46/3&lt;br /&gt;
	+node46/2+node46/1+node46/0+node47/23+node47/22+node47/21+node47/20+no&lt;br /&gt;
	de47/19+node47/18+node47/17+node47/16+node47/15+node47/14+node47/13+no&lt;br /&gt;
	de47/12+node47/11+node47/10+node47/9+node47/8+node47/7+node47/6+node47&lt;br /&gt;
	/5+node47/4+node47/3+node47/2+node47/1+node47/0&lt;br /&gt;
(...)&lt;br /&gt;
	PBS_O_WORKDIR=/home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&lt;br /&gt;
(...)&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat -f 4438&lt;br /&gt;
Job Id: 4438.hydra&lt;br /&gt;
(...)&lt;br /&gt;
    exec_host = node49/23+node49/22+node49/21+node49/20+node49/19+node49/18+no&lt;br /&gt;
	de49/17+node49/16+node49/15+node49/14+node49/13+node49/12+node49/11+no&lt;br /&gt;
	de49/10+node49/9+node49/8+node49/7+node49/6+node49/5+node49/4+node49/3&lt;br /&gt;
	+node49/2+node49/1+node49/0+node50/23+node50/22+node50/21+node50/20+no&lt;br /&gt;
	de50/19+node50/18+node50/17+node50/16+node50/15+node50/14+node50/13+no&lt;br /&gt;
	de50/12+node50/11+node50/10+node50/9+node50/8+node50/7+node50/6+node50&lt;br /&gt;
	/5+node50/4+node50/3+node50/2+node50/1+node50/0&lt;br /&gt;
(...)&lt;br /&gt;
	PBS_O_WORKDIR=/home/pzaninelli/workdir/SENSHeatWave03/sims/phy1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Se obtiene que:&lt;br /&gt;
* 4432: utiliza nodos 46 y 47 y se ejecuta en /home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&lt;br /&gt;
* 4438: utiliza nodos 49 y 50 y se ejecuta en /home/pzaninelli/workdir/SENSHeatWave03/sims/phy1&lt;br /&gt;
# Los pasos a seguir para cada nodo son los siguientes. Se tiene que entrar en todos los nodos, puesto que no hay otra manera de conocer los procesos que se ejectuan en ellos&lt;br /&gt;
## Entrar en el nodo&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ssh [nombreNodo]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
## Chequear los procesos&lt;br /&gt;
### Si no se sabe el nombre de la aplicación que podría estar zombie&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ps -ef | grep $USER&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Si se sabe el nombre de la aplicación (en el caso de ejemplo el modelo &amp;lt;code&amp;gt;WRF&amp;lt;/Code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ps -ef | grep [aplicación]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Parar esos procesos que no correspondan (huérfanos de job PBS)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
kill -9 [procesoID]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Salir del nodo y empezar con el siguiente&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Después de mirar en los nodos del 40 al 46, entrando en el nodo 47&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hhydra ~]$ ssh node47&lt;br /&gt;
[pzaninelli@node47 ~]$ ps -ef | grep wrf.exe&lt;br /&gt;
1561     17995     1 74 Apr07 ?        3-00:53:43 ./wrf.exe&lt;br /&gt;
1561     17996     1 74 Apr07 ?        3-00:53:36 ./wrf.exe&lt;br /&gt;
1561     17997     1 74 Apr07 ?        3-00:49:04 ./wrf.exe&lt;br /&gt;
1561     17998     1 74 Apr07 ?        3-00:53:03 ./wrf.exe&lt;br /&gt;
1561     17999     1 74 Apr07 ?        3-00:57:12 ./wrf.exe&lt;br /&gt;
1561     18000     1 74 Apr07 ?        3-00:47:14 ./wrf.exe&lt;br /&gt;
1561     18001     1 74 Apr07 ?        3-00:50:55 ./wrf.exe&lt;br /&gt;
1561     18002     1 74 Apr07 ?        3-00:51:08 ./wrf.exe&lt;br /&gt;
1561     18003     1 74 Apr07 ?        3-00:53:33 ./wrf.exe&lt;br /&gt;
1561     18005     1 74 Apr07 ?        3-00:56:54 ./wrf.exe&lt;br /&gt;
1561     18006     1 74 Apr07 ?        3-00:49:51 ./wrf.exe&lt;br /&gt;
1561     18007     1 74 Apr07 ?        3-00:51:28 ./wrf.exe&lt;br /&gt;
1561     18008     1 74 Apr07 ?        3-00:53:13 ./wrf.exe&lt;br /&gt;
1561     18009     1 74 Apr07 ?        3-00:49:40 ./wrf.exe&lt;br /&gt;
1561     18010     1 74 Apr07 ?        3-00:52:07 ./wrf.exe&lt;br /&gt;
1561     18011     1 74 Apr07 ?        3-00:52:12 ./wrf.exe&lt;br /&gt;
1561     18012     1 74 Apr07 ?        3-00:54:05 ./wrf.exe&lt;br /&gt;
1561     18013     1 74 Apr07 ?        3-00:52:44 ./wrf.exe&lt;br /&gt;
1561     18014     1 74 Apr07 ?        3-00:49:45 ./wrf.exe&lt;br /&gt;
1561     18015     1 74 Apr07 ?        3-00:48:31 ./wrf.exe&lt;br /&gt;
1561     23401 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23402 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23403 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23404 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23405 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23406 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23407 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23408 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23409 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23410 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23411 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23412 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23413 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23414 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23415 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23416 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23417 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23418 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23419 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23420 23401 56 Apr09 ?        1-05:16:12 ./wrf.exe&lt;br /&gt;
1561     23421 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23422 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23423 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23424 23402 55 Apr09 ?        1-05:07:45 ./wrf.exe&lt;br /&gt;
1561     23425 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23426 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23427 23404 55 Apr09 ?        1-04:50:56 ./wrf.exe&lt;br /&gt;
1561     23428 23405 55 Apr09 ?        1-05:07:46 ./wrf.exe&lt;br /&gt;
1561     23429 23406 56 Apr09 ?        1-05:33:20 ./wrf.exe&lt;br /&gt;
1561     23430 23417 56 Apr09 ?        1-05:35:49 ./wrf.exe&lt;br /&gt;
1561     23431 23407 56 Apr09 ?        1-05:26:10 ./wrf.exe&lt;br /&gt;
1561     23432 23421 56 Apr09 ?        1-05:15:46 ./wrf.exe&lt;br /&gt;
1561     23433 23415 55 Apr09 ?        1-05:05:37 ./wrf.exe&lt;br /&gt;
1561     23434 23403 55 Apr09 ?        1-04:50:07 ./wrf.exe&lt;br /&gt;
1561     23435 23426 55 Apr09 ?        1-04:57:29 ./wrf.exe&lt;br /&gt;
1561     23436 23419 55 Apr09 ?        1-05:08:26 ./wrf.exe&lt;br /&gt;
1561     23437 23414 56 Apr09 ?        1-05:17:23 ./wrf.exe&lt;br /&gt;
1561     23438 23409 55 Apr09 ?        1-04:57:51 ./wrf.exe&lt;br /&gt;
1561     23439 23411 55 Apr09 ?        1-04:59:43 ./wrf.exe&lt;br /&gt;
1561     23440 23408 55 Apr09 ?        1-05:00:49 ./wrf.exe&lt;br /&gt;
1561     23441 23410 55 Apr09 ?        1-04:55:14 ./wrf.exe&lt;br /&gt;
1561     23442 23423 56 Apr09 ?        1-05:21:52 ./wrf.exe&lt;br /&gt;
1561     23443 23422 56 Apr09 ?        1-05:28:01 ./wrf.exe&lt;br /&gt;
1561     23444 23413 56 Apr09 ?        1-05:36:03 ./wrf.exe&lt;br /&gt;
1561     23445 23425 55 Apr09 ?        1-05:12:05 ./wrf.exe&lt;br /&gt;
1561     23446 23412 56 Apr09 ?        1-05:35:22 ./wrf.exe&lt;br /&gt;
1561     23447 23418 57 Apr09 ?        1-05:46:58 ./wrf.exe&lt;br /&gt;
1561     23448 23416 56 Apr09 ?        1-05:16:21 ./wrf.exe&lt;br /&gt;
1561     27514 27474  0 12:30 pts/0    00:00:00 grep wrf.exe&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Si hay algún proceso &#039;zombie&#039; tendrá un tiempo de ejecución muy largo. Se observa que hay dos grupos distintos de procesos: &amp;lt;code&amp;gt;3-00:53:43 ./wrf.exe&amp;lt;/code&amp;gt; (3 días y 53 minutos) y &amp;lt;code&amp;gt;1-04:50:56 ./wrf.exe&amp;lt;/code&amp;gt; (1 días 4 horas y 50 minutos)&lt;br /&gt;
# Analizar donde está ejecutándose cada proceso&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ pwdx 17995&lt;br /&gt;
17995: /home/pzaninelli/workdir/SENSHeatWave03/sims/phy1/run&lt;br /&gt;
&lt;br /&gt;
[pzaninelli@node47 ~]$ pwdx 23427&lt;br /&gt;
23427: /home/pzaninelli/workdir/SENSHeatWave03/sims/control50k/run&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Del análisi anterior el nodo 47 sólo hospeda el job PBS &amp;lt;code&amp;gt;4432&amp;lt;/code&amp;gt; que se ejecuta en &amp;lt;code&amp;gt;/home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&amp;lt;/code&amp;gt;. Así que los procesos del grupo (Ids de 17995 a 18015) que se ejecutan en &amp;lt;code&amp;gt;/home/pzaninelli/workdir/SENSHeatWave03/sims/phy1/run&amp;lt;/code&amp;gt; son zombies. Así que ya se pueden eliminar&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ kill -9 $(seq 17995 18015)&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Ahora al buscar los procesos aprecen sólo los procesos dependiendo del job PBS&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ ps -ef | grep wrf.exe&lt;br /&gt;
1561     23401 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23402 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23403 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23404 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23405 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23406 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23407 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23408 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23409 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23410 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23411 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23412 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23413 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23414 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23415 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23416 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23417 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23418 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23419 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23420 23401 56 Apr09 ?        1-05:20:01 ./wrf.exe&lt;br /&gt;
1561     23421 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23422 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23423 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23424 23402 55 Apr09 ?        1-05:11:40 ./wrf.exe&lt;br /&gt;
1561     23425 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23426 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23427 23404 55 Apr09 ?        1-04:54:40 ./wrf.exe&lt;br /&gt;
1561     23428 23405 55 Apr09 ?        1-05:11:37 ./wrf.exe&lt;br /&gt;
1561     23429 23406 56 Apr09 ?        1-05:37:15 ./wrf.exe&lt;br /&gt;
1561     23430 23417 56 Apr09 ?        1-05:40:07 ./wrf.exe&lt;br /&gt;
1561     23431 23407 56 Apr09 ?        1-05:30:09 ./wrf.exe&lt;br /&gt;
1561     23432 23421 56 Apr09 ?        1-05:19:26 ./wrf.exe&lt;br /&gt;
1561     23433 23415 55 Apr09 ?        1-05:09:19 ./wrf.exe&lt;br /&gt;
1561     23434 23403 55 Apr09 ?        1-04:54:05 ./wrf.exe&lt;br /&gt;
1561     23435 23426 55 Apr09 ?        1-05:01:22 ./wrf.exe&lt;br /&gt;
1561     23436 23419 55 Apr09 ?        1-05:12:17 ./wrf.exe&lt;br /&gt;
1561     23437 23414 56 Apr09 ?        1-05:21:10 ./wrf.exe&lt;br /&gt;
1561     23438 23409 55 Apr09 ?        1-05:01:46 ./wrf.exe&lt;br /&gt;
1561     23439 23411 55 Apr09 ?        1-05:03:41 ./wrf.exe&lt;br /&gt;
1561     23440 23408 55 Apr09 ?        1-05:04:38 ./wrf.exe&lt;br /&gt;
1561     23441 23410 55 Apr09 ?        1-04:58:56 ./wrf.exe&lt;br /&gt;
1561     23442 23423 56 Apr09 ?        1-05:25:47 ./wrf.exe&lt;br /&gt;
1561     23443 23422 56 Apr09 ?        1-05:31:58 ./wrf.exe&lt;br /&gt;
1561     23444 23413 56 Apr09 ?        1-05:40:00 ./wrf.exe&lt;br /&gt;
1561     23445 23425 55 Apr09 ?        1-05:16:06 ./wrf.exe&lt;br /&gt;
1561     23446 23412 56 Apr09 ?        1-05:39:17 ./wrf.exe&lt;br /&gt;
1561     23447 23418 57 Apr09 ?        1-05:50:49 ./wrf.exe&lt;br /&gt;
1561     23448 23416 56 Apr09 ?        1-05:20:11 ./wrf.exe&lt;br /&gt;
1561     27524 27474  0 12:37 pts/0    00:00:00 grep wrf.exe&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Se repite el proceso en el resto de los nodos y se observa cómo los nodos reducen su carga de trabajo&lt;br /&gt;
&lt;br /&gt;
[[File:GangliaAfterZombie.png|frame|50px|Ejemplo de descarga de los nodos 47, 50 y 53 después de matar un proceso &#039;zombie&#039;]]&lt;/div&gt;</summary>
		<author><name>Pzaninelli</name></author>
	</entry>
	<entry>
		<id>http://wiki.cima.fcen.uba.ar/index.php?title=procesos_zombies&amp;diff=1143</id>
		<title>procesos zombies</title>
		<link rel="alternate" type="text/html" href="http://wiki.cima.fcen.uba.ar/index.php?title=procesos_zombies&amp;diff=1143"/>
		<updated>2018-12-07T18:19:03Z</updated>

		<summary type="html">&lt;p&gt;Pzaninelli: /* script general */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Un proceso zombie se entiende cómo ese proceso que fue lanzado dentro de la cola PBS de gestión de trabajos, que ocupa espacio en el sistema del clúster, pero que actualmente no está en curso.&lt;br /&gt;
&lt;br /&gt;
Esta situación suele ocurrir cuando el clúster se apaga de una manera no controloda, el espacio en el &amp;lt;code&amp;gt;&#039;home&#039;&amp;lt;/code&amp;gt; del clúster se llena.&lt;br /&gt;
&lt;br /&gt;
Es importante que se paren estos procesos, puesto que suponen un sobre esfuerzo para los nodos en los cuáles estos procesos zombies están ejecutándose.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;NOTA:&#039;&#039;&#039; este es el único caso en el cuál está permitido entrar en los nodos de cálculo del clúster.&lt;br /&gt;
&lt;br /&gt;
== script general ==&lt;br /&gt;
Hay una script the bash con la cuál se muestran los procesos &amp;lt;CODE&amp;gt;`wrf.exe&#039;&amp;lt;/CODE&amp;gt; de un usuario que están corriendo en todos los nodos del clúster&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
/share/tools/work-flows/components/bats/check_hydra.bash [user]&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
En donde &amp;lt;CODE&amp;gt;[user]&amp;lt;/CODE&amp;gt; es el código numérico del usuarie (ej.: lluis.fita --&amp;gt; 1624)&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
! nombre&lt;br /&gt;
! úumero &lt;br /&gt;
|-&lt;br /&gt;
| pablo.zaninelli&lt;br /&gt;
| 1561&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Caso1: Pasos a seguir =&lt;br /&gt;
# Itendificar los procesos de la cola del sistema del usuario (ej. lluis.fita)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4426.hydra                 wrf_control      lluis.fita      1699:08: R larga          &lt;br /&gt;
4427.hydra                 ...nsSFC-control lluis.fita             0 H larga          &lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      649:09:3 R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      645:40:5 R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 603:42:2 R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 600:18:4 R larga          &lt;br /&gt;
4443.hydra                 WRF17O           lluis.fita             0 Q larga  &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
# El proceso &amp;lt;code&amp;gt;4426&amp;lt;/code&amp;gt; es una simulación de WRF. &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qstat -f 4426&lt;br /&gt;
Job Id: 4426.hydra&lt;br /&gt;
    Job_Name = wrf_control&lt;br /&gt;
    Job_Owner = lluis.fita@node48&lt;br /&gt;
    resources_used.cput = 1700:00:37&lt;br /&gt;
    resources_used.mem = 6264020kb&lt;br /&gt;
    resources_used.vmem = 15322976kb&lt;br /&gt;
    resources_used.walltime = 72:40:26&lt;br /&gt;
    job_state = R&lt;br /&gt;
    queue = larga&lt;br /&gt;
    server = hydra&lt;br /&gt;
    Checkpoint = u&lt;br /&gt;
    ctime = Sat Apr  7 10:47:06 2018&lt;br /&gt;
    depend = beforeany:4427.hydra@hydra&lt;br /&gt;
    Error_Path = node48:/home/lluis.fita/estudios/WRFsensSFC/simulations/contr&lt;br /&gt;
	ol/wrf_control.e4426&lt;br /&gt;
    exec_host = node48/23+node48/22+node48/21+node48/20+node48/19+node48/18+no&lt;br /&gt;
	de48/17+node48/16+node48/15+node48/14+node48/13+node48/12+node48/11+no&lt;br /&gt;
	de48/10+node48/9+node48/8+node48/7+node48/6+node48/5+node48/4+node48/3&lt;br /&gt;
	+node48/2+node48/1+node48/0+node51/23+node51/22+node51/21+node51/20+no&lt;br /&gt;
	de51/19+node51/18+node51/17+node51/16+node51/15+node51/14+node51/13+no&lt;br /&gt;
	de51/12+node51/11+node51/10+node51/9+node51/8+node51/7+node51/6+node51&lt;br /&gt;
	/5+node51/4+node51/3+node51/2+node51/1+node51/0&lt;br /&gt;
    exec_port = 15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15&lt;br /&gt;
	003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+&lt;br /&gt;
	15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+1500&lt;br /&gt;
	3+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15&lt;br /&gt;
	003+15003+15003&lt;br /&gt;
    Hold_Types = n&lt;br /&gt;
    Join_Path = oe&lt;br /&gt;
    Keep_Files = n&lt;br /&gt;
    Mail_Points = ae&lt;br /&gt;
    Mail_Users = lluis.fita@cima.fcen.uba.ar&lt;br /&gt;
    mtime = Sat Apr  7 10:48:06 2018&lt;br /&gt;
    Output_Path = node48:/home/lluis.fita/estudios/WRFsensSFC/simulations/cont&lt;br /&gt;
	rol/wrf_control.o4426&lt;br /&gt;
    Priority = 0&lt;br /&gt;
    qtime = Sat Apr  7 10:47:06 2018&lt;br /&gt;
    Rerunable = True&lt;br /&gt;
    Resource_List.mem = 30gb&lt;br /&gt;
    Resource_List.nodect = 2&lt;br /&gt;
    Resource_List.nodes = 2:ppn=24&lt;br /&gt;
    Resource_List.vmem = 30gb&lt;br /&gt;
    Resource_List.walltime = 168:00:00&lt;br /&gt;
    session_id = 19561&lt;br /&gt;
    Variable_List = PBS_O_QUEUE=larga,PBS_O_HOST=node48,&lt;br /&gt;
	PBS_O_HOME=/home/lluis.fita,PBS_O_LANG=en_US.UTF-8,&lt;br /&gt;
	PBS_O_LOGNAME=lluis.fita,&lt;br /&gt;
	PBS_O_PATH=/usr/local/bin:/opt/intel/composerxe-2011.3.174/bin/intel6&lt;br /&gt;
	4:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:&lt;br /&gt;
	/usr/local/maui/bin:/usr/local/maui/sbin:/usr/local/bin:/opt/intel/com&lt;br /&gt;
	poserxe/bin:/usr/local/maui/bin:/usr/local/maui/sbin:/home/lluis.fita/&lt;br /&gt;
	bin:/opt/intel/composerxe-2011.3.174/mpirt/bin/intel64:/usr/local/maui&lt;br /&gt;
	/bin:/usr/local/maui/sbin,PBS_O_MAIL=/var/spool/mail/lluis.fita,&lt;br /&gt;
	PBS_O_SHELL=/bin/bash,PBS_SERVER=hydra,&lt;br /&gt;
	PBS_O_WORKDIR=/home/lluis.fita/estudios/WRFsensSFC/simulations/contro&lt;br /&gt;
	l&lt;br /&gt;
    etime = Sat Apr  7 10:47:14 2018&lt;br /&gt;
    submit_args = -W depend=afterany:4425 /home/lluis.fita/estudios/WRFsensSFC&lt;br /&gt;
	/simulations/control/run_WRF.pbs&lt;br /&gt;
    start_time = Sat Apr  7 10:47:14 2018&lt;br /&gt;
    Walltime.Remaining = 343110&lt;br /&gt;
    start_count = 1&lt;br /&gt;
    fault_tolerant = False&lt;br /&gt;
    submit_host = node48&lt;br /&gt;
    init_work_dir = /home/lluis.fita/estudios/WRFsensSFC/simulations/control&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# La variable &amp;lt;code&amp;gt;PBS_O_WORKDIR&amp;lt;/code&amp;gt; contiene el path de ejecución del job. Se tiene que cercionar que el proceso (simulación WRF en este caso), está ejecutándose (en este caso si los rsl.out/error.[nnnn] se están actualizando). En este caso la ejecución del WRF se halla en $PBS_O_WORKDIR/run&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ls -l /home/lluis.fita/estudios/WRFsensSFC/simulations/control/run/rsl.error.0000&lt;br /&gt;
rw-r--r-- 1 lluis.fita cima 1204224 Apr  8 19:07 /home/lluis.fita/estudios/WRFsensSFC/simulations/control/run/rsl.error.0000&lt;br /&gt;
$ date&lt;br /&gt;
Tue Apr 10 11:32:35 ART 2018&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Está claro que el WRF no está ejecutándose adecuadamente. Así que se tendrá que entrar en el nodo en dónde se está ejecutando y parar el job, ya que el sistema de colas PBS no controla este proceso. El WRF se ejecuta en el nodo (valores de &amp;lt;CODE&amp;gt;exec_host&amp;lt;/CODE&amp;gt;) node48 y node51.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ ssh node48&lt;br /&gt;
[lluis.fita@node48 ~]$ top&lt;br /&gt;
top - 11:35:40 up 5 days, 47 min,  0 users,  load average: 23.00, 22.99, 23.03&lt;br /&gt;
Tasks: 246 total,  24 running, 198 sleeping,   0 stopped,  24 zombie&lt;br /&gt;
Cpu(s): 87.4%us,  8.4%sy,  0.0%ni,  4.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st&lt;br /&gt;
Mem:  32942908k total, 18055956k used, 14886952k free,     3972k buffers&lt;br /&gt;
Swap:        0k total,        0k used,        0k free, 12102316k cached&lt;br /&gt;
&lt;br /&gt;
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                       &lt;br /&gt;
21531 lluis.fi  20   0  508m 256m  28m R 100.0  0.8   4360:00 wrf.exe                                                                                                      &lt;br /&gt;
21532 lluis.fi  20   0  503m 248m  23m R 100.0  0.8   4358:44 wrf.exe                                                                                                      &lt;br /&gt;
21537 lluis.fi  20   0  503m 247m  23m R 100.0  0.8   4359:53 wrf.exe                                                                                                      &lt;br /&gt;
21540 lluis.fi  20   0  507m 252m  23m R 100.0  0.8   4359:52 wrf.exe                                                                                                      &lt;br /&gt;
21545 lluis.fi  20   0  503m 249m  24m R 100.0  0.8   4359:35 wrf.exe                                                                                                      &lt;br /&gt;
21546 lluis.fi  20   0  503m 247m  22m R 100.0  0.8   4359:51 wrf.exe                                                                                                      &lt;br /&gt;
21547 lluis.fi  20   0  507m 253m  23m R 100.0  0.8   4360:04 wrf.exe                                                                                                      &lt;br /&gt;
21548 lluis.fi  20   0  507m 254m  24m R 100.0  0.8   4359:52 wrf.exe                                                                                                      &lt;br /&gt;
21549 lluis.fi  20   0  495m 241m  23m R 100.0  0.7   4359:44 wrf.exe                                                                                                      &lt;br /&gt;
21550 lluis.fi  20   0  507m 251m  23m R 100.0  0.8   4359:47 wrf.exe                                                                                                      &lt;br /&gt;
21552 lluis.fi  20   0  501m 249m  27m R 100.0  0.8   4359:49 wrf.exe                                                                                                      &lt;br /&gt;
21529 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:42 wrf.exe                                                                                                       &lt;br /&gt;
21530 lluis.fi  20   0  500m 246m  25m R 99.7  0.8   4360:05 wrf.exe                                                                                                       &lt;br /&gt;
21533 lluis.fi  20   0  507m 253m  23m R 99.7  0.8   4359:55 wrf.exe                                                                                                       &lt;br /&gt;
21534 lluis.fi  20   0  507m 254m  24m R 99.7  0.8   4360:04 wrf.exe                                                                                                       &lt;br /&gt;
21535 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:40 wrf.exe                                                                                                       &lt;br /&gt;
21536 lluis.fi  20   0  503m 250m  24m R 99.7  0.8   4359:37 wrf.exe                                                                                                       &lt;br /&gt;
21539 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:01 wrf.exe                                                                                                       &lt;br /&gt;
21541 lluis.fi  20   0  503m 248m  23m R 99.7  0.8   4360:07 wrf.exe                                                                                                       &lt;br /&gt;
21542 lluis.fi  20   0  500m 250m  27m R 99.7  0.8   4359:49 wrf.exe                                                                                                       &lt;br /&gt;
21543 lluis.fi  20   0  500m 244m  23m R 99.7  0.8   4359:43 wrf.exe                                                                                                       &lt;br /&gt;
21544 lluis.fi  20   0  507m 251m  23m R 99.7  0.8   4359:34 wrf.exe                                                                                                       &lt;br /&gt;
21551 lluis.fi  20   0  507m 257m  27m R 99.7  0.8   4360:02 wrf.exe        &lt;br /&gt;
[lluis.fita@node48 ~]$ killall wrf.exe&lt;br /&gt;
[lluis.fita@node48 ~]$ top&lt;br /&gt;
top - 11:37:32 up 5 days, 49 min,  0 users,  load average: 19.46, 22.23, 22.79&lt;br /&gt;
Tasks: 199 total,   1 running, 197 sleeping,   0 stopped,   1 zombie&lt;br /&gt;
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st&lt;br /&gt;
Mem:  32942908k total, 12665572k used, 20277336k free,     3972k buffers&lt;br /&gt;
Swap:        0k total,        0k used,        0k free, 12048020k cached&lt;br /&gt;
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                       &lt;br /&gt;
   10 root      20   0     0    0    0 S  0.3  0.0  39:27.92 kworker/0:1                                                                                                   &lt;br /&gt;
  262 root      20   0     0    0    0 S  0.3  0.0   0:18.40 kpktgend_13                                                                                                   &lt;br /&gt;
  272 root      20   0     0    0    0 S  0.3  0.0   0:18.47 kpktgend_23                                                                                                   &lt;br /&gt;
27134 lluis.fi  20   0 15280 1256  888 R  0.3  0.0   0:00.01 top                                                                                                           &lt;br /&gt;
    1 root      20   0 25660 1736 1428 S  0.0  0.0   0:05.71 init                                                                                                          &lt;br /&gt;
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                                                      &lt;br /&gt;
    3 root      20   0     0    0    0 S  0.0  0.0   0:01.17 ksoftirqd/0  &lt;br /&gt;
[lluis.fita@node48 ~]$ exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Se repite el procedimiento en tantos nodos cómo ocupe el job (en este caso también &amp;lt;code&amp;gt;node51&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ ssh node51&lt;br /&gt;
[lluis.fita@node51 ~]$ top&lt;br /&gt;
[lluis.fita@node51 ~]$ killall wrf.exe&lt;br /&gt;
[lluis.fita@node51 ~]$ top&lt;br /&gt;
[lluis.fita@node51 ~]$ exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# El job re-aparece cómo &amp;lt;code&amp;gt;cancelled&amp;lt;/code&amp;gt; (letra &amp;lt;code&amp;gt;C&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4426.hydra                 wrf_control      lluis.fita      1703:44: C larga          &lt;br /&gt;
4427.hydra                 ...nsSFC-control lluis.fita             0 R larga          &lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      654:14:4 R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      650:46:2 R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 608:46:5 R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 605:22:5 R larga          &lt;br /&gt;
4443.hydra                 WRF17O           lluis.fita             0 R larga     &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Si el job no fuera cancelado, se hace encesaria la cancelación manual&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qdel 4426&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Otros jobs que estabn en la cola PBS, ahora ya se están ejecutándose al liberarse los nodos! Se observa en el [[http://scad.cima.fcen.uba.ar/ganglia/ ganglia]] del sistema cómo los nodos 48 y 51, se descargan de carga de cálculo&lt;br /&gt;
&lt;br /&gt;
[[File:Ganglia_afterZombie.png|frame|50px|Ejemplo de descarga de los nodos 48 y 51 después de matar un proceso &#039;zombie&#039;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= procesos sin PBS: Pasos a seguir II =&lt;br /&gt;
Puede darse el caso, que el proceso esté ocupando el nodo y no se muestre cómo trabajo de la cola. En este caso el nodo estará ejecutando un proceso, pero para el sistema de colas PBS, no estaría ocupando recursos con lo que el nodo estaría sobre trabjando. &lt;br /&gt;
&lt;br /&gt;
En este caso, al consultar el Ganglia del clúster, se ve que el nodo está en rojo y que está todo gris (en el ejemplo de Ganglia anterior los nodos 47,50 y 53). Para estar seguros será necesario entrar en todos los nodos del clúster uno a uno y asegurarse que no tenga ningún proceso sin job dependiente.&lt;br /&gt;
&lt;br /&gt;
Ejemplo con el usuario &amp;lt;CODE&amp;gt;pzaninelli&amp;lt;/CODE&amp;gt;&lt;br /&gt;
&lt;br /&gt;
# Mirar procesos corriendo&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      1248:26: R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      1244:57: R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 1201:37: R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 1197:29: R larga          &lt;br /&gt;
4445.hydra                 wrf_control      lluis.fita      593:41:0 R larga          &lt;br /&gt;
4446.hydra                 ...nsSFC-control lluis.fita             0 H larga          &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Determinar nodos y ruta de ejecución de todos los jobs del usuari.x.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat -f 4432&lt;br /&gt;
Job Id: 4432.hydra&lt;br /&gt;
    Job_Name = wrf_control50k&lt;br /&gt;
(...)&lt;br /&gt;
    exec_host = node46/23+node46/22+node46/21+node46/20+node46/19+node46/18+no&lt;br /&gt;
	de46/17+node46/16+node46/15+node46/14+node46/13+node46/12+node46/11+no&lt;br /&gt;
	de46/10+node46/9+node46/8+node46/7+node46/6+node46/5+node46/4+node46/3&lt;br /&gt;
	+node46/2+node46/1+node46/0+node47/23+node47/22+node47/21+node47/20+no&lt;br /&gt;
	de47/19+node47/18+node47/17+node47/16+node47/15+node47/14+node47/13+no&lt;br /&gt;
	de47/12+node47/11+node47/10+node47/9+node47/8+node47/7+node47/6+node47&lt;br /&gt;
	/5+node47/4+node47/3+node47/2+node47/1+node47/0&lt;br /&gt;
(...)&lt;br /&gt;
	PBS_O_WORKDIR=/home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&lt;br /&gt;
(...)&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat -f 4438&lt;br /&gt;
Job Id: 4438.hydra&lt;br /&gt;
(...)&lt;br /&gt;
    exec_host = node49/23+node49/22+node49/21+node49/20+node49/19+node49/18+no&lt;br /&gt;
	de49/17+node49/16+node49/15+node49/14+node49/13+node49/12+node49/11+no&lt;br /&gt;
	de49/10+node49/9+node49/8+node49/7+node49/6+node49/5+node49/4+node49/3&lt;br /&gt;
	+node49/2+node49/1+node49/0+node50/23+node50/22+node50/21+node50/20+no&lt;br /&gt;
	de50/19+node50/18+node50/17+node50/16+node50/15+node50/14+node50/13+no&lt;br /&gt;
	de50/12+node50/11+node50/10+node50/9+node50/8+node50/7+node50/6+node50&lt;br /&gt;
	/5+node50/4+node50/3+node50/2+node50/1+node50/0&lt;br /&gt;
(...)&lt;br /&gt;
	PBS_O_WORKDIR=/home/pzaninelli/workdir/SENSHeatWave03/sims/phy1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Se obtiene que:&lt;br /&gt;
* 4432: utiliza nodos 46 y 47 y se ejecuta en /home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&lt;br /&gt;
* 4438: utiliza nodos 49 y 50 y se ejecuta en /home/pzaninelli/workdir/SENSHeatWave03/sims/phy1&lt;br /&gt;
# Los pasos a seguir para cada nodo son los siguientes. Se tiene que entrar en todos los nodos, puesto que no hay otra manera de conocer los procesos que se ejectuan en ellos&lt;br /&gt;
## Entrar en el nodo&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ssh [nombreNodo]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
## Chequear los procesos&lt;br /&gt;
### Si no se sabe el nombre de la aplicación que podría estar zombie&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ps -ef | grep $USER&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Si se sabe el nombre de la aplicación (en el caso de ejemplo el modelo &amp;lt;code&amp;gt;WRF&amp;lt;/Code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ps -ef | grep [aplicación]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Parar esos procesos que no correspondan (huérfanos de job PBS)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
kill -9 [procesoID]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Salir del nodo y empezar con el siguiente&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Después de mirar en los nodos del 40 al 46, entrando en el nodo 47&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hhydra ~]$ ssh node47&lt;br /&gt;
[pzaninelli@node47 ~]$ ps -ef | grep wrf.exe&lt;br /&gt;
1561     17995     1 74 Apr07 ?        3-00:53:43 ./wrf.exe&lt;br /&gt;
1561     17996     1 74 Apr07 ?        3-00:53:36 ./wrf.exe&lt;br /&gt;
1561     17997     1 74 Apr07 ?        3-00:49:04 ./wrf.exe&lt;br /&gt;
1561     17998     1 74 Apr07 ?        3-00:53:03 ./wrf.exe&lt;br /&gt;
1561     17999     1 74 Apr07 ?        3-00:57:12 ./wrf.exe&lt;br /&gt;
1561     18000     1 74 Apr07 ?        3-00:47:14 ./wrf.exe&lt;br /&gt;
1561     18001     1 74 Apr07 ?        3-00:50:55 ./wrf.exe&lt;br /&gt;
1561     18002     1 74 Apr07 ?        3-00:51:08 ./wrf.exe&lt;br /&gt;
1561     18003     1 74 Apr07 ?        3-00:53:33 ./wrf.exe&lt;br /&gt;
1561     18005     1 74 Apr07 ?        3-00:56:54 ./wrf.exe&lt;br /&gt;
1561     18006     1 74 Apr07 ?        3-00:49:51 ./wrf.exe&lt;br /&gt;
1561     18007     1 74 Apr07 ?        3-00:51:28 ./wrf.exe&lt;br /&gt;
1561     18008     1 74 Apr07 ?        3-00:53:13 ./wrf.exe&lt;br /&gt;
1561     18009     1 74 Apr07 ?        3-00:49:40 ./wrf.exe&lt;br /&gt;
1561     18010     1 74 Apr07 ?        3-00:52:07 ./wrf.exe&lt;br /&gt;
1561     18011     1 74 Apr07 ?        3-00:52:12 ./wrf.exe&lt;br /&gt;
1561     18012     1 74 Apr07 ?        3-00:54:05 ./wrf.exe&lt;br /&gt;
1561     18013     1 74 Apr07 ?        3-00:52:44 ./wrf.exe&lt;br /&gt;
1561     18014     1 74 Apr07 ?        3-00:49:45 ./wrf.exe&lt;br /&gt;
1561     18015     1 74 Apr07 ?        3-00:48:31 ./wrf.exe&lt;br /&gt;
1561     23401 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23402 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23403 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23404 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23405 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23406 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23407 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23408 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23409 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23410 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23411 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23412 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23413 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23414 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23415 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23416 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23417 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23418 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23419 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23420 23401 56 Apr09 ?        1-05:16:12 ./wrf.exe&lt;br /&gt;
1561     23421 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23422 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23423 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23424 23402 55 Apr09 ?        1-05:07:45 ./wrf.exe&lt;br /&gt;
1561     23425 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23426 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23427 23404 55 Apr09 ?        1-04:50:56 ./wrf.exe&lt;br /&gt;
1561     23428 23405 55 Apr09 ?        1-05:07:46 ./wrf.exe&lt;br /&gt;
1561     23429 23406 56 Apr09 ?        1-05:33:20 ./wrf.exe&lt;br /&gt;
1561     23430 23417 56 Apr09 ?        1-05:35:49 ./wrf.exe&lt;br /&gt;
1561     23431 23407 56 Apr09 ?        1-05:26:10 ./wrf.exe&lt;br /&gt;
1561     23432 23421 56 Apr09 ?        1-05:15:46 ./wrf.exe&lt;br /&gt;
1561     23433 23415 55 Apr09 ?        1-05:05:37 ./wrf.exe&lt;br /&gt;
1561     23434 23403 55 Apr09 ?        1-04:50:07 ./wrf.exe&lt;br /&gt;
1561     23435 23426 55 Apr09 ?        1-04:57:29 ./wrf.exe&lt;br /&gt;
1561     23436 23419 55 Apr09 ?        1-05:08:26 ./wrf.exe&lt;br /&gt;
1561     23437 23414 56 Apr09 ?        1-05:17:23 ./wrf.exe&lt;br /&gt;
1561     23438 23409 55 Apr09 ?        1-04:57:51 ./wrf.exe&lt;br /&gt;
1561     23439 23411 55 Apr09 ?        1-04:59:43 ./wrf.exe&lt;br /&gt;
1561     23440 23408 55 Apr09 ?        1-05:00:49 ./wrf.exe&lt;br /&gt;
1561     23441 23410 55 Apr09 ?        1-04:55:14 ./wrf.exe&lt;br /&gt;
1561     23442 23423 56 Apr09 ?        1-05:21:52 ./wrf.exe&lt;br /&gt;
1561     23443 23422 56 Apr09 ?        1-05:28:01 ./wrf.exe&lt;br /&gt;
1561     23444 23413 56 Apr09 ?        1-05:36:03 ./wrf.exe&lt;br /&gt;
1561     23445 23425 55 Apr09 ?        1-05:12:05 ./wrf.exe&lt;br /&gt;
1561     23446 23412 56 Apr09 ?        1-05:35:22 ./wrf.exe&lt;br /&gt;
1561     23447 23418 57 Apr09 ?        1-05:46:58 ./wrf.exe&lt;br /&gt;
1561     23448 23416 56 Apr09 ?        1-05:16:21 ./wrf.exe&lt;br /&gt;
1561     27514 27474  0 12:30 pts/0    00:00:00 grep wrf.exe&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Si hay algún proceso &#039;zombie&#039; tendrá un tiempo de ejecución muy largo. Se observa que hay dos grupos distintos de procesos: &amp;lt;code&amp;gt;3-00:53:43 ./wrf.exe&amp;lt;/code&amp;gt; (3 días y 53 minutos) y &amp;lt;code&amp;gt;1-04:50:56 ./wrf.exe&amp;lt;/code&amp;gt; (1 días 4 horas y 50 minutos)&lt;br /&gt;
# Analizar donde está ejecutándose cada proceso&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ pwdx 17995&lt;br /&gt;
17995: /home/pzaninelli/workdir/SENSHeatWave03/sims/phy1/run&lt;br /&gt;
&lt;br /&gt;
[pzaninelli@node47 ~]$ pwdx 23427&lt;br /&gt;
23427: /home/pzaninelli/workdir/SENSHeatWave03/sims/control50k/run&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Del análisi anterior el nodo 47 sólo hospeda el job PBS &amp;lt;code&amp;gt;4432&amp;lt;/code&amp;gt; que se ejecuta en &amp;lt;code&amp;gt;/home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&amp;lt;/code&amp;gt;. Así que los procesos del grupo (Ids de 17995 a 18015) que se ejecutan en &amp;lt;code&amp;gt;/home/pzaninelli/workdir/SENSHeatWave03/sims/phy1/run&amp;lt;/code&amp;gt; son zombies. Así que ya se pueden eliminar&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ kill -9 $(seq 17995 18015)&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Ahora al buscar los procesos aprecen sólo los procesos dependiendo del job PBS&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ ps -ef | grep wrf.exe&lt;br /&gt;
1561     23401 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23402 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23403 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23404 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23405 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23406 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23407 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23408 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23409 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23410 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23411 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23412 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23413 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23414 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23415 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23416 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23417 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23418 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23419 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23420 23401 56 Apr09 ?        1-05:20:01 ./wrf.exe&lt;br /&gt;
1561     23421 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23422 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23423 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23424 23402 55 Apr09 ?        1-05:11:40 ./wrf.exe&lt;br /&gt;
1561     23425 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23426 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23427 23404 55 Apr09 ?        1-04:54:40 ./wrf.exe&lt;br /&gt;
1561     23428 23405 55 Apr09 ?        1-05:11:37 ./wrf.exe&lt;br /&gt;
1561     23429 23406 56 Apr09 ?        1-05:37:15 ./wrf.exe&lt;br /&gt;
1561     23430 23417 56 Apr09 ?        1-05:40:07 ./wrf.exe&lt;br /&gt;
1561     23431 23407 56 Apr09 ?        1-05:30:09 ./wrf.exe&lt;br /&gt;
1561     23432 23421 56 Apr09 ?        1-05:19:26 ./wrf.exe&lt;br /&gt;
1561     23433 23415 55 Apr09 ?        1-05:09:19 ./wrf.exe&lt;br /&gt;
1561     23434 23403 55 Apr09 ?        1-04:54:05 ./wrf.exe&lt;br /&gt;
1561     23435 23426 55 Apr09 ?        1-05:01:22 ./wrf.exe&lt;br /&gt;
1561     23436 23419 55 Apr09 ?        1-05:12:17 ./wrf.exe&lt;br /&gt;
1561     23437 23414 56 Apr09 ?        1-05:21:10 ./wrf.exe&lt;br /&gt;
1561     23438 23409 55 Apr09 ?        1-05:01:46 ./wrf.exe&lt;br /&gt;
1561     23439 23411 55 Apr09 ?        1-05:03:41 ./wrf.exe&lt;br /&gt;
1561     23440 23408 55 Apr09 ?        1-05:04:38 ./wrf.exe&lt;br /&gt;
1561     23441 23410 55 Apr09 ?        1-04:58:56 ./wrf.exe&lt;br /&gt;
1561     23442 23423 56 Apr09 ?        1-05:25:47 ./wrf.exe&lt;br /&gt;
1561     23443 23422 56 Apr09 ?        1-05:31:58 ./wrf.exe&lt;br /&gt;
1561     23444 23413 56 Apr09 ?        1-05:40:00 ./wrf.exe&lt;br /&gt;
1561     23445 23425 55 Apr09 ?        1-05:16:06 ./wrf.exe&lt;br /&gt;
1561     23446 23412 56 Apr09 ?        1-05:39:17 ./wrf.exe&lt;br /&gt;
1561     23447 23418 57 Apr09 ?        1-05:50:49 ./wrf.exe&lt;br /&gt;
1561     23448 23416 56 Apr09 ?        1-05:20:11 ./wrf.exe&lt;br /&gt;
1561     27524 27474  0 12:37 pts/0    00:00:00 grep wrf.exe&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Se repite el proceso en el resto de los nodos y se observa cómo los nodos reducen su carga de trabajo&lt;br /&gt;
&lt;br /&gt;
[[File:GangliaAfterZombie.png|frame|50px|Ejemplo de descarga de los nodos 47, 50 y 53 después de matar un proceso &#039;zombie&#039;]]&lt;/div&gt;</summary>
		<author><name>Pzaninelli</name></author>
	</entry>
	<entry>
		<id>http://wiki.cima.fcen.uba.ar/index.php?title=procesos_zombies&amp;diff=1142</id>
		<title>procesos zombies</title>
		<link rel="alternate" type="text/html" href="http://wiki.cima.fcen.uba.ar/index.php?title=procesos_zombies&amp;diff=1142"/>
		<updated>2018-12-07T18:18:16Z</updated>

		<summary type="html">&lt;p&gt;Pzaninelli: /* script general */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Un proceso zombie se entiende cómo ese proceso que fue lanzado dentro de la cola PBS de gestión de trabajos, que ocupa espacio en el sistema del clúster, pero que actualmente no está en curso.&lt;br /&gt;
&lt;br /&gt;
Esta situación suele ocurrir cuando el clúster se apaga de una manera no controloda, el espacio en el &amp;lt;code&amp;gt;&#039;home&#039;&amp;lt;/code&amp;gt; del clúster se llena.&lt;br /&gt;
&lt;br /&gt;
Es importante que se paren estos procesos, puesto que suponen un sobre esfuerzo para los nodos en los cuáles estos procesos zombies están ejecutándose.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;NOTA:&#039;&#039;&#039; este es el único caso en el cuál está permitido entrar en los nodos de cálculo del clúster.&lt;br /&gt;
&lt;br /&gt;
== script general ==&lt;br /&gt;
Hay una script the bash con la cuál se muestran los procesos &amp;lt;CODE&amp;gt;`wrf.exe&#039;&amp;lt;/CODE&amp;gt; de un usuario que están corriendo en todos los nodos del clúster&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
/share/tools/work-flows/components/bats/check_hydra.bash [user]&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
En donde &amp;lt;CODE&amp;gt;[user]&amp;lt;/CODE&amp;gt; es el código numérico del usuarie (ej.: lluis.fita --&amp;gt; 1624)&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
| nombre&lt;br /&gt;
| numero &lt;br /&gt;
-&lt;br /&gt;
| pablo.zaninelli&lt;br /&gt;
| 1561&lt;br /&gt;
-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Caso1: Pasos a seguir =&lt;br /&gt;
# Itendificar los procesos de la cola del sistema del usuario (ej. lluis.fita)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4426.hydra                 wrf_control      lluis.fita      1699:08: R larga          &lt;br /&gt;
4427.hydra                 ...nsSFC-control lluis.fita             0 H larga          &lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      649:09:3 R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      645:40:5 R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 603:42:2 R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 600:18:4 R larga          &lt;br /&gt;
4443.hydra                 WRF17O           lluis.fita             0 Q larga  &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
# El proceso &amp;lt;code&amp;gt;4426&amp;lt;/code&amp;gt; es una simulación de WRF. &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qstat -f 4426&lt;br /&gt;
Job Id: 4426.hydra&lt;br /&gt;
    Job_Name = wrf_control&lt;br /&gt;
    Job_Owner = lluis.fita@node48&lt;br /&gt;
    resources_used.cput = 1700:00:37&lt;br /&gt;
    resources_used.mem = 6264020kb&lt;br /&gt;
    resources_used.vmem = 15322976kb&lt;br /&gt;
    resources_used.walltime = 72:40:26&lt;br /&gt;
    job_state = R&lt;br /&gt;
    queue = larga&lt;br /&gt;
    server = hydra&lt;br /&gt;
    Checkpoint = u&lt;br /&gt;
    ctime = Sat Apr  7 10:47:06 2018&lt;br /&gt;
    depend = beforeany:4427.hydra@hydra&lt;br /&gt;
    Error_Path = node48:/home/lluis.fita/estudios/WRFsensSFC/simulations/contr&lt;br /&gt;
	ol/wrf_control.e4426&lt;br /&gt;
    exec_host = node48/23+node48/22+node48/21+node48/20+node48/19+node48/18+no&lt;br /&gt;
	de48/17+node48/16+node48/15+node48/14+node48/13+node48/12+node48/11+no&lt;br /&gt;
	de48/10+node48/9+node48/8+node48/7+node48/6+node48/5+node48/4+node48/3&lt;br /&gt;
	+node48/2+node48/1+node48/0+node51/23+node51/22+node51/21+node51/20+no&lt;br /&gt;
	de51/19+node51/18+node51/17+node51/16+node51/15+node51/14+node51/13+no&lt;br /&gt;
	de51/12+node51/11+node51/10+node51/9+node51/8+node51/7+node51/6+node51&lt;br /&gt;
	/5+node51/4+node51/3+node51/2+node51/1+node51/0&lt;br /&gt;
    exec_port = 15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15&lt;br /&gt;
	003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+&lt;br /&gt;
	15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+1500&lt;br /&gt;
	3+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15&lt;br /&gt;
	003+15003+15003&lt;br /&gt;
    Hold_Types = n&lt;br /&gt;
    Join_Path = oe&lt;br /&gt;
    Keep_Files = n&lt;br /&gt;
    Mail_Points = ae&lt;br /&gt;
    Mail_Users = lluis.fita@cima.fcen.uba.ar&lt;br /&gt;
    mtime = Sat Apr  7 10:48:06 2018&lt;br /&gt;
    Output_Path = node48:/home/lluis.fita/estudios/WRFsensSFC/simulations/cont&lt;br /&gt;
	rol/wrf_control.o4426&lt;br /&gt;
    Priority = 0&lt;br /&gt;
    qtime = Sat Apr  7 10:47:06 2018&lt;br /&gt;
    Rerunable = True&lt;br /&gt;
    Resource_List.mem = 30gb&lt;br /&gt;
    Resource_List.nodect = 2&lt;br /&gt;
    Resource_List.nodes = 2:ppn=24&lt;br /&gt;
    Resource_List.vmem = 30gb&lt;br /&gt;
    Resource_List.walltime = 168:00:00&lt;br /&gt;
    session_id = 19561&lt;br /&gt;
    Variable_List = PBS_O_QUEUE=larga,PBS_O_HOST=node48,&lt;br /&gt;
	PBS_O_HOME=/home/lluis.fita,PBS_O_LANG=en_US.UTF-8,&lt;br /&gt;
	PBS_O_LOGNAME=lluis.fita,&lt;br /&gt;
	PBS_O_PATH=/usr/local/bin:/opt/intel/composerxe-2011.3.174/bin/intel6&lt;br /&gt;
	4:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:&lt;br /&gt;
	/usr/local/maui/bin:/usr/local/maui/sbin:/usr/local/bin:/opt/intel/com&lt;br /&gt;
	poserxe/bin:/usr/local/maui/bin:/usr/local/maui/sbin:/home/lluis.fita/&lt;br /&gt;
	bin:/opt/intel/composerxe-2011.3.174/mpirt/bin/intel64:/usr/local/maui&lt;br /&gt;
	/bin:/usr/local/maui/sbin,PBS_O_MAIL=/var/spool/mail/lluis.fita,&lt;br /&gt;
	PBS_O_SHELL=/bin/bash,PBS_SERVER=hydra,&lt;br /&gt;
	PBS_O_WORKDIR=/home/lluis.fita/estudios/WRFsensSFC/simulations/contro&lt;br /&gt;
	l&lt;br /&gt;
    etime = Sat Apr  7 10:47:14 2018&lt;br /&gt;
    submit_args = -W depend=afterany:4425 /home/lluis.fita/estudios/WRFsensSFC&lt;br /&gt;
	/simulations/control/run_WRF.pbs&lt;br /&gt;
    start_time = Sat Apr  7 10:47:14 2018&lt;br /&gt;
    Walltime.Remaining = 343110&lt;br /&gt;
    start_count = 1&lt;br /&gt;
    fault_tolerant = False&lt;br /&gt;
    submit_host = node48&lt;br /&gt;
    init_work_dir = /home/lluis.fita/estudios/WRFsensSFC/simulations/control&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# La variable &amp;lt;code&amp;gt;PBS_O_WORKDIR&amp;lt;/code&amp;gt; contiene el path de ejecución del job. Se tiene que cercionar que el proceso (simulación WRF en este caso), está ejecutándose (en este caso si los rsl.out/error.[nnnn] se están actualizando). En este caso la ejecución del WRF se halla en $PBS_O_WORKDIR/run&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ls -l /home/lluis.fita/estudios/WRFsensSFC/simulations/control/run/rsl.error.0000&lt;br /&gt;
rw-r--r-- 1 lluis.fita cima 1204224 Apr  8 19:07 /home/lluis.fita/estudios/WRFsensSFC/simulations/control/run/rsl.error.0000&lt;br /&gt;
$ date&lt;br /&gt;
Tue Apr 10 11:32:35 ART 2018&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Está claro que el WRF no está ejecutándose adecuadamente. Así que se tendrá que entrar en el nodo en dónde se está ejecutando y parar el job, ya que el sistema de colas PBS no controla este proceso. El WRF se ejecuta en el nodo (valores de &amp;lt;CODE&amp;gt;exec_host&amp;lt;/CODE&amp;gt;) node48 y node51.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ ssh node48&lt;br /&gt;
[lluis.fita@node48 ~]$ top&lt;br /&gt;
top - 11:35:40 up 5 days, 47 min,  0 users,  load average: 23.00, 22.99, 23.03&lt;br /&gt;
Tasks: 246 total,  24 running, 198 sleeping,   0 stopped,  24 zombie&lt;br /&gt;
Cpu(s): 87.4%us,  8.4%sy,  0.0%ni,  4.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st&lt;br /&gt;
Mem:  32942908k total, 18055956k used, 14886952k free,     3972k buffers&lt;br /&gt;
Swap:        0k total,        0k used,        0k free, 12102316k cached&lt;br /&gt;
&lt;br /&gt;
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                       &lt;br /&gt;
21531 lluis.fi  20   0  508m 256m  28m R 100.0  0.8   4360:00 wrf.exe                                                                                                      &lt;br /&gt;
21532 lluis.fi  20   0  503m 248m  23m R 100.0  0.8   4358:44 wrf.exe                                                                                                      &lt;br /&gt;
21537 lluis.fi  20   0  503m 247m  23m R 100.0  0.8   4359:53 wrf.exe                                                                                                      &lt;br /&gt;
21540 lluis.fi  20   0  507m 252m  23m R 100.0  0.8   4359:52 wrf.exe                                                                                                      &lt;br /&gt;
21545 lluis.fi  20   0  503m 249m  24m R 100.0  0.8   4359:35 wrf.exe                                                                                                      &lt;br /&gt;
21546 lluis.fi  20   0  503m 247m  22m R 100.0  0.8   4359:51 wrf.exe                                                                                                      &lt;br /&gt;
21547 lluis.fi  20   0  507m 253m  23m R 100.0  0.8   4360:04 wrf.exe                                                                                                      &lt;br /&gt;
21548 lluis.fi  20   0  507m 254m  24m R 100.0  0.8   4359:52 wrf.exe                                                                                                      &lt;br /&gt;
21549 lluis.fi  20   0  495m 241m  23m R 100.0  0.7   4359:44 wrf.exe                                                                                                      &lt;br /&gt;
21550 lluis.fi  20   0  507m 251m  23m R 100.0  0.8   4359:47 wrf.exe                                                                                                      &lt;br /&gt;
21552 lluis.fi  20   0  501m 249m  27m R 100.0  0.8   4359:49 wrf.exe                                                                                                      &lt;br /&gt;
21529 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:42 wrf.exe                                                                                                       &lt;br /&gt;
21530 lluis.fi  20   0  500m 246m  25m R 99.7  0.8   4360:05 wrf.exe                                                                                                       &lt;br /&gt;
21533 lluis.fi  20   0  507m 253m  23m R 99.7  0.8   4359:55 wrf.exe                                                                                                       &lt;br /&gt;
21534 lluis.fi  20   0  507m 254m  24m R 99.7  0.8   4360:04 wrf.exe                                                                                                       &lt;br /&gt;
21535 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:40 wrf.exe                                                                                                       &lt;br /&gt;
21536 lluis.fi  20   0  503m 250m  24m R 99.7  0.8   4359:37 wrf.exe                                                                                                       &lt;br /&gt;
21539 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:01 wrf.exe                                                                                                       &lt;br /&gt;
21541 lluis.fi  20   0  503m 248m  23m R 99.7  0.8   4360:07 wrf.exe                                                                                                       &lt;br /&gt;
21542 lluis.fi  20   0  500m 250m  27m R 99.7  0.8   4359:49 wrf.exe                                                                                                       &lt;br /&gt;
21543 lluis.fi  20   0  500m 244m  23m R 99.7  0.8   4359:43 wrf.exe                                                                                                       &lt;br /&gt;
21544 lluis.fi  20   0  507m 251m  23m R 99.7  0.8   4359:34 wrf.exe                                                                                                       &lt;br /&gt;
21551 lluis.fi  20   0  507m 257m  27m R 99.7  0.8   4360:02 wrf.exe        &lt;br /&gt;
[lluis.fita@node48 ~]$ killall wrf.exe&lt;br /&gt;
[lluis.fita@node48 ~]$ top&lt;br /&gt;
top - 11:37:32 up 5 days, 49 min,  0 users,  load average: 19.46, 22.23, 22.79&lt;br /&gt;
Tasks: 199 total,   1 running, 197 sleeping,   0 stopped,   1 zombie&lt;br /&gt;
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st&lt;br /&gt;
Mem:  32942908k total, 12665572k used, 20277336k free,     3972k buffers&lt;br /&gt;
Swap:        0k total,        0k used,        0k free, 12048020k cached&lt;br /&gt;
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                       &lt;br /&gt;
   10 root      20   0     0    0    0 S  0.3  0.0  39:27.92 kworker/0:1                                                                                                   &lt;br /&gt;
  262 root      20   0     0    0    0 S  0.3  0.0   0:18.40 kpktgend_13                                                                                                   &lt;br /&gt;
  272 root      20   0     0    0    0 S  0.3  0.0   0:18.47 kpktgend_23                                                                                                   &lt;br /&gt;
27134 lluis.fi  20   0 15280 1256  888 R  0.3  0.0   0:00.01 top                                                                                                           &lt;br /&gt;
    1 root      20   0 25660 1736 1428 S  0.0  0.0   0:05.71 init                                                                                                          &lt;br /&gt;
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                                                      &lt;br /&gt;
    3 root      20   0     0    0    0 S  0.0  0.0   0:01.17 ksoftirqd/0  &lt;br /&gt;
[lluis.fita@node48 ~]$ exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Se repite el procedimiento en tantos nodos cómo ocupe el job (en este caso también &amp;lt;code&amp;gt;node51&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ ssh node51&lt;br /&gt;
[lluis.fita@node51 ~]$ top&lt;br /&gt;
[lluis.fita@node51 ~]$ killall wrf.exe&lt;br /&gt;
[lluis.fita@node51 ~]$ top&lt;br /&gt;
[lluis.fita@node51 ~]$ exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# El job re-aparece cómo &amp;lt;code&amp;gt;cancelled&amp;lt;/code&amp;gt; (letra &amp;lt;code&amp;gt;C&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4426.hydra                 wrf_control      lluis.fita      1703:44: C larga          &lt;br /&gt;
4427.hydra                 ...nsSFC-control lluis.fita             0 R larga          &lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      654:14:4 R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      650:46:2 R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 608:46:5 R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 605:22:5 R larga          &lt;br /&gt;
4443.hydra                 WRF17O           lluis.fita             0 R larga     &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Si el job no fuera cancelado, se hace encesaria la cancelación manual&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qdel 4426&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Otros jobs que estabn en la cola PBS, ahora ya se están ejecutándose al liberarse los nodos! Se observa en el [[http://scad.cima.fcen.uba.ar/ganglia/ ganglia]] del sistema cómo los nodos 48 y 51, se descargan de carga de cálculo&lt;br /&gt;
&lt;br /&gt;
[[File:Ganglia_afterZombie.png|frame|50px|Ejemplo de descarga de los nodos 48 y 51 después de matar un proceso &#039;zombie&#039;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= procesos sin PBS: Pasos a seguir II =&lt;br /&gt;
Puede darse el caso, que el proceso esté ocupando el nodo y no se muestre cómo trabajo de la cola. En este caso el nodo estará ejecutando un proceso, pero para el sistema de colas PBS, no estaría ocupando recursos con lo que el nodo estaría sobre trabjando. &lt;br /&gt;
&lt;br /&gt;
En este caso, al consultar el Ganglia del clúster, se ve que el nodo está en rojo y que está todo gris (en el ejemplo de Ganglia anterior los nodos 47,50 y 53). Para estar seguros será necesario entrar en todos los nodos del clúster uno a uno y asegurarse que no tenga ningún proceso sin job dependiente.&lt;br /&gt;
&lt;br /&gt;
Ejemplo con el usuario &amp;lt;CODE&amp;gt;pzaninelli&amp;lt;/CODE&amp;gt;&lt;br /&gt;
&lt;br /&gt;
# Mirar procesos corriendo&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      1248:26: R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      1244:57: R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 1201:37: R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 1197:29: R larga          &lt;br /&gt;
4445.hydra                 wrf_control      lluis.fita      593:41:0 R larga          &lt;br /&gt;
4446.hydra                 ...nsSFC-control lluis.fita             0 H larga          &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Determinar nodos y ruta de ejecución de todos los jobs del usuari.x.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat -f 4432&lt;br /&gt;
Job Id: 4432.hydra&lt;br /&gt;
    Job_Name = wrf_control50k&lt;br /&gt;
(...)&lt;br /&gt;
    exec_host = node46/23+node46/22+node46/21+node46/20+node46/19+node46/18+no&lt;br /&gt;
	de46/17+node46/16+node46/15+node46/14+node46/13+node46/12+node46/11+no&lt;br /&gt;
	de46/10+node46/9+node46/8+node46/7+node46/6+node46/5+node46/4+node46/3&lt;br /&gt;
	+node46/2+node46/1+node46/0+node47/23+node47/22+node47/21+node47/20+no&lt;br /&gt;
	de47/19+node47/18+node47/17+node47/16+node47/15+node47/14+node47/13+no&lt;br /&gt;
	de47/12+node47/11+node47/10+node47/9+node47/8+node47/7+node47/6+node47&lt;br /&gt;
	/5+node47/4+node47/3+node47/2+node47/1+node47/0&lt;br /&gt;
(...)&lt;br /&gt;
	PBS_O_WORKDIR=/home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&lt;br /&gt;
(...)&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat -f 4438&lt;br /&gt;
Job Id: 4438.hydra&lt;br /&gt;
(...)&lt;br /&gt;
    exec_host = node49/23+node49/22+node49/21+node49/20+node49/19+node49/18+no&lt;br /&gt;
	de49/17+node49/16+node49/15+node49/14+node49/13+node49/12+node49/11+no&lt;br /&gt;
	de49/10+node49/9+node49/8+node49/7+node49/6+node49/5+node49/4+node49/3&lt;br /&gt;
	+node49/2+node49/1+node49/0+node50/23+node50/22+node50/21+node50/20+no&lt;br /&gt;
	de50/19+node50/18+node50/17+node50/16+node50/15+node50/14+node50/13+no&lt;br /&gt;
	de50/12+node50/11+node50/10+node50/9+node50/8+node50/7+node50/6+node50&lt;br /&gt;
	/5+node50/4+node50/3+node50/2+node50/1+node50/0&lt;br /&gt;
(...)&lt;br /&gt;
	PBS_O_WORKDIR=/home/pzaninelli/workdir/SENSHeatWave03/sims/phy1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Se obtiene que:&lt;br /&gt;
* 4432: utiliza nodos 46 y 47 y se ejecuta en /home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&lt;br /&gt;
* 4438: utiliza nodos 49 y 50 y se ejecuta en /home/pzaninelli/workdir/SENSHeatWave03/sims/phy1&lt;br /&gt;
# Los pasos a seguir para cada nodo son los siguientes. Se tiene que entrar en todos los nodos, puesto que no hay otra manera de conocer los procesos que se ejectuan en ellos&lt;br /&gt;
## Entrar en el nodo&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ssh [nombreNodo]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
## Chequear los procesos&lt;br /&gt;
### Si no se sabe el nombre de la aplicación que podría estar zombie&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ps -ef | grep $USER&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Si se sabe el nombre de la aplicación (en el caso de ejemplo el modelo &amp;lt;code&amp;gt;WRF&amp;lt;/Code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ps -ef | grep [aplicación]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Parar esos procesos que no correspondan (huérfanos de job PBS)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
kill -9 [procesoID]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Salir del nodo y empezar con el siguiente&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Después de mirar en los nodos del 40 al 46, entrando en el nodo 47&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hhydra ~]$ ssh node47&lt;br /&gt;
[pzaninelli@node47 ~]$ ps -ef | grep wrf.exe&lt;br /&gt;
1561     17995     1 74 Apr07 ?        3-00:53:43 ./wrf.exe&lt;br /&gt;
1561     17996     1 74 Apr07 ?        3-00:53:36 ./wrf.exe&lt;br /&gt;
1561     17997     1 74 Apr07 ?        3-00:49:04 ./wrf.exe&lt;br /&gt;
1561     17998     1 74 Apr07 ?        3-00:53:03 ./wrf.exe&lt;br /&gt;
1561     17999     1 74 Apr07 ?        3-00:57:12 ./wrf.exe&lt;br /&gt;
1561     18000     1 74 Apr07 ?        3-00:47:14 ./wrf.exe&lt;br /&gt;
1561     18001     1 74 Apr07 ?        3-00:50:55 ./wrf.exe&lt;br /&gt;
1561     18002     1 74 Apr07 ?        3-00:51:08 ./wrf.exe&lt;br /&gt;
1561     18003     1 74 Apr07 ?        3-00:53:33 ./wrf.exe&lt;br /&gt;
1561     18005     1 74 Apr07 ?        3-00:56:54 ./wrf.exe&lt;br /&gt;
1561     18006     1 74 Apr07 ?        3-00:49:51 ./wrf.exe&lt;br /&gt;
1561     18007     1 74 Apr07 ?        3-00:51:28 ./wrf.exe&lt;br /&gt;
1561     18008     1 74 Apr07 ?        3-00:53:13 ./wrf.exe&lt;br /&gt;
1561     18009     1 74 Apr07 ?        3-00:49:40 ./wrf.exe&lt;br /&gt;
1561     18010     1 74 Apr07 ?        3-00:52:07 ./wrf.exe&lt;br /&gt;
1561     18011     1 74 Apr07 ?        3-00:52:12 ./wrf.exe&lt;br /&gt;
1561     18012     1 74 Apr07 ?        3-00:54:05 ./wrf.exe&lt;br /&gt;
1561     18013     1 74 Apr07 ?        3-00:52:44 ./wrf.exe&lt;br /&gt;
1561     18014     1 74 Apr07 ?        3-00:49:45 ./wrf.exe&lt;br /&gt;
1561     18015     1 74 Apr07 ?        3-00:48:31 ./wrf.exe&lt;br /&gt;
1561     23401 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23402 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23403 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23404 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23405 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23406 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23407 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23408 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23409 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23410 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23411 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23412 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23413 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23414 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23415 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23416 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23417 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23418 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23419 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23420 23401 56 Apr09 ?        1-05:16:12 ./wrf.exe&lt;br /&gt;
1561     23421 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23422 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23423 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23424 23402 55 Apr09 ?        1-05:07:45 ./wrf.exe&lt;br /&gt;
1561     23425 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23426 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23427 23404 55 Apr09 ?        1-04:50:56 ./wrf.exe&lt;br /&gt;
1561     23428 23405 55 Apr09 ?        1-05:07:46 ./wrf.exe&lt;br /&gt;
1561     23429 23406 56 Apr09 ?        1-05:33:20 ./wrf.exe&lt;br /&gt;
1561     23430 23417 56 Apr09 ?        1-05:35:49 ./wrf.exe&lt;br /&gt;
1561     23431 23407 56 Apr09 ?        1-05:26:10 ./wrf.exe&lt;br /&gt;
1561     23432 23421 56 Apr09 ?        1-05:15:46 ./wrf.exe&lt;br /&gt;
1561     23433 23415 55 Apr09 ?        1-05:05:37 ./wrf.exe&lt;br /&gt;
1561     23434 23403 55 Apr09 ?        1-04:50:07 ./wrf.exe&lt;br /&gt;
1561     23435 23426 55 Apr09 ?        1-04:57:29 ./wrf.exe&lt;br /&gt;
1561     23436 23419 55 Apr09 ?        1-05:08:26 ./wrf.exe&lt;br /&gt;
1561     23437 23414 56 Apr09 ?        1-05:17:23 ./wrf.exe&lt;br /&gt;
1561     23438 23409 55 Apr09 ?        1-04:57:51 ./wrf.exe&lt;br /&gt;
1561     23439 23411 55 Apr09 ?        1-04:59:43 ./wrf.exe&lt;br /&gt;
1561     23440 23408 55 Apr09 ?        1-05:00:49 ./wrf.exe&lt;br /&gt;
1561     23441 23410 55 Apr09 ?        1-04:55:14 ./wrf.exe&lt;br /&gt;
1561     23442 23423 56 Apr09 ?        1-05:21:52 ./wrf.exe&lt;br /&gt;
1561     23443 23422 56 Apr09 ?        1-05:28:01 ./wrf.exe&lt;br /&gt;
1561     23444 23413 56 Apr09 ?        1-05:36:03 ./wrf.exe&lt;br /&gt;
1561     23445 23425 55 Apr09 ?        1-05:12:05 ./wrf.exe&lt;br /&gt;
1561     23446 23412 56 Apr09 ?        1-05:35:22 ./wrf.exe&lt;br /&gt;
1561     23447 23418 57 Apr09 ?        1-05:46:58 ./wrf.exe&lt;br /&gt;
1561     23448 23416 56 Apr09 ?        1-05:16:21 ./wrf.exe&lt;br /&gt;
1561     27514 27474  0 12:30 pts/0    00:00:00 grep wrf.exe&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Si hay algún proceso &#039;zombie&#039; tendrá un tiempo de ejecución muy largo. Se observa que hay dos grupos distintos de procesos: &amp;lt;code&amp;gt;3-00:53:43 ./wrf.exe&amp;lt;/code&amp;gt; (3 días y 53 minutos) y &amp;lt;code&amp;gt;1-04:50:56 ./wrf.exe&amp;lt;/code&amp;gt; (1 días 4 horas y 50 minutos)&lt;br /&gt;
# Analizar donde está ejecutándose cada proceso&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ pwdx 17995&lt;br /&gt;
17995: /home/pzaninelli/workdir/SENSHeatWave03/sims/phy1/run&lt;br /&gt;
&lt;br /&gt;
[pzaninelli@node47 ~]$ pwdx 23427&lt;br /&gt;
23427: /home/pzaninelli/workdir/SENSHeatWave03/sims/control50k/run&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Del análisi anterior el nodo 47 sólo hospeda el job PBS &amp;lt;code&amp;gt;4432&amp;lt;/code&amp;gt; que se ejecuta en &amp;lt;code&amp;gt;/home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&amp;lt;/code&amp;gt;. Así que los procesos del grupo (Ids de 17995 a 18015) que se ejecutan en &amp;lt;code&amp;gt;/home/pzaninelli/workdir/SENSHeatWave03/sims/phy1/run&amp;lt;/code&amp;gt; son zombies. Así que ya se pueden eliminar&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ kill -9 $(seq 17995 18015)&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Ahora al buscar los procesos aprecen sólo los procesos dependiendo del job PBS&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ ps -ef | grep wrf.exe&lt;br /&gt;
1561     23401 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23402 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23403 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23404 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23405 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23406 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23407 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23408 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23409 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23410 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23411 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23412 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23413 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23414 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23415 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23416 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23417 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23418 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23419 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23420 23401 56 Apr09 ?        1-05:20:01 ./wrf.exe&lt;br /&gt;
1561     23421 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23422 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23423 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23424 23402 55 Apr09 ?        1-05:11:40 ./wrf.exe&lt;br /&gt;
1561     23425 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23426 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23427 23404 55 Apr09 ?        1-04:54:40 ./wrf.exe&lt;br /&gt;
1561     23428 23405 55 Apr09 ?        1-05:11:37 ./wrf.exe&lt;br /&gt;
1561     23429 23406 56 Apr09 ?        1-05:37:15 ./wrf.exe&lt;br /&gt;
1561     23430 23417 56 Apr09 ?        1-05:40:07 ./wrf.exe&lt;br /&gt;
1561     23431 23407 56 Apr09 ?        1-05:30:09 ./wrf.exe&lt;br /&gt;
1561     23432 23421 56 Apr09 ?        1-05:19:26 ./wrf.exe&lt;br /&gt;
1561     23433 23415 55 Apr09 ?        1-05:09:19 ./wrf.exe&lt;br /&gt;
1561     23434 23403 55 Apr09 ?        1-04:54:05 ./wrf.exe&lt;br /&gt;
1561     23435 23426 55 Apr09 ?        1-05:01:22 ./wrf.exe&lt;br /&gt;
1561     23436 23419 55 Apr09 ?        1-05:12:17 ./wrf.exe&lt;br /&gt;
1561     23437 23414 56 Apr09 ?        1-05:21:10 ./wrf.exe&lt;br /&gt;
1561     23438 23409 55 Apr09 ?        1-05:01:46 ./wrf.exe&lt;br /&gt;
1561     23439 23411 55 Apr09 ?        1-05:03:41 ./wrf.exe&lt;br /&gt;
1561     23440 23408 55 Apr09 ?        1-05:04:38 ./wrf.exe&lt;br /&gt;
1561     23441 23410 55 Apr09 ?        1-04:58:56 ./wrf.exe&lt;br /&gt;
1561     23442 23423 56 Apr09 ?        1-05:25:47 ./wrf.exe&lt;br /&gt;
1561     23443 23422 56 Apr09 ?        1-05:31:58 ./wrf.exe&lt;br /&gt;
1561     23444 23413 56 Apr09 ?        1-05:40:00 ./wrf.exe&lt;br /&gt;
1561     23445 23425 55 Apr09 ?        1-05:16:06 ./wrf.exe&lt;br /&gt;
1561     23446 23412 56 Apr09 ?        1-05:39:17 ./wrf.exe&lt;br /&gt;
1561     23447 23418 57 Apr09 ?        1-05:50:49 ./wrf.exe&lt;br /&gt;
1561     23448 23416 56 Apr09 ?        1-05:20:11 ./wrf.exe&lt;br /&gt;
1561     27524 27474  0 12:37 pts/0    00:00:00 grep wrf.exe&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Se repite el proceso en el resto de los nodos y se observa cómo los nodos reducen su carga de trabajo&lt;br /&gt;
&lt;br /&gt;
[[File:GangliaAfterZombie.png|frame|50px|Ejemplo de descarga de los nodos 47, 50 y 53 después de matar un proceso &#039;zombie&#039;]]&lt;/div&gt;</summary>
		<author><name>Pzaninelli</name></author>
	</entry>
	<entry>
		<id>http://wiki.cima.fcen.uba.ar/index.php?title=procesos_zombies&amp;diff=1141</id>
		<title>procesos zombies</title>
		<link rel="alternate" type="text/html" href="http://wiki.cima.fcen.uba.ar/index.php?title=procesos_zombies&amp;diff=1141"/>
		<updated>2018-12-07T18:16:44Z</updated>

		<summary type="html">&lt;p&gt;Pzaninelli: /* script general */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Un proceso zombie se entiende cómo ese proceso que fue lanzado dentro de la cola PBS de gestión de trabajos, que ocupa espacio en el sistema del clúster, pero que actualmente no está en curso.&lt;br /&gt;
&lt;br /&gt;
Esta situación suele ocurrir cuando el clúster se apaga de una manera no controloda, el espacio en el &amp;lt;code&amp;gt;&#039;home&#039;&amp;lt;/code&amp;gt; del clúster se llena.&lt;br /&gt;
&lt;br /&gt;
Es importante que se paren estos procesos, puesto que suponen un sobre esfuerzo para los nodos en los cuáles estos procesos zombies están ejecutándose.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;NOTA:&#039;&#039;&#039; este es el único caso en el cuál está permitido entrar en los nodos de cálculo del clúster.&lt;br /&gt;
&lt;br /&gt;
== script general ==&lt;br /&gt;
Hay una script the bash con la cuál se muestran los procesos &amp;lt;CODE&amp;gt;`wrf.exe&#039;&amp;lt;/CODE&amp;gt; de un usuario que están corriendo en todos los nodos del clúster&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
/share/tools/work-flows/components/bats/check_hydra.bash [user]&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
En donde &amp;lt;CODE&amp;gt;[user]&amp;lt;/CODE&amp;gt; es el código numérico del usuarie (ej.: lluis.fita --&amp;gt; 1624)&lt;br /&gt;
&lt;br /&gt;
{-&lt;br /&gt;
nombre |&lt;br /&gt;
numero -&lt;br /&gt;
pablo.zaninelli | &lt;br /&gt;
1561&lt;br /&gt;
-}&lt;br /&gt;
&lt;br /&gt;
= Caso1: Pasos a seguir =&lt;br /&gt;
# Itendificar los procesos de la cola del sistema del usuario (ej. lluis.fita)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4426.hydra                 wrf_control      lluis.fita      1699:08: R larga          &lt;br /&gt;
4427.hydra                 ...nsSFC-control lluis.fita             0 H larga          &lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      649:09:3 R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      645:40:5 R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 603:42:2 R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 600:18:4 R larga          &lt;br /&gt;
4443.hydra                 WRF17O           lluis.fita             0 Q larga  &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
# El proceso &amp;lt;code&amp;gt;4426&amp;lt;/code&amp;gt; es una simulación de WRF. &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qstat -f 4426&lt;br /&gt;
Job Id: 4426.hydra&lt;br /&gt;
    Job_Name = wrf_control&lt;br /&gt;
    Job_Owner = lluis.fita@node48&lt;br /&gt;
    resources_used.cput = 1700:00:37&lt;br /&gt;
    resources_used.mem = 6264020kb&lt;br /&gt;
    resources_used.vmem = 15322976kb&lt;br /&gt;
    resources_used.walltime = 72:40:26&lt;br /&gt;
    job_state = R&lt;br /&gt;
    queue = larga&lt;br /&gt;
    server = hydra&lt;br /&gt;
    Checkpoint = u&lt;br /&gt;
    ctime = Sat Apr  7 10:47:06 2018&lt;br /&gt;
    depend = beforeany:4427.hydra@hydra&lt;br /&gt;
    Error_Path = node48:/home/lluis.fita/estudios/WRFsensSFC/simulations/contr&lt;br /&gt;
	ol/wrf_control.e4426&lt;br /&gt;
    exec_host = node48/23+node48/22+node48/21+node48/20+node48/19+node48/18+no&lt;br /&gt;
	de48/17+node48/16+node48/15+node48/14+node48/13+node48/12+node48/11+no&lt;br /&gt;
	de48/10+node48/9+node48/8+node48/7+node48/6+node48/5+node48/4+node48/3&lt;br /&gt;
	+node48/2+node48/1+node48/0+node51/23+node51/22+node51/21+node51/20+no&lt;br /&gt;
	de51/19+node51/18+node51/17+node51/16+node51/15+node51/14+node51/13+no&lt;br /&gt;
	de51/12+node51/11+node51/10+node51/9+node51/8+node51/7+node51/6+node51&lt;br /&gt;
	/5+node51/4+node51/3+node51/2+node51/1+node51/0&lt;br /&gt;
    exec_port = 15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15&lt;br /&gt;
	003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+&lt;br /&gt;
	15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+1500&lt;br /&gt;
	3+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15&lt;br /&gt;
	003+15003+15003&lt;br /&gt;
    Hold_Types = n&lt;br /&gt;
    Join_Path = oe&lt;br /&gt;
    Keep_Files = n&lt;br /&gt;
    Mail_Points = ae&lt;br /&gt;
    Mail_Users = lluis.fita@cima.fcen.uba.ar&lt;br /&gt;
    mtime = Sat Apr  7 10:48:06 2018&lt;br /&gt;
    Output_Path = node48:/home/lluis.fita/estudios/WRFsensSFC/simulations/cont&lt;br /&gt;
	rol/wrf_control.o4426&lt;br /&gt;
    Priority = 0&lt;br /&gt;
    qtime = Sat Apr  7 10:47:06 2018&lt;br /&gt;
    Rerunable = True&lt;br /&gt;
    Resource_List.mem = 30gb&lt;br /&gt;
    Resource_List.nodect = 2&lt;br /&gt;
    Resource_List.nodes = 2:ppn=24&lt;br /&gt;
    Resource_List.vmem = 30gb&lt;br /&gt;
    Resource_List.walltime = 168:00:00&lt;br /&gt;
    session_id = 19561&lt;br /&gt;
    Variable_List = PBS_O_QUEUE=larga,PBS_O_HOST=node48,&lt;br /&gt;
	PBS_O_HOME=/home/lluis.fita,PBS_O_LANG=en_US.UTF-8,&lt;br /&gt;
	PBS_O_LOGNAME=lluis.fita,&lt;br /&gt;
	PBS_O_PATH=/usr/local/bin:/opt/intel/composerxe-2011.3.174/bin/intel6&lt;br /&gt;
	4:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:&lt;br /&gt;
	/usr/local/maui/bin:/usr/local/maui/sbin:/usr/local/bin:/opt/intel/com&lt;br /&gt;
	poserxe/bin:/usr/local/maui/bin:/usr/local/maui/sbin:/home/lluis.fita/&lt;br /&gt;
	bin:/opt/intel/composerxe-2011.3.174/mpirt/bin/intel64:/usr/local/maui&lt;br /&gt;
	/bin:/usr/local/maui/sbin,PBS_O_MAIL=/var/spool/mail/lluis.fita,&lt;br /&gt;
	PBS_O_SHELL=/bin/bash,PBS_SERVER=hydra,&lt;br /&gt;
	PBS_O_WORKDIR=/home/lluis.fita/estudios/WRFsensSFC/simulations/contro&lt;br /&gt;
	l&lt;br /&gt;
    etime = Sat Apr  7 10:47:14 2018&lt;br /&gt;
    submit_args = -W depend=afterany:4425 /home/lluis.fita/estudios/WRFsensSFC&lt;br /&gt;
	/simulations/control/run_WRF.pbs&lt;br /&gt;
    start_time = Sat Apr  7 10:47:14 2018&lt;br /&gt;
    Walltime.Remaining = 343110&lt;br /&gt;
    start_count = 1&lt;br /&gt;
    fault_tolerant = False&lt;br /&gt;
    submit_host = node48&lt;br /&gt;
    init_work_dir = /home/lluis.fita/estudios/WRFsensSFC/simulations/control&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# La variable &amp;lt;code&amp;gt;PBS_O_WORKDIR&amp;lt;/code&amp;gt; contiene el path de ejecución del job. Se tiene que cercionar que el proceso (simulación WRF en este caso), está ejecutándose (en este caso si los rsl.out/error.[nnnn] se están actualizando). En este caso la ejecución del WRF se halla en $PBS_O_WORKDIR/run&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ls -l /home/lluis.fita/estudios/WRFsensSFC/simulations/control/run/rsl.error.0000&lt;br /&gt;
rw-r--r-- 1 lluis.fita cima 1204224 Apr  8 19:07 /home/lluis.fita/estudios/WRFsensSFC/simulations/control/run/rsl.error.0000&lt;br /&gt;
$ date&lt;br /&gt;
Tue Apr 10 11:32:35 ART 2018&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Está claro que el WRF no está ejecutándose adecuadamente. Así que se tendrá que entrar en el nodo en dónde se está ejecutando y parar el job, ya que el sistema de colas PBS no controla este proceso. El WRF se ejecuta en el nodo (valores de &amp;lt;CODE&amp;gt;exec_host&amp;lt;/CODE&amp;gt;) node48 y node51.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ ssh node48&lt;br /&gt;
[lluis.fita@node48 ~]$ top&lt;br /&gt;
top - 11:35:40 up 5 days, 47 min,  0 users,  load average: 23.00, 22.99, 23.03&lt;br /&gt;
Tasks: 246 total,  24 running, 198 sleeping,   0 stopped,  24 zombie&lt;br /&gt;
Cpu(s): 87.4%us,  8.4%sy,  0.0%ni,  4.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st&lt;br /&gt;
Mem:  32942908k total, 18055956k used, 14886952k free,     3972k buffers&lt;br /&gt;
Swap:        0k total,        0k used,        0k free, 12102316k cached&lt;br /&gt;
&lt;br /&gt;
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                       &lt;br /&gt;
21531 lluis.fi  20   0  508m 256m  28m R 100.0  0.8   4360:00 wrf.exe                                                                                                      &lt;br /&gt;
21532 lluis.fi  20   0  503m 248m  23m R 100.0  0.8   4358:44 wrf.exe                                                                                                      &lt;br /&gt;
21537 lluis.fi  20   0  503m 247m  23m R 100.0  0.8   4359:53 wrf.exe                                                                                                      &lt;br /&gt;
21540 lluis.fi  20   0  507m 252m  23m R 100.0  0.8   4359:52 wrf.exe                                                                                                      &lt;br /&gt;
21545 lluis.fi  20   0  503m 249m  24m R 100.0  0.8   4359:35 wrf.exe                                                                                                      &lt;br /&gt;
21546 lluis.fi  20   0  503m 247m  22m R 100.0  0.8   4359:51 wrf.exe                                                                                                      &lt;br /&gt;
21547 lluis.fi  20   0  507m 253m  23m R 100.0  0.8   4360:04 wrf.exe                                                                                                      &lt;br /&gt;
21548 lluis.fi  20   0  507m 254m  24m R 100.0  0.8   4359:52 wrf.exe                                                                                                      &lt;br /&gt;
21549 lluis.fi  20   0  495m 241m  23m R 100.0  0.7   4359:44 wrf.exe                                                                                                      &lt;br /&gt;
21550 lluis.fi  20   0  507m 251m  23m R 100.0  0.8   4359:47 wrf.exe                                                                                                      &lt;br /&gt;
21552 lluis.fi  20   0  501m 249m  27m R 100.0  0.8   4359:49 wrf.exe                                                                                                      &lt;br /&gt;
21529 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:42 wrf.exe                                                                                                       &lt;br /&gt;
21530 lluis.fi  20   0  500m 246m  25m R 99.7  0.8   4360:05 wrf.exe                                                                                                       &lt;br /&gt;
21533 lluis.fi  20   0  507m 253m  23m R 99.7  0.8   4359:55 wrf.exe                                                                                                       &lt;br /&gt;
21534 lluis.fi  20   0  507m 254m  24m R 99.7  0.8   4360:04 wrf.exe                                                                                                       &lt;br /&gt;
21535 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:40 wrf.exe                                                                                                       &lt;br /&gt;
21536 lluis.fi  20   0  503m 250m  24m R 99.7  0.8   4359:37 wrf.exe                                                                                                       &lt;br /&gt;
21539 lluis.fi  20   0  507m 252m  24m R 99.7  0.8   4359:01 wrf.exe                                                                                                       &lt;br /&gt;
21541 lluis.fi  20   0  503m 248m  23m R 99.7  0.8   4360:07 wrf.exe                                                                                                       &lt;br /&gt;
21542 lluis.fi  20   0  500m 250m  27m R 99.7  0.8   4359:49 wrf.exe                                                                                                       &lt;br /&gt;
21543 lluis.fi  20   0  500m 244m  23m R 99.7  0.8   4359:43 wrf.exe                                                                                                       &lt;br /&gt;
21544 lluis.fi  20   0  507m 251m  23m R 99.7  0.8   4359:34 wrf.exe                                                                                                       &lt;br /&gt;
21551 lluis.fi  20   0  507m 257m  27m R 99.7  0.8   4360:02 wrf.exe        &lt;br /&gt;
[lluis.fita@node48 ~]$ killall wrf.exe&lt;br /&gt;
[lluis.fita@node48 ~]$ top&lt;br /&gt;
top - 11:37:32 up 5 days, 49 min,  0 users,  load average: 19.46, 22.23, 22.79&lt;br /&gt;
Tasks: 199 total,   1 running, 197 sleeping,   0 stopped,   1 zombie&lt;br /&gt;
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st&lt;br /&gt;
Mem:  32942908k total, 12665572k used, 20277336k free,     3972k buffers&lt;br /&gt;
Swap:        0k total,        0k used,        0k free, 12048020k cached&lt;br /&gt;
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                       &lt;br /&gt;
   10 root      20   0     0    0    0 S  0.3  0.0  39:27.92 kworker/0:1                                                                                                   &lt;br /&gt;
  262 root      20   0     0    0    0 S  0.3  0.0   0:18.40 kpktgend_13                                                                                                   &lt;br /&gt;
  272 root      20   0     0    0    0 S  0.3  0.0   0:18.47 kpktgend_23                                                                                                   &lt;br /&gt;
27134 lluis.fi  20   0 15280 1256  888 R  0.3  0.0   0:00.01 top                                                                                                           &lt;br /&gt;
    1 root      20   0 25660 1736 1428 S  0.0  0.0   0:05.71 init                                                                                                          &lt;br /&gt;
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                                                      &lt;br /&gt;
    3 root      20   0     0    0    0 S  0.0  0.0   0:01.17 ksoftirqd/0  &lt;br /&gt;
[lluis.fita@node48 ~]$ exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Se repite el procedimiento en tantos nodos cómo ocupe el job (en este caso también &amp;lt;code&amp;gt;node51&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ ssh node51&lt;br /&gt;
[lluis.fita@node51 ~]$ top&lt;br /&gt;
[lluis.fita@node51 ~]$ killall wrf.exe&lt;br /&gt;
[lluis.fita@node51 ~]$ top&lt;br /&gt;
[lluis.fita@node51 ~]$ exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# El job re-aparece cómo &amp;lt;code&amp;gt;cancelled&amp;lt;/code&amp;gt; (letra &amp;lt;code&amp;gt;C&amp;lt;/code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[lluis.fita@hydra ~]$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4426.hydra                 wrf_control      lluis.fita      1703:44: C larga          &lt;br /&gt;
4427.hydra                 ...nsSFC-control lluis.fita             0 R larga          &lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      654:14:4 R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      650:46:2 R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 608:46:5 R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 605:22:5 R larga          &lt;br /&gt;
4443.hydra                 WRF17O           lluis.fita             0 R larga     &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Si el job no fuera cancelado, se hace encesaria la cancelación manual&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ qdel 4426&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Otros jobs que estabn en la cola PBS, ahora ya se están ejecutándose al liberarse los nodos! Se observa en el [[http://scad.cima.fcen.uba.ar/ganglia/ ganglia]] del sistema cómo los nodos 48 y 51, se descargan de carga de cálculo&lt;br /&gt;
&lt;br /&gt;
[[File:Ganglia_afterZombie.png|frame|50px|Ejemplo de descarga de los nodos 48 y 51 después de matar un proceso &#039;zombie&#039;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= procesos sin PBS: Pasos a seguir II =&lt;br /&gt;
Puede darse el caso, que el proceso esté ocupando el nodo y no se muestre cómo trabajo de la cola. En este caso el nodo estará ejecutando un proceso, pero para el sistema de colas PBS, no estaría ocupando recursos con lo que el nodo estaría sobre trabjando. &lt;br /&gt;
&lt;br /&gt;
En este caso, al consultar el Ganglia del clúster, se ve que el nodo está en rojo y que está todo gris (en el ejemplo de Ganglia anterior los nodos 47,50 y 53). Para estar seguros será necesario entrar en todos los nodos del clúster uno a uno y asegurarse que no tenga ningún proceso sin job dependiente.&lt;br /&gt;
&lt;br /&gt;
Ejemplo con el usuario &amp;lt;CODE&amp;gt;pzaninelli&amp;lt;/CODE&amp;gt;&lt;br /&gt;
&lt;br /&gt;
# Mirar procesos corriendo&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat&lt;br /&gt;
Job id                    Name             User            Time Use S Queue&lt;br /&gt;
------------------------- ---------------- --------------- -------- - -----&lt;br /&gt;
4432.hydra                 wrf_control50k   pzaninelli      1248:26: R larga          &lt;br /&gt;
4433.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4438.hydra                 wrf_phy1         pzaninelli      1244:57: R larga          &lt;br /&gt;
4439.hydra                 run_experiment   pzaninelli             0 H larga          &lt;br /&gt;
4440.hydra                 WRF17O           victoria.gallig 1201:37: R larga          &lt;br /&gt;
4441.hydra                 WRF17O           victoria.gallig 1197:29: R larga          &lt;br /&gt;
4445.hydra                 wrf_control      lluis.fita      593:41:0 R larga          &lt;br /&gt;
4446.hydra                 ...nsSFC-control lluis.fita             0 H larga          &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Determinar nodos y ruta de ejecución de todos los jobs del usuari.x.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat -f 4432&lt;br /&gt;
Job Id: 4432.hydra&lt;br /&gt;
    Job_Name = wrf_control50k&lt;br /&gt;
(...)&lt;br /&gt;
    exec_host = node46/23+node46/22+node46/21+node46/20+node46/19+node46/18+no&lt;br /&gt;
	de46/17+node46/16+node46/15+node46/14+node46/13+node46/12+node46/11+no&lt;br /&gt;
	de46/10+node46/9+node46/8+node46/7+node46/6+node46/5+node46/4+node46/3&lt;br /&gt;
	+node46/2+node46/1+node46/0+node47/23+node47/22+node47/21+node47/20+no&lt;br /&gt;
	de47/19+node47/18+node47/17+node47/16+node47/15+node47/14+node47/13+no&lt;br /&gt;
	de47/12+node47/11+node47/10+node47/9+node47/8+node47/7+node47/6+node47&lt;br /&gt;
	/5+node47/4+node47/3+node47/2+node47/1+node47/0&lt;br /&gt;
(...)&lt;br /&gt;
	PBS_O_WORKDIR=/home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&lt;br /&gt;
(...)&lt;br /&gt;
[pzaninelli@hydra ~]$ qstat -f 4438&lt;br /&gt;
Job Id: 4438.hydra&lt;br /&gt;
(...)&lt;br /&gt;
    exec_host = node49/23+node49/22+node49/21+node49/20+node49/19+node49/18+no&lt;br /&gt;
	de49/17+node49/16+node49/15+node49/14+node49/13+node49/12+node49/11+no&lt;br /&gt;
	de49/10+node49/9+node49/8+node49/7+node49/6+node49/5+node49/4+node49/3&lt;br /&gt;
	+node49/2+node49/1+node49/0+node50/23+node50/22+node50/21+node50/20+no&lt;br /&gt;
	de50/19+node50/18+node50/17+node50/16+node50/15+node50/14+node50/13+no&lt;br /&gt;
	de50/12+node50/11+node50/10+node50/9+node50/8+node50/7+node50/6+node50&lt;br /&gt;
	/5+node50/4+node50/3+node50/2+node50/1+node50/0&lt;br /&gt;
(...)&lt;br /&gt;
	PBS_O_WORKDIR=/home/pzaninelli/workdir/SENSHeatWave03/sims/phy1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Se obtiene que:&lt;br /&gt;
* 4432: utiliza nodos 46 y 47 y se ejecuta en /home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&lt;br /&gt;
* 4438: utiliza nodos 49 y 50 y se ejecuta en /home/pzaninelli/workdir/SENSHeatWave03/sims/phy1&lt;br /&gt;
# Los pasos a seguir para cada nodo son los siguientes. Se tiene que entrar en todos los nodos, puesto que no hay otra manera de conocer los procesos que se ejectuan en ellos&lt;br /&gt;
## Entrar en el nodo&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ssh [nombreNodo]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
## Chequear los procesos&lt;br /&gt;
### Si no se sabe el nombre de la aplicación que podría estar zombie&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ps -ef | grep $USER&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Si se sabe el nombre de la aplicación (en el caso de ejemplo el modelo &amp;lt;code&amp;gt;WRF&amp;lt;/Code&amp;gt;)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
ps -ef | grep [aplicación]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Parar esos procesos que no correspondan (huérfanos de job PBS)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
kill -9 [procesoID]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
### Salir del nodo y empezar con el siguiente&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
exit&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Después de mirar en los nodos del 40 al 46, entrando en el nodo 47&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[pzaninelli@hhydra ~]$ ssh node47&lt;br /&gt;
[pzaninelli@node47 ~]$ ps -ef | grep wrf.exe&lt;br /&gt;
1561     17995     1 74 Apr07 ?        3-00:53:43 ./wrf.exe&lt;br /&gt;
1561     17996     1 74 Apr07 ?        3-00:53:36 ./wrf.exe&lt;br /&gt;
1561     17997     1 74 Apr07 ?        3-00:49:04 ./wrf.exe&lt;br /&gt;
1561     17998     1 74 Apr07 ?        3-00:53:03 ./wrf.exe&lt;br /&gt;
1561     17999     1 74 Apr07 ?        3-00:57:12 ./wrf.exe&lt;br /&gt;
1561     18000     1 74 Apr07 ?        3-00:47:14 ./wrf.exe&lt;br /&gt;
1561     18001     1 74 Apr07 ?        3-00:50:55 ./wrf.exe&lt;br /&gt;
1561     18002     1 74 Apr07 ?        3-00:51:08 ./wrf.exe&lt;br /&gt;
1561     18003     1 74 Apr07 ?        3-00:53:33 ./wrf.exe&lt;br /&gt;
1561     18005     1 74 Apr07 ?        3-00:56:54 ./wrf.exe&lt;br /&gt;
1561     18006     1 74 Apr07 ?        3-00:49:51 ./wrf.exe&lt;br /&gt;
1561     18007     1 74 Apr07 ?        3-00:51:28 ./wrf.exe&lt;br /&gt;
1561     18008     1 74 Apr07 ?        3-00:53:13 ./wrf.exe&lt;br /&gt;
1561     18009     1 74 Apr07 ?        3-00:49:40 ./wrf.exe&lt;br /&gt;
1561     18010     1 74 Apr07 ?        3-00:52:07 ./wrf.exe&lt;br /&gt;
1561     18011     1 74 Apr07 ?        3-00:52:12 ./wrf.exe&lt;br /&gt;
1561     18012     1 74 Apr07 ?        3-00:54:05 ./wrf.exe&lt;br /&gt;
1561     18013     1 74 Apr07 ?        3-00:52:44 ./wrf.exe&lt;br /&gt;
1561     18014     1 74 Apr07 ?        3-00:49:45 ./wrf.exe&lt;br /&gt;
1561     18015     1 74 Apr07 ?        3-00:48:31 ./wrf.exe&lt;br /&gt;
1561     23401 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23402 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23403 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23404 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23405 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23406 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23407 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23408 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23409 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23410 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23411 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23412 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23413 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23414 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23415 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23416 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23417 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23418 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23419 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23420 23401 56 Apr09 ?        1-05:16:12 ./wrf.exe&lt;br /&gt;
1561     23421 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23422 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23423 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23424 23402 55 Apr09 ?        1-05:07:45 ./wrf.exe&lt;br /&gt;
1561     23425 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23426 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23427 23404 55 Apr09 ?        1-04:50:56 ./wrf.exe&lt;br /&gt;
1561     23428 23405 55 Apr09 ?        1-05:07:46 ./wrf.exe&lt;br /&gt;
1561     23429 23406 56 Apr09 ?        1-05:33:20 ./wrf.exe&lt;br /&gt;
1561     23430 23417 56 Apr09 ?        1-05:35:49 ./wrf.exe&lt;br /&gt;
1561     23431 23407 56 Apr09 ?        1-05:26:10 ./wrf.exe&lt;br /&gt;
1561     23432 23421 56 Apr09 ?        1-05:15:46 ./wrf.exe&lt;br /&gt;
1561     23433 23415 55 Apr09 ?        1-05:05:37 ./wrf.exe&lt;br /&gt;
1561     23434 23403 55 Apr09 ?        1-04:50:07 ./wrf.exe&lt;br /&gt;
1561     23435 23426 55 Apr09 ?        1-04:57:29 ./wrf.exe&lt;br /&gt;
1561     23436 23419 55 Apr09 ?        1-05:08:26 ./wrf.exe&lt;br /&gt;
1561     23437 23414 56 Apr09 ?        1-05:17:23 ./wrf.exe&lt;br /&gt;
1561     23438 23409 55 Apr09 ?        1-04:57:51 ./wrf.exe&lt;br /&gt;
1561     23439 23411 55 Apr09 ?        1-04:59:43 ./wrf.exe&lt;br /&gt;
1561     23440 23408 55 Apr09 ?        1-05:00:49 ./wrf.exe&lt;br /&gt;
1561     23441 23410 55 Apr09 ?        1-04:55:14 ./wrf.exe&lt;br /&gt;
1561     23442 23423 56 Apr09 ?        1-05:21:52 ./wrf.exe&lt;br /&gt;
1561     23443 23422 56 Apr09 ?        1-05:28:01 ./wrf.exe&lt;br /&gt;
1561     23444 23413 56 Apr09 ?        1-05:36:03 ./wrf.exe&lt;br /&gt;
1561     23445 23425 55 Apr09 ?        1-05:12:05 ./wrf.exe&lt;br /&gt;
1561     23446 23412 56 Apr09 ?        1-05:35:22 ./wrf.exe&lt;br /&gt;
1561     23447 23418 57 Apr09 ?        1-05:46:58 ./wrf.exe&lt;br /&gt;
1561     23448 23416 56 Apr09 ?        1-05:16:21 ./wrf.exe&lt;br /&gt;
1561     27514 27474  0 12:30 pts/0    00:00:00 grep wrf.exe&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
# Si hay algún proceso &#039;zombie&#039; tendrá un tiempo de ejecución muy largo. Se observa que hay dos grupos distintos de procesos: &amp;lt;code&amp;gt;3-00:53:43 ./wrf.exe&amp;lt;/code&amp;gt; (3 días y 53 minutos) y &amp;lt;code&amp;gt;1-04:50:56 ./wrf.exe&amp;lt;/code&amp;gt; (1 días 4 horas y 50 minutos)&lt;br /&gt;
# Analizar donde está ejecutándose cada proceso&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ pwdx 17995&lt;br /&gt;
17995: /home/pzaninelli/workdir/SENSHeatWave03/sims/phy1/run&lt;br /&gt;
&lt;br /&gt;
[pzaninelli@node47 ~]$ pwdx 23427&lt;br /&gt;
23427: /home/pzaninelli/workdir/SENSHeatWave03/sims/control50k/run&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Del análisi anterior el nodo 47 sólo hospeda el job PBS &amp;lt;code&amp;gt;4432&amp;lt;/code&amp;gt; que se ejecuta en &amp;lt;code&amp;gt;/home/pzaninelli/workdir/SENSHeatWave03/sims/control50k&amp;lt;/code&amp;gt;. Así que los procesos del grupo (Ids de 17995 a 18015) que se ejecutan en &amp;lt;code&amp;gt;/home/pzaninelli/workdir/SENSHeatWave03/sims/phy1/run&amp;lt;/code&amp;gt; son zombies. Así que ya se pueden eliminar&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ kill -9 $(seq 17995 18015)&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Ahora al buscar los procesos aprecen sólo los procesos dependiendo del job PBS&lt;br /&gt;
&amp;lt;PRE&amp;gt;&lt;br /&gt;
[pzaninelli@node47 ~]$ ps -ef | grep wrf.exe&lt;br /&gt;
1561     23401 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23402 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23403 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23404 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23405 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23406 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23407 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23408 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23409 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23410 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23411 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23412 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23413 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23414 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23415 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23416 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23417 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23418 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23419 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23420 23401 56 Apr09 ?        1-05:20:01 ./wrf.exe&lt;br /&gt;
1561     23421 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23422 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23423 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23424 23402 55 Apr09 ?        1-05:11:40 ./wrf.exe&lt;br /&gt;
1561     23425 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23426 23390  0 Apr09 ?        00:00:00 /bin/bash ./launch_pbs.bash ./wrf.exe&lt;br /&gt;
1561     23427 23404 55 Apr09 ?        1-04:54:40 ./wrf.exe&lt;br /&gt;
1561     23428 23405 55 Apr09 ?        1-05:11:37 ./wrf.exe&lt;br /&gt;
1561     23429 23406 56 Apr09 ?        1-05:37:15 ./wrf.exe&lt;br /&gt;
1561     23430 23417 56 Apr09 ?        1-05:40:07 ./wrf.exe&lt;br /&gt;
1561     23431 23407 56 Apr09 ?        1-05:30:09 ./wrf.exe&lt;br /&gt;
1561     23432 23421 56 Apr09 ?        1-05:19:26 ./wrf.exe&lt;br /&gt;
1561     23433 23415 55 Apr09 ?        1-05:09:19 ./wrf.exe&lt;br /&gt;
1561     23434 23403 55 Apr09 ?        1-04:54:05 ./wrf.exe&lt;br /&gt;
1561     23435 23426 55 Apr09 ?        1-05:01:22 ./wrf.exe&lt;br /&gt;
1561     23436 23419 55 Apr09 ?        1-05:12:17 ./wrf.exe&lt;br /&gt;
1561     23437 23414 56 Apr09 ?        1-05:21:10 ./wrf.exe&lt;br /&gt;
1561     23438 23409 55 Apr09 ?        1-05:01:46 ./wrf.exe&lt;br /&gt;
1561     23439 23411 55 Apr09 ?        1-05:03:41 ./wrf.exe&lt;br /&gt;
1561     23440 23408 55 Apr09 ?        1-05:04:38 ./wrf.exe&lt;br /&gt;
1561     23441 23410 55 Apr09 ?        1-04:58:56 ./wrf.exe&lt;br /&gt;
1561     23442 23423 56 Apr09 ?        1-05:25:47 ./wrf.exe&lt;br /&gt;
1561     23443 23422 56 Apr09 ?        1-05:31:58 ./wrf.exe&lt;br /&gt;
1561     23444 23413 56 Apr09 ?        1-05:40:00 ./wrf.exe&lt;br /&gt;
1561     23445 23425 55 Apr09 ?        1-05:16:06 ./wrf.exe&lt;br /&gt;
1561     23446 23412 56 Apr09 ?        1-05:39:17 ./wrf.exe&lt;br /&gt;
1561     23447 23418 57 Apr09 ?        1-05:50:49 ./wrf.exe&lt;br /&gt;
1561     23448 23416 56 Apr09 ?        1-05:20:11 ./wrf.exe&lt;br /&gt;
1561     27524 27474  0 12:37 pts/0    00:00:00 grep wrf.exe&lt;br /&gt;
&amp;lt;/PRE&amp;gt;&lt;br /&gt;
# Se repite el proceso en el resto de los nodos y se observa cómo los nodos reducen su carga de trabajo&lt;br /&gt;
&lt;br /&gt;
[[File:GangliaAfterZombie.png|frame|50px|Ejemplo de descarga de los nodos 47, 50 y 53 después de matar un proceso &#039;zombie&#039;]]&lt;/div&gt;</summary>
		<author><name>Pzaninelli</name></author>
	</entry>
</feed>