[dmccune@pppl.gov: "rmonitor" database -- run monitoring]

From: Doug Mccune (dmccune@pppl.gov)
Date: Thu Oct 11 2001 - 15:55:45 PDT


------- Start of forwarded message -------
Date: Thu, 11 Oct 2001 18:15:18 -0400 (EDT)
From: Doug Mccune <dmccune@pppl.gov>
To: nfs-workers@apollo.gat.com
Subject: "rmonitor" database -- run monitoring
Reply-to: dmccune@pppl.gov
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII

Hi folks,

I am going to send two messages. The first is a description of our
currently used but rather ancient TRANSP run queue monitoring software
database (really just a collection of ascii files). The second message
will describe the scheme currently used to control access to TRANSP
compute services in this, our legacy production system.

It is to be understood that this is not the be all / end all of queueing
systems ... rather, just a data point for our Collaboratory design
process...

.............................................

First message: run monitoring software...

Background: the legacy TRANSP system has a hand built queue manager that
farms runs out to a small set of dedicated servers (about 10 workstations).
Users can ENQUEUE runs into this system, and monitor their progress;
in addition users can:

  -- DEQUEUE -- remove a run that has not yet started execution
  -- HALT -- stop a run that is executing; leave in *aborted* state.
  -- ARCHIVE -- stop and archive a partially completed run
  -- DELETE -- stop and delete a run
  -- LOOK -- fetch interim output from a run, leaving it running

To look at run status, users run a program called RMONITOR, which shows
run status. This is just a little command line program; for Collaboratory
purposes it would make sense to have something Java-esque and accessible
from anywhere using a web browser...

Here are the fields used in the current implementation ... plus a few
that I might add if TRANSP were a parallel code or if I had more time
to work on this now. The lengths of character strings and other details
can be thought of as peculiar to TRANSP, but are included for completeness.

<runid> -- a short (~10 char) character string identifying the TRANSP run

<project> -- a very short character string identifying the "project", in
             this context a tokamak experiment -- 4 characters

<username>-- a short (~10 char) character string identifying the user

<status> -- an integer code with allowed values 1 <= status <= 6
             interpretation:

              -1: "*aborted*"
               0: "Queued"
               1: "preprocessing"
               2: "executing"
               3: "postprocessing"
               4: "*completed*"

   if status==0 (Queued but not yet executing) there are the following
   additional fields:

      <priority> -- an integer number between 1 and 9 -- 5 is "normal"

      <date-time> -- date-time the run was queued

         (queueing order is by priority, then by order of submission)

      <options> -- character string of text options
         (can be used e.g. to select a specific server or class of servers)
         (~80 characters)

      <comments> -- 0 or more user supplied lines of 80 character text:
         --if high priority is requested, a non-null comment must be
           supplied. Other users running the "rmonitor" software can
           see who the user is, and read such comments which are meant
           to explain why high priority is required.

   if status<>0, the run at least started some phase of execution, and the
   following additional fields are defined:

      <server-id> -- a short character string identifying the "server"
         or "queue" in which the run is/was executing (10 characters)

      <physics-time> -- physical time to which the (time dependent)
         simulation has advanced-- real number, seconds.

      (things I would like to add):

      <ncpu> -- (would add this if TRANSP were parallelized) number of
         processors that are/were used.

      <cpu-time> -- (real number) -- ellapsed cpu time / processor, hours

      <wall-time> -- (real number) -- ellapsed wall clock time, hours

   if status==-1 this means the run crashed. Current users of the legacy
   system have other ways of determining this, but for a generic system
   there should probably be

      <error-string> -- 80 characters

      which would give a very terse characterization of the error, e.g.

         "crashed during execution -- see logfile"

      or

         "MDSplus error, input tree node not found: <nodename>"

   for the Collaboratory we might also want something bigger, like

      <log-tail>, the last 100 lines of the logfile of the aborted run.

   ...thats it...

I should mention that the ancient VMS based "rmonitor" program is just
using a few small ascii files for its "database"... this old fortran code
is definitely not scaleable to the big time. The only reason why it is
interesting, is that the system has been used successfully to monitor a
multi-user multi-project (and recently, multi-institutional) shared
resource (a collection of Unix workstations employed as TRANSP compute
servers) for over 10 years-- the users have liked it.

Typical output:

BIRCH$ rmonitor dq

...date of this display: 11-OCT-2001 17:42:46.90
RUN OWNER STATUS DETAILS
=============== ============= ================ ================================
38975A01 JT60 HILL *completed* BIRCH,
38975A02 JT60 HILL executing MARS, RESTART at 7.920 secs
===============
25196A03 TORS BUDNY executing TYR, RESTART at 10.40 secs
===============
106242A02 NSTX ERIC executing SATURN, RESTART at 0.2752 secs
106244A06 NSTX ERIC *completed* BIRCH,
   ...default queueing is to UNIX machines.
[USER ACKNOWLEDGE - HIT ANY KEY]

BIRCH$

(BIRCH is an old slow VMS machine. One of the reasons I am looking forward
to the collaboratory is I would like to get off BIRCH).

the fields seen in this display are:

 <runid> <project> <user> <status> <server> <physics-time>

Because not all servers are engaged, there are no runs queued.

projects are: JT60, a Japanese tokamak; TORS (Tore-Supra), a French
tokamak, and NSTX, a PPPL compact tokamak. The owners are PPPL physicists
working on NSTX or collaborations. Sometimes we also have CMOD runs owned
by "TRANSP" -- these are runs that were requested by CMOD software and
connected to the legacy system by using anonymous ftp, scripts, and MDSplus.
One of the weakness of the CMOD implementation is that they cannot run
RMONITOR conveniently from MIT, so... I get calls like the one this
afternoon from MIT physicist Paul Bonoli, who wanted to know what happened
to one of his runs. Turns out it got caught by an anonymous ftp server
crash that happened yesterday...

Users have indicated to me that they very much like the compactness of the
RMONITOR output, allowing a quick overview of the situation. "rmonitor"
options allow examination of further details.

A java program that showed such a summary but allowed one to click on
"*aborted*" and get <log-tail> would be really nice...!

- -------------------------Doug

  Doug McCune, co-head, PPPL Computational Plasma Physics Group
------- End of forwarded message -------

===============================================================================

This message was sent to the SciDAC National Fusion Collaboratory (NFC)
workers list nfc-workers. Visit the Collaboratory at
<http://www.fusiongrid.org/>.

To unsubscribe from this list, please send a message to
majordomo@fusion.gat.com with the following text in the *body* of the
message: unsubscribe nfc-workers

David P. Schissel: <schissel@fusion.gat.com> <http://fusion.gat.com/~schissel/>



This archive was generated by hypermail 2.1.1 : Thu Feb 07 2002 - 15:40:41 PST