EUGridPMA/IGTF Nagios Monitor v 0.1b
                  ==================================== 

                      Last revision: 29 Sept 2005

                           Jan Jona Javorsek
                          jona.javorsek@ijs.si


STATUS

This package is a collection of plugins, support scripts and
configurations for Nagios monitoring application for monitoring the
EUGridPMA/IGTF CA infrastructure. provides as is, with no warranty and
not enough testing. At this stage it is ment as public testing for the
facility. Many settings and scripts are improvised and inherently
brittle. You are invited to review code, make improvements and
suggestions. Many scripts will have to be reimplemented properly to
avoid design pitfalls or external dependencies.


PURPOSE

This is intended as a public service for participating PMAs, CAs and
relaying parties. Source such as it is is available, and the server
should contain only public data. (Local networking could be hidden
to the guest user).

But any changes to the parameters of this server can change it in a
denial of service tool, so only those receiving direct notifications
from the server and server administrators should be allowed access to
the command interface.

Several of the checks (ie. CA certificate data, proposed CRL URL's,
signing_policy data) could and should be implemented as self-control
checks for the PMA distribution building. But I consider a public
re-check where it is accessible to relaying parties as a service and
as source a good thing. It adds more hooks for warnings, transparency
and different implementations of the inevitable bugs. Please remember
that no amount of automated checking can avoid the usual human
stupidity, and that this service will be wrong more ofthen than the CA
and PMA maintainers. Your mileage my vary.

Finally, this is to complement the existing cron-based CA monitor
service by Min Tsai at TWGrid
 
 http://goc.grid.sinica.edu.tw/camonitor/

or, alternatively,

 http://www.eugridpma.org/sinica/
 http://goc.grid.sinica.edu.tw:8080/ 

If these services differ, the Nagios one is probaby wrong at this
stage.


REQUIREMENTS

This package is prepared for a recent version of Nagios - it was
tested with Version 2.0b3 April 03, 2005. You should install or
compile Nagios before going any further. Note that this setup is not
ment to be used with a pre-existing Nagios setup.

Plugins require Perl and a binary version of Debian's dpkg - is is
used by the uscan perl script for version comparison and will be
phased out as soon as possible. Source and a x386 binary is included
with the package.


INSTALLATION

Warning: Make-based configurable installation is still missing!

You should manually install setup files from nagios-setup/ and plugins
from nagios-plugins/

You should edit at least the main config file (nagios.cfg - to set up
locations, log files, permissions, authentication etc) and the local
config (local.cfg - you must set up your own network infrastructure
and you probably only want any warning messages to be sent to you at
this stage).

You should set your Nagios directories and other settings in the
setup/makecaconfs.sh script and run:

 cd setup; ./makecaconfs.sh

If all went well, CAs were default Nagios conf files were installed,
the last EU Grid PMA distribution was downloaded, unpacked and
installed in Nagios cofniguration directory, and a number of CA config
files were created from it.

There are sure to be errors - this is beta software. Try running
Nagios to debug the configuration, such as this:

 /path/to/nagios/bin/nagios -v /path/to/nagios/etc/nagios.cfg

For further debugging, it is best to look at the log. Its location is
specified in the main configuration file:

 $ grep nagios\.log /path/to/nagios/etc/nagios.cfg

Then set up your web server. There are many examples for setting up
Apache in Nagios documentation, and some examples are under
examples/apache/ XXX in this distribution.


WARNINGS

This is a low-frequency setup where approximately 4 checks per day are
made for services that are UP. If a problem is detected, a number of
fast consecutive checks will be made before it is reported.

This means that you should probably not change any timings and only
run this service on a regular basis with prior arrangement with the
people on the <dg-eur-ca@services.cnrs.fr> mailing list.

If you run this service, please make sure that the command interface
is not open to the general public - it can turn the server into an
unwanted denial of service.

If you are only interested in service availability, you should
probably use either the cron-based CA monitor at
http://goc.grid.sinica.edu.tw/camonitor/ or the current beta Nagios
monitor at http://signet-ca.ijs.si/nagios/.

If you are planning to test and contribute to this package, you have
come to the right place!


FEATURES

* Accredited repository interface: Nagios checks the repository for
server response and new versions - a warning is issued when a newer
version is released (or if something is wrong with the repository,
ie. the current version is not available any more etc.).

* Autoupgrade: a handle is run when a new version is released. At this
time, the distribution is not yet actually downloaded and installed,
but this feature is planned. (Actually just calling wget URL; make
upgrade; should suffice. This is not fully supported yet!)

* Local connectivity support: local infractructure is checked so that
network connectivity failures are not reported as CA-related problems.

* Accredited repository used for conf-base: individual CA conf files
are created based on the repository.

* Data presentation: CA short name, hash, DN, key validity, CRL time to
live and length are presented on the Service Overview page for easy
searching.

* CA Certificate check: CA Cert is parsed, startdate, enddate and DN
are extracted. If validity is longer than currently recommended
maximum of 20 years, a note is added to the report (too long).

* CA time to live: enddate is checked periodically. Warning and
critical notification are issued 30 and 5 days before end of validity,
respectively.

* CRL repository check: CRL is pulled from its location periodically,
parsed, validated for signiture with the corresponding CA certificate.

* CRL validity check: a note is printed for CRLs with validity of over
31 days (too long) or under 25 days (short). A warning is issued when
CRL validity reaches 7 days, and a critical notification when it
reaches one day.

* CRL with shorter validity support: if CRL has validity under 25
days, corresponding fractions of its validity are used for warning and
critical notifications. This is over one day, so we will probably have
to change that to avoid warnings for NIIF, Cesnet etc. (But one day
and half seems reasonable compared to 7 days for 30-day CRLs.)


MISSING FEATURES

* Only EUGridPMA CA's are included - other members of IGTF could be
  added (but we are waiting for common repositories/distribution formats
  etc. - lazy)

* Nothing is done with signing_policy data: DN consistency could be
  checked, signing namespaces displayed and revoqed DN's in CRL's
  checked for inconsistencies with signing namespaces (harder).

* CRL URL from certificates is not extracted, is not compared with the
  URL in the repository, is not added as a secondary test if there are
  more than one.

* No support for autoresponders (but no certificates contain responder
  data at this time).

* Display CRL version and extensions

* If multiple repositories get implemented, merge them in the
  system. (Can of worms.)

* No data on responsible admins, no notifications for specific
  CA-admins, no access control for specific CA admins etc. It would be
  nice if responsible admins could control notification, flag for
  scheduled downtime etc. for their own services (using Nagios feature
  allow-notified-user, for example).


NOTIFICATIONS AND LOGGING

Currently, no individual contacts in CAs are implemented - only
infrastructure and PMA contats are used, and these are hardwired in
the 'local' configuration.

Instead of direct notifications, a mailing-list is available at the
following address: <grid-ca-monitor@ijs.si> (or via a mailmain web
access at
http://mailman.ijs.si/mailman/listinfo/grid-ca-monitor). It is
archived, accepts mailman commands, is not moderated, accepts any
subscribers and only accepts posts from the Nagios service.

A less reliable but more detailed list of events is available in the
Event Log under the section Reporting in the server's web interface.

If you want to receive notifications, simply subsribe to the list.

We are hoping to provide individual access control, command cgi
interface and individual notifications to CA administrators in the
future. Considerable support for this exists in the Nagios application
already, so this is not considered difficult. These are actually also
the only potential advantages over the existing CA monitor.

(** Any suggestions how this could be done without manually adding
contact data and access control mechanisms are welcome - please note
that we can not rely on certificates for run-time access to the
service since administrators might wish to access the service because
the certificates are not being accepted for some reason, but they are
or course OK when setting up the account. **)


ADVANTAGES (FOR THE FUTURE)

While the Nagios approach is much more complicated than
straight-forward scripts, cron jobs and web pages, it could offer some
advantages.

* notifications: direct and controlled e-mail or SMS notifications can
  be provided for faster response

* remote commands: an administrator could annouce downtime of a site
  or reques re-scheduling of a check

* scalability: Nagios can scale and scale well. Distributed monitoring
  could be implemented when the PMAs grow - each PMA can run its own
  services, or regional serices can be run, but all Nagios servers can
  display aggregated inforamation from the whole federation by running
  replicas of a "central server". (See
  http://nagios.sourceforge.net/docs/2_0/distributed.html for more
  info.)

None of these is implemented at this time.


BUGS

* mailing list might not be up at this moment

* tmpdirs: check_crl and check_distro seem to be leaving around tmp
  dirs for no reason.

* XXX make upgrade not implemented; move accredited distribution
  installation to a local directory.

* host checks are implented with http - pings are assumed to be
  blocked by firewalls way too often to be used outside the local
  network. A smarter way of using http, https, ftp and ssh
  sequentially is planned. (And a parese of traceroute showing the
  blocking point would be nice - or at least the traceroute CGI.

* uscan depends on dpkg binary; dpkg binary reports errors on missing
  debian infrastructure

* plugins use external shell commands where perl and openssl APIs
  could be used


REFERENCES

* Experimental Nagios CA Monitor by SiGNET CA, http://signet-ca.ijs.si/nagios/

* Experimental Nagios CA Monitor mailing list: <igtf-monitor.ijs.si>
  http://mailman.ijs.si/mailman/listinfo/grid-ca-monitor

* CA Monitor by Min Tsai at TWGrid, http://goc.grid.sinica.edu.tw/camonitor/

* EU Grid Policy Management Authority (EUGridPMA), http://www.eugridpma.org/,
  <dg-eur-ca@services.cnrs.fr>

* International Grid Trust Federation (IGTF), http://www.gridpma.org/, 
  http://www.eugridpma.org/igtf/