EUGridPMA/IGTF Nagios Monitor v 0.1b ==================================== Last revision: 29 Sept 2005 Jan Jona Javorsek jona.javorsek@ijs.si STATUS This package is a collection of plugins, support scripts and configurations for Nagios monitoring application for monitoring the EUGridPMA/IGTF CA infrastructure. provides as is, with no warranty and not enough testing. At this stage it is ment as public testing for the facility. Many settings and scripts are improvised and inherently brittle. You are invited to review code, make improvements and suggestions. Many scripts will have to be reimplemented properly to avoid design pitfalls or external dependencies. PURPOSE This is intended as a public service for participating PMAs, CAs and relaying parties. Source such as it is is available, and the server should contain only public data. (Local networking could be hidden to the guest user). But any changes to the parameters of this server can change it in a denial of service tool, so only those receiving direct notifications from the server and server administrators should be allowed access to the command interface. Several of the checks (ie. CA certificate data, proposed CRL URL's, signing_policy data) could and should be implemented as self-control checks for the PMA distribution building. But I consider a public re-check where it is accessible to relaying parties as a service and as source a good thing. It adds more hooks for warnings, transparency and different implementations of the inevitable bugs. Please remember that no amount of automated checking can avoid the usual human stupidity, and that this service will be wrong more ofthen than the CA and PMA maintainers. Your mileage my vary. Finally, this is to complement the existing cron-based CA monitor service by Min Tsai at TWGrid http://goc.grid.sinica.edu.tw/camonitor/ or, alternatively, http://www.eugridpma.org/sinica/ http://goc.grid.sinica.edu.tw:8080/ If these services differ, the Nagios one is probaby wrong at this stage. REQUIREMENTS This package is prepared for a recent version of Nagios - it was tested with Version 2.0b3 April 03, 2005. You should install or compile Nagios before going any further. Note that this setup is not ment to be used with a pre-existing Nagios setup. Plugins require Perl and a binary version of Debian's dpkg - is is used by the uscan perl script for version comparison and will be phased out as soon as possible. Source and a x386 binary is included with the package. INSTALLATION Warning: Make-based configurable installation is still missing! You should manually install setup files from nagios-setup/ and plugins from nagios-plugins/ You should edit at least the main config file (nagios.cfg - to set up locations, log files, permissions, authentication etc) and the local config (local.cfg - you must set up your own network infrastructure and you probably only want any warning messages to be sent to you at this stage). You should set your Nagios directories and other settings in the setup/makecaconfs.sh script and run: cd setup; ./makecaconfs.sh If all went well, CAs were default Nagios conf files were installed, the last EU Grid PMA distribution was downloaded, unpacked and installed in Nagios cofniguration directory, and a number of CA config files were created from it. There are sure to be errors - this is beta software. Try running Nagios to debug the configuration, such as this: /path/to/nagios/bin/nagios -v /path/to/nagios/etc/nagios.cfg For further debugging, it is best to look at the log. Its location is specified in the main configuration file: $ grep nagios\.log /path/to/nagios/etc/nagios.cfg Then set up your web server. There are many examples for setting up Apache in Nagios documentation, and some examples are under examples/apache/ XXX in this distribution. WARNINGS This is a low-frequency setup where approximately 4 checks per day are made for services that are UP. If a problem is detected, a number of fast consecutive checks will be made before it is reported. This means that you should probably not change any timings and only run this service on a regular basis with prior arrangement with the people on the mailing list. If you run this service, please make sure that the command interface is not open to the general public - it can turn the server into an unwanted denial of service. If you are only interested in service availability, you should probably use either the cron-based CA monitor at http://goc.grid.sinica.edu.tw/camonitor/ or the current beta Nagios monitor at http://signet-ca.ijs.si/nagios/. If you are planning to test and contribute to this package, you have come to the right place! FEATURES * Accredited repository interface: Nagios checks the repository for server response and new versions - a warning is issued when a newer version is released (or if something is wrong with the repository, ie. the current version is not available any more etc.). * Autoupgrade: a handle is run when a new version is released. At this time, the distribution is not yet actually downloaded and installed, but this feature is planned. (Actually just calling wget URL; make upgrade; should suffice. This is not fully supported yet!) * Local connectivity support: local infractructure is checked so that network connectivity failures are not reported as CA-related problems. * Accredited repository used for conf-base: individual CA conf files are created based on the repository. * Data presentation: CA short name, hash, DN, key validity, CRL time to live and length are presented on the Service Overview page for easy searching. * CA Certificate check: CA Cert is parsed, startdate, enddate and DN are extracted. If validity is longer than currently recommended maximum of 20 years, a note is added to the report (too long). * CA time to live: enddate is checked periodically. Warning and critical notification are issued 30 and 5 days before end of validity, respectively. * CRL repository check: CRL is pulled from its location periodically, parsed, validated for signiture with the corresponding CA certificate. * CRL validity check: a note is printed for CRLs with validity of over 31 days (too long) or under 25 days (short). A warning is issued when CRL validity reaches 7 days, and a critical notification when it reaches one day. * CRL with shorter validity support: if CRL has validity under 25 days, corresponding fractions of its validity are used for warning and critical notifications. This is over one day, so we will probably have to change that to avoid warnings for NIIF, Cesnet etc. (But one day and half seems reasonable compared to 7 days for 30-day CRLs.) MISSING FEATURES * Only EUGridPMA CA's are included - other members of IGTF could be added (but we are waiting for common repositories/distribution formats etc. - lazy) * Nothing is done with signing_policy data: DN consistency could be checked, signing namespaces displayed and revoqed DN's in CRL's checked for inconsistencies with signing namespaces (harder). * CRL URL from certificates is not extracted, is not compared with the URL in the repository, is not added as a secondary test if there are more than one. * No support for autoresponders (but no certificates contain responder data at this time). * Display CRL version and extensions * If multiple repositories get implemented, merge them in the system. (Can of worms.) * No data on responsible admins, no notifications for specific CA-admins, no access control for specific CA admins etc. It would be nice if responsible admins could control notification, flag for scheduled downtime etc. for their own services (using Nagios feature allow-notified-user, for example). NOTIFICATIONS AND LOGGING Currently, no individual contacts in CAs are implemented - only infrastructure and PMA contats are used, and these are hardwired in the 'local' configuration. Instead of direct notifications, a mailing-list is available at the following address: (or via a mailmain web access at http://mailman.ijs.si/mailman/listinfo/grid-ca-monitor). It is archived, accepts mailman commands, is not moderated, accepts any subscribers and only accepts posts from the Nagios service. A less reliable but more detailed list of events is available in the Event Log under the section Reporting in the server's web interface. If you want to receive notifications, simply subsribe to the list. We are hoping to provide individual access control, command cgi interface and individual notifications to CA administrators in the future. Considerable support for this exists in the Nagios application already, so this is not considered difficult. These are actually also the only potential advantages over the existing CA monitor. (** Any suggestions how this could be done without manually adding contact data and access control mechanisms are welcome - please note that we can not rely on certificates for run-time access to the service since administrators might wish to access the service because the certificates are not being accepted for some reason, but they are or course OK when setting up the account. **) ADVANTAGES (FOR THE FUTURE) While the Nagios approach is much more complicated than straight-forward scripts, cron jobs and web pages, it could offer some advantages. * notifications: direct and controlled e-mail or SMS notifications can be provided for faster response * remote commands: an administrator could annouce downtime of a site or reques re-scheduling of a check * scalability: Nagios can scale and scale well. Distributed monitoring could be implemented when the PMAs grow - each PMA can run its own services, or regional serices can be run, but all Nagios servers can display aggregated inforamation from the whole federation by running replicas of a "central server". (See http://nagios.sourceforge.net/docs/2_0/distributed.html for more info.) None of these is implemented at this time. BUGS * mailing list might not be up at this moment * tmpdirs: check_crl and check_distro seem to be leaving around tmp dirs for no reason. * XXX make upgrade not implemented; move accredited distribution installation to a local directory. * host checks are implented with http - pings are assumed to be blocked by firewalls way too often to be used outside the local network. A smarter way of using http, https, ftp and ssh sequentially is planned. (And a parese of traceroute showing the blocking point would be nice - or at least the traceroute CGI. * uscan depends on dpkg binary; dpkg binary reports errors on missing debian infrastructure * plugins use external shell commands where perl and openssl APIs could be used REFERENCES * Experimental Nagios CA Monitor by SiGNET CA, http://signet-ca.ijs.si/nagios/ * Experimental Nagios CA Monitor mailing list: http://mailman.ijs.si/mailman/listinfo/grid-ca-monitor * CA Monitor by Min Tsai at TWGrid, http://goc.grid.sinica.edu.tw/camonitor/ * EU Grid Policy Management Authority (EUGridPMA), http://www.eugridpma.org/, * International Grid Trust Federation (IGTF), http://www.gridpma.org/, http://www.eugridpma.org/igtf/