.. _section-administration: Administration ================================================================================ This document describes various recommendations and tips for successfull administration and management of the Mentat system. .. _section-administration-configuration: Configuration -------------------------------------------------------------------------------- Using RAM-based filesystem for message queues ```````````````````````````````````````````````````````````````````````````````` In case you are encountering heavy IO traffic on your server you may wish to try to speed things up by using RAM based filesystem for the folders containing the message queues. Please consider all the drawbacks before implementation. Due to the volatility of the RAM you will loose your data (that have not yet been stored in database) in case of power outage or system crash. If you choose to implement this solution you may try to follow this simple procedure: .. code-block:: shell # 1. Stop your receiving Warden client. # 2. Wait a moment for your Mentat daemons to process all remaining messages. # 3. Stop all Mentat daemons: mentat-controller.py --command stop mentat-controller.py --command disable # 4. Delete current content of your message processing queues: rm -rf /var/mentat/spool/mentat-* # 5. Add following line to your /etc/fstab file (adjust the size of the RAM disk as necessary): tmpfs /var/mentat/spool tmpfs nodev,nosuid,noexec,nodiratime,size=2048M 0 0 # 6. Mount the newly added filesystem and check: /bin/mount -a mount | grep mentat df -h | grep mentat # 7. Start all Mentat daemons: mentat-controller.py --command enable mentat-controller.py --command start # 8. Start your receiving Warden client. # 9. Check, that the IDEA messages are passing through the processing chain: tail -f /var/mentat/log/mentat-storage.py.log Please adjust the variables (like queue folder location and RAM filesystem size) in the procedure above according to your setup and preferences. In this example we are using the `tmpfs `__ filesystem, whose one of intended use cases is exactly this one. Also, according to the documentation, no RAM is actually wasted in case the ramdisk is empty, the ramdisk allocation size is technically just an upper limit and overflowing data will be swapped to hard disk. Local customizations of reporting translations ```````````````````````````````````````````````````````````````````````````````` It is possible to locally customize the reporting templates and translations. All relevant files reside inside the ``/etc/mentat/templates`` configuration directory. * ``/etc/mentat/templates/informant`` - Templates and message catalogs for ``mentat-informant.py`` * ``/etc/mentat/templates/reporter`` - Templates and message catalogs for ``mentat-reporter.py`` In each of the directories above you should aim for following files: * ``*.j2`` - Jinja templates for email reports * ``translations/cs/LC_MESSAGES/messages.po`` - Message catalog for czech translations The workflow for customizing the templates is as follows:: # Step 1: Modify the appropriate '.j2' file # Step 2: Update the message catalogs: hawat-cli repintl update # Step 3: Translate newly added strings in appropriate '*.po' file(s) # Step 4: Compile the message catalogs: hawat-cli repintl compile hawat-cli repintl clean .. _section-administration-monitoring: Monitoring -------------------------------------------------------------------------------- Periodical monitoring is of course a key to keeping the whole system healthy and preventing problems. We are using heavily the `Nagios `__ system for monitoring. Some of the features of Mentat system have built-in support for `Nagios `__, for monitoring some of them you have to use existing `Nagios `__ plugins and configure them to your liking. You may consider monitoring following features of Mentat system: #. Monitoring database (low level) #. Monitoring Mentat database #. Monitoring Mentat system #. Monitoring message queues #. Monitoring log files You may also want to make use of our `Ansible `__ role `honzamach.mentat `__, which is capable of configuring the `Nagios `__ monitoring for you. Or you may use its appropriate tasks as a model for your custom configuration. Monitoring database (low level) ```````````````````````````````````````````````````````````````````````````````` Currently there is no built-in mechanism for database status. We are using the `Nagios `__ plugins ``check_procs`` and `check_postgres `__ for monitoring the database. You may use something like the following as your NRPE configuration: .. code-block:: shell # # Check running processes. # command[check_postgresql]=/usr/lib/nagios/plugins/check_procs -c 1:100 -C postgres # # Common checks # command[check_pg_log]=/usr/lib/nagios/plugins/check_postgres_log /var/log/postgresql/postgresql-12-main.log command[check_pg_hitratio]=/usr/lib/nagios/plugins/check_postgres_hitratio --dbuser=watchdog command[check_pg_querytime]=/usr/lib/nagios/plugins/check_postgres_query_time --dbuser=watchdog --warning='1 minutes' --critical='1 minutes' command[check_pg_backends]=/usr/lib/nagios/plugins/check_postgres_backends --dbuser=watchdog # # Checks for database 'mentat_events'. # command[check_pg_con_mentat_events]=/usr/lib/nagios/plugins/check_postgres_connection --dbname=mentat_events --dbuser=watchdog command[check_pg_blt_mentat_events]=/usr/lib/nagios/plugins/['check_postgres_bloat', 'blt'] --dbname=mentat_events --dbuser=watchdog --warning='8G' --critical='14G' --exclude='pg_catalog.' --exclude='alembic_version' command[check_pg_anl_mentat_events]=/usr/lib/nagios/plugins/['check_postgres_last_analyze', 'anl'] --dbname=mentat_events --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version' command[check_pg_vac_mentat_events]=/usr/lib/nagios/plugins/['check_postgres_last_vacuum', 'vac'] --dbname=mentat_events --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version' command[check_pg_aan_mentat_events]=/usr/lib/nagios/plugins/['check_postgres_last_autoanalyze', 'aan'] --dbname=mentat_events --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version' command[check_pg_ava_mentat_events]=/usr/lib/nagios/plugins/['check_postgres_last_autovacuum', 'ava'] --dbname=mentat_events --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version' # # Checks for database 'mentat_main'. # command[check_pg_con_mentat_main]=/usr/lib/nagios/plugins/check_postgres_connection --dbname=mentat_main --dbuser=watchdog command[check_pg_blt_mentat_main]=/usr/lib/nagios/plugins/['check_postgres_bloat', 'blt'] --dbname=mentat_main --dbuser=watchdog --warning='256M' --critical='2G' --exclude='pg_catalog.' --exclude='alembic_version' command[check_pg_anl_mentat_main]=/usr/lib/nagios/plugins/['check_postgres_last_analyze', 'anl'] --dbname=mentat_main --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version' command[check_pg_vac_mentat_main]=/usr/lib/nagios/plugins/['check_postgres_last_vacuum', 'vac'] --dbname=mentat_main --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version' The custom ``check_postgres_log`` Nagios monitoring plugin for checking PostgreSQL log file for errors can be found in our Our `Ansible `__ role `honzamach.postgresql `__. Monitoring Mentat database ```````````````````````````````````````````````````````````````````````````````` Very usefull thing to monitor is the health of the message processing chain and verifiing, that new messages are being constantly added to the database. For this there is a built-in feature in the :ref:`section-bin-mentat-dbmngr` utility. It contains the ``watchdog-events`` command, which can be executed periodically to check database for new messages. It can be used in conjunction with the ``nagios-plugin`` option to be incorporated into your monitoring infrastructure: .. code-block:: shell command[check_mentat_edb]=/usr/local/bin/mentat-dbmngr.py --command watchdog-events --nagios-plugin --log-level warning --shell --user nagios --group nagios Additionally there is a bundle of usefull check scripts in the ``/etc/mentat/scripts`` directory, which can be used to help with keeping the data quality on the sane levels. These scripts are currently really simple, they just perform hardcoded database query and send the query results via email to list of configured recipients. Target email addressess can be configured in ``/etc/default/mentat`` configuration file or passed directly to the script as command line parameters. To correctly correctly configure these scripts please pay attention to following configurations ``/etc/default/mentat``: ``MENTAT_IS_ENABLED`` Master switch. Unless value is set to ``yes`` no checks will be performed. ``MENTAT_CHECKS_MAIL_TO`` List of recipients of check reports (must be array). ``MENTAT_HAWAT_URL`` Base URL to the Mentat`s web interface. It will be used to generate URLs to example events. To enable these scripts please configure them to be launched periodically via ``cron``. ``/etc/mentat/scripts/mentat-check-alive.sh`` Query the IDEA event database and find a list of event detectors, that stopped sending new events. This can be used to detect possible problems with detectors going suddenly offline. ``/etc/mentat/scripts/mentat-check-inspectionerrors.sh`` Query the IDEA event database and detect list of all inspection errors along with example events. The :ref:`section-bin-mentat-inspector` module is by default configured to perform event sanity inspection and logging errors it finds directly into the event. This script can provide summary of all current inspection errors, so you can go and fix malfunctioning detectors. ``/etc/mentat/scripts/mentat-check-noeventclass.sh`` Query the IDEA event database and detect list of events without assigned internal classification. The event classification is an internal mechanism for aggregating events possibly from different detectors and representing similar event classess (e.g. SSH bruteforce attacks detected by different detectors may by described by slightly different IDEA events. In a best case scenario any IDEA event should be assigned exactly one event class and there should not be any events without an event class. ``/etc/mentat/scripts/mentat-check-volatiledescription.sh`` Query the IDEA event database and detect list of detectors that are putting variable data into ``Description`` key within the event. The description should contain only constant data, things like IP addressess, timestamps and so on should be placed into the ``Note`` key. ``/etc/mentat/scripts/mentat-check-test.sh`` Query the IDEA event database and detect list of detectors that are sending events with ``Test`` category for "longer than normal" time. Ussually when new detector is added to the system, it is smart to assess the quality of the data provided before letting the messages be handled in full. However detectors should not use this feature permanently, instead the data source should eiher move to production level by starting to omit the ``Test`` category, or stop sending those messages altogether. Following is an example ``cron`` configuration to enable all these checks. .. code-block:: shell # root@host$ crontab -e 10 0 * * mon /etc/mentat/scripts/mentat-check-alive.sh 7 11 0 * * mon /etc/mentat/scripts/mentat-check-inspectionerrors.sh 7 12 0 * * mon /etc/mentat/scripts/mentat-check-noeventclass.sh 7 # As an example use 14 days as check interval here instead of 7 days 13 0 * * mon /etc/mentat/scripts/mentat-check-volatiledescription.sh 14 # As an example send these reports to some different people 14 0 * * mon /etc/mentat/scripts/mentat-check-test.sh 7 admin@domain.org another-admin@domain.org All these scripts send their reports via email with following headers, that you may use for automated email processing: * ``From: Mentat Sanity Checker `` * ``X-Mentat-Report-Class: sanity-check`` * ``X-Mentat-Report-Type: check-[xxx]`` Monitoring Mentat system ```````````````````````````````````````````````````````````````````````````````` For overall system state monitoring there is a feature built into the :ref:`section-bin-mentat-controller` utility. You may use the ``status`` command to detect the current overall state of Mentat modules: .. code-block:: shell root@mentat:~# mentat-controller.py 2018-09-26 13:31:17,752 INFO: Executing script command 'status' 2018-09-26 13:31:17,981 INFO: Status of configured Mentat real-time modules: 2018-09-26 13:31:17,981 INFO: Real-time module 'mentat-storage.py': 'Process is running or service is OK (1)' 2018-09-26 13:31:17,981 INFO: Real-time module 'mentat-enricher.py': 'Process is running or service is OK (1)' 2018-09-26 13:31:17,982 INFO: Real-time module 'mentat-inspector.py': 'Process is running or service is OK (1)' 2018-09-26 13:31:17,982 INFO: Overall real-time module status: 'All modules are running OK' 2018-09-26 13:31:17,982 INFO: Status of configured Mentat cronjob modules: 2018-09-26 13:31:17,982 INFO: Cronjob module 'mentat-backup-py': 'Cronjob is enabled' 2018-09-26 13:31:17,982 INFO: Cronjob module 'mentat-cleanup-py': 'Cronjob is enabled' 2018-09-26 13:31:17,982 INFO: Cronjob module 'fetch-geoipdb-sh': 'Cronjob is enabled' 2018-09-26 13:31:17,982 INFO: Cronjob module 'mentat-informant-py': 'Cronjob is enabled' 2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-precache-py': 'Cronjob is enabled' 2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-reporter-py': 'Cronjob is enabled' 2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-statistician-py': 'Cronjob is enabled' 2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-watchdog-events-py': 'Cronjob is enabled' 2018-09-26 13:31:17,983 INFO: Overall cronjob module status: 'All cronjobs are enabled' 2018-09-26 13:31:17,983 INFO: Overall Mentat system status: 'All modules are running OK and all cronjobs are enabled' 2018-09-26 13:31:17,984 INFO: Application runtime: '0:00:00.329097' (effectivity 70.49 %) 2018-09-26 13:31:17,985 INFO: Application persistent state saved to file '/var/mentat/run/mentat-controller.py.pstate' 2018-09-26 13:31:17,985 INFO: Application runlog saved to file '/var/mentat/run/mentat-controller.py/201809261331.runlog' You may use the built-in command line option ``nagios-plugin`` to force the output and return code to be according to the `Nagios plugin API `__. In that case you may use something like the following as your NRPE configuration: .. code-block:: shell command[check_mentat]=/usr/local/bin/mentat-controller.py --command status --nagios-plugin --log-level warning --shell Monitoring message queues ```````````````````````````````````````````````````````````````````````````````` Currently there is no built-in mechanism for monitoring number of messages in the message queues. We are using the `Nagios `__ plugin `check_file_count `__ for monitoring the number of messages in the queues. You may use something like the following as your NRPE configuration: .. code-block:: shell command[check_mentat_inspector_a_errors_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector.py/errors -w 100 -c 1000 command[check_mentat_inspector_a_pending_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector.py/pending -w 100 -c 1000 command[check_mentat_inspector_a_incoming_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector.py/incoming -w 5000 -c 10000 command[check_mentat_enricher_errors_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-enricher.py/errors -w 100 -c 1000 command[check_mentat_enricher_pending_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-enricher.py/pending -w 100 -c 1000 command[check_mentat_enricher_incoming_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-enricher.py/incoming -w 5000 -c 10000 command[check_mentat_storage_errors_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-storage.py/errors -w 100 -c 1000 command[check_mentat_storage_pending_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-storage.py/pending -w 100 -c 1000 command[check_mentat_storage_incoming_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-storage.py/incoming -w 5000 -c 10000 Monitoring log files ```````````````````````````````````````````````````````````````````````````````` You may consider using tools like ``logwatch``, ``logcheck``, ``Kibana`` or ``Graylog`` to monitor the log files in ``/var/mentat/log``. So solutions are currently part of the package, you have to implement your own. .. _section-administration-maintenance: Maintenance -------------------------------------------------------------------------------- Database ```````````````````````````````````````````````````````````````````````````````` References: * `Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT `__ .. code-block:: # Launch tmux or screen. tmux # Stop Mentat system. printf 'SetOutputFilter SUBSTITUTE;DEFLATE\nSubstitute "s/__MAINTENANCE_START__/%b/n"\nSubstitute "s/__MAINTENANCE_END__/%b/n"\n' "`date '+%F %R'`" "`date -d '+4 hour' '+%F %R'`" > /etc/mentat/apache/maintenance/.htaccess a2enmod substitute a2dissite site_mentat-ng.conf a2ensite site_maintenance.conf systemctl restart apache2 mentat-controller.py --command disable mentat-controller.py --command stop systemctl restart postgresql # Perform database maintenance tasks. time psql mentat_events -c 'VACUUM FULL VERBOSE;' time psql mentat_events -c 'CLUSTER VERBOSE;' time psql mentat_events -c 'ANALYZE VERBOSE;' time psql mentat_main -c 'VACUUM FULL VERBOSE;' time psql mentat_main -c 'CLUSTER VERBOSE;' time psql mentat_main -c 'ANALYZE VERBOSE;' # Start Mentat system. systemctl restart postgresql mentat-controller.py --command start mentat-controller.py --command enable a2dismod substitute a2dissite site_maintenance.conf a2ensite site_mentat-ng.conf systemctl restart apache2 For your convenience there is a script ``/etc/mentat/scripts/sqldb-maintenance.sh``, that can be used to perform all of the above tasks for you in single command. We recommend executing it in the ``tmux`` or ``screen`` terminals, so that it is not dependent on your current session.