Administration

This document describes various recommendations and tips for successfull administration and management of the Mentat system.

Configuration

Using RAM-based filesystem for message queues

In case you are encountering heavy IO traffic on your server you may wish to try to speed things up by using RAM based filesystem for the folders containing the message queues. Please consider all the drawbacks before implementation. Due to the volatility of the RAM you will loose your data (that have not yet been stored in database) in case of power outage or system crash.

If you choose to implement this solution you may try to follow this simple procedure:

# 1. Stop your receiving Warden client.

# 2. Wait a moment for your Mentat daemons to process all remaining messages.

# 3. Stop all Mentat daemons:
mentat-controller.py --command stop
mentat-controller.py --command disable

# 4. Delete current content of your message processing queues:
rm -rf /var/mentat/spool/mentat-*

# 5. Add following line to your /etc/fstab file (adjust the size of the RAM disk as necessary):
tmpfs  /var/mentat/spool  tmpfs  nodev,nosuid,noexec,nodiratime,size=2048M 0 0

# 6. Mount the newly added filesystem and check:
/bin/mount -a
mount | grep mentat
df -h | grep mentat

# 7. Start all Mentat daemons:
mentat-controller.py --command enable
mentat-controller.py --command start

# 8. Start your receiving Warden client.

# 9. Check, that the IDEA messages are passing through the processing chain:
tail -f /var/mentat/log/mentat-storage.py.log

Please adjust the variables (like queue folder location and RAM filesystem size) in the procedure above according to your setup and preferences. In this example we are using the tmpfs filesystem, whose one of intended use cases is exactly this one. Also, according to the documentation, no RAM is actually wasted in case the ramdisk is empty, the ramdisk allocation size is technically just an upper limit and overflowing data will be swapped to hard disk.

Local customizations of reporting translations

It is possible to locally customize the reporting templates and translations. All relevant files reside inside the /etc/mentat/templates configuration directory.

/etc/mentat/templates/informant - Templates and message catalogs for mentat-informant.py
/etc/mentat/templates/reporter - Templates and message catalogs for mentat-reporter.py

In each of the directories above you should aim for following files:

*.j2 - Jinja templates for email reports
translations/cs/LC_MESSAGES/messages.po - Message catalog for czech translations

The workflow for customizing the templates is as follows:

# Step 1: Modify the appropriate '.j2' file

# Step 2: Update the message catalogs:
hawat-cli repintl update

# Step 3: Translate newly added strings in appropriate '*.po' file(s)

# Step 4: Compile the message catalogs:
hawat-cli repintl compile
hawat-cli repintl clean

Monitoring

Periodical monitoring is of course a key to keeping the whole system healthy and preventing problems. We are using heavily the Nagios system for monitoring. Some of the features of Mentat system have built-in support for Nagios, for monitoring some of them you have to use existing Nagios plugins and configure them to your liking.

You may consider monitoring following features of Mentat system:

Monitoring database (low level)
Monitoring Mentat database
Monitoring Mentat system
Monitoring message queues
Monitoring log files

You may also want to make use of our Ansible role honzamach.mentat, which is capable of configuring the Nagios monitoring for you. Or you may use its appropriate tasks as a model for your custom configuration.

Monitoring database (low level)

Currently there is no built-in mechanism for database status. We are using the Nagios plugins check_procs and check_postgres for monitoring the database. You may use something like the following as your NRPE configuration:

#
# Check running processes.
#
command[check_postgresql]=/usr/lib/nagios/plugins/check_procs -c 1:100 -C postgres

#
# Common checks
#
command[check_pg_log]=/usr/lib/nagios/plugins/check_postgres_log /var/log/postgresql/postgresql-12-main.log
command[check_pg_hitratio]=/usr/lib/nagios/plugins/check_postgres_hitratio --dbuser=watchdog
command[check_pg_querytime]=/usr/lib/nagios/plugins/check_postgres_query_time --dbuser=watchdog --warning='1 minutes' --critical='1 minutes'
command[check_pg_backends]=/usr/lib/nagios/plugins/check_postgres_backends --dbuser=watchdog
#
# Checks for database 'mentat_events'.
#
command[check_pg_con_mentat_events]=/usr/lib/nagios/plugins/check_postgres_connection --dbname=mentat_events --dbuser=watchdog
command[check_pg_blt_mentat_events]=/usr/lib/nagios/plugins/['check_postgres_bloat', 'blt'] --dbname=mentat_events --dbuser=watchdog --warning='8G' --critical='14G' --exclude='pg_catalog.' --exclude='alembic_version'
command[check_pg_anl_mentat_events]=/usr/lib/nagios/plugins/['check_postgres_last_analyze', 'anl'] --dbname=mentat_events --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version'
command[check_pg_vac_mentat_events]=/usr/lib/nagios/plugins/['check_postgres_last_vacuum', 'vac'] --dbname=mentat_events --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version'
command[check_pg_aan_mentat_events]=/usr/lib/nagios/plugins/['check_postgres_last_autoanalyze', 'aan'] --dbname=mentat_events --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version'
command[check_pg_ava_mentat_events]=/usr/lib/nagios/plugins/['check_postgres_last_autovacuum', 'ava'] --dbname=mentat_events --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version'
#
# Checks for database 'mentat_main'.
#
command[check_pg_con_mentat_main]=/usr/lib/nagios/plugins/check_postgres_connection --dbname=mentat_main --dbuser=watchdog
command[check_pg_blt_mentat_main]=/usr/lib/nagios/plugins/['check_postgres_bloat', 'blt'] --dbname=mentat_main --dbuser=watchdog --warning='256M' --critical='2G' --exclude='pg_catalog.' --exclude='alembic_version'
command[check_pg_anl_mentat_main]=/usr/lib/nagios/plugins/['check_postgres_last_analyze', 'anl'] --dbname=mentat_main --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version'
command[check_pg_vac_mentat_main]=/usr/lib/nagios/plugins/['check_postgres_last_vacuum', 'vac'] --dbname=mentat_main --dbuser=watchdog --warning='3d' --critical='7d' --exclude='pg_catalog.' --exclude='alembic_version'

The custom check_postgres_log Nagios monitoring plugin for checking PostgreSQL log file for errors can be found in our Our Ansible role honzamach.postgresql.

Monitoring Mentat database

Very usefull thing to monitor is the health of the message processing chain and verifiing, that new messages are being constantly added to the database. For this there is a built-in feature in the mentat-dbmngr.py utility. It contains the watchdog-events command, which can be executed periodically to check database for new messages. It can be used in conjunction with the nagios-plugin option to be incorporated into your monitoring infrastructure:

command[check_mentat_edb]=/usr/local/bin/mentat-dbmngr.py --command watchdog-events --nagios-plugin --log-level warning --shell --user nagios --group nagios

Additionally there is a bundle of usefull check scripts in the /etc/mentat/scripts directory, which can be used to help with keeping the data quality on the sane levels. These scripts are currently really simple, they just perform hardcoded database query and send the query results via email to list of configured recipients. Target email addressess can be configured in /etc/default/mentat configuration file or passed directly to the script as command line parameters.

To correctly correctly configure these scripts please pay attention to following configurations /etc/default/mentat:

MENTAT_IS_ENABLED: Master switch. Unless value is set to yes no checks will be performed.
MENTAT_CHECKS_MAIL_TO: List of recipients of check reports (must be array).
MENTAT_HAWAT_URL: Base URL to the Mentat`s web interface. It will be used to generate URLs to example events.

To enable these scripts please configure them to be launched periodically via cron.

/etc/mentat/scripts/mentat-check-alive.sh: Query the IDEA event database and find a list of event detectors, that stopped sending new events. This can be used to detect possible problems with detectors going suddenly offline.
/etc/mentat/scripts/mentat-check-inspectionerrors.sh: Query the IDEA event database and detect list of all inspection errors along with example events. The mentat-inspector.py module is by default configured to perform event sanity inspection and logging errors it finds directly into the event. This script can provide summary of all current inspection errors, so you can go and fix malfunctioning detectors.
/etc/mentat/scripts/mentat-check-noeventclass.sh: Query the IDEA event database and detect list of events without assigned internal classification. The event classification is an internal mechanism for aggregating events possibly from different detectors and representing similar event classess (e.g. SSH bruteforce attacks detected by different detectors may by described by slightly different IDEA events. In a best case scenario any IDEA event should be assigned exactly one event class and there should not be any events without an event class.
/etc/mentat/scripts/mentat-check-volatiledescription.sh: Query the IDEA event database and detect list of detectors that are putting variable data into Description key within the event. The description should contain only constant data, things like IP addressess, timestamps and so on should be placed into the Note key.
/etc/mentat/scripts/mentat-check-test.sh: Query the IDEA event database and detect list of detectors that are sending events with Test category for “longer than normal” time. Ussually when new detector is added to the system, it is smart to assess the quality of the data provided before letting the messages be handled in full. However detectors should not use this feature permanently, instead the data source should eiher move to production level by starting to omit the Test category, or stop sending those messages altogether.

Following is an example cron configuration to enable all these checks.

# root@host$ crontab -e
10 0 * * mon /etc/mentat/scripts/mentat-check-alive.sh 7
11 0 * * mon /etc/mentat/scripts/mentat-check-inspectionerrors.sh 7
12 0 * * mon /etc/mentat/scripts/mentat-check-noeventclass.sh 7
# As an example use 14 days as check interval here instead of 7 days
13 0 * * mon /etc/mentat/scripts/mentat-check-volatiledescription.sh 14
# As an example send these reports to some different people
14 0 * * mon /etc/mentat/scripts/mentat-check-test.sh 7 admin@domain.org another-admin@domain.org

All these scripts send their reports via email with following headers, that you may use for automated email processing:

From: Mentat Sanity Checker <mentat@hostname.fqdn>
X-Mentat-Report-Class: sanity-check
X-Mentat-Report-Type: check-[xxx]

Monitoring Mentat system

For overall system state monitoring there is a feature built into the mentat-controller.py utility. You may use the status command to detect the current overall state of Mentat modules:

root@mentat:~# mentat-controller.py
2018-09-26 13:31:17,752 INFO: Executing script command 'status'
2018-09-26 13:31:17,981 INFO: Status of configured Mentat real-time modules:
2018-09-26 13:31:17,981 INFO: Real-time module 'mentat-storage.py': 'Process is running or service is OK (1)'
2018-09-26 13:31:17,981 INFO: Real-time module 'mentat-enricher.py': 'Process is running or service is OK (1)'
2018-09-26 13:31:17,982 INFO: Real-time module 'mentat-inspector.py': 'Process is running or service is OK (1)'
2018-09-26 13:31:17,982 INFO: Overall real-time module status: 'All modules are running OK'
2018-09-26 13:31:17,982 INFO: Status of configured Mentat cronjob modules:
2018-09-26 13:31:17,982 INFO: Cronjob module 'mentat-backup-py': 'Cronjob is enabled'
2018-09-26 13:31:17,982 INFO: Cronjob module 'mentat-cleanup-py': 'Cronjob is enabled'
2018-09-26 13:31:17,982 INFO: Cronjob module 'fetch-geoipdb-sh': 'Cronjob is enabled'
2018-09-26 13:31:17,982 INFO: Cronjob module 'mentat-informant-py': 'Cronjob is enabled'
2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-precache-py': 'Cronjob is enabled'
2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-reporter-py': 'Cronjob is enabled'
2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-statistician-py': 'Cronjob is enabled'
2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-watchdog-events-py': 'Cronjob is enabled'
2018-09-26 13:31:17,983 INFO: Overall cronjob module status: 'All cronjobs are enabled'
2018-09-26 13:31:17,983 INFO: Overall Mentat system status: 'All modules are running OK and all cronjobs are enabled'
2018-09-26 13:31:17,984 INFO: Application runtime: '0:00:00.329097' (effectivity  70.49 %)
2018-09-26 13:31:17,985 INFO: Application persistent state saved to file '/var/mentat/run/mentat-controller.py.pstate'
2018-09-26 13:31:17,985 INFO: Application runlog saved to file '/var/mentat/run/mentat-controller.py/201809261331.runlog'

You may use the built-in command line option nagios-plugin to force the output and return code to be according to the Nagios plugin API. In that case you may use something like the following as your NRPE configuration:

command[check_mentat]=/usr/local/bin/mentat-controller.py --command status --nagios-plugin --log-level warning --shell

Monitoring message queues

Currently there is no built-in mechanism for monitoring number of messages in the message queues. We are using the Nagios plugin check_file_count for monitoring the number of messages in the queues. You may use something like the following as your NRPE configuration:

command[check_mentat_inspector_a_errors_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector.py/errors -w 100 -c 1000
command[check_mentat_inspector_a_pending_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector.py/pending -w 100 -c 1000
command[check_mentat_inspector_a_incoming_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector.py/incoming -w 5000 -c 10000

command[check_mentat_enricher_errors_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-enricher.py/errors -w 100 -c 1000
command[check_mentat_enricher_pending_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-enricher.py/pending -w 100 -c 1000
command[check_mentat_enricher_incoming_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-enricher.py/incoming -w 5000 -c 10000

command[check_mentat_storage_errors_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-storage.py/errors -w 100 -c 1000
command[check_mentat_storage_pending_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-storage.py/pending -w 100 -c 1000
command[check_mentat_storage_incoming_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-storage.py/incoming -w 5000 -c 10000

Monitoring log files

You may consider using tools like logwatch, logcheck, Kibana or Graylog to monitor the log files in /var/mentat/log. So solutions are currently part of the package, you have to implement your own.

Maintenance

Database

References: * Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT

# Launch tmux or screen.
tmux

# Stop Mentat system.
printf 'SetOutputFilter SUBSTITUTE;DEFLATE\nSubstitute "s/__MAINTENANCE_START__/%b/n"\nSubstitute "s/__MAINTENANCE_END__/%b/n"\n' "`date '+%F %R'`" "`date -d '+4 hour' '+%F %R'`" > /etc/mentat/apache/maintenance/.htaccess
a2enmod substitute
a2dissite site_mentat-ng.conf
a2ensite site_maintenance.conf
systemctl restart apache2
mentat-controller.py --command disable
mentat-controller.py --command stop
systemctl restart postgresql

# Perform database maintenance tasks.
time psql mentat_events -c 'VACUUM FULL VERBOSE;'
time psql mentat_events -c 'CLUSTER VERBOSE;'
time psql mentat_events -c 'ANALYZE VERBOSE;'
time psql mentat_main -c 'VACUUM FULL VERBOSE;'
time psql mentat_main -c 'CLUSTER VERBOSE;'
time psql mentat_main -c 'ANALYZE VERBOSE;'

# Start Mentat system.
systemctl restart postgresql
mentat-controller.py --command start
mentat-controller.py --command enable
a2dismod substitute
a2dissite site_maintenance.conf
a2ensite site_mentat-ng.conf
systemctl restart apache2

For your convenience there is a script /etc/mentat/scripts/sqldb-maintenance.sh, that can be used to perform all of the above tasks for you in single command. We recommend executing it in the tmux or screen terminals, so that it is not dependent on your current session.