Administration

This document describes various recommendations and tips for successfull administration and management of the Mentat system.

Configuration

Using RAM-based filesystem for message queues

In case you are encountering heavy IO traffic on your server you may wish to try to speed things up by using RAM based filesystem for the folders containing the message queues. Please consider all the drawbacks before implementation. Due to the volatility of the RAM you will loose your data (that have not yet been stored in database) in case of power outage or system crash.

If you choose to implement this solution you may try to follow this simple procedure:

# 1. Stop your receiving Warden client.

# 2. Wait a moment for your Mentat daemons to process all remaining messages.

# 3. Stop all Mentat daemons:
mentat-controller.py --command stop

# 4. Delete current content of your message processing queues:
rm -rf /var/mentat/spool/mentat-*

# 5. Add following line to your /etc/fstab file (adjust the size of the RAM disk as necessary):
tmpfs  /var/mentat/spool  tmpfs  nodev,nosuid,noexec,nodiratime,size=2048M 0 0

# 6. Mount the newly added filesystem and check:
/bin/mount -a
mount | grep mentat
df -h | grep mentat

# 7. Start all Mentat daemons:
mentat-controller.py --command start

# 8. Start your receiving Warden client.

# 9. Check, that the IDEA messages are passing through the processing chain:
tail -f /var/mentat/log/mentat-storage.py.log

Please adjust the variables (like queue folder location and RAM filesystem size) in the procedure above according to your setup and preferences. In this example we are using the tmpfs filesystem, whose one of intended use cases is exactly this one. Also, according to the documentation, no RAM is actually wasted in case the ramdisk is empty, the ramdisk allocation size is technically just an upper limit and overflowing data will be swapped to hard disk.

Monitoring

Periodical monitoring is of course a key to keeping the whole system healthy and preventing problems. We are using heavily the Nagios system for monitoring. Some of the features of Mentat system have built-in support for Nagios, for monitoring some of them you have to use existing Nagios plugins and configure them to your liking.

You may consider monitoring following features of Mentat system:

  1. Monitoring system state

  2. Monitoring database state

  3. Monitoring message queues

  4. Monitoring log files

You may also want to make use of our Ansible role honzamach.mentat, which is capable of configuring the Nagios monitoring for you. Or you may use its appropriate tasks as a model for your custom configuration.

Monitoring system state

For overall system state monitoring there is a feature built into the mentat-controller.py utility. You may use the status command to detect the current overall state of Mentat modules:

root@mentat:~# mentat-controller.py
2018-09-26 13:31:17,752 INFO: Executing script command 'status'
2018-09-26 13:31:17,981 INFO: Status of configured Mentat real-time modules:
2018-09-26 13:31:17,981 INFO: Real-time module 'mentat-storage.py': 'Process is running or service is OK (1)'
2018-09-26 13:31:17,981 INFO: Real-time module 'mentat-enricher.py': 'Process is running or service is OK (1)'
2018-09-26 13:31:17,982 INFO: Real-time module 'mentat-inspector-b.py': 'Process is running or service is OK (1)'
2018-09-26 13:31:17,982 INFO: Real-time module 'mentat-inspector.py': 'Process is running or service is OK (1)'
2018-09-26 13:31:17,982 INFO: Overall real-time module status: 'All modules are running OK'
2018-09-26 13:31:17,982 INFO: Status of configured Mentat cronjob modules:
2018-09-26 13:31:17,982 INFO: Cronjob module 'mentat-backup-py': 'Cronjob is enabled'
2018-09-26 13:31:17,982 INFO: Cronjob module 'mentat-cleanup-py': 'Cronjob is enabled'
2018-09-26 13:31:17,982 INFO: Cronjob module 'fetch-geoipdb-sh': 'Cronjob is enabled'
2018-09-26 13:31:17,982 INFO: Cronjob module 'mentat-informant-py': 'Cronjob is enabled'
2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-precache-py': 'Cronjob is enabled'
2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-reporter-py': 'Cronjob is enabled'
2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-statistician-py': 'Cronjob is enabled'
2018-09-26 13:31:17,983 INFO: Cronjob module 'mentat-watchdog-events-py': 'Cronjob is enabled'
2018-09-26 13:31:17,983 INFO: Overall cronjob module status: 'All cronjobs are enabled'
2018-09-26 13:31:17,983 INFO: Overall Mentat system status: 'All modules are running OK and all cronjobs are enabled'
2018-09-26 13:31:17,984 INFO: Application runtime: '0:00:00.329097' (effectivity  70.49 %)
2018-09-26 13:31:17,985 INFO: Application persistent state saved to file '/var/mentat/run/mentat-controller.py.pstate'
2018-09-26 13:31:17,985 INFO: Application runlog saved to file '/var/mentat/run/mentat-controller.py/201809261331.runlog'

You may use the built-in command line option nagios-plugin to force the output and return code to be according to the Nagios plugin API. In that case you may use something like the following as your NRPE configuration:

command[check_mentat]=/usr/local/bin/mentat-controller.py --command status --nagios-plugin --log-level warning --shell

Monitoring database state

First of all you may wish to use the check_procs Nagios plugin to check that the database is indeed running:

command[check_postgresql]=/usr/lib/nagios/plugins/check_procs -c 5:100 -C postgres

Next very usefull thing to monitor is the health of the message processing chain and verifiing, that new messages are being constantly added to the database. For this there is a built-in feature in the mentat-dbmngr.py utility. It contains the watchdog-events command, which can be executed periodically to check database for new messages. It can be used in conjunction with the nagios-plugin option to be incorporated into your monitoring infrastructure:

command[check_mentat_edb]=/usr/local/bin/mentat-dbmngr.py --command watchdog-events --nagios-plugin --log-level warning --shell --user nagios --group nagios

Additionally there is a bundle of usefull check scripts in the /etc/mentat/scripts directory, which can be used to help with keeping the data quality on the sane levels. These scripts are currently really simple, they just perform hardcoded database query and send the query results via email to list of configured recipients. Target email addressess can be configured in /etc/default/mentat configuration file with configuration key MENTAT_CHECKS_MAIL_TO. This script can be set to launch periodically via cron:

/etc/mentat/scripts/mentat-check-alive.sh

Query the IDEA event database and find a list of event detectors, that stopped sending new events. This can be used to detect possible problems with detectors going suddenly offline.

/etc/mentat/scripts/mentat-check-inspectionerrors.sh

Query the IDEA event database and detect list of all inspection errors along with example messages. One of the mentat-inspector.py modules is by default configured to perform message sanity inspection and logs errors it finds directly into the message. This script can provide summary of all current inspection errors, so you can go and fix malfunctioning detectors.

/etc/mentat/scripts/mentat-check-no-eventclass.sh

Query the IDEA event database and detect list of events without assigned internal classification. The event classification is an internal mechanism for aggregating messages possibly from different detectors and representing similar events.

/etc/mentat/scripts/mentat-check-test.sh

Query the IDEA event database and detect list of detectors that are sending messages with Test category for “longer than normal” time. Ussually when new detector is added to the system, it is smart to assess the quality of the data provided before letting the messages be handled in full. However detectors should not use this feature permanently, instead the data source should eiher move to production level by starting to omit the Test category, or stop sending those messages.

/etc/mentat/scripts/mentat-check-volatile-description.sh

Query the IDEA event database and detect list of detectors that are putting variable data into Description key within the message. The description should contain only constant data, things like IP addressess, timestamps and so on should be placed into the Note key.

Monitoring message queues

Currently there is no built-in mechanism for monitoring number of messages in the message queues. We are using the Nagios plugin check_file_count for monitoring the number of messages in the queues. You may use something like the following as your NRPE configuration:

command[check_mentat_inspector_a_errors_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector.py/errors -w 100 -c 1000
command[check_mentat_inspector_a_pending_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector.py/pending -w 100 -c 1000
command[check_mentat_inspector_a_incoming_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector.py/incoming -w 5000 -c 10000
command[check_mentat_inspector_b_errors_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector-b.py/errors -w 100 -c 1000
command[check_mentat_inspector_b_pending_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector-b.py/pending -w 100 -c 1000
command[check_mentat_inspector_b_incoming_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-inspector-b.py/incoming -w 5000 -c 10000
command[check_mentat_enricher_errors_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-enricher.py/errors -w 100 -c 1000
command[check_mentat_enricher_pending_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-enricher.py/pending -w 100 -c 1000
command[check_mentat_enricher_incoming_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-enricher.py/incoming -w 5000 -c 10000
command[check_mentat_storage_errors_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-storage.py/errors -w 100 -c 1000
command[check_mentat_storage_pending_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-storage.py/pending -w 100 -c 1000
command[check_mentat_storage_incoming_dir]=/usr/lib/nagios/plugins/check_file_count -d /var/mentat/spool/mentat-storage.py/incoming -w 5000 -c 10000

Monitoring log files

You may consider using tools like logwatch, logcheck, Kibana or Graylog to monitor the log files in /var/mentat/log. So solutions are currently part of the package, you have to implement your own.