Database

The Mentat system uses the awesome PostgreSQL as its primary database backend storage. For performance reasons two different database abstraction layers are used to access the data.

mentat.services.eventstorage

This library/module is used to access the IDEA event database. This database contains the main table for storing IDEA events as well as some additional tables that implement thresholding and relapse mechanism for reporting. For performance reasons it is implemented directly on top of the psycopg2 Python driver for PostgreSQL.

mentat.services.sqlstorage

This library/module is used to access the metadata database and work with any other objects like user accounts, groups, reporting filters, etc. For convenience it is implemented on top of sqlalchemy Python universal DBAL and ORM toolkit.

Database configuration tuning

The main resources used for database configuration tuning:

Example server hardware specs

Configuration and tuning tips in following sections are based on our experience with operating Mentat system and its PostgreSQL database on following hardware:

  • Dell PowerEdge R630

  • 2 x Intel Xeon E5-2695 v4 2.1GHz,45M Cache,9.60GT/s QPI,Turbo,HT,18C/36T (120W) Max Mem 2400MHz

  • 256 GB RAM (8 x 32GB RDIMM, 2400MT/s, Dual Rank, x4 Data Width)

  • PERC H730 Integrated RAID Controller, 1GB Cache

  • 3,6 TB HDD (6 x 600GB 15K RPM SAS 2.5in Hot-plug Hard Drive)

General settings and performance optimizations

Following options were changed in the main configuration file /etc/postgresql/11/main/postgresql.conf:

# Original default value 'localtime'
timezone = 'UTC'
  • Timezone must be set to UTC to enable proper time based searching. All timestamps

are converted to UTC prior to storing to database and stored without the timezone information according to PostgreSQL best practices.

# Original default value 128MB
shared_buffers = 16GB
  • Shared buffers are used to hold database representation in memory before writing it to disk. In INSERT/UPDATE heavy systems, larger values are beneficial. Setting this value too large does not help performance as OS disk cache is also used by PostgreSQL. Shared buffers smooth the IO operations rate and prevent spikes. Value of 16GB seems enough for this system and typical Mentat load. An increase might be in order if either the INSERT rate or mean event size increase substantially to keep the buffering effect.

# Original default value 4 MB
work_mem = 16MB
  • Specifies the memory size for use in internal operations such as in-memory sorting and hash table calculation. This limit is per internal operation, in typical query multiple such operations are executed. If the operation would require more memory than this limit, a theoretically slower on-disk variant is used (which might still be performed in disk subsystem cache with no real IO operations to storage, possibly even with better performance). Value of 16MB seems to work right for this system. As the internal sort is used in ORDER BY, some queries where index can not be used for sorting might profit from larger values (as the amount of data to sort is seldom too high in Mentat with typical query using LIMIT 100).

# Original default value 64MB
maintenance_work_mem = 256MB
  • Maintenance work memory is similar to work_mem but used for maintenance operations such as VACUUM or CREATE INDEX, larger values (when compared to work_mem) decrease their run time considerably. Only one such operation can run per database at the same time, value of 256MB is generally enough for the typical table sizes of Mentat and IOps reserves of this system.

# Original default value was commented out
wal_buffers = -1
  • Defines the maximum amount of memory (in shared_buffers) for write ahead log (WAL) data buffering. Typically WAL is written to disk after every transaction (hard requirement, otherwise journaling stops functioning as intended) so this should be large enough to hold all WAL data for a typical transaction for optimum performance. Value of -1 means no limit so transactions of arbitrary size (limited by shared_buffers, so essentialy no limit) are supported optimally. Limiting this setting only makes sense on resource constrained systems with small typical transaction size (few queries per transaction with small row size).

# Original default value 4GB
effective_cache_size = 200GB
  • Effective cache size is an estimate of the amount of system memory usable as disk cache for PostgreSQL. On a modern Linux system, this value is generally the total memory size minus the memory used by application programs. An efficient way to determine a suitable value is to monitor the targer system under expected load and note the amount of memory dedicated to disk buffers over time. On a system running PostgreSQL only, reasonable effective_cache_size is equal to the disk buffer size. If the DB shares the system with other disk write/read (total size, not IOps) intensive applications, this value should be decreased accordingly. This value represents the amount of on-disk data that PostgreSQL can estimate to be accessible in constant time. If it is larger than the sum of sizes of all tables and indices, disk IO is ammortized to initial read at first affected query after startup. Mentat is IOps intensive but quite light on total amout of data read/written.

Import tuning

Following options were changed in the main configuration file /etc/postgresql/11/main/postgresql.conf:

# Original default value 1GB
max_wal_size = 32GB
  • Write ahead log serves as a journal of changes against the on-disk snapshot of database that is created by checkpoint. In case of failure, the journal is replayed over the checkpoint to get to the state that includes all commited transactions. As checkpoint is a costly process that requires a table lock, it is best not performed too often (with the downside of a slightly longer database rebuild time after failure). Mentat is an INSERT heavy application as new events are recorded at all time at a rate of ~14 per second. Meanwhile Mentat can tolerate slightly longer database rebuild times. An increase in maximum write ahead log size from 1GB to 32GB increased the time between required consecutive checkpoints considerably, reducing overall system load and improving SELECT latency (as analytic queries are less likely to hit the less frequent checkpoints), furthermore this improves INSERT performance for batches of larger events (up to 250kB as opposed to long time running average of 10kB) which appear at an increasing rate lately.

# Original default value 5min
checkpoint_timeout = 15min
  • Checkpoint timeout defines the maximum time between two consecutive checkpoints. A value too small cases frequent checkpointing, an disk IO intensive operation. As Mentat writes are mostly INSERTs and batch DELETEs - larger values are sustainable. The increase from 5 minutes to 15 minutes leads to lower load of disk subsystem, leaving more bandwidth for SELECT - improving performance in case of hitting an INSERT heavy timeslot (large events), before all three firing caused disk subsystem overload (extremely poor performance), after the change the probability decreased considerably.

Enabling query parallelism

Following options were changed in the main configuration file /etc/postgresql/11/main/postgresql.conf:

As of PostgreSQL 9.6, with significant further improvements in 10.0, individual queries can be performed by parallel subworkers. Default settings are very conservative and allow this type of parallelism only in extreme cases and with very few workers. For any system, configuration has to be tuned individually. List of changes to improve parallelism follows:

# Original default value 8
max_parallel_workers = 36
  • Defines maximum number of parallel workers shared by all queries. As with Mentat we generally only have a low number of concurrently executed queries, this value was set to accommodate two queries with full parallelism, see max_parallel_workers_per_gather.

# Original default value 2
max_parallel_workers_per_gather = 18
  • Defines the maximum number of parallel workers for one query. To allow the OS task scheduler to place all workers on one physical CPU and to stop threads competing for resources because of HyperThreading, the value was set to number of physical cores on one CPU (18 for Intel(R) Xeon(R) CPU E5-2695 v4).

# Original default value 1
effective_io_concurrency = 4
  • Defines the optimal number of parallel IO requests for the disk subsystem. Default value of 1 is aimed at non-RAID HDDs; SATA SSDs perform best with values of 4 - 8; NVMe SSDs can support much higher values. Chosen value of 4 seems representative of example server’s storage capabilities.

# Original default value 1.0
seq_page_cost = 0.1
  • PostgreSQL query planner uses costs on arbitrary scale for local optimization of function of suboperation costs. These cost values represent relative cost of suboperations to each other. Due to performant storage, the sequential page cost, a representation of cost of sequential write when compared to CPU processing speed was decreased from 1.0 to 0.1 (10x).

# Original default value 4.0
random_page_cost = 0.15
  • Similarly to seq_page_cost, this value represents the cost of random IO. The default ratio of random_page_cost:seq_page_cost = 4:1 is aimed at HDDs. With example server’s storage, random reads are almost equal to sequential (on HW level completely equal, cache manager is able to predict and buffer sequential reads so those are a bit more performant).

# Original default value 0.1
parallel_tuple_cost = 0.005
  • The cost for parallel tuple retrieval is set to 1/2 of cpu_tuple_cost (for single threaded retrieval) to make parallel execution preferable, using more workers. This reduces the time it takes to perform the parallel portion significantly, making the total execution time converge to the time of serial only portion. This is important for systems with high CPU counts (60+). If the amount of available CPUs is lower, it does not need to be decreased that much. In any case, altering min_parallel_index_scan_size first is preferable as it brings results sooner.

# Original default value 1000
parallel_setup_cost = 10.0
  • Represents the cost to set up worker threads (pass the task to them as they already run). This is generally negligible and the default value of 1000.0 limits the parallel execution to extreme cases only.

# Original default value 512kB
min_parallel_index_scan_size = 128kB
  • Defines the minimum segment size to partition indices for parallel execution. Lowering this value to 128kB enables parallel index scans over the most used index events_cesnet_storagetime_idx to run with optimal number of workers across a large range of stored event counts when filtering for a whole day worth of events. Affected queries no longer run over threshold (2s). Lowering this value helps most if either the system is equipped with a high parallel I/O capable disk subsystem or the whole DB fits into RAM. Otherwise it is advised to watch out for workers starving on I/O.

Note

Conclusion:

The two most important changes enabling parallelism for queries are those for effective_io_concurrency, which limits storage subsystem queues, and parallel_setup_cost, which makes parallel execution seem viable even for simpler queries (and Mentat uses simple queries almost exclusively). The changes to costs are aimed at better prediction of actual run time (comparison of estimated and real run time is provided by EXPLAIN ANALYZE), after are the estimations better matching (not 1:1 but linear function). Number of workers was set based on current load, common number of concurrent queries and memory consumption settings from initial configuration. If under higher analytical (OLAP) load (multiple Hawat users), those values might be the best place to start further optimization.

Regular VACUUM and ANALYZE

PostgreSQL has two main maintenance tasks to maintain performance. VACUUM invalidates vacated entries in both tables and indices, to prevent bloat. ANALYZE performs frequency analysis on table columns, which are then used by query planner. Both procedures are assigned a command of the same name and need to be performed regularly. Ongoing VACUUM and ANALYZE is performed automatically by the AUTOVACUUM daemon. Default AUTOVACUUM settings are suitable for common use, in comparison to which the Mentat’s event table is very different. It is a table with ~125M rows, with constant INSERTs and batch DELETEs. Therefore the settings for this use case have to be altered.

# Original default value 100
default_statistics_target = 1000
  • A multiplicative parameter for depth of analysis for ANALYZE, it allows for a trade-off between ANALYZE speed and relative frequency data precision by defining statistic subset size. Default value of 100 is reasonable for smaller tables or tables with roughly uniform distribution of individual values in columns. As AUTOANALYZE is performed in the background and requires no lock, we can allow for a higher value allowing for better query plans. If the number of rows grows in the future, it might be a good idea to scale this value accordingly to allow for good estimations for the more rare of values.

ALTER TABLE events SET (autovacuum_vacuum_threshold=10000);
ALTER TABLE events SET (autovacuum_vacuum_scale_factor=0.0);
  • Automatic VACUUM is performed on a table when the number of invalidated rows (most commonly done by DELETE) gets over “threshold + scale_factor * table_size_in_rows”. The default values of threshold = 50 and scale_factor = 0.2 are suitable for smaller tables but mean serious index bloat in our use case (9M invalidated rows takes about 3 weeks under current load). Current setting instruct to perform autovacuum after invalidated 10000 rows, effectively after each batch DELETE, which keeps table and indices much more compact. Occasional REINDEX (best performed by CLUSTER) is still required after change, but the frequency decreased from daily to about monthly.

ALTER TABLE events SET (autovacuum_analyze_threshold=20000);
ALTER TABLE events SET (autovacuum_analyze_scale_factor=0.0);
  • The meaning and impact of these parameters is similar to their vacuum counterparts with the difference that new rows (INSERT) are also included in the calculation. Under constant INSERT that we see on Mentat’s events table, 20000 is reached roughly every 15 - 20 minutes. This interval allows for good query plan estimation for queries with time based conditions in recent past (select events detected in last 4 hours) as is often the case. If the number of INSERTs per second increases, it might be a good idea to scale this value accordingly.

# Original default value -1 (use vacuum_cost_limit)
autovacuum_vacuum_cost_limit = 5000

# Original default value 20ms
autovacuum_vacuum_delay = 10ms
  • These settings limit the automatic VACUUM operation, when the cost limit is reached (same scale as other costs), the operation is postponed for the length of delay. Default values of cost 200 and delay 20 ms are aimed at very low additional load to be generated by this background task to affect normal operation. As this is mostly CPU bound and runs in single thread, we can allow for more aggressive values in our use case. For large tables such as events, this allows the autovacuum to finish in a few minutes instead of tens of minutes with default values. The risk of cancelling background VACUUM due to conflicting operation is therefore lowered and the reclamation of invalidated rows is performed much more promptly what reduces the index and table bloat considerably. The chosen value of delay still enables brief autovacuum process sleeps during the run, what is aimed at preventing disk subsystem overload as VACUUM tends to be an IO heavy operation.

Current database schema

Metadata database

Related modules:

The database schema is generated using the sqlalchemy from model classess and it will not be described here. Because of the database use-cases in-depth knowledge of this database should not be necessary. If that is not your case, please study documentation of appropriate modules.

IDEA event database

Related module: mentat.services.eventstorage

#
# Main IDEA event storage.
#
CREATE TABLE IF NOT EXISTS events(
    id text PRIMARY KEY,
    detecttime timestamp NOT NULL,
    category text[] NOT NULL,
    description text,
    source_ip iprange[],
    target_ip iprange[],
    source_ip_aggr_ip4 ip4r,
    source_ip_aggr_ip6 ip6r,
    target_ip_aggr_ip4 ip4r,
    target_ip_aggr_ip6 ip6r,
    source_port integer[],
    target_port integer[],
    source_type text[],
    target_type text[],
    protocol text[],
    node_name text[] NOT NULL,
    node_type text[],
    cesnet_storagetime timestamp NOT NULL,
    cesnet_resolvedabuses text[],
    cesnet_eventclass text,
    cesnet_eventseverity text,
    cesnet_inspectionerrors text[],
    event bytea
);

#
# Storage for reporting thresholds.
#
CREATE TABLE IF NOT EXISTS thresholds(
    id text PRIMARY KEY,
    relapsetime timestamp NOT NULL,
    ttltime timestamp NOT NULL
);

#
# Storage for thresholded (postponed) events.
#
CREATE TABLE IF NOT EXISTS events_thresholded(
    eventid text NOT NULL,
    keyid text NOT NULL,
    groupname text NOT NULL,
    eventseverity text NOT NULL,
    createtime timestamp NOT NULL,
    PRIMARY KEY(eventid, keyid)
);

CREATE INDEX IF NOT EXISTS events_detecttime_idx ON events USING BTREE (detecttime);
CREATE INDEX IF NOT EXISTS events_cesnet_storagetime_idx ON events USING BTREE (cesnet_storagetime);
CREATE INDEX IF NOT EXISTS events_cesnet_eventseverity_idx ON events USING BTREE (cesnet_eventseverity) WHERE cesnet_eventseverity IS NOT NULL;
CREATE INDEX IF NOT EXISTS events_combined_idx ON events USING GIN (category, node_name, protocol, source_port, target_port, source_type, target_type, node_type, cesnet_resolvedabuses, cesnet_inspectionerrors);

CREATE INDEX IF NOT EXISTS thresholds_relapsetime_idx ON thresholds USING BTREE (relapsetime);
CREATE INDEX IF NOT EXISTS thresholds_ttltime_idx ON thresholds USING BTREE (ttltime);

CREATE INDEX IF NOT EXISTS events_thresholded_combined_idx ON events_thresholded USING BTREE (groupname, eventseverity);
CREATE INDEX IF NOT EXISTS events_thresholded_createtime_idx ON events_thresholded USING BTREE (createtime);

Usefull maintenance queries

Sources:

Show index usage

SELECT
    t.tablename,
    indexname,
    c.reltuples AS num_rows,
    pg_size_pretty(pg_relation_size(quote_ident(t.tablename)::text)) AS table_size,
    pg_size_pretty(pg_relation_size(quote_ident(indexrelname)::text)) AS index_size,
    CASE WHEN indisunique THEN 'Y'
       ELSE 'N'
    END AS UNIQUE,
    idx_scan AS number_of_scans,
    idx_tup_read AS tuples_read,
    idx_tup_fetch AS tuples_fetched
FROM pg_tables t
LEFT OUTER JOIN pg_class c ON t.tablename=c.relname
LEFT OUTER JOIN
    ( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns, idx_scan, idx_tup_read, idx_tup_fetch, indexrelname, indisunique FROM pg_index x
           JOIN pg_class c ON c.oid = x.indrelid
           JOIN pg_class ipg ON ipg.oid = x.indexrelid
           JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid )
    AS foo
    ON t.tablename = foo.ctablename
WHERE t.schemaname='public'
ORDER BY 1,2;

Show table size statistics

SELECT
    *,
    pg_size_pretty(total_bytes) AS total,
    pg_size_pretty(index_bytes) AS INDEX,
    pg_size_pretty(toast_bytes) AS toast,
    pg_size_pretty(table_bytes) AS TABLE
FROM (
    SELECT
        *,
        total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes
    FROM (
        SELECT
            c.oid,
            nspname AS table_schema,
            relname AS TABLE_NAME,
            c.reltuples AS row_estimate,
            pg_total_relation_size(c.oid) AS total_bytes,
            pg_indexes_size(c.oid) AS index_bytes,
            pg_total_relation_size(reltoastrelid) AS toast_bytes
        FROM pg_class c
        LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
        WHERE relkind = 'r' AND relname = %[table_name]
    ) a
) a;