The Mentat system uses the awesome PostgreSQL as its primary database backend storage. For performance reasons two different database abstraction layers are used to access the data.
This library/module is used to access the IDEA event database. This database contains the main table for storing IDEA events as well as some additional tables that implement thresholding and relapse mechanism for reporting. For performance reasons it is implemented directly on top of the psycopg2 Python driver for PostgreSQL.
This library/module is used to access the metadata database and work with any other objects like user accounts, groups, reporting filters, etc. For convenience it is implemented on top of sqlalchemy Python universal DBAL and ORM toolkit.
Database configuration tuning¶
The main resources used for database configuration tuning:
Example server hardware specs¶
Configuration and tuning tips in following sections are based on our experience with operating Mentat system and its PostgreSQL database on following hardware:
Dell PowerEdge R630
2 x Intel Xeon E5-2695 v4 2.1GHz,45M Cache,9.60GT/s QPI,Turbo,HT,18C/36T (120W) Max Mem 2400MHz
256 GB RAM (8 x 32GB RDIMM, 2400MT/s, Dual Rank, x4 Data Width)
PERC H730 Integrated RAID Controller, 1GB Cache
3,6 TB HDD (6 x 600GB 15K RPM SAS 2.5in Hot-plug Hard Drive)
General settings and performance optimizations¶
Following options were changed in the main configuration file
# Original default value 'localtime' timezone = 'UTC'
Timezone must be set to UTC to enable proper time based searching. All timestamps
are converted to UTC prior to storing to database and stored without the timezone information according to PostgreSQL best practices.
# Original default value 128MB shared_buffers = 16GB
Shared buffers are used to hold database representation in memory before writing it to disk. In
UPDATEheavy systems, larger values are beneficial. Setting this value too large does not help performance as OS disk cache is also used by PostgreSQL. Shared buffers smooth the IO operations rate and prevent spikes. Value of 16GB seems enough for this system and typical Mentat load. An increase might be in order if either the
INSERTrate or mean event size increase substantially to keep the buffering effect.
# Original default value 4 MB work_mem = 16MB
Specifies the memory size for use in internal operations such as in-memory sorting and hash table calculation. This limit is per internal operation, in typical query multiple such operations are executed. If the operation would require more memory than this limit, a theoretically slower on-disk variant is used (which might still be performed in disk subsystem cache with no real IO operations to storage, possibly even with better performance). Value of 16MB seems to work right for this system. As the internal sort is used in
ORDER BY, some queries where index can not be used for sorting might profit from larger values (as the amount of data to sort is seldom too high in Mentat with typical query using
# Original default value 64MB maintenance_work_mem = 256MB
Maintenance work memory is similar to
work_membut used for maintenance operations such as
CREATE INDEX, larger values (when compared to
work_mem) decrease their run time considerably. Only one such operation can run per database at the same time, value of 256MB is generally enough for the typical table sizes of Mentat and IOps reserves of this system.
# Original default value was commented out wal_buffers = -1
Defines the maximum amount of memory (in
shared_buffers) for write ahead log (WAL) data buffering. Typically WAL is written to disk after every transaction (hard requirement, otherwise journaling stops functioning as intended) so this should be large enough to hold all WAL data for a typical transaction for optimum performance. Value of -1 means no limit so transactions of arbitrary size (limited by
shared_buffers, so essentialy no limit) are supported optimally. Limiting this setting only makes sense on resource constrained systems with small typical transaction size (few queries per transaction with small row size).
# Original default value 4GB effective_cache_size = 200GB
Effective cache size is an estimate of the amount of system memory usable as disk cache for PostgreSQL. On a modern Linux system, this value is generally the total memory size minus the memory used by application programs. An efficient way to determine a suitable value is to monitor the targer system under expected load and note the amount of memory dedicated to disk buffers over time. On a system running PostgreSQL only, reasonable
effective_cache_sizeis equal to the disk buffer size. If the DB shares the system with other disk write/read (total size, not IOps) intensive applications, this value should be decreased accordingly. This value represents the amount of on-disk data that PostgreSQL can estimate to be accessible in constant time. If it is larger than the sum of sizes of all tables and indices, disk IO is ammortized to initial read at first affected query after startup. Mentat is IOps intensive but quite light on total amout of data read/written.
Following options were changed in the main configuration file
# Original default value 1GB max_wal_size = 32GB
Write ahead log serves as a journal of changes against the on-disk snapshot of database that is created by checkpoint. In case of failure, the journal is replayed over the checkpoint to get to the state that includes all commited transactions. As checkpoint is a costly process that requires a table lock, it is best not performed too often (with the downside of a slightly longer database rebuild time after failure). Mentat is an
INSERTheavy application as new events are recorded at all time at a rate of ~14 per second. Meanwhile Mentat can tolerate slightly longer database rebuild times. An increase in maximum write ahead log size from 1GB to 32GB increased the time between required consecutive checkpoints considerably, reducing overall system load and improving
SELECTlatency (as analytic queries are less likely to hit the less frequent checkpoints), furthermore this improves
INSERTperformance for batches of larger events (up to 250kB as opposed to long time running average of 10kB) which appear at an increasing rate lately.
# Original default value 5min checkpoint_timeout = 15min
Checkpoint timeout defines the maximum time between two consecutive checkpoints. A value too small cases frequent checkpointing, an disk IO intensive operation. As Mentat writes are mostly
DELETEs- larger values are sustainable. The increase from 5 minutes to 15 minutes leads to lower load of disk subsystem, leaving more bandwidth for
SELECT- improving performance in case of hitting an
INSERTheavy timeslot (large events), before all three firing caused disk subsystem overload (extremely poor performance), after the change the probability decreased considerably.
Enabling query parallelism¶
Following options were changed in the main configuration file
As of PostgreSQL 9.6, with significant further improvements in 10.0, individual queries can be performed by parallel subworkers. Default settings are very conservative and allow this type of parallelism only in extreme cases and with very few workers. For any system, configuration has to be tuned individually. List of changes to improve parallelism follows:
# Original default value 8 max_parallel_workers = 36
Defines maximum number of parallel workers shared by all queries. As with Mentat we generally only have a low number of concurrently executed queries, this value was set to accommodate two queries with full parallelism, see
# Original default value 2 max_parallel_workers_per_gather = 18
Defines the maximum number of parallel workers for one query. To allow the OS task scheduler to place all workers on one physical CPU and to stop threads competing for resources because of HyperThreading, the value was set to number of physical cores on one CPU (18 for Intel(R) Xeon(R) CPU E5-2695 v4).
# Original default value 1 effective_io_concurrency = 4
Defines the optimal number of parallel IO requests for the disk subsystem. Default value of 1 is aimed at non-RAID HDDs; SATA SSDs perform best with values of 4 - 8; NVMe SSDs can support much higher values. Chosen value of 4 seems representative of example server’s storage capabilities.
# Original default value 1.0 seq_page_cost = 0.1
PostgreSQL query planner uses costs on arbitrary scale for local optimization of function of suboperation costs. These cost values represent relative cost of suboperations to each other. Due to performant storage, the sequential page cost, a representation of cost of sequential write when compared to CPU processing speed was decreased from 1.0 to 0.1 (10x).
# Original default value 4.0 random_page_cost = 0.15
seq_page_cost, this value represents the cost of random IO. The default ratio of random_page_cost:seq_page_cost = 4:1 is aimed at HDDs. With example server’s storage, random reads are almost equal to sequential (on HW level completely equal, cache manager is able to predict and buffer sequential reads so those are a bit more performant).
# Original default value 0.1 parallel_tuple_cost = 0.005
The cost for parallel tuple retrieval is set to 1/2 of
cpu_tuple_cost(for single threaded retrieval) to make parallel execution preferable, using more workers. This reduces the time it takes to perform the parallel portion significantly, making the total execution time converge to the time of serial only portion. This is important for systems with high CPU counts (60+). If the amount of available CPUs is lower, it does not need to be decreased that much. In any case, altering
min_parallel_index_scan_sizefirst is preferable as it brings results sooner.
# Original default value 1000 parallel_setup_cost = 10.0
Represents the cost to set up worker threads (pass the task to them as they already run). This is generally negligible and the default value of 1000.0 limits the parallel execution to extreme cases only.
# Original default value 512kB min_parallel_index_scan_size = 128kB
Defines the minimum segment size to partition indices for parallel execution. Lowering this value to 128kB enables parallel index scans over the most used index events_cesnet_storagetime_idx to run with optimal number of workers across a large range of stored event counts when filtering for a whole day worth of events. Affected queries no longer run over threshold (2s). Lowering this value helps most if either the system is equipped with a high parallel I/O capable disk subsystem or the whole DB fits into RAM. Otherwise it is advised to watch out for workers starving on I/O.
The two most important changes enabling parallelism for queries are those for
effective_io_concurrency, which limits storage subsystem queues, and
parallel_setup_cost, which makes parallel execution seem viable even for simpler queries (and Mentat uses simple queries almost exclusively). The changes to costs are aimed at better prediction of actual run time (comparison of estimated and real run time is provided by
EXPLAIN ANALYZE), after are the estimations better matching (not 1:1 but linear function). Number of workers was set based on current load, common number of concurrent queries and memory consumption settings from initial configuration. If under higher analytical (OLAP) load (multiple Hawat users), those values might be the best place to start further optimization.
Regular VACUUM and ANALYZE¶
PostgreSQL has two main maintenance tasks to maintain performance.
VACUUM invalidates vacated entries in both tables and indices, to prevent bloat.
ANALYZE performs frequency analysis on table columns, which are then used by query planner. Both procedures are assigned a command of the same name and need to be performed regularly. Ongoing
ANALYZE is performed automatically by the
AUTOVACUUM daemon. Default
AUTOVACUUM settings are suitable for common use, in comparison to which the Mentat’s event table is very different. It is a table with ~125M rows, with constant
INSERTs and batch
DELETEs. Therefore the settings for this use case have to be altered.
# Original default value 100 default_statistics_target = 1000
A multiplicative parameter for depth of analysis for
ANALYZE, it allows for a trade-off between
ANALYZEspeed and relative frequency data precision by defining statistic subset size. Default value of 100 is reasonable for smaller tables or tables with roughly uniform distribution of individual values in columns. As
AUTOANALYZEis performed in the background and requires no lock, we can allow for a higher value allowing for better query plans. If the number of rows grows in the future, it might be a good idea to scale this value accordingly to allow for good estimations for the more rare of values.
ALTER TABLE events SET (autovacuum_vacuum_threshold=10000); ALTER TABLE events SET (autovacuum_vacuum_scale_factor=0.0);
VACUUMis performed on a table when the number of invalidated rows (most commonly done by
DELETE) gets over “threshold + scale_factor * table_size_in_rows”. The default values of threshold = 50 and scale_factor = 0.2 are suitable for smaller tables but mean serious index bloat in our use case (9M invalidated rows takes about 3 weeks under current load). Current setting instruct to perform autovacuum after invalidated 10000 rows, effectively after each batch DELETE, which keeps table and indices much more compact. Occasional
REINDEX(best performed by
CLUSTER) is still required after change, but the frequency decreased from daily to about monthly.
ALTER TABLE events SET (autovacuum_analyze_threshold=20000); ALTER TABLE events SET (autovacuum_analyze_scale_factor=0.0);
The meaning and impact of these parameters is similar to their vacuum counterparts with the difference that new rows (
INSERT) are also included in the calculation. Under constant
INSERTthat we see on Mentat’s events table, 20000 is reached roughly every 15 - 20 minutes. This interval allows for good query plan estimation for queries with time based conditions in recent past (select events detected in last 4 hours) as is often the case. If the number of
INSERTsper second increases, it might be a good idea to scale this value accordingly.
# Original default value -1 (use vacuum_cost_limit) autovacuum_vacuum_cost_limit = 5000 # Original default value 20ms autovacuum_vacuum_delay = 10ms
These settings limit the automatic
VACUUMoperation, when the cost limit is reached (same scale as other costs), the operation is postponed for the length of delay. Default values of cost 200 and delay 20 ms are aimed at very low additional load to be generated by this background task to affect normal operation. As this is mostly CPU bound and runs in single thread, we can allow for more aggressive values in our use case. For large tables such as events, this allows the autovacuum to finish in a few minutes instead of tens of minutes with default values. The risk of cancelling background
VACUUMdue to conflicting operation is therefore lowered and the reclamation of invalidated rows is performed much more promptly what reduces the index and table bloat considerably. The chosen value of delay still enables brief autovacuum process sleeps during the run, what is aimed at preventing disk subsystem overload as
VACUUMtends to be an IO heavy operation.
Current database schema¶
The database schema is generated using the sqlalchemy from model classess and it will not be described here. Because of the database use-cases in-depth knowledge of this database should not be necessary. If that is not your case, please study documentation of appropriate modules.
IDEA event database¶
# # Main IDEA event storage. # CREATE TABLE IF NOT EXISTS events( id text PRIMARY KEY, detecttime timestamp NOT NULL, category text NOT NULL, description text, source_ip iprange, target_ip iprange, source_ip_aggr_ip4 ip4r, source_ip_aggr_ip6 ip6r, target_ip_aggr_ip4 ip4r, target_ip_aggr_ip6 ip6r, source_port integer, target_port integer, source_type text, target_type text, protocol text, node_name text NOT NULL, node_type text, cesnet_storagetime timestamp NOT NULL, cesnet_resolvedabuses text, cesnet_eventclass text, cesnet_eventseverity text, cesnet_inspectionerrors text, event bytea ); # # Storage for reporting thresholds. # CREATE TABLE IF NOT EXISTS thresholds( id text PRIMARY KEY, relapsetime timestamp NOT NULL, ttltime timestamp NOT NULL ); # # Storage for thresholded (postponed) events. # CREATE TABLE IF NOT EXISTS events_thresholded( eventid text NOT NULL, keyid text NOT NULL, groupname text NOT NULL, eventseverity text NOT NULL, createtime timestamp NOT NULL, PRIMARY KEY(eventid, keyid) ); CREATE INDEX IF NOT EXISTS events_detecttime_idx ON events USING BTREE (detecttime); CREATE INDEX IF NOT EXISTS events_cesnet_storagetime_idx ON events USING BTREE (cesnet_storagetime); CREATE INDEX IF NOT EXISTS events_cesnet_eventseverity_idx ON events USING BTREE (cesnet_eventseverity) WHERE cesnet_eventseverity IS NOT NULL; CREATE INDEX IF NOT EXISTS events_combined_idx ON events USING GIN (category, node_name, protocol, source_port, target_port, source_type, target_type, node_type, cesnet_resolvedabuses, cesnet_inspectionerrors); CREATE INDEX IF NOT EXISTS thresholds_relapsetime_idx ON thresholds USING BTREE (relapsetime); CREATE INDEX IF NOT EXISTS thresholds_ttltime_idx ON thresholds USING BTREE (ttltime); CREATE INDEX IF NOT EXISTS events_thresholded_combined_idx ON events_thresholded USING BTREE (groupname, eventseverity); CREATE INDEX IF NOT EXISTS events_thresholded_createtime_idx ON events_thresholded USING BTREE (createtime);
Usefull maintenance queries¶
Show index usage¶
SELECT t.tablename, indexname, c.reltuples AS num_rows, pg_size_pretty(pg_relation_size(quote_ident(t.tablename)::text)) AS table_size, pg_size_pretty(pg_relation_size(quote_ident(indexrelname)::text)) AS index_size, CASE WHEN indisunique THEN 'Y' ELSE 'N' END AS UNIQUE, idx_scan AS number_of_scans, idx_tup_read AS tuples_read, idx_tup_fetch AS tuples_fetched FROM pg_tables t LEFT OUTER JOIN pg_class c ON t.tablename=c.relname LEFT OUTER JOIN ( SELECT c.relname AS ctablename, ipg.relname AS indexname, x.indnatts AS number_of_columns, idx_scan, idx_tup_read, idx_tup_fetch, indexrelname, indisunique FROM pg_index x JOIN pg_class c ON c.oid = x.indrelid JOIN pg_class ipg ON ipg.oid = x.indexrelid JOIN pg_stat_all_indexes psai ON x.indexrelid = psai.indexrelid ) AS foo ON t.tablename = foo.ctablename WHERE t.schemaname='public' ORDER BY 1,2;
Show table size statistics¶
SELECT *, pg_size_pretty(total_bytes) AS total, pg_size_pretty(index_bytes) AS INDEX, pg_size_pretty(toast_bytes) AS toast, pg_size_pretty(table_bytes) AS TABLE FROM ( SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM ( SELECT c.oid, nspname AS table_schema, relname AS TABLE_NAME, c.reltuples AS row_estimate, pg_total_relation_size(c.oid) AS total_bytes, pg_indexes_size(c.oid) AS index_bytes, pg_total_relation_size(reltoastrelid) AS toast_bytes FROM pg_class c LEFT JOIN pg_namespace n ON n.oid = c.relnamespace WHERE relkind = 'r' AND relname = %[table_name] ) a ) a;