HEALTH-CHECK MONITORING

The SELECT health_check_monitoring command empowers system administrators to comprehensively monitor the database’s health across multiple categories.

In the storage domain, it provides insights into cluster storage chunks and their fragmentation, helping to prevent table reading bottlenecks by alerting in the case of a fragmentation scenario. Additionally, it gives indications per table on when to trigger cleanup executions (to free up storage and improve reading performance). The metadata_stats category offers information on Worker and metadata reactivity, enabling the identification of system performance during peak loads and the revelation of potential concurrent issues. Addressing licensing concerns, the command gives details on the customer’s license, including storage capacity and restrictions, and proactively alerts administrators before reaching limitations. Lastly, under self_healing, it supplies essential details on ETL and load processes, monitors query execution flow, tracks Workers per service, identifies idle Workers, and detects issues like stuck snapshots—crucial for regular monitoring and offering clear insights during the Root Cause Analysis (RCA) process for optimal resource allocation.

Here, you can discover details on configuring the monitoring for each of the four categories, along with instructions on how to access and interpret the log files for each category.

Before You Begin

  • Download the Health-Check Monitor input.json configuration file and save it anywhere you want.

  • The SELECT health_check_monitoring command requires SUPERUSER permissions.

Configuration

There are two types of Health-Check Monitor metrics: one is configurable, and the second is non-configurable. Non-configurable metrics provide information about the system, such as total storage capacity measured in Gigabytes. Configurable metrics are set with low, high, or both thresholds within a valid range to be reported in case of deviation. For example, this could include the number of days remaining on your SQreamDB license.

Default Metric Values

The Health-Check Monitor configuration file comes pre-configured with best practices. However, as mentioned before, you have the flexibility to customize any default metric values based on your preferences. All metrics presented below are defined with valid ranges, so any value outside the range triggers a warning. It’s important to note that configuring only one threshold will make the Health-Check Monitor assume the ignored threshold is set to infinity.

{
   "totalNumberOfFragmentedChunks":{
          "from":0,
          "to":100
   },
   "percentageStorageCapacity":{
          "from":0,
          "to":0.9
   },
   "daysForLicenseExpire":{
          "from":60
   },
   "stuckSnapshots":{
          "from":0,
          "to":2
   },
   "queriesInQueue":{
          "from":0,
          "to":100
   },
   "availableWorkers":{
          "from":0,
          "to":5
   },
   "nodeHeartbeatMsgMaxResponseTimeMS":{
          "from":0,
          "to":1000
   },
   "checkLocksMsgMaxResponseTimeMS":{
          "from":0,
          "to":1000
   },
   "keysAndValuesNMaxResponseTimeMS":{
          "from":0,
          "to":1000
   },
   "keysWithPrefixMsgMaxResponseTimeMS":{
          "from":0,
          "to":1000
   },
   "nodeHeartbeatMsgVariance":{
          "from":0,
          "to":1000
   },
   "checkLocksMsgVariance":{
          "from":0,
          "to":1000
   },
   "keysAndValuesNVariance":{
          "from":0,
          "to":1000
   },
   "keysWithPrefixMsgVariance":{
          "from":0,
          "to":1000
   }
}

General Syntax

SELECT health_check_monitoring('<category>', '<input_file>', '<export_path>')

category :: = { storage | metadata_stats | license | self_healing }
Parameters

Parameter

Description

category

Specifies the system domain for which health information is to be retrieved.

input_file

Specifies the path to the configuration file of the designated category for which you want to obtain information.

export_path

Specifies the directory path where you want the monitoring log file to be extracted.

Health-Check Log Structure

After executing the SELECT health_check_monitoring command, a health-check log file and a CLI result set are generated. When reading your health-check log through the CLI, in addition to the metric values, it also showcases your initially set metric range configuration and the location of your exported log file. It’s important to note that logs are separately generated for each of the four Health-Check Monitor categories.

The log file and the result set both output the following information:

Log Output

Log Column Name

Description

metric_time

The time when the specific metric was checked

metric_category

The system domain for which health information is retrieved; either storage, metadata_stats, license, or self_healing

metric_name

The specific metric that is being evaluated

metric_description

For metrics that need a detailed analysis breakdown, this column will showcase the breakdown alongside any additional information

metric_value

The value of the specific metric

metric_validation_status

One of three statuses:
  • info, metric value is within its defined valid range

  • none, the metric provides information about the system and has no valid range

  • warn, metric deviates from its defined valid range

response_time_sec

Indicates the time taken to gather information for a specific metric. This is helpful for timing health-check executions

Handling Warnings

Upon reviewing your log output, you’ll observe that the metric_validation_status column reflects one of three potential statuses: info, none, or warn. This section offers guidance on effectively managing warnings whenever a warn status is encountered.

Health-Check Category

Metric Name

How to Handle warn

Storage

No. fragmented chunks

Recreating the table for triggering defragmentation

Metadata Statistics

NodeHeartbeatMsg, CheckLocksMsg, KeysAndValuesNMsg, KeysWithPrefixMsg

Gather your metadata statistics by executing the following commands and send the information to SQreamDB Support:

SELECT export_leveldb_stats('<EXPORT_FILE_PATH>');
SELECT export_statement_queue_stats('<EXPORT_FILE_PATH>');
SELECT export_conn_stats('<EXPORT_FILE_PATH>');

License

% of used storage capacity, License expiration date

Contact SQreamDB Support for license expansion

Self Healing

Queries in queue

To prevent bottlenecks in the service, reallocate service Workers. Distributing or reallocating service Workers strategically can help optimize performance and mitigate potential bottlenecks. Learn more about Worker allocation

Self Healing

Available workers per service

Efficiently utilize resources by reallocating idle workers to a busier service. This approach optimizes resource consumption and helps balance the workload across services. Learn more about Worker allocation

Self Healing

Stuck snapshots

The Healer is designed to autonomously address stuck snapshots based on its configured timeout. The session flag, healerDetectionFrequencySeconds, determines when the Healer detects and takes corrective actions for stuck snapshots. To manually address a situation, execute a graceful shutdown of the statement’s Worker

Health-Check Category Specifications

Storage

Provides insights into cluster storage chunks and their fragmentation process. Offers an indication of irrelevant storage files in the cluster, preventing potential bottlenecks in chunk iteration during table readings in advance.

storage monitoring has a lengthy execution time, necessitating low-frequency checks to prevent undue strain on your environment.

SELECT health_check_monitoring('storage', 'path/to/my/input.json', 'directory/where/i/save/logs')

When monitoring your storage health, you may also filter information retrieval by database, schema, table, or all three.

SELECT health_check_monitoring('storage', 'master', 'path/to/my/input.json', 'path/to/where/i/save/logs')

SELECT health_check_monitoring('storage', 'master', 'schema1', 'path/to/my/input.json', 'path/to/where/i/save/logs')

SELECT health_check_monitoring('storage', 'master', 'schema1', 'table1', 'path/to/my/input.json', 'path/to/where/i/save/logs')
Storage Metrics

Metric

Configuration Flag

Default Value

Description

No. storage chunks

NA

NA

Chunk status; categorized as either clean, mixed, or deleted. This classification aids in comprehending potential slowdowns when reading from a table. Clean indicates that your table is free of physically lingering deleted data. Mixed suggests that your table contains data marked for deletion but not yet purged (awaiting the removal of deleted data). Meanwhile, deleted signifies that the table has undergone the cleanup process. This categorization proves valuable for scrutinizing deletion and clean-up practices, particularly when visualizing data through dedicated tools

No. fragmented chunks

totalNumberOfFragmentedChunks

"from":0, "to":100

Defines the number of fragmented chunks

Metadata Statistics

Provides information on Worker and metadata reactivity. Regular monitoring allows for the identification of the system’s performance during peak loads, indicating periods of heavy system load. This insight can be invaluable for uncovering potential concurrent issues.

SELECT health_check_monitoring('metadata_stats', 'path/to/my/input.json', 'directory/where/i/save/logs')
Metadata Statistics Metrics

Metric

Configuration Flag

Default Value

Description

NodeHeartbeatMsg

nodeHeartbeatMsgMaxResponseTimeMS, nodeHeartbeatMsgVariance

"from":0, "to":1000

Ensures worker vitality through metadata pings. max response time indicates the peak time for the monitored category, while variance represents the standard deviation between the peak time and the monitoring time.

CheckLocksMsg

checkLocksMsgMaxResponseTimeMS, checkLocksMsgVariance

"from":0, "to":1000

Provides details on current locks at the metadata to determine the feasibility of executing the statement. max response time indicates the peak time for the monitored category, while variance represents the standard deviation between the peak time and the monitoring time.

KeysAndValuesNMsg

keysAndValuesNMaxResponseTimeMS, keysAndValuesNVariance

"from":0, "to":1000

Iterates through metadata keys and values. max response time indicates the peak time for the monitored category, while variance represents the standard deviation between the peak time and the monitoring time.

KeysWithPrefixMsg

keysWithPrefixMsgMaxResponseTimeMS, keysWithPrefixMsgVariance

"from":0, "to":1000

Iterates through metadata keys and values with a specific prefix. max response time indicates the peak time for the monitored category, while variance represents the standard deviation between the peak time and the monitoring time.

License

Provides details about the customer’s license, including database storage capacity and licensing restrictions. Proactively alerts the customer before reaching license limitations, ensuring awareness and timely action.

SELECT health_check_monitoring('license', 'path/to/my/input.json', 'directory/where/i/save/logs')
License Metrics

Metric

Configuration Flag

Default Value

Description

Total storage capacity

NA

NA

Indicates your licensed storage capacity, outlining the permissible limit for your usage

Used storage capacity

NA

NA

Indicates current storage utilization

% of used storage capacity

percentageStorageCapacity

"from":0, "to":0.9

Indicates current storage utilization percentage

License expiration date

daysForLicenseExpire

"from":60

Indicates how many days until your license expires

self_healing

Supplies details on customer ETLs and loads, monitors the execution flow of queries over time, tracks the number of Workers per service, identifies idle Workers, and detects potential issues such as stuck snapshots. It is imperative to regularly monitor this data. During the Root Cause Analysis (RCA) process, it provides a clear understanding of executed operations at specific times, offering customers guidance on optimal resource allocation, particularly in terms of Workers per service.

Monitoring self_healing frequently is a best practice to maximize its value.

SELECT health_check_monitoring('self_healing', 'path/to/my/input.json', 'directory/where/i/save/logs')
self_healing Metrics

Metric

Configuration Flag

Default Value

Description

Queries in queue

queriesInQueue

"from":0, "to":100

Indicates the number of currently queued queries

Available workers per service

availableWorkers

"from":0, "to":5

Indicates the number of unused Workers per service

Stuck snapshots

stuckSnapshots

"from":0, "to":2

Indicates the number of currently stuck snapshots