HEALTH-CHECK MONITORING
The SELECT health_check_monitoring
command empowers system administrators to comprehensively monitor the database’s health across multiple categories.
In the storage
domain, it provides insights into cluster storage chunks and their fragmentation, helping to prevent table reading bottlenecks by alerting in the case of a fragmentation scenario. Additionally, it gives indications per table on when to trigger cleanup executions (to free up storage and improve reading performance). The metadata_stats
category offers information on Worker and metadata reactivity, enabling the identification of system performance during peak loads and the revelation of potential concurrent issues. Addressing licensing concerns, the command gives details on the customer’s license
, including storage capacity and restrictions, and proactively alerts administrators before reaching limitations. Lastly, under self_healing
, it supplies essential details on ETL and load processes, monitors query execution flow, tracks Workers per service, identifies idle Workers, and detects issues like stuck snapshots—crucial for regular monitoring and offering clear insights during the Root Cause Analysis (RCA) process for optimal resource allocation.
Here, you can discover details on configuring the monitoring for each of the four categories, along with instructions on how to access and interpret the log files for each category.
Before You Begin
Download the Health-Check Monitor
input.json
configuration file and save it anywhere you want.The
SELECT health_check_monitoring
command requiresSUPERUSER
permissions.
Configuration
There are two types of Health-Check Monitor metrics: one is configurable, and the second is non-configurable. Non-configurable metrics provide information about the system, such as total storage capacity measured in Gigabytes. Configurable metrics are set with low, high, or both thresholds within a valid range to be reported in case of deviation. For example, this could include the number of days remaining on your SQreamDB license.
Default Metric Values
The Health-Check Monitor configuration file comes pre-configured with best practices. However, as mentioned before, you have the flexibility to customize any default metric values based on your preferences. All metrics presented below are defined with valid ranges, so any value outside the range triggers a warning. It’s important to note that configuring only one threshold will make the Health-Check Monitor assume the ignored threshold is set to infinity.
{
"totalNumberOfFragmentedChunks":{
"from":0,
"to":100
},
"percentageStorageCapacity":{
"from":0,
"to":0.9
},
"daysForLicenseExpire":{
"from":60
},
"stuckSnapshots":{
"from":0,
"to":2
},
"queriesInQueue":{
"from":0,
"to":100
},
"availableWorkers":{
"from":0,
"to":5
},
"nodeHeartbeatMsgMaxResponseTimeMS":{
"from":0,
"to":1000
},
"checkLocksMsgMaxResponseTimeMS":{
"from":0,
"to":1000
},
"keysAndValuesNMaxResponseTimeMS":{
"from":0,
"to":1000
},
"keysWithPrefixMsgMaxResponseTimeMS":{
"from":0,
"to":1000
},
"nodeHeartbeatMsgVariance":{
"from":0,
"to":1000
},
"checkLocksMsgVariance":{
"from":0,
"to":1000
},
"keysAndValuesNVariance":{
"from":0,
"to":1000
},
"keysWithPrefixMsgVariance":{
"from":0,
"to":1000
}
}
General Syntax
SELECT health_check_monitoring('<category>', '<input_file>', '<export_path>')
category :: = { storage | metadata_stats | license | self_healing }
Parameter |
Description |
---|---|
|
Specifies the system domain for which health information is to be retrieved. |
|
Specifies the path to the configuration file of the designated category for which you want to obtain information. |
|
Specifies the directory path where you want the monitoring log file to be extracted. |
Health-Check Log Structure
After executing the SELECT health_check_monitoring
command, a health-check log file and a CLI result set are generated. When reading your health-check log through the CLI, in addition to the metric values, it also showcases your initially set metric range configuration and the location of your exported log file. It’s important to note that logs are separately generated for each of the four Health-Check Monitor categories.
The log file and the result set both output the following information:
Log Column Name |
Description |
---|---|
|
The time when the specific metric was checked |
|
The system domain for which health information is retrieved; either |
|
The specific metric that is being evaluated |
|
For metrics that need a detailed analysis breakdown, this column will showcase the breakdown alongside any additional information |
|
The value of the specific metric |
|
|
|
Indicates the time taken to gather information for a specific metric. This is helpful for timing health-check executions |
Handling Warnings
Upon reviewing your log output, you’ll observe that the metric_validation_status
column reflects one of three potential statuses: info, none, or warn. This section offers guidance on effectively managing warnings whenever a warn status is encountered.
Health-Check Category |
Metric Name |
How to Handle warn |
---|---|---|
Storage |
|
Recreating the table for triggering defragmentation |
Metadata Statistics |
|
Gather your metadata statistics by executing the following commands and send the information to SQreamDB Support: SELECT export_leveldb_stats('<EXPORT_FILE_PATH>');
SELECT export_statement_queue_stats('<EXPORT_FILE_PATH>');
SELECT export_conn_stats('<EXPORT_FILE_PATH>');
|
License |
|
Contact SQreamDB Support for license expansion |
Self Healing |
|
To prevent bottlenecks in the service, reallocate service Workers. Distributing or reallocating service Workers strategically can help optimize performance and mitigate potential bottlenecks. Learn more about Worker allocation |
Self Healing |
|
Efficiently utilize resources by reallocating idle workers to a busier service. This approach optimizes resource consumption and helps balance the workload across services. Learn more about Worker allocation |
Self Healing |
|
The Healer is designed to autonomously address stuck snapshots based on its configured timeout. The session flag, healerDetectionFrequencySeconds, determines when the Healer detects and takes corrective actions for stuck snapshots. To manually address a situation, execute a graceful shutdown of the statement’s Worker |
Health-Check Category Specifications
Storage
Provides insights into cluster storage chunks and their fragmentation process. Offers an indication of irrelevant storage files in the cluster, preventing potential bottlenecks in chunk iteration during table readings in advance.
storage
monitoring has a lengthy execution time, necessitating low-frequency checks to prevent undue strain on your environment.
SELECT health_check_monitoring('storage', 'path/to/my/input.json', 'directory/where/i/save/logs')
When monitoring your storage health, you may also filter information retrieval by database, schema, table, or all three.
SELECT health_check_monitoring('storage', 'master', 'path/to/my/input.json', 'path/to/where/i/save/logs')
SELECT health_check_monitoring('storage', 'master', 'schema1', 'path/to/my/input.json', 'path/to/where/i/save/logs')
SELECT health_check_monitoring('storage', 'master', 'schema1', 'table1', 'path/to/my/input.json', 'path/to/where/i/save/logs')
Metric |
Configuration Flag |
Default Value |
Description |
---|---|---|---|
|
NA |
NA |
Chunk status; categorized as either |
|
|
|
Defines the number of fragmented chunks |
Metadata Statistics
Provides information on Worker and metadata reactivity. Regular monitoring allows for the identification of the system’s performance during peak loads, indicating periods of heavy system load. This insight can be invaluable for uncovering potential concurrent issues.
SELECT health_check_monitoring('metadata_stats', 'path/to/my/input.json', 'directory/where/i/save/logs')
Metric |
Configuration Flag |
Default Value |
Description |
---|---|---|---|
|
|
|
Ensures worker vitality through metadata pings. |
|
|
|
Provides details on current locks at the metadata to determine the feasibility of executing the statement. |
|
|
|
Iterates through metadata keys and values. |
|
|
|
Iterates through metadata keys and values with a specific prefix. |
License
Provides details about the customer’s license, including database storage capacity and licensing restrictions. Proactively alerts the customer before reaching license limitations, ensuring awareness and timely action.
SELECT health_check_monitoring('license', 'path/to/my/input.json', 'directory/where/i/save/logs')
Metric |
Configuration Flag |
Default Value |
Description |
---|---|---|---|
|
NA |
NA |
Indicates your licensed storage capacity, outlining the permissible limit for your usage |
|
NA |
NA |
Indicates current storage utilization |
|
|
|
Indicates current storage utilization percentage |
|
|
|
Indicates how many days until your license expires |
self_healing
Supplies details on customer ETLs and loads, monitors the execution flow of queries over time, tracks the number of Workers per service, identifies idle Workers, and detects potential issues such as stuck snapshots. It is imperative to regularly monitor this data. During the Root Cause Analysis (RCA) process, it provides a clear understanding of executed operations at specific times, offering customers guidance on optimal resource allocation, particularly in terms of Workers per service.
Monitoring self_healing
frequently is a best practice to maximize its value.
SELECT health_check_monitoring('self_healing', 'path/to/my/input.json', 'directory/where/i/save/logs')
Metric |
Configuration Flag |
Default Value |
Description |
---|---|---|---|
|
|
|
Indicates the number of currently queued queries |
|
|
|
Indicates the number of unused Workers per service |
|
|
|
Indicates the number of currently stuck snapshots |