Health Monitoring

The Health Monitoring service enhances observability, enabling shorter investigation times and facilitating both high-level and detailed drill-downs.

Before You Begin

It is essential that you follow these prerequisites:

Installation

All Health Monitoring-related installations are on a stand-alone installations.

Grafana

Grafana is an open-source analytics and monitoring platform designed for visualizing and analyzing real-time and historical data through customizable dashboards. It offers both an open-source version and an enterprise edition, catering to varying needs and scales of deployment.

For more details, refer to the Grafana specification.

Note

Log in as root user.

Disabling SELinux

  1. Check for the current SELinux status:

    getenforce
    
  2. Open the SELinux configuration file:

    vim /etc/sysconfig/selinux
    
  3. Configure SELINUX to be disabled:

    SELINUX=disabled
    
  4. Reboot your system:

    reboot
    

Installing Grafana via YUM Repository

  1. Create a repo file:

    vim /etc/yum.repos.d/grafana.repo
    
  2. Add the following flags to the repo file:

    [grafana]
    name=grafana
    baseurl=https://packages.grafana.com/oss/rpm
    repo_gpgcheck=1
    enabled=1
    gpgcheck=1
    gpgkey=https://packages.grafana.com/gpg.key
    sslverify=1
    sslcacert=/etc/pki/tls/certs/ca-bundle.crt
    
  3. Install Grafana

    sudo yum install grafana
    

    The installed package performs the following actions:

    • Installs the Grafana server binary at /usr/sbin/grafana-server

    • Copies the init.d script to /etc/init.d/grafana-server

    • Places the default configuration file in /etc/sysconfig/grafana-server

    • Copies the main configuration file to /etc/grafana/grafana.ini

    • Installs the systemd service file (if systemd is supported) as grafana-server.service

    • By default, logs are written to /var/log/grafana/grafana.log

  4. Install free type and urw fonts:

    yum install fontconfig
    yum install freetype*
    yum install urw-fonts
    

Enabling the Grafana Service

  1. Check for the service status:

    systemctl status grafana-server
    
  2. If not active, start the service:

    systemctl start grafana-server
    
  3. Enable the Grafana service on system boot:

    systemctl enable grafana-server.service
    

Modifying your Firewall

  1. Enabling the Grafana port:

    firewall-cmd --zone=public --add-port=3000/tcp --permanent
    
  2. Reload Firewall service:

    firewall-cmd --reload
    

Prometheus

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. Prometheus can be used to scrape and store metrics, enabling real-time monitoring, alerting, and analysis of performance and health.

Your sqream installation includes a Prometheus yml file.

  1. Download Prometheus.

  2. Set the YML path:

    PROMETHEUS_YML_PATH=<GRAFANA_PROJECT_PATH>/ymls/prometheus.yml
    
  3. Run the following script:

    Prometheus_Server_install () {
            echo "Prometheus_Server_install"
            sudo useradd --no-create-home --shell /bin/false prometheus
            sudo mkdir /etc/prometheus
    sudo mkdir /var/lib/prometheus
    sudo touch /etc/prometheus/prometheus.yml
    cat <<EOF | sudo tee /etc/prometheus/prometheus.yml
    
    #node_exporter port : 9100
    #nvidia_exporter port: 9445
    #process-exporter port: 9256
    
    global:
      scrape_interval: 10s
    
    scrape_configs:
      - job_name: 'prometheus'
            scrape_interval: 5s
            static_configs:
              - targets:
                    - 0.0.0.0:9090
      - job_name: 'prosesses'
            scrape_interval: 5s
            static_configs:
              - targets:
                    - <process exporters iP>:9256
                    - <another process exporters iP>:9256
      - job_name: 'nvidia'
            scrape_interval: 5s
            static_configs:
              - targets:
                    - <nvidia exporter IP>:9445
                    - <another nvidia exporter IP>:9445
      - job_name: 'nodes'
            scrape_interval: 5s
            static_configs:
              - targets:
                    - <node exporter IP>:9100
                    - <another node exporter IP>:9100
    EOF
    # Assign ownership of the files above to prometheus user
    sudo chown -R prometheus:prometheus /etc/prometheus
    sudo chown prometheus:prometheus /var/lib/prometheus
    
    # Download prometheus and copy utilities to where they should be in the filesystem
    #VERSION=2.2.1
    #VERSION=$(curl https://raw.githubusercontent.com/prometheus/prometheus/master/VERSION)
    #wget https://github.com/prometheus/prometheus/releases/download/v2.31.1/prometheus-2.31.1.linux-amd64.tar.gz
    wget ftp://drivers:[email protected]/IT-Scripts+Packages/prometheus-2.31.1.linux-amd64.tar.gz
    
    tar xvzf prometheus-2.31.1.linux-amd64.tar.gz
    
    sudo cp prometheus-2.31.1.linux-amd64/prometheus /usr/local/bin/
    sudo cp prometheus-2.31.1.linux-amd64/promtool /usr/local/bin/
    sudo cp -r prometheus-2.31.1.linux-amd64/consoles /etc/prometheus
    sudo cp -r prometheus-2.31.1.linux-amd64/console_libraries /etc/prometheus
    
    # Assign the ownership of the tools above to prometheus user
    sudo chown -R prometheus:prometheus /etc/prometheus/consoles
    sudo chown -R prometheus:prometheus /etc/prometheus/console_libraries
    sudo chown prometheus:prometheus /usr/local/bin/prometheus
    sudo chown prometheus:prometheus /usr/local/bin/promtool
    
    # Populate configuration files
    #cat ./prometheus/prometheus.yml | sudo tee /etc/prometheus/prometheus.yml
    #cat ./prometheus/prometheus.rules.yml | sudo tee /etc/prometheus/prometheus.rules.yml
    cat <<EOF | sudo tee /etc/systemd/system/prometheus.service
    [Unit]
    Description=Prometheus
    Wants=network-online.target
    After=network-online.target
    
    [Service]
    User=prometheus
    Group=prometheus
    Type=simple
    ExecStart=/usr/local/bin/prometheus \
            --config.file /etc/prometheus/prometheus.yml \
            --storage.tsdb.path /var/lib/prometheus/ \
            --web.console.templates=/etc/prometheus/consoles \
            --web.console.libraries=/etc/prometheus/console_libraries
    
    [Install]
    WantedBy=multi-user.target
    EOF
    # systemd
    sudo systemctl daemon-reload
    sudo systemctl enable prometheus
    sudo systemctl start prometheus
    
    # Installation cleanup
    #rm prometheus-${VERSION}.linux-amd64.tar.gz
    #rm -rf prometheus-${VERSION}.linux-amd64
    }
    
    Prometheus_Server_install
    

    This script generates a Prometheus service.

  4. Ensure the user mentioned in the /etc/systemd/system/prometheus.service Prometheus service has permissions to trigger Prometheus.

Loki and Promtail

Loki is a log aggregation system designed to store and query logs, while Promtail is an agent that collects logs and forwards them to Loki.

  1. Download Loki:

    wget https://github.com/grafana/loki/releases/download/v3.0.0/loki-3.0.0.x86_64.rpm
    
  2. Download Promtail:

    wget https://github.com/grafana/loki/releases/download/v3.0.0/promtail-3.0.0.x86_64.rpm
    
  3. Extract the RPM files onto the appropriate machines:

    sudo rpm -i ~/loki-3.0.0.x86_64.rpm
    rpm -i promtail-3.0.0.x86_64.rpm
    
  4. Open the loki.service file:

    sudo vim /etc/systemd/system/loki.service
    
  5. Configure the service file:

    [Unit]
    Description=Loki
    
    [Service]
    ExecStart=/usr/bin/loki -config.file=<LOKI_YML>
    User=root
    Group=<GROUP>
    
    [Install]
    WantedBy=multi-user.target
    
  6. Reload systemd to recognize the new service:

    systemctl daemon-reload
    
  7. Restart the Promtail service:

    sudo systemctl restart promtail
    

Exporters

An Exporter is a software component that gathers metrics from various sources (such as hardware, software, or services) and exposes them in a format that Prometheus can scrape and store.

  1. Download `Exporters<https://github.com/utkuozdemir/nvidia_gpu_exporter/releases>`_.

  2. Install Exporters:

    rpm -i <rpm_file>
    
  3. Reload your system:

    sudo systemctl daemon-reload
    
  4. Restart Exporters service:

    sudo systemctl restart nvidia_gpu_exporter
    

CPU Exporter

  1. Download the CPU Exporter.

  2. Extract package content:

    tar -xvf <package>
    
  3. Move the node_exporter binary to the /usr/bin directory:

    sudo mv <node_exporter_folder>/node_exporter /usr/bin
    
  4. Open the /etc/systemd/system/node_exporter.service file:

    sudo vim /etc/systemd/system/node_exporter.service
    

Add the following to the service file:

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Restart=always
SyslogIdentifier=prometheus
ExecStart=/usr/bin/node_exporter

[Install]
WantedBy=default.target
  1. Reload the systemd manager configuration:

    sudo systemctl daemon-reload
    
  2. Restart the Node Exporter service managed by systemd

    sudo systemctl restart node_exporter
    

Process Exporter

  1. (Prometheus Exporter installation)-Slavi

  2. Start the Exporter:

    /usr/bin/process-exporter --config.path /etc/process-exporter/all.yaml --web.listen-address=:9256 &> process_exporter.out &
    

Deployment

Grafana

  1. Access the Grafana web interface by entering your server IP or host name to the following URL:

    http://<server ip or host name>:3000/
    
  2. Type in admin for both user name and password.

  3. Change your password.

  4. Go to Data Sources and choose prometheus.

  5. Go to Data Sources and choose loki.

  6. Set URL as your Prometheus server ip.

  7. Go to Dashboards and choose Import.

  8. Import dashboards one by one.

Using the Monitor Service

The Monitor service package includes two files (which must be placed in the same folder):

  • monitor_service (an executable)

  • monitor_input.json

Configuring the Monitor Service Worker

Before running the monitor service worker, ensure the following Sqream configuration flags are properly set:

Flag

Configuration File

Description

"cudaMemQuota": 0

Worker configuration file

This setting disables GPU memory usage for the monitor service. Consequently, the Worker must be a non-GPU Worker to avoid exceptions from the monitor service.

"initialSubscribedServices": "monitor"

Worker configuration file

This configuration specifies that the monitor service should run on a non-GPU Worker. To avoid mixing with GPU Worker processes, the monitor service is set to operate on a designated non-GPU Worker. By default, it runs under the service name monitor, but this can be adjusted if needed.

"enableNvprofMarkers" : false

Cluster and session configuration file

Enabling this flag while using a non-GPU Worker results in exceptions. Ensure this flag is turned off to avoid issues since there are no GPU instances involved.

Execution Arguments

When executing the Monitor service, you can configure the following flags:

Flag

Type

Description

State

Default

-h, --help

option

Shows help message

--host

string

The SQreamDB host address

Optional

localhost

--port

integer

The SQreamDB port number

Optional

5000

--database

string

The SQreamDB database name

Optional

master

--username

string

The SQreamDB username

Mandatory

sqream

--password

string

The SQreamDB password

Mandatory

sqream

--clustered

option

An option if the server_picker is running

Optional

False

--service

string

The SQreamDB service name

Optional

monitor

--loki_host

string

The Loki instance host address

Optional

localhost

--loki_port

integer

The Loki port number

Optional

3100

--log_file_path

string

The path to where log files are saved

Optional

NA

--metrics_json_path

string

The path to where the monitor_input.json file is stored

Optional

Example

Execution example:

./monitor_service --username=sqream --password=sqream --host=1.2.3.4 --port=2711 --service=monitor --loki_host=1.2.3.5 --loki_port=3100 --metrics_json_path='/home/arielw/monitor_service/monitor_input.json'

Monitor Service Output Example

Type

Color

Information about monitor service triggering

Blue

Successful insertion

Green

Error

Red

monitor_service_example