NVIDIA GPU SMI SSH

Pack Assets

Monitored Objects

The Pack NVIDIA GPU collects metrics for:

Gpu-stats

Collected Metrics

Gpu-stats

Metric name	Description	Unit
devices.gpu.total.count	Number of gpu devices
product_name:id#device.gpu.utilization.percentage	Percent of time over the past sample period (between 1 second and 1/6 second depending on the product) during which one or more kernels was executing on the GPU	%
product_name:id#device.gpu.memory.utilization.percentage	Percent of time over the past sample period (between 1 second and 1/6 second depending on the product) during which global (device) memory was being read or written	%
product_name:id#device.gpu.encoder.utilization.percentage	Percent of time over the past sample period (sampling rate is variable) during which the GPU video encoder was being used	%
product_name:id#device.gpu.decoder.utilization.percentage	Percent of time over the past sample period (sampling rate is variable) during which the GPU video decoder was being used	%
product_name:id#device.gpu.frame_buffer.memory.usage.bytes	On-board frame buffer memory usage	B
product_name:id#device.gpu.frame_buffer.memory.free.bytes	On-board frame buffer memory available usage	B
product_name:id#device.gpu.frame_buffer.memory.usage.percentage	On-board frame buffer memory usage in percentage	%
product_name:id#device.gpu.bar1.memory.usage.bytes	BAR1 memory usage	B
product_name:id#device.gpu.bar1.memory.free.bytes	BAR1 memory available usage	B
product_name:id#device.gpu.bar1.memory.usage.percentage	BAR1 memory usage in percentage	%
product_name:id#device.gpu.fan.speed.percentage	Fan speed value	%
product_name:id#device.gpu.temperature.celsius	Temperature value	C
product_name:id#device.gpu.power.consumption.watt	The last measured power draw for the entire board	W

Prerequisites

The centreon-engine user performs a SSH connection to a remote system user. This user must have enough privileges to run nvidia-smi command.

Setup

Online License
Offline License

Install the Centreon Plugin on every Poller:

yum install centreon-plugin-Hardware-Devices-Nvidia-Gpu-Smi-Ssh

On the Centreon Web interface in Configuration > Connectors > Monitoring Connectors, install the NVIDIA GPU SMI SSH Pack

Install the Centreon Plugin on every Poller:

yum install centreon-plugin-Hardware-Devices-Nvidia-Gpu-Smi-Ssh

On the Centreon Central server, install the Centreon Pack from the RPM:

yum install centreon-pack-hardware-devices-nvidia-gpu-smi-ssh

On the Centreon Web interface in Configuration > Connectors > Monitoring Connectors, install the NVIDIA GPU SMI SSH Pack

Host configuration

Add a new Host and apply the HW-Device-Nvidia-Gpu-Smi-SSH-custom Host Template

Once the template applied, some Macros have to be configured. 3 SSH backends are available to connect to the remote server: sshcli, plink and libssh which are detailed below.

sshcli backend
plink backend
libssh backend (default)

Mandatory	Name	Description
X	SSHBACKEND	Name of the backend: `sshcli`
X	SSHUSERNAME	By default, it uses the user running process `centengine` on your Poller
	SSHPASSWORD	Cannot be used with backend. Only ssh key authentication
	SSHPORT	By default: 22
	SSHEXTRAOPTIONS	Customize it with your own if needed. E.g.: `--ssh-priv-key=/user/.ssh/id_rsa`

Warning With that backend, you have to validate the target server fingerprint manually (with the SSHUSERNAME used).

Mandatory	Name	Description
X	SSHBACKEND	Name of the backend: `plink`
X	SSHUSERNAME	By default, it uses the user running process `centengine` on your Poller
	SSHPASSWORD	Can be used. If not set, SSH key authentication is used
	SSHPORT	By default: 22
	SSHEXTRAOPTIONS	Customize it with your own if needed. E.g.: `--ssh-priv-key=/user/.ssh/id_rsa`

Warning With that backend, you have to validate the target server fingerprint manually (with the SSHUSERNAME used).

Mandatory	Name	Description
X	SSHBACKEND	Name of the backend: `libssh`
X	SSHUSERNAME	By default, it uses the user running process `centengine` on your Poller
	SSHPASSWORD	Can be used. If not set, SSH key authentication is used
	SSHPORT	By default: 22
	SSHEXTRAOPTIONS	Customize it with your own if needed. E.g.: `--ssh-priv-key=/user/.ssh/id_rsa`

With that backend, you do not have to validate the target server fingerprint manually.

How to test the Plugin and what are the main options for?

Once the Plugin installed, log into your Poller using the centreon-engine user account and test by running the following command :

/usr/lib/centreon/plugins/centreon_nvidia_gpu_smi_ssh.pl \
    --plugin=hardware::devices::nvidia::gpu::smi::plugin \
    --mode=stats \
    --hostname=10.30.2.81 \
    --ssh-username=centreon \
    --ssh-password='centreon-password' \
    --ssh-backend=libssh \
    --verbose

Expected command output is shown below:

OK: All devices are ok | 'devices.gpu.total.count'=2;;;0; 'Quadro K6000:00000000:08:00.0#device.gpu.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.memory.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.encoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.decoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.usage.bytes'=1349517312B;;;0;12798918656 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.free.bytes'=11449401344B;;;0;12798918656 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.usage.percentage'=10.54%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.usage.bytes'=13631488B;;;0;268435456 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.free.bytes'=254803968B;;;0;268435456 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.usage.percentage'=5.08%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.fan.speed.percentage'=26.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.temperature.celsius'=40C;;;; 'Quadro K6000:00000000:08:00.0#device.gpu.power.consumption.watt'=24.16W;;;0; 'Quadro K6000:00000000:84:00.0#device.gpu.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.memory.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.encoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.decoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.usage.bytes'=732954624B;;;0;12798918656 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.free.bytes'=12065964032B;;;0;12798918656 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.usage.percentage'=5.73%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.usage.bytes'=5242880B;;;0;268435456 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.free.bytes'=263192576B;;;0;268435456 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.usage.percentage'=1.95%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.fan.speed.percentage'=26.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.temperature.celsius'=40C;;;; 'Quadro K6000:00000000:84:00.0#device.gpu.power.consumption.watt'=23.86W;;;0;
checking device gpu 'Quadro K6000:00000000:08:00.0'
    utilization gpu: 0.00 %, memory: 0.00 %, encoder: 0.00 %, decoder: 0.00 %
    frame buffer memory usage total: 11.92 GB used: 1.26 GB (10.54%) free: 10.66 GB (89.46%)
    bar1 memory usage total: 256.00 MB used: 13.00 MB (5.08%) free: 243.00 MB (94.92%)
    fan speed: 26.00 %
    gpu temperature: 40 C
    power consumption: 24.16 W
checking device gpu 'Quadro K6000:00000000:84:00.0'
    utilization gpu: 0.00 %, memory: 0.00 %, encoder: 0.00 %, decoder: 0.00 %
    frame buffer memory usage total: 11.92 GB used: 699.00 MB (5.73%) free: 11.24 GB (94.27%)
    bar1 memory usage total: 256.00 MB used: 5.00 MB (1.95%) free: 251.00 MB (98.05%)
    fan speed: 26.00 %gpu temperature: 40 C
    power consumption: 23.86 W

The command above gets GPU statistics (--mode=stats).

It uses a SSH username centreon (--ssh-username=centreon), a SSH password centreon-password (--ssh-password='centreon-password'), uses a SSH backend libssh (--ssh-backend='libssh') and it connects to the host 10.30.2.81 (--hostname=10.30.2.81) on the SSH default port 22 (--ssh-port=22).

All the options as well as all the available thresholds can be displayed by adding the --help parameter to the command:

/usr/lib/centreon/plugins/centreon_nvidia_gpu_smi_ssh.pl \
    --plugin=hardware::devices::nvidia::gpu::smi::plugin \
    --mode=stats \
    --help

Troubleshooting

Troubleshooting plugins

Pack Assets​

Monitored Objects​

Collected Metrics​

Prerequisites​

Setup​

Host configuration​

How to test the Plugin and what are the main options for?​

Troubleshooting​