One of the biggest problemos I find in identifying client issues is trying to correlate the time an issue is reported with log data. The data usually isn't granular enough or not collected due to peak capacity or connectivity loss. Recently, we had an issue with high load causing high latency and connectivity loss. To correlate when database connections were lost with monitoring data, we employ Solarwinds for trending/historical data among other things. I'm learning my way around Solarwinds and the first thing I noticed in a setup is that you need a microscope to identify the data point created by Solarwinds with granularity to 1 day with 1 minute intervals. However, there are also gaps in data collection at high load times which doesn't help identify what actual load the hardware was seeing.
So, you see a confetti of data points and a few gaps in data but that could be network connectivity issues due to the load. I did notice that one particular CPU was pegged out and wanted to focus on capturing the actual CPU utilization to rule out high memory or interface utilization as the cause. This happened to be a Checkpoint 12600 firewall with performance issues. The operating system is actually based on the Linux kernel which opens many doors from a SNMP monitoring standpoint based on system MIBs. I didn't want to actively watch "top" and wait to use time wisely. After doing some "research," I was able to locate the OIDs for determining the CPU hardware and poll the total utilization for individual CPUs below:
[Expert@12600:0]# snmpwalk -c <community> -v2c 127.0.0.1 HOST-RESOURCES-MIB::hrDeviceDescr
...
HOST-RESOURCES-MIB::hrDeviceDescr.768 =
STRING: GenuineIntel: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
HOST-RESOURCES-MIB::hrDeviceDescr.769 =
STRING: GenuineIntel: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
HOST-RESOURCES-MIB::hrDeviceDescr.770 =
STRING: GenuineIntel: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
...
To determine current utilization for the specific CPU:
[Expert@12600:0]#snmpget -c <community> -v2c 127.0.0.1 HOST-RESOURCES-MIB::hrProcessorLoad.768
HOST-RESOURCES-MIB::hrProcessorLoad.768 = INTEGER: 79 <--Percent Utilization
From that, I built this Bash script to get continuous polling for a particular period to correlate:
[Expert@12600:0]# cat /tmp/cpu1.sh
#!/bin/bash
while [ 1 ]; do
atime=`date`
cpu=`snmpwalk -c <community> -v2c 127.0.0.1 HOST-RESOURCES-MIB::hrProcessorLoad.768`
/bin/echo $atime $cpu
sleep 1
done
[Expert@12600:0]#
I took this script and piped it to a temporary file to allow me to gather granular data every second without constantly watching which is nice. After talking with support and looking at trending data, we were told our version of code doesn't evenly distribute load to fully utilize the CPUs.
Nice work there, Chris!
ReplyDelete