I'm a sysadmin at ScaleEngine, a video content distribution network. I'm responsible for overseeing a fleet of over 100 servers, worldwide in a wide variety of countries and providers.
The most important monitoring is of the services that you provide. If you're an email provider, having one of your mail exchangers be down is a major issue, whereas if you're a webhost, you might not be so concerned.
Whether your server is consuming half its memory, or two thirds, doesn't really matter, what matters is when it runs out and breaks the services you need to keep up
Monitor things we don't care about, but might be useful to dectect a problem (precurser metrics)
There are 2 good ways to calculate the CPU usage. One is to use the UCD ssCpuRaw* values to get a counter of the CPU time spent and average it on your monitoring interval. The other is to use the UCD ssCPU* values which are an integer, pre averaged over 1 minute, and divided down by the number of processors and rebased as a percentage
$ snmpget -c public -v 2c server.example.com UCD-SNMP-MIB::ssCpuIdle.0 UCD-SNMP-MIB::ssCpuIdle.0 = INTEGER: 83 $ snmpget -c public -v 2c server.example.com UCD-SNMP-MIB::ssCpuRawIdle.0 UCD-SNMP-MIB::ssCpuRawIdle.0 = Counter32: 1653347551
Neither is necessarily better, both have their weaknesses. If you have a cron job that is pinning the processor for 30s, and you check cpu every 5 min, a rolling average will show only 10% usage, and a snapshot might show either 100%, or 0%.
Although it seems like the easiest thing to check, there are a number of issues. It fails to account for multiple cores, and the distinction between cores and hyperthreads is lost on it. Additionally, it varies wildly between workloads: our monitor servers will show loadavg of 16 on a 4 core machine, with only 30% cpu usage, and a pinned video server might only show ~8 on an 8 core machine.
Mem: 9588K Active, 103M Inact, 2786M Wired, 4992K Cache, 55M Free ARC: 1935M Total, 603M MFU, 1202M MRU, 560K Anon, 19M Header, 111M Other
All of the statistics from the arc and regular memory usage can be found in sysctl. Note that ARC stats are in bytes, and memory stats are in pages, so you need to multiply memory stats by page size to compare apples to apples
$ sysctl kstat.zfs.misc.arcstats $ sysctl vm.stats $ sysctl vm.stats.vm.v_page_size
You can get a count of how many times the ARC has been throttled due to memory pressure:
$ sysctl kstat.zfs.misc.arcstats.memory_throttle_count kstat.zfs.misc.arcstats.memory_throttle_count: 22
$ snmpget -c public -v 2c server IF-MIB::ifHCOutOctets.2 IF-MIB::ifHCOutOctets.2 = Counter64: 26790537050371 $ snmpget -c public -v 2c server IF-MIB::ifHCInOctets.2 IF-MIB::ifHCInOctets.2 = Counter64: 17901892810225
Checking the network utilization is the most obvious. It's pretty straightforward with SNMP. Use IF-MIB::ifXTable.
Usage is stored as a counter of total bytes (octets) sent since system boot. Store the value in a temp file and subtract the old value from the new one to get a delta.
Things to watch for: make sure you're using 64-bit counters (ifXTable instead of ifTable), since 32 bits rolls over very fast.
Use pings from the monitor to detect loss and high latency. However, tune your sensitivity carefully. The internet is made of baler wire and hope, so some packets will be lost. If you're too sensitive, ICMP checks become meaningless. But what do we alert on? 10% packet loss should be reasonable, but what is 10% loss? Unless you're doing more than 10 echos in each ping, it's 1 lost packet. Don't wake yourself up until it's dropping packets for more than one check.
Not the most obvious thing to look for, but sometimes autonegotiation goes about as well as canada's trade negotiations, and you only get 10Mb/s. Also, some providers will throttle your connection to 100Mb/s if you're over your quota, so this is good to watch.
Ifconfig has all the information in FreeBSD. In Linux, you need to use ethtool to get link information.
$ ifconfig | grep media media: Ethernet autoselect (1000baseT
Obviously keeping track of your usage is also great. How we do it is using RTG to get the counters out of SNMP, then pull the information out of MySQL and tally it up. It's also good to have the warnings scale with the month, so you don't burn out all of your quota at the start.
The most obvious thing to monitor is the disk space. Just like memory and CPU, we need free disk space or bad things happen. The standard way to set this up is with SNMP to check free disk space on / and we're good to go.
Of course that's right up until applications start crashing and complaining about not being able to write files, with you confused since the check reports there being 50GB free. Because you are checking the root of the filesystem, and you also were a good boy and set a 50GB reservation on the / dataset. So SNMP sees 50GB free and doesn't notice that /usr is out of space.
The better way to monitor this is by checking through ZFS, since it's much better at dealing with itself:
$ zfs list -pH -o name,used,avail mjolnir 46128820224 67251605504 mjolnir/ROOT 12187820032 67251605504 mjolnir/ROOT/default 12187729920 67251605504 mjolnir/tmp 157851648 67251605504 mjolnir/usr 29228777472 67251605504 mjolnir/usr/home 27944427520 67251605504
ZFS introduces both advantages and disadvantages. On the upside, it offers per-dataset granularity, quota information is readily available, and jail usage is easy to break down.
If you're only able to work with SNMP, then make sure you're checking on all datasets you care about, not just /
Disk health means SMART. Yes I know smartctl is horrid. It's the best option. However, you don't have to care about all the different SMART metrics. Only 4 matter:
SMART overall-health self-assessment test result: PASSED
5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
Also of use is the SSD wear stats, but these vary between manufacturers
Just because the disk is ok doesn't mean ZFS is ok with it. There could be a cable error, soft errors, or just operator error. Checking zpool health is pretty simple:
$ zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT mjolnir 109G 43.0G 66.0G - 27% 39% 1.00x ONLINE -
If you want more details, you can also parse through zpool status to find the exact drive that has the problems.
Having nice graphs of your uptime is important for proving your superiority to all the silly people who patch their operating systems, but it's also useful for catching reboots. Check if the uptime is less than twice the alert interval to catch a reboot This also works for switches, allowing you to be less sensitive with ICMP
interval * 2 - 1
Watch your ntp time offset. If your clock drifts out of sync you're going to get problems with ssl and the like. For best results, check a different server than the one you're syncronizing to or you create a common failure mode.
$ ntpdate -q 0.pool.ntp.org server 22.214.171.124, stratum 2, offset -0.007518, delay 0.04460
You can get the CPU temperatures out of sysctl. Thermal throttling will cause CPU shortage
nvidia-smi lets you get the temperature and utilization of your GPUs. Important if you're doing compute on them. Watch the correct metric for your workload.
Jails offer a lot of challenges in obtaining meaningful statistics, especially if you care about isolating to the jail. CPU and network metrics are basically unobtainium, unless you're using vnet. (which is the default in 12) SNMP won't work normally in the jails. You have to build net-snmp with special options so it works in jails at all. You need to ensure that you have good coupling in your monitoring system so that you can associate CPU/memory/network/etc. on the jail with the approprate host
Be flexible with your measures. As soon as you hardcode a warning threshold into your check, you'll find that one server that runs 10 degrees hotter than the rest and have to be modifying scripts. Take your warning values from arguments, and files or anything else that might change (this also makes it easier to repurpose scripts).
ls,cut,uname,grep,sort,ps,ifconfig,test,route,netstat,sockstat,anything involving networks or disks, etc
Portability is a major concern. Most basic utilities have at least a few non-portable features, so if you ever need to use your scripts on another server, you need to be careful what you use. You're better off using a scripting language that is portable, like perl or ruby to head off the problem. If you have to use shell, do be careful