Monitoring FreeBSD systems

By Andrew Fengler

andrew.fengler@scaleengine.com

Overview

Reasoning behind monitoring

General Unix monitoring

Getting information we need

Interpreting the information

Avoiding certain pitfalls

FreeBSD specific pitfalls

About me

Sysadmin at ScaleEngine

Manage a fleet of over 100 servers

Worldwide distribution

I'm a sysadmin at ScaleEngine, a video content distribution network. I'm responsible for overseeing a fleet of over 100 servers, worldwide in a wide variety of countries and providers.

Why do we monitor computer systems?

Monitor the state of services we care about

The most important monitoring is of the services that you provide. If you're an email provider, having one of your mail exchangers be down is a major issue, whereas if you're a webhost, you might not be so concerned.

Monitor what our services depend on

Whether your server is consuming half its memory, or two thirds, doesn't really matter, what matters is when it runs out and breaks the services you need to keep up

Detecting future problems

Monitor things we don't care about, but might be useful to dectect a problem (precurser metrics)

CPU

Time vs utilization percent

There are 2 good ways to calculate the CPU usage. One is to use the UCD ssCpuRaw* values to get a counter of the CPU time spent and average it on your monitoring interval. The other is to use the UCD ssCPU* values which are an integer, pre averaged over 1 minute, and divided down by the number of processors and rebased as a percentage

$ snmpget -c public -v 2c server.example.com UCD-SNMP-MIB::ssCpuIdle.0
UCD-SNMP-MIB::ssCpuIdle.0 = INTEGER: 83
$ snmpget -c public -v 2c server.example.com UCD-SNMP-MIB::ssCpuRawIdle.0
UCD-SNMP-MIB::ssCpuRawIdle.0 = Counter32: 1653347551

Running counter vs snapshot

Neither is necessarily better, both have their weaknesses. If you have a cron job that is pinning the processor for 30s, and you check cpu every 5 min, a rolling average will show only 10% usage, and a snapshot might show either 100%, or 0%.

loadavg is considered harmful

Although it seems like the easiest thing to check, there are a number of issues. It fails to account for multiple cores, and the distinction between cores and hyperthreads is lost on it. Additionally, it varies wildly between workloads: our monitor servers will show loadavg of 16 on a 4 core machine, with only 30% cpu usage, and a pinned video server might only show ~8 on an 8 core machine.

CPU Graphs

Memory

Monitor what you have left

Be aware of ARC

Mem: 9588K Active, 103M Inact, 2786M Wired, 4992K Cache, 55M Free
ARC: 1935M Total, 603M MFU, 1202M MRU, 560K Anon, 19M Header, 111M Other

All of the statistics from the arc and regular memory usage can be found in sysctl. Note that ARC stats are in bytes, and memory stats are in pages, so you need to multiply memory stats by page size to compare apples to apples

$ sysctl kstat.zfs.misc.arcstats
$ sysctl vm.stats
$ sysctl vm.stats.vm.v_page_size

You can get a count of how many times the ARC has been throttled due to memory pressure:

$ sysctl kstat.zfs.misc.arcstats.memory_throttle_count
kstat.zfs.misc.arcstats.memory_throttle_count: 22

Network

Network Speed

$ snmpget -c public -v 2c server IF-MIB::ifHCOutOctets.2
IF-MIB::ifHCOutOctets.2 = Counter64: 26790537050371
$ snmpget -c public -v 2c server IF-MIB::ifHCInOctets.2
IF-MIB::ifHCInOctets.2 = Counter64: 17901892810225

Checking the network utilization is the most obvious. It's pretty straightforward with SNMP. Use IF-MIB::ifXTable.

Usage is stored as a counter of total bytes (octets) sent since system boot. Store the value in a temp file and subtract the old value from the new one to get a delta.

Things to watch for: make sure you're using 64-bit counters (ifXTable instead of ifTable), since 32 bits rolls over very fast.

Network quality

Use pings from the monitor to detect loss and high latency. However, tune your sensitivity carefully. The internet is made of baler wire and hope, so some packets will be lost. If you're too sensitive, ICMP checks become meaningless. But what do we alert on? 10% packet loss should be reasonable, but what is 10% loss? Unless you're doing more than 10 echos in each ping, it's 1 lost packet. Don't wake yourself up until it's dropping packets for more than one check.

Interface speed

Not the most obvious thing to look for, but sometimes autonegotiation goes about as well as canada's trade negotiations, and you only get 10Mb/s. Also, some providers will throttle your connection to 100Mb/s if you're over your quota, so this is good to watch.

Ifconfig has all the information in FreeBSD. In Linux, you need to use ethtool to get link information.

$ ifconfig | grep media
        media: Ethernet autoselect (1000baseT )

Bandwidth Usage

Obviously keeping track of your usage is also great. How we do it is using RTG to get the counters out of SNMP, then pull the information out of MySQL and tally it up. It's also good to have the warnings scale with the month, so you don't burn out all of your quota at the start.

Disks

Usage

The most obvious thing to monitor is the disk space. Just like memory and CPU, we need free disk space or bad things happen. The standard way to set this up is with SNMP to check free disk space on / and we're good to go.

Of course that's right up until applications start crashing and complaining about not being able to write files, with you confused since the check reports there being 50GB free. Because you are checking the root of the filesystem, and you also were a good boy and set a 50GB reservation on the / dataset. So SNMP sees 50GB free and doesn't notice that /usr is out of space.

The better way to monitor this is by checking through ZFS, since it's much better at dealing with itself:

$ zfs list -pH -o name,used,avail
mjolnir 46128820224     67251605504
mjolnir/ROOT    12187820032     67251605504
mjolnir/ROOT/default    12187729920     67251605504
mjolnir/tmp     157851648       67251605504
mjolnir/usr     29228777472     67251605504
mjolnir/usr/home        27944427520     67251605504

ZFS introduces both advantages and disadvantages. On the upside, it offers per-dataset granularity, quota information is readily available, and jail usage is easy to break down.

If you're only able to work with SNMP, then make sure you're checking on all datasets you care about, not just /

Disk Health

Disk health means SMART. Yes I know smartctl is horrid. It's the best option. However, you don't have to care about all the different SMART metrics. Only 4 matter:

Overall-health

SMART overall-health self-assessment test result: PASSED

Reallocated sector count(5)

  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0

Current Pending Sector/Offline Uncorrectable (197/198)

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

Also of use is the SSD wear stats, but these vary between manufacturers

Zpool Health

Just because the disk is ok doesn't mean ZFS is ok with it. There could be a cable error, soft errors, or just operator error. Checking zpool health is pretty simple:

$ zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
mjolnir   109G  43.0G  66.0G         -    27%    39%  1.00x  ONLINE  -

If you want more details, you can also parse through zpool status to find the exact drive that has the problems.

SMART graphs

Other fun things to check

Uptime

Having nice graphs of your uptime is important for proving your superiority to all the silly people who patch their operating systems, but it's also useful for catching reboots. Check if the uptime is less than twice the alert interval to catch a reboot This also works for switches, allowing you to be less sensitive with ICMP

interval * 2 - 1

NTP

Watch your ntp time offset. If your clock drifts out of sync you're going to get problems with ssl and the like. For best results, check a different server than the one you're syncronizing to or you create a common failure mode.

$ ntpdate -q 0.pool.ntp.org
server 89.149.59.102, stratum 2, offset -0.007518, delay 0.04460

Temperature

You can get the CPU temperatures out of sysctl. Thermal throttling will cause CPU shortage

dev.cpu.n.temperature

GPUs

nvidia-smi lets you get the temperature and utilization of your GPUs. Important if you're doing compute on them. Watch the correct metric for your workload.

$ nvidia-smi -q -x ... <utilization> <gpu_util>6 %</gpu_util> <memory_util>4 %</memory_util> <encoder_util>14 %</encoder_util> <decoder_util>42 %</decoder_util> </utilization> ...

Other notes

Jails

Jails offer a lot of challenges in obtaining meaningful statistics, especially if you care about isolating to the jail. CPU and network metrics are basically unobtainium, unless you're using vnet. (which is the default in 12) SNMP won't work normally in the jails. You have to build net-snmp with special options so it works in jails at all. You need to ensure that you have good coupling in your monitoring system so that you can associate CPU/memory/network/etc. on the jail with the approprate host

We named the cow Bessie

Be flexible with your measures. As soon as you hardcode a warning threshold into your check, you'll find that one server that runs 10 degrees hotter than the rest and have to be modifying scripts. Take your warning values from arguments, and files or anything else that might change (this also makes it easier to repurpose scripts).

Portability

ls,cut,uname,grep,sort,ps,ifconfig,test,route,netstat,sockstat,anything involving networks or disks, etc

Portability is a major concern. Most basic utilities have at least a few non-portable features, so if you ever need to use your scripts on another server, you need to be careful what you use. You're better off using a scripting language that is portable, like perl or ruby to head off the problem. If you have to use shell, do be careful

Graphs

Don't go overboard

Questions?

Presentation at:

http://www.fengler.ca/articles/monitoring.html