A common request for hosting companies is the guarantee of uptime on hosts and services. Naturally, this is very hard to guarantee, because servers go down in the middle of the night, and without good reason. So what are you going to do? Sit there all night, visit the websites to confirm they’re working? Ping the servers all night long, checking for downtime? Luckily tools exist to monitor your hosts, and today I’ll write a bit about our journey integrating one of them, Nagios into our preferred communications channel - slack.
Nagios Ain’t Gonna Insist On Sainthood
Nagios gets its name from the greek word ‘άγιος’ which could loosely be translated to ‘sainthood’. For some, it is indeed a saint. Allowing them to easily check for downtime, and giving precise messages as to what exactly is wrong with hosts, or the services running on them. However, other users would claim this to be ironic, as it often seems little is sacred for this program. For those of us unfamiliar with Nagios, it’s used to detect and notify users of any problems with hosts or services running on said hosts. It seems to support almost any kind of monitoring, partially due to its amazing extendability. Users can define hosts, services and even the commands used to check whether previously mentioned groups are working within acceptable parameters. That being said, nagios has a curious way to build said data.
Handling data the nagios way
Looking at this formula of services running on hosts, which in turn are a part of hosts groups, which are monitored by certain users who exist in certain communication channels, that check their hosts using certain commands should give the experienced user reading this a feeling of joy. Originally I assumed this data to be in a database, due to its neatly structured and seemingly relational format. However, Nagios has its own way of doing this. Sometimes this way may seem strange, but usually, there’s a good reason for everything.
Defining Data
Nagios has its user defined data stored in .cfg files on your system. In these you may find one or multiple defined object(s). These objects are described using nagios’ exacting standards. let me show you how a typical template for a host is defined.
define host {
host_name google.com
alias google
address 1.2.3.4
parents name.of.parent.com
use google-hosts_template
}
authors note: nagios allows for an incredible amount of params, options and other neat features in definition of objects. For the sake of your time, I will only go into the most common ones.
A host name is simply the way you name this host. It does not have to match the host’s DNS. This name allows you to reuse this host in other objects. More on that in a bit. The alias is a simplification of the name, making it easier for users to filter or order in external commands (more on this later). Its address is of course its public facing IP address or its DNS. The use variable allows you to inherit properties from other already defined templates. This allows for very easy grouping and configurations of large amounts of hosts. However, nagios can do more for you than checking whether your hosts are reachable.
On our hosts we like to run services. Sadly, we often find our services having problems. Mysql servers without disk space, etc. Luckily, we can get Nagios to monitor these for us, using the following config
define service {
service_description service-mysql
check_command check_whether_its_working
host_name google.com
contact_groups awesome_admins
use database_templates
}
Here we see one of nagios’es more awesome features in action: it allows us to group our services to hosts, and use templates to define what our services to look like. However, as every Object Oriented programmer worth his salt knows, Inheritance is a double edged sword. Use this too liberally and you will find yourself knee deep in dependencies, which are especially annoying considering how these files are structured. Therefore I would strongly reccomend a bit of planning using your favorite design tools before creating groups, templates and all that good stuff. Another cool feature in Nagios is that you can actually define yourself how a host should be checked. In a bit, I’ll show you how we leveraged this command to send out notifications to our slack.
define command{
command_name your command name
command_line /bin/echo "your very own command to make the world a better place goes here!"
}
Checking configuration
Because of the relational nature of the objects, and the fact that users have to define these by hand in their favorite text editor, typos happen. But fear not! Nagios offers a command to determine whether it likes your offerings or not.
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
This command spits out some output to your terminal, giving hints as to what offends it. Using nagios is fairly straightforward if you plan on allowing all users access to its slightly archaic looking web interface. We decided we would not, preffering to keep our information centralised in one channel.
External commands
One of the few downfalls of nagios is that it does not offer an API by default. But, plugins can be found to make nagios offer these services. As I have no experience in using these, as I did not attempt to convince our server admins to enable these, I instead decided to do things the old fashioned way. To use “external commands” as nagios calls them, a user must first configure their system to allow this feature. Luckily, this is most likely enabled by default. If it isn’t, consider poking your bearded server admin until it is. Once this feature is enabled, you will hopefully find this file on your system:
/usr/local/nagios/var/rw/nagios.cmd.
I cannot guarantee this file will be in this exact location, as nagios offers a lot of different options for installations, and even between installations on the same distro the results can greatly differ. The file extension itself might have our more experienced readers curious however. “a cmd file on a unix system?”. This file is a pipe, implemented as a queue. This means the first message jammed in the pipe will be the first to come out.
Nagios is very specific about the messages it likes. Beneath you will see an example of an external command, copied from the nagios external commands page.
##!/bin/sh
## This is a sample shell script showing how you can submit the ACKNOWLEDGE_SVC_PROBLEM command
## to Nagios. Adjust variables to fit your environment as necessary.
now=`date +%s`
commandfile='/usr/local/nagios/var/rw/nagios.cmd'
/bin/printf "[%lu] ACKNOWLEDGE_SVC_PROBLEM;host1;service1;2;1;1;Some One;Some Acknowledgement Comment\n" $now > $commandfile
Every command starts with a timestamp in between a pair of brackets. I have not found an explanation as to why it does not add these itself as soon as it receives a message in its queue, like a shopkeeper handing out tickets to customers to keep order in his shop. The closest thing I found to an explanation for this behavior and many other of its quirks is “Nagios Ain’t Gonna Insist On Sainthood” as the developers are fond of saying. Followed by this timestamp is the command itself, which may or may not take a couple parameters in the form of a a (couple) numbers, author names, comments, or even more timestamps. On that note, the highest amount of timestamps i’ve managed to cram into a command so far is 3. Please email me if you broke this highscore. This may sound very gruesome, but do keep in mind that Nagios has a huge community offering some excellent documentation. If you ever find yourself lost, you can quite litterally google your exact question, and find a solution by a user having dealt with the exact same issue.
Commands from different channels
Now, you might not want to have your users on either your nagios web interface or the server itself. In our company we found ourselves in the same situation. Using a pair of SSL sockets and a liberal amount of patience we managed to link a slackbot to our nagios instance. in the next few pharagraphs, I’ll explain how we did that. First, some tips for anyone attempting to do the same:
Nagios will not output anything if there’s something off with the formatting of your message. Strongly consider printing and comparing your message to the examples on the nagios external commands page.
Nagios will not output anything for some correct commands either. Verify the correct working of your commands by peeking in the web interface of your installation (hostname:port/nagios/). The “schedule check” command does not actually schedule a check for that time. It adds said time to the next check. If you want to check a host now, send your timestamp minus your check interval off to the pipe.If you do not feel inclined to thinker with your notifications time settings, you can just simply schedule your check for half an hour ago. This way, nagios will instantly send out its notifications.
SSL sockets are some of the most obscure parts of a programming language, other than C. If you hope to google your errors like usual, ignore any git issue you come across, no matter how similar it looks to your problem, it will most likely not help you in the slightest.
So, let’s walk through our implementation of the nagios connection step by step. The first thing I wanted to achieve was getting the notifications out of nagios into our slack. Remember those commmands we could define ourselves? In your contacts.cfg file, nagios allows you to specify the command to notify users. meaning, we could define ourselves what should happen with the notification messages. here you can see what I came up with
##'notify-by-slack' command definition
define command{
command_name notify-by-slack
command_line /bin/echo "{\"Slack_server\": \"you'd like to know that, wouldn't you?\",\"Port\": \"one of many\", \"Notification_type\": \"$NOTIFICATIONTYPE$\",\"Service\": \"$SERVICEDESC$\",\"Host\": \"$HOSTNAME$\",\"State\": \"$SERVICESTATE$\",\"Info\":\"$SERVICEOUTPUT$\",\"Comment\":\"$SERVICEACKCOMMENT$\",\"Date\":\"$SHORTDATETIME$\"}" | python /opt/nagios/bin/client.py
}
In this command definition I gave it the name notify-by-slack and piped the available data to a client socket written in python. To save myself some work, I formatted this output message as a JSON dictionary. writing the sockets themselves turned out to be a real pain. Having never written anything networking related before, I figured writing these would just be a simple exercise in some low level programming. my first issue arose when I realised my socket was sending fairly sensitive data over unencrypted traffic. Being the naïve intern I am, I wrote my own encryption scheme.
import Crypto.Random
from Crypto.Cipher import AES
import hashlib
import os
## salt size in bytes
SALT_SIZE = 16
## number of iterations in the key generation
NUMBER_OF_ITERATIONS = 512
## the size multiple required for AES
AES_MULTIPLE = 64
def generate_key(password, salt, iterations):
assert iterations > 0
key = password + salt
for i in range(iterations):
key = hashlib.sha256(key).digest()
return key
def pad_text(text, multiple):
extra_bytes = len(text) % multiple
padding_size = multiple - extra_bytes
padding = chr(padding_size) * padding_size
padded_text = text + padding
return padded_text
def unpad_text(padded_text):
padding_size = ord(padded_text[-1])
text = padded_text[:-padding_size]
return text
def encrypt(plaintext, password):
salt = Crypto.Random.get_random_bytes(SALT_SIZE)
key = generate_key(password, salt, NUMBER_OF_ITERATIONS)
cipher = AES.new(key, AES.MODE_ECB)
padded_plaintext = pad_text(plaintext, AES_MULTIPLE)
ciphertext = cipher.encrypt(padded_plaintext)
ciphertext_with_salt = salt + ciphertext
return ciphertext_with_salt
def decrypt(ciphertext, password):
salt = ciphertext[0:SALT_SIZE]
ciphertext_sans_salt = ciphertext[SALT_SIZE:]
key = generate_key(password, salt, NUMBER_OF_ITERATIONS)
cipher = AES.new(key, AES.MODE_ECB)
padded_plaintext = cipher.decrypt(ciphertext_sans_salt)
plaintext = unpad_text(padded_plaintext)
return plaintext
I jammed this class in our nagios server among with a key my laptop generated, wich our Nagios server shared. During the next daily standup, I proudly announced these achievements. I had written two sockets, and encrypted it all myself. “Why the aren’t you using SSL sockets?” was the first response from the scrum team. So I decided to switch to SSL traffic, and drop the do it yourself encryption scheme. Soon I discovered SSL sockets are finnicky things. they didn’t send out error messages like I was used to, but often just froze omniously. If it did print an exception, it was often contradicting the next one it gave off when I tried to fix it. Finally it worked, after messing with self signed certs and other strange issues for what seemed like an eternity. Here you can a slightly censored version of my socket. I tried to use self explaining function names, so you could easily copy paste this code and add your code for the missing methods.
##! /usr/bin/python2.7
import ssl
import socket
import sys
import re
import logging
from systemd import journal
import os
import configReader
##initialising, allowing to create a new socket while there are still unsend packets on the network, and allowing the reuse of adresses
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
##time out after no response in 5 seconds
s.settimeout(5)
config = configReader.config
cert_file = '{}{}'.format(config.get('nagios', 'NAG_KEY_DIR'),config.get('nagios', 'NAG_CERT_NAME'))
## Require a certificate from the server. We used a self-signed certificate
ssl_sock = ssl.wrap_socket(s,
ca_certs=cert_file,
keyfile='{}{}'.format(config.get('nagios', 'NAG_KEY_DIR'),config.get('nagios', 'NAG_SERVER_KEY')),
certfile=cert_file,
cert_reqs=ssl.CERT_REQUIRED)
def send(msg,sock):
total_sent = 0
while total_sent < len(msg):
sent = sock.send(msg[total_sent:])
if sent == 0:
raise RuntimeError("socket connection broken")
total_sent = total_sent + sent
lines = sys.stdin.read()
try:
slackserver_adr = slackserver_adr_re.search(lines).groups()[0]
port = port_re.search(lines).groups()[0]
ssl_sock.connect((slackserver_adr, int(port)))
##we validate wether we have the right host by checking if the cert we got is the one we issued to the server.
recv_cert = ssl_sock.getpeercert()
my_cert = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_PEM, open(__find_data("server.crt"), 'r').read())
validate(ssl_sock, my_cert, recv_cert)
ssl_sock.write(lines)
except socket.error as err:
stream = journal.stream('nagios_communication sender client')
res = stream.write(err.message)
s.close()
Adding this command to my contact template, I could now notify users via slack. The next step was to send messages to slack, so users could reply to these notifications. This one took a while to figure out. In the end I settled on running a socket server as a daemon on the nagios server. One of the key things I wanted to achieve here was to give this socket absolutely minimal permissions, as it takes input from the internet. And every dev knows input from the internet is inherently evil.
I achieved this by configuring nagios for external traffic by configuring it for external commands without the sudo command. This might cause some pains for some, but personally I feel this is definitely worth it for the peace of mind it buys you. This listener class now pipes the input it takes after some rigorous validation to our previously mentioned nagios.cmd file. At first, still naive about the unix philosophy, I tried to open this cmd file as a file in my python class. Unfortunately it doesn’t work that way. Becuase, once again, previously mentioned it’s a unix pipe, meaning you can’t simply open it as you would a file, because it does not have an EOF. In python I resolved this problem by opening it as a process.
class NagiosCheckHandler:
##'/usr/local/nagios/var/rw/nagios.cmd'
def __init__(self,cmdfile):
self.cmd = cmdfile
self.lock = threading.Lock()
def perform_command(self, command_str):
fd = os.open(self.cmd, os.O_WRONLY)
with self.lock:
os.write(fd, command_str + '\n')
os.close(fd)
return "nagios scheduled command for specified problem"
Hi Mr Robot!
so, now that we could receive and send traffic from a server, it seems sensible to have an actual bot to receive, format and forward these messages towards your communication channel right? Considering every langauge has an implementation of sockets, wich always seem suspiciously similar to the ones in C, my solution would work for any programming langauge of your choosing, or towards any communication channel you prefer. In our company we use slack, so thats what I wrote my bot for. Slack offers an excellent API(unlike some other programs mentioned here). Slack’s instant messaging API is a firehose API. meaning it sprays data for every event that happens in your bots presence. Because of the excellent wrapping around this data however, this is not a problem to deal with at all. simply filter by looking at the keys in one of the many dictionaries the slack api sends you. Structuring this slackbot, my main concern while designing this project was the extensability of my bot. Nagios offers a ton of commands, and I couldn’t possibly implement them all in a couple sprints. In the end, I settled for the following structure. a command has to be adressed to the bot ("@nagiosbot" in this case). once the bot sees a command adressed at it, it instantiates a class with the same name as the command. Inspired by the way Flask routes requests, I decided to create my own router class. its code looks like this:
import re
class Router():
def __init__(self):
self.routes = {}
def route(self, exp):
def decorator(f):
self.routes[re.compile(exp)] = f
return f
return decorator
def serve(self, instance, user_input, user_id):
for r, f in self.routes.items():
m = r.match(user_input)
if m:
m = m.groupdict()
m['user_id'] = user_id
return f(instance, **m)
raise RouteNotFoundError('Route "{}" has not been registered'.format(user_input))
class RouteNotFoundError(Exception):
def __init__(self,string):
self.message = string
Every commandclass has its own private instance of this class, so functions and regexes can be reused between classes without any collisions. The downside to this however is that no commands can easily be shared between classes, as the superclass has a different instance than its sublcasses. Anyhow, in these commandclasses functions have a decorator attached to them, wich sports a regex. If the users input matches this regex, the commmand is selected and its parameters are filled using the regex capture groups. This leads to an incredible system where users can addfunctions and commmandclasses wihtout ever touching anyone elses code! Now we can read and write to nagios, making never have to open that medieval nagios web app again right? Sadly, Nagios has some limitations to its external commands. to define hosts we still need to create and write a *.cfg by hand in the nagios app. Worse still, we have to restart nagios every time we add something!
Our end goal
So, now we have a bot wich sends and receives messages, and by doing so allows users to interact with our server monitor using slack. Why did we go through all this trouble? for one, we have centralised our communications, making it easier for us to see whats going on in our company. Second off, we (hopefully) made the life of those brave engineers on malfunction watch easier. when something goes wrong, their phone buzz’es with a slack message. Instantly they have all the information they need on hand, and can if they’re especially sleepy, they can open up a terminal app on their phone (i’d reccomend Termux), and instantly fix a customers broken server, without even having to get out of bed.
A competitor arises
At one point, some of the developers seemed to have gotten into an arguement about good software design, as we often do. So they forked nagios and called it Icinga. Icinga still contains all the lovely features we just discussed, but has a modern web interface, and actually offers an API by default. This API has new cool features, like adding and deleting hosts remotely. Our company is currently looking to migrate our monitoring to Icinga. However, as Icinga stable was released 59 days ago at this moment of writing, there are still some possible problems lurking. Keep posted to read our follow up blogs on our experience on migrating to Icinga!