Turned A No-Brainer Task Into A Challenging Job
Yes, this is a no brainer job. However, if you were to do it for near to hundred servers, a no brainer job will become a nerve cracking job. Sooner or later you will swear like hell. BTW, I did swear too. After all the swearing, I was wondering whether I can do a better job than what the admin staff used to do. I cannot possibly doing this manually every month, right?
After some exploratory works and understanding how the files store these information, I realised that I should be able to do that programmatically by dumping the RRD files into ASCII text, and in this case is in XML format. My next question will be, shall I use XML parser to extract the information ? But not for this case because the system does not have any XML toolkit installed. Also some of the XML toolkit may filter off the comments which I will need to tap onto (the timestamp in yyyy-mm-dd format and epoch time). This is a very useful piece of information to determine whether the server is "amber" or "red".
I always belief that I can extract anything as long as the output is generated by a program, it ought to have a pattern. Here I am showing you a dump of the RRD (at the end of this blog), can you see the pattern ?
I will not show you my code because it is rather involve and messy. However, I will describe my approach in getting things done. Basically I use a lot of UNIX pipes between a mixture of AWK and sed. FYI, I avoided using temporary file for all the processing
- In my case, 1st <datasbase> stores the daily info, 2nd for weekly, 3rd for monthly and 4th for yearly (depending on how you create your RRD)
- Use AWK/sed to pick up the data from 2nd <database>, ignore the NaN (not a number) record, extract the timestamp
- Pipe that into another AWK to work out which date in the week has the highest CPU utilisation
- Open up that day's RRD (apparently it is stored in another RRD)
- Retrieve the 1st <database> data, that's the daily data
- Work out the time difference between those records that are above the threshold
- Count those records above the thresholds. Suppose the polling interval is 5 minutes, we should be seeing a continuous 300 seconds time difference in the filtered records.
- If count exceeds the time specified (continuous 1 hour means 12 data points), we flag it out as either Amber or Red depending on the threshold
I hope you are still with me. The moral of the story is not about the above steps, it is about we should always try to find joy in doing our work no matter how dump it is. It looked like a no-brainer job at first, but at the end it turned out be pretty challenging one.
Here is the RRD file dump:
rrdtool dump some-rrd-file.rrd<!-- Round Robin Database Dump --> <rrd> <version> 0001 </version> <step> 15 </step> <!-- Seconds --> <lastupdate> 1222743000 </lastupdate> <!-- 2008-09-30 10:50:00 SGT --> <ds> <name> ds0 </name> <type> GAUGE </type> <minimal_heartbeat> 600 </minimal_heartbeat> <min> 0 </min> <max> 1.0000000e+02 </max> <!-- PDP Status --> <last_ds> 4.3240000e+01 </last_ds> <value> 0.0000000000e+00 </value> <unknown_sec> 0 </unknown_sec> </ds> <!-- Round Robin Archives --> <rra> <cf> AVERAGE </cf> <pdp_per_row> 1 </pdp_per_row> <!-- 300 seconds --> <xff> 5.0000000000e-01 </xff> <cdp_prep> <ds><value> NaN </value> <unknown_datapoints> 0 </unknown_datapoints></ds> </cdp_prep> <database> <!-- 2008-10-24 09:50:00 SGT / 1224813000 --> <row><v> 1.5000000e+01 </v></row> <!-- 2008-10-24 09:55:00 SGT / 1224813300 --> <row><v> 1.0234000e+01 </v></row> ..... </database> </rra> <rra> .... <database> <!-- 2008-10-20 12:00:00 SGT / 1224475200 --> <row><v> 3.1365000e+01 </v></row> <!-- 2008-10-20 12:30:00 SGT / 1224477000 --> <row><v> 2.4532000e+01 </v></row> ..... </database> </rra> <rra> .... <database> ..... </database> </rra> </rrd>