Shane Alcock's Blog
Spent much of my week working on getting BSOD ready to be wheeled out at Open Day once again. During this process, I managed to find and fix a couple of bugs in the server that were now causing nasty crashes. I also tracked down a bug in the client where the UI elements aren't redrawn properly if the window is resized. Normally this hasn't been a big problem, but newer versions of Gnome like to try and silently resize full-screen apps and this meant that our UI was disappearing off the bottom of the screen. As an interim fix, I've disabled resizing in BSOD client but we really should be trying to handle resize events properly.
Received a bug report for libtrace about the compression detection occasionally giving a false positive for uncompressed ERF traces. This is because the ERF header has no identifying 'magic' at the start, so every now and again the first few bytes (where the timestamp is stored) end up matching the bytes we use to identify a gzip header. I've strengthened the gzip check to use an extra byte so the chance of this happening now is 1 in 16 million. I've also added a special URI format called rawerf: so users can force libtrace to treat traces as uncompressed ERF.
Friday was mostly consumed with looking after our displays at Open Day. BSOD continued to impress quite a few people and we were reasonably busy most of the day, so it seemed a worthwhile exercise.
Spent a little time reviewing my old YouTube paper in preparation for discussing it in 513.
Tracked down and fixed a few outstanding bugs in my new and improved anomaly_ts. The main problem was with my algorithm for keeping a running update of the median -- I had a rather obscure bug when inserting a new value that was between the two values I was averaging to calculate the median that was causing all sorts of problems.
Added an API to ampy for querying the event database. This will hopefully allow us to add little event markers on our time series graphs. Also integrated my code for querying data for Munin time series into ampy.
Churned out a revised version of my L7 filter paper for the IEEE Workshop on Network Measurements. I have repositioned the paper as an evaluation of open-source payload-based traffic classifers rather than a critique of L7 filter. I also spent a fair chunk of time replacing my nice pass-fail system for representing results with the exact accuracy numbers because apparently reviewers found the former confusing.
Tried to continue my work in tidying up and releasing various trace sets, but ran into some problems with my rsyncs being flooded out over the faculty network. This was quite a nuisance so we need to be more careful in future about how we move traces around (despite it not really being our fault!).
Managed to get a decent little algorithm going for quickly detecting a change between a noisy and constant time series. Seems to work fairly well with the examples I have so far.
Decided to completely re-factor the existing anomaly_ts code as it was getting a little unkempt, especially if we hope to have students working on it. For instance, there were several implementations of a buffer containing the recent history for a time series spread across the various detector modules. Also, most of the detectors that we had implemented were not being used and were creating a lot of confusion and our main source file had a lot of branching based on the metric being used by a time series, e.g. latency, bytes, users.
It took the whole week, but I managed to produce a fresh implementation that was clean, tidy and did not have extraneous code. All of the old detectors were placed in an archive directory in case we need them later. Each time series metric is now implemented as a separate class, so there is a lot less branching in the main source. There is also now a single HistoryBuffer implementation that can be used by any detector, including future detectors.
Released the ISP DSL I traces on WITS -- we are now sharing (anonymised) residential DSL traces for the first time, which will no doubt prove to be very popular.
Finished up the 513 marking (eventually!) and released the marks to the students.
Released a new version of libtrace -- 3.0.17.
Started working on releasing some new public trace sets. Waikato 8 is now available on WITS and the DSL traffic from our 2009 ISP traces will hopefully soon follow. In the process, I found a couple of little glitches in traceanon that I was able to fix before the libtrace release.
Decided that our anomaly detection code does not handle time series that switch from constant to noisy and back again particularly well. A classic example is latency to Google: during working hours it is noisy, but it is constant other times. We detect the switch, but only after a long time. I would like to detect this change sooner and report it as an event (although not necessarily alert on it). I've started looking into an alternative method of detecting the change in time series style based on a pair of sliding windows: one for the last hour, one for the previous 12 hours before that. It is working better, but is currently a bit too sensitive to the effect of an individual outlier.
Libtrace 3.0.17 has finally been released.
This release adds some new convenience functions to the libtrace API and fixes a number of bugs, many of which have been reported by users.
The major changes in this release are:
* Added API functions for getting the IP address from a packet as a string.
* Added API functions for calculating packet checksums at the IP and transport layers.
* Fixed major bug where the event API was not working with int: inputs.
* Fixed broken checksum calculations in tracereplay.
* Fixed bug where IP headers embedded inside ICMP messages were not being anonymised by traceanon.
* Added API support for working with ICMPv6 headers.
The full list of changes in this release can be found in the libtrace ChangeLog.
You can download the new version of libtrace from the libtrace website.
Fixed the bugs in the anomaly_ts / eventing chain that I introduced last week. We're back reporting events again on the web dashboard.
Wrote ampy modules for retrieving smokeping and munin data from NNTSC so that Brendon could plot graphs of those time series. Doing this showed up some (more) problems in the graphing which Brendon eventually tracked down to being related to how aggregation was being performed within the NNTSC database.
Spent a large chunk of my week marking the 513 libtrace assignment. It is a much bigger class than previous years (over 30 students) so it was pretty time consuming to mark. In general, it was pleasing to see most students had gotten the basics of passive measurement worked out and hopefully they got some valuable experience from it. My biggest disappointment was how many students didn't read the instructions carefully -- especially those who missed the requirement to write original programs rather than blindly copying huge chunks of the example code.
Another short week, due to being away on Tuesday and Wednesday.
Started writing up a decent description of the design and implementation of NNTSC, which would hopefully make for a decent blog post. It also means that the entire thing is stored somewhere other than in my head...
Revisited the eventing side of our anomaly detection process. Had a long but eventually productive discussion with Brendon about what information needs to be stored in the events database to be able to support the visualisation side. We decided that, given the NNTSC query mechanism, events should have information about the collection and stream that they belong to so that we can easily filter them based on those parameters. We used to use "source" and "destination" for this, but streams are defined using more than just a source and destination now.
Updated anomalyfeed, anomaly_ts and eventing to support the new info that needs to be exported all the way to the eventing program. In the process, I moved eventing into the anomaly_ts source tree (because they shared some common header files) and wrangled automake into building them properly as separate tools. Got to the stage where everything was building happily, but not running so good :(
Very short week this week, but managed to get a few little things sorted.
Added a new dataparser to NNTSC for reading the RRDs used by Munin, a program that Brad is using to monitor the switches in charge of our red cables. The data in these RRDs is a lot noisier than smokeping data, so it will be interesting to see how our anomaly detection goes with that data. Also finally got the AMP data actually being exported to our anomaly detector - the glue program that converted NNTSC data into something that can be read by anomaly_ts wasn't parsing AMP records properly.
Spent a bit of time working on adding some new rules to libprotoident to identify previously unknown traffic in some traces sent to me by one of our users.
Spent Friday afternoon talking with Brian Trammell about some mutual interests, in particular passive measurement of TCP congestion window state and large-scale measurement data collection, storage and access. In terms of the latter, it looks many of the design decisions we have reached with NNTSC are very similar to those that he had reached with mPlane (albeit mPlane is a fair bit more ambitious than what we are doing) which I think was pretty reassuring for both sides. Hopefully we will be able to collaborate more in this space, e.g. developing translation code to make our data collection compatible with mPlane.
Exporting from NNTSC is now back to a functional state and the whole event detection chain is back online. Added table and view descriptions for more complicated AMP tests; traceroute, http2 and udpstream are now all present. Hopefully we can get new AMP collecting and reporting data for these tests soon so we can test whether it actually works!
Had some user-sourced libtrace patches come in, so spent a bit of time integrating these into the source tree and testing the results. One simply cleans up the libpacketdump install directory to not create as many useless or unused files (e.g. static libraries and versioned library symlinks). The other adds support for the OpenBSD loopback DLT, which is actually a real nuisance because OpenBSD isn't entirely consistent with other OS's as to the values of some DLTs.
Helped Nathan with some TCP issues that Lightwire were seeing on a link. Was nice to have an excuse to bust out tcptrace again...
Looks like my L7 Filter paper is going to be rejected. Started thinking about ways in which it can be reworked to be more palatable, maybe present it as a comparative evaluation of open-source traffic classifiers instead.
Turns out that once again, the current design of NNTSC didn't quite meet all of the requirements for storing AMP data. The more complicated traceroute and HTTP tests needed multiple tables for storing their results, which wasn't quite going to work with the "one stream table, one data table" design I had implemented.
Managed to come up with a new design that will hopefully satisfy Brendon while still allowing for a consistent querying approach. Implemented the data collection side of this, including creating tables for the traceroute test. This was a bit trickier than planned, because SQLAlchemy doesn't natively support views and also the traceroute view was rather complicated.
Currently working on updating the exporting side to use id numbers to identify collections rather than names, since there is no longer any guarantee that the data will be located in a table called "data_" + module + module_subtype.
Also spent a fair bit of time reading over Meenakshee's report and covering it in red pen. Pretty happy with how it is coming together.