Saturday, February 20, 2010

Parallelizing JUnit test runs

Test runs should be as fast as possible in order to allow a lean development cycle. One of the applications is a Continuous Deployment (see Lean Startup).

Using strong multi-core machines to run tests is not enough since most unit tests are using a single thread per test. Apart of reducing IO to minimum, having tests running in parallel is required to squeeze the juice of of the machine. Additional value from Parallelizing JUnit test runs is ensuring that tests have no dependency between each other.

The way we implemented parallel test runs is creating an ANT target with a parallel task containing N JUnit tasks where N == number of cores. For example:

<parallel>
      <junit printsummary="yes" haltonfailure="yes" fork="true" maxmemory="${maxmemory}" showoutput="yes">
        <jvmarg value="-XX:MaxPermSize=${permMem}"/>
        <jvmarg value="-Xms${minMem}"/>
        <jvmarg value="-Xmx${maxmemory}"/>
        <classpath refid="classpath.test" />
        <formatter type="xml" usefile="true" />
        <test name="com.kaching.GroupedTests$GroupA" todir="${testTargetJunit}" unless="testcase"/>
      </junit>
      <junit printsummary="yes" haltonfailure="yes" fork="true" maxmemory="${maxmemory}" showoutput="yes">
        <jvmarg value="-XX:MaxPermSize=${permMem}"/>
        <jvmarg value="-Xms${minMem}"/>
        <jvmarg value="-Xmx${maxmemory}"/>
        <classpath refid="classpath.test" />
        <formatter type="xml" usefile="true" />
        <test name="com.kaching.GroupedTests$GroupB" todir="${testTargetJunit}" unless="testcase"/>
      </junit>
  ...
     </parallel>
The GroupedTests$GroupX are classes extending TestCase with a public static Test suite(). The suite() method creates a JUnit test suite on the fly by loading all the tests in scope and filtering them out. If for example there are four cores, therefore you would like to have a group of four suites each running fourth of the tests. The suits are using a java.util.Random seeded with the commit revision number. Using the Random object we decide on placing test cases in suits.
random.nextInt(numOfTestSuites) == testSuiteId
Therefore a testcase goes to a single suite and they are evenly distributed between suites. Randomness in assigning test cases to suites (and therefore to processes) gives us some reassurance that there are no dependencies between tests.

As a result we have the tests running about twice as fast as they did before, in the range of 2 min, 20 sec for about 4.6k tests for one of our components, including code fetch from source repository, clean, build, and test suite setup time. It gives us a nice 100% test machine utilization and faster commit to production deployment cycle (about four minutes).

Friday, February 12, 2010

select product(value) from mytable

A nice tip from Harold Fuchs to calculate the product in Mysql

select exp(sum(log(coalesce(value,1))) from mytable
The coalesce() function is there to guard against trying to calculate the
logarithm of a null value and may be optional depending on your
circumstances.
Here's an example
+------+-------+
| id | value |
+------+-------+
| 1 | 3 |
| 1 | 2 |
| 2 | 5 |
| 3 | 7 |
| 3 | 3 |
+------+-------+
and as expected
select id,exp(sum(log(coalesce(value,1)))) from mytable group by id;
yields
+------+----------------------------------+
| id | exp(sum(log(coalesce(value,1)))) |
+------+----------------------------------+
| 1 | 6 |
| 2 | 5 |
| 3 | 21 |
+------+----------------------------------+

Tuesday, February 2, 2010

Voldemort in the Wild

At kaChing, we've tried to embraced as much of the lean startup methodology as possible. In keeping with the spirit, we've worked to scale our infrastructure smartly, using data to drive our decisions and discarding speculation. As part of our infrastructure, we've embraced Project Voldemort as a highly performant and reliable data store. One experiment we've been looking into is how the use of Solid State Drives may improve the performance of Voldemort, and perhaps even more importantly, how does that performance compare relative to the cost of the hardware. Before even starting, every indication pointed to SSD providing a significant performance boost in almost every type of benchmark, but we are solely concerned with how SSD performs in our production infrastructure. I realize that's a huge caveat, and I suppose there's plenty of artifacts that affect the performance numbers (purist shudder), but there are plenty of good reasons to use our production systems. First, I don't have to take any machines out of rotation (unused resources are costly). Second, I don't need to do any work to set up the benchmark, I just need to instrument already running services. Third, and probably most importantly, I get performance numbers on actual data(!), not just a contrived benchmark. If I plug in more SSD drives to my infrastructure, I know exactly how it will perform on the data that is most unique and special to me.

Setting up the benchmark was fairly easy. Since the source for Voldemort is available on GitHub, I just cloned the latest version and added some stopwatches using Perf4J. I was primarily concerned with the round-trip times for the standard operations 'get', 'put', 'getAll' from the view of my clients (As an aside, Voldemort makes server-side stats available via JMX if that's your interest). Then, we use a Log4J Appender to forward the stats collected from Perf4J to a central hub. I wrote a nice little parser in Scala (woot!) and generated some charts using JFreeChart.

It also makes sense to talk about the hardware used in the experiment. In general, the machines are similar Linux OSes, running mostly similar services. They're all 32-bit, dual-core processors with 4GB memory. The biggest difference is that half the machines are configured with the Physical Address Extension feature while the other are not. Machines with PAE enabled are capable of addressing the full 4GB of memory, while non-PAE machines reserve ~1GB for the kernel.

#14 (SSD)#1#6#8#29#30
3GB4GB (PAE)3GB3GB4GB (PAE)4GB (PAE)
Intel E2140 1.60GHzIntel E2180 2.00GHzIntel E2140 1.60GHzIntel E2140 1.60GHzIntel E2160 1.80GHzIntel E2160 1.80GHz

So, with no further delay, here are the charts and data! For the charts, the left side is the average time in milliseconds between when an operation starts and stops from the client. The timespan is over 5 full stock market trading days. The chart data was captured at 10 minute increments, while the tables show the data rolled up into averages for the entire day. In addition, the daily tables show the number of times the operation was called as the second number in the table cell. The SSD machine is highlighted as a red line so it's clearly distinguishable from the others.

This first set of charts shows stock ticker data that we store in Voldemort. We take fetch stock information from our provider and put the data as protocol buffers into Voldemort, essentially using it as a persistent cache. As you can see in the chart, the process starts ramping up around 6am EST, and is consistent throughout the day until right before the stock market closes at 4pm. The size of the data is roughly about 190 bytes per object.

Stock Ticker / Get


01/1001/1101/1201/1301/14
#14 (SSD)2.10 ms / 125799051.51 ms / 128089482.00 ms / 123722512.10 ms / 127755412.10 ms / 13021540
#14.50 ms / 136392232.86 ms / 140660134.50 ms / 140351174.30 ms / 140147194.30 ms / 14381169
#63.30 ms / 146976282.47 ms / 149743483.50 ms / 149764033.20 ms / 149925603.30 ms / 15297108
#83.30 ms / 142685002.41 ms / 145419383.40 ms / 145434523.10 ms / 145376753.20 ms / 14933763
#293.80 ms / 156380442.58 ms / 159784724.00 ms / 160447113.20 ms / 159629323.70 ms / 16407025
#303.10 ms / 136131292.30 ms / 138441553.30 ms / 139304803.10 ms / 137682933.30 ms / 14126112


Stock Ticker / Put


01/1001/1101/1201/1301/14
#14 (SSD)2.20 ms / 121915061.57 ms / 124589992.10 ms / 123280352.20 ms / 124331882.20 ms / 12712775
#14.70 ms / 132036972.92 ms / 136465334.60 ms / 136207144.50 ms / 135960184.40 ms / 13952884
#63.50 ms / 143132092.53 ms / 146114323.70 ms / 145767073.30 ms / 145908333.50 ms / 14941805
#83.40 ms / 139690252.47 ms / 142683553.50 ms / 141896133.20 ms / 142295973.40 ms / 14575123
#294.20 ms / 156523142.75 ms / 159669164.30 ms / 159294933.40 ms / 159523554.00 ms / 16369449
#303.60 ms / 148727502.58 ms / 151603713.80 ms / 152132953.60 ms / 150759203.90 ms / 15508978


Stock Ticker / GetAll


01/1001/1101/1201/1301/14
#14 (SSD)1.40 ms / 1600112.79 ms / 1650532.10 ms / 1792041.20 ms / 2308472.50 ms / 284638
#19.20 ms / 1686224.37 ms / 1752075.80 ms / 1905915.20 ms / 2526075.80 ms / 310047
#62.70 ms / 1657984.37 ms / 1714714.80 ms / 1853572.50 ms / 2374985.10 ms / 293108
#82.50 ms / 1593004.50 ms / 1641134.90 ms / 1779452.20 ms / 2299313.90 ms / 286053
#293.30 ms / 1692873.51 ms / 1744953.70 ms / 1887623.30 ms / 2517964.90 ms / 309301
#3010.50 ms / 1644919.78 ms / 17008311.70 ms / 18378010.90 ms / 24021411.70 ms / 294018


The next data set is generated by a batch processing job that calculates portfolio performance from the day's market data. It's scheduled to start after the market closes and runs for a few hours to completion. It's represented as a protobuf and the size of the data is roughly about 500 bytes per object. The daily values in the table are a little skewed (read: messed up) because the snapshot is taken at 16:00 EST, which is right near some of the calculation and seems to be miss some of the data. The charts are unaffected, since they're taken every 10 minutes.

Data Crunching / Get


01/1001/1101/1201/1301/14
#14 (SSD)6.40 ms / 1450213.13 ms / 2493268.70 ms / 25818711.90 ms / 1345312.30 ms / 302971
#110.90 ms / 1776515.31 ms / 27092513.30 ms / 27996820.30 ms / 1488320.80 ms / 328361
#611.60 ms / 1683913.60 ms / 28828711.30 ms / 29803215.90 ms / 1548718.00 ms / 349492
#811.30 ms / 1721713.68 ms / 28138611.20 ms / 29032519.40 ms / 1553520.00 ms / 340501
#295.60 ms / 1982412.15 ms / 3046639.50 ms / 31492017.90 ms / 1647716.80 ms / 368730
#3010.10 ms / 1820014.88 ms / 26822012.70 ms / 27602418.80 ms / 1529922.60 ms / 321462


Data Crunching / Put


01/1001/1101/1201/1301/14
#14 (SSD)10.00 ms / 553012.79 ms / 1779818.30 ms / 18366015.00 ms / 743512.20 ms / 216806
#114.30 ms / 657514.91 ms / 19612512.70 ms / 20234525.00 ms / 822520.80 ms / 238778
#612.90 ms / 674612.69 ms / 20187510.10 ms / 20850619.00 ms / 853617.10 ms / 245850
#815.20 ms / 703512.68 ms / 20717910.10 ms / 21360126.50 ms / 877818.40 ms / 251763
#299.30 ms / 732111.34 ms / 2169558.90 ms / 22418321.40 ms / 913516.40 ms / 264182
#3011.70 ms / 726013.77 ms / 21595811.50 ms / 22213027.40 ms / 911521.30 ms / 260285



Data Crunching / GetAll


01/1001/1101/1201/1301/14
#14 (SSD)0.60 ms / 36821.00 ms / 41530.70 ms / 49890.50 ms / 27390.70 ms / 6381
#110.40 ms / 38726.62 ms / 43268.70 ms / 50778.80 ms / 27767.00 ms / 5278
#68.10 ms / 40318.64 ms / 461710.70 ms / 56429.20 ms / 30528.60 ms / 6120
#87.20 ms / 44706.97 ms / 52399.70 ms / 57848.60 ms / 34247.40 ms / 6473
#293.30 ms / 44235.13 ms / 50776.20 ms / 56876.60 ms / 34285.50 ms / 5923
#3014.80 ms / 410312.08 ms / 511420.10 ms / 547421.20 ms / 321917.40 ms / 6352


This last set of charts is also a portfolio analytics calculation that runs as a batch processing job and uses voldemort as a persistent intermediary. It's also stored as a protobuf and the object size is about 8-9K per object. Of the three stores, this one is the least interesting, as it's only used for a short period of time during the day, and is untouched during the rest of the day. I also believe the sample size is somewhat small, but of all the experiments, the SSD performs the worst on this data set. Also, due to the timing of the daily cut off and the low number of calls, I'm not going to bother including the table.

Gains / Get



Gains / Put



Gains / GetAll



Some Conclusions


Clearly, the performance of the solid state drive is better than our our other stores backed by traditional drives. One other interesting thing we notice in the data is that our machines without the Physical Address Extension seem to outperform the machines with the extension. Some of the reasoning may be that PAE adds an additional level of indirection required for memory operations, but I'd be interested in whatever thoughts others have on the topic. It's a little tough to see from the charts, since I was trying to highlight the SSD), but hopefully in a post down the line, I'll emphasize that difference. Also, since we collected data for this experiment, we've also added a 64-bit machine into the rotation, so it should be interesting to see what kind of results we see from that machine.

Finally, special thanks to Andrew Schwabecher and Will over at Central Host for helping us out! Also, for anyone who's interested in seeing what changes I made to voldemort to perform the benchmarking, check out my branch of Voldmort. (Try to ignore the hack I have in there for using Voldemort with passing in the configuration instead of reading the config from disk; it's unrelated to the profiling)

Tuesday, January 26, 2010

Amusing log message

Some of us found this (a little too) amusing:

[pool-http-jetty-exec-thread-3] 20100122140052,WARN,com.kaching.trading.core.BalanceCalculation$1,
missing price for stock AAGH on portfolio [elided]
com.kaching.trading.core.StockMissingException: AAGH @ 01/22/2010
at com.kaching.trading.core.PositionMetrics.getPositionValues(PositionMetrics.java:265)
at com.kaching.trading.core.PositionMetrics.getLongPositionValues(PositionMetrics.java:251)
at com.kaching.trading.core.PositionMetrics.getAggregateLongValue(PositionMetrics.java:274)
at com.kaching.trading.core.PositionMetrics.getTotalValue(PositionMetrics.java:286)

Sunday, January 10, 2010

Complement TDD with MDA

Test Driven Development (aka TDD) is on the rise. Good developers understand that code with no proper testing is dead code. You can't trust it to do what you want and its hard to change.
I'm a strong believer in Dijkstra's observation that "Program testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence."

Dijkstra's statement doesn't contradict TDD. The test is testing a limited state machine. We do hope will cover the bloody battlefield of production confronting live data from users but if when we find the users did something unexpected which broke our software, we add a test emulating the users behavior and fix the problem.

Introducing Monitoring Driven Architecture (aka MDA)!
MDA is a second line of defense for TDD. MDA means that you bake monitoring into your architecture. Once you have MDA and the software is written in a monitorable (it is a word) way, you can have a faster detection of problems and auto roll back of faulty code. On the other hand, it is not uncommon that a small number of users suffer from a problem which manifests itself in some NPE thrown in one of the logs once in a blue moon and the operations team finds about it after a long while.

This is why I'm so excited about John's new Flexible Log Monitoring with Scribe, Esper, and Nagios deployment. It means that when we do find a problem we'll of course fix it but in addition make sure we express it in the logs and have our monitoring tools pick it up and send alerts about its existent without counting on anyone to manually look at the logs.

Saturday, January 9, 2010

Actually Implementing Group Management Using ZooKeeper

ZooKeeper offers, in the words of its documentation, "off-the-shelf... group management". The "off-the-shelf" part is inaccurate; it really offers the proper primitives to *implement* group management, but it's up to you to fill in a few missing pieces.

I'll be describing one type of group management system I built at KaChing using ZooKeeper:

  • A group contains some logical service. The *meaning* of belonging to a group is typically "the instance is available for use by clients over the network".
  • Services can join and leave the group. The special case of a service crashing or a network outage needs to be handled as leaving the group.
  • Joined services share metadata about how to communicate with it, i.e., its IP address, base URL, etc.
  • Clients can ask what instances are in the group, i.e., available.
  • Clients are notified when group membership changes so they can mutate their local state.
These map onto ZooKeeper as:
  • A group is a (permanent) node in the ZooKeeper hierarchy. Clients and services must be told the path to this node.
  • A services joins the group by creating an ephemeral node whose parent is the group node. By using an ephemeral node, if the service dies then the service is automatically removed from the group.
  • The ephemeral node's data contains the service metadata in some format like JSON, XML, Avro, Protobufs, Thrift, etc. ZooKeeper has no equivalent of HTTP's "Content-Type" header to identify the metadata representation, so services and clients must agree upon the format in some manner.
  • Clients can query for the children of the group node to identify the members of the group.
  • Clients can place a watch on the group node to be notified if nodes have joined or left the group.
To help with development I use zkclient (pros: provides a much more natural interface to zk from Java compared to the actual ZooKeeper library, somewhat responsive committers; cons: some of the API semantics are difficult to understand, somewhat responsive committers). zkconf from Patrick Hunt makes it trivial to get zk running.

I've looked a bit at norbert from LinkedIn but the documentation is very slim (no README even!); from what I can make out from the code it seems to be well thought-out and provide a super-set of the system I've described. LinkedIn-ners, can you help a brother out?

Bonus section: Service Manifest

One downside to the system I've described is that it doesn't avoid the "rogue service": some service that shouldn't be running actually is running. It's happened to everyone before, don't be shy; you retired service X but it wasn't wiped from the servers and when the box rebooted some cron job restarted it.... oops.

To handle this you need a service "manifest" that lists all the services that *should* be running, so clients can filter out group members that are available but shouldn't be. In ZooKeeper this can be a parallel tree to the group node that uses
permanent nodes rather than ephemeral nodes for group members, or just one permanent node whose data contains the group members, or some variation on that. And make sure you have your clients sound the alarm when a rogue service shows up.

.. Adam

P.S. the Blogger post editor really really sucks.

Tuesday, January 5, 2010

Flexible Log Monitoring with Scribe, Esper, and Nagios

If you have yourself a pretty decent sized cluster, there's probably a good chance that you've had the following experience: One day, while routinely browsing some server logs, you stumble upon some concerning entries that you wish you had been made aware of sooner. You could probably go back and write some custom scripts that email out warnings, or perhaps here's a solution that might scale better for your needs.

The idea here will be to use Scribe, Esper, and Nagios/NSCA to build a flexible way to monitor your logs for issues. First, a little background from each of their corresponding websites:

Scribe
Scribe is a server for aggregating log data streamed in real time from a large number of servers. It is designed to be scalable, extensible without client-side modification, and robust to failure of the network or any specific machine. Scribe was developed at Facebook and released as open source.

Esper
Esper is a component for CEP and ESP applications, available for Java as Esper, and for .NET as NEsper. Complex Event Processing, or CEP, is technology to process events and discover complex patterns among multiple streams of event data. ESP stands for Event Stream Processing and deals with the task of processing multiple streams of event data with the goal of identifying the meaningful events within those streams, and deriving meaningful information from them.

Nagios
Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. Nagios monitors your entire IT infrastructure to ensure systems, applications, services, and business processes are functioning properly. In the event of a failure, Nagios can alert technical staff of the problem, allowing them to begin remediation processes before outages affect business processes, end-users, or customers.

Nagios Service Check Acceptor
NSCA allows you to integrate passive alerts and checks from remote machines and applications with Nagios. Useful for processing security alerts, as well as deploying redundant and distributed Nagios setups.

Now that we know a little bit about these technologies, here's the game plan. Use Scribe to aggregate all of our logs of interest to a central location, process all of the log entries with Esper searching for dubious messages, and then forward on the alert to our monitoring infrastructure in Nagios.

Prerequisites
In order to save a little space in this post and to reuse the existing documentation around the net, we're going to leave setting up Scribe and Nagios as an exercise to the reader. In the Scribe source tree on github, you'll find a good tutorial for an example scribe configuration (see examples/README in the Scribe source). Nagios and NSCA will have packages available for most distributions, and there is plenty of documentation floating around for configuration and help if you should run into trouble. If you don't feel like going through the hassle, or if you have some other infrastructure you'd prefer to use, just skip this part and I'll outline some quick alternatives along the way.

Log Aggregation
The key first step to the process is collecting all our data to process into a single location. Admittedly, getting scribe installed can be a little tough, especially with the dependencies, but once you've set it up, there's a return on investment. If you've set up scribe in a configuration like their 2nd example which sets up a scribe client and a central scribe server, a nice way to get your logs forwarded to the central scribe server is to use a nice python script called scribe_log by Silas Sewell. Another good suggestion for keeping scribe_log running smoothly is to use Supervisor, which is outlined on Silas' blog post.

Log Processing
The next step is to process the aggregated logs looking for anomalies. For this, we'll write some java code using the Esper libraries. We'll create a class called SimpleEsperEngine and fill it with some configuration. Don't worry about copy and pasting, at the end of the post there's a link to the source on github.

private EPServiceProvider epService;
private EPStatement statement;

public SimpleEsperEngine(String esperStatement) {
Configuration config = new Configuration();
config.addEventType(LogLine.class);

this.epService = EPServiceProviderManager.getDefaultProvider(config); 
this.statement = epService.getEPAdministrator().createEPL(esperStatement);
}

public void addUpdateListener(UpdateListener listener) {
statement.addListener(listener);
}

private void processLine(String line) {
LogLine event = new LogLine(line);
epService.getEPRuntime().sendEvent(event);
}
This code sets up the Esper engine with a standard configuration and registers the LogLine event which we'll define in the LogLine class later. Next we'll define a simple notification that just prints out our status code and message, which will later correspond with the passive alert we'll send Nagios/NSCA.
public void notifyCommand(NagiosStatus status, String message) {
System.out.println(String.format("Status: %s, Message: %s", status, message));
}
And here's the LogLine class:
class LogLine {

private String line;

public LogLine(String line) {
this.line = line;
}

public String getLine() {
return line;
}

}
Finally, our main that pulls it all together. It's likely that you'll probably want to pass in your Esper statement or keyword in as parameters to the program, but I'll leave that up for you to set up to your preferences. The key is the esperStatement, which says to count the number of times a LogLine event contains our keyword inside a 60-second sliding window. To be honest, the things you can describe in Esper can be quite complex, so this particular example is really only scratching the surface of what you can put together. (Also as a note, the 2 passed into the notifyCommand() method maps to the Critical status code for nagios, where 2=Critical, 1=Warning, 0=Okay)
public static void main(String[] args) {

final String keyword = "NullPointerException";
final String esperStatement = 
"select line, count(*) as cnt from LogLine.win:time(60 second) where line like '%"+keyword+"%'";
final String messageFormat = "CRITICAL: Found %s line(s) containing '%s'";

final SimpleEsperEngine test = new SimpleEsperEngine(esperStatement);
test.addUpdateListener(new UpdateListener() {
public void update(EventBean[] newEvents, EventBean[] oldEvents) {
EventBean event = newEvents[0];
long count = (Long) event.get("cnt");
if (count >= 1) {
try {
test.notifyCommand(NagiosStatus.Critical, 
String.format(messageFormat, event.get("cnt"), keyword));
} catch(Exception e) {
e.printStackTrace();
}
}
}
});

BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
try {
while (true) {
String line = in.readLine();
test.processLine(line);
}
} catch (IOException e) {
e.printStackTrace();
}
}
It's also helpful to set up a little one-liner script like this one (don't worry, it's in the source tree as run-demo.sh)
java -cp classes:lib/antlr-runtime-3.1.1.jar:lib/cglib-nodep-2.2.jar:lib/commons-logging-1.1.1.jar:lib/esper-3.2.0.jar:lib/jsendnsca-core-1.2.jar:log4j-1.2.15.jar SimpleEsperEngine
As is, we can run this from the command line, and type input to System.in to test it out. In this case, whenever we type 'NullPointerException' on a line, we'll see the CRITICAL message.
sh run-demo.sh
Adding Nagios to the Mix
Finally, to incorporate Nagios/NSCA into the mix, we'll use the JSendNSCA library and modify our notifyCommand.
public void notifyCommand(NagiosStatus status, String message) {
NagiosSettings nagiosSettings = new NagiosSettings();
nagiosSettings.setNagiosHost("localhost");
nagiosSettings.setPort(5667);
nagiosSettings.setEncryptionMethod(NagiosSettings.NO_ENCRYPTION);

NagiosPassiveCheckSender sender = new NagiosPassiveCheckSender(
nagiosSettings);

MessagePayload payload = new MessagePayload();
payload.setHostname("localhost");
payload.setLevel(status.getStatusCode());

payload.setServiceName("TestMessage");
payload.setMessage(message);

try {
sender.send(payload);
} catch (NagiosException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}  
}
There's plenty of Nagios configuration that needs to be done, but here's the service I've defined that matches the alert that is sent from the program.
define service {
service_description     TestMessage
host_name               localhost
retry_check_interval    5
check_command           check_dummy!0
stalking_options        w,c,u
passive_checks_enabled  1
use generic-service
}
Once you've got it all configured you can put all the pieces together.
With scribe, just run:
tail -f /path/to/scribed/test/test_current | sh run-demo.sh &

and the send the exception through scribe with their included scribe_cat tool:
echo NullPointerException | scribe_cat test
Or, if you didn't set up scribe, but want to tail your own log:
touch test.log
tail -f test.log | sh run-demo.sh &
echo NullPointerException >> test.log
So, that's about the essence of it. Unfortunately, the topic of setting up Scribe and Nagios/NSCA were too long for this blog post, but at least they are fairly well documented out there. I'm sure you'll find that there are countless uses for these three technologies, and that this has helped serve as an introduction to what might be possible. Happy coding. Thanks to the dev teams that put these things together, and another thanks for Silas for some handy utils.
- John H.

References
Source Code on GitHub: http://github.com/hitch17/esper-nsca-demo

Scribe: http://developers.facebook.com/scribe/
Esper: http://esper.codehaus.org/about/esper/esper.html
Nagios: http://www.nagios.org/
Nagios Service Check Acceptor: http://www.nagios.org/download/addons/
Supervisor: http://supervisord.org/
jsendnsca: http://code.google.com/p/jsendnsca/
scribe_log: http://www.silassewell.com/blog/tag/scribe_log/

  ©kaChing