Analyzing the performance of a relatively complex map-reduce job may be tricky. Many factors come into play and all this often happens in the constantly evolving environment where IT guys change their networks, software engineers change the code, other engineers modify some job or cluster settings trying to get better performance etc. In this environment it may be often had to say - what exactly was the result of this last change we have done?

In my practice I had to deal with a relatively complex map-reduce job. It was somewhat questionable design - I do believe this job could have been simplified by splitting the process into phases, but I have joined the project at the moment it was already a bit too late to rethink it. Plus, everyone, including me was at some point on Hadoop learning curve so at that time most of us knew much less than we do today.

Skipping some details, I would describe the job as very I/O intensive. It was designed to crunch large amounts of data nearly in real-time. The fight was to be able to crunch all the data we get without being back-logged for too long and leave some resources for other things.

Usual situation: weeks before the first release and the team sees that…well, we are not there yet. Of course, there were tons of options, ideas, proposals - most of them did make sense. But the question was: what exactly do we want to do to get the best bang for buck considering that there is not too much time left and we just could not do and try it all?

Yes, we did have QA team doing the testing, including the performance tests. But the input data they used was often inconsistent, not representative enough and often just simply wrong - they were also working on their test procedures. So we could not fully rely on their reports at that time.

At that time the team has made one good investment of the time - we have decided to take some time and develop a good test that we can run at any moment with a simple command what would allow us almost instantly to run our map-reduce job on a predefined set of data. We called it “reference test” - because we wanted to use it to test all the changes we were planning to do.

How would it look like? Assuming we have some fixed amount of input data and our job completes in X seconds. Someone makes a code change that may affect the performance. Or changes one of hundred parameters Hadoop offers you to tweak its behavior. We would re-run the test and in a few minutes we could see if the change was for good or for bad.

Here is how the test looked like:

  • it was using exactly the same set of JAR files as the real product - so one could simply copy the JARs from the real build and run the test
  • it was using a custom map-reduce job launcher that was as close as possible to the real one from the product
  • it was operating on a pre-generated immutable set of input data that was once copied to HDFS and remained there forever
  • the test was controlled via shell scripts so launching it was as simple as just running a script
  • the test was independent from the real map-reduce job, i.e. it could co-exist with the real product running on the same cluster (using own HDFS paths, own HBase table etc)
  • the test could be used with various amount of input data - depending on the goal of the test. So one could run the test with only a small subset of the input data just to smoke-test the job itself
  • the launcher script was written in such a way that it was resetting the environment before running the test. So, regardless of how many times you run the test, you would get almost identical performance numbers from it, no data accumulation of any kind would happen

Once we have got this test working, the team work has become much more systematic. We knew what we were doing, we knew exactly what was the gain of moving from Gzip to Snappy compression, what was the outcome of changing the number of regions for an HBase table. We have set ourselves a goal: by that day our test needs to crunch that much data in under Z seconds. The goal was clear, the tool was simple and we have done it.

Yes, Hadoop does come with famous TeraSort test. You can use it to tune some of your cluster parameters. But each job is different, especially if the job is relatively complex. Not all parameters that work well for TeraSort will serve your job in the same way. I strongly suggest for anyone who works with map-reduce tasks to come up with a strategy for testing them in isolation on the real cluster.

blog comments powered by Disqus


06 March 2013