Nov 172014

My (rather popular) first post on this topic explained the benefits of compression (which comes as the default option with the new WiredTiger storage engine) for systems with lesser IO capabilities.  The intent was to first show that the new storage engine saved space on disk and then to show that this could be translated into a gain in terms of performance when reading that data (slowly) off disk.

The first part of that story worked out pretty well, the data was nicely compressed on disk and it was easy to show it in the graph.  The second part of that story did not work out as expected, the graph was a little off from expectations and my initial speculation that it was a non-optimal access pattern didn’t pan out.  In fact, I determined that the slowness I was seeing was independent of IO and was due to how slow the in-memory piece was when using WiredTiger to do a table scan.  Needless to say, I started to talk to engineering about the issue and tried tweaking various options – each one essentially reinforced the original finding.

It was soon obvious that we had a bug that needed to be addressed (one that was still present in the first release candidate 2.8.0-rc0). I gathered the relevant data and opened SERVER-16150 to investigate the problem. Thanks to the ever excellent folks in MongoDB engineering (this one in particular), we soon had the first patched build attempting to address the issue (more, with graphs after the jump).  Before that, for anyone looking to reproduce this testing, I would recommend waiting until SERVER-16150 has been closed and integrated into the next release candidate (2.8.0-rc1), you won’t see the same results from 2.8.0-rc0 (it will instead look like the first set of results).

Continue reading »

Nov 122014

CAVEAT: This post deals with a development version of MongoDB and represents very early testing. The version used was not even a release candidate – 2.7.9-pre to be specific, this is not even a release candidate.  Therefore any and all details may change significantly before the release of 2.8, so please be aware that nothing is yet finalized and as always, read the release notes once 2.8.0 ships.

Update (Nov 17th, 2014): Good news! I have re-tested with a patched version of 2.8.0-rc0 and the results are very encouraging compared to figure 2 below.  For full details (including an updated graph), see MongoDB 2.8: Improving WiredTiger Performance

Anyone that follows the keynotes from recent MongoDB events will know that we have demonstrated the concurrency performance improvements coming in version 2.8 several times now.  This is certainly the headline performance improvement for MongoDB 2.8, with concurrency constraints in prior versions leading to complex database/collection layouts, complex deployments and more to work around the per-database locking limitations.

However, the introduction of the new WiredTiger storage engine that was announced at MongoDB London also adds another capability with a performance component that has long been requested: compression

Eliot also gave a talk about the new storage engines at MongoDB London last week after announcing the availability of WiredTiger in the keynote.  Prior to that we were talking about what would be a good way to structure that talk and I suggested showing the effects and benefits of compression. Unfortunately there wasn’t enough time to put something meaningful together on the day, but the idea stuck with me and I have put that information together for this blog post instead.

It’s not a short post, and it has graphs, so I’ve put the rest after the jump.

Continue reading »

Jun 202014

Setting readahead (RA from now on) appropriately is a contentious subject.  There are a lot of variables involved, but in my particular case I am setting out to minimize those variables, get a baseline, and have a reasonable idea of what to expect out of this configuration:

  • Environment: Amazon EC2
  • Instance Size: m3.xlarge (4 vCPU, 15GiB RAM)
  • Disk Config: Single EBS Volume, 1000 PIOPS

The testing I am going to be doing is pretty detailed, and intended for use in a future whitepaper, so I wanted to get some prep done and figure out exactly what I was dealing with here before I moved forward.  The initial testing (which is somewhat unusual for MongoDB) involves a lot of sequential IO.  Normally, I am tweaking an instance for random IO and optimizing for memory utilization efficiency – a very different beast which generally means low RA settings.  For this testing, I figured I would start with my usual config (and the one I was using on a beefy local server) and do some tweaking to see what the impact was.

I was surprised to find a huge cliff in terms of operations per second hitting the volume when I dropped RA to 16.  I expected the larger readahead settings to help up to a certain point because of the sequential IO (probably up to the point that I saturate the bandwidth to the EBS volume or similar).  But I did not expect the “cliff” between RA settings of 32, and RA settings of 16.

To elaborate: one of the things I was keeping an eye on was the page faulting rate within MongoDB.  MongoDB only reports “hard” page faults, where the data is actually fetched off the disk.  Since I was wiping out the system caches between caching runs, all of the data I was reading had to come from the disk, so the fault rate should be pretty predictable, and IO was going to be my limiting factor.

With the RA settings at 32, my tests were taking longer than 64, 64 took longer than 128 etc. until the results for 256, 512 were close enough to make no difference and RA was not really a factor anymore.  At 32, the faulting rate was relatively normal – somewhere around 20 faults/sec at peak and well within the capacity of the PIOPS volume to satisfy, this was a little higher than the 64 RA fault rate which ran at ~15 faults/sec.  I was basically just keeping an eye on it, it did not seem to be playing too big a part.

With an RA of 16 though, things slowed down dramatically.  The faults spike to over 1000 faults/sec and stay there.  That’s a ~50x increase over the 32 RA setting and basically is pegging the max PIOPS I have on that volume.  Needless to say the test takes a **lot** longer to run with the IO pegged.  To show this graphically, here are the run completion times with the differing RA settings (click for larger view):

mongodump test runs, using various readahead settings

TL;DR I will be using RA settings of 128 for this testing, and will be very careful before dropping RA below 32 on EBS volumes in EC2 in future.

Update: A bit of digging revealed that the max/default size of an IO request on provisioned IOPS instances is 16K, which would mean that setting RA to 32 matches this well, whereas dropping below it by 50% is essentially a bad mismatch.  Not sure it justifies the 1000+ IOPS that suddenly appear, but at least it’s a partial explanation.

Apr 152014

This tweet encourages people to read the timeline related to the Heartbleed discovery and dissemination and draw your own interesting conclusions – challenge accepted!

There is plenty of fodder in there for the conspiracy theorists, but taking a step back for a second I would draw a conclusion not based on who knew what, but rather how to be one of those entities that knew early.  Why take that approach?

Well, the companies that learned of this bug early (the timeline lists the ones that admit they did, there were likely others) had a significant advantage here.  They were able to take the necessary steps to protect their systems while the bug was largely unknown, they could evaluate the situation calmly and without their customers/shareholders/interested parties screaming for updates, exposure assessments, time lines for fixes and the like.

As an ex-operations guy, an ex-support guy and someone that’s had to deal with this stuff before, I would definitely like to be working for one of the companies with an early heads up here rather than the ones in the dark until the public announcement.

Hence, the question I would ask is this: How do I make sure that I am on the early notification list for such issues?

Now, that may seem way too broad a question, but let’s break it down another way:

  • What technologies are truly critical to my business?
  • How do I ensure that I am up to date immediately regarding significant developments?
  • How do I ensure I can prioritize features, get key issues for my business addressed?

Sometimes, with closed source technology, the answer is simple – you make sure you are an important customer for the owner of the technology, whether it is Microsoft, Oracle, Cisco or anyone else.  This might be a matter of paying them enough money, or it could be that you make sure you are a visible and valuable partner, provide a nice reference use case etc. – basically whatever it takes to make sure that you are at the top of their list for any vital information flow or feature decision.

What about open source software though?  How do you make sure you are similarly informed for the Apache HTTP web server, the HAProxy load balancer, OpenSSL, the Linux Kernel, or whatever OSS technology your business relies on?

Take a look at that timeline again, consider who learned about the issue early.  I haven’t crunched the numbers, but I’d bet good money that the companies that learned early (and were not part of discovery) either have a significant number of employees contributing to OpenSSL, or they have employees that know the main contributors well (and let’s face it, most of them will be contributing to other OSS projects – geeks and nerds gossip just like everyone else).

Therefore, the conclusion I draw from the Heartbleed timeline is this:

If I am using open source software that is critical to my business, I should be employing people that actively contribute to that software, that are known to the core developers, if not core developers themselves.  They will work for me, but devote a significant portion (perhaps all) of their time to the specific project(s) identified as critical.

There are many benefits to this – besides getting early notification of issues, you would have an expert on hand to answer those screaming for updates, to evaluate your exposure and perhaps even fix the issue internally before the public fix is available.  You also get a respected voice in terms of setting the direction of the project, have a way to prioritize key features and more.  Finally, you get the good will of the community, help make the product better for everyone, and become a possible destination for other smart contributors to work.

The key here is about actually committing resources.  It’s often amazing (to me) how quickly the commitment of actual resources will focus an otherwise overly broad discussion.  If you start by asking people to list all of the OSS technology that is critical to the business, you will likely end up with a massive list.  Now tell them that they are going to have to commit headcount, budget to support every piece of technology on the list (and justify it) – it will likely shrink rapidly.

Mar 282014

Note: I have also written this up in Q&A format over on StackOverflow for visibility.

When I am testing MongoDB, I often need to insert a bunch of data quickly into a collection so I can manipulate it, check performance, try out different indexes etc.  There’s nothing particularly complex about this data usually, so a simple for loop generally suffices.  Here is a basic example that inserts 100,000 docs:

for(var i = 0; i < 100000; i++){db.timecheck.insert({"_id" : i, "date" : new Date(), "otherID" : new ObjectId()})};

Generally, I would just copy and paste that into the mongo shell, and then go about using the data.  With 2.4 and below, this is pretty fast.  To test, I’ve simplified even more and kept it to a single field (_id) and added some very basic timing.  Here’s the result with the 2.4 shell:

> db.timecheck.drop();
> start = new Date(); for(var i = 0; i < 100000; i++){db.timecheck.insert({"_id" : i})}; end = new Date(); print(end - start);

A little over 2 seconds to insert 100,000 documents, not bad.  Now, let’s try the same thing with the 2.6.0-rc2 shell:

> db.timecheck.drop();
> start = new Date(); for(var i = 0; i < 100000; i++){db.timecheck.insert({"_id" : i})}; end = new Date(); print(end - start);

Oh dear – over 37 seconds to insert the same number of documents, that’s more than 15x slower!  You might be tempted to despair and think 2.6 performance is terrible, but in fact this is just a behavioral change in the shell (I will explain that shortly).  Just to make it clear that it’s not something weird caused by running things in a single line in the shell, let’s pass the same code in as a JavaScript snippet.  This time we’ll just use the time command to measure:

2.4 shell:

$ time mongo ~/mongo/insert100k.js --port 31100
MongoDB shell version: 2.4.6
connecting to:

real    0m2.253s
user    0m0.942s
sys    0m0.432s

2.6 shell:

$ time ./mongo ~/mongo/insert100k.js --port 31100
MongoDB shell version: 2.6.0-rc2
connecting to:

real    0m34.691s
user    0m22.203s
sys    0m2.272s

So, no real change, things are pretty slow with a 2.6 shell.  It should be noted that I ran both against a 2.6 mongod, only the shells I am using are different.  So, of course, you can work around it by using the 2.4 shell to connect to 2.6 but that is not exactly future proof.

(UPDATE: if anyone saw my original post, I had screwed up and run a 2.4 shell thanks to a PATH mix up, there is no difference between passing in the file and an interactive loop).

To explain: before 2.4 the interactive shell would run through the loop and only check the success (using getLastError) of the last operation in the loop (more specifically, it called getLastError after each carriage return, with the last operation being the last insert in the loop).  With 2.6, the shell will now check on the status of each individual operation within the loop.  Essentially that means that the “slowness” with 2.6 can be attributed to acknowledged versus unacknowledged write performance rather than an actual issue.

Acknowledged writes have been the default for some time now, and so I think the behavior in the 2.6 is more correct, though a little inconvenient for those of us used to the original behavior.  We have a workaround with 2.4 but ideally we want to use the latest shell with the latest server, so the question remains, how do I do a simple bulk insert from the 2.6 shell quickly if I truly don’t care about failures?

The answer is to use the new unordered bulk insert API:

> db.timecheck.drop();
> var bulk = db.timecheck.initializeUnorderedBulkOp(); start = new Date(); for(var i = 0; i < 100000; i++){bulk.insert({"_id" : i})}; bulk.execute({w:1}); end = new Date(); print(end - start);

Success!  And essentially the same performance at just over 2 seconds. Sure, it’s a little more bulky (pardon the pun), but you know exactly what you are getting, which I think is a good thing in general. There is also an upside here, when you are not looking for timing information. Let’s get rid of that and run the insert again:

> db.timecheck.drop();
> var bulk = db.timecheck.initializeUnorderedBulkOp(); for(var i = 0; i < 100000; i++){bulk.insert({"_id" : i})}; bulk.execute({w:1});
"writeErrors" : [ ],
"writeConcernErrors" : [ ],
"nInserted" : 100000,
"nUpserted" : 0,
"nMatched" : 0,
"nModified" : 0,
"nRemoved" : 0,
"upserted" : [ ]

Now we get a nice result document when we do the bulk insert. Because it is an unordered bulk operation, it will continue should it encounter an error and report on each one in this document. There are none to be seen here, but it’s easy to create a failure scenario, let’s just pre-insert a value we know will come up and hence cause a duplicate key error on the (default) unique _id index:

> db.timecheck.drop();
> db.timecheck.insert({_id : 500})
WriteResult({ "nInserted" : 1 })
> var bulk = db.timecheck.initializeUnorderedBulkOp(); for(var i = 0; i < 100000; i++){bulk.insert({"_id" : i})}; bulk.execute({w:1});
2014-03-28T16:19:40.923+0000 BulkWriteError({
"writeErrors" : [
"index" : 500,
"code" : 11000,
"errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.timecheck.$_id_ dup key: { : 500.0 }",
"op" : {
"_id" : 500
"writeConcernErrors" : [ ],
"nInserted" : 99999,
"nUpserted" : 0,
"nMatched" : 0,
"nModified" : 0,
"nRemoved" : 0,
"upserted" : [ ]

Now we can see how many were successful, which one failed (and why). It may be a little more complicated to set up, but overall I think we can call this a win.

Mar 112014

This will be my first multi-part blog post, and I am actually not sure just how many parts it will have by the time I am finished.  My original intent was to test some failure scenarios whereby I would emulate a WAN link disappearing.  That quickly expanded into a more ambitious test, I still wanted to test failure scenarios but I also wanted to test with some real world (ish) values for latency, packet loss, jitter etc.

In particular, I wanted to see how a sharded MongoDB cluster would behave with this type of variable performance and what I could do to make things better (I have some interesting ideas there), as well as test some improvements in the 2.6 version.  I’ve created configurations like this in labs with hardware (expensive), XenServer (paid) but I wanted something others could reproduce, reuse and preferably easily and at no cost.  Hence I decided to see if I could make this work with VirtualBox (I also plan to come up with something similar for Docker/Containers having read this excellent summary, but that is for later).

My immediate thought was to use Traffic Control but I had a vague recollection of having used a nice utility in the past that gave me a nice (basic) web interface for configuring various options, and was fairly easy to set up.  A bit of quick Googling got me to WANem and this was indeed what I had used in the past.  I recalled the major drawback at the time was that after booting it, we needed to reconfigure each time because it was a live CD.  Hence the first task was to fix that and get it to the point that it was a permanent VM (note: there is a pre-built VMWare appliance available for those on that platform).

That was reasonably straight forward, and I wrote up the process over at SuperUser as a Q&A:

Once that was done, it was time to configure the interfaces, make routing work and test that the latency and packet loss settings actually worked (continues after the jump).

Continue reading »

Feb 102014

I’ve meant to write this post up for a while.  I have spent a lot of my time in the last 2 years recruiting, interviewing, making offers etc. as I built a team up from nothing to 9 people.  I also interviewed for other positions in the company, helping others build out their teams.  It was very tough, at times it seemed there were no decent candidates out there, but with persistence and perseverance we would eventually get a good candidate over the line and my faith would be renewed.

In October 2013, I was contacted about doing this again by another company.  Now, I am happy in MongoDB, but I always listen to interesting opportunities – that’s how I ended up in MongoDB in the first place after all, by listening to an interesting proposal.  This company wanted me to join them as they built out their office in Dublin, and they essentially wanted me to do the same thing I had done for MongoDB: build a support team from scratch.  Despite thinking that it sounded a little Sisyphean to go back to square one, I was willing to talk to them, find out about the opportunity, the company and what the role would actually entail.

That part actually went quite well.  It is an exciting company, another start up, and one that I had heard of in only positive ways.  I did some due diligence, checked out the product, looked into the company size, funding, recent news and the like, in preparation for talking to them.  It all checked out, and my curiousity was piqued.

Then came the interview itself, it was an introductory interview with the person that was setting up the Dublin office and we talked a little about their ambition for Dublin, what I’d done, what the role would be about, how the company was doing and where they wanted to go in the future.  It all went really well until we got to the discussion about compensation.

Salary was not an issue, they had no problem beating my current level (encouraging).  Their benefits had not been nailed down yet, but I know how that goes and was not worried.  Then we got to stock options.  I described my current stock option vesting schedule and the current valuation MongoDB had, and then the conversation fell apart.  There was “no way” I would get “anything even close” to the options levels in their company that I was currently getting in MongoDB.  If I was expecting something like that, then it was not going to work.

I found that reaction very odd, why would you be so strident about the stock being a blocker so quickly.  If I had a great candidate for a role, I would not let something like this become a sticking point so early in the process, I would flag it as a potential problem, tell the candidate that I would see what I could do, and then run it up the chain to see if there was any way we could come up with a package that was competitive.  Going so negative early left me with a few impressions:

  1. The company (or perhaps just the recruiter) did not value the role, or the department they were setting up in Dublin sufficiently (a very negative impression to leave)
  2. The person in question was being far too conservative in a competitive market – hedge, definitely, but get an idea of who the candidate is and what they can do first
  3. While I have a decent options plan, it is by no means extravagant and I would expect there to be leeway for leadership recruitment.  If there is not, then there is something wrong.
  4. If they can’t even get close, then it is by far a lesser opportunity for me, I can’t see any other way to look at it – the company does not have a significantly better profile or upside than my current company

Now, it is entirely possible that I would not have fit the role for other reasons, but now we will never know.  When I followed up to say that I could not see the point of leaving a better opportunity for a lesser one (as politely as I could) I never received an acknowledgment of any kind.  Overall, the process has left a sour taste and tainted my opinion of the company in question, which is why I have been careful to leave their name out of this post.  It also reinforced my thinking on the subject, and taught me an excellent lesson in terms of how to manage my own recruiting efforts, so at least I learned something :)

 Posted by at 6:59 pm
Oct 302013

As mentioned previously, I like getting my little gamification rewards and I have been meaning to add new content here for quite some time.  In order to kill two birds with one stone, I took a couple of my ideas and turned them into the Q&A format that is encouraged on StackOverflow and the DBA StackExchange site.

Hence we now have these two new questions (and answers):

The first one is a bit specific, but is easily adapted for other purposes.  The second one is something I threw together for a support issue and have re-used multiple times and generally found rather useful.  It also made me finally put versions on Github rather than on my laptop, which has to be a good thing :)

Dec 112012

I recently presented at MongoDB Melbourne and MongoDB Sydney and the slides have now been made available on the 10gen website:

These were not recorded, but at least the slides are now up for reference, which several people had asked about.  The Amazon whitepaper should be updated shortly too, but still contains good information for reference purposes, even if it is lacking some of the newer features/options.  Interestingly, my Sydney presentation coincided on the same day as the announcement of the availability of a Sydney EC2 instance – nice timing :)

May 282012

The Rollback state of MongoDB can cause a lot of confusion when it occurs.  While there are documents and posts that describe how to deal with the state, there is nothing that explains step by step how you might end up that way.  I often find that figuring out how to simulate the event makes it a lot clearer what is going on, while at the same time allowing people to figure out how they should deal with the situation when it arises.  Here is the basic outline:

  1. Create a replica set, load up some data, verify it is working as expected, start to insert data
  2. Break replication (artificially) on the set so that the secondary falls behind
  3. Stop the primary while the secondary is still lagging behind
  4. This will cause the lagging secondary to be promoted
I’ll go through each step, including diagrams to explain what is going on after the jump.