Wednesday, 15 March 2017

MongoDB with node.js apparent deadlock on Linux but not Windows

Problem

I'm having trouble with deadlock-like behavior running integration tests of our node.js server application against a MongoDB instance. Both our test code and server code use Mongoose and the node.js driver to access MongoDB.

The test framework drops the database and re-creates the indexes for every collection at the start of each test suite (or sometimes a single test). This occurs many times during the course of a test run.

The deadlock-like behavior appears to happen during a series of calls to db.collection.ensureIndex. Once this happens, all execution stops until the test timeout expires, which aborts the test. If I run the tests without a timeout, the hang appears to be forever.

At this point I feel like I'm flailing and would love to hear suggestions on a debugging approach or a direction to look in for clues. Thanks much in advance.

Details

According to the MongoDB Concurrency FAQ, ensureIndex requires an exclusive database write lock ('W') when called without the background:true option. However, I've verified through Mongoose debug logging that background:true is used during these calls, e.g:

mails.ensureIndex({ expires: 1, mailType: 1, readOn: 1 }) { background: true }

However, the MongoDB log appears (to me at least) to show that an exclusive lock is attempted:

command: createIndexes { createIndexes: "mails", indexes: [ { name: "expires_1_mailType_1_readOn_1", key: { expires: 1, mailType: 1, readOn: 1 }, background: true } ], writeConcern: { w: 1 } } numYields:0 reslen:113 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1, W: 2 }, acquireWaitCount: { w: 1, W: 2 }, timeAcquiringMicros: { w: 1027663, W: 193312 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_query 1289ms

Meanwhile, when the system is in this state, I cannot connect to the database using any client, including the mongo CLI, mongotop, mongostat, or the node.js driver. Therefore, I can't run db.currentOp() or db.serverStatus() to gather info about the state of the DB. Once I kill the test, the clients can connect and the DB is back in a normal state. I don't know whether this is caused by a global exclusive lock or not, but it sure seems global.

This problem appears 100% of the time when the node.js client code executes in a docker container (running Linux). However, it never occurs when running the same client code on Windows against the exact same MongoDB instance.

This makes me wonder whether the problem might be with the Linux version of the node.js driver for MongoDB. I've researched this to some extent but haven't found anything conclusive yet. I'm in the process of setting up another Linux environment outside of a container to see if I can repro the behavior there.

Code

Drop DB

mongoose.connection.db.dropDatabase(callback);

Rebuild Indexes

// (inside a larger async.waterfall block)    
// Loop through all schemas and recreate all indexes.
_.each(mongoose.connection.base.modelSchemas, function (schema, key) {
    asyncFunctions.push(function (cb) {
        mongoose.model(key, schema).ensureIndexes(err => {
            return cb(err);
        });
    });
});

async.parallel(asyncFunctions, function (err) {
    return callback(err);
});

Context

  • mongoose@4.4.20 (mongodb@2.1.18)
  • connect-mongodb-session@1.3.0 (mongodb@2.2.24) <-- different mongodb version
  • node --version => v6.9.1
  • mongod --version => v3.2.12 (from https://hub.docker.com/_/mongo/)
  • Host OS: Windows 10 Pro Version 1607 (OS Build 14393.693)
  • Guest OS: Linux 4.9.12-moby
  • Docker v. 17.03.0-ce

Other

I'm not claiming this is a true DB deadlock. I can't tell for sure yet. It just acts like a deadlock.

I've also tried using * mongoose@4.9.0 (mongodb@2.2.24) * mongod --version => v3.4.2 This added driver-level timeouts, but when I set them high, the deadlock-like behavior remained.

I tried replacing the call to drop the DB with calls to collection.remove() for every collection, but that took so long it caused timeouts in that phase of the operation.

I also tried replacing the DB drop with collection.drop() for every collection, but that moved the deadlock to this phase of the operation (instead of ensureIndex).

Finally, I tried replacing the async.parallel() call to ensureIndexes (above) with async.series(), but that just delayed the deadlock.



via Matthew Walker

No comments:

Post a Comment