Thursday, 4 May 2017

Writing streamed JSON data to MongoDB using buffers

I'm trying to write streamed JSON data via JSONStream to MongoDB. The stream is needed because the data can get very large (it can go up to tens of GBs), and I would like to use MongoDB's bulk write capability to further make the process faster. To do that, I would need to buffer the data, bulk-writing every 1000 JSON objects or so.

My problem is that when I buffer the writing of the data, it does not write all the data and leaves out the last few thousand objects. That is, if I try to write 100000 JSON objects, my code gets to write only 97000 of them. I have tried buffering both MongoDB bulk write and normal write with similar erroneous results.

My code:

var JSONStream = require('JSONStream');
var mongodb = require('mongodb');

// DB connect boilerplate here

var coll = database.collection('Collection');
var bulk = coll.initializeOrderedBulkOp();
var bufferSizeLimit = 1000;
var recordCount = 0;
var jsonStream = JSONStream.parse(['items', true]);

jsonStream.on('data', (data) => {
  bulk.insert(data);
  recordCount++;

  // Write when bulk commands reach buffer size limit
  if (recordCount % bufferSizeLimit == 0) {
    bulk.execute((err, result) => {
      bulk = coll.initializeOrderedBulkOp();
    });
  }
});

jsonStream.on('end', () => {
  // Flush remaining buffered objects to DB
  if (recordCount % bufferSizeLimit != 0) {
    bulk.execute((err, result) => {
      db.close();
    });
  }
});

If I substitute the buffered write code with a simple MongoDB insert, the code works properly. Is there anything I am missing here?



via iambas

No comments:

Post a Comment