Tuesday, 30 May 2017

JSON to csv to gzip to s3 while streaming without using too much memory

We need to implement a cron service in node js that follows this flow:

  1. query from postgres lot's of data (about 500mb)
  2. transform json data into another json
  3. convert json to csv
  4. gzip
  5. upload to s3 with "upload" method

Obviusly, we need to implement this procedure using streams, without generating memory overhead.

we got lot's of problems:

  1. we are using sequelize, an SQL orm. With it, we can't stream the queries. So we are converting our JSON returned by the query into a readable Stream
  2. we can't find an elegant and clever way to implement a transform stream that transforms the json returned by the query. (for example input-> [{a:1,b:2}..] --> output ->[{a1:1,b1:2}..]
  3. while logging and tryng to write to fs instead of s3 (using fs.createWriteStream), seems that the file is created at same time as the pipeline starts but the size it's about 10bytes and it became consistent only when the streaming process is finished. Furthermore, lot's of RAM is used and the streaming process seems to be useless in terms of memory usage.

How would you write this flow in node js? I've used the following libraries during my experiments:

  • json2csv-stream
  • JSONStream
  • oboe
  • zlib
  • fs
  • aws-sdk


via Radar155

No comments:

Post a Comment