Wednesday, 24 May 2017

Nodejs http request responds with broken uridecoded data

I'm scraping a certain website using node's https module, like so:

https.request({}, function(res){
    gzip = zlib.createGunzip();
    res.pipe(gzip);
    output = gzip;

    ...
});

Using Firefox or Chrome, the page's HTML contains this:

brouck%C3%A8re

However, in the string I get from the ServerResponse object, the urlencoded part turns into an invalid character:

brouck�re

Why is it not staying url-encoded? I'm not decoding it anywhere in my flow.

I'm concatenating the data, but setting the encoding correctly:

output.on('data', function gotData(data) {
    body += data.toString('utf-8');
});

So what's going on, here?



via skerit

No comments:

Post a Comment