Friday 21 April 2017

delaying requests using request and cheerio modules

So this is the code I used to crawl my pages (i'm using request and cheerio modules:

for (let j = 1; j < nbRequest; j++)
{
  const currentPromise = new Promise((resolve, reject) => {
    request(
      `https://www.url${j}`,
      (error, response, body) => {
        if (error || !response) {
          console.log("Error: " + error);
        }

    console.log("Status code: " + response.statusCode + ", Connected to the page");

    var $ = cheerio.load(body);
    let output = {
      ranks: [],
      names: [],
      numbers: [],
    };

    $('td.rangCell').each(function( index ) {
      if ($(this).text().trim() != "Rang")
      {
        output.ranks.push($(this).text().trim().slice(0, -1));
        nbRanks = nb_ranks+1;
      }

    });

    $('td.nameCell:has(label)').each(function( index ) {
      output.names.push($(this).find('label.nameValue > a').text().trim());
    });

    $('td.numberCell').each(function( index ) {
      if ($(this).text().trim() != "Nombre")
      {
        output.numbers.push($(this).text().trim());
      }
    });

    console.log("HERE 1");
    return resolve(output);
  }
);


 });
    promises.push(currentPromise);
   }

after that I'm parsing and saving the result in a csv file using a node module. At this point i've been able to crawl about 100 pages, but when it comes to much bigger numbers (1000+) I'm receiving a 500 response meaning that i'm being kicked i think. So i think the best solution is to delay requests, but i didn't find the solution. Do you guys have any idea and how the code would look like ?



via ezdin gharbi

No comments:

Post a Comment