Thursday 4 May 2017

Node.js scraping with chrome-remote-interface

I have been trying to scrape a website protected by Distil Networks, in which using selenium (with Python) would just always fail.

I did a few searches, and my conclusion is that the site can detect you are using Selenium by using some sort of javascript. I then took a loot at chrome-remote-interface, like it is the thing that I want, but then I got stuck.

What I would like to do is to automate following steps:

  1. Open a Chrome instance
  2. Navigate to a page
  3. Run some javascript
  4. Collect data and save to file
  5. Repeat steps 2 - 4

I know that I can open a instance of Chrome for debugging by:

google-chrome --remote-debugging-port=9222

And I can open a console on node by:

chrome-remote-interface -t 127.0.0.1 -p 9222 inspect -r

I can also run simple scripts like

Page.navigate({url:"https://google.com"})
Runtime.evaluate({expression:"1+1"})

But like I can't get the DOMs directly on Node.js as what I could do on the Chrome Developer Tools console. Basically what I want is run scripts on Node like what I could do on the Chrome Developer Tools console.

Also , there are not enough documentation on chrome-remote-interface for scraping. Is there any good links for that?



via Gabriel

No comments:

Post a Comment