I have been trying to scrape a website protected by Distil Networks, in which using selenium (with Python) would just always fail.
I did a few searches, and my conclusion is that the site can detect you are using Selenium by using some sort of javascript. I then took a loot at chrome-remote-interface
, like it is the thing that I want, but then I got stuck.
What I would like to do is to automate following steps:
- Open a Chrome instance
- Navigate to a page
- Run some javascript
- Collect data and save to file
- Repeat steps 2 - 4
I know that I can open a instance of Chrome for debugging by:
google-chrome --remote-debugging-port=9222
And I can open a console on node by:
chrome-remote-interface -t 127.0.0.1 -p 9222 inspect -r
I can also run simple scripts like
Page.navigate({url:"https://google.com"})
Runtime.evaluate({expression:"1+1"})
But like I can't get the DOMs directly on Node.js as what I could do on the Chrome Developer Tools console. Basically what I want is run scripts on Node like what I could do on the Chrome Developer Tools console.
Also , there are not enough documentation on chrome-remote-interface
for scraping. Is there any good links for that?
via Gabriel
No comments:
Post a Comment