Monday, 22 May 2017

NodeJs extract all text from html body

I am using node-crawler and i would like to know how can i properly extract all the text from html to get nice proper results. I would like to extract all words & keywords from a html document.

$("body").text();

The above code returns all the javascript code from the body which is wrong, also i would like to have have the words without tabs or whitespaces or possibly stored in an array.

Any suggestions what libraries are out there that can do such a task? Is it possible somehow with jquery selectors? Or i should roll my own functions to parse the html the way i want it?



via Azarus

No comments:

Post a Comment