Friday, 26 May 2017

Finding near duplicates/similar documents in RethinkDB

In my application, I have a feed of social media posts, which I basically reduce to their text content. So, in my RethinkDB, there's a table of:

[{text: "This is a document."}, {text: "This is another document."}, ...]

These posts are constantly being added to the database.

What I want to do is to skip saving similar documents. In other words, I don't want to save documents where people have essentially said similar things.

For example:

{text: 'I ate ice-cream today!'} would be similar to {text: 'I ate a big bowl of ice-cream! #icecream'} but not to {text: 'I have visited the Disneyland!'}

What is the way (preferably, specific to RethinkDB) in which I can handle this task most efficiently?



via nainy

No comments:

Post a Comment