Detecting similar news

I was faced with an interview question recently that sounded like a nice puzzle to play with. If my memory's right, the question was something of the following:

You are a teacher, your students have been given the task to write a paper of a 1000 words on a subject of their choice. Find a way to help detect plagiarism.

I've been doing IT because I love puzzle, this one was sweet. My solution was something like the following:

for every student:
    for every words in student's paper:
        store words in a words set
    store words in a dict (key being the student)

for every student's words:
    only keep rare words 
    // aka words present in less than 70% of the article
    // to remove common words and prevent noisy match

plagiarism as a default dict to 0    
for every (a)student's words:
    pop current student's paper
    // to avoid comparing it against itself or
    // to duplicate the comparaison A->B then B->A
    for every other (b)student's words:
        for every word in (a)student words:
            // increment on each similarity between a and b
            if word in (b)student's words:
                plagiarism["%s_%s"] % (a, b) += 1

Even if I believe my algorithm is efficient, I still need a way to prove my point (hopefully I'm right). That's where I'll use Building a simple crawler article. Actually this article is the reason why I worked on the crawler in the first place. I'll pull from different tech websites and see if I can discover which articles are about the same subject (title and URL don't count, only content counts).

Retrieving the articles

So the following script will fill up the database with pages (about 250mb of data after one crawl and it took about an hour... you've been warned).

#!/usr/bin/python
# filename: run.py
import re  
from crawler import Crawler, CrawlerCache

if __name__ == "__main__":  
    # Using SQLite as a cache to avoid pulling twice
    crawler = Crawler(CrawlerCache('crawler.db'))
    root_re = re.compile('^/$').match
    crawler.crawl('http://techcrunch.com/', no_cache=root_re)
    crawler.crawl('http://www.engadget.com/', no_cache=root_re)
    crawler.crawl('http://gizmodo.com/', no_cache=root_re)
    crawler.crawl('http://www.zdnet.com/', no_cache=root_re)
    crawler.crawl('http://www.wired.com/', no_cache=root_re)

So I've got this data in SQLite, just need to retrieve the articles and strip the HTML from the article.

Find the correct URLs

So first task is to find the articles themself. Luckily, modern routes have patterns, and articles have simple ones. For Techcrunch, it is something like /2014/08/03/hello-world-2/, as a regular expression it would be detected through ^/\d{4}/\d{2}/\d{2}/*.

Find the content

Next step will be extracting the article words, striping down the tags. For that, I'm using BeautifulSoup. I'll build a lambda method per domain accepting a beautifulSoup object and returning the element containing the article. Finally I can use beautifulSoup get_text() method to strip away the tags.

defaultdict

If you never encountered a defaultdict: imagine it as dict which has a shortcut for creation. The object you pass to the constructor will be instanciated as a value for any non existing key you query. If x was a default dict with int as default class:

x = defaultdict(int)  
x[key] += 1  

is equivalent to

x= {}  
if key not in x:  
    x[key] = int()
test[key] += 1  

Le code

#!/usr/bin/python
# filename: plagia.py

# native libs
import sys  
import re  
from collections import defaultdict

# external libs
from bs4 import BeautifulSoup

# local libs
from crawler import CrawlerCache

crawler_cache = CrawlerCache('crawler.db')

# Config per domain
sites = (  
    {
        'domain': 'techcrunch.com',
        'url_re': re.compile('^/\d{4}/\d{2}/\d{2}/*').match,
        'get_content': lambda page: page.find('div', 'article-entry'),
        'words': defaultdict(int),
        'urls': {},
    },
    {
        'domain': 'www.engadget.com',
        'url_re': re.compile('^/\d{4}/\d{2}/\d{2}/*').match,
        'get_content': lambda page: page.find('div', 'post-body'),
        'words': defaultdict(int),
        'urls': {},
    },
    {
        'domain': 'gizmodo.com',
        'url_re': re.compile('^/[a-z0-9\-]*-\d{5,12}').match,
        'get_content': lambda page: page.find('article', 'post'),
        'words': defaultdict(int),
        'urls': {},
    },
    {
        'domain': 'www.zdnet.com',
        'url_re': re.compile('^/[a-z0-9\-]*-\d{5,12}/$').match,
        'get_content': lambda page: page.find('article', 'post'),
        'words': defaultdict(int),
        'urls': {},
    },
    {
        'domain': 'www.wired.com',
        'url_re': re.compile('^/\d{4}/\d{2}/[^/]*/$').match,
        'get_content': lambda page: page.find('article', 'post'),
        'words': defaultdict(int),
        'urls': {},
    }
)

# store all the words for statistical value

# filter the URLs, extract the content, get the words
for site in sites:  
    domain = site['domain']
    # retrieve all the URL for current domain matching an article format
    urls = set([u for u in crawler_cache.get_urls(domain) if site['url_re'](u)])
    for url in urls:
        html = crawler_cache.get(domain=domain, url=url)
        if html:
            # User beautifulSoup to navigate the document
            page = BeautifulSoup(html)
            # retrive the content of the article
            content = site['get_content'](page)
            if content:
                # remove script tag content from article
                # yes, there is JS in the article :/
                [s.clear() for s in content.find_all('script')]
                # trim the tags from the article
                article_words = content.get_text().split()
                # articles with less than 200 words kind of suck
                # so let ignore those
                if len(article_words) > 200:
                    # keep uniq words by putting them in a set
                    article_words = set(w.lower() for w in article_words)
                    site['urls'][url] = article_words
                    # count the words occurence (per domain and globally)
                    for word in article_words:
                        site['words'][word] += 1


# Now lets remove words common in the article of the domain
for site in sites:  
    # words present over 5% of the articles of the domain are removed
    threshold = len(site['urls']) * .05
    noisy_words = set(w for w, c in site['words'].items() if c > threshold)
    for url in site['urls'].keys():
        # remove part using set difference feature, pretty sweet
        site['urls'][url] = site['urls'][url].difference(noisy_words)


# We can now compare article to each others across domains
plagia = defaultdict(list)  
for site in sites:  
    for other_site in sites:
        # We don't match site against itself :|
        if other_site['domain'] == site['domain']:
            continue
        # grab every articles for the domain
        for url in site['urls'].keys():
            # words on the current article
            words = site['urls'][url]
            # minumum match has to be 10%
            best_score = len(words) * .1
            match = ''
            # compare article to the another domain's articles
            for other_url in other_site['urls'].keys():
                # words in the article from another domain
                other_words = other_site['urls'][other_url]
                # count how many common "rare" words
                score = len(words.intersection(other_words))
                if score > best_score:
                    # woohoo, if you're here you're the new best match
                    match = other_url
                    best_score = score
            if match:
                full_url = 'http://%s%s' % (site['domain'], url)
                full_other_url = 'http://%s%s' % (other_site['domain'], match)
                plagia[full_url].append((
                    best_score,
                    (best_score * 100.0) / len(words), # percentage
                    full_other_url,
                ))

for url, matches in plagia.items():  
    print url
    for match in matches:
        print '\t%s\t%.d%%\t%s' % match

Output

First line the URL, followed by corresponding articles on other domains

So what ?

Well, I can't think of any direct use of it, it was mostly fun to work on it. Overall, the concept of removing common words made sense. Some of my initial estimations were pretty wrong though.

I discovered that my first estimation that words present in 70% of the articles are common was completely wrong. I had to lower this to 5%, meaning words present in 1 out of 20 articles are declared common (the crawler cumulated about 40+ articles per domain). Even at 10% I got very bad matches. I guess, the idea is that you want to compare what is very unique to an article.

Then you have to compare A to B and B to A. I thought it would not be necessary at first but while A might have less than 10% in common with B, B can have more than 10% in common with A. Compare those two:

  • Are you hungry?
  • I'm very hungry today I want to eat a steak right now! etc...

I also had to remove very short articles (less than 200 words) as you can imagine with the previous example, they ended up being often bad matches to others or were completely emptied by the noise reduction.

This solution definitely calls for more. I've been planning on writing about extracting the theme and tags out of a text...

Gist and comments

Code is avalailable as a gist on github, comments are on reddit /r/python and Hacker News