I was faced with an interview question recently that sounded like a nice puzzle to play with. If my memory's right, the question was something of the following:
You are a teacher, your students have been given the task to write a paper of a 1000 words on a subject of their choice. Find a way to help detect plagiarism.
I've been doing IT because I love puzzle, this one was sweet. My solution was something like the following:
for every student:
for every words in student's paper:
store words in a words set
store words in a dict (key being the student)
for every student's words:
only keep rare words
// aka words present in less than 70% of the article
// to remove common words and prevent noisy match
plagiarism as a default dict to 0
for every (a)student's words:
pop current student's paper
// to avoid comparing it against itself or
// to duplicate the comparaison A->B then B->A
for every other (b)student's words:
for every word in (a)student words:
// increment on each similarity between a and b
if word in (b)student's words:
plagiarism["%s_%s"] % (a, b) += 1
Even if I believe my algorithm is efficient, I still need a way to prove my point (hopefully I'm right). That's where I'll use Building a simple crawler article. Actually this article is the reason why I worked on the crawler in the first place. I'll pull from different tech websites and see if I can discover which articles are about the same subject (title and URL don't count, only content counts).
Retrieving the articles
So the following script will fill up the database with pages (about 250mb of data after one crawl and it took about an hour... you've been warned).
#!/usr/bin/python
# filename: run.py
import re
from crawler import Crawler, CrawlerCache
if __name__ == "__main__":
# Using SQLite as a cache to avoid pulling twice
crawler = Crawler(CrawlerCache('crawler.db'))
root_re = re.compile('^/$').match
crawler.crawl('http://techcrunch.com/', no_cache=root_re)
crawler.crawl('http://www.engadget.com/', no_cache=root_re)
crawler.crawl('http://gizmodo.com/', no_cache=root_re)
crawler.crawl('http://www.zdnet.com/', no_cache=root_re)
crawler.crawl('http://www.wired.com/', no_cache=root_re)
So I've got this data in SQLite, just need to retrieve the articles and strip the HTML from the article.
Find the correct URLs
So first task is to find the articles themself. Luckily, modern routes have patterns, and articles have simple ones. For Techcrunch, it is something like /2014/08/03/hello-world-2/
, as a regular expression it would be detected through ^/\d{4}/\d{2}/\d{2}/*
.
Find the content
Next step will be extracting the article words, striping down the tags. For that, I'm using BeautifulSoup
. I'll build a lambda method per domain accepting a beautifulSoup object and returning the element containing the article. Finally I can use beautifulSoup get_text()
method to strip away the tags.
defaultdict
If you never encountered a defaultdict: imagine it as dict which has a shortcut for creation. The object you pass to the constructor will be instanciated as a value for any non existing key you query. If x
was a default dict with int
as default class:
x = defaultdict(int)
x[key] += 1
is equivalent to
x= {}
if key not in x:
x[key] = int()
test[key] += 1
Le code
#!/usr/bin/python
# filename: plagia.py
# native libs
import sys
import re
from collections import defaultdict
# external libs
from bs4 import BeautifulSoup
# local libs
from crawler import CrawlerCache
crawler_cache = CrawlerCache('crawler.db')
# Config per domain
sites = (
{
'domain': 'techcrunch.com',
'url_re': re.compile('^/\d{4}/\d{2}/\d{2}/*').match,
'get_content': lambda page: page.find('div', 'article-entry'),
'words': defaultdict(int),
'urls': {},
},
{
'domain': 'www.engadget.com',
'url_re': re.compile('^/\d{4}/\d{2}/\d{2}/*').match,
'get_content': lambda page: page.find('div', 'post-body'),
'words': defaultdict(int),
'urls': {},
},
{
'domain': 'gizmodo.com',
'url_re': re.compile('^/[a-z0-9\-]*-\d{5,12}').match,
'get_content': lambda page: page.find('article', 'post'),
'words': defaultdict(int),
'urls': {},
},
{
'domain': 'www.zdnet.com',
'url_re': re.compile('^/[a-z0-9\-]*-\d{5,12}/$').match,
'get_content': lambda page: page.find('article', 'post'),
'words': defaultdict(int),
'urls': {},
},
{
'domain': 'www.wired.com',
'url_re': re.compile('^/\d{4}/\d{2}/[^/]*/$').match,
'get_content': lambda page: page.find('article', 'post'),
'words': defaultdict(int),
'urls': {},
}
)
# store all the words for statistical value
# filter the URLs, extract the content, get the words
for site in sites:
domain = site['domain']
# retrieve all the URL for current domain matching an article format
urls = set([u for u in crawler_cache.get_urls(domain) if site['url_re'](u)])
for url in urls:
html = crawler_cache.get(domain=domain, url=url)
if html:
# User beautifulSoup to navigate the document
page = BeautifulSoup(html)
# retrive the content of the article
content = site['get_content'](page)
if content:
# remove script tag content from article
# yes, there is JS in the article :/
[s.clear() for s in content.find_all('script')]
# trim the tags from the article
article_words = content.get_text().split()
# articles with less than 200 words kind of suck
# so let ignore those
if len(article_words) > 200:
# keep uniq words by putting them in a set
article_words = set(w.lower() for w in article_words)
site['urls'][url] = article_words
# count the words occurence (per domain and globally)
for word in article_words:
site['words'][word] += 1
# Now lets remove words common in the article of the domain
for site in sites:
# words present over 5% of the articles of the domain are removed
threshold = len(site['urls']) * .05
noisy_words = set(w for w, c in site['words'].items() if c > threshold)
for url in site['urls'].keys():
# remove part using set difference feature, pretty sweet
site['urls'][url] = site['urls'][url].difference(noisy_words)
# We can now compare article to each others across domains
plagia = defaultdict(list)
for site in sites:
for other_site in sites:
# We don't match site against itself :|
if other_site['domain'] == site['domain']:
continue
# grab every articles for the domain
for url in site['urls'].keys():
# words on the current article
words = site['urls'][url]
# minumum match has to be 10%
best_score = len(words) * .1
match = ''
# compare article to the another domain's articles
for other_url in other_site['urls'].keys():
# words in the article from another domain
other_words = other_site['urls'][other_url]
# count how many common "rare" words
score = len(words.intersection(other_words))
if score > best_score:
# woohoo, if you're here you're the new best match
match = other_url
best_score = score
if match:
full_url = 'http://%s%s' % (site['domain'], url)
full_other_url = 'http://%s%s' % (other_site['domain'], match)
plagia[full_url].append((
best_score,
(best_score * 100.0) / len(words), # percentage
full_other_url,
))
for url, matches in plagia.items():
print url
for match in matches:
print '\t%s\t%.d%%\t%s' % match
Output
First line the URL, followed by corresponding articles on other domains
- http://gizmodo.com/electricity-eating-bacteria-are-real-and-more-common-th-1606910533
- http://www.zdnet.com/microsoft-ordered-to-hand-over-overseas-email-throwing-eu-privacy-rights-in-the-fire-7000032210/
- http://gizmodo.com/amazon-fire-phone-review-a-shaky-first-step-1608853105
- http://techcrunch.com/2014/07/31/internet-org-app/
- http://techcrunch.com/2014/08/02/sony-and-ea-experiment-with-new-business-models-for-older-games/
- http://www.wired.com/2014/06/amazon-fire-phone-hands-on/
- http://www.engadget.com/2014/08/03/microsoft-nfl-sideline-viewing-system/
- http://www.wired.com/2014/07/amazon-fire-phone-2/
- http://www.zdnet.com/apple-refreshes-macbook-pro-with-retina-display-lineup-drops-prices-7000032080/
- http://www.engadget.com/2014/04/02/xiaomi-pressy-clone/
- http://www.wired.com/2014/07/the-big-costs-of-nsa-surveillance-that-no-ones-talking-about/
- http://www.wired.com/2014/07/updated-macbook-pros/
- http://www.wired.com/2014/07/james-cameron-deepsea-challenge/
- http://www.zdnet.com/amazon-fire-phone-hands-on-initial-impressions-7000030729/
- http://www.engadget.com/2014/07/31/twitter-transparent/
- http://www.zdnet.com/apple-closes-acquisition-of-beats-music-7000032274/
- http://techcrunch.com/2014/08/03/this-extension-gives-you-a-new-gif-every-single-time-you-open-a-tab/
- http://www.wired.com/2014/07/usb-security/
- http://www.zdnet.com/how-to-improve-your-iphones-battery-life-updated-for-ios7-7000014902/
- http://techcrunch.com/2014/08/03/nfl-will-let-teams-use-microsoft-surface-tablets-on-the-sidelines-during-games/
- http://www.wired.com/2007/10/cockroach-birth/
- http://www.wired.com/2014/07/nokia-lumia-635/
- http://www.wired.com/2014/07/internet-org-zambia/
- http://www.zdnet.com/badusb-big-bad-usb-security-problems-ahead-7000032211/
- http://gizmodo.com/this-weeks-top-comedy-video-how-to-spell-my-last-name-1615154242
- http://www.zdnet.com/why-i-bit-the-bullet-and-finally-switched-from-outlook-to-gmail-7000032179/
- http://www.wired.com/2014/07/wifi-dongle-sharing/
- http://techcrunch.com/2014/07/30/founders-on-depression/
- http://www.zdnet.com/how-one-judge-single-handedly-killed-trust-in-the-us-technology-industry-7000032257/
- http://www.wired.com/2014/07/virgin-mobiles-new-wireless-plan-is-like-netflix-for-your-phone/
- http://techcrunch.com/2014/07/31/watch-this-robot-learn-to-walk-off-an-injury/
So what ?
Well, I can't think of any direct use of it, it was mostly fun to work on it. Overall, the concept of removing common words made sense. Some of my initial estimations were pretty wrong though.
I discovered that my first estimation that words present in 70% of the articles are common was completely wrong. I had to lower this to 5%, meaning words present in 1 out of 20 articles are declared common (the crawler cumulated about 40+ articles per domain). Even at 10% I got very bad matches. I guess, the idea is that you want to compare what is very unique to an article.
Then you have to compare A to B and B to A. I thought it would not be necessary at first but while A might have less than 10% in common with B, B can have more than 10% in common with A. Compare those two:
Are you hungry?
I'm very hungry today I want to eat a steak right now! etc...
I also had to remove very short articles (less than 200 words) as you can imagine with the previous example, they ended up being often bad matches to others or were completely emptied by the noise reduction.
This solution definitely calls for more. I've been planning on writing about extracting the theme and tags out of a text...
Gist and comments
Code is avalailable as a gist on github, comments are on reddit /r/python and Hacker News