backstage.bbc.co.uk

Use Our Stuff To Build Your Stuff

Prototypes

Bayesian News by Email

I'm the author of a reasonable popular open source app called phplist. My app is capable of handling RSS sources and sending out regular emails. But there's no point setting it up to send BBC news emails, because your email service is already doing that and probably even better. But then I thought, inspired by the Bayesian spam filters of Mozilla mail (that I use) why not have a go at Bayesian news filtering. I am rather impressed with the success rate of my spam filter and all of that only based on a reasonably simple algorithm. So I rather quickly whacked up a prototype site of my system, with some advanced configuration in the emails that allows tagging news stories to be "interesting" or "not interesting" which can then be fed back to the system to update the Bayesian filters, and could be used to personalise the news stories from the BBC. After all, aren't we all receiving too much information anyway.

For now, the system only registers the filters and updates the database. There is no Bayesian algorithm involved at all. But for proper filtering that will take a while anyway. But on the whole the entire infrastructure for this to work is now set up and filtering can be activated with just a little more time to implement it. After all, Bayesian filtering is not that complex (if you read http://www.paulgraham.com/better.html). The system runs fully automated, pulling off RSS, and sending the emails without any intervention. I have had a system like this running for the indymedia website for easily 2 years now (just to test that it works, hardly anybody actually knows about that). But then again we're talking about rather less volume in this case.

This is really a prototype, and certainly not capable of big loads.

So it's quite obvious what I'd do, if I had more time and money.

- implement the filtering
- make it scalable
- add loads more security and privacy checks
- add some branding
- wrap up the code, either in a plugin for phplist or in the core code).
- probably loads more, there is always something in a project you don't think about.

  • 18 May 2005 11:05 AM

Comments  Post a comment

  • 1.
  • On 18 May 2005 01:38 PM,
  • roberson said:

very good !!!

  • 2.
  • On 18 May 2005 02:12 PM,
  • Scott said:

I like the idea. I do think it's the sort of thing that could be implemented on a web page rather than just by email. I had some thoughts along those lines a few months ago, which you can find on my blog:

http://www.matthewman.net/archives/2004/08/28/bayesian_filtering_for_website_front_pages.php

  • 3.
  • On 19 May 2005 01:05 AM,
  • Ed said:

Great idea. I'll have a look :)

I wonder if Mozilla Thunderbird lets you use Bayesian filtering on RSS feeds. (That's how I get all my news)

  • 5.
  • On 21 May 2005 06:03 PM,
  • Ryan Clark said:

I think this is a really cool idea. However, I'm having a bit of an issue. Today I got my first news feed. When I clicked the "Update filters" link it took me http://bbcnews.phplist.com/lists/experiment.php and said "Sorry, user not found". Any ideas why this would happen?


P.S. By random chance, I just happened to be learning about and setting up PHPlist for a client at the same time I read this. Great work.

  • 6.
  • On 23 May 2005 06:02 PM,
  • David Roussel said:

It seems there are a number of implementations already outthere.

  • 7.
  • On 26 May 2005 01:11 PM,
  • michiel said:

In reference to Ryan's comment about the User not found error:

it appears the ability to post from an email to a website has been disabled by quite a few mail readers, so implementing this via email may not be the way to go.

I use Mozilla mail and it works fine, but it's been reported not to work with Thunderbird and Mac OS X's mail.

  • 8.
  • On 06 Jun 2005 11:17 AM,
  • eAi said:

Yes, it doesn't work in Thunderbird, which is a pain. Why not use good old GET instead of POST?

Interesting - I described something like this in a post about hypothetical RSS readers a month ago (http://mike.teczno.com/notes/competition_for_bloglines.html). Do you find it's enough to have just two states, "interesting" and "not interesting"?

Post a comment




Remember Me?




style: lo-fi | hi-fi