The Spam and the Spider
I just finished reading a great book called The Starfish and the Spider. The book does a really good job of illustrating how decentralized organizations trump centralized companies time and time again. Examples cited include the Apache web server, Napster, Alcoholics Anonymous, and Skype.
I’m still trying to figure out how I feel about some of their examples. The issue I have is how they group different types of decentralization. The authors associate decentralized organizations like Alcoholics Anonymous, Apache, and Linux with distributed platforms like Skype and Napster. There’s a big difference between an organization and a technology platform and it doesn’t make sense to group the examples together.
That aside, I was most interested in how this theory applies to the problem of spam. I think spam is a classic example of an “evil” starfish organization like Al Queda. The more you fight it, the more virulent it becomes.
The authors cite three primary examples of fighting decentralized organizations: challenge their ideology, centralize them, or decentralize yourself. I don’t think spammer have much of an ideology. They’re just trying to make a buck. There’s really no “cause” here. I don’t see much to centralize either. However taking a decentralized approach to filtering spam seems to make a lot of sense.
Every major webmail provider has a spam box that allows you to mark spam. Surely the vast majority of spam hits most of the major webmail providers’ systems. If a person marks a particular address or piece of content as spam it should get blacklisted in a shared database accessible to all email providers. This approach would basically created a distributed human computing engine to combat spam. Of course some people would occasionally mark non-spam messages (such as opt-in retail mailers) as spam, but statistically the wisdom of crowds would prevail and the system should be 99.9% accurate.
Most of the spam I get in my mailbox is highly repetitive. Almost all the messages have similar subject lines and content. I use Yahoo Mail to manage all my emails from multiple accounts. Why can’t the power of all Yahoo users be harnessed to filter my spam intelligently? Why can’t the power of all email users across all email providers be harnessed to filter spam intelligently?
You have companies like Symantec that keep a central database of all viruses. Why not a company that keeps a central database of all emails marked as spam across all email providers, which could, in turn, be licensed to each email provider. It seems like a win/win venture to me. Each new email company that signs up provides additional user generated spam data, and the analysis of that data provides dramatically improved real time filtering data to all webmail providers. As soon as an email gets marked as spam in one mailbox, it gets filtered as spam in all mailboxes. If Google can index and offer instant search results on all the content in the world, surely a company can index and offer instant filtering on all the spam in the world?
This entry was posted on Monday, September 17th, 2007 at 1:43 am and is filed under Books, Business. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
on September 17, 2007 at 7:41 pm Aaron wrote:
Hey Lorenz,
The large email providers do use algorithms like this to cut down spam, and to some degree share information between them (centralizing themselves to mutual benefit). In fact, Google does this quite nicely in a way that allows small groups of their users to help the rest of them not receive spam even after it’s been “delivered” with a nice addition to the usual use of the original “This is Spam!” button.
If you sent 50,000 messages to AOL users and 10% identified it as spam, they might block your IP address from sending mail to AOL. To reduce the load on users, they could hold back most of the messages and pass along only the first 5,000 to see what people think. A trial run, if you will. Based on that small sample they can still tell if it’s spam and take appropriate action.
Gmail adds a twist and a dramatic improvement. Using that same example above, Gmail would deliver *all* 50,000 messages and then watch what happens. Let’s say they saw that 250 people have actually seen the message, and 23.7% said it was spam. Their anti-spam system might say, “Good enough, we can act on that”. And this is where something special happens.
Gmail quietly reaches into the inboxes of those other 49,750 people and gently nudges the message into the spam folder. Their key insight was that because they’re a web-based email provider, they don’t so much “deliver” mail as make it available for viewing. All the people who haven’t logged in or refreshed their browser window haven’t actually “received” the message yet. If you were to make it disappear, they’ll never know it was there, and don’t have to trip over it. Huzzah!
Neat huh?
Another very cool centralizing anti-spam concept is Akismet, from the folks who brought us Wordpress - it aggregates knowledge of what is and isn’t spam based on comment submissions across hundreds of thousands of blogs. Quite brilliant really, and exactly the same idea you were talking about above.
- aaron
on September 17, 2007 at 7:51 pm Lorenz wrote:
Aaron, thanks a lot for clarifying this. I was hoping someone who knew something about this would give me some clarity on the issue. Great info!