EBlogger

October 24, 2005

On Vandalism

(See this post for a little bit of context.)

Vandalism is a problem, Wikipedians are quick to assert, but one that is solved by constant vigilance–Wikipedians are watching recent changes “like hawks”.

“Yes, vandalism is common on Wikipedia,” we read in the recent collaboratively edited press release, “but Wikipedia heals quickly.” After all, “IBM researchers found that most vandalism on Wikipedia was reverted in less than five minutes.”

We see this statement frequently repeated at Wikipedia and elsewhere.

Most vandalism on Wikipedia is reverted in less than five minutes. Let us assume, for the moment, that that statement is true. Does it imply that vandalism is a solved problem for wikipedia? Well, no. Suppose that 99 out of every 100 articles that get vandalized are reverted within 24 hours. Then there is more vandalism in Wikipedia today than there was yesterday. Without knowing the rate that un-corrected vandalism is added to Wikipedia, it is entirely possible that the percentage of vandalized articles is greater today than it was yesterday. The rate at which most vandalism is reverted isn’t the right question to ask, we should be concerned with whether the amount of vandalism is shrinking or growing.

But it gets worse than that. Most vandalism on Wikipedia is reverted in less than five minutes. Is that a meaningful thing to say? In order to know that most vandalism is reverted within minutes, wouldn’t we need to identify all vandalism, at least for a representative sample of Wikipedia articles? At best what we really mean is that most known vandalism is reverted in less than five minutes. Unknown vandalism is, well, unknown.

But wait–there’s more. Most vandalism on Wikipedia is reverted in less than five minutes. Did IBM researchers actually say that? Well, no. As far as I can see, the article to which everyone links seems to have only one paragraph on vandalism, which reads as follows:

“As publicly editable sites, Wikis are vulnerable to vandalism. We’ve examined many pages on Wikipedia that treat controversial topics, and have discovered that most have, in fact, been vandalized at some point in their history. But we’ve also found that vandalism is usually repaired extremely quickly–so quickly that most users will never see its effects. The pictures below tell the story.”

The “pictures below” are:

“Visualizing every saved version of the page on “abortion”, with each version getting equal space. The vertical black interruptions indicate times when a visitor has deleted most of the page.”



and

“Same page on “abortion”, but here horizontal spacing corresponds to time, so that rapid-fire changes show up almost on top of each other. Because vandalism is repaired so quickly, it does not show up in this view of the visualization”



Wait a minute. The IBM tool visualizes (a) the number of lines in the article and (b) who created those lines. It doesn’t give any insight at all into the content of those lines. It seems that they’ve defined “vandalism” as “deleting most of the page”, and that in articles they’ve examined this is usually repaired “extremely quickly”. Wikipedian’s don’t even enumerate “deleting most of the page” on their list of common types of vandalism.

Where’s the “most vandalism” part? Or even the “five minutes” part? What IBM researchers really say is that for the controversial articles they have examined, page-wipes are restored quickly.

It seems that this “IBM researchers found most vandalism on Wikipedia is reverted in less than five minutes” line is a complete myth: IBM researchers didn’t actually make that claim, it’s not a meaninful claim to make, and it doesn’t really tell us anything at all about the volume of vandalism within Wikipedia.

October 11, 2005

Wikipedia is not Open Source

(See this post for a little bit of context.)

Wikipedia and other “open content” initiatives are often lumped together with “open source” projects.

For instance, a Google search on “wikipedia open source” currently finds over 8 million hits. The expression “open source encyclopedia” currently finds more that 12 million. Wikipedians themselves are fond of drawing a comparision to open source projects, invoking Linus’s Law (also here), citing a benevolent dictator, or comparing the project to Linux or the Apache Web Server.

While the Wikipedia is certainly “open” for editing and is made available under a license derived from one used for open source software, it is managed differently than every every open source project on the planet, at least every one I’m aware of.

In an open source software project, one is free use the software, to obtain and examine the software’s source code, to modify it locally, and with various limitations, to redistribute it in binary or source form. One is encouraged, and in some circumstances required, to make his modifications available for others to use. But there is always someone, or a team of someones, who acts as the maintainer of the software. In the case of the Linux kernel, it was for a long time a single individual, and is now that individual and team of trusted lieutenants. In the case of the Apache Web Server, it is the “Project Management Committee”, a group, in principal, of the most meritorious contributors (who approve new members by unanimous vote). While there are many contributors to each project, and many proposed contributions, there is always someone—a maintainer, a gatekeeper, an authority, an expert, that reviews and approves each contribution.

While I’ve never followed the day-to-day Linux development, I can tell you that at the Apache Software Foundation there is an extensive, formal, and documented process to ensure that every contribution is carefully reviewed. The Foundation is legally accountable for certain types of copyright and patent infringement, and prides itself on the quality of the software it produces. Reviews, and the “web-of-trust” that determines who is qualified to do such a review, are an important part of the Apache development process. Presumably it is not a coincidence that this process produces the most popular web server in the world, and one that is remarkably secure, robust and stable.

The absence of gatekeepers is not a new complaint about Wikipedia. The obvious retort, of course, is that other contributors will review changes after the fact. This is sometimes known as a “commit then review” protocol in open source circles. But open source projects only allow commit-then-review contributions from a trusted few. The Wikipedia review process, by allowing arbitrary commit-then-review contributions, assumes (a) that someone is actually reviewing the contribution, and that (b) that someone is capable of performing an informed review of that contribution. It is possible for both of these assumptions to be correct. It is worth noting, however, thus far at least, these are unproven assumptions.

The presence of errors within the Wikipedia (and let’s be honest, the presence of more errors than virtually any “traditional” encyclopedia)–despite its impressive popularity–makes one wonder just how many eyeballs are needed before all bugs become shallow.

Update [11 Oct 2005 20:03 GMT]:

Based on comments here and elsewhere, I seem to have either riled or confused some folks, so perhaps I wasn’t quite clear. Let me restate the above as follows:

1) When people (including Wikipedia contributors) talk about Wikipedia they often appeal to a comparision to open source, and ascribe aspects/virtues of open source intitiatives to Wikipedia.

2) Wikipedia is organized differently than other “open” projects, in the sense that every open source project (as opposed to open content) maintains a gatekeeper in one form or another, while Wikipedia does not.

3) As a result, some the aspects ascribed to Wikipedia via the comparision in point #1 may not apply. Since (among many differences) they follow a different review process, things that are true about Linux or httpd may not be true about Wikipedia.

In other words, the essence of Wikipedia may be different than that of open source projects. (In fact, the essence of Wikipedia is much more like that of Ward’s Wiki than many would seem to like to admit.)

October 5, 2005

Britannica, Wikipedia and General Reference

I’ve been a contributor to open source projects, an employee of Encyclopædia Britannica, and an observer (and occasional contributor) to Wikipedia for several years now. Over the years, I’ve noticed a number of misconceptions that Wikipedians have about Britannica, that Britannicians have about Wikipedia, and that the public at large seems to have about both and Wikipedia and Britannica, and general reference sources as a category.

Over the coming weeks, I’m going to attempt to address some of these misconceptions in a series of blog posts. As I do, I’ll update this post with links the subsequent entries, so that this post can serve as a sort of index of related entries.

Get free blog up and running in minutes with Blogsome | Theme designs available here