Issue #19 May 2006

Lyceum: One installation, many (open source) blogs


Lyceum is a blogging services system, based on WordPress. Lyceum was conceived, funded, and developed by ibiblio, an online digital library that has been around for more than a decade, was one of the original mirrors of Linux distributions, and is the home of Project Gutenberg, Groklaw, the Linux Documentation Project, iCommons, and more than 1500 other collections.

Fig 1. Admin screen of a default Lyceum installation. Look familiar?

In 2004 and 2005, blogging was becoming a Big Deal, and the blogging software scene began to mature. Ibiblio saw how important blogging was as a new web-enabled medium. Through our experience providing services and support to our collections, and research of available solutions, we saw an obvious deficiency in the blog system market: software that could support multiple blogs from one installation.

In evaluating current offerings, we saw WordPress emerging as the leader in flexibility and popularity. Not only did it have an excellent feature set, but also a loyal and sophisticated community of theme and plugin development.

The solution seemed clear: people needed an easy way to build and manage an arbitrary number of WordPress blogs. After evaluating WordPress' schema, we were happy to find that such a system would scale quite well.

Fig 2. Admin screen of a default Wordpress installation. You can see the family resemblance.

Lyceum has evolved into a full-featured blogging services package, offering the following features:

  • One installation with a static number of database tables supports an arbitrary number of users and blogs.
  • Each blog delivers the complete WordPress user experience. Lyceum changes very little in terms of user interface elements or per-blog features.
  • Users and blogs are not isomorphically related. Any user may have privileges on zero or more blogs, any blog may have an arbitrary number of users, each with arbitrary permissions.
  • Complete compatibility with WordPress themes (requiring very slight modification).
  • Complete compatibility with WordPress plugins which do not use the database. Those which use the database may be modified to be compatible with Lyceum.
  • Activation of plugins on a system-wide basis.
  • Activation, by each blog administrator, of plugins on a per-blog basis.

The development of Lyceum has been an exciting journey, and it has only just begun. Below I offer a description of how we began the project, and the architectural features we have implemented so far.

Project infrastructure

Any software project requires proper infrastructure. No matter how big or small, no matter how few programmers, there are two things that every software project lasting more than a week should absolutely have: (1) source control, and (2) issue tracking. For these two things, we went the route that many open source projects are going these days: Subversion and Trac. These powerful free tools provide stellar infrastructure for an open source software project.

Subversion is considered by many to be the successor to the venerable CVS. Some of its immediately noticeable advantages over CVS are:

  • Atomic commits. If multiple files are being committed at once, and any one of them needs to be updated first, none of the commits will go through.
  • The ability to move files within the repository while maintaining the file's history (including information about the move).
  • The ability to delete a file, and then create a new file with the same name in the same path, with a different history.

Trac is a project management system. It has three core functions:

  • Issue-tracking, similar to Bugzilla.
  • Seamless integration with a Subversion repository, providing an intuitive graphical view of changesets.
  • A wiki. Lyceum documentation is developed in an installation of MediaWiki, so we do not use the Trac wiki very much.

One of the things that makes Trac so nice to work with is the ability to link between the objects of each system using simple syntax. For example, when writing a comment on a "ticket" (a bug report or a feature request), the string changeset:101 will render as a link to a graphical view of that changeset (including a visualization of the changes made in each file). Similarly, in the comments of a Subversion commit, one may use the string ticket:42, which will then render as a link to that ticket. Tickets, changesets, and wikipages may all link to one another arbitrarily.

To facilitate the growth of a Lyceum community, we also made sure that mailing lists were in place. We went with the standard trifecta: developer, user, and announcement. For those who want immediate answers and fascinating discussion, #lyceum on freenode is the coolest place to be.

Source management / Code merging

Our incentive for using WordPress is to take advantage of its interface, plugins, and themes. Therefore staying parallel with its codebase is extremely important. When we began to work on Lyceum, WordPress 2 development had just begun to pick up steam. We felt it was important to merge in changes on a weekly basis, to avoid surprises down the road. We needed an easy way to regularly and reliably merge changes from the WordPress trunk.

Our initial solution, while a little hacky, was effective and (mostly) reliable. We found that Subversion is happy enough to merge changes from one repository into another. An example merge command, executed from /src/lyceum/:

svn merge -r2955:2958 http://svn.automattic.com/wordpress/trunk/

This command results in the familiar update output, noting files that were added, deleted, or updated, those that caused a conflict, and those that were merged. Merges are manually documented in /dev/wordpress_patch_history.txt. I call this technique a "foreign merge."

Subversion-savvy readers are now thinking, a foreign merge will result in inconsistent, meaningless, and/or corrupt Subversion metadata, and is prohibited in Subversion version 1.3 and beyond. Both of these thoughts are correct. To my knowledge there is no documented or recommended way to use Subversion to perform a foreign merge. This method worked well for us through the last time we used it, when we synchronized with WordPress 2.01. After this point, Lyceum's file structure became significantly different from WordPress', as described below in the section on security. To help resolve this issue, it would ideal if Subversion could:

  • Allow and track the history of copies from an external source into the main repository.
  • Allow the merging of changes from an external source into the main repository, using knowledge of changed file structure in either repository, to apply changes to moved files.

Lyceum has made significant changes to WordPress' file structure, and intends to merge in almost all of its changes for the indefinite future. I have coined a phrase for this type of project: "Benevolent Rogue Heterostructural Parallel Branch." Such projects are certainly few and far between, and therefore their needs will probably not be directly addressed with Subversion features in the near future. We have yet to come up with a perfect solution, but it will probably entail shell scripts hard-coded with file path information, command-line diff and patch, and manual documentation. If any readers have ideas as to how we may more easily manage merges, please contact us at lyceum AT ibiblio DOT org.

Architectural differences between Lyceum and WordPress

Lyceum multiblogifies WordPress and adds few other features. The Lyceum database schema, as one might expect, is very similar to the WordPress schema. The only added infrastructure is used to associate data with blogs.

Fig 3. General admin screen for the whole Lyceum installation

The first thing to notice about the Lyceum schema is the addition of the blogs table. The blogs table has the following columns:

  • id (int) The id number of the blog.
  • slug (varchar) The string that is used in the blog's base URL. http://myslug.blogs.example.com/ or http://blogs.example.com/myslug
  • status (enum: active, deleted) Blogs in Lyceum can be deactivated by a system administrator (from http://example.com/system-admin/blog-management.php?b=system).
Fig 4. General admin screen for one blog in a Lyceum installation

The following tables from the WordPress schema have a blog (int) column added to them, which reference a row in the blogs table:

  • categories
  • linkcategories
  • options
  • usermeta
Fig 5. General admin screen for one user in a Lyceum installation

And lastly, a few additional columns have been added:

  • options.optiondomain (enum: system, blog) - This indicates if an option is relevant to the entire Lyceum system, or just one blog
  • usermeta.meta_domain (enum: system, blog) - This indicates if a user meta-property is relevant to the entire Lyceum system, or just one blog
  • users.user_locked (enum: 0,1) - A flag which indicates if a user has been restricted from logging into the system (managed from http://example.com/system-admin/user-management.php)
  • users.user_admin (enum: 0,1) - a flag which denotes if a user is a Lyceum system administrator for the entire system

The rest of the WordPress schema remains unchanged. Those who don't do much DB design may be wondering why there isn't a blog column in more (or all) of the tables. This is because some of the tables are associated with a blog through another table that does have a blog column. For example, all posts in WordPress must have a category. All categories in Lyceum may only be associated with one blog. Therefore, posts are associated with a blog through their categories, and the posts table does not need a blog column.

The URL for each blog in a Lyceum installation includes a "slug," either as a subdomain (http://myslug.blogs.example.com/) or a directory (http://blogs.example.com/myslug), specifying which blog is being accessed. Using mod_rewrite (which is required to run Lyceum), this slug is transformed into a URL variable, b (for "blog"). The magic of this architecture is that, in the development of Lyceum, the vast majority of business logic, including that which generates URLs, did not need to be modified.

Consider the following relative path in WordPress:

file.php?x=5&y=42

This represents the URL:

http://example.com/file.php?x=5&y=42

In Lyceum, the code which generates this URL is unchanged. But since the Lyceum URL already has the base:

http://example.com/myslug/

The link will lead to:

http://example.com/myslug/file.php?x=5&y=42

The rewrite engine will then transform this to:

http://example.com/file.php?b=myslug&x=5&y=42

providing the b variable that is used throughout the system.

Fig 6. mod_rewrite rules for a Lyceum installation

Included in the collection of unmodified URLs are those generated by the WordPress permalink engine. URL design in WordPress is very flexible and powerful. Through the web interface, WordPress users can use a multitude of tags in many different combinations to create permalinks of their liking. Examples include:

  • http://example.com/<post id>
  • http://example.com/archives/<post id>
  • http://example.com/<year>/<month>/<day>/<post title>
  • http://example.com/<year>/<month> (will show all posts in that specific month)

Furthermore, searches can be done as such:

http://example.com/search/<term1>+<term2>+...

And the RSS feed URL is simple and intuitive:

http://example.com/feed

In Lyceum, all of these features remain usable and adjustable on a per-blog basis, and no URL logic needed to be modified to achieve this. Those familiar with the WordPress source code may be surprised to see the results of performing a diff on WordPress and Lyceum's respective classes.php files.

Although the business logic of WordPress happily lives within Lyceum's schema, it was still necessary to tweak a large portion of the SQL statements in order be relevant to the new schema. To assist in the modification of the SQL statements, Lyceum uses SQL generator functions for the two most frequent types of commands: those which perform selections on only the posts table, and those which perform selections on only the comments table. Here is the generator for selections on the posts table:

function make_post_query($columns, $criteria){
	global $blog;
	$sql = " 
		SELECT DISTINCT $columns
		FROM posts
			INNER JOIN post2cat ON (post_id = ID)
				INNER JOIN categories ON (category_id = cat_ID)
		WHERE
			blog = '$blog' AND
			$criteria
	;";
	return $sql;
}

This generator is used within a series of one-liner functions that wrap around existing WordPress database functions. For example, the wrapper for the function get_results() is get_post_results(). Here is an example of a call to the DB in WordPress:

get_results("
	SELECT ID, post_title
	FROM posts
	WHERE post_status = 'draft' AND post_author IN ($editable) AND post_author != '$user_id'
");

And the equivalent action in Lyceum:

get_post_results(
	'ID, post_title',
	"post_status = 'draft' AND post_author IN ($editable) AND post_author != '$user_id'"
);

All of the columns in the join are unique and indexed (B-tree), providing unfettered logarithmic access time to the data location. So the worst-case time-complexity for the entire join is O(3xln(n)), where n is the cardinality of the posts or categories table, whichever is greater. It will take quite a large data-set for this query to become a significant problem, and even then the problem can be solved by scaling the hardware linearly with n. For applications where such large datasets may arise, a problem that can be solved by the linear scaling of hardware is not a problem.

Add to all of this the fact that systems with gobs of memory are rather affordable these days (a 1U 64 bit server with 16 GB of RAM can be had for just over $10k), and you end up with the MySQL query cache cutting down query overhead by orders of magnitude.

In the future we will be experimenting with the possibility of using views in MySQL 5 to further optimize these queries.

To support large installations, it is essential that Lyceum use InnoDB tables instead of the MyIsam tables typically used in WordPress installations. InnoDB supports row locking (vs. MyIsam's table locking) and other features designed for large, high-demand datasets. Unfortunately, InnoDB does not support full-text search in MySQL 4 (in MySQL 5, it does). It is important that Lyceum support MySQL 4. In order to support full-text searching, we have created a lone MyIsam table, postsearch, for the purpose of storing redundant searchable content.

Security

WordPress and Lyceum's security needs differ in two ways. The first is the dramatically increased cost associated with an intrusion on Lyceum. A single security breach on a Lyceum installation can affect thousands of blogs and compromise data and services for thousands of users. This makes a breach more devastating, and creates a more desirable target for attackers.

The second is the social circumstances under which the two systems are used. Though WordPress allows for unmoderated and arbitrary account registration, few installations exercise this option. The vast majority of WordPress installations have one or a handful of authors, all of whom know one another personally. Lyceum, on the other hand, will often be used in environments where account registration is either wide open, or isomorphically associated with some other user namespace (such as a university network). In these situations, there is no lower bar of trust that users can expect from one another. Thus, comprehensive security design is a must.

There are two main security enhancements that Lyceum brings to the WordPress codebase. The first is a simple Best Practices redesign of the file structure: all of the non-web-requestable files in the system have been moved out of the web server document root. For open source projects (where there is no security gained by hiding the source code), compromises involving this vulnerability are very rare. Even so, they can be devastating and this is a key part of a "defense in depth" security strategy.

For the reasons mentioned above, it is not as crucial for WordPress to be secured in this manner. In addition, it would also be impractical; many hosting facilities do not allow users to configure the document root of their websites. In these situations it would be very difficult or impossible to install a web application that used files outside of the document root. In fact, this was the #1 complaint when Lyceum was first released. In response, we designed a system that allows the lib, config, and installation directories to be moved into the document root at the user's discretion. See the installation instructions and the comments in src/lyceum/private.php for more information.

The second architectural security enhancement that Lyceum introduces is the comprehensive, system-wide use of security tokens, (a.k.a. "nonces") in all POST and mutational GET requests, to defeat cross-site scripting, cross-site request forgeries, and spoofed forms. These tokens are a one-use, per-user, per-action, and sometimes per-object sha1 hash. Whenever a page in the admin interface is requested, all of the controls within it that perform data-changing actions have a security token associated with them. Security tokens are included in a hidden field in all forms, and as an HTTP variable in all link URLs.

For example, in the Manage→Posts section of the admin interface there is a list of posts, each with a 'Delete' link to a URL that looks something like this:

http://example.com/blogname/admin/post.php?token=738ee8556d23baa54b2fa28ba6cc96fc85786e9f&action=delete&post=1

The token has been generated randomly:

$token = sha1(uniqid(rand(), TRUE));

And placed into an array in the php session, indexed by another hash that has been generated like this:

$key = sha1($targetscript.$action.$id.$userdata->ID);

In this case $targetscript is 'post.php', $action is 'delete', $id is 1, and $userdata->ID is the user id of the current authenticated user. After $token has been generated and placed in the session, the only way for Lyceum to access it is by first generating the proper $key, which requires using the correct user id, which restricts usage of the token to the user who generated it in the first place. Therefore, the security token system is as secure as the authentication system.

Currently, the Lyceum authentication system is essentially identical to that of WordPress. It stores a local double-hashed password in a cookie, and authenticates on every request. This is problematic: if the cookie is stolen, it may be used by an attacker indefinitely, unless the user changes their password. It is also somewhat less efficient to pass a username and password back and forth instead of a single session key. The solution, which Lyceum aims to implement in a forthcoming release, is to store the authentication information in a session, which periodically times out and then refreshes itself automatically. Then, if a session key is stolen, it is only good until the true user logs in again (with a stale session key). The true user will be prompted to reauthenticate, therefore canceling the stolen session being used by the attacker.

There are various other security features in Lyceum that are part of a "defense in depth" security strategy. One is the storing of session information in a database table instead of in temporary system files. This dramatically reduces the vulnerability of session data files, particularly in a shared hosting environment running PHP as an Apache module, where access to session data in /tmp by other users on the machine is trivial. Another feature requires the system admin user (the 'root' user of a Lyceum installation) to authenticate from a fixed list of IP addresses.

Spam

Blog spam is an ever increasing problem for the Web. 87% of blog comments are spam ["Live Spam Zeitgeist," 8 May 2006]. Most single-blog systems available today include anti-spam tools, with varying degrees of effectiveness. For blogging services systems such as Lyceum, quality of service, resource availability, and consistency of user experience for hundreds or thousands of users are a primary concern. A comprehensive and evolving set of anti-spam tools is essential. The Lyceum anti-spam strategy encapsulates several standard anti-spam measures, and a few that are specific to the needs of a large-scale blogging services system. Some of the below features are already implemented, and all of them will be included in future releases:

  • Obfuscation of the name of the file wp_comments_post.php. The name of this file is changed periodically, to defeat bots that troll the web looking specifically for wp_comments_post.php, or that find the receiving script of a given comment system and then continue to reuse it without re-checking. An added benefit is reduced processing load on the server. If certain bots are not even accessing an available resource in the first place, then none of the other comment or spam logic needs to run at all.
  • The same one-use token system described in the security section above is used for comments. This defeats bots that do not reload the comment form with each spam submission to learn the new hidden field value.
  • Requiring the client to calculate random, trivial JavaScript arithmetic before submitting a comment. This will defeat bots that do not execute Javascript.
  • Requiring process-intensive JavaScript to be executed per-comment, which will be barely perceptible to human commenters but will dramatically impact the efficacy of a bot-driven spam campaign. This strategy is known as Hashcash [Hashcash.org].
  • WordPress already limits the rate at which a given IP address may comment on a blog. Lyceum expands this concept to per-IP rate-limiting on an entire Lyceum installation.
  • We will also explore the possibility of installation-wide rate limiting on other things, such as identical message bodies or email addresses.
  • Commenter's IP addresses are checked against open proxy databases such as Blitzed.

If using the above strategies, including the Javascript components, these anti-spam measures will eliminate all spam originating from bots. For now. The spam arms race never ends.

Some administrators may not want to require that a user have Javascript enabled to post a comment. This will allow a portion of spam bots to penetrate the system. An even bigger problem is human spammers. For this kind of spam, other heuristic techniques need to be employed. There are a variety of stand-alone anti-spam plugins for WordPress that will work equally as well in Lyceum. Some may even be made more effective because the dataset of the entire Lyceum installation can be used to help identify spam.

No login required. Want to see your comments in print? Send a letter to the editor.

An anti-spam solution that stands out above the others is Akismet, produced by Automattic, the people who develop WordPress. Akismet is an anti-spam system that consists of a WordPress plugin, and a centralized service, through which comments are filtered. The heuristics employed by Akismet are unpublished, but our blackbox evaluation has shown the system to be very effective.

We are excited to see the ways in which people use Lyceum. The feedback we have gotten so far has been very positive, and already we are benefiting significantly from bug reports and patches being submitted by our users, even so early in the project. A testament to the power of open source. If you would like to learn more about Lyceum, please explore the links below, or to discuss this article, visit this thread.

More information

About the author

John Joseph Bachir is a computer science graduate of Rice University, and is the chief architect of Lyceum. He currently works at ibiblio.org, one of the nation's premier public domain resources and research labs. Ibiblio is located at the University of North Carolina at Chapel Hill and is a collaborative project of the School of Information and Library Science and the School of Journalism and Mass Communication. Ibiblio hosts the Lyceum project, among many others.