United States (change)
Shortcuts: Downloads Fedora Red Hat Network
Issue #19 May 2006
Lyceum is a blogging services system, based on WordPress. Lyceum was conceived, funded, and developed by ibiblio, an online digital library that has been around for more than a decade, was one of the original mirrors of Linux distributions, and is the home of Project Gutenberg, Groklaw, the Linux Documentation Project, iCommons, and more than 1500 other collections.
In 2004 and 2005, blogging was becoming a Big Deal, and the blogging software scene began to mature. Ibiblio saw how important blogging was as a new web-enabled medium. Through our experience providing services and support to our collections, and research of available solutions, we saw an obvious deficiency in the blog system market: software that could support multiple blogs from one installation.
In evaluating current offerings, we saw WordPress emerging as the leader in flexibility and popularity. Not only did it have an excellent feature set, but also a loyal and sophisticated community of theme and plugin development.
The solution seemed clear: people needed an easy way to build and manage an arbitrary number of WordPress blogs. After evaluating WordPress' schema, we were happy to find that such a system would scale quite well.
Lyceum has evolved into a full-featured blogging services package, offering the following features:
The development of Lyceum has been an exciting journey, and it has only just begun. Below I offer a description of how we began the project, and the architectural features we have implemented so far.
Any software project requires proper infrastructure. No matter how big or small, no matter how few programmers, there are two things that every software project lasting more than a week should absolutely have: (1) source control, and (2) issue tracking. For these two things, we went the route that many open source projects are going these days: Subversion and Trac. These powerful free tools provide stellar infrastructure for an open source software project.
Subversion is considered by many to be the successor to the venerable CVS. Some of its immediately noticeable advantages over CVS are:
Trac is a project management system. It has three core functions:
One of the things that makes Trac so nice to work with is the ability to link between the objects of each system using simple syntax. For example, when writing a comment on a "ticket" (a bug report or a feature request), the string changeset:101 will render as a link to a graphical view of that changeset (including a visualization of the changes made in each file). Similarly, in the comments of a Subversion commit, one may use the string ticket:42, which will then render as a link to that ticket. Tickets, changesets, and wikipages may all link to one another arbitrarily.
To facilitate the growth of a Lyceum community, we also made sure that mailing lists were in place. We went with the standard trifecta: developer, user, and announcement. For those who want immediate answers and fascinating discussion, #lyceum on freenode is the coolest place to be.
Our incentive for using WordPress is to take advantage of its interface, plugins, and themes. Therefore staying parallel with its codebase is extremely important. When we began to work on Lyceum, WordPress 2 development had just begun to pick up steam. We felt it was important to merge in changes on a weekly basis, to avoid surprises down the road. We needed an easy way to regularly and reliably merge changes from the WordPress trunk.
Our initial solution, while a little hacky, was effective and (mostly) reliable. We found that Subversion is happy enough to merge changes from one repository into another. An example merge command, executed from /src/lyceum/:
svn merge -r2955:2958 http://svn.automattic.com/wordpress/trunk/
This command results in the familiar update output, noting files that were added, deleted, or updated, those that caused a conflict, and those that were merged. Merges are manually documented in /dev/wordpress_patch_history.txt. I call this technique a "foreign merge."
Subversion-savvy readers are now thinking, a foreign merge will result in inconsistent, meaningless, and/or corrupt Subversion metadata, and is prohibited in Subversion version 1.3 and beyond. Both of these thoughts are correct. To my knowledge there is no documented or recommended way to use Subversion to perform a foreign merge. This method worked well for us through the last time we used it, when we synchronized with WordPress 2.01. After this point, Lyceum's file structure became significantly different from WordPress', as described below in the section on security. To help resolve this issue, it would ideal if Subversion could:
Lyceum has made significant changes to WordPress' file structure, and intends to merge in almost all of its changes for the indefinite future. I have coined a phrase for this type of project: "Benevolent Rogue Heterostructural Parallel Branch." Such projects are certainly few and far between, and therefore their needs will probably not be directly addressed with Subversion features in the near future. We have yet to come up with a perfect solution, but it will probably entail shell scripts hard-coded with file path information, command-line diff and patch, and manual documentation. If any readers have ideas as to how we may more easily manage merges, please contact us at lyceum AT ibiblio DOT org.
Lyceum multiblogifies WordPress and adds few other features. The Lyceum database schema, as one might expect, is very similar to the WordPress schema. The only added infrastructure is used to associate data with blogs.
The first thing to notice about the Lyceum schema is the addition of the blogs table. The blogs table has the following columns:
id (int) The id number of the blog.slug (varchar) The string that is used in the blog's base URL. http://myslug.blogs.example.com/ or http://blogs.example.com/myslugstatus (enum: active, deleted) Blogs in Lyceum can be deactivated by a system administrator (from http://example.com/system-admin/blog-management.php?b=system).The following tables from the WordPress schema have a blog (int) column added to them, which reference a row in the blogs table:
And lastly, a few additional columns have been added:
The rest of the WordPress schema remains unchanged. Those who don't do much DB design may be wondering why there isn't a blog column in more (or all) of the tables. This is because some of the tables are associated with a blog through another table that does have a blog column. For example, all posts in WordPress must have a category. All categories in Lyceum may only be associated with one blog. Therefore, posts are associated with a blog through their categories, and the posts table does not need a blog column.
The URL for each blog in a Lyceum installation includes a "slug," either as a subdomain (http://myslug.blogs.example.com/) or a directory (http://blogs.example.com/myslug), specifying which blog is being accessed. Using mod_rewrite (which is required to run Lyceum), this slug is transformed into a URL variable, b (for "blog"). The magic of this architecture is that, in the development of Lyceum, the vast majority of business logic, including that which generates URLs, did not need to be modified.
Consider the following relative path in WordPress:
file.php?x=5&y=42
This represents the URL:
http://example.com/file.php?x=5&y=42
In Lyceum, the code which generates this URL is unchanged. But since the Lyceum URL already has the base:
http://example.com/myslug/
The link will lead to:
http://example.com/myslug/file.php?x=5&y=42
The rewrite engine will then transform this to:
http://example.com/file.php?b=myslug&x=5&y=42
providing the b variable that is used throughout the system.
Included in the collection of unmodified URLs are those generated by the WordPress permalink engine. URL design in WordPress is very flexible and powerful. Through the web interface, WordPress users can use a multitude of tags in many different combinations to create permalinks of their liking. Examples include:
Furthermore, searches can be done as such:
http://example.com/search/<term1>+<term2>+...
And the RSS feed URL is simple and intuitive:
http://example.com/feed
In Lyceum, all of these features remain usable and adjustable on a per-blog basis, and no URL logic needed to be modified to achieve this. Those familiar with the WordPress source code may be surprised to see the results of performing a diff on WordPress and Lyceum's respective classes.php files.
Although the business logic of WordPress happily lives within Lyceum's schema, it was still necessary to tweak a large portion of the SQL statements in order be relevant to the new schema. To assist in the modification of the SQL statements, Lyceum uses SQL generator functions for the two most frequent types of commands: those which perform selections on only the posts table, and those which perform selections on only the comments table. Here is the generator for selections on the posts table:
function make_post_query($columns, $criteria){
global $blog;
$sql = "
SELECT DISTINCT $columns
FROM posts
INNER JOIN post2cat ON (post_id = ID)
INNER JOIN categories ON (category_id = cat_ID)
WHERE
blog = '$blog' AND
$criteria
;";
return $sql;
}
This generator is used within a series of one-liner functions that wrap around existing WordPress database functions. For example, the wrapper for the function get_results() is get_post_results(). Here is an example of a call to the DB in WordPress:
get_results("
SELECT ID, post_title
FROM posts
WHERE post_status = 'draft' AND post_author IN ($editable) AND post_author != '$user_id'
");
And the equivalent action in Lyceum:
get_post_results( 'ID, post_title', "post_status = 'draft' AND post_author IN ($editable) AND post_author != '$user_id'" );
All of the columns in the join are unique and indexed (B-tree), providing unfettered logarithmic access time to the data location. So the worst-case time-complexity for the entire join is O(3xln(n)), where n is the cardinality of the posts or categories table, whichever is greater. It will take quite a large data-set for this query to become a significant problem, and even then the problem can be solved by scaling the hardware linearly with n. For applications where such large datasets may arise, a problem that can be solved by the linear scaling of hardware is not a problem.
Add to all of this the fact that systems with gobs of memory are rather affordable these days (a 1U 64 bit server with 16 GB of RAM can be had for just over $10k), and you end up with the MySQL query cache cutting down query overhead by orders of magnitude.
In the future we will be experimenting with the possibility of using views in MySQL 5 to further optimize these queries.
To support large installations, it is essential that Lyceum use InnoDB tables instead of the MyIsam tables typically used in WordPress installations. InnoDB supports row locking (vs. MyIsam's table locking) and other features designed for large, high-demand datasets. Unfortunately, InnoDB does not support full-text search in MySQL 4 (in MySQL 5, it does). It is important that Lyceum support MySQL 4. In order to support full-text searching, we have created a lone MyIsam table, postsearch, for the purpose of storing redundant searchable content.
WordPress and Lyceum's security needs differ in two ways. The first is the dramatically increased cost associated with an intrusion on Lyceum. A single security breach on a Lyceum installation can affect thousands of blogs and compromise data and services for thousands of users. This makes a breach more devastating, and creates a more desirable target for attackers.
The second is the social circumstances under which the two systems are used. Though WordPress allows for unmoderated and arbitrary account registration, few installations exercise this option. The vast majority of WordPress installations have one or a handful of authors, all of whom know one another personally. Lyceum, on the other hand, will often be used in environments where account registration is either wide open, or isomorphically associated with some other user namespace (such as a university network). In these situations, there is no lower bar of trust that users can expect from one another. Thus, comprehensive security design is a must.
There are two main security enhancements that Lyceum brings to the WordPress codebase. The first is a simple Best Practices redesign of the file structure: all of the non-web-requestable files in the system have been moved out of the web server document root. For open source projects (where there is no security gained by hiding the source code), compromises involving this vulnerability are very rare. Even so, they can be devastating and this is a key part of a "defense in depth" security strategy.
For the reasons mentioned above, it is not as crucial for WordPress to be secured in this manner. In addition, it would also be impractical; many hosting facilities do not allow users to configure the document root of their websites. In these situations it would be very difficult or impossible to install a web application that used files outside of the document root. In fact, this was the #1 complaint when Lyceum was first released. In response, we designed a system that allows the lib, config, and installation directories to be moved into the document root at the user's discretion. See the installation instructions and the comments in src/lyceum/private.php for more information.
The second architectural security enhancement that Lyceum introduces is the comprehensive, system-wide use of security tokens, (a.k.a. "nonces") in all POST and mutational GET requests, to defeat cross-site scripting, cross-site request forgeries, and spoofed forms. These tokens are a one-use, per-user, per-action, and sometimes per-object sha1 hash. Whenever a page in the admin interface is requested, all of the controls within it that perform data-changing actions have a security token associated with them. Security tokens are included in a hidden field in all forms, and as an HTTP variable in all link URLs.
For example, in the Manage→Posts section of the admin interface there is a list of posts, each with a 'Delete' link to a URL that looks something like this:
http://example.com/blogname/admin/post.php?token=738ee8556d23baa54b2fa28ba6cc96fc85786e9f&action=delete&post=1
The token has been generated randomly:
$token = sha1(uniqid(rand(), TRUE));
And placed into an array in the php session, indexed by another hash that has been generated like this:
$key = sha1($targetscript.$action.$id.$userdata->ID);
In this case $targetscript is 'post.php', $action is 'delete', $id is 1, and $userdata->ID is the user id of the current authenticated user. After $token has been generated and placed in the session, the only way for Lyceum to access it is by first generating the proper $key, which requires using the correct user id, which restricts usage of the token to the user who generated it in the first place. Therefore, the security token system is as secure as the authentication system.
Currently, the Lyceum authentication system is essentially identical to that of WordPress. It stores a local double-hashed password in a cookie, and authenticates on every request. This is problematic: if the cookie is stolen, it may be used by an attacker indefinitely, unless the user changes their password. It is also somewhat less efficient to pass a username and password back and forth instead of a single session key. The solution, which Lyceum aims to implement in a forthcoming release, is to store the authentication information in a session, which periodically times out and then refreshes itself automatically. Then, if a session key is stolen, it is only good until the true user logs in again (with a stale session key). The true user will be prompted to reauthenticate, therefore canceling the stolen session being used by the attacker.
There are various other security features in Lyceum that are part of a "defense in depth" security strategy. One is the storing of session information in a database table instead of in temporary system files. This dramatically reduces the vulnerability of session data files, particularly in a shared hosting environment running PHP as an Apache module, where access to session data in /tmp by other users on the machine is trivial. Another feature requires the system admin user (the 'root' user of a Lyceum installation) to authenticate from a fixed list of IP addresses.
Blog spam is an ever increasing problem for the Web. 87% of blog comments are spam ["Live Spam Zeitgeist," 8 May 2006]. Most single-blog systems available today include anti-spam tools, with varying degrees of effectiveness. For blogging services systems such as Lyceum, quality of service, resource availability, and consistency of user experience for hundreds or thousands of users are a primary concern. A comprehensive and evolving set of anti-spam tools is essential. The Lyceum anti-spam strategy encapsulates several standard anti-spam measures, and a few that are specific to the needs of a large-scale blogging services system. Some of the below features are already implemented, and all of them will be included in future releases:
If using the above strategies, including the Javascript components, these anti-spam measures will eliminate all spam originating from bots. For now. The spam arms race never ends.
Some administrators may not want to require that a user have Javascript enabled to post a comment. This will allow a portion of spam bots to penetrate the system. An even bigger problem is human spammers. For this kind of spam, other heuristic techniques need to be employed. There are a variety of stand-alone anti-spam plugins for WordPress that will work equally as well in Lyceum. Some may even be made more effective because the dataset of the entire Lyceum installation can be used to help identify spam.
An anti-spam solution that stands out above the others is Akismet, produced by Automattic, the people who develop WordPress. Akismet is an anti-spam system that consists of a WordPress plugin, and a centralized service, through which comments are filtered. The heuristics employed by Akismet are unpublished, but our blackbox evaluation has shown the system to be very effective.
We are excited to see the ways in which people use Lyceum. The feedback we have gotten so far has been very positive, and already we are benefiting significantly from bug reports and patches being submitted by our users, even so early in the project. A testament to the power of open source. If you would like to learn more about Lyceum, please explore the links below, or to discuss this article, visit this thread.
John Joseph Bachir is a computer science graduate of Rice University, and is the chief architect of Lyceum. He currently works at ibiblio.org, one of the nation's premier public domain resources and research labs. Ibiblio is located at the University of North Carolina at Chapel Hill and is a collaborative project of the School of Information and Library Science and the School of Journalism and Mass Communication. Ibiblio hosts the Lyceum project, among many others.