[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

OSSNET - Proposal for Swarming Data Propagation



(Alan Cox mentioned a theoretical idea for bittorrent in data
propagation for yum... so this seemed like the most appropriate time to post this again. Comments would be greatly welcomed.)



OSSNET Proposal October 28, 2003 Warren Togami <warren togami com>

The following describes my proposal for the "OSSNET" swarming data
propagation network. This was originally posted to mirror-list-d
during April 2003. This proposal has been cleaned up a bit and
amended.

Unified Namespace
=================
This can be shared with all Open Source projects and distributions.
Imagine this type of unified namespace for theoretical protocol "ossnet".

ossnet://%{publisher}/path/to/data
Where %{publisher} is the vendor or project's master tracker.
The client finds it with standard DNS.

Examples:
ossnet://swarm.redhat.com/linux/fedora/1/en/iso/i386/
ossnet://ossnet.kernel.org/pub/linux/kernel/
ossnet://swarm.openoffice.org/stable/1.2beta/
ossnet://central.debian.org/dists/woody/
ossnet://swarm.k12ltsp.org/3.1.1/
ossnet://master.mozilla.org/mozilla1.7/

Each project tracker has their own official data source with the entire
repository GPG signed for automatic ossnet client verification.

Phase 1 - Swarming for Mirrors only
===================================
Initial implementation would be something like rsync, except swarming
like bittorrent and used only for mirror propagation. It may need
encryption, some kind of access control, and tracking in order to
prevent intrusion, i.e. hold new release secret until release day.

(This paragraph below about access control and encryption was written after the release of RH9, and the failure of "Easy ISO" early access due to bandwidth overloading and bittorrent. In the new Fedora episteme this access control stuff may actually not be needed anymore. We can perhaps implement OSSNET without it at first.)

I believe access control can be done with the central tracker (i.e. Red
Hat) generating public/private keys, and giving the public key to the
mirror maintainers. Each mirror maintainer would choose which
directories they want to permanently mirror, and which to exclude. Each
mirror server that communicates with another mirror would first need to
verify identity with the master tracker somehow. If somebody leaks
before a release, they can be punished by revoking their key, then the
master tracker and other mirrors will reject them.

Even without the encryption/authorization part this would be powerful.
This would make mirror propagation far faster while dramatically
reducing load on the master mirror. Huge money savings for the data
publisher... but it gets better.


Phase 2 - Swarming clients for users ==================================== I was also thinking about end-user swarming clients. up2date, apt or yum could integrate this functionality, and this would work well because they already maintain local caches. The protocol described above would need to behave differently for end-users in several ways.

Other than the package manager tools, a simple "wget" like program would be best for ISO downloads.

Unauthenticated clients could join the swarm with upload turned off by
default and encryption turned off (reduce server CPU usage). Most users
don't want to upload, and that's okay because the Linux mirrors are
always swarming outgoing data. Clients can optionally turn on upload,
set an upload rate cap, and specify network subnets where uploading is
allowed. This would allow clients within an organization to act as
caches for each other, or a network administrator could setup a client
running as a swarm cache server uploading only to the LAN, saving tons
of ISP bandwidth. A DSL/cable modem ISP would be easy to convince to
setup their own cache server to efficiently serve their customers. This
is because setting up a server can be done quickly & unofficially.

Clients joining the swarm would greatly complicate things because the
protocol would need to know about "nearby" nodes, like your nearest
swarming mirror or your LAN cache server. This may need to be a
configuration option for end-user clients. These clients would need to
make more intelligent use of nearby caches rather than randomly swarm
packets from hosts over the ISP link. The (bittorrent) protocol would
need to be changed to allow "leeching" under certain conditions without
returning packets to the network. Much additional thought would be
needed in these design considerations.

Region Complication
===================
Due to higher costs of intercontinental bandwidth, or commodity Internet
over I2 cost within America, we may need to implement a "cost table"
system that calculates best near-nodes taking bandwidth cost into account.

Perhaps this may somehow use dedicated "alternate master trackers"
within each cost region, for example Australia, which are GPG identified by the master tracker as being authoritative for the entire region. Then end-user clients that connect to the master tracker are immediately told about their nearer regional tracker.


Possible Multicasting
=====================
This isn't required, but multicasting could be utilized in addition to
unicast in order to more efficiently seed the larger and more connected
worldwide mirrors. Multicast would significantly increase the complexity of network router setup and software complexity, so I am not advocating this be worked on until the rest of the system is mplemented.



Possible Benefits? ================== * STATISTICS! As BitTorrent has demonstrated, end-user downloads could possibily be tracked and counted. It would be fairly easy to standardize data collection in this type of system. Today we have no realistic way to collect download data from many mirrors due to the setup hassles and many different types of servers. Imagine how useful package download frequency data would be. We would have a real idea of what software people are using, and possibly use that data to guage where QA should be focused to better and make users/customers happier.

* Unified namespace!
Users never have a need to find mirrors anymore, although optionally setting cache addresses would help it be faster and more efficient.


* Public mirrors (even unofficial) can easily setup and shutdown at any
time. Immediately after going online they will join the swarm and begin
contributing packets to the world. THAT is an unprecedented and amazing
ability. The server maintainer can set an upload cap so it never kills
their network. For example, businesses or schools could increase their
upload cap during periods of low activity (like night?) and contribute
to the world. The only difference between an official and unofficial
mirror would be unofficial cannot download or serve access controlled
data since they are not cryptographically identified by the master tracker. Any client (client == mirror) can choose what data they want to serve, and what they do not want to serve.


* Automatic failover: If your nearest or preferred mirror is down, as
long as you can still reach the master tracker you can still download
from the swarm.

* Most of everything I described above is ALREADY WRITTEN AND PROVEN
CONCEPTS in existing Open Source implementations like bittorrent
(closest in features), freenet (unified namespace) and swarmcast
(progenitor?). I think the access control and dynamic update mechanism
has been implemented yet. bittorrent may be a good point to start
development from since it is written in python ... although scalability may be a factor with python, so a C rewrite may be needed. (?)


FAQ
===
1. This idea sucks, I don't want to upload!
RTFM! This proposal says that clients have upload DISABLED by default.

2. This idea sucks, I don't want to upload to other people!
RTTM! In this proposal you can set your mirror to upload only to certain
subnets, at certain set upload rate caps.

3. Wont this plan fail for clients behind NAT?
Incoming TCP sessions are only needed if you upload to the swarm, as other clients connect to you. Uploading is DISABLED by default Downloading only requires outgoing TCP connections.


4. What if outgoing connections on high ports are disallowed?
Then you are SOL, unless we implement a "proxy" mode. Your LAN can have a single proxy mirror that serves only your local network, and downloading your requests on your behalf.



Conclusion ========== Just imagine how much of a benefit this would be to the entire Open Source community! Never again would anyone need to find mirrors. Simply point ossnet compatible clients to the unified namespace URI, and it *just works*. We could make a libossnet library, and easily extend existing programs like wget, curl, Mozilla, galeon, or Konqueror to browse this namespace.

This is an AWESOME example of legitimate use of P2P, and far easier to
police abuse than traditional illegal use of P2P clients. Data publishers need to run a publically accessible tracker and must be held legally accountable. This is more like a web server with legal content and millions of worldwide proxy caches. In any case the web server would
be held accountable for the legality of their content.


That is how this differs from Freenet which uses encryption everywhere
and is decentralized. Freenet can be used for both good and evil, while
ossnet can only sustainably used for good, because normal law enforcement can easily locate and (rightly) prosecute offenders. This is existing copyright law, how it was meant to be used. If this idea became reality, we could point to this glowing example of legitimate P2P as a weapon to fight RIAA/MPAA interests.


I hope I can work on this project one day. This could be world
changing... and sure would be a fun to develop. Maybe Red Hat could
develop this, in cooperation with other community benefactors of such an
awesome distribution system.

Comments? =)

Warren Togami
warren togami com

p.s. Time to short Akamai stock. <evil grin>




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]