Behlendorf on Open Source
Tim: I understand you played a key role in the creation of the Apache web server. How did that come about?
Behlendorf: In 1995 I was working at Wired magazine and also starting a web company of my own called Organic. I was working with this pre-existing web server from NCSA, the same group who put out Mosaic, and I had some patches which did things like improve performance here and there, and fix some bugs and security holes. I was sharing these patches with other NCSA server uses that I knew about, through different mailing lists and standards groups. Eventually this group I was sharing these patches with came to the conclusion that NCSA wasn't going to release another version because they'd just lost all their developers to Netscape. So why don't we create our own distribution of the server with all the patches included? It felt like the right thing, and we looked at the license for NCSA, and it said this code is public domain, do whatever you want, just give NCSA credit if you create a derivative work and pass it along. So we said, that's a good philosophy, why don't we adopt that ourselves so that if we ever get into a situation where we're all hired away by Netscape or whatever, the next group can come along and do the same thing to us. So that was the beginning of the Apache project, back in 1995.
In 1998 we decided that in order to give some legal backing to what we did we needed to form a shell around the project, and so we created a non-profit organization called the Apache software foundation. I was president of that for the first three years, all the while focusing both on the tools the developers were using, and on how to bring corporate entities like IBM and Sun and Apple and all these others, into a project like Apache without corrupting the development processes that we felt lent themselves to high-quality open-source software. Figuring that out led me to think that it might be interesting to help other companies figure out how to work with the open source community, and that was the genesis behind CollabNet.
Tim: So what was CollabNet all about, and how did you come up with Subversion?
Behlendorf: The company was started on the premise that we'd figured out what it is about open source that works. I looked at what I was doing with Apache in terms of the back-end tools and the infrastructure, which seemed pretty unique. I started working on integrating these kind of tools together more tightly, and started to get some resonance out there with companies like Sun, who wanted something like that for Open Office, and Hewlett Packard, who wanted something like that for their own internal software development.
We decided to start with the tools that the open source community are using. So we started with CVS, but we realised that CVS had some obvious limitations. It had been around for 20 years and was well known for both its features and its limitations. We needed something better. We could have licensed a commercial product, but we didn't like that option, so I said, "Why not let's start it as an open-source project?" I wanted something that would have the same degree of widespread use and trust that the community has in CVS.
So that was why we started Subversion. I hired this guy called Karl Fogel, who had written a book open source development with CVS that I really liked. He warned me that the existing CVS code was really ugly and on its last legs, and I said, let's totally start from scratch, fix the key things that we know are wrong with CVS, but also create what could be a base platform for doing more interesting things.
We could afford to hire three developers. I couldn't afford to hire thirty to build the team I thought would be necessary. So I said to those three, "You have to start the project and be responsible for the kernel of that code, but you are also responsible for building a community around it." That was around 2000, and it took around three and a half years before they hit a 1.0 at the beginning of this year. At the 1.0 release there were already 2000 public installations. We know this because Netcraft would report Apache installations that had the subversion module installed. In addition there were a bunch of companies using it privately. And you know, it still needed some things like the FSFS back end and things like that, that have improved reliability, but it's on a really good trajectory right now.
Tim: Do you see FSFS replacing the Berkeley DB back end?
Behlendorf: Initially it was seen as something that could be equivalent, but I think a lot of people are not seeing any upside to Berkeley DB. It doesn't look like it's faster in any way, and it's not more reliable. I think we'll continue to support it, since there's a lot of people with existing Berkeley DB installations out there, but I think FSFS is going to turn out to be the recommended route. We could deprecate Berkeley DB at some point, but I leave those decisions up to the community.
Tim: Some people are saying that Subversion is fantastic, but it's the archetypal centralized repository, and maybe we should be moving more to distributed repositories, such as Arch. What's your view?
Behlendorf: There's a couple of projects out there that are looking at building a distributed system on top of Subversion. Thinks like svk, hosted at svk.tigris.org, which is essentially an intelligent proxy server, so you could have a team in India making their commits to a local proxy, and somebody charged with reviewing those patches and sending them to the upstream server.
A lot of what's driving the desire for decentralized repositories is wanting to have a lot more flexibility. Developers say, "I don't have commit privileges to the upstream repository, but I still want to check things into a place and share that check-in with other people." I think svk will do a lot of that.
Arch and Bitkeeper take a fundamentally different approach in their concept of how a repository works. I think those projects are valuable to have as reference points, but right now the focus is on making Subversion the best kind-of centralized repository, and on fighting some of the things that make centralization painful. If we can solve those problems then centralization won't sound like the dirty word that it is right now.
We are also driven by what the corporate or enterprise customers want. And they do want centralization. They do want to know that all the code is sitting there in one place that is backed up, that perhaps has mirrors in a bunch of different places, but that there's some global view of what's available, rather than having to worry that maybe there's a collection of good and useful stuff, but it's sitting on some developer's private server that he's sharing with three other developers, and that the rest of the organization doesn't know about. That kind of global knowledge and global permissions management is important in the enterprise.
Tim: Companies like Rational and Borland are promoting the idea of software lifecycle management with highly integrated tools. There's a danger of tools vendor lock-in. What's your view?
Behlendorf: I think an integrated environment has a lot of value. It's something that's been our argument from day one, with CollabNet Enterprise Edition. For example, when you make a commit, and you mention an issue ID, there's value in having bi-directional links between the issue database and the version control repository. Or you make a commit and a commit message appears on the mailing list, people can respond and that response goes to the developers' list. There is a useful set of integration points between these tools that is difficult to do if you have a version control tool from one vendor, a bug database from another vendor, and communication tools from another vendor. So I think there's a lot of value in having an integrated collaboration toolset across an Enterprise. Yet we've never walked in and said, "There's one particular methodology". We're not here to say there's one specific open-source methodology, we're not here to say that RUP [Rational Unified Process] doesn't have any value. What ends up happening is that most companies adopt their own methodologies, their own standards for workflow within the Enterprise. Perhaps they even vary from team to team, and you need that integrated toolset to be programmable. We already see our tool used throughout the entire development lifecycle, beginning with product managers talking with sales people and customer support about what the product should do. Those issues and requirements become artefacts in the system that get traced all the way through development to production and deployment. At the same time we're not specifying a particular front-end IDE or developer tool. You can use Eclipse with our infrastructure, you can use JBuilder, you can use Visual Studio, or you can use the command-line clients. We're trying to focus on the server side, using the web as an interesting way to present the UI.
Tim: What about modelling?
Behlendorf: I'm agnostic on it. People have been talking about 4GLs for twenty years, about higher-level representations and visualizations of code, and having the underlying software code become more of a derivative work. The reality is that the source code itself is still the authoritative description of what the software does. If you have a defect, you still have to dive into the code. When you really care about performance you want to dive in and understand how things work below the API. When you're too abstract, when you don't understand how registers work and how the underlying architectures work, then you tend to write inefficient code or create subtle bugs that take a long time to resolve. So I'm sceptical, but I think there are teams that are successful with things like UML models and round-trip engineering. I wouldn't say that our tools have a preference or any kind of enforcement either way. You can track UML models as just another artefact.
Tim: I'd like to ask some more general questions about open source development. It works very well for big projects, but there are lots of smaller projects out there in various states of bugginess and abandonment. Without a crystal ball, how can you tell whether a particular open source project is going to be high-risk?
Behlendorf: For companies who need to mitigate the risk around this, the best thing to do is work with what are essentially agencies. Companies such as Covalent who are experts with Apache web server code and server-side Java code; or companies like JBoss and MySQL, who act as a buffer for these projects. Customers buy the commercially supported releases and if they find a bug they can still work through these companies to get the bug fixed.
There's a whole wave of new companies that provide this service as well. You might have heard of SpikeSource, for example - I'm on the board of advisors there, so that's a disclaimer; but there's also SourceLabs, there's Open Logic, there's BlueCoat, these are all new companies that are basically stating that they will create a stack of software they certify, comprised of open source software that they will support. They will manage the risk around someday the project being abandoned. They'll probably choose those components that aren't just one-person projects, that have a multi-person longevity to them.
The other thing is I think you'll see open source projects themselves start to take some responsibility for this. Within Apache for example we are pretty fierce about preventing or avoiding having projects in our community that are one-person projects. There's even been a couple of examples of projects where it's become clear that there was one developer who was vetoing everything that they didn't write and not really building a community around what they did. They had firm and very strong opinions about what was right and what was wrong, and that led to a very weak community and a very weak collaboration environment. We've kicked developers out on some occasions and shut down projects because of that. We have something called the incubator for new projects, where they might start as one or two-person projects, and they have to pass that incubation phase in order to become real Apache projects. The criterion is that there is a vibrant multi-participant, multi-vendor development community around it.
Tim: Just to clarify, some of these agencies might be the guardians of the project they are responsible for, such as JBoss, but others work with a variety of different projects, they don't actually own a project in any sense but they work with that project and provide a buffer for the customer to give them some security.
Behlendorf: Yes, and I don't think owning a community is really needed in order to be able to provide that buffer. I think you need to be smart, almost like you're a mutual fund manager, deciding which companies to invest in, which companies are mature, which companies have risk. There is an emerging open source maturity model, trying to recognise how long has this project been around and actively developed, how many users are out there, how many developers are checking code in, in a given week, how much peer review goes on.
There's also a company out there called Open Source Risk Management, which provides some of this business-oriented or actuarial information around projects. They don't themselves sell insurance policies, but they can provide the information that's needed to make some sort of numeric estimation of the risk around using something like Linux or Apache.
With CollabNet and with the Subversion project, we fund a lot of the core full-time developers but we've tried to make it very clear, we're not the ones dictating direction. We are enablers of the community. We do have our own priorities and things we'd like to see the project do, driven by how our corporate customers are using it, but we recognise that the strength of the product comes from the strength of the community. We can't throw in a quick implementation because we need it to make a sale. There is a process around it, and a community around it that needs to be engaged on the evolution of the code. That's a model that is different from how JBoss and MySQL manage their communities, where those companies hire their key developers, and drive the roadmap. We'd rather take a more multi-vendor, multi-participant approach to it, because when we sell Subversion to a company, they want some guarantee that it's going to be around in five years. This is one reason we put an Apache license on the code itself. We wanted to allow other companies to come in and provide support for it. We wanted to reassure companies that if they became customers of CollabNet, and something went wrong, that they would still be able to support themselves if they needed to. It's not going to disappear or lock them into a proprietary system. That's been a huge win for us.
Copyright Tim Anderson November 2004