Category: Python

Moving on

Posted by on January 2, 2010

The new year is bringing some big changes for me. A few weeks back I accepted a position at Relaxed Inc. and notified Mozilla that I would be leaving at the end of the year.

Mozilla

I started working at Mozilla 2 years ago. I started the day after my employment at the Open Source Applications Foundation ended. At this point I already took for granted some of the best parts of working at Mozilla; working for a public benefit organization, spending 100% of my time working on Open Source, working with very smart people in the open (lists, IRC, etc.).

But Mozilla is even more than all that. Succeeding at Mozilla means something more than a pat on the back and a good end of the year review. When you succeed at Mozilla you impact one of the most important products on the internet. You reach hundreds of millions of users and contribute to keeping the web an open and free (as in speech) world. There is no other place in the world you can work where you can conceivably have this kind of impact.

Mozilla as an organization is truly unique. Last year was the hardest I’ve ever had, i suffered a huge loss in my personal life and Mozilla was as supportive during this time as any of my friends or family. There are a lot of places that let you put so much of yourself in to the organization to help it attain it’s goals but there are only a handful that are there to support you when you need it.

Relaxed Inc.

I started using CouchDB in 2008 after a great talk by Jan Lehnardt at OSCON. I started using it right away and over the next year it re-shaped how I think about web development and applications. In the last 6 months my group at Mozilla has become a heavy CouchDB user and not just because of my own interest but because CouchDB was the only solution for some of the harder problems we needed to solve with our results storage.

As I’ve used CouchDB more and more and become a part of the CouchDB community I’ve had the pleasure of knowing some of the core contributors, three of which have decided to found a new startup around CouchDB; Jan Lehnardt, J Chris Anderson, and the creator of CouchDB Damien Katz. Shorty after they received their funding they made an offer. It’s an amazing opportunity and while the decision to leave Mozilla is one of the hardest I’ve ever had to make I’m very excited about my future at Relaxed.

The Future

I’m really looking forward to working with everyone at Relaxed. It’s an exciting time and I’m not 100% sure yet which projects I currently work on that I will still have time to maintain. In the next week or so I’ll be doing a blog post on all the libraries I currently work on and maintain (it’s a long list) and what their status is moving forward. I still maintain code I wrote long before I worked at Mozilla and have every intention of continuing to work on some of the projects I started at Mozilla.

One thing is certain. I’m not the guy who figures out how to test the browser any more. Windmill and Mozmill are important projects that I have every intention of supporting by making time for code reviews and community support but I won’t be available to put time in to new feature work and refactoring like I have in the past. Luckily there are solid communities behind both of these projects and I’m confident that there are people who can continue to drive them in the future.

I don’t know what is going to happen next, all I know is that it should be fun, it won’t be like anything I’ve done before, and will certainly continue to include lots JavaScript and Python.

For everyone who depends on me and the code I’ve written over the last few years I’ll be sure to keep you all up to date. And one thing I can promise is that if you want to fix anything in one my projects, fork it on github and send me a pull request and I will always find time to look at it :)

Hosting?

Posted by on November 29, 2009

I’m starting to work on a simple blog to replace this WordPress instance.

I’ve had a great run with WordPress but I have a few ideas I want to experiment with and I also want to dogfood couchdb-pythonviews a little more.

This blog is hosted on Dreamhost. Dreamhost has been a great host for a low impact blog, the uptime hasn’t been 100% but all the maintenance has been easy and it’s also remained dirt cheap for the last few years.

I need to find a new hosting provider. I have one dedicated server but I don’t plan on running a blog there because that server is a little busy.

I need something cheap. I need root (or some kind of sudo jail) where I can run CouchDB and nginx and manage Python. Preferably Debian. Definitely Linux. Decent uptime.

I’ve considered EC2 but for a low impact site it’s actually quite expensive (~30 dollars a month before bandwidth) and the performance I’m told is about 5x slower than a Macbook.

Backups aren’t necessary since I have CouchDB replication for backing up all the important bits.

I’m open to any and all suggestions.

JSON Performance in Python

Posted by on November 20, 2009

In part of my ongoing performance work in our CouchDB+Python application I’ve decided to sit down and profile JSON performance in the different open source libraries available for Python.

I ran this test profiling json (pure Python simplejson) available in Python stdlib, simplejson compiled with C speedups, cjson, and jsonlib2, with a large JSON document. The test decodes and encodes a large JSON object 100 times. It then runs that test 100 times in each library in succession in order to find the average encode/decode time for each library and minimize other environmental factors that may occur. These numbers were taken on my MacBook Air running Mac OS X 1.6.1 with the default Python 2.6.

The time represents in milliseconds how long it takes to encode/decode this JSON object 100 times.

JSONPerf

I honestly didn’t expect the stdlib json to be this far behind.

Among the other C based libraries there isn’t a clear winner. cjson is the best decoder but the slowest encoder, simplejson compiled with C speedups is the fastest encoder but the slowest decoder while jsonlib2 is somewhere in the middle for both cases.

Also, annoyingly, cjson doesn’t implement the same API as the other libraries (dump and load functions are named encode and decode) making it much more difficult for a library to include support for all available libraries. Now rather than just being able to add a user defined json module I’ll need to add support for user defined parsing and encoding functions to couchdb-pythonviews, couchquery, and couchdb-wsgi.

CouchDB View Performance (Python vs JavaScript)

Posted by on November 4, 2009

We’re gearing up for some heavy CouchDB usage in a new automation system and it has fallen upon me to do some performance benchmarking.

The most important thing for us to figure out was whether or not the CentOS virtual machine we’re currently running CouchDB on is going to be enough even in the short term. Until today we’ve been running 0.9 and have encountered performance problems.

Our main bottleneck is, and has always been, view generation and update performance. We tend to have medium to large size documents (jobs are relatively small but results from test runs can be incredibly large).

View generation of large documents has typically been our biggest issue which we have previously mitigated by refreshing all views after any large write but that isn’t going to work for the amount of results that we plan on pouring in to the new system.

Last weekend I wrote a Python view server for CouchDB. couchdb-python includes a view server but in the past I’ve heard complaints about performance (although none recently). In addition, the view server in couchdb-python only supports map and reduce, which is only about 1/5 of the current view server spec which includes handlers for update, show, list, filter, and validate which provide the groundwork for CouchDB as an application platform. As of Sunday my view server passes all of the current CouchDB spec and initial performance tests showed it faster than the JavaScript view server.

Below are the performance graphs for CouchDB trunk running on a CentOS virtual machine. I’m using Python 2.6 with the default stdlib json library. The spidermonkey core is 1.7 (I don’t know what the status of using 1.8 with CouchDB is but as we’ll see below, this won’t improve performance too much for these tests).

These graphs show view generation time for a given number of documents in a new database. The design doc I used had two views, one does emit(doc['type'],doc), the other emit(doc['_id'], 1).

The graphs support zooming, mouseover and all kinds of flot goodness :)

JavaScript is the yellow line. Python is the Blue line.

This is a test of moderately sized documents, what we normally expect the size of a job or build description. Each document is identical and fairly simple with a size of ~1,588 bytes.

These documents were incredibly large, they were taken from a full fennec mochitest run. Each document is identical and while large it consists mostly of small sized JSON objects inside a much larger JSON object coming in at ~139,096 bytes.

I had also intended to chart the reduce performance with a simple sum operation but all the results were sub-second regardless of the amount of documents I threw at it with Python being only a little faster than JavaScript.

The nearly identical reduce time tells me that the actual code processing time inside the view functions are hardly different which means that the large difference in performance during view generation is most likely due to JSON serialization time. This also explains why larger documents cause an even greater difference in performance between Python and JavaScript.

Improving Performance

The Python view server is already as optimized as I can imagine for processing time inside the views. Since CouchDB doesn’t provide a way for the view server to support it’s own concurrency we’ve basically hit the wall here on what Python can provide. If we increased the complexity of the view functions I think that Python would start to show better than Spidermonkey 1.7, but 1.8 with traceing enabled would likely bridge that gap, possibly even showing JavaScript faster than Python.

The big problem is JSON serialization. We can make Python faster by compiling simplejson with C speedups. But using the C based JSON parser in newer versions of Spidermonkey requires some other changes to CouchDB since there are differences in the encoding of undefined.

At the end of the day though, this all looks great. CouchDB trunk (pre-0.11) is going to run fast enough for what we need right now even on a virtual machine and if we start to see view generation bottlenecks on views that aren’t hit as often and have to update a large number of documents we can just move those views to Python and the performance should be back down to sub-second.

Introducing… couchdb-wsgi

Posted by on October 28, 2009

Last weekend I put together some pretty useful code that converts [CouchDB's external process](http://wiki.apache.org/couchdb/ExternalProcesses) JSON request/responses to a WSGI compliant interface.

This means you should be able to run any modern Python web framework in an external process :)

The simplest example:

#!/usr/bin/python
import couchdb_wsgi
 
def application(environ, start_response):
    start_response('200 Ok', [('content-type', 'text/plain')])
    return ['Hello World']
 
couchdb_wsgi.CouchDBWSGIHandler(application).run()

But a far more interesting example is running a django app :)

#!/usr/bin/python
import os, sys
import couchdb_wsgi
 
django_project = os.path.join(os.path.dirname(__file__), 'mysite')
sys.path.append(django_project)
os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings'
 
import django.core.handlers.wsgi
 
application = django.core.handlers.wsgi.WSGIHandler()
 
couchdb_wsgi.CouchDBWSGIHandler(application).run()

All the code is [up on github](http://github.com/mikeal/couchdb-wsgi) and I’ve written up some solid [Sphinx docs that are up on gh-pages](http://mikeal.github.com/couchdb-wsgi/). I also pushed an [initial release to PyPI](http://pypi.python.org/pypi/couchdb-wsgi).

GitHub is the winner

Posted by on July 20, 2009

I’m not lucky enough to get to choose one source control manager and use it exclusively. On a daily basis I use git, svn, and hg. Every week or so I also use bzr. Luckily, I no longer have to touch darcs.

I haven’t dug in to the internals of these tools enough to say which one has the superior technical merits although I will say that I’ve never seen a git conflict resolution interface even across unbelievably hairy merges.

I write a lot of small libraries and a couple big ones. I care far more about the social effects and contribution workflows a tool provides than any other features. There are different public web applications that try and provide infrastructure for the social effects of DCVS and after months of working with different approaches I have to say that GitHub is the winner by a mile.

At the end of the day there are two factors that make GitHub such a clear winner. The first is zero friction publishing. The second is the democratizing effect of scraping any notion of a “central” repository.

Nearly a year ago i hit the Google Code project limit and had to call in some favors to get the limit pushed up for my account. I have to push lots of small libraries so having a simple and seamless publishing of repositories has made my life much easier. The fact that I can just push my repository and worry about turning that repo on GitHub into a “project” later, instead of the other way around, means that I have no reason **not** to publish every little thing I do.

The second and more controversial feature of GitHub, and possibly of git itself, is that there is never a clear central repository. There is my repository, and your repository, and every other **person’s** repository. This throws off FLOSS projects that have always relied on a “committer” hierarchy to manage the influx of work in to a project. Nearly every book on community driven open source focuses on the creation of a class of contributors with special write permissions to the repository. There has been a huge discussion on how to translate that process to DCVS and some tools, in particular hg, make it fairly easy to simulate older workflows with a central repository.

After living with GitHub for a while and seeing the potential for new collaboration I think the answer to translating the “committer” model to DCVS is to **not translate it at all**. GitHub makes **everyone** a committer and that enables a new class of contribution that the old model totally excluded.

Since code can travel seamlessly through different developer’s repositories each change takes on a life of it’s own. People who made what they thought were small changes for their own personal use easily share them with other developers and those changes can move around repositories hopefully making it in to an official release. New contributors don’t have to worry about this giant wall of process behind getting a patch in, they simply write the patch and push it, send pull requests to other relevant contributors and module owners eventually getting those changes pushed up in to the repository that gets packaged and distributed.

Someone is always going to be responsible for releasing a product, someone owns the keys to the distribution mechanisms, so I find the notion that some amount of authority over the project’s direction is lost by not centralizing the repository to be exaggerated. Although there is some authority that is lost to the previously defined class of committers the democratization of write permissions encourages a bigger class of lost contribution that is excluded by the laborious process of patches in bugs and the required upstream process to get the work committed. This also means that a number of contributors can live with changesets for an extended period of time before they get packaged in a release which increases confidence in large changesets that many projects reject outright for fear of instability.

GitHub solves the social problems of open source collaboration by taking a much more anarchist approach to the contribution process and while this is certainly shaking the foundation of traditional contribution models I’m loving it :)

Up for a Pint?

Posted by on July 2, 2009

I’m in London for the next few days and would love to grab a drink with any community members be you Mozilla, CouchDB, Python, Windmill, JavaScript or just plain old coffee, whisky or beer geeks :)

Heading to EuroPython

Posted by on June 26, 2009

I’m getting all packed up and leaving Sunday for [EuroPython](http://www.europython.eu/) in Birmingham, UK.

This will be my first time at EuroPython and my first time in Europe!

I’ll be giving two talks, one on [Windmill](http://www.getwindmill.com) and one about [CouchDB](http://couchdb.apache.org/) and Python. The Windmill talk will be more or less the talk that I gave at [Open Source Bridge](http://opensourcebridge.org/) last week, which went very well. This is the first time I’ll be talking about CouchDB, the most exciting new technology on the web. The talk will mostly be about breaking our old data modeling habits that we developed to deal with SQL and what libraries and tools are available for interacting with CouchDB in Python.

I will also be in London for a few extra days after the conference so anyone interested in a meetup should ping me.

Conference Season Begins

Posted by on June 15, 2009

I’ll be leaving tomorrow morning for [Open Source Bridge](http://opensourcebridge.org/) in Portland, Oregon.

I’m putting together a new [Windmill talk](http://opensourcebridge.org/sessions/36) that tries to incorporate all the feedback we’ve received over the last year of speaking which I’ll be presenting on Thursday.

Mozilla is also a [sponsoring](http://opensourcebridge.org/sponsors/) the conference and there is going to be some great [Firefox related sprints in the hacker lounge](http://opensourcebridge.org/wiki/Hacker_Lounge). Dietrich is also giving what sounds like an awesome talk on extending Firefox called [Firefox Switchblade](http://opensourcebridge.org/sessions/251).

Hope to see you all there!

PS. I’ll also be at EuroPython and the Community Leadership Summit, more on those later :)

Windmill 1.1 (the PyCon release)

Posted by on April 7, 2009

So much good stuff landed in Windmill over the last few weeks that we decided to push another major release.

The biggest new features are:

* [django management command](http://trac.getwindmill.com/wiki/WindmillAndDjango) for running windmill tests ([Jacob](http://jacobian.org/) said the existing django support wasn’t good enough and I agreed so I wrote this during the PyCon Sprints)
* new [nose plugin](http://trac.getwindmill.com/wiki/BookChapter-5-RunningTests#RunningTestsfromNose)
* cygwin support contributed by [Simon Law](http://sfllaw.livejournal.com/) (he went and wrote an implemenation of [winreg for cygwin](http://pypi.python.org/pypi/cygwinreg) to get this to work).

There were also some really good bug fixes that landed:

* much better unicode handling and serialization (adam)
* fix for POST to foreign domains ([Anthony Lenton](http://anthony.lenton.com.ar))
* continued improvements to click simulation (adam)

The release is [up on PyPI](http://cheeseshop.python.org/pypi/windmill) and you can install/update with:

$ easy_install -U windmill

For anyone interesting in working **on** windmill we’re having a Sprint in #windmill on irc.freenode.net tomorrow April 8th, 2008 for pretty much all day. We’re going to be improving the unittests for Windmill itself.

The next planned major release will be 1.2 which will include the much anticipated SSL support, courtesy of some great work being done by Anthony.