Month: November 2009

Hosting?

Posted by on November 29, 2009

I’m starting to work on a simple blog to replace this WordPress instance.

I’ve had a great run with WordPress but I have a few ideas I want to experiment with and I also want to dogfood couchdb-pythonviews a little more.

This blog is hosted on Dreamhost. Dreamhost has been a great host for a low impact blog, the uptime hasn’t been 100% but all the maintenance has been easy and it’s also remained dirt cheap for the last few years.

I need to find a new hosting provider. I have one dedicated server but I don’t plan on running a blog there because that server is a little busy.

I need something cheap. I need root (or some kind of sudo jail) where I can run CouchDB and nginx and manage Python. Preferably Debian. Definitely Linux. Decent uptime.

I’ve considered EC2 but for a low impact site it’s actually quite expensive (~30 dollars a month before bandwidth) and the performance I’m told is about 5x slower than a Macbook.

Backups aren’t necessary since I have CouchDB replication for backing up all the important bits.

I’m open to any and all suggestions.

JSON Performance in Python

Posted by on November 20, 2009

In part of my ongoing performance work in our CouchDB+Python application I’ve decided to sit down and profile JSON performance in the different open source libraries available for Python.

I ran this test profiling json (pure Python simplejson) available in Python stdlib, simplejson compiled with C speedups, cjson, and jsonlib2, with a large JSON document. The test decodes and encodes a large JSON object 100 times. It then runs that test 100 times in each library in succession in order to find the average encode/decode time for each library and minimize other environmental factors that may occur. These numbers were taken on my MacBook Air running Mac OS X 1.6.1 with the default Python 2.6.

The time represents in milliseconds how long it takes to encode/decode this JSON object 100 times.

JSONPerf

I honestly didn’t expect the stdlib json to be this far behind.

Among the other C based libraries there isn’t a clear winner. cjson is the best decoder but the slowest encoder, simplejson compiled with C speedups is the fastest encoder but the slowest decoder while jsonlib2 is somewhere in the middle for both cases.

Also, annoyingly, cjson doesn’t implement the same API as the other libraries (dump and load functions are named encode and decode) making it much more difficult for a library to include support for all available libraries. Now rather than just being able to add a user defined json module I’ll need to add support for user defined parsing and encoding functions to couchdb-pythonviews, couchquery, and couchdb-wsgi.

CouchDB View Performance (Python vs JavaScript)

Posted by on November 4, 2009

We’re gearing up for some heavy CouchDB usage in a new automation system and it has fallen upon me to do some performance benchmarking.

The most important thing for us to figure out was whether or not the CentOS virtual machine we’re currently running CouchDB on is going to be enough even in the short term. Until today we’ve been running 0.9 and have encountered performance problems.

Our main bottleneck is, and has always been, view generation and update performance. We tend to have medium to large size documents (jobs are relatively small but results from test runs can be incredibly large).

View generation of large documents has typically been our biggest issue which we have previously mitigated by refreshing all views after any large write but that isn’t going to work for the amount of results that we plan on pouring in to the new system.

Last weekend I wrote a Python view server for CouchDB. couchdb-python includes a view server but in the past I’ve heard complaints about performance (although none recently). In addition, the view server in couchdb-python only supports map and reduce, which is only about 1/5 of the current view server spec which includes handlers for update, show, list, filter, and validate which provide the groundwork for CouchDB as an application platform. As of Sunday my view server passes all of the current CouchDB spec and initial performance tests showed it faster than the JavaScript view server.

Below are the performance graphs for CouchDB trunk running on a CentOS virtual machine. I’m using Python 2.6 with the default stdlib json library. The spidermonkey core is 1.7 (I don’t know what the status of using 1.8 with CouchDB is but as we’ll see below, this won’t improve performance too much for these tests).

These graphs show view generation time for a given number of documents in a new database. The design doc I used had two views, one does emit(doc['type'],doc), the other emit(doc['_id'], 1).

The graphs support zooming, mouseover and all kinds of flot goodness :)

JavaScript is the yellow line. Python is the Blue line.

This is a test of moderately sized documents, what we normally expect the size of a job or build description. Each document is identical and fairly simple with a size of ~1,588 bytes.

These documents were incredibly large, they were taken from a full fennec mochitest run. Each document is identical and while large it consists mostly of small sized JSON objects inside a much larger JSON object coming in at ~139,096 bytes.

I had also intended to chart the reduce performance with a simple sum operation but all the results were sub-second regardless of the amount of documents I threw at it with Python being only a little faster than JavaScript.

The nearly identical reduce time tells me that the actual code processing time inside the view functions are hardly different which means that the large difference in performance during view generation is most likely due to JSON serialization time. This also explains why larger documents cause an even greater difference in performance between Python and JavaScript.

Improving Performance

The Python view server is already as optimized as I can imagine for processing time inside the views. Since CouchDB doesn’t provide a way for the view server to support it’s own concurrency we’ve basically hit the wall here on what Python can provide. If we increased the complexity of the view functions I think that Python would start to show better than Spidermonkey 1.7, but 1.8 with traceing enabled would likely bridge that gap, possibly even showing JavaScript faster than Python.

The big problem is JSON serialization. We can make Python faster by compiling simplejson with C speedups. But using the C based JSON parser in newer versions of Spidermonkey requires some other changes to CouchDB since there are differences in the encoding of undefined.

At the end of the day though, this all looks great. CouchDB trunk (pre-0.11) is going to run fast enough for what we need right now even on a virtual machine and if we start to see view generation bottlenecks on views that aren’t hit as often and have to update a large number of documents we can just move those views to Python and the performance should be back down to sub-second.