Introducing leveldb-server

| 2 comments

We just released leveldb-server. Start forking: Github leveldb-server


leveldb-server

  • Async leveldb server and client based on zeromq
  • Storage engine leveldb
  • Networking library zeromq
  • We use leveldb-server at Safebox

License

New BSD license. Please see license.txt for more details.

Features

  • Very simple key-value storage
  • Data is sorted by key – allows range queries
  • Data is automatically compressed
  • Can act as persistent cache
  • For our use at Safebox it replaced memcached+mysql
  • Simple backups cp -rf level.db backup.db
  • Networking/wiring from zeromq messaging library – allows many topologies
  • Async server for scalability and capacity
  • Sync client for easy coding
  • Easy polyglot client bindings. See zmq bindings
>>> db.put("k3", "v3")
'True'
>>> db.get("k3")
'v3'
>>> db.range()
'[{"k1": "v1"}, {"k2": "v2"}, {"k3": "v3"}]'
>>> db.range("k1", "k2")
'[{"k1": "v1"}, {"k2": "v2"}]'
>>> db.delete('k1')
>>>
Will be adding high availability, replication and autosharding using the same zeromq framework.

Dependencies

python 2.6+ (older versions with simplejson)
zmq
pyzmq
leveldb
pyleveldb 

Getting Started

Instructions for an EC2 Ubuntu box.

Installing zeromq

wget http://download.zeromq.org/zeromq-2.1.10.tar.gz
tar xvfz zeromq-2.1.10.tar.gz
cd zeromq-2.1.10
sudo ./configure
sudo make
sudo make install

Installing pyzmq

wget https://github.com/zeromq/pyzmq/downloads/pyzmq-2.1.10.tar.gz
tar xvfz pyzmq-2.1.10.tar.gz
cd pyzmq-2.1.10/
sudo python setup.py configure --zmq=/usr/local/lib/
sudo python setup.py install

Installing leveldb and pyleveldb

svn checkout http://py-leveldb.googlecode.com/svn/trunk/ py-leveldb-read-only
cd py-leveldb-read-only
sudo compile_leveldb.sh
sudo python setup.py install

Starting the leveldb-server

> python leveldb-server.py -h
Usage: leveldb-server.py 
 -p [port and host settings] Default: tcp://127.0.0.1:5147
 -f [database file name] Default: level.db

leveldb-server

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -p HOST, --host=HOST  
  -d DBFILE, --dbfile=DBFILE
> python leveldb-server.py

Using the leveldb-client-py

> cd clients/py/
> sudo python setup.py install
> python 
>>> from leveldbClient import database
>>> db = database.leveldb()
>>> db.get("Key")
>>> db.put("K", "V")
>>> db.range()
>>> db.range(start, end)
>>> db.delete("K")

Backups
cp -rpf level.db backup.db

Known issues and work in progress

Would love your pull requests on
  • Benchmarking and performance analysis
  • client libraries for other languages
  • [issue] zeromq performance issues with 1M+ inserts at a time
  • [feature] timeouts in client library
  • [feature] support for counters
  • [feature] limit support in range queries
  • Serializing and seperate threads for get/put/range in leveldb-server
  • HA/replication/autosharding and possibly pub-sub for replication

Thanks

Thanks to all the folks who have contributed to all the dependencies. Special thanks to pyzmq/examples/mongo* author for inspiration.

Launching Safebox for ur dropbox

| 0 comments

Site just went live few hours ago. Go ahead and get Safebox for ur dropbox



-Srini

Handling HN traffic - some tricks

| 0 comments

My request "Review my startup: sanetax.com - A real Tax Professional prepares your taxes (sanetax.com)"  was on the front page of HN for ~4-5hrs. During the time, I had seen visitor count spiked up 10 times the average on my website SaneTax.com. Just before submitting for the review, I had taken following steps to make sure site won't crash. I thought I would share so that fellow readers can benefit.

  1. ngnix for static html pages(~10 html pages) - python tornado with reverse proxy for the app(login and the real site) and mysql db store
  2. all static assets(images, css, js) on s3 with Cache-Control option of 30days
  3. cloudflare as free dns service manager (default html cache for 2hr)
  4. tee app server log files (not a biggy but tee to make sure I capture everything app emits and throws up)
1. ngnix for static html pages(~10 html pages) - python tornado with reverse proxy for the app and mysql db store

This is pretty obvious. All public pages are static pages(html). I have created them based on a template that I created manually for the whole site. I have 4 instances of python tornado running on AWS and one mysql instance - all micro instances. 

2. all static assets(images, css, js) on s3 with Cache-Control option of 30days

I had all the css and pngs put in static folder and used the s3afe.py script to upload them to s3. Modified s3afe is here in github repository. This adds support for a quick directory upload and cache-control.

3. cloudflare as free dns service manager (default html cache for 2hr)

Cloudflare is one of the coolest dns provider/cdn provider that don't get mentioned enough. I use cloudflare for pretty much all my sites. All the html files are cached and DNS TTL of 1day. Though they cloudflare had outage couple of days ago, overall I am really satisfied with the service and will recommend for anyone to give it a try.

4. app server log files as tee

Not a biggy but instead of direct piping, tee to make sure we capture everything app server emits and throws up.  Something like

python app.py --port=<port number> |& tee machine-<port number>.log &

-Srini


Benchmarking JavaScript Engines - V8, SFX-Nitro, Carakan, Tracemonkey

| 0 comments

We were evaluating javascript engines for the purposes of integrating in a prototype product. Chrome-V8, Safari-Squirrel Fish Extreme(aka Nitro) and Firefox Tracemonkey are the three options that we had to consider(just because all three of them are open sourced). There are lot of online resources with comparison study done however all of them are either old or not relevant now. So, thought I would share what we are seeing with respect to speed. There are other metrics that influence the adoption but this article is solely on the speed. I have included Opera results for the completeness. IE is not included for the time constraints that we had and also it being dog slow. 

Setup:


  • Use Dromaeo for the purposes of comparison. It is super easy to run and compare. 
  • Every benchmark is done multiple times and took an average - though variation is very minimal
  • During every run, machine had only 3 apps open. One putty session, one process explorer and the relevant browser. So machine resources are same for all.
  • Run all the benchmarks on Windows box - Dell Poweredge SC 420 - XP, Pentium 4 HT, 2 CPU 2.8GHz, 2GB RAM.
  • Used Macbook Pro 15" to make sure the trend is same. 
  • All the latest release versions: Chrome - 4.1.249.1045 (Build 42898), Safari - 4.0.3(Build 531.9.1), Opera - 10.51(Build 3315), Firefox - 3.6.3

Opera has save-as text file feature. So after all the runs, results are extracted by loading the html page in Opera » save as text file » then run following one-liner to get every run into ":" separated text file so that it is easy to load up in excel. 

more * | egrep ":|runs\/s|txt" | egrep -v "URL:|http:|Origin, Source, Tests:|run on:|::::::|rv:" | perl -ne 'chomp; s/runs/ :runs/g; s/txt/txt: /g; if(/runs/){print "$_ \n";}else{print}' > ../foo.txt

more * is being used above assuming all the text files are saved in the present working directory. 

Sunspider:

Includes well balanced benchmark suite. Covers various areas including math problems, string operations, crypto, raytracers, etc.. 
As seen above, Caracan is the clear winner followed by V8, SFX and TraceM for Sunspider benchmark suite. 

DOM Core:


These tests include setting, getting DOM attributes, DOM traversal, DOM element querying, etc..

SFX is the clear winner followed by V8, TraceMonkey and Caracan for DOM Core suite.

V8 Test:


These include tests like looping, function calls, object manipulation, etc.. 
 
As seen above, V8 is clear winner followed by Caracan, SFX, and Tracemonkey for V8 benchmark suite.

JS Lib:


Fom jQuery, JS Libarary tests include lot of DOM modifications, events, prototypes, jQuery DOM traversal, styling etc...
 
As seen above, V8 is clear winner followed by SFX, Caracan, and Tracemonkey for JS Lib benchmark suite.

I haven't included benchmarks Dromaeo and CSS Selector. Dromaeo; because Caracan results were like 100x on regexp cases which I wanted double check if it is indeed the case. CSS Selector; because each run takes like ~20mins which is quite a bit considering multiple runs.

Conclusion:


If you follow technical documentation on V8, SFX, Caracan and Tracemonkey, you can clearly see similarities in approach each is taking in tackling the problem. Some of the common techniques and modes are creating virtual machines, JIT techniques, inline cache, GC optimizations. Yet there are few differences in each engine. V8 doesn't generate bytecode instead it generates native code directly - similar to compilation for static languages. However it does follow virtual machine techniques under the hood without having two separate steps. SFX generates bytecode and then native code based on the platform it is being run. This approach is similar to LuaJIT where JIT techniques are applied on the bytecode. Caracan seems to take best of both - native code compilation for loops and bytecode interpreter to tackle remaining part of the code. Tracemonkey is slightly different compared to other in that it is the only engine which tries to record/trace as the interpreter kicks in - more details on this here. Jigermonkey is the next version(still in works) to TM which seem to take SFX assembler and combine it to the existing spidermonkey and tracer. 

We have decided to start off with V8 and keep SFX/Nitro as the backup JavaScript engine. Feel free to drop in a comment/question. 




US population distribution for marketing and planning purposes

| 0 comments

We had to do a quick study on where to focus geographically/state-wise for one of our products - ideally it shouldn't matter where the user is but we had this constraint to get the stats to focus on 10 states. 

Obvious start was to look at population stats; more population translates to more users/customer base translates to focus. For data source, we used US Census Bureau population projections for each state from year 2000 to year 2009 - available here.

As you see in below picture, we added two extra columns along with the sorted list - percentage and cumulative percentage sum. 


To our surprise, first 9 states constitute more than 50% of whole population in US. I did look at the population spread before but this is definitely a surprise to see 50% of whole US population living in 9 states. 



Books on Start Up and entrepreneurship

| 0 comments

Here is a write-up on couple of books that I liked reading and learning internals about start ups and how it all works. I would recommend both of them equally. Both have their own strong points in various topics. For example, 'High Tech Start Up' shines in the area of market analysis and product planning; 'Engineering Your Start-up' shines in the area of accounting and finance with lots of templates and ready-made excel sheets to get started with planning. 

Again, just like any good book and education, learn as much as possible and see what makes sense and relevant for your situation. 



        
 


It has 14 chapters in total - spanning over 270 pages. The most impressive thing I liked in this book is market and product planning. According to the author, if we assume x-axis being a product and y-axis being a market - following quadrants can be made.


Quadrant 1: New Product and New Market - Missionary sales and tech push. This is where, consumers don't know they need the product and there is no existing product. Typically high risk and costly. 

Quadrant 2: Existing Product and New Market - Marketing driven and essentially you are selling new use of the existing product. 

Quadrant 3: Existing Product and Existing Market - Face lot of competition. Can be used for income substitution. 

Quadrant 4: New Product and Existing Market - Tech push, market pull. Delivering value to existing problem.
 
It is good if you are in Q4 - where you already have the market and with your product you are innovating. If you are in Q1, you would need lots of money raising to do. 

I found chapter 5, which deals with the business planning interesting in that, you would come to realize so many details that make up for a good planning. 





It has total of 21 chapters spanning over 400 pages. Accounting, Finances, and Term Sheets are dealt really well in this book. 'Rule of X Competitors' and Financial Statements really impressed me from this book. 

Rule of X Competitors: Take any market, at any point in time, there is a room for only X viable competitors. As from the definition, obviously, the smaller the X, the better. According to the author, if X is more than 7, you might want to reconsider the plan. 

On the Financial Statements, it is tough to cover details in a post like this, however following quick 3 statements are something you want to keep track of all the time. 
1. Balance Sheet
2. Income Statement
3. Cash Flow Statement


Final Note: Rules are there to break and laws are there to follow. So you can read and learn about how it is done typically but just adapt the rules according to your situation.

Encouraging financial charts

| 0 comments

Business Insider has couple of charts which I thought interesting.


First one: About the stimulus money and how much of it is still left to be spent. 

Most of us who follow any financial news thought of US economy getting out of government help line in Feb 2010. Contrary to that, it seems we have 2/3 total money to be spent. Looking at the progress and stability from a year ago. It is quite encouraging. 


Second one: About the stock market historic trend - compared to 70's trend. 

This is somewhat controversial and I don't agree. One can find patterns anywhere - as article mentioned. However given the EU situation, unemployment situation and instability with US real estate, I don't see this happening. 




Just like any other financial forecasting, it could be either way. Only time will tell.