Atom Feed

Cassandra Primary Keys

December 11, 2017

Cassandra schemas can be a bit hard to design and are especially important to design correctly because of the distributed nature of Cassandra. Many new users of Cassandra try to design schemas similar to relational databases because of CQL’s similar syntax to SQL.


CREATE TABLE example (
key uuid PRIMARY KEY

Composite Key

CREATE TABLE example (
key1 text,  // partition key:  determines how data is partitioned across nodes
key2 int,   // clustering key: determines how data is sorted within a partition
PRIMARY KEY(key1, key2)

The goals for designing cassandra keys (stolen from the datastax documentation) are:

  1. Spread data evenly around the cluster
  2. Minimize the number of partitions read


MyPy Review

November 2, 2017

I recently added type annotations to two of my projects git-browse and git-reviewers using mypy and found it to be relatively enjoyable. Adding types to python definitely helps to make code self-documenting and effectively increases the number of tests in your code. There are a few large issues though for anyone trying to add type annotations:

  1. In order to use python’s type annotation syntax (rather than type comments), your code must be Python 3 only. (Yes, you should use python 3 whether or not you’re adding type annotations)

  2. You must have library stubs of your imports so that MyPy can infer types. So far, there are very few library stubs available and even some extremely popular packages like Flask aren’t covered. This limits type checking to packages with few if any external dependencies.

Adding in type annotations, I also ran into a few issues:

  1. The docs are quite good but given python typing’s obscurity, it’s still hard to find answers for more esoteric features.

  2. The syntax for default values, e.g. (from the mypy docs):

def greeting(name: str, prefix: str = 'Mr.') -> str:
return 'Hello, {} {}'.format(name, prefix)

puts the default value after the type annotation. For a person who hasn’t worked with the type syntax before, at first glance it looks like a string value is being assigned to str within a dictionary.

  1. MyPy requires newly instantiated empty iterables (lists, sets, dictionaries) to include annotations so it can type check elements. However, the native python syntax has no support for it which requires adding type comments, resulting in:
data = [] # type: List[int]
  1. The comment syntax has a bug where its types require imports which set off linters like Flake8 as an unused import. From the above example, this ends up requiring odd code to pass both the flake8 linter and mypy:
from types import List  # NOQA
data = []  # type: List[int]

MyPy is still under heavy development with significant hands-on support from Guido van Rossum himself. Overall though, adding in types was still a relatively easy and useful exercise and helped prompt some refactorings.

PS - Turns out that python-markdown2 has a bug when rendering code fences inside of lists.


Griping about time zones

October 26, 2017

Daylight savings is ending in two weeks and I’m going to gripe about it. Not the usual gripe about people having to adjust their clocks and schedules (I for one, actually like daylight savings), but about how basically everyone writes time zones incorrectly. When people give a time with a time zone, such as five o’clock in San Francisco, many people just write “5 PDT” or “5 PST” without giving a thought where that “D” or “S” come from. Well, those acronyms stand for “Daylight” and “Standard,” respectively, so you should only use “PDT” to refer to times in the summer, and “PST” refer to times in the winter. If you want to be generic, you can use “PT” for everything and let context take care of the exact time zone.


Bundling Python Packages with PyInstaller and Requests

September 23, 2017

I recently tried using PyInstaller to bundle python applications as a single binary executable. Pyinstaller was relatively easy to use and its documentation is pretty good. However, I ran into a bit of trouble bundling the python requests package because of problems with requests looking for a trusted certificates file, usually emitting an error like OSError: Could not find a suitable TLS CA certificate bundle, invalid path: .... In a typical installation, the certifi package includes a set of trusted CA certificates but when PyInstaller bundles the requests and ceritifi packages, certifi can’t provide a file path for requests to use.

The way to fix this is to set the REQUESTS_CA_BUNDLE variable (documentation) within your code before using requests:

import pkgutil
import requests
import tempfile

# Read the cert data
cert_data = pkgutil.get_data('certifi', 'cacert.pem')

# Write the cert data to a temporary file
handle = tempfile.NamedTemporaryFile(delete=False)

# Set the temporary file name to an environment variable for the requests package
os.environ['REQUESTS_CA_BUNDLE'] =

# Make requests using the requests package

# Clean up the temp file


Go Receiver Pointers vs. Values

September 4, 2017

When writing a method in Go, should you use a pointer or a value receiver?

Type Use
Basic Value
Map Value
Func Value
Chan Value
Slice (no reslicing/reallocating) Value
Small Struct/Array Value
Concurrent mutations Value if possible
Is Mutated By Method Pointer
Large Struct/Array Pointer
Contains a sync.Mutex Pointer
Contains Pointers Pointer
🤷 Pointer

Distilled from Golang Code Review Comments


Fixing latency

September 1, 2017

Last night, I finally discovered and fixed the reason for very high latencies for Many of the data-heavy pages on have had multi-second response times and although a Django/MySQL site on a minimally provisioned server isn’t the epitome of performance engineering, I’ve always bet it should run faster. After four years of optimizing parts of the website, I finally find a way to reduce latencies by an order of magnitude and bring sub-second response times.

These were some of the things that I tried which gave relatively minimal benefit:

  • Adding memcached to cache model data and view partials
  • Adding a Cloudflare CDN
  • Upgrading the server, particularly increasing CPU cores and memory
  • Optimizing the MySQL configuration
  • Rate limiting web crawlers
  • Denormalizing database models

What I did last night is by checking django-silk, I noticed that on certain pages, multiple simple but slow SQL queries were made and filtered by indexed fields. Some of these queries were taking on the order of hundereds of milliseconds, over 10X the latency of the same query on a local unoptimized virtual machine. Digging deeper with EXPLAIN queries, and checking the database schema, I found several indices were missing. Although the indices were included in the models, they were never added (or dropped) some time ago, probably by Django South, Django’s old migration tool. Evidently, one should not rely too much on ORMs, and manually checking your MySQL schemas can result in some amazing latency improvements:

StatsOnIce Latencies


Showing schemas in different databases

August 26, 2017


describe keyspace <keyspace>;




FROM information_schema.columns
WHERE table_schema = '<DATABASE NAME>'


Straight lines

June 2, 2017

There are a ton of straight lines commonly seen in English text.

  • - the hyphen, used to join breaks within words or compound words
  • - the en-dash, used for spans of numbers or compound adjectives
  • - the em-dash, used in place of colons, commas, and parentheses
  • _ - the underscore, originally used to underline words by typewriters
  • | - the vertical bar of many uses.

This doesn’t even include the dozens of singular lines available in Unicode.


Emerson on Intellect

May 29, 2017

Before Pycon, I visited Powell’s City of Books and stumbled into the fiction aisle of D through H authors. I sampled some good books by Dostoyevsky, Dumas, and Hemingway but a particular passage stuck in my memory:

Every human being has a choice between truth and repose. Take which you please, you cannot have both. Between these, as a pendulum, man oscillates. He in whom the love of repose predominates will accept the first creed, the first philosophy, the first political party he meets– most likely his father’s. He gets rest, commodity, and reputation; but he shuts the door on truth. He in whom a love of truth predominates will keep himself aloof from all moorings and afloat. He will abstain from dogmatism and recognize all the opposite negations between which, as walls, his being is swung. He submits to the inconvenience of suspense and imperfect opinion but he is a candidate for truth, as the other is not, and respects the highest law of being.

Ralph Waldo Emerson Essays: First Series Essay XI Intellect


Core metric for developer productivity

May 21, 2017

A lot of software companies are concerned about the concept of developer productivity and maximizing the amount of output each engineer produces. The problem is that few companies measure productivity in any quantitative manner, instead using subjective metrics like developer happiness or irrelevant metrics like lines of code. Additionally, companies frequently confuse product management processes, like agile development, with helping software engineering metrics. This results in software being developed slowly and quite a lot of grumbling when the scrum master burns hours of each engineer’s team on sprint planning.

Looking around, I think that the core metric for developer productivity should be the average frequency of the edit/test/debug cycle. In all but the most intellectually challenging of programming problems, there is a clear direction of what needs to be built but the limiting factor is making sure the programmer’s code works as intended. Optimizing the edit/test/debug cycle therefore makes developers produce value faster. Many features in IDEs and modern development practices are aimed at shortening steps in the cycle, reducing the number of cycle iterations needed to complete the development work, or helping developers stay within the cycle. Indeed, Joel Spolsky’s famous test for programming environments can be summarized as checking the health of a development team’s edit/test/debug cycle. I therefore hope that when you evaluate processes and products to help be more productive, you think of how it will benefit your edit/test/debug cycle.


How to capture a camera image with python

May 7, 2017

While working on sky-color, I found that taking a photo using a webcam with python was pretty hard. opencv has some pretty opaque documentation since it’s primarily written for C developers and simplecv) is dead and doesn’t support python 3. Stackoverflow is also filled with outdated incorrect answer. I therefore had to figure out a way to take a photo and save it to a file myself using python 3.6 and MacOS.

Prerequisites: Install numpy and opencv. My requirements.txt file looks like:



import time
import cv2

camera_id = 0
file_name = 'image.png'
cam = cv2.VideoCapture(camera_id)
time.sleep(1) # Give some time for the webcam to automatically adjust brightness levels
ret_val, img =
cv2.imwrite(file_name, img)

Further reference: OpenCV API


Python has a ridiculous number of inotify implementations

May 2, 2017

Mostly stolen from watchdog’s readme:

Looking through a few of these, I think I recommend watchdog and inotify_simple.


Projects: Gentle-Alerts

April 27, 2017

Gentle-Alerts is a chrome extension that I built to fix the problem of noisy popup alerts in Chrome. Using Google Calendar a lot, I used to get a popup alert before every event that I was invited to. Fiddling with its built-in “browser notifications”, I wasn’t very satisfied because of its pop-over UX. I therefore decided to create Gentle-Alerts to solve this problem for Calendar and all other websites.

Gentle-Alerts works by overriding the window.alert built-in function with a custom function that shows a browser modal. In building Gentle-Alerts, I had some fun with some different frontend programming rules. Rather than the usual problem of writing javascript code that has to be compatible with different browsers with a known environment, writing the javascript for Gentle-Alerts required me to write javascript code compatible specifically for Chrome but running against the javascript environment of any website. I therefore kept the code pretty simple and used only vanilla javascript without any third-party dependencies.

Thanks to Chris Lewis, David Hamme, Song Feng, and Scott Kennedy for testing the extension.


Creating a new PyPI release

April 24, 2017

As a reminder to myself for the magic incantations for uploading a repository to PyPI:

pip install twine
python sdist bdist_wheel
twine upload dist/*


Eva Air USB Ports

April 24, 2017

I just got off an Eva Air flight which had in-seat USB ports not only for power but also for data. I found that when I plugged in USB keys, it could navigate through FAT32 and NTFS memory sticks which makes me think that the in-flight entertainment system was based off of an embedded Windows oS. Several of the games in the system also had multiplayer modes which would mean that there must be some LAN within the plane and since the plane’s audio announcement system could pipe audio through the seatback system, that must be connected as well.

Although I doubt the Boeing 777’s designers would also link up flight-critical systems like avionics, there is something to be said about the possibilities arising from putting some sufficiently determined hacker on a plane with WIFI, an electrical socket, physical access to a Windows-backed USB port, and twelve-plus hours of boredom.

From the Eva Air Website:

In-seat USB Port

If you are traveling on our selected B777-300ER (Royal Laurel Class) and A330-300 aircrafts (Premium Laurel Class), you can navigate through PDF files, photos and other multimedia content stored in your storage devices (iPod, USB flash drive**, AV connector-enabled device, etc.) on your seat-back screen. Instructions are shown on the screen once connected.


Projects: Git-Browse

March 18, 2017

I’ve recently worked on a project called Git-Browse to help look up information in github and uber’s phabricator. Quite often, I’ve found the need to look up information about a git repository in order to share code with people, find history, or file issues. Having to manually look up the repository on github or phabricator takes excessive time and can easily lead to incorrect information from looking at forks. Git-Browse solves the problem by introspecting a git repository’s .git/config file and automatically opening the git repository in a browser. Git-Browse can then be integrated in your local or global .gitconfig as an alias so you can open repository objects with git browse <path>.

While working on git-browse, I found that this is similar to github hub’s browse but git-browse would be a lot easier to support additional repository hosts. Hub doesn’t support opening arbitrary branches or commits either, but it does support opening issues and wikis.

Git-Browse requires python 3 to run. Install it by following the Readme Instructions.


Cassandra Compaction Strategies

March 5, 2017

When setting up Cassandra tables, you should specify the compaction strategy Cassandra should use to store data internally. To do so, just add

WITH compaction = { 'class': '<compactionName>' }

to an ALTER TABLE or CREATE TABLE command.

Name Acronym Used For
SizeTieredCompactionStrategy STCS Insert-Heavy Tables
LeveledCompactionStrategy LCS Read-Heavy Tables
DateTieredCompactionStrategy DTCS Time Series Data


Code Is Like Tissue Paper

January 25, 2017

Code is like Tissue Paper, it falls apart after one use

Code is like Tissue Paper, there’s holes everywhere

Code is like Tissue Paper, it sucks to have to use someone else’s

Code is like Tissue Paper, you get a new one, even for the same problems

Code is like Tissue Paper, you should not feel bad when you throw it away

Code is like Tissue Paper, there’s many layers

Code is like Tissue Paper, other people won’t like to use yours


Seen in a bathroom stall at MIT

January 24, 2017

Do you compute?
No, I come poop


Underused Python Package: webbrowser

January 21, 2017

While I was working on git-browse (post coming soon), I found out about python’s webbrowser package. It’s a super-simple way of opening a URL in one of many different browsers. Python’s standard library is pretty awesome.


Pax ?

January 5, 2017

There’s an idea in (popular) political science about Pax Romana, the idea that for several hundred years, Ancient Rome’s preponderance of power created an atmosphere of relative peace both inside and outside its borders. This was supported by a massive, well-organized military whose the job it was to enforce Roman law inside its borders, and suppress any enemies internal or external.

This idea has been applied to other cases, in particular (Pax Brittanica)( and (Pax Americana)( In all three of these cases, the hegemon has had a dominating military in a specific (battlespace)[] compared to any rival. Rome had superior land armies, Britain had superior navies, and the US has a superior air (and space) force. States were thus geared for supporting these forces and reaping the economic benefits they gave. Rome’s armies sustained Rome’s economies through slaves, Britain through mercantile trade, and the US through rapid global soft and hard power projection.

The question therefore is what does the future hold for hegemonic peace? We’re probably in the middle of a Pax Americana so during the next few decades we’ll have 1) continued Pax Americana, 2) a shift of power to a different hegemon (as happened to Pax Britanica at the end of the 19th century), or 3) the world order devolves into having several localized powers that do not necessarily cooperate for mutual peace and gain (as happened after Pax Romana in the Middle Ages). Since military and economic power are tightly linked and interdependent, I believe the new economic shift to the Internet will require a new hegemon to develop and maintain technical and information superiority to challenge the current world order. Many countries, particularly Russia and China, have developed these cyberwarfare capabilities and successfully tested them against adversaries. The US also has significant capabilities but cyberwarfare is so far in such a nascent state that it’s hard to establish dominance. Hopefully, it won’t require a real test like a repeat of the Napoleonic wars or WWI/WWII .


Golang Review

January 2, 2017

I’ve been using Go for several projects at Uber and personally. From my experience, I’ve developed some opinions on the Go programming language, from both objective and subjective points of view. Compared to other languages, I find that Go has much to be desired in terms of its language design.

Let’s start off with the nice points. Go is pretty opinionated about its development setup with standardized layouts of packages and build systems. Language-wise, it has a simple, easy-to-learn syntax that can be easily learned by anybody with backgrounds in C, Java, or Python. It’s statically typed, expressive while legible, and has understandable concurrency primitives.

On the other hand, Go has a lot of downsides in terms of environment and language. Environment first. A glaring issue with Go is dependency management. Go’s development layout assumes you’re working on a monorepo (more on this later). If you’re not working on a monorepo and/or you don’t have direct control over your dependencies, you’re going to have a bad time trying to keep your dependencies up to date without breaking things. Many tools have been written to work around this, but the core issue is that many Go packages aren’t themselves versioned. It seems semantic versioning and even change logs seem to be new and controversial in many Go communities.

My other gripe with Go’s environment is the work it takes to set up and maintain a new project. A go repo not only requires the entire $GOPATH to be set up, with attendant dependencies and repositories, but also requires Makefiles, and a lot of testing boilerplate. Arguably, in a monorepo-style development process, this is a cost that scales O(1) instead of O(n), but I feel that in many Go projects, the amount of Bash and scripting language (most commonly python) code exceeds the amount of actual Go code. Maintaining this (often untested and hacky) code is an extra source of work for a Go developer.

As for the language itself, I’ll freely echo the common criticism about Go’s lack of generics and while I have gotten used to the lack of exceptions, this all seems to oversimplify the language. Many nice programming idioms found in other languages (e.g. monads, classes, duck typing) are unavailable in Go due to the restricted feature set.

From these points, I find that Go is a fine language for large teams of average programmers creating large monorepo systems. Go caters to this demographic of programmers by being essentially a compilable, statically-typed BASIC while ignoring the last several decades of Programming Language Theory.


Wadler's Law

December 15, 2016

In any language design, the total time spent discussing
a feature in this list is proportional to two raised to
the power of its position.

0. Semantics
1. Syntax
2. Lexical syntax
3. Lexical syntax of comments


Tunnel v2

December 8, 2016

I posted a handy single-line trick to forward connections from one ip/port address to another. I recently had a problem where I needed to forward to a port on a local host but the service was only listening on the public interface. OpenSSH however tries to be clever and rewrites the public ip to localhost, which isn’t overridable (the tunnel entrance is overridable but not the exit, which is what I was trying to change).

I therefore present tunnel v2, using NodeJS instead of SSH:

'use strict';

var net = require('net');
var process = require('process');
var console = require('console');

// parse "80" and "localhost:80" or even ""
var addrRegex = /^(([a-zA-Z\-\.0-9]+):)?(\d+)$/;

var addr = {
from: addrRegex.exec(process.argv[2]),
to: addrRegex.exec(process.argv[3])

if (!addr.from || ! {
console.log('Usage: <from> <to>'); // eslint-disable-line no-console
throw new Error('Not enough arguments');

net.createServer(function onServer(from) {
var to = net.createConnection({
}).listen(addr.from[3], addr.from[2]);

Adapted from Andrey Sidorov



December 5, 2016

I have a particular interest in multicolor pens and I’ve developed a taste for them over the years since using the common BIC 4-Color Ballpoint Pen. Those pens only have four basic colors, and the actual tips and ink result in inconsistent sticky ink flow. BIC also makes 4-Color pens with finer points but that only results in spotty writing.

Next up, I’ve also used the Zebra Multi Color Pen which is a similar pen but with an additional pencil included. The ink and ballpoint is also slightly better than the BIC.

For those who optimize for options, there’s 6 color and 10 color pens. If the ridiculousness of the number of colors doesn’t keep you from using the pens every day, the problem with having so many colors is that their individual ink sticks are more off-center. This causes more bending of the ink and can make the side of a ballpoint pen scrape along paper. It also makes writing more spongy because of the increased room and bent existent within the pen.

The pen that I use now is the Uni Jetstream Pen. It contains the four standard black, red, green, and blue colors, plus a pencil and eraser. It has some reasonable weight, the writing is consistent, and it looks pretty professional.


SSH Tunnel

September 18, 2016

This is a handy command to memorize:

ssh -fN -L $port1:$host1:$port2 $host2

This allows you to make requests to $host1 on $port1 to instead hit $host2 on $port2.


That time I was a whitehat hacker

September 18, 2016

I’ve been trying to find a replacement for github streaks after they removed them a few months ago. I was pretty happy to find GithubOriginalStreak which had browser plugins for Chrome, Firefox, and Opera. After installing it and noticing it wasn’t correctly reporting streak lengths, I dug into its source code and was surprised to see it was using github gists as a datastore for streak information.

This presented a few problems:

  1. Github gists aren’t supposed to be used as a high performance database. Github probably rate limits access to its data.
  2. The packaged browser extensions contain read/write keys to the account that owns the github gist. The GithubOriginalStreak repository itself doesn’t have the keys but the keys are easily extractable from the extensions anyways.
  3. Neither the gist nor the code does any validation of incoming data before the supposed gist lengths are displayed inline in the Github page.

This last problem was the most critical. A malicious attacker could have gotten write-privileges by downloading and unpacking the extension, then modified the gist to inject an XSS attack into someone else’s browser. The best part is that the gist contains a list of all people who use the extension so you could target a specific person for XSS.

I talked to the author afterwards and thankfully he was receptive of the feedback. The extension is still using Github gists but is now doing some data validation. With the new profile design, extensions like these shouldn’t be needed anymore.


Comparison of country and company GDPs

September 8, 2016

I’ve noticed that many companies have yearly revenues on the order of many (non-insignificant countries). With countries though, yearly revenues are usually called gross domestic product (GDP). I therefore present a comparison of national revenues and corporate GDPs:

Company? Rank within type GDP (billions of USD) Name
1 18,558 USA
2 11,383 China
23 509 Taiwan
Y 1 482 Walmart
24 474 Poland
37 306 Israel
Y 6 305 Samsung
38 302 Denmark
39 295 Singapore
Y 7 273 Royal Dutch Shell
Y 8 270 Vitol
Y 9 268 ExonMobil
40 266 South Africa
42 253 Columbia
Y 11 245 Volkswagen
43 235 Chile
44 234 Finland
Y 12 234 Apple
45 226 Bangladesh

This table does not include state-owned companies. Fun fact - Vitol, which has a revenue of $270B is headquartered in Switzerland, which has a GDP of $652B.



Sketching Science

September 8, 2016

Sketching Science


Tech Hiring Misperceptions At Different Companies

July 22, 2016

I’ve seen many companies’ technical interview processes and I feel many of them are wrong. For almost any advice about technical interviewing you’re likely to hear the opposite from a different person. What people don’t realize is that different types of companies need different types of engineers, and that you can’t select the correct type of engineer if you’re interviewing for the same qualities.

Many people, especially those firmly in the Silicon Valley Blogosphere, opine that you should only hire ninja rockstar jedi engineers (nobody calls them that anymore, but the mindset is still there) for your startup (because you obviously work at a startup, right?) whether your startup is trying to sell a static program analysis engine or be an Uber for X company. The fact is that for most companies, good software will not save a failing company and bad software will not sink a successful company (case in point: Yahoo and your typical government contractor, respectively). Therefore, if you’re in the category of companies whose success doesn’t depend on the strength of your engineering organization ( which is most companies), I think you should stop trying to attract really good engineers - they’re going to get bored writing yet another CRUD app and you’re going to be paying a lot more money.


Calculating Rails Database Connections

June 26, 2016

I recently ran into a problem with calculating the number of database connections used by Rails. It turns out that for a typical production environment, it’s actually hard to find the maximum number of connections that would be made. MySQL and PostgreSQL also have relatively low default maximum connection limits (151 and 100, respectively) which means it’s really easy to get an error like “PG::ConnectionBad: FATAL: sorry, too many clients already.”

After some digging, I believe the formula for getting the maximum number of open connections is by multiplying the “pool” value in config/database.yaml against the number of processes (workers in Puma). If you’re running sidekiq or other background job processor, you’ll also need to add in the number of background processors into your web server’s processes count.


DevOps Reactions

June 12, 2016


Tuning Postgres

June 9, 2016

I was trying to tune a PostgreSQL database and found that default settings for PostgreSQL are optimized for really old computers. If you want to fix it, you can spend an hour reading through documentation, or you can use pgtune. If CLI isn’t your thing, you can try the web version.



June 4, 2016

Romanesco Broccoli

(Romanesco broccoli + Fibonacci)