email: sam at this domain
Sam lives in El Cerrito.
- ~40 custom cookbooks for in-house applications
- ~30 wrapper cookbooks to customize official cookbooks
- ~90 server roles
- ~5 Chef environments for development (one per team), one for staging, and one for production. All configuration changes can be tested before going into production.
- ~400 nodes under Chef control
- OpsCode's hosted Chef service
- Berkshelf for cookbook dependancy management
- Chef search (solr) is used for configure-time service discovery. For instance, database master and slave servers can be taught to automatically discover each other, exchange credentials, and set up firewall rules to communicate with each other.
This is an ongoing project to document each part of our system with an operations audience in mind.
I identified a need for an operations handoff document and wrote the first example readme. I coached and gave feedback to other developers as they were writing readmes for their components.
Readmes are the starting point for writing chef recipes, nagios checks, and emergency runbooks. Each readme includes everything you need to run and monitor a single component. They describe customer-facing components, back-end components, and 3rd party software used for internal infrastructure.
Readmes are based on a standard template which includes:
- Availability Requirements
- Business effects - including customer-visible effects.
- Data responsibility. What data is this component responsible for and how do you make backups of it?
- Deployment considerations - special requirements for installation and upgrades
- Service Dependencies - which internal and 3rd party services/components does this component depend on?
- Architecture description - what kind of failures can it tolerate?
- Health check procedures - positive and negative health indicators
- Failover and disaster recovery procedures
- Routine maintenance procedures
- Known Operational Issues and Workarounds
Centralized RRD-based graphing system for parking event detection software
- Stores ~15,000 data points per second
- RabbitMQ (1800 messages per second)
- Eventually performed well enough, but we should have just used Graphite
A big monolithic Rails app
- Ruby on Rails (1.2 -> 2.3 -> 3.1)
- Google Maps based network and sensor installation planning tool
- Customer-facing data presentation applications
- Receive, process and store events from the parking event detection system
- RVM, rubygems, Bundler
- Ruby on Rails 3, REST APIs
- Chef (berkshelf, knife-vsphere)
- Ember.js and SproutCore
- Google Maps API (older versions)
- Subversion to Git transition, including introductory training
- MySQL - intermediate tuning, built-in replication
- RabbitMQ, bunny AMQP gem
- Amazon S3 (key/value store), Route53 (DNS), EC2
- Vagrant (development VM provisioning)
- Jenkins - basic
- introductory tech talks on Ruby and Git
- TDD/BDD with RSpec
- Comprehensive automation.
- Your infrastructure configuration should be described in a machine readable format and stored in a revision control system
- You should be able to configure a whole datacenter without manual intervention.
- If you find it necessary to log in to a system to perform an action, you should open a ticket so that your task can eventually be automated.
- Orchestration - how do you model, automate, and change-manage operational events like software upgrades, architectural changes, datacenter moves, failover events, etc-
- Share-nothing / eventually consistent architectures like Wave Operational Transform, Dynamo/Cassandra/Riak, etc.
- Tools to reinforce good development habits like using Gerrit or Github pull requests to make code reviews an automatic part of the develoment process.
- Present common operational tasks through an easily customizable UI like Jenkins so that these tasks are easy to perform in a consistent manner and there is a record of who made a change and when it was made.
- Elasticsearch / Logstash / Kibana
- Clojure and its ecosystem
- http://mesos.apache.org/ Cluster manager. Like Google's borg, but from Twitter and UC Berkeley
- http://airbnb.github.io/chronos/ cron replacement that runs on top of Mesos
- Cassandra or Riak
- Docker - (Linux containers / BSD jails / Solaris Zones)
- CoreOS - A Docker based linux distribution with a mostly read-only root filesystem.
- industrial design
- armchair cognitive science
- cars, from a technical standpoint (ex-mechanic)