• Background
  • What Could Possibly Go Wrong?
  • Compute Engine Power Tools
  • Demonstration Prototype


What strikes you immediately is the scale of things. The room is so huge you can almost see the curvature of Earth. You're in the throbbing heart of the Internet. You really feel it.
— Steven Levy (Photo: Google/Connie Zhou
Photo: Google/Connie Zhou

Google Compute Engine 101

  • Infrastructure as a Service (IaaS)
  • Virtual Machines, Networks, Storage
  • Built with Google DNA

Q: "Can I play too?"

A: "Yes!"

  1. Q: What Could Possibly Go Wrong?
  2. A: Everything.

Seriously, what can go wrong?

  • Human Error
    • config, procedure, accidents
  • Software Failures
    • resource, bug, bad recovery, failed upgrades
  • Hardware/Environment Events
    • scheduled maintenance
    • outages
  • Security Failures
    • password leak, DOS, worms/viruses, browser/auth, theft
Recognizing hazard and successfully operating inside tolerable performance boundaries requires intimate contact with failure.
— Richard Cook, M.D.
How Complex Systems Fail

Process Guidelines

  • Set reliability goals & track metrics
  • Document everything - learn from your outages
  • Automate everything - manual procedure == failure waiting to happen
  • Prepare for maintenance events & outages
  • Build a capacity planning model and a zone/region move plan

Power Tools

Know Your Storage!

Scratch Disk
Persistent Disk
Cloud Storage
Cloud SQL
Cloud Datastore
Bring your own data store

Storage: Persistent Disks

  • Sweet spot: Durable, flexible and consistent
  • Features: Global snapshots, bootable, attachable, read-only sharable
  • Uses: Database storage, fast boot, static data distribution, migration
Tooling should be used to automate every step.
— Adrian Cockroft, Netflix Cloud Architect

Automation: Instance Metadata

Simple, programmable, built-in data store for VMs

  • Dictionary of key/value pairs
  • Set from the API, read by the Instance
  • Accessible at metadata server (http://metadata/...)
  • Useful for small amounts of configuration data
  • Project level metadata inherited by all instances

Automation: Instance Metadata

me@workstation$ gcutil addinstance metadata-example \
--metadata=role:master --metadata_from_file=config:config.txt
me@workstation$ gcutil ssh metadata-example
me@metadata-example$ MDS=http://metadata/computeMetadata/v1beta1/instance
me@metadata-example$ curl ${MDS}/attributes/role
me@metadata-example$ curl ${MDS}/attributes/config
[...file content...]

Start Up Scripts

Simple Bootstrapping

  • Builds on Metadata
  • Equivalent to rc.local
  • Example Usage:
    • Install packages, start services
    • Use Google Cloud Storage to grab data, code and binaries
  • Bootstrap other management systems

Automation: Start Up Scripts

me@workstation$ cat
#! /bin/bash
sudo rpm -Uvh epel-release-6*.rpm
yum install -y npm
npm install express request

ROLE=$(curl http://metadata/computeMetadata/v1beta1/instance/attributes/role)
gsutil cp gs://my-app/roles/$ROLE
me@workstation$ gcutil addinstance start-me-up \ \
me@workstation$ gcutil ssh start-me-up
me@start-me-up$ cat /var/log/google.log

Automation: Managing Images

  • Try and avoid images if you can
  • Script image build
Useful pattern for custom images, scratch boot and PD boot:
#! /bin/bash
if [ ! -e $IMAGE_MARK ];
  [... runs one time ...]
[... runs every boot ...]

Next Step: Orchestration

  • Drive Compute Engine from App Engine
  • Use commercial or open source systems to automate
    • Chef, Puppet, Rightscale, Scalr and others
  • Common features:
    • Manage configuration and deployment
    • Autoscaling based on load
    • Self healing of broken instances
    • Monitoring and tracking cost and usage
    • Configure and verify software in VM

Scaling + Stability

  • Load balancing brings stability and scale
    • Open source: nginx, HAProxy
    • Coming soon: Google hosted load balancing
  • Geographic diversity: zones and regions

Fast Startup Times

Demonstration Prototype

Ping Me!

  • Web app served by a cluster
  • NEW Stack (Node.js + Express + Websockets)
  • Server selected by Load Balancing with health checking
  • Connect to random server, press button to ping entire cluster
  • Pings streamed via websockets to all clients connected on all servers
  • Self-configuring, dynamic, shared-nothing cluster

  • $ wc -l app.js index.html
     136 app.js
      62 index.html
     198 total

Demo App



app.get('/', function (req, res) {
  res.sendfile(__dirname + '/index.html');

app.get('/health', function (req, res) {

socket.on('browser_ping', function() {'browsers').emit('browser_ping', data);'peers').emit('forwarded_ping', data);
socket.on('forwarded_ping', function() {'browsers').emit('browser_ping', data);


<button id="Ping" onclick="ping()">Ping</button>
<div id="servers"></div>
<div id="pings"></div>
  function connect() {
    socket = io.connect('http://' + ip_addr + '/');
    socket.on('ping', on_ping);
  function on_ping(data) {
  function send_ping() {

Single Server

Multi Server

Adding a Storage Tier

Leveraging the Cloud

Take aways...

  • Reliable systems: infrastructure + design + people
  • Manage configuration, deployment, and scaling like code
  • Fast-starting servers help scaling and availability
  • Use power tools to build self-configuring VMs
  • Take advantage of managed services (Cloud Storage, Cloud Datastore)
  • Plan for [app|server|zone|region|*] outages
  • Stay tuned for more awesomeness (like Load Balancing and System Monitoring)

Your Homework: Build Awesome Services

Where to learn more...