Agile Infrastructure
IT operations
differentiator
enabler
faster feedback
more flexible
Infrastructure renaissance
you can change faster
you can change easier
developers and operations can work together
Hero culture
Heroism is a virtue
Patching on live production system 5 am
Bad mistakes
NO! Heroism is not good for operations
BECAUSE everyone keeps for granted you'll keep machines 24/7
Different environments
it works in test environment and not in production
you cannot keep in sync testing and production environment
dev environment should not be very different than production environment
Done is deployed
Techniques
Version Control
Network configurations
System configurations
Applications configurations
Application code
Database schema
Documentation
preferably executable
Anything that matters
Configuration Management
put systems into a known state
audit and enforce consistency
manage server lifecycle
reason about services, instead of systems
apply dev-test-prod cycle to infrastructure
Build from source
automated provisioning and deployment of services
roll config changes forward (dev-test-prod)
dev, test and prod not out of sync
no one is editing config files; they are automatically pulled from svn
test from a known state
setup process
scaling
building infrastructure is not a big manual process
speed of thought
disaster recovery
One step deploy
one automated process from version control to line services
one process for devs, testers and it operation; across all environments
computers are really goot at running the same commands over and over
you DO NOT want to have manual scripts
having people deploy manually is immoral
manual deployment is error prone
lower the fixed cost of deploy
Monitoring
what does 'normal' looks like?
Feedback
You have to know how 'green' looks like to know how 'red' looks like
don't just look at the data when things are bad
need baseline chart, trends
test driven?
Continuous Integration
test new builds
assert services are running
run functional tests
Deploy early and often
there should be no ritual
the ceremony is waterfall process
Tag everything
who?
what?
when?
synchronization - get all machines sync'd
Correlate
have the same power as with failed tests in TDD - know exactly why something is wrong
Information radiators
share metrics
dev and ops see the same thing, in the same place
helps two groups (dev and ops) to have a conversation
Share the repository
keep configs in sync with application code
everyone knows where too look
everyone sees everyone else working
minimize surprise
ops can see the work devs are doing
conversations early in the process
get rid of ceremony, pagers, antagonism, etc.
boundary objects
Always ship trunk
Configuration drift
inconsistencies between machines (2, 4, 8, 100?)
mistakes
confusion
changes are painful
Agile for development; Waterfall for deployment ?!
Communication between dev team and ops team is facilitated by ticket system
Operations are stakeholder!
Non-functional requirements
Your site cannot be down - you are loosing money!
The mystery machine
The machine in the corner that everyone is afraid to turn off, but no one why it's on
Infrastructure is code
API driven abstraction (cloud computing, etc.)
Infrastructure is application
Fail happens
Questions
Can you afford to be down?
How long?
How fast can you be back up?
Fail safe
Try not to cause it
Practice makes perfect
"Out of the window" test
Fire drills
Go and unplug your system :D
Try different failure cases
Be confident
Culture
There is only us
Learning and respect
Work together
Manage flow
Planning for fires is hard
The best way to fight fires is never let them get started