AMA with Charity Majors - June 15, 2017

On June 15th, we’re going to have the amazing Charity Majors as a “Ask Me Anything” guest!

Charity is the cofounder and CEO of, a next generation observability SaaS for debugging and understanding the next generation of infrastructure problems. She likes free software, free speech and single malt scotch.

She’ll be online and answering your questions on observability, reliability, and more, live from 1:30-2:30PM Pacific, June 15th, 2017.

Ask your questions in the thread below (either beforehand or during), and don’t forget to use the “watch” feature on the thread to be notified.


Hi Charity! Thanks for doing this!

Having ran so much infrastructure in your career and now running, what is the biggest mistake that most teams make when collecting and analyzing metrics?

Hi there Charity!

What role do you envision for machine learning and AI in the era of Observability? Which patterns do you think the robots can extract, and where would the humans need to get involved?

Hi Charity,

Do you have any advice for introducing tracing into an existing metrics and logging infrastructure?

Hi Charity!

Here’s a question on reliability of the Observability tools.
Among the tools used for Tracing, Logging and Instrumentation, do you think that teams sometime over-instrument/over-trace(?)/over-log an application ?
Follow up question, what are some common anti patterns with this regard (ie over logging/ under-logging etc) ?

Charity, could you tell us about your upcoming movie Rampart?

Or alternately, what are the metrics that SRE teams should be bringing to the rest of business to … for lack of a better word…, show off? It’s easy to talk about uptime but 99.99 and 99.995 look very similar even if there’s an order of magnitude in skill between them.

Hi Charity. Given your background as a Production Engineering Manager at Facebook, what is the state of the art with auto-remediation and self-healing systems at the Googles and Facebooks of this world? As DevOps practices standardize, get codified over time, and eventually become the norm (CI already there, CD still feels like a cutting-edge approach for many shops), do you think auto-remediation will become a prerequisite feature of production systems?

Dear Charity,

Please share your thoughts on a buy-vs-build decision. When should an engineer buy a new tool, when should you use something open source, and when is building your own the best choice?

Howdy Charity!

What’s one technical decision you’ve previously made in your career (a particular design, or implementation detail, etc) you’d make differently now? Is there any specific way you’d recommend for junior engineers to learn some of the pitfalls and gotchas in designing and operating infrastructure?

Hi Charity! User experiences vary in quality across a lot of tools used in our industry. How does Honeycomb go about designing user experiences and interfaces for a highly technical audience?

hey :slight_smile:

Is this what you thought you’d do when you grew up, or anything close to it? As you’ve talked to people all over the world that do “this stuff”, are there any commonalities that could help in finding more people that might want to do reliability engineering for distributed computer systems? What could we (all) be doing better at to make our discipline more attractive and welcoming?

Given the data set you have at Honeycomb and your experience to date, are there any trends you are seeing in the world of reliability, resiliency, distributed systems that you just don’t like?

HI! Thank you so much for having me, and for the delicious afterlunch bottle of Ardbeg Dark Cove. Sitting here with Eric and Alex, sippin a dark islay, y’all should be jealous.

** TEST**

Hah. The robots are always gonna be the best at aggregating massive, massive amounts of data, and looking for known problems, adn everything that requires good judgment is going to be better done by humans. This has always been true and will keep being true for a long ass time. The goal is to keep the list of things you HAVE to care about as small and tractable as possible.

I’m always perplexed by the eagerness of engineers to jump straight to the robots doing everything. My current bugbear is how they jump right over the social solutions … what is my team doing, what did the last oncall find, what do the service owners collectively do or think? … and try to go straight from NO automation to ROBOT EVERYTHING. There’s still a lot of room for humans to exercise good judgment in a manageably small list of things, while delegating the deluge of shit to the machines. Only in my day we just called that … automation.

But automation doesn’t raise quite as much venture cash, does it. :roll_eyes:

Oh right you actually asked a question too, didn’t you? :smiley:

The biggest mistake most teams make when collecting metrics … is actually using metrics. Using metrics isntead of events was a giant detour we took in systems land; we consciously stripped out all the context because of scarce resources. It’s a mistake to keep investing in metrics instead of starting to tip back towards event-based analytics. Just watch, the next few years are going to be very interesting. :slight_smile:

Are you talking about distributed tracing, e.g. zipkin, opentracing, lightstep? Well, it’s very time consuming, there aren’t a lot of shortcuts. It’s unfortunate too because you really have to fully instrument your entire stack before reaping te major benefits. So I guess I’d say aim for instrumenting things in a reusable way, so that the events are consumable by other tools along the way too.

My bias is showing here, obviously. You can think of tracing as depth-first, whereas something like honeycomb is breadth-first and doesn’t have the same limitations. With something like honeycomb, you can start capturing events anywhere adn they are immediately useful – from a database, for example. Event-based analytics are mind blowing powerful for understanding databases. And you can get like 80% of a full observability solution just by instrumenting your edge and propagating timing headers back. You can also consume tracing events with honeycomb. So I think it’s more versatile than just DT alone.

It’ts not so much over-instrumenting that’s the problem, as collecting useless and misleading data. That definitely happens a lot.

Dear god, I have to start submitting shorter answers.

Antipatterns with events (logs are just baby events) … i will refer you to the blog series i am curating on this topic to publish next week. :slight_smile:

What the fuck is a Rampart

I think SRE teams (every eng team) needs to get better about thinking of themselves as arms of a business. The metrics you bring to the table will be different for every org, and it completely depends on your business objectives. This may seem like a copout answer but it isn’t – this is where the creativity and ingenuity of engineering really lies, in deeply understanding your objectives and why they are your objectives and delivering a custom fit solution, every time. :slight_smile:

And engineering for the sake of engineering is a waste of time. Delivering on company goals is what makes you a fucking badass. :slight_smile: