Pipeline Team

People

What we're building

  • PostHog Customer Data Platform

    We cover some of the functionality of existing CDP solutions, such as Segment. By creating a UI which specifically encompasses the ideas of "Sources" and "Destinations", along with building out more of the integrations, we can turn PostHog into a leading CDP solution.

    Project updates

    No updates yet. Engineers are currently hard at work, so check back soon!

Roadmap

Here’s what we’re considering building next. Vote for your favorites or share a new idea on GitHub.

Recently shipped

New data pipelines UI launched

Back in May we launched the beta for a new data pipelines UI, and today we've rolled that beta out to everyone and retired the old version.

As before, the new UI breaks the data pipeline down into destinations and transformations, filters for your active pipelines, and includes teeny, weeny graphs inside so you can get volumes at a glance.

Goals

Goals: Q2 2024

  1. Batch exports UX improvements, e.g. error notifications and UI rewamp
    • why: helps reduce confusion around how batch exports work reducing support, also cost savings
  2. Person data batch export
    • why: Its currently only possible to have event data but person data is needed for full analysis downstream
  3. Support adding new products
    • why: helping other teams ship faster while ensuring the stability of existing products
  4. Iterate on person processing to make it faster and cheaper
    • why: This is our biggest bottleneck and the main source of the strict ordering constraints in events ingestion
  5. Visibility into what's in the ingestion queues and past performance
    • why: To help reduce time to incident resolution and help find bottlenecks
  6. Fast configuration options to speed up incident recovery, e.g. by token send to overflow or drop
    • why: To help reduce time to incident resolution
  7. Deprecate posthog-events by moving to capture-rs fully
    • why: the new capture-rs service is improving reliability and efficiency and all events should benefit from it
    • ingestion of session replays is out of scope for this quarter
  8. Deprecate scheduler & jobs deployments, runEveryX plugins and kafkajs consumers
    • why: Reduce operational and support load and costs

Handbook

Responsibilities

Our team owns our ingestion pipeline until Kafka in front of ClickHouse. That means we own:

  • Capture APIs
  • Ingestion app server
  • Data pipelines, including:
    • Transformation apps
    • Data Export and webhooks
  • Client libraries, where it pertains to event ingestion
  • Kafka setup, where it pertains to event ingestion

Our work generally falls into one of three categories, in order of priority:

Ingestion robustness

On the road to providing the best events pipeline in the world, we need to build a system that is robust.

To do so, we must ensure, in order of priority:

  • Data integrity: Events ingested should be correct
  • Availability: We should not lose events
  • Scalability: We should be able to scale to massive event volumes
  • Maintainability: It should be easy to debug and contribute to our ingestion pipeline

Thus, it is our responsibility to consistently revise our past decisions and improve processes where we see fit, from client library behaviors to ClickHouse schemas.

Scaffolding to support core PostHog features

In order to achieve company goals or introduce new features (often owned by other teams), changes to our ingestion pipeline may be required.

An example of this is the work to remodel our events to store person and group data, which is essential to ensuring we can provide fast querying for users. While querying data is not owned by this team, the change to enable faster queries requires a large restructuring of our events pipeline, and thus we are owners of that component of the project.

In short, a core responsibility of our team is to enable other teams to be successful.

Pipelines

We're building a flexible and integrated data pipeline, which we treat as a separate product just like product analytics and feature flags, for example. The PostHog data warehouse is one destination, and the most important, but we care about getting data into all the places where it is valuable.

How do we work?

We run a 30 minute sync meeting on Monday, Wednesdays, and Fridays, and extend the slot if we feel the need to have a longer synchronous discussion about a specific topic. We look at the Pipeline Team GitHub Board, and we document every sync in this doc.

We are happy to sync anytime if we feel it is important to do so. This is generally coordinated on Slack where someone will start a Slack huddle. Some of the reasons we sync include: debugging outages, sharing context (including shadowing), making decisions when there's been a deadlock, and pairing sessions.

We have a single owner assigned to each priority and sprint goal, but we work together on goals as a team. We try to avoid lone-wolfing, so in general we'll have 3 sprint goals per sprint.

Secondary (a.k.a. Luigi)

We have a secondary schedule. The idea is to have a single person catch all of the interruptions and context switching. A secondary rotation is two weeks long and aligns with our sprints. We don't assign sprint goals to the secondary.

During this rotation their first priority are these responsibilities:

  • Firefighting
  • Questions in #support-pipeline
  • Customer Success triage
  • Adding or improving runbooks
  • Follow ups from outages
  • Cleaning up Sentry

If there is a long running customer issue, we'll assign an owner as a team.

It's acceptable for the secondary responsibilities to take up all of your time. Otherwise, you can help out the rest of the team with sprint goals. It's up to you to prioritize correctly.

Slack channel

#team-ingestion