daily

Lessons from the Trenches: Processes, Pitfalls, and the Future of TPM in AI

After a grueling product launch, I reflect on the essential processes of Technical Program Management and how they adapt in the age of AI. From incident management to balancing governance and speed, I explore the healthy and unhealthy patterns that shape our work.

TPMxAI

03 Aug 2025 — 3 min read

Lessons from the Trenches: Processes, Pitfalls, and the Future of TPM in AI

Launch Day: Excitement Meets Anxiety

It was a Friday afternoon when the notification pinged my phone: our eagerly anticipated AI product had officially launched. The excitement was palpable, but so was the anxiety—my team had poured countless hours into this project, and we were all too aware of the pitfalls that could await us. As I sat at my desk, staring blankly at the screen, I couldn’t help but feel a mix of pride and trepidation. This product was supposed to revolutionize the way our users interacted with AI, but I knew that the road ahead was fraught with complications.

As a Technical Program Manager (TPM), I’ve learned that processes are the backbone of any successful project. They are what keep the chaos at bay, especially in the fast-paced world of AI. Yet, as I sat there reflecting on our recent launch, I realized that not all processes are created equal. In fact, the difference between healthy and unhealthy patterns can make or break a project.

Incident Management and Blameless Postmortems

Take incident management, for example. In our rush to launch, we encountered a major issue that caused a significant disruption for users. The temptation was there to assign blame, to point fingers at the engineering team for not catching the bug earlier or at the product team for pushing the timeline. But we resisted that urge. Instead, we held a blameless postmortem. This process allowed us to dissect what went wrong without the fear of reprisal. It’s vital to create an environment where team members feel safe to discuss failures openly. Through this process, we uncovered not just the technical missteps, but also the communication gaps that contributed to the incident.

SLO/SLA Hygiene

Then there’s the matter of Service Level Objectives (SLOs) and Service Level Agreements (SLAs). These are critical in AI, where the performance of models can fluctuate significantly. After our launch, we found that our SLOs had not been adequately defined, leading to a scramble to set them post-launch. This reactive approach is an anti-pattern that creates confusion and undermines trust with stakeholders. In hindsight, establishing clear SLOs during the planning phase would have provided us with a guiding star, helping to align expectations and evaluate our performance accurately.

Release Trains and Quality Gates

Another area where we stumbled was in our release trains and quality gates. We had initially planned a strict set of quality checks before each release, but as deadlines loomed, we relaxed these gates. It seemed like a pragmatic choice at the time, but it backfired spectacularly. The rush led to bugs slipping through, which became evident once users started interacting with the product. In the world of AI, where the consequences of a bug can be amplified, this approach serves as a cautionary tale. Establishing robust quality gates is non-negotiable; they provide the necessary checks to ensure that we don’t compromise on quality for the sake of speed.

Design/PRD Review Rituals

Our design and Product Requirement Document (PRD) review rituals also fell victim to the cargo cult

Prioritize Rituals For Team Success

mentality. We had all the right processes in place, but as we became overwhelmed, we started skipping important discussions. The irony? We ended up creating more work for ourselves as we had to backtrack to address fundamental design flaws that could have been caught early on. This experience reinforced for me the importance of adhering to rituals that promote thorough review and discussion. Healthy patterns in these processes lead to better outcomes and enhance team collaboration.

Balancing Governance with Speed

As I sat there, I pondered the balance between governance and speed. In the world of AI, this balance is more crucial than ever. We need to move quickly to stay relevant, but we cannot afford to do so at the expense of thoroughness. The challenge lies in crafting processes that are lightweight yet effective, allowing teams to adapt as needed without falling into bureaucratic traps. It’s a delicate dance, one that requires constant attention and adjustment.

So, where do we go from here? The hype surrounding generative models and AI is palpable, but as TPMs, we must remain skeptical and grounded. We must focus on building robust processes that foster innovation while also ensuring that we learn from our mistakes. The lessons from our recent launch have been invaluable. They remind me that while the tools and technologies we use may change, the principles of good program management remain the same.

As I reflect on the challenges we faced, I’m reminded that the journey of a TPM is rarely straightforward. It’s filled with twists and turns, successes and setbacks. But through it all, one truth stands clear: effective processes, when executed well, can be the difference between chaos and clarity. And in the rapidly evolving landscape of AI, that clarity is essential for success.