Mar 05 2024
Behind gen AI: building an infrastructure for the future

Kevin L., a technical program manager (TPM), studied mechanical engineering at the University of Waterloo with the dream of “building planes, cars and fast things.” When his career began, however, he was transported by a different kind of speed. “At Meta, we move fast,” he shares. “We ask each other, ‘what would you do if you weren’t afraid?’ I had this eye-opening moment that I could break out of my way of doing things.”

When Kevin joined Meta as an intern and thermal engineer in 2011, he was building cooling solutions and supporting the first generation of the Open Compute Project. “My manager sent me to a data center to fix overheating network switches. When it didn’t end up being a thermal problem, I had an opportunity to still find a solution. Meta changed my perspective on problem-solving: look at the whole issue and focus on solving integrated challenges. It doesn’t matter what your degree or title is, at Meta we are all problem-solvers.”

With this mindset, Kevin spent the next decade at Meta bringing his skills to exciting new places: the world of AI.



Solving integrated challenges

Kevin joined the release to production team as a full-time validation engineer in 2012. “Our team was scrappy — we just had to figure out issues. My focus was testing hardware in the data center, but I was also learning to debug systems and hack the Linux Kernel — skills I’d later use to solve open-ended questions around AI.”

When Meta kickstarted its AI efforts, the team approached Kevin for his expertise in graphics processing units (GPUs) for training models, which they didn’t have yet. “I bought the materials, plugged GPUs into the servers, changed power supplies and helped build our first iteration of AI compute at Meta,” he shares. As Meta shifted more toward AI, so did Kevin’s roles. In 2016, he became a TPM focused on vision and strategy for AI; in 2018 he became a TPM manager, leading a group for AI acceleration; and in 2020 he moved to the Fundamental AI Research (FAIR) org, where he helped build the AI Research SuperCluster — which Meta used to train its first Llama large language model.

“Nothing is just a hardware problem,” Kevin explains. “It’s an integrated problem from hardware to software to AI research. Moving from an engineer who deep dives into things, to a TPM who looks more broadly, helped me expand my scope.”


Building infrastructure for the future of AI

Today, Kevin sits within the infra foundation team, creating fundamental building blocks for AI infrastructure at Meta. “Our work accelerates what Meta does as a company, impacting our ability to scale and maintain reliability and functionality,” he shares. “Engineers and researchers use this infrastructure to build products and do better research. For instance, Meta recently announced two versions of our 24,576-GPU data center scale cluster, which support our current and next-gen AI models, including Llama 3, the successor to Llama 2, our open-source LLM.”

As AI training models continue getting bigger, Kevin and his team need to scale infrastructure to train them — but building large systems is a fundamental challenge. If one GPU slows or fails, Kevin explains, the whole system is affected. “We’re already collaborating with vendors to design solutions that can minimize the impact of system failures. We’re also improving checkpointing, which is like a ‘save state’ in video games, where we can return to the last known good state to resume training. These are all so that we can continue training even bigger and more complex models.”



Collaborating on a cross-functional mission

Kevin describes the AI clusters as being collaborative in every way — bringing together innovation in hardware, storage, software and network fabrics. “Picture a cluster as one big chip being split across the data center. We have to connect one GPU to another with cables. Can you imagine how many cables and switches are needed? Ensuring we maintain low latency and high bandwidth is complex from a network perspective — cross-functional work is vital.”

Kevin highlights the culture at Meta as “set up for collaborative communication” — an advantage to his team’s success. “ I have people explore different teams and tracks, acquiring diverse skills and knowledge — which we all benefit from. Every team across the company is thinking about AI and how we can build it better. The flexibility and focus we share at Meta are highly unique. ”

Collaboration at Meta extends to the open source community as well. “The commitment to open source at Meta is tried and true. We’ve been doing it, and we’ll continue to do it,” Kevin says.

“We open up our hardware designs to get input from the community on how to improve. We’re aiming to build something that’s useful for everyone, whether it’s Llama or Open Compute.”

That’s what innovation means to Kevin — pushing the boundaries of what’s possible. “Every few years we see things that change what we’re doing and set us up for the next big thing, from the internet to smart phones. GenAI will help every person in every industry be more creative and productive. Together, with these tools, we’ll do bigger and better things.”

Stay connected.