Chakra and ASTRA-sim: An open-source ecosystem for advancing co-design for future distributed AI systems
AI systems play a pivotal role in unlocking the full potential of emerging AI workloads such as LLMs and DLRMs by addressing their unique compute, memory and network demands. These systems require judicious SW-HW co-design to drive optimization and innovation around AI models, software, and next-generation hardware. We identify two specific challenges with AI system SW-HW co-design. The first is access to realistic workloads. While industry-approved full-stack benchmark suites (such as MLPerf) play a crucial role in enabling fair comparisons across different SW and HW stacks on current AI systems, running an entire MLPerf workload software stack over large-scale distributed systems is often prohibitive in practice, and only possible for a handful of cloud vendors today. The second is simulation fidelity. Modeling a futuristic large-scale AI system and simulating it in reasonable time at acceptable accuracy is an extremely challenging engineering task.
To address these challenges, we have been developing an open ecosystem of frameworks to enable design and optimization of future systems. On the workload end, we are developing Chakra - an open and interoperable graph-based representation for standardizing AI workload execution traces. Chakra’s execution traces represent the key operators (computation, memory and communication) when running the distributed workload, along with control dependencies, timing, and resource constraints. Chakra also includes a complementary set of tools and capabilities to enable the collection (pre or post-execution), analysis, generation, and adoption of Chakra execution traces by a broad range of simulators, emulators, and operator replay tools. On the simulation side, we are developing ASTRA-sim, a multi-fidelity simulation framework for distributed AI systems. ASTRA-sim schedules the operators from the Chakra traces over plug-and-play compute and network simulators. ASTRA-sim’s key innovation is the ability to plug-and-play diverse compute / network simulators via a common API, letting users choose their preferred open/proprietary simulator depending on the scale of the system and simulation fidelity they care about. Together, the Chakra + ASTRA-sim ecosystem enables an agile methodology for collection and reproduction of workload behavior in current production systems and extends it to future AI SW-HW system co-design. We are continuing to cultivate this ecosystem further in open partnership with industry and academia as part of the Chakra Working group in MLCommons.