📄️ Machine Learning Systems Overview
📄️ Fault Tolerance in LLM Training/Serving Systems
MLSys 有下面几点优化方向,本文主要关注第三点
📄️ MLSys Learning PLAN
MLSys 学习大纲
📄️ Distributed Training System Overview
1 动机:解决单机性能瓶颈
📄️ Distributed Training Approaches
概述
📄️ Machine Learning Cluster Architecture
机器学习集群架构
📄️ Distributed Training Collective Communication
集合通信
📄️ Distributed Training Parameters Server
参数服务器
📄️ Distributed Training Overview
If a failure occurs, torchrun will terminate all the processes and restart them. Each process entry point first loads and initializes the last saved snapshot, and continues training from there. So at any failure, you only lose the training progress from the last saved snapshot.
📄️ How to do research in the era of LLM
大模型时代下做科研的四个思路
📄️ GPU
基本概念