MLSys | Fenglyu's Doc

📄️ Machine Learning Systems Overview

📄️ Fault Tolerance in LLM Training/Serving Systems

MLSys 有下面几点优化方向，本文主要关注第三点

📄️ MLSys Learning PLAN

MLSys 学习大纲

📄️ Distributed Training System Overview

1 动机：解决单机性能瓶颈

📄️ Distributed Training Approaches

概述

📄️ Machine Learning Cluster Architecture

机器学习集群架构

📄️ Distributed Training Collective Communication

集合通信

📄️ Distributed Training Parameters Server

参数服务器

📄️ Distributed Training Overview

If a failure occurs, torchrun will terminate all the processes and restart them. Each process entry point first loads and initializes the last saved snapshot, and continues training from there. So at any failure, you only lose the training progress from the last saved snapshot.

📄️ How to do research in the era of LLM

大模型时代下做科研的四个思路

📄️ GPU

基本概念