Archives for HPDL Blog

Parcae Proactive Liveput-Optimized DNN Training on Preemptible Instances

Characterization of Large Language Model Development in the Datacenter

EasyScale Elastic Training with Consistent Accuracy and Improved Utilization on GPUs

Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding

ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning

Oobleck Resilient Distributed Training of Large Models Using Pipeline Templates

PGLBox Multi-GPU Graph Learning Framework for Web-Scale Recommendation

QLORA: Efficient Finetuning of Quantized LLMs

some methods for mixed precision training

TOWARD EFFICIENT LOW-PRECISION TRAINING: DATA FORMAT OPTIMIZATION AND HYSTERESIS QUANTIZATION

Colossal-Auto Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models

FlexMoE:Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks

MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

Elastic Averaging for Efficient Pipelined DNN Training

MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud

SANCUS: Staleness-Aware Communication-Avoiding Full-Graph Decentralized Training in Large-Scale Graph Neural Networks

Persia, 针对大规模推荐模型的优化; 以及Tutel,针对大规模moe模型的优化

TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting

Deep Neural Network Training With Distributed K-FAC

DeepSpeed-MOE

Alpa, Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

GNNLab: A Factored System for Sample-based GNN Training over GPUs

Colossal-AI中2D、2.5D和3D张量并行

Rematerialization and swapping

ICLR 2017 MOE

PaGraph: Scaling GNN Training on Large Graphs viaComputation-aware Caching

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Piper: Multidimensional Planner for DNN Parallelization

论文串讲-FlexFlow与自动并行

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

Ultra-Low Precision 4-bit Training of Deep Neural Networks

GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs

TeraPipe Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Combining Label Propagation and Simple Models Out-performs Graph Neural Networks

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Pelican使用手册

Chimera-Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines