Amazon Web Services provides the most elastic and scalable cloud infrastructure to run your distributed machine learning applications. With AWS’s virtually unlimited capacity, engineers, researchers, and computational system owners can innovate beyond the limitations of on-premises infrastructure.
Compute on AWS removes the long wait times and lost productivity that can come from using fixed, on-premises computational clusters. Flexible configuration and virtually unlimited scalability allow you to grow and shrink your infrastructure as your workloads dictate, not the other way around. Additionally, with access to a broad portfolio of cloud-based services like Data Analytics, Artificial Intelligence (AI), and Machine Learning (ML), you can redefine traditional computational workflows to innovate faster.
In this workshop you will deploy an HPC system using the AWS ParallelCluster and run distributed training jobs with PyTorch. To start with the workshop continue to Workshop Overview.
AWS ParallelCluster is an AWS-supported, open source cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters on AWS. ParallelCluster uses a text file (YAML) to model and provision the resources needed for your computational applications in an automated and secure manner. It also supports a variety of job schedulers such as AWS Batch and Slurm for easy job submissions.
AWS ParallelCluster is released via the Python Package Index (PyPI). ParallelCluster’s source code is hosted on the Amazon Web Services repository on GitHub. AWS ParallelCluster is available at no additional charge, and you pay only for the AWS resources you use to run your applications.