Welcome to the Expanse User Guide! This document provides comprehensive instructions for utilizing the Expanse system, covering architecture, job management, storage, security, and troubleshooting. Start here to optimize your experience.
System Architecture Overview
Expanse features AMD EPYC 7742 processors, delivering 128 cores and 256 GB DRAM per node. The system offers 5.16 peak petaflops, with 728 standard compute nodes and 52 GPU nodes.
2.1. Standard Compute Nodes Specifications
Expanse’s standard compute nodes are powered by dual AMD EPYC 7742 processors, each featuring 64 cores, totaling 128 cores per node. These nodes are equipped with 256 GB of DDR4 memory and 1 TB of NVMe storage. The AMD EPYC 7742 operates at a base clock speed of 2.25 GHz, providing robust performance for compute-intensive tasks. Each node supports a wide range of applications, from scientific simulations to data analytics. With 728 standard compute nodes available, Expanse offers ample resources for parallel processing and scalable workflows. The nodes are interconnected via Mellanox HDR InfiniBand, ensuring high-speed communication for distributed computing. This architecture makes Expanse an ideal platform for researchers and scientists requiring reliable, high-performance computing capabilities.
2.2. GPU Nodes Configuration
Expanse’s GPU nodes are optimized for accelerated computing, featuring NVIDIA V100 GPUs. Each GPU node is equipped with 4 V100 GPUs, providing significant parallel processing power for tasks like AI, machine learning, and data analytics. These nodes are powered by 40 Intel Xeon cores, ensuring balanced CPU-GPU performance. Each GPU node boasts 384 GB of CPU DRAM and 2.5 TB of NVMe storage, enabling efficient data handling and high-speed computations. The nodes are interconnected with Mellanox HDR InfiniBand, facilitating fast data transfer and scalability for distributed workloads; With 52 GPU nodes available, Expanse caters to demanding applications requiring accelerated performance. This configuration makes Expanse an excellent choice for researchers and developers focusing on GPU-intensive workloads, ensuring optimal resource utilization and faster time-to-solution.
2.3. Performance Features and Capabilities
Expanse delivers exceptional performance with a peak capacity of 5.16 petaflops, making it a powerful tool for high-performance computing. The system features 728 standard compute nodes, each equipped with dual AMD EPYC 7742 processors, offering 128 cores per node and 256 GB of DDR4 memory. For accelerated workloads, 52 GPU nodes are available, each with 4 NVIDIA V100 GPUs, 40 Intel Xeon cores, and 384 GB of CPU DRAM. The storage subsystem includes 810 TB of NVMe, providing fast data access and high throughput. Expanse also supports composable systems and cloud bursting, allowing users to dynamically allocate resources for diverse workloads. The interconnect fabric, based on Mellanox HDR InfiniBand, ensures low-latency and high-bandwidth communication, enabling efficient scaling for parallel applications. These features make Expanse ideal for scientific simulations, data analytics, and AI-driven research, ensuring high efficiency and scalability for complex computational tasks.
Getting Started with Expanse
Welcome to Expanse! This section guides you through the initial steps of accessing and using the system. Obtain your account, log in, and start exploring its capabilities. Contact support for assistance.
3.1. Logging Onto the System
To access Expanse, users must first obtain an account through the appropriate channels. Once credentials are granted, log in via SSH using your assigned username and password. Ensure two-factor authentication (2FA) is enabled as required for enhanced security. The Expanse User Portal also provides a web-based interface for accessing the system, allowing users to log in securely and manage their sessions. For command-line access, use a SSH client like ssh or ssh.exe, connecting to the Expanse login nodes. After logging in, users can navigate the system, submit jobs, and manage files. For additional security, consider setting up SSH keys for passwordless access. Always verify your connection details and ensure compliance with security policies. If issues arise, contact HPC support for assistance. This section provides the essential steps to securely access and begin using the Expanse system.
3.2. Navigating the Expanse User Portal
The Expanse User Portal serves as a central hub for managing your interactions with the system. Upon logging in, you will be greeted by the Home Dashboard, which provides an overview of system status, recent jobs, and available resources. The portal is divided into key sections: File Manager for transferring and organizing files, Job Manager for submitting and monitoring jobs, and Resources for accessing documentation and support. Navigation is intuitive, with a sidebar menu offering quick access to all main features. Users can also customize their view by selecting preferred layouts and settings. For first-time users, a guided tour is available to familiarize yourself with the interface. The portal supports secure file transfers via SFTP and provides real-time updates on job statuses and system health. This section will help you efficiently navigate and utilize the Expanse User Portal for streamlined workflows.
Compiling and Running Jobs on Expanse
Compiling and running jobs on Expanse requires understanding the system’s architecture and available tools. Expanse supports various compilers, including Intel, Portland Group (PGI), and GNU, ensuring compatibility with diverse applications. For optimal performance, Intel compilers paired with MVAPICH2 MPI implementations are recommended. To compile code, use commands like mpiicc for MPI-enabled applications or icc for serial jobs. Job scripts should specify resource requirements, such as CPU cores, memory, and walltime, using directives like #SBATCH –ntasks=4 or #SBATCH –mem=32G. Submit jobs via sbatch for batch processing or srun for interactive sessions. Ensure all scripts include the #SBATCH –partition= directive to target the correct queue. For detailed instructions, refer to the Expanse User Guide.
Job Management and Monitoring
Effective job management and monitoring are crucial for maximizing productivity on Expanse. Users can submit and manage jobs using Slurm Workload Manager, with commands like sbatch for batch submissions and srun for interactive jobs. Monitor job status with squeue, and track resource usage via sacct. Key features include job queues, resource allocation, and priority scheduling. Use sbatch –partition= to target specific queues. For detailed monitoring, sacct –job= provides insights into job performance and resource consumption. Regularly check logs to ensure jobs run efficiently. Best practices include testing jobs at scale and leveraging job arrays for parallel tasks. This section helps users optimize job submissions and monitor their workflows effectively. Consult the Expanse User Guide for advanced techniques and troubleshooting tips.
Storage and Data Management Options
Expanse offers various storage solutions to meet different needs, ensuring efficient data management. Home directories provide persistent storage for user files, while project directories enable collaboration. Scratch storage is optimized for temporary, high-performance computing tasks. NVMe storage offers ultra-fast access for demanding workloads. Each storage type has specific quotas and lifetimes, detailed in the user guide. Best practices include regular cleanup of scratch spaces and archiving unused data. Use quota commands to monitor storage usage. Transfer data securely with tools like scp or rsync. Backup critical files to avoid data loss. This section helps users optimize storage utilization and ensure data integrity, aligning with Expanse’s high-performance environment. Proper data management is essential for seamless workflows. Consult this guide for storage policies and tools to enhance your experience on Expanse.
Security Measures and Best Practices
Expanse prioritizes security to protect user accounts and data. Enable two-factor authentication (2FA) using tools like Google Authenticator to enhance account security. Use strong, unique passwords and avoid sharing credentials. Regularly update your software and ensure your SSH keys are securely managed. When transferring data, use encrypted methods such as scp or rsync. Be cautious with sensitive information and avoid storing it in insecure locations. Monitor your account activity for unusual behavior and report suspicious incidents to support. Familiarize yourself with Expanse’s security policies to comply with regulations. This section provides guidelines to safeguard your work and maintain system integrity. By following these practices, you contribute to a secure computing environment on Expanse.
Troubleshooting Common Issues
Expanse users may encounter various issues while using the system. Common problems include login failures, job submission errors, and data transfer issues. For login problems, ensure your credentials are correct and 2FA is properly configured. If jobs fail, check your script for errors and verify resource requests. Data transfer issues often arise from incorrect paths or network problems—use tools like scp or rsync for reliability. If you encounter system-related issues, refer to the Expanse status page or contact support. Regularly update your software and scripts to avoid compatibility problems. Familiarize yourself with error messages and logs to diagnose issues effectively. For persistent problems, submit a support ticket with detailed information. This section helps you identify and resolve common challenges, ensuring smooth operation on the Expanse system.
Additional Resources and Support
For further assistance, Expanse offers a variety of resources to help users maximize their experience. The Expanse User Portal provides access to documentation, tutorials, and FAQs. Users can also explore Meditech Expanse Training Videos for hands-on learning. Additionally, the SDSC Help Desk is available for technical inquiries, and support tickets can be submitted via email for prompt resolution. The system architecture details and user guides are accessible online, ensuring users stay informed about updates and best practices. For troubleshooting, refer to the FAQ section or contact support directly. Community forums and user groups are also available for peer-to-peer assistance. Regular updates and notifications are shared through the Expanse status page, keeping users informed about system maintenance and improvements. These resources empower users to efficiently navigate and utilize the Expanse system.
Mastering the Expanse system requires practice and exploration. Regularly review the Expanse User Guide to stay updated on features and best practices. Familiarize yourself with the Expanse User Portal for efficient job submissions and monitoring. Utilize training resources and support options to address challenges promptly. Optimize performance by selecting appropriate compute nodes and leveraging advanced features. Security is crucial, so ensure two-factor authentication is enabled and follow best practices. For technical issues, contact the SDSC Help Desk or submit a support ticket. Explore community forums for shared knowledge and solutions. Keep an eye on the Expanse status page for system updates and maintenance schedules. By following these tips, you can maximize your productivity and efficiently utilize the Expanse system’s capabilities. Happy computing!