Skip to end of metadata
Go to start of metadata

This quick start demonstrates how to implement a parallel (MPI) Fortran/C program on HCC supercomputers. The sample codes and submit scripts can be downloaded from <mpi_dir.zip>. 

Login to a HCC Cluster (Tusker or Sandhills) 

Log in to a HCC cluster through PuTTY (For Windows Users) or Terminal (For Mac/Linux Users) and make a subdirectory called mpi_dir under the $WORK directory. 

$ cd $WORK
$ mkdir mpi_dir

In the subdirectory mpi_dir, save all the relevant codes. Here we include two demo programs, demo_f_mpi.f90 and demo_c_mpi.c, that compute the sum from 1 to 20 through parallel processes. A straightforward parallelization scheme is used for demonstration purpose. First, the master core (i.e. myid=0) distributes equal computation workload to a certain number of cores (as specified by --ntasks in the submit script). Then, each worker core computes a partial summation as output. Finally, the master core collects the outputs from all worker cores and perform an overall summation. For easy comparison with the serial code (Fortran/C on HCC), the added lines in the parallel code (MPI) are marked with "!=" or "//=".

demo_f_mpi.f90
Program demo_f_mpi
!====== MPI =====
	use mpi		
!================
	implicit none
	integer, parameter :: N = 20
	real*8 w
	integer i
	common/sol/ x
	real*8 x
	real*8, dimension(N) :: y 
!============================== MPI =================================
	integer ind
	real*8, dimension(:), allocatable :: y_local					
	integer numnodes,myid,rc,ierr,start_local,end_local,N_local		
	real*8 allsum													
!====================================================================
	
!============================== MPI =================================
	call mpi_init( ierr )											
	call mpi_comm_rank ( mpi_comm_world, myid, ierr )				
	call mpi_comm_size ( mpi_comm_world, numnodes, ierr )			
																																		!
	N_local = N/numnodes											
	allocate ( y_local(N_local) )									
	start_local = N_local*myid + 1 									
	end_local =  N_local*myid + N_local								
!====================================================================
	do i = start_local, end_local
		w = i*1d0
		call proc(w)
		ind = i - N_local*myid
		y_local(ind) = x
!		y(i) = x
!		write(6,*) 'i, y(i)', i, y(i)
	enddo	
!		write(6,*) 'sum(y) =',sum(y)
!============================================== MPI =====================================================
	call mpi_reduce( sum(y_local), allsum, 1, mpi_real8, mpi_sum, 0, mpi_comm_world, ierr )				
	call mpi_gather ( y_local, N_local, mpi_real8, y, N_local, mpi_real8, 0, mpi_comm_world, ierr )		
																										
	if (myid == 0) then																					
		write(6,*) '-----------------------------------------'											
		write(6,*) '*Final output from... myid=', myid													
		write(6,*) 'numnodes =', numnodes																
		write(6,*) 'mpi_sum =', allsum	
		write(6,*) 'y=...'
		do i = 1, N
			write(6,*) y(i)
		enddo																						
		write(6,*) 'sum(y)=', sum(y)																
	endif																								
																										
	deallocate( y_local )																				
	call mpi_finalize(rc)																				
!========================================================================================================
	
Stop
End Program
Subroutine proc(w)
	real*8, intent(in) :: w
	common/sol/ x
	real*8 x
	
	x = w
	
Return
End Subroutine
demo_c_mpi.c
//demo_c_mpi
#include <stdio.h>
//======= MPI ========
#include "mpi.h"	
#include <stdlib.h>	
//====================

double proc(double w){
		double x;		
		x = w;	
		return x;
}

int main(int argc, char* argv[]){
	int N=20;
	double w;
	int i;
	double x;
	double y[N];
	double sum;
//=============================== MPI ============================
	int ind;													
	double *y_local;											
	int numnodes,myid,rc,ierr,start_local,end_local,N_local;	
	double allsum;												
//================================================================
//=============================== MPI ============================
	MPI_Init(&argc, &argv);
	MPI_Comm_rank( MPI_COMM_WORLD, &myid );
	MPI_Comm_size ( MPI_COMM_WORLD, &numnodes );
	N_local = N/numnodes;
	y_local=(double *) malloc(N_local*sizeof(double));
	start_local = N_local*myid + 1;
	end_local = N_local*myid + N_local;
//================================================================
	
	for (i = start_local; i <= end_local; i++){        
		w = i*1e0;
		x = proc(w);
		ind = i - N_local*myid;
		y_local[ind-1] = x;
//		y[i-1] = x;
//		printf("i,x= %d %lf\n", i, y[i-1]) ;
	}
	sum = 0e0;
	for (i = 1; i<= N_local; i++){
		sum = sum + y_local[i-1];	
	}
//	printf("sum(y)= %lf\n", sum);    
//====================================== MPI ===========================================
	MPI_Reduce( &sum, &allsum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
	MPI_Gather( &y_local[0], N_local, MPI_DOUBLE, &y[0], N_local, MPI_DOUBLE, 0, MPI_COMM_WORLD );
	
	if (myid == 0){
	printf("-----------------------------------\n");
	printf("*Final output from... myid= %d\n", myid);
	printf("numnodes = %d\n", numnodes);
	printf("mpi_sum = %lf\n", allsum);
	printf("y=...\n");
	for (i = 1; i <= N; i++){
		printf("%lf\n", y[i-1]);
	}	
	sum = 0e0;
	for (i = 1; i<= N; i++){
		sum = sum + y[i-1];	
	}
	
	printf("sum(y) = %lf\n", sum);
	
	}
	
	free( y_local );
	MPI_Finalize ();
//======================================================================================		

return 0;
}

Compiling the Code

The compiling of a MPI code requires first loading a compiler "engine" such as gccintel, or pgi and then loading a MPI wrapper openmpi. Here we will use the GNU Complier Collection, gcc, for demonstration.

module load compiler/gcc/6.1 openmpi/2.0

$ mpif90 demo_f_mpi.f90 -o demo_f_mpi.x
$ mpicc demo_c_mpi.c -o demo_c_mpi.x

 

The above commends load the gcc complier with the openmpi wrapper. The compiling commands mpif90 or mpicc are used to compile the codes to.x files (executables). 


Creating a Submit Script

Create a submit script to request 5 cores (with --ntasks). A parallel execution command mpirun ./ needs to enter to last line before the main program name.

submit_f.mpi
#!/bin/sh
#SBATCH --ntasks=5
#SBATCH --mem-per-cpu=1024
#SBATCH --time=00:01:00
#SBATCH --job-name=Fortran
#SBATCH --error=Fortran.%J.err
#SBATCH --output=Fortran.%J.out

mpirun ./demo_f_mpi.x 
submit_c.mpi
#!/bin/sh
#SBATCH --ntasks=5
#SBATCH --mem-per-cpu=1024
#SBATCH --time=00:01:00
#SBATCH --job-name=C
#SBATCH --error=C.%J.err
#SBATCH --output=C.%J.out

mpirun ./demo_c_mpi.x 

Submit the Job

The job can be submitted through the command sbatch. The job status can be monitored by entering squeue with the -u option.

$ sbatch submit_f.mpi
$ sbatch submit_c.mpi
$ squeue -u <username>

Sample Output

The sum from 1 to 20 is computed and printed to the .out file (see below). The outputs from the 5 cores are collected and processed by the master core (i.e. myid=0).

Fortran.out
 -----------------------------------------
 *Final output from... myid=           0
 numnodes =           5
 mpi_sum =   210.00000000000000     
 y=...
   1.0000000000000000     
   2.0000000000000000     
   3.0000000000000000     
   4.0000000000000000     
   5.0000000000000000     
   6.0000000000000000     
   7.0000000000000000     
   8.0000000000000000     
   9.0000000000000000     
   10.000000000000000     
   11.000000000000000     
   12.000000000000000     
   13.000000000000000     
   14.000000000000000     
   15.000000000000000     
   16.000000000000000     
   17.000000000000000     
   18.000000000000000     
   19.000000000000000     
   20.000000000000000     
 sum(y)=   210.00000000000000     
 
C.out
-----------------------------------
*Final output from... myid= 0
numnodes = 5
mpi_sum = 210.000000
y=...
1.000000
2.000000
3.000000
4.000000
5.000000
6.000000
7.000000
8.000000
9.000000
10.000000
11.000000
12.000000
13.000000
14.000000
15.000000
16.000000
17.000000
18.000000
19.000000
20.000000
sum(y) = 210.000000
  • No labels