diff --git a/README.md b/README.md index a92c0db660af8b313ceeb86fc5b03727a78f88d9..9c50c5ed426eb9ccfa4c3043ed4b745d0994e81e 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,96 @@ -# Zhaojin's Lab -## H3C +# Benchmark of MPI Communication Performance on Different Nodes +## Facility Specification -## Sugon +| Machine | CPU | Memory | OS | Infiniband Device Model & Driver | +| ---------------- | ------------------------------------- | -----: | ---------- | --------------------------------------- | +| H3C: h[5,6] | Intel Xeon 6240R 24C @ 2.4GHz \* 2 | 192GB | CentOS 7.8 | Mellanox MT27800, Official OFED 5.5 | +| H3C: h[9,10] | Intel Xeon 6240R 24C @ 2.4GHz \* 2 | 192GB | CentOS 7.9 | Mellanox MT27800, CentOS builtin-driver | +| H3C: SciNat | Intel Xeon 5219R 20C @ 2.1GHz \* 2 | 192GB | CentOS 7.9 | Mellanox MT27800, CentOS builtin-driver | +| Sugon: node[1,2] | Intel Xeon E5-2650v2 8C @ 2.6GHz \* 2 | 64GB | CentOS 7.8 | Mellanox MT27500, Official OFED 4.9 | +| Sugon: node[3,4] | Intel Xeon E5-2650v2 8C @ 2.6GHz \* 2 | 64GB | CentOS 7.9 | Mellanox MT27500, CentOS builtin-driver | +## Alltoall Performance -# BLSC +`Alltoall` is very similar to matrix transposing and here is a schematic +illustration of `Alltoall`: +``` + @brief Illustrates how to use an all to all. + @details This application is meant to be run with 3 MPI processes. Every MPI + process begins with a buffer containing 3 integers, one for each process + including themselves. They also have a buffer in which receive the integer + that has been sent by each other process for them. It can be visualised as + follows: -# HFAC + +-----------------------+ +-----------------------+ +-----------------------+ + | Process 0 | | Process 1 | | Process 2 | + +-------+-------+-------+ +-------+-------+-------+ +-------+-------+-------+ + | Value | Value | Value | | Value | Value | Value | | Value | Value | Value | + | 0 | 100 | 200 | | 300 | 400 | 500 | | 600 | 700 | 800 | + +-------+-------+-------+ +-------+-------+-------+ +-------+-------+-------+ + | | |_________|_______|_______|_________|___ | | + | | _____________|_______|_______|_________| | | | + | |___|_____________|_ | _|_____________|___| | + | _____|_____________| | | | |_____________|_____ | + | | | | | | | | | + +-----+-----+-----+ +-----+-----+-----+ +-----+-----+-----+ + | 0 | 300 | 600 | | 100 | 400 | 700 | | 200 | 500 | 800 | + +-----+-----+-----+ +-----+-----+-----+ +-----+-----+-----+ + | Process 0 | | Process 1 | | Process 2 | + +-----------------+ +-----------------+ +-----------------+ +``` + +Reference: [rookiehpc](https://rookiehpc.org/mpi/docs/mpi_alltoall/index.html) + +### Test Code + +```C +void test_alltoall(const uint64_t count_send, const int nrank, const int irank) { + if (0 == irank) { + uint64_t bytes = count_send * nrank * sizeof(int); + uint64_t gbs = bytes >> 30; + uint64_t mbs = ( bytes % (1 << 30) ) >> 20; + printf(" * Profiling throughput of %4lu GB %4lu MB per rank ... ", gbs, mbs); + } + + int* buffer_send = (int*)malloc( nrank * count_send * (sizeof(int)) ); + int* buffer_recv = (int*)malloc( nrank * count_send * (sizeof(int)) ); + + for (uint64_t i=0; i!=(nrank * count_send); ++i) { + buffer_send[i] = i * i + irank * 114514; + buffer_recv[i] = 0; + } + + clock_t start = clock(); + for (int i=0; i!=10; ++i) { + MPI_Alltoall(buffer_send, count_send, MPI_INT, buffer_recv, count_send, MPI_INT, MPI_COMM_WORLD); + } + clock_t end = clock(); + int msec = (end - start) * 100 / CLOCKS_PER_SEC; + + free(buffer_send); + free(buffer_recv); + + if (0 == irank) { + printf(" time taken in MPI_Alltoall: %3d s %3d ms\n", msec / 1000, msec % 1000); + } +} +``` + +**Note**: This code allocates `nrank * bufsize` memories on each MPI rank, which means the total allocated memory +would be related to number of MPI ranks. Thus the more processes you used in `mpirun`, the more data is transferred. + +### Test Result + +| Machine | MPI Ranks | Data Transferred per Rank | Time Used (sec) | +| --------- | :-------: | ------------------------: | --------------: | +| SciNat | 48 | 1.5 GB | 0.895 | +| h5 | 48 | 1.5GB | 1.094 | +| h9 | 48 | 1.5GB | 1.045 | +| h5,h6 | 24 \* 2 | 1.5GB | 1.688 | +| h9,h10 | 24 \* 2 | 1.5GB | 2.437 | +| node1 | 16 | 0.5GB | 0.349 | +| node3 | 16 | 0.5GB | 0.352 | +| node[1,2] | 8 \* 2 | 0.5GB | 0.430 | +| node[3,4] | 8 \* 2 | 0.5GB | 1.935 |