[README.md] add report of Alltoall

77e3bb1a · Ionizing · fcfcf2d9 · 77e3bb1a
Commit 77e3bb1a authored 1 year ago by Ionizing
--- a/README.md
+++ b/README.md
-# Zhaojin's Lab
-## H3C
+# Benchmark of MPI Communication Performance on Different Nodes

+## Facility Specification

-## Sugon
+| Machine          | CPU                                   | Memory | OS         | Infiniband Device Model & Driver        |
+| ---------------- | ------------------------------------- | -----: | ---------- | --------------------------------------- |
+| H3C: h[5,6]      | Intel Xeon 6240R 24C @ 2.4GHz \* 2    |  192GB | CentOS 7.8 | Mellanox MT27800, Official OFED 5.5     |
+| H3C: h[9,10]     | Intel Xeon 6240R 24C @ 2.4GHz \* 2    |  192GB | CentOS 7.9 | Mellanox MT27800, CentOS builtin-driver |
+| H3C: SciNat      | Intel Xeon 5219R 20C @ 2.1GHz \* 2    |  192GB | CentOS 7.9 | Mellanox MT27800, CentOS builtin-driver |
+| Sugon: node[1,2] | Intel Xeon E5-2650v2 8C @ 2.6GHz \* 2 |   64GB | CentOS 7.8 | Mellanox MT27500, Official OFED 4.9     |
+| Sugon: node[3,4] | Intel Xeon E5-2650v2 8C @ 2.6GHz \* 2 |   64GB | CentOS 7.9 | Mellanox MT27500, CentOS builtin-driver |

+## Alltoall Performance

-# BLSC
+`Alltoall` is very similar to matrix transposing and here is a schematic
+illustration of `Alltoall`:

+```
+    @brief Illustrates how to use an all to all.
+    @details This application is meant to be run with 3 MPI processes. Every MPI
+    process begins with a buffer containing 3 integers, one for each process
+    including themselves. They also have a buffer in which receive the integer
+    that has been sent by each other process for them. It can be visualised as
+    follows:

-# HFAC
+    +-----------------------+ +-----------------------+ +-----------------------+
+    |       Process 0       | |       Process 1       | |       Process 2       |
+    +-------+-------+-------+ +-------+-------+-------+ +-------+-------+-------+
+    | Value | Value | Value | | Value | Value | Value | | Value | Value | Value |
+    |   0   |  100  |  200  | |  300  |  400  |  500  | |  600  |  700  |  800  |
+    +-------+-------+-------+ +-------+-------+-------+ +-------+-------+-------+
+        |       |       |_________|_______|_______|_________|___    |       |
+        |       |    _____________|_______|_______|_________|   |   |       |
+        |       |___|_____________|_      |      _|_____________|___|       |
+        |      _____|_____________| |     |     | |_____________|_____      |
+        |     |     |               |     |     |               |     |     |
+     +-----+-----+-----+         +-----+-----+-----+         +-----+-----+-----+
+     |  0  | 300 | 600 |         | 100 | 400 | 700 |         | 200 | 500 | 800 |
+     +-----+-----+-----+         +-----+-----+-----+         +-----+-----+-----+
+     |    Process 0    |         |    Process 1    |         |    Process 2    |
+     +-----------------+         +-----------------+         +-----------------+
+```
+
+Reference: [rookiehpc](https://rookiehpc.org/mpi/docs/mpi_alltoall/index.html)
+
+### Test Code
+
+```C
+void test_alltoall(const uint64_t count_send, const int nrank, const int irank) {
+    if (0 == irank) {
+        uint64_t bytes = count_send * nrank * sizeof(int);
+        uint64_t gbs   = bytes >> 30;
+        uint64_t mbs   = ( bytes % (1 << 30) ) >> 20;
+        printf(" * Profiling throughput of %4lu GB %4lu MB per rank ...  ", gbs, mbs);
+    }
+
+    int* buffer_send = (int*)malloc( nrank * count_send * (sizeof(int)) );
+    int* buffer_recv = (int*)malloc( nrank * count_send * (sizeof(int)) );
+
+    for (uint64_t i=0; i!=(nrank * count_send); ++i) {
+        buffer_send[i] = i * i + irank * 114514;
+        buffer_recv[i] = 0;
+    }
+
+    clock_t start = clock();
+    for (int i=0; i!=10; ++i) {
+        MPI_Alltoall(buffer_send, count_send, MPI_INT, buffer_recv, count_send, MPI_INT, MPI_COMM_WORLD);
+    }
+    clock_t end = clock();
+    int msec    = (end - start) * 100 / CLOCKS_PER_SEC;
+
+    free(buffer_send);
+    free(buffer_recv);
+
+    if (0 == irank) {
+        printf(" time taken in MPI_Alltoall: %3d s %3d ms\n", msec / 1000, msec % 1000);
+    }
+}
+```
+
+**Note**: This code allocates `nrank * bufsize` memories on each MPI rank, which means the total allocated memory
+would be related to number of MPI ranks. Thus the more processes you used in `mpirun`, the more data is transferred.
+
+### Test Result
+
+| Machine   | MPI Ranks | Data Transferred per Rank | Time Used (sec) |
+| --------- | :-------: | ------------------------: | --------------: |
+| SciNat    |    48     |                    1.5 GB |           0.895 |
+| h5        |    48     |                     1.5GB |           1.094 |
+| h9        |    48     |                     1.5GB |           1.045 |
+| h5,h6     |  24 \* 2  |                     1.5GB |           1.688 |
+| h9,h10    |  24 \* 2  |                     1.5GB |           2.437 |
+| node1     |    16     |                     0.5GB |           0.349 |
+| node3     |    16     |                     0.5GB |           0.352 |
+| node[1,2] |  8 \* 2   |                     0.5GB |           0.430 |
+| node[3,4] |  8 \* 2   |                     0.5GB |           1.935 |