sessions: add support for ucx and more #12723

hppritcha · 2024-07-30T16:49:28Z

Greatly simplify support for MPI_Comm_create_from_group and MPI_Intercomm_create_from_group by removing the need to support the 128-bit excid notion.

Instead, make use of a PMIx capability - PMIX_GROUP_LOCAL_CID and the notion of PMIX_GROUP_INFO. This capability was introduced in Open PMIx 4.1.3. This capability allows us to piggy-back a local cid selected for the new communicator on the PMIx_Group_construct operation. Using this approach, a lot of the complex active message style operations implemented in the OB1 PML to support excids can be avoided.

This PR also includes simplifications to the OFI MTL to make use of the PMIX_GROUP_LOCAL_CID feature.

Infrastructure for debugging communicator management routines was also introduced, along with a new MCA parameter - mpi_comm_verbose.

Related to #12566

hppritcha · 2024-07-30T16:50:02Z

@rhc54 could you check the PMIx usage here?

wenduwan · 2024-07-30T16:53:13Z

Running AWS CI

hppritcha · 2024-07-30T17:16:21Z

hmmm that coredump is coming from prrte -

Core was generated by `prterun --map-by :OVERSUBSCRIBE -np 4 python3 ./main.py -v -f'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000ffffb0a9c7b8 in prte_grpcomm_base_get_tracker (sig=0xffffa83e1ae0, create=true) at base/grpcomm_base_stubs.c:313
313	        if (coll->dmns[n] == PRTE_PROC_MY_NAME->rank) {
[Current thread is 1 (Thread 0xffffb0c28f40 (LWP 3683323))]
(gdb) bt
#0  0x0000ffffb0a9c7b8 in prte_grpcomm_base_get_tracker (sig=0xffffa83e1ae0, create=true) at base/grpcomm_base_stubs.c:313
#1  0x0000ffffb0a9bac4 in allgather_stub (fd=-1, args=4, cbdata=0xffffa80df400) at base/grpcomm_base_stubs.c:159
#2  0x0000ffffb04ee628 in event_process_active_single_queue () from /lib64/libevent_core-2.1.so.6
#3  0x0000ffffb04ef290 in event_base_loop () from /lib64/libevent_core-2.1.so.6
#4  0x0000000000408fd4 in main (argc=9, argv=0xfffff697ac38) at prte.c:1297
(gdb) list
308	
309	    /* see if I am in the array of participants - note that I may
310	     * be in the rollup tree even though I'm not participating
311	     * in the collective itself */
312	    for (n = 0; n < coll->ndmns; n++) {
313	        if (coll->dmns[n] == PRTE_PROC_MY_NAME->rank) {
314	            coll->nexpected++;
315	            break;
316	        }
317	    }
(gdb) print n
$1 = 0
(gdb) print coll
$2 = (prte_grpcomm_coll_t *) 0x137b0da0
(gdb) print coll->ndmns
$3 = 1
(gdb) print coll->dmns
$4 = (pmix_rank_t *) 0x0
(gdb)

bosilca · 2024-07-30T17:13:11Z

ompi/runtime/ompi_mpi_params.c

@@ -104,6 +104,7 @@ bool ompi_ftmpi_enabled = false;
 #endif /* OPAL_ENABLE_FT_MPI */

 static int ompi_stream_buffering_mode = -1;
+int ompi_comm_verbose_level = 0;


What is the use of this ? I can see it set but I don't see it used anywhere.

you haven't looked closely enough. see changes to ompi_comm_init

bosilca · 2024-07-30T17:25:52Z

ompi/communicator/comm_cid.c


    rc = ompi_group_to_proc_name_array (newcomm->c_local_group, &name_array, &proc_count);
    if (OPAL_UNLIKELY(OMPI_SUCCESS != rc)) {
        return rc;
    }

+    if ( OMPI_COMM_IS_INTER (newcomm) ){


Why such a convoluted approach with multiple malloc and memcpy ? ompi_group_to_proc_name_array is only used here you can change it to populate an array already allocated, allogin this core to allocate the name_array only once (with the correct size depending on the communicator type) and then populate it once with the necessary groups.

Aslo, as this array is only used converted as PMIX NAME you might want to generate automatically convert the opal name to PMIX names.

However, the conversion from OPAL naming to PMIX naming is an extremely costly operation liean in the number of processes and the number of known jobid. At scale, especially when the number of jobid increases (which is the case when sessions are in play), this will be significant.

Not sure I understand why converting OPAL to PMIx names should be an "extremely costly" operation. When you record the process names for a job, the rank obviously just carries across - and you can do a one-time call to get the numerical jobid corresponding to that job string so all procs get recorded in OPAL/OMPI using the numerical jobid. If that doesn't help resolve the problem, we can look at other methods.

good point about the arrays and things. reworked to not use this proc_name_array thing.

bosilca · 2024-07-30T17:40:31Z

ompi/communicator/comm_cid.c

-    rc = PMIx_Group_construct(tag, procs, proc_count, &pinfo, 1, &results, &nresults);
-    PMIX_INFO_DESTRUCT(&pinfo);
+    rc = PMIx_Group_construct(tag, procs, proc_count, pinfo, ninfo, &results, &nresults);
+    PMIX_DATA_ARRAY_DESTRUCT(&darray);


There must be some optimizations behind this centralized group creation for it to scale. I would love to hear more about.

Depends on the RTE, of course, as all PMIx does is construct the request and pass it up. For PRRTE, it is just the tree-based allgather we use for pretty much all collectives. I don't believe anyone has exerted much effort towards optimizing it.

That's exactly what I'm concerned about as I have seen no performance report on this change.

rhc54 · 2024-07-31T00:56:48Z

hmmm that coredump is coming from prrte -

Not surprising - we know that the group construct code in PRRTE master has problems. I've been working on it and am almost done with the fix, but am getting pulled in multiple directions right now. So it is sitting off to the side for now.

I'll try to look thru the PMIx stuff as time permits.

hppritcha · 2024-07-31T13:52:44Z

hmmm that coredump is coming from prrte -

Not surprising - we know that the group construct code in PRRTE master has problems. I've been working on it and am almost done with the fix, but am getting pulled in multiple directions right now. So it is sitting off to the side for now.

I'll try to look thru the PMIx stuff as time permits.

I did some digging in to this and it turns out that the Open MPI is feeding a corrupted proc list into PMIx_Group_construct for at least one rank, that's ending up causing problems in prte and ulitimately the segfault. not sure how pmix could validate that kind of input.

rhc54 · 2024-07-31T16:11:20Z

You mean one proc in the array has a bad nspace? Yeah, I'm not sure how we check for a bad string, though we do know the max size of the nspace - so we should not be charging off into nowhere land. PMIX_CHECK_NSPACE has the protection for that much - could be we have a place in PRRTE that isn't using it.

hppritcha · 2024-07-31T16:39:00Z

You mean one proc in the array has a bad nspace? Yeah, I'm not sure how we check for a bad string, though we do know the max size of the nspace - so we should not be charging off into nowhere land. PMIX_CHECK_NSPACE has the protection for that much - could be we have a place in PRRTE that isn't using it.

a namespace sanity check in prrte group construct may have caught this but not sure.

wenduwan · 2024-08-01T15:21:10Z

FYI AWS CI passed for this PR

hppritcha · 2024-08-01T16:31:48Z

@wenduwan could you rerun AWS CI?

Greatly simplify support for MPI_Comm_create_from_group and MPI_Intercomm_create_from_group by removing the need to support the 128-bit excid notion. Instead, make use of a PMIx capability - PMIX_GROUP_LOCAL_CID and the notion of PMIX_GROUP_INFO. This capability was introduced in Open PMIx 4.1.3. This capability allows us to piggy-back a local cid selected for the new communicator on the PMIx_Group_construct operation. Using this approach, a lot of the complex active message style operations implemented in the OB1 PML to support excids can be avoided. This PR also includes simplifications to the OFI MTL to make use of the PMIX_GROUP_LOCAL_CID feature. Infrastructure for debugging communicator management routines was also introduced, along with a new MCA parameter - mpi_comm_verbose. Related to open-mpi#12566 Signed-off-by: Howard Pritchard <[email protected]>

wenduwan · 2024-08-06T14:05:11Z

ompi/mca/mtl/ofi/mtl_ofi.c

        if (NULL == mtl_comm->c_index_vec) {
            ret = OMPI_ERR_OUT_OF_RESOURCE;
            OBJ_RELEASE(mtl_comm);
            goto error;
-        } else {
-            for (uint32_t i=0; i < comm_size; i++) {
-                mtl_comm->c_index_vec[i].c_index_state = MCA_MTL_OFI_CID_NOT_EXCHANGED;


In this change c_index_vec are zero-initialized, whereas it had different initial states previously, i.e. MCA_MTL_OFI_CID_NOT_EXCHANGED. Is this intentional(I guess not)?

ompi/ompi/mca/mtl/ofi/mtl_ofi.h

Lines 80 to 82 in 1fcf4db

#define MCA_MTL_OFI_CID_NOT_EXCHANGED 2

#define MCA_MTL_OFI_CID_EXCHANGING 1

#define MCA_MTL_OFI_CID_EXCHANGED 0

Yes this is intentional. the various bits MCA_MTL_OFI_CID... were used to track the state of exchange of the no-longer-used excid info. The new approach takes advantage of the fact that the cid for a communicator that isn't WORLD or self or null can't be zero. I'll add a blurb about this in the comment before the call to ompi_comm_get_remote_cid.

wenduwan · 2024-08-06T14:08:30Z

ompi/mca/mtl/ofi/mtl_ofi.h

@@ -81,8 +81,7 @@ extern opal_thread_local int ompi_mtl_ofi_per_thread_ctx;
 #define MCA_MTL_OFI_CID_EXCHANGED     0

 typedef struct {
-    uint32_t c_index:30;


Unrelated to this change... how was uint32_t c_index:30 supposed to work? I'm not familiar with bit fields but is it safe to shrink fit a 32bit field to 30 bits?

yes as long as the value for c_index can fit into 30 bits. but we're not using that code now.

wenduwan · 2024-08-06T14:13:24Z

ompi/runtime/params.h

@@ -191,6 +191,12 @@ OMPI_DECLSPEC extern bool ompi_enable_timing;
 OMPI_DECLSPEC extern int ompi_mpi_event_tick_rate;
 OMPI_DECLSPEC extern bool ompi_mpi_yield_when_idle;

+ /**
+ * A integer value specifying verbosity level for communicator management


Super nit An integer ... the communicator management subsystem.

wenduwan · 2024-08-06T14:14:52Z

ompi/communicator/comm.c

@@ -1509,9 +1509,6 @@ static int ompi_comm_idup_with_info_activate (ompi_comm_request_t *request)

 static int ompi_comm_idup_with_info_finish (ompi_comm_request_t *request)


Shouldn't we simply remove this static function?

I'd prefer to leave this in case something changes later on where we actually need this function to do something - note its appeaded to the schedule for a comm constructor at line 1505 of comm.c

wenduwan · 2024-08-06T14:17:38Z

ompi/communicator/comm.c

@@ -1741,7 +1738,7 @@ int ompi_intercomm_create_from_groups (ompi_group_t *local_group, int local_lead
                                       ompi_communicator_t **newintercomm)
 {
    ompi_communicator_t *newcomp = NULL, *local_comm, *leader_comm = MPI_COMM_NULL;
-    ompi_comm_extended_cid_block_t new_block;
+    ompi_comm_extended_cid_block_t new_block = {{0}};


I think you only need {0}

wenduwan · 2024-08-06T14:19:29Z

ompi/communicator/comm_cid.c

+    rc = PMIx_Info_list_add(grpinfo, PMIX_GROUP_ASSIGN_CONTEXT_ID, NULL, PMIX_BOOL);
+    if (PMIX_SUCCESS != rc) {
+        OPAL_OUTPUT_VERBOSE((10, ompi_comm_output, "PMIx_Info_list_add failed %s %d", PMIx_Error_string(rc), __LINE__));
+        return OMPI_ERR_OUT_OF_RESOURCE ;


Same as below for list, IIUC we need to cleanup before returning errors by calling PMIx_Info_list_release

good catch!

Signed-off-by: Howard Pritchard <[email protected]>

hppritcha · 2024-08-07T18:56:04Z

@wenduwan when you have a chance could you check changes per your comments?

hppritcha · 2024-08-07T20:40:04Z

@janjust please review - esp. UCX part when you get a chance

wenduwan

My comments have been addressed. Also passed AWS CI.

bosilca · 2024-08-08T14:37:57Z

ompi/mca/mtl/ofi/mtl_ofi.h

@@ -81,8 +81,7 @@ extern opal_thread_local int ompi_mtl_ofi_per_thread_ctx;
 #define MCA_MTL_OFI_CID_EXCHANGED     0

 typedef struct {
-    uint32_t c_index:30;
-    uint32_t c_index_state:2;
+    uint32_t c_index;
 } c_index_vec_t;


Why do we need this type now that everything is an int ?

we don't need it now. i'll remove that!

bosilca

It took me a while to understand the change, but overall I think this is a good and necessary move. Two comments:

I would move it up from the PML comm into the OMPI comm, and then it will be automatic for all MTL/PML;
I would be interested in seeing some performance at scale (we now have a critical component depending on a software we have no control over).

rhc54 · 2024-08-08T15:47:44Z

we now have a critical component depending on a software we have no control over

The dependency has been true for a very long time, actually - OMPI doesn't run at all without PMIx and hasn't for years. In particular for this PR, PMIx has been in the path for connect/accept for years, and the collective used here is the same as the collective currently in that path. The only difference being gradually introduced is to replace the multiple calls to PMIx functions combined with a bunch of MPI ops with a single call to the PMIx group op.

So you should eventually see a performance improvement, although it may take another PR step to get it as I withdrew the PR that consolidated to a single op. This just takes an initial step in that direction.

As for control - OMPI continues to have (in some opinions, an undue) significant influence over PMIx. As noted elsewhere, the only real "disconnect" has been the lack of code contribution by OMPI members, which means that things like faster "allgather" in PRRTE don't get implemented - not due to refusal, but just due to lack of contribution.

hppritcha · 2024-08-08T16:09:07Z

Just for clarification - we only use the PMIx_Group_construct method for communicators being created using MPI_Comm_create_from_group and MPI_Intercomm_create_from_groups. Fortunately for us, the mpi4py testsuite has a number of tests which exercise this functionality.

tvegas1 · 2024-08-13T06:58:59Z

ompi/mca/pml/ucx/pml_ucx.c

+        }
+
+        /* 
+         *  This code takes advantage of the fact that c_index is 0 for MPI_COMM_WORLD


maybe instead of comment we can assert()?

tvegas1 · 2024-08-13T07:08:25Z

ompi/mca/pml/ucx/pml_ucx.c

+        if (OMPI_COMM_IS_INTRA(comm)) {
+            pml_comm->c_index_vec[comm->c_my_rank] = comm->c_index;
+        }
+        comm->c_pml_comm = pml_comm;


so we only ever add one pml_comm to ompi_communicator_t?

tvegas1 · 2024-08-13T07:11:35Z

ompi/mca/pml/ucx/pml_ucx.c

@@ -755,12 +801,27 @@ int mca_pml_ucx_isend_init(const void *buf, size_t count, ompi_datatype_t *datat
        return OMPI_ERROR;
    }

+    if (OPAL_LIKELY(OMPI_COMM_IS_GLOBAL_INDEX(comm))) {
+        cid = comm->c_index;
+    } else {


do we need to check if comm->c_pml_comm is != NULL?

tvegas1 · 2024-08-13T07:14:23Z

ompi/mca/pml/ucx/pml_ucx.c

@@ -898,14 +962,28 @@ int mca_pml_ucx_isend(const void *buf, size_t count, ompi_datatype_t *datatype,
        return OMPI_ERROR;
    }

+    if (OPAL_LIKELY(OMPI_COMM_IS_GLOBAL_INDEX(comm))) {


need to move that code to separate function and call it from mca_pml_ucx_isend and mca_pml_ucx_isend_init.

tvegas1 · 2024-08-13T07:15:35Z

ompi/mca/pml/ucx/pml_ucx.c

@@ -1019,17 +1100,32 @@ int mca_pml_ucx_send(const void *buf, size_t count, ompi_datatype_t *datatype, i
                    OMPI_SPC_BYTES_SENT_USER, OMPI_SPC_BYTES_SENT_MPI);
 #endif

+    if (OPAL_LIKELY(OMPI_COMM_IS_GLOBAL_INDEX(comm))) {


deduplicate using common as commented above

tvegas1 · 2024-08-13T07:16:39Z

ompi/mca/pml/ucx/pml_ucx_request.c

@@ -282,7 +282,7 @@ void mca_pml_ucx_completed_request_init(ompi_request_t *ompi_req)
    mca_pml_ucx_request_init_common(ompi_req, false, OMPI_REQUEST_ACTIVE,
                                    mca_pml_completed_request_free,
                                    mca_pml_completed_request_cancel);
-    ompi_req->req_mpi_object.comm = &ompi_mpi_comm_world.comm;
+    ompi_req->req_mpi_object.comm = &ompi_mpi_comm_null.comm;


why is this change?

tvegas1 · 2024-08-13T07:24:51Z

ompi/mca/pml/ucx/pml_ucx.h

@@ -64,6 +64,14 @@ struct mca_pml_ucx_module {
 extern mca_pml_base_component_2_1_0_t mca_pml_ucx_component;
 extern mca_pml_ucx_module_t ompi_pml_ucx;

+typedef struct mca_pml_comm_t {
+    opal_object_t super;
+    uint32_t c_index;


this c_index member is not used?

tvegas1 · 2024-08-13T07:25:40Z

ompi/mca/pml/ucx/pml_ucx_component.c

@@ -145,6 +145,10 @@ mca_pml_ucx_component_init(int* priority, bool enable_progress_threads,
    *priority = (support_level == OPAL_COMMON_UCX_SUPPORT_DEVICE) ?
                ompi_pml_ucx.priority : 19;
    PML_UCX_VERBOSE(2, "returning priority %d", *priority);
+
+    /** this pml supports the extended CID space */
+    ompi_pml_ucx.super.pml_flags |= MCA_PML_BASE_FLAG_SUPPORTS_EXT_CID;


out of curiosity, is it possible to display that capability for ucx on omp_info output?

tvegas1 · 2024-08-13T07:28:46Z

ompi/mca/pml/ucx/pml_ucx_request.h

    ((((uint64_t) (_tag)            ) << (PML_UCX_RANK_BITS + PML_UCX_CONTEXT_BITS)) | \
     (((uint64_t)(_comm)->c_my_rank ) << PML_UCX_CONTEXT_BITS) | \
-     ((uint64_t)(_comm)->c_index))
+     ((uint64_t)(_c_index)))


since we only have 20bits here, shall we check that cid is always lower than 2^20? or somehow we are guaranteed that cid is within limit?

who has only 20 bits?

PML_UCX_CONTEXT_BITS is 20 bits so if _c_index was to ever be bigger, it would spill on the ->c_my_rank field, maybe that's no the case?

guess i'm curious - OMPI's communicator index has always been 32-bits. So have you just been lucky because the index never grew big enough to violate your 20-bit threshold? If so, then it seems you have a design problem as there is no guarantee that the index won't be greater than 20-bits.

yes - pml_max_contextid field in UCX module definition should guarantee the communicator index being below 20bit

but in connect/accept and comm_spawn operations, the context id is being set by PMIx, which provides a 32-bit ID. There are other places in the code that also set the ID, not just in the PML. So how do you handle ID's beyond 20-bit? Or is UCX simply limited to 20-bit ID's and fails if OMPI sets anything above that?

rakhmets · 2024-08-13T10:32:10Z

ompi/communicator/communicator.h

@@ -614,6 +616,13 @@ static inline struct ompi_proc_t* ompi_comm_peer_lookup (const ompi_communicator
    return ompi_group_peer_lookup(comm->c_remote_group,peer_id);
 }

+static inline bool ompi_comm_instances_same(const ompi_communicator_t *comm1, const ompi_communicator_t *comm2)


Suggested change

static inline bool ompi_comm_instances_same(const ompi_communicator_t *comm1, const ompi_communicator_t *comm2)

static inline bool ompi_comm_instances_same (const ompi_communicator_t *comm1, const ompi_communicator_t *comm2)

rakhmets · 2024-08-13T11:12:20Z

ompi/communicator/comm_cid.c

+        PMIX_INFO_LOAD(&tinfo[1], PMIX_GROUP_CONTEXT_ID, &excid.cid_base, PMIX_SIZE);
+        PMIX_INFO_SET_QUALIFIER(&tinfo[1]);
+        if (PMIX_SUCCESS != (rc = PMIx_Get(&pmix_proc, PMIX_GROUP_LOCAL_CID, tinfo, 2, &val))) {
+                OPAL_OUTPUT_VERBOSE((10, ompi_comm_output, "PMIx_Get failed for PMIX_GROUP_LOCAL_CID cid_base %ld %s", excid.cid_base, PMIx_Error_string(rc)));


Suggested change

OPAL_OUTPUT_VERBOSE((10, ompi_comm_output, "PMIx_Get failed for PMIX_GROUP_LOCAL_CID cid_base %ld %s", excid.cid_base, PMIx_Error_string(rc)));

OPAL_OUTPUT_VERBOSE((10, ompi_comm_output, "PMIx_Get failed for PMIX_GROUP_LOCAL_CID cid_base %ld %s", excid.cid_base, PMIx_Error_string(rc)));

rc = opal_pmix_convert_status(rc);

goto done;

, and please remove if (PMIX_SUCCESS == rc) {.

rakhmets · 2024-08-13T11:17:57Z

ompi/communicator/comm_cid.c

+        PMIx_Info_construct(&tinfo[0]);
+        PMIX_INFO_LOAD(&tinfo[0], PMIX_TIMEOUT, &ompi_pmix_connect_timeout, PMIX_UINT32);
+
+        excid =  ompi_comm_get_extended_cid (comm);


Suggested change

excid = ompi_comm_get_extended_cid (comm);

excid = ompi_comm_get_extended_cid(comm);

rakhmets · 2024-08-13T11:36:38Z

ompi/communicator/comm_cid.c

+    if (PMIX_SUCCESS != rc) {
+        OPAL_OUTPUT_VERBOSE((10, ompi_comm_output, "PMIx_Info_list_add failed %s %d", PMIx_Error_string(rc), __LINE__));
+        PMIX_DATA_ARRAY_DESTRUCT(&darray);
+        rc = OMPI_ERR_OUT_OF_RESOURCE;
+        goto fn_exit;
+    }
+    PMIX_DATA_ARRAY_DESTRUCT(&darray);


Suggested change

if (PMIX_SUCCESS != rc) {

OPAL_OUTPUT_VERBOSE((10, ompi_comm_output, "PMIx_Info_list_add failed %s %d", PMIx_Error_string(rc), __LINE__));

PMIX_DATA_ARRAY_DESTRUCT(&darray);

rc = OMPI_ERR_OUT_OF_RESOURCE;

goto fn_exit;

}

PMIX_DATA_ARRAY_DESTRUCT(&darray);

PMIX_DATA_ARRAY_DESTRUCT(&darray);

if (PMIX_SUCCESS != rc) {

OPAL_OUTPUT_VERBOSE((10, ompi_comm_output, "PMIx_Info_list_add failed %s %d", PMIx_Error_string(rc), __LINE__));

rc = OMPI_ERR_OUT_OF_RESOURCE;

goto fn_exit;

}

rakhmets · 2024-08-13T11:46:17Z

ompi/communicator/comm_cid.c

@@ -464,18 +537,11 @@ static int ompi_comm_nextcid_ext_nb (ompi_communicator_t *newcomm, ompi_communic
        is_new_block = true;
    }

+


Suggested change

rakhmets · 2024-08-13T11:47:15Z

ompi/communicator/comm_cid.c

@@ -498,7 +564,7 @@ int ompi_comm_nextcid_nb (ompi_communicator_t *newcomm, ompi_communicator_t *com

    /* old CID algorighm */

-    /* if we got here and comm is NULL then that means the app is invoking MPI-4 Sessions or later
+    /* if we got here and comm is NULL then that means the app is  invoking MPI-4 Sessions or later


Suggested change

/* if we got here and comm is NULL then that means the app is invoking MPI-4 Sessions or later

/* if we got here and comm is NULL then that means the app is invoking MPI-4 Sessions or later

rakhmets · 2024-08-13T11:48:21Z

ompi/mca/pml/ucx/pml_ucx.c

@@ -194,6 +196,7 @@ static int mca_pml_ucx_recv_worker_address(ompi_proc_t *proc,
    return ret;
 }

+


Suggested change

rhc54 · 2024-08-13T12:44:15Z

I have seen the question of the size of the context ID raised several times, so let me provide a little perspective on it. In PMIx, the context ID for a group is specified to be size_t. Of course, the actual size of the value being returned is set by the RTE since it is solely responsible for assigning that value. In the case of PRRTE, the variable tracking context IDs has a uint32_t definition.

@hppritcha had previously (a long time ago, now) requested that PRRTE start assigning values at UINT32_MAX and count downwards from there. IIRC, this was to avoid conflict with OMPI's internal CID assignment code that started at 0 (for MPI_COMM_WORLD) and counted up.

I don't really understand all the bit manipulation going on here, so I am probably missing something - but it strikes me that if you are limiting the context ID to something less than 32-bits, and using the CID returned by PMIx, then you have a problem.

hppritcha · 2024-08-21T14:37:38Z

Okay sorry for not getting back to this thread sooner. Just very busy in my ompi time with other things at the moment.
There's a very important pml parameter that PML components are suppose to initialize - pml_max_contextid before being used for sending/receiving communicator based messages.

There's also a variant of this for the MTL components.

Anyway this quantity is used for controlling max value c_index that is the subject of comments above.

hppritcha · 2024-08-21T14:54:01Z

Also the PMIx group id is now only used to extract modex data as can be seen in the code changes in this PR. The excid that Nathan developed (which used pmix group id in a 128 bit extended context id) is not really used now except in OB1. Eventually we'll remove that but not as part of 6.0.x. It was one of those ideas that's good in theory, but in the general case of external libraries providing the lower level APIs for moving data round, it's not really maintainable. The excid stuff is described in https://doi.org/10.1109/CLUSTER.2019.8891002

rhc54 · 2024-08-21T15:32:41Z

Sooo...you are telling me that I went thru all the trouble of adding the ability for PMIx and PRRTE to provide context IDs as part of the group operation that returns modex data - for nothing? You'd rather not only do the group op, but also do a kluge multi-stage operation to create the ID??

hppritcha · 2024-08-21T15:57:23Z

Huh? look at the code changes in this PR for file ompi/communicator/comm_cid.c . there you see we do indeed use the pmix group context id to look up the local cid for the target receiver. We just are now looking up the receivers local cid in a different way than nathan's difficult to maintain (much less implement in PMLs besides OB1) out-of-band bootstrap method.

rhc54 · 2024-08-21T18:45:13Z

Okay, @hppritcha - I confess to confusion. Your statement was:

The excid that Nathan developed (which used pmix group id in a 128 bit extended context id) is not really used now except in OB1. Eventually we'll remove that but not as part of 6.0.x. It was one of those ideas that's good in theory, but in the general case of external libraries providing the lower level APIs for moving data round, it's not really maintainable.

I was reacting to this statement as it implied (as I read it) that the use of PMIx to generate a context ID was something to be removed as PMIx is the only "external library" involved (to the best of my knowledge). I confess I haven't read thru this PR to try and fully grok all the changes, so if I have misunderstood, I apologize for it. I do see the use of PMIx group ops - just was taking the understanding that you intended to remove those at some post 6.0 move.

AFAIK, the group operation is working correctly in terms of generating IDs. I know there is a problem in PRRTE/PMIx master branches (and in their release branches) regarding correctly collecting all required info to be exchanged. I've been working on it, albeit slowly. Had to step away for awhile, but am now gradually attempting to restart it.

hppritcha changed the title ~~sessions: add support for ucx more~~ sessions: add support for ucx and more Jul 30, 2024

github-actions bot added the Target: main label Jul 30, 2024

hppritcha requested a review from wenduwan July 30, 2024 16:50

bosilca reviewed Jul 30, 2024

View reviewed changes

hppritcha force-pushed the topic/ucx_excid_support branch 2 times, most recently from 7f17065 to 23865b7 Compare August 1, 2024 15:19

hppritcha requested a review from janjust August 1, 2024 15:21

hppritcha force-pushed the topic/ucx_excid_support branch from fbe5c47 to e7ef1be Compare August 1, 2024 16:55

hppritcha mentioned this pull request Aug 5, 2024

Test the current HEAD of PMIx/PRRTE master branches #12732

Closed

wenduwan reviewed Aug 6, 2024

View reviewed changes

some PR feedback

0ce9cae

Signed-off-by: Howard Pritchard <[email protected]>

hppritcha requested a review from wenduwan August 7, 2024 18:56

wenduwan approved these changes Aug 7, 2024

View reviewed changes

bosilca reviewed Aug 8, 2024

View reviewed changes

tvegas1 reviewed Aug 13, 2024

View reviewed changes

rakhmets reviewed Aug 13, 2024

View reviewed changes

	#define MCA_MTL_OFI_CID_NOT_EXCHANGED 2
	#define MCA_MTL_OFI_CID_EXCHANGING 1
	#define MCA_MTL_OFI_CID_EXCHANGED 0

		@@ -1509,9 +1509,6 @@ static int ompi_comm_idup_with_info_activate (ompi_comm_request_t *request)

		static int ompi_comm_idup_with_info_finish (ompi_comm_request_t *request)

	static inline bool ompi_comm_instances_same(const ompi_communicator_t comm1, const ompi_communicator_t comm2)
	static inline bool ompi_comm_instances_same (const ompi_communicator_t comm1, const ompi_communicator_t comm2)

	excid = ompi_comm_get_extended_cid (comm);
	excid = ompi_comm_get_extended_cid(comm);

		@@ -464,18 +537,11 @@ static int ompi_comm_nextcid_ext_nb (ompi_communicator_t *newcomm, ompi_communic
		is_new_block = true;
		}

	/* if we got here and comm is NULL then that means the app is invoking MPI-4 Sessions or later
	/* if we got here and comm is NULL then that means the app is invoking MPI-4 Sessions or later

		@@ -194,6 +196,7 @@ static int mca_pml_ucx_recv_worker_address(ompi_proc_t *proc,
		return ret;
		}

sessions: add support for ucx and more #12723

Are you sure you want to change the base?

sessions: add support for ucx and more #12723

Conversation

hppritcha commented Jul 30, 2024

hppritcha commented Jul 30, 2024

wenduwan commented Jul 30, 2024

hppritcha commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhc54 commented Jul 31, 2024

hppritcha commented Jul 31, 2024

rhc54 commented Jul 31, 2024

hppritcha commented Jul 31, 2024

wenduwan commented Aug 1, 2024

hppritcha commented Aug 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hppritcha commented Aug 7, 2024

hppritcha commented Aug 7, 2024

wenduwan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bosilca left a comment • edited Loading

Choose a reason for hiding this comment

rhc54 commented Aug 8, 2024

hppritcha commented Aug 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yosefe Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhc54 commented Aug 13, 2024

hppritcha commented Aug 21, 2024

hppritcha commented Aug 21, 2024

rhc54 commented Aug 21, 2024

hppritcha commented Aug 21, 2024

rhc54 commented Aug 21, 2024

bosilca left a comment •

edited

Loading

yosefe Aug 21, 2024 •

edited

Loading