Improve multithread support #188

masterleinad · 2019-05-30T21:44:04Z

Part of #185. This pull request allows hierarchy_driver to run multithreaded. The main modification is that we have to copy the Evaluator for each agglomerate to avoid race conditions.

I have no idea why the formatting in include/mfmg/dealii/dealii_mesh_evaluator.hpp changed that much.

codecov-io · 2019-05-30T22:13:44Z

Codecov Report

Merging #188 into master will decrease coverage by 0.05%.
The diff coverage is 92.45%.

@@            Coverage Diff             @@
##           master     #188      +/-   ##
==========================================
- Coverage   88.63%   88.57%   -0.06%     
==========================================
  Files          57       57              
  Lines        3193     3203      +10     
==========================================
+ Hits         2830     2837       +7     
- Misses        363      366       +3

Impacted Files	Coverage Δ
tests/main.cc	`100% <ø> (ø)`	⬆️
.../mfmg/dealii/dealii_matrix_free_mesh_evaluator.hpp	`10% <0%> (+3.75%)`	⬆️
include/mfmg/dealii/amge_host.templates.hpp	`93.4% <100%> (+0.03%)`	⬆️
tests/hierarchy_driver.cc	`69.18% <100%> (-0.17%)`	⬇️
tests/test_hierarchy_helpers.hpp	`70.5% <100%> (+0.87%)`	⬆️
tests/test_hierarchy.cc	`87.04% <100%> (ø)`	⬆️
...rce/dealii/dealii_matrix_free_hierarchy_helpers.cc	`99.15% <97.43%> (ø)`	⬆️
include/mfmg/dealii/dealii_mesh_evaluator.hpp	`33.33% <0%> (+13.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ced9993...f34704a. Read the comment docs.

dalg24 · 2019-05-30T23:58:52Z

tests/hierarchy_driver.cc

@@ -174,8 +174,7 @@ int main(int argc, char *argv[])
 {
  namespace boost_po = boost::program_options;

-  MPI_Init(&argc, &argv);
-  dealii::MultithreadInfo::set_thread_limit(1);
+  dealii::Utilities::MPI::MPI_InitFinalize mpi_init(argc, argv);


I noticed that deal.II does not assert that the level of thread support available is sufficient (does not assert provided >= wanted)

I have nothing against the change you suggest, even more so since this line is the only direct call to MPI_Init() in mfmg, but I am curious if this actually was a bug. Can you expand on this?

edit @masterleinad I realized I had pasted the wrong link

For requesting explicit thread support we need to call MPI_Init_thread instead which dealii::Utilities::MPI::MPI_InitFinalize does (apart from initializing Zoltan and emptying memory pools).

MPI_Init seems to not allow calling MPI functions from multiple threads and hence behaves similar to MPI_Init_thread with MPI_THREAD_SINGLE (or possibly MPI_THREAD_FUNNELED).

In the end, I was just seeing MPI errors when calling MPI functions (inside the MatrixFree constructior to be precise) when running multithreaded. Using MPI_Init instead of MPI_Init_thread or dealii::Utilities::MPI::MPI_InitFinalize was likely just an oversight.

Updated link to deal.II code

tests/test_hierarchy.cc

dalg24 · 2019-05-31T00:11:12Z

source/dealii/dealii_matrix_free_hierarchy_helpers.cc

+          AgglomerateOperator agglomerate_operator(*dealii_mesh_evaluator,
+                                                   agglomerate_dof_handler,
+                                                   agglomerate_constraints);
+          agglomerate_operator.vmult(correction, delta_eig);


These changes look reasonable but are lost in all the code formatting mess.

dalg24 · 2019-05-31T00:13:13Z

include/mfmg/dealii/dealii_matrix_free_mesh_evaluator.hpp

+  {
+    ASSERT_THROW_NOT_IMPLEMENTED();
+    return std::make_unique<DealIIMatrixFreeMeshEvaluator>(*this);
+  }


Please remind me why this cannot be pure virtual.

Need to discuss why this is OK.

My objections are: we force the user to implement to boiler plate code, we have no way of knowing how much data he stuffed into his derived class and this could be costly.
Also I would like to understand where the race condition happen :/

This would be much easier if we have a uniform interface for global/agglomerate initialization and evaluation or separate user classes.

/usr/include/c++/7/ext/new_allocator.h:136:4: error: invalid new-expression of abstract class type ‘TestMeshEvaluator<mfmg::DealIIMatrixFreeMeshEvaluator<2> >’ { ::new((void *)__p) _Up(std::forward<_Args>(__args)...); } ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from mfmg/tests/test_hierarchy.cc:37:0: mfmg/tests/test_hierarchy_helpers.hpp:202:7: note: because the following virtual functions are pure within ‘TestMeshEvaluator<mfmg::DealIIMatrixFreeMeshEvaluator<2> >’: class TestMeshEvaluator final : public MeshEvaluator ^~~~~~~~~~~~~~~~~ In file included from mfmg/include/mfmg/dealii/amge_host.hpp:16:0, from mfmg/include/mfmg/dealii/dealii_hierarchy_helpers.hpp:16, from mfmg/include/mfmg/common/hierarchy.hpp:19, from mfmg/tests/test_hierarchy.cc:14: mfmg/include/mfmg/dealii/dealii_matrix_free_mesh_evaluator.hpp:63:58: note: std::unique_ptr<mfmg::DealIIMatrixFreeMeshEvaluator<dim> > mfmg::DealIIMatrixFreeMeshEvaluator<dim>::clone() const [with int dim = 2] virtual std::unique_ptr<DealIIMatrixFreeMeshEvaluator> clone() const = 0; ^~~~~

I am not sure if we need to modify the matrix-based version as well.

tests/test_hierarchy_helpers.hpp

masterleinad · 2019-05-31T12:06:29Z

The matrix-based version needs more work, so I restricted the number of threads to one in that case again, To actually check the matrix-free version here, I changed the default as well to "matrix-free" as well (forcing a Chebyshev smoother since that is the only possibility).

Rombur · 2019-05-31T12:39:23Z

So the problem is that MatrixFree is not thread-safe? I don't get why we need to clone the MeshEvaluator

masterleinad · 2019-05-31T12:46:39Z

So the problem is that MatrixFree is not thread-safe? I don't get why we need to clone the MeshEvaluator

No, that is not the problem. So far, we had a common Evaluator object that is both responsible for agglomerate evaluations as well as global evaluations. If we use multiple threads, we try to evaluate different agglomerates at the same time. In particular, we change the state of the class in initialize_agglomerate which is then used in evaluate_agglomerate. Hence, all the agglomerate related member variables might be involved in race conditions. Apart from that, we might run into an inconsistent state comparing initialize_agglomerate and evaluate_agglomerate.

masterleinad · 2019-06-04T21:04:01Z

As compared to #192 changes in timings for running with one thread are unaffected. These are the run times on my notebook:

# threads	2x2	4x4
1	37.3	20.5
2	25.8	12.6
4	13.4	8.9
8	11.9	8.32

masterleinad · 2019-06-05T14:43:07Z

For the 4x4 patch and one thread, I get

Section	no. calls	wall time	% of total
Apply	74	2.01s	10%
Apply: fine levels	74	2s	10%
Setup	1	14.8s	75%
Setup: build restrictor	1	14.8s	75%

while 8 threads give me

Section	no. calls	wall time	% of total
Apply	74	1.92s	23%
Apply: fine levels	74	1.91s	23%
Setup	1	3.96s	48%
Setup: build restrictor	1	3.95s	48%

masterleinad · 2019-06-05T15:45:55Z

Using vectorization (AVX2) on top of that gives me (1 thread)

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      7.73s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Apply                           |        74 |     0.557s |       7.2% |
| Apply: fine levels              |        74 |     0.553s |       7.2% |
| Setup                           |         1 |      6.88s |        89% |
| Setup: build restrictor         |         1 |      6.88s |        89% |
+---------------------------------+-----------+------------+------------+

resp. (8 threads)

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      2.73s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Apply                           |        74 |       0.6s |        22% |
| Apply: fine levels              |        74 |     0.596s |        22% |
| Setup                           |         1 |      1.84s |        67% |
| Setup: build restrictor         |         1 |      1.83s |        67% |
+---------------------------------+-----------+------------+------------+

masterleinad · 2019-06-05T15:51:13Z

Using MPI instead of threads gives (1 MPI process)

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      8.89s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Apply                           |        74 |     0.665s |       7.5% |
| Apply: fine levels              |        74 |     0.659s |       7.4% |
| Setup                           |         1 |      7.88s |        89% |
| Setup: build restrictor         |         1 |      7.87s |        89% |
+---------------------------------+-----------+------------+------------+

resp. (8 MPI processes)

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      2.12s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Apply                           |        74 |     0.349s |        16% |
| Apply: fine levels              |        74 |     0.331s |        16% |
| Setup                           |         1 |      1.62s |        76% |
| Setup: build restrictor         |         1 |      1.61s |        76% |
+---------------------------------+-----------+------------+------------+

masterleinad · 2019-06-06T17:45:45Z

Does anyone want to discuss a different solution?

Rombur · 2019-06-06T17:51:07Z

@dalg24 said he was going to look at it

dalg24 · 2019-06-06T19:25:22Z

source/dealii/dealii_matrix_free_hierarchy_helpers.cc

@@ -12,6 +12,7 @@
 #include <mfmg/common/instantiation.hpp>
 #include <mfmg/common/operator.hpp>
 #include <mfmg/dealii/amge_host.hpp>
+#include <mfmg/dealii/amge_host.templates.hpp>


Comment it is included for MatrixFreeAgglomerateOperator and that it would probably not be a bad idea to move the definition elsewhere.

dalg24 · 2019-06-06T19:36:14Z

include/mfmg/dealii/dealii_matrix_free_mesh_evaluator.hpp

@@ -56,6 +56,16 @@ class DealIIMatrixFreeMeshEvaluator : public DealIIMeshEvaluator<dim>
   */
  virtual ~DealIIMatrixFreeMeshEvaluator() override = default;

+  /**
+   * Create a deep copy of this class such that initializing on another
+   * agglomerate works.


Please add a note that this was introduced because calls to member functions matrix_free_initialize_agglomerate() and matrix_free_evaluate_agglomerate() (implemented in user provided class deriving from MatrixFreeMeshEvaluator) are not thread safe and that it is not the only option to solve the problem.

masterleinad · 2019-06-10T19:05:03Z

Thanks!

masterleinad force-pushed the multithreading branch from a3b0191 to 806881c Compare May 30, 2019 22:37

dalg24 reviewed May 31, 2019

View reviewed changes

tests/test_hierarchy_helpers.hpp Show resolved Hide resolved

masterleinad mentioned this pull request May 31, 2019

Allow using SolverCG in hierarchy_driver #192

Merged

masterleinad added 6 commits June 6, 2019 10:44

Copy the evaluator for each patch

558fb82

Allow multiple threads with MPI in hierarchy_driver

d98affc

Enable more multithread testing in test_hierarchy

46e08f3

Use matrix-free as default in hierarchy_driver

e066555

Document deal.II's mechanism for setting the number of threads

bb86bcc

Document copy constructor

ab5bf1d

masterleinad force-pushed the multithreading branch from eb1b2ad to ab5bf1d Compare June 6, 2019 14:45

dalg24 reviewed Jun 6, 2019

View reviewed changes

Add more comments

f34704a

dalg24 approved these changes Jun 6, 2019

View reviewed changes

dalg24 merged commit 1beeef6 into ORNL-CEES:master Jun 10, 2019

masterleinad deleted the multithreading branch June 10, 2019 19:05

masterleinad mentioned this pull request Jun 13, 2019

Compare with ML #202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve multithread support #188

Improve multithread support #188

masterleinad commented May 30, 2019 •

edited

Loading

codecov-io commented May 30, 2019 •

edited

Loading

dalg24 May 30, 2019 •

edited

Loading

masterleinad May 31, 2019

dalg24 May 31, 2019

dalg24 May 31, 2019

dalg24 May 31, 2019

dalg24 May 31, 2019

masterleinad May 31, 2019

masterleinad May 31, 2019

masterleinad commented May 31, 2019

Rombur commented May 31, 2019

masterleinad commented May 31, 2019

masterleinad commented Jun 4, 2019

masterleinad commented Jun 5, 2019

masterleinad commented Jun 5, 2019 •

edited

Loading

masterleinad commented Jun 5, 2019

masterleinad commented Jun 6, 2019

Rombur commented Jun 6, 2019

dalg24 Jun 6, 2019

dalg24 Jun 6, 2019

masterleinad commented Jun 10, 2019

Improve multithread support #188

Improve multithread support #188

Conversation

masterleinad commented May 30, 2019 • edited Loading

codecov-io commented May 30, 2019 • edited Loading

Codecov Report

dalg24 May 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad commented May 31, 2019

Rombur commented May 31, 2019

masterleinad commented May 31, 2019

masterleinad commented Jun 4, 2019

masterleinad commented Jun 5, 2019

masterleinad commented Jun 5, 2019 • edited Loading

masterleinad commented Jun 5, 2019

masterleinad commented Jun 6, 2019

Rombur commented Jun 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad commented Jun 10, 2019

masterleinad commented May 30, 2019 •

edited

Loading

codecov-io commented May 30, 2019 •

edited

Loading

dalg24 May 30, 2019 •

edited

Loading

masterleinad commented Jun 5, 2019 •

edited

Loading