Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve multithread support #188

Merged
merged 7 commits into from
Jun 10, 2019
Merged

Conversation

masterleinad
Copy link
Collaborator

@masterleinad masterleinad commented May 30, 2019

Part of #185. This pull request allows hierarchy_driver to run multithreaded. The main modification is that we have to copy the Evaluator for each agglomerate to avoid race conditions.

I have no idea why the formatting in include/mfmg/dealii/dealii_mesh_evaluator.hpp changed that much.

@codecov-io
Copy link

codecov-io commented May 30, 2019

Codecov Report

Merging #188 into master will decrease coverage by 0.05%.
The diff coverage is 92.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #188      +/-   ##
==========================================
- Coverage   88.63%   88.57%   -0.06%     
==========================================
  Files          57       57              
  Lines        3193     3203      +10     
==========================================
+ Hits         2830     2837       +7     
- Misses        363      366       +3
Impacted Files Coverage Δ
tests/main.cc 100% <ø> (ø) ⬆️
.../mfmg/dealii/dealii_matrix_free_mesh_evaluator.hpp 10% <0%> (+3.75%) ⬆️
include/mfmg/dealii/amge_host.templates.hpp 93.4% <100%> (+0.03%) ⬆️
tests/hierarchy_driver.cc 69.18% <100%> (-0.17%) ⬇️
tests/test_hierarchy_helpers.hpp 70.5% <100%> (+0.87%) ⬆️
tests/test_hierarchy.cc 87.04% <100%> (ø) ⬆️
...rce/dealii/dealii_matrix_free_hierarchy_helpers.cc 99.15% <97.43%> (ø) ⬆️
include/mfmg/dealii/dealii_mesh_evaluator.hpp 33.33% <0%> (+13.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ced9993...f34704a. Read the comment docs.

@@ -174,8 +174,7 @@ int main(int argc, char *argv[])
{
namespace boost_po = boost::program_options;

MPI_Init(&argc, &argv);
dealii::MultithreadInfo::set_thread_limit(1);
dealii::Utilities::MPI::MPI_InitFinalize mpi_init(argc, argv);
Copy link
Collaborator

@dalg24 dalg24 May 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that deal.II does not assert that the level of thread support available is sufficient (does not assert provided >= wanted)

I have nothing against the change you suggest, even more so since this line is the only direct call to MPI_Init() in mfmg, but I am curious if this actually was a bug. Can you expand on this?

edit @masterleinad I realized I had pasted the wrong link

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For requesting explicit thread support we need to call MPI_Init_thread instead which dealii::Utilities::MPI::MPI_InitFinalize does (apart from initializing Zoltan and emptying memory pools).

MPI_Init seems to not allow calling MPI functions from multiple threads and hence behaves similar to MPI_Init_thread with MPI_THREAD_SINGLE (or possibly MPI_THREAD_FUNNELED).

In the end, I was just seeing MPI errors when calling MPI functions (inside the MatrixFree constructior to be precise) when running multithreaded. Using MPI_Init instead of MPI_Init_thread or dealii::Utilities::MPI::MPI_InitFinalize was likely just an oversight.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated link to deal.II code

tests/test_hierarchy.cc Show resolved Hide resolved
AgglomerateOperator agglomerate_operator(*dealii_mesh_evaluator,
agglomerate_dof_handler,
agglomerate_constraints);
agglomerate_operator.vmult(correction, delta_eig);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look reasonable but are lost in all the code formatting mess.

{
ASSERT_THROW_NOT_IMPLEMENTED();
return std::make_unique<DealIIMatrixFreeMeshEvaluator>(*this);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remind me why this cannot be pure virtual.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to discuss why this is OK.

My objections are: we force the user to implement to boiler plate code, we have no way of knowing how much data he stuffed into his derived class and this could be costly.
Also I would like to understand where the race condition happen :/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be much easier if we have a uniform interface for global/agglomerate initialization and evaluation or separate user classes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/include/c++/7/ext/new_allocator.h:136:4: error: invalid new-expression of abstract class type ‘TestMeshEvaluator<mfmg::DealIIMatrixFreeMeshEvaluator<2> >’
  { ::new((void *)__p) _Up(std::forward<_Args>(__args)...); }
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from mfmg/tests/test_hierarchy.cc:37:0:
mfmg/tests/test_hierarchy_helpers.hpp:202:7: note:   because the following virtual functions are pure within ‘TestMeshEvaluator<mfmg::DealIIMatrixFreeMeshEvaluator<2> >’:
 class TestMeshEvaluator final : public MeshEvaluator
       ^~~~~~~~~~~~~~~~~
In file included from mfmg/include/mfmg/dealii/amge_host.hpp:16:0,
                 from mfmg/include/mfmg/dealii/dealii_hierarchy_helpers.hpp:16,
                 from mfmg/include/mfmg/common/hierarchy.hpp:19,
                 from mfmg/tests/test_hierarchy.cc:14:
mfmg/include/mfmg/dealii/dealii_matrix_free_mesh_evaluator.hpp:63:58: note: 	std::unique_ptr<mfmg::DealIIMatrixFreeMeshEvaluator<dim> > mfmg::DealIIMatrixFreeMeshEvaluator<dim>::clone() const [with int dim = 2]
   virtual std::unique_ptr<DealIIMatrixFreeMeshEvaluator> clone() const = 0;
                                                          ^~~~~

I am not sure if we need to modify the matrix-based version as well.

@masterleinad
Copy link
Collaborator Author

The matrix-based version needs more work, so I restricted the number of threads to one in that case again, To actually check the matrix-free version here, I changed the default as well to "matrix-free" as well (forcing a Chebyshev smoother since that is the only possibility).

@Rombur
Copy link
Collaborator

Rombur commented May 31, 2019

So the problem is that MatrixFree is not thread-safe? I don't get why we need to clone the MeshEvaluator

@masterleinad
Copy link
Collaborator Author

So the problem is that MatrixFree is not thread-safe? I don't get why we need to clone the MeshEvaluator

No, that is not the problem. So far, we had a common Evaluator object that is both responsible for agglomerate evaluations as well as global evaluations. If we use multiple threads, we try to evaluate different agglomerates at the same time. In particular, we change the state of the class in initialize_agglomerate which is then used in evaluate_agglomerate. Hence, all the agglomerate related member variables might be involved in race conditions. Apart from that, we might run into an inconsistent state comparing initialize_agglomerate and evaluate_agglomerate.

@masterleinad
Copy link
Collaborator Author

As compared to #192 changes in timings for running with one thread are unaffected. These are the run times on my notebook:

# threads 2x2 4x4
1 37.3 20.5
2 25.8 12.6
4 13.4 8.9
8 11.9 8.32

@masterleinad
Copy link
Collaborator Author

For the 4x4 patch and one thread, I get

Section no. calls wall time % of total
Apply 74 2.01s 10%
Apply: fine levels 74 2s 10%
Setup 1 14.8s 75%
Setup: build restrictor 1 14.8s 75%

while 8 threads give me

Section no. calls wall time % of total
Apply 74 1.92s 23%
Apply: fine levels 74 1.91s 23%
Setup 1 3.96s 48%
Setup: build restrictor 1 3.95s 48%

@masterleinad
Copy link
Collaborator Author

masterleinad commented Jun 5, 2019

Using vectorization (AVX2) on top of that gives me (1 thread)

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      7.73s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Apply                           |        74 |     0.557s |       7.2% |
| Apply: fine levels              |        74 |     0.553s |       7.2% |
| Setup                           |         1 |      6.88s |        89% |
| Setup: build restrictor         |         1 |      6.88s |        89% |
+---------------------------------+-----------+------------+------------+

resp. (8 threads)

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      2.73s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Apply                           |        74 |       0.6s |        22% |
| Apply: fine levels              |        74 |     0.596s |        22% |
| Setup                           |         1 |      1.84s |        67% |
| Setup: build restrictor         |         1 |      1.83s |        67% |
+---------------------------------+-----------+------------+------------+

@masterleinad
Copy link
Collaborator Author

Using MPI instead of threads gives (1 MPI process)

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      8.89s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Apply                           |        74 |     0.665s |       7.5% |
| Apply: fine levels              |        74 |     0.659s |       7.4% |
| Setup                           |         1 |      7.88s |        89% |
| Setup: build restrictor         |         1 |      7.87s |        89% |
+---------------------------------+-----------+------------+------------+

resp. (8 MPI processes)

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      2.12s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Apply                           |        74 |     0.349s |        16% |
| Apply: fine levels              |        74 |     0.331s |        16% |
| Setup                           |         1 |      1.62s |        76% |
| Setup: build restrictor         |         1 |      1.61s |        76% |
+---------------------------------+-----------+------------+------------+

@masterleinad
Copy link
Collaborator Author

Does anyone want to discuss a different solution?

@Rombur
Copy link
Collaborator

Rombur commented Jun 6, 2019

@dalg24 said he was going to look at it

@@ -12,6 +12,7 @@
#include <mfmg/common/instantiation.hpp>
#include <mfmg/common/operator.hpp>
#include <mfmg/dealii/amge_host.hpp>
#include <mfmg/dealii/amge_host.templates.hpp>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment it is included for MatrixFreeAgglomerateOperator and that it would probably not be a bad idea to move the definition elsewhere.

@@ -56,6 +56,16 @@ class DealIIMatrixFreeMeshEvaluator : public DealIIMeshEvaluator<dim>
*/
virtual ~DealIIMatrixFreeMeshEvaluator() override = default;

/**
* Create a deep copy of this class such that initializing on another
* agglomerate works.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a note that this was introduced because calls to member functions matrix_free_initialize_agglomerate() and matrix_free_evaluate_agglomerate() (implemented in user provided class deriving from MatrixFreeMeshEvaluator) are not thread safe and that it is not the only option to solve the problem.

@dalg24 dalg24 merged commit 1beeef6 into ORNL-CEES:master Jun 10, 2019
@masterleinad
Copy link
Collaborator Author

Thanks!

@masterleinad masterleinad deleted the multithreading branch June 10, 2019 19:05
@masterleinad masterleinad mentioned this pull request Jun 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants