Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UDP/TCP port fowarding to a host without setting up a tun #1179

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

cre4ture
Copy link

@cre4ture cre4ture commented Jul 14, 2024

This is intended to implement #1014

Build-Status twin PR: cre4ture#1

How to be used can be read in the example config.yml:

# by using port port forwarding (port tunnels) its possible to establish connections
# from/into the nebula-network without using a tun/tap device and thus without requiring root access
# on the host. port forwarding is only supported when setting "tun.user" is set to true and thus
# a user-tun instead of a real one is used.
# IMPORTANT: For incoming tunnels, don't forget to also open the firewall for the relevant ports.
port_forwarding:
  outbound:
  # format of local and remote address: <host/ip>:<port>
  #- local_address: 127.0.0.1:3399
  #  remote_address: 192.168.100.92:4499
     # format of protocols lists (yml-list): [tcp], [udp], [tcp, udp]
  #  protocols: [tcp, udp]
  inbound:
  # format of forward_address: <host/ip>:<port>
  #- port: 5599
  #  forward_address: 127.0.0.1:5599
     # format of protocols lists (yml-list): [tcp], [udp], [tcp, udp]
  #  protocols: [tcp, udp]

Till now I only did some basic manual tests with netcat. They resulted in the expected behavior.
Further tests where done by @sybrensa and also by myself.

These points are still open:

  • namings like "tunnel" vs. "forwarding", "ingoing" vs. "inbound", ... Please help me in finding the right terminology as I'm not an expert in this topic.
  • mutexes needed? I honestly did't yet explicitly check for potential concurrent accesses. this is a TODO DONE.
  • clean shutdown - Currently it gives some errors when shutting down. this is also TODO DONE.
  • memory leaks - I think DONE.
  • performance tunings - DONE, but could be more
  • 2nd iteration of performance tunings - DONE, but could be more
  • reloadable (SIGHUP) port forwarding configuration

I did file copy tests. What I got was 1/2 the rate and around 4 times more CPU usage compared to the case with kernel-tun.
I tried multiple different improvements on my code, but I fear its actually more a limitation of the gvisor netstack :-/

Any help or idea is welcome.

Update: My file copy tests achieve now around 90% of the speed at 160% of CPU load of nebula. This is a significant improvment and acceptable (at least for my useases). Thanks @akernet for the support here.

Copy link

Thanks for the contribution! Before we can merge this, we need @cre4ture to sign the Salesforce Inc. Contributor License Agreement.

@akernet
Copy link

akernet commented Jul 21, 2024

Thanks a ton for taking this project on!

I took a stab at this before finding your changes here, it's here if you want to take a look.

In particular e764eb adds a script that sets up a two device loopback tunnel (without root) and runs an iperf3 speed test.

I was able to improve throughput by ~7x by adding some buffering to the UserDevice (a16fdb) which you might want to try too. With this change it is running at 1.6 Gbit/s with a 1300 MTU which makes it quite usable to me at least.

I still think there are several improvements left on the table:

  • Re-use allocations in this buffering by using something like sync.Pool,
  • Or even better, avoid these copies all together, although that would require some bigger changes in the rest of Nebula.
  • Depending on how the internal locking of gvisor works you could probably gain some by running several net stacks in parallel that handle their individual set of connections (distributing packets based on their source address+destination port). This would not improve single socket connections though.

@cre4ture
Copy link
Author

Thanks a ton for taking this project on!

I took a stab at this before finding your changes here, it's here if you want to take a look.

In particular e764eb adds a script that sets up a two device loopback tunnel (without root) and runs an iperf3 speed test.

I was able to improve throughput by ~7x by adding some buffering to the UserDevice (a16fdb) which you might want to try too. With this change it is running at 1.6 Gbit/s with a 1300 MTU which makes it quite usable to me at least.

I still think there are several improvements left on the table:

* Re-use allocations in this buffering by using something like `sync.Pool`,

* Or even better, avoid these copies all together, although that would require some bigger changes in the rest of Nebula.

* Depending on how the internal locking of gvisor works you could probably gain some by running several net stacks in parallel that handle their individual set of connections (distributing packets based on their source address+destination port). This would not improve single socket connections though.

Hey, cool. I could use some support especially regarding the performance and the automated testing. But also for other golang specifics as I'm rather unexperienced there.

I tried to do some performance profiling with the help of pprof.
This results where leading me to problems in gvisor itself, as I can read it from this diagram:
grafik

It seems that its doing some dynamic memory allocation. But from a first glance into the relevant source-code I could not see the reason for it. Do you have an idea? I'm actually already failing to upgrade the gvisior to the latest version. something is wrong with it such that it complains about multiple packages in a directrory. But how can this be undiscovered by the maintainters of gvisor? I have the feeling that I'm doing something wrong... :-/

I will for sure try your idea with the buffered queues.
I didn't yet look in detail into your branch. But it seems that we had some similar approach using the gvisor stack.
I would be glad, if you can do a review of my code. It seems that the maintainers of the repo where so far bussy with other stuff. ;-)

@johnmaguire
Copy link
Collaborator

Hi @cre4ture - this PR looks very interesting! I wanted to let you know that the maintainers best suited to review this PR are currently working on #6 which requires some major rework of Nebula, so it will probably be a bit before we are able to dig in. That being said, thanks for the contribution!

@cre4ture
Copy link
Author

cre4ture commented Jul 22, 2024

@akernet I added your test and afterwards also the performance improvement via cherry-picking. Hope this is fine for you :-)

Your performance improvement lead to a improvment of x6 for the test in your testscript-commit on my hardware. I will do a further test where I will use this on a real-world example (the file-copy test with two seperate machines) to confirm this improvment. I'm a bit concerned because the pprof profiling directed my in some different direction. At least this is how I interpret it right now. I will let you know.

@cre4ture
Copy link
Author

cre4ture commented Jul 22, 2024

@akernet

it seems to me that the improvement with the buffered pipes has no impact on the my testing scenario with the file-copy between two different machines. I'm using a "private cloud" server storage called "garage" mounted via "rclone". The two machines are my laptop and a NUC PC ("server"). Both connected via ethernet cable (1GBit/s) to my local network. I use the offical release binary for the laptop and exchange for testing purposes the binary on the server side. I measure the throughput rate (nautilus file copy) on the laptop. And I measure the CPU load on the server side. These are my results:

1. test-case: official released binary - kernel tun:
95.5 MB/second
CPU load on server side:
88 % nebula
71 % garage

2. test-case: local build with buffered pipes - gvisor tun+stack:
52.8 MB/second
CPU load server side:
197 % nebula
34 % garage

3. test-case: local build no buffered pipes - gvisor tun+stack:
52.3 MB/second
CPU load server side:
195 % nebula
35 % garage

4. test-case: local build with the buffered pipes - but kernel tun (as in 1. test-case):
99.5 MB/second
94% nebula
72% garage

The last test shall demonstrate the the locally compiled binary has comparable performance as the release binary.
So what exactly this means is not yet clear to me. It seems that the buffered pipes do not have a significant impact for this test-scenarios. But I can't reason about it.

In general, the userspace tun/stack has an impact of factor 4. which results from x2 times CPU load and 1/2 throughput rate put together.

@akernet
Copy link

akernet commented Jul 23, 2024

I would be glad, if you can do a review of my code. It seems that the maintainers of the repo where so far bussy with other stuff. ;-)

I'm also new to go but I'll try to find some time in the next days! :)

I added your test and afterwards also the performance improvement via cherry-picking. Hope this is fine for you :-)

Ofc!

I'm using a "private cloud" server storage called "garage"

Cool, I've been looking at garage for the past weeks so nice to see that you are using it with Nebula!

It seems that the buffered pipes do not have a significant impact for this test-scenarios. But I can't reason about it.

Yeah this is a bit strange. How is the system load overall, is it close to max? The buffering should not make Nebula more efficient when it comes to CPU, in fact it's probably gonna be more costly due to the copying. What it does is decoupling the outside and inside (gvisor) parts, allowing them to run in parallel. If the system is at limits already this is unlikely to help, however. I'll try to do some benchmarking in CPU limited scenarios too, gvisor is always gonna use some additional resources but the 66% of your pprof run seems a bit high.

@cre4ture
Copy link
Author

cre4ture commented Jul 26, 2024

Update regarding performance:
Yesterday I could achieve a significant improvement for my filecopy test-scenario:

"performance 5":

89.8 MB/second
cpu-load server side:
163 % CPU nebula
63 % CPU garage

This brings the copy-speed to almost the reference (99MB/second) - only 10% difference remaining.
The CPU-load of nebula is only + ~60% now.

So overall we are at around 175% of the reference. Which is significant compared to the initial ~ 400%.

@akernet I think the main difference comes from a further improvement of your performance tuning "buffering to UserDevice". I achieved it by addtionally avoiding data-copy and dynamic allocation steps (i think).

Copy link

@akernet akernet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice! Cool that you got the buffer reuse to work

cmd/nebula/main.go Outdated Show resolved Hide resolved
// Its there to avoid a fresh dynamic memory allocation of 1 byte
// for each time its used.
var BYTE_SLICE_ONE []byte = []byte{1}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest moving this to a separate PR since it affects normal configs too

examples/config.yml Outdated Show resolved Hide resolved
examples/config.yml Outdated Show resolved Hide resolved
examples/config.yml Outdated Show resolved Hide resolved
port-forwarder/fwd_tcp.go Outdated Show resolved Hide resolved
port-forwarder/fwd_tcp.go Outdated Show resolved Hide resolved
default:
}

rn, r_err := from.Read(buf)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be replaced with io.Copy()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can. I once tried this. But it didn't improve the performance. Thats why I went on with other experiments. But now, as we solved the issue to a big extent, I will introduce it again as it simplifies the code. Thanks for pointing it out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a variant with io.Copy().
Problem: It seems to slightly decrease performance. So I'm wondering if I better should keep the original implementation.

port-forwarder/fwd_tcp.go Outdated Show resolved Hide resolved
port-forwarder/fwd_tcp.go Outdated Show resolved Hide resolved
Copy link

Thanks for the contribution! Before we can merge this, we need @akernet to sign the Salesforce Inc. Contributor License Agreement.

@cre4ture cre4ture force-pushed the feature/try_with_gvisor_stack branch 3 times, most recently from 9718675 to 3a03eb9 Compare August 2, 2024 21:34
@cre4ture cre4ture requested a review from akernet August 3, 2024 09:48
@cre4ture
Copy link
Author

cre4ture commented Aug 3, 2024

@akernet please re-review. and it seems that you need to also sign the cla as I cherry picked from your branch to honor your contribution. :-)

@cre4ture
Copy link
Author

cre4ture commented Aug 13, 2024

Hi @cre4ture - I don't believe that CI runs the same test in parallel multiple times. There is a test matrix which runs the tests on various OS in individual containers - so ports should not conflict. Looking at the failed CI, it appears it succeeded on windows-latest, macos-latest, linux with boringcrypto, but failed on ubuntu-linux. Oftentimes this is indicative of some kind of intermittent test failure (i.e. flaky test, maybe a race condition.)

It seems that the tests get messed up due to using the same port numers.

I admittedly only took a very quick look, but I didn't find a log message that suggested a requested port was in use. Perhaps I missed it. Can you share what you're looking at?

@johnmaguire I'm also not fully sure about this. An error message like you suggest would make it clear. But I can't explain why the failing test complains about not fitting ca-certificates otherwise.
What I'm looking at is this:

Invalid certificate from host" error="certificate validation failed: could not find ca for the certificate

and also this:

"Invalid certificate from host" error="certificate validation failed: certificate is expired"

But for the second one, I have an explanation meanwhile: the generated certificate for the test-services is only valid for 5 minutes. as the relevant test runns into a timeout after 10 minutes, its clear that this error occurs after 5 minutes.

Still, the first error I can't explain. The ca-key-pair is generated freshly each time the test runs. From the ca-key-pair, the signed service key-pair and certificate is generated each time anew. So there should not be the possibility that these do not match. Expect, if one consideres that tests run in parallel an they accidentially connect with the wrong test-partner.

But I'm open for other ideas. Any help is welcome.

@johnmaguire
Copy link
Collaborator

@cre4ture I'm not sure off-hand. But the fact that it succeeded on most platforms and failed on one again makes me wonder if it's a flaky test. I probably would've tried re-running the CI w/o changes to see whether it failed on a repeat run. If not, then we need to think about whether there's any timing issue that could affect that test. (I haven't looked at the test code really, but just as an example, maybe the updated CA bundle hasn't finished writing to disk when the test starts. Or maybe there was a silent error doing so?)

@johnmaguire
Copy link
Collaborator

FYI we recently merged #1181 which caused conflicts. One conflict is caused by returning *gonet.TCPConn in lieu of net.Conn. Is it necessary to return the former, or we can continue to return the interface? Thanks!

…isor_stack

# Conflicts:
#	examples/go_service/main.go
#	service/service.go
@cre4ture
Copy link
Author

FYI we recently merged #1181 which caused conflicts. One conflict is caused by returning *gonet.TCPConn in lieu of net.Conn. Is it necessary to return the former, or we can continue to return the interface? Thanks!

I merged the changes from main. It seems that the interface is OK.
Please tell me if you prefer a rebase, and/or a squash commit/merge.

udp/udp_linux.go Outdated
@@ -315,6 +321,10 @@ func (u *StdConn) getMemInfo(meminfo *[unix.SK_MEMINFO_VARS]uint32) error {

func (u *StdConn) Close() error {
//TODO: this will not interrupt the read loop
Copy link
Author

@cre4ture cre4ture Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnmaguire I think I have the test stable now. At least on linux. The issue was a unclean shutdown of the nebula service at the end of each test which caused confusion in the following tests.

The change here in this file solved the issue.
I will check the next days if there are some sleeps to cleanup, now as I found the issue.

return s.eg.Wait()
err := s.eg.Wait()

s.ipstack.Destroy()
Copy link
Author

@cre4ture cre4ture Sep 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this line made the windows test stable. They now run 200+ in a row without issue.

@@ -323,6 +329,14 @@ func (u *RIOConn) Close() error {
windows.PostQueuedCompletionStatus(u.rx.iocp, 0, 0, nil)
windows.PostQueuedCompletionStatus(u.tx.iocp, 0, 0, nil)

u.rx.mu.Lock() // for waiting till active reader is done
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to run the test 1000 times to achieve stable reproduction on my windows 11. Is fixed with this change.

@cre4ture
Copy link
Author

@johnmaguire after extensive testing and a few findings I dare to claim that the tests are now stable on windows and linux. Can you please approve another workflow run on this PR? And have a look about what is still do be done to merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Support UDP/TCP port fowarding to a host without setting up a tun
3 participants