Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBD integration into controller and replica #1109

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Toutou98
Copy link

Which issues this PR references

Issues: longhorn/longhorn#6590 (comment)


longhorn/longhorn#5002 (comment)


longhorn/longhorn#5374 (comment)

What this PR does

This PR integrates the Network Block Device (NBD) protocol as an option into both the kernel to controller (instead of iSCSI) and controller to replica communications (instead of the custom Longhorn engine protocol). NBD is fairly standard, available prebuilt in most Linux distributions. The current version also supports multiple concurrent connections, which this PR takes advantage of.

The benefit is increased performance, especially at the frontend. In my tests, using a fairly recent PC hardware configuration, I found tgtd to be a major bottleneck, limiting the frontend (kernel to controller) R/W IOPS to ~50k, whereas this number almost increases 10-fold with NBD and multiple concurrent connections. Similarly, the backend (controller to replica) performance improves, although there is probably still a lot of room for improvement inside the controller. Test setup and performance results are given below.

This PR includes all code changes, including additional parameters to the engine binaries to enable NBD and control the number of concurrent connections. It also includes some necessary additions to the container produced.

Test setup and results

Hardware configuration:

  • CPU: Ryzen7 5700X @3.40Ghz
  • DRAM: 32GB DDR4 3600MT/s
  • Disk: Ramdisk 10GB
  • OS: Ubuntu 22.04
  • Kernel: Linux 6.5.0-14-generic

Three series of tests were conducted, using fio to load the longhorn device:

  • Frontend test to measure the performance up to the controller
  • Dataconn test to measure the performance up to the replica (including the network overhead)
  • End to end test to measure the performance when using a disk (used a Ramdisk to factor out the device speed)

Below is a table of results, showing the best results with NBD that were achieved using 16 connections.

benchmarks

fio configuration file:

[global]
group_reporting
ramp_time=5
time_based=1
runtime=60
direct=1
ioengine=libaio
filename=/dev/longhorn/vol-name

[iops-write]
gtod_reduce=1
blocksize=4K
iodepth=16
numjobs=16
rw=randwrite

[iops-read]
stonewall
wait_for=iops-write
gtod_reduce=1
blocksize=4K
iodepth=16
numjobs=16

Notes for reviewer

To be able to replicate the results you need libnbd-dev/libnbd-devel (v1.13.1) and nbd-client to be installed in the host system. Also you need the nbd kernel module to be installed in the kernel (available on Ubuntu 22.04).

I have modified the launch-simple-longhorn script to use nbd both at the frontend and backend with a configurable number of parallel connections. The command used to run the engine with 16 frontend connections and 16 backend connections is:

sudo docker run --privileged --net=host -v /lib/modules/<kernel>:/lib/modules/<kernel>  -v /dev:/host/dev -v /proc:/host/proc -v /volume longhornio/longhorn-engine:<tag> launch-simple-longhorn vol-name 10g nbd 16 16

Additional information or context

Frontend implementation details:

  • The client side uses the nbd kernel module that Linux already provides.
  • The server side within the controller, uses a modified version of go-nbd (modified for concurrency).

Backend implementation details:

  • The client side uses libnbd (development version).
  • The server side uses the same go-nbd-based implementation as the controller.

Copy link

mergify bot commented May 10, 2024

This pull request is now in conflict. Could you fix it @Toutou98? 🙏

Copy link

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Jun 25, 2024
@derekbit
Copy link
Member

What we can do for the feature later

  • Compare the performance of new data paths based on NBD and UBLK.
  • Analyze the advantages and disadvantages of each data path and their use cases.

cc @shuo-wu @PhanLe1010 @c3y1huang @mantissahz @Vicente-Cheng @WebberHuang1118 @innobead

@github-actions github-actions bot removed the stale label Jun 26, 2024
Copy link

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Aug 10, 2024
Copy link

This PR was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this Aug 21, 2024
@PhanLe1010
Copy link
Contributor

Reopening. I am actively investigate this PR

@PhanLe1010 PhanLe1010 reopened this Aug 23, 2024
Copy link

mergify bot commented Aug 23, 2024

This pull request is now in conflict. Could you fix it @Toutou98? 🙏

@github-actions github-actions bot removed the stale label Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants