Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spark Operator Lock identity is empty while HA #2063

Open
tankim opened this issue Jun 15, 2024 · 3 comments
Open

[BUG] Spark Operator Lock identity is empty while HA #2063

tankim opened this issue Jun 15, 2024 · 3 comments

Comments

@tankim
Copy link

tankim commented Jun 15, 2024

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.

If your request is for a new feature, please use the Feature request template.

  • [v] ✋ I have searched the open/closed issues and my issue is not listed.

Reproduction Code [Required]

Steps to reproduce the behavior:

  • Just setting replicaCount higher than 1
replicaCount: 2

Expected behavior

  • Launch additional operator pod

Actual behavior

  • Got error from both operator pod

Terminal Output Screenshot(s)

+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ echo 0
+ echo 0
+ echo root:x:0:0:root:/root:/bin/bash
0
0
root:x:0:0:root:/root:/bin/bash
+ [[ -z root:x:0:0:root:/root:/bin/bash ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator -v=4 -logtostderr -namespace= -enable-ui-service=true -ingress-url-format= -controller-threads=600 -resync-interval=30 -enable-batch-scheduler=false -label-selector-filter= -enable-metrics=true -metrics-labels=app_type -metrics-port=10254 -metrics-endpoint=/metrics -metrics-prefix= -enable-webhook=true -webhook-svc-namespace=dataplatform-common-dev -webhook-port=8080 -webhook-timeout=30 -webhook-svc-name=spark-operator-webhook -webhook-config-name=spark-operator-webhook-config -webhook-namespace-selector=spark-webhook-enabled=true -enable-resource-quota-enforcement=false -leader-election=true -leader-election-lock-namespace=dataplatform-common-dev -leader-election-lock-name=spark-operator-lock
F0615 02:58:37.044201      10 main.go:146] Lock identity is empty

goroutine 1 [running]:
github.com/golang/glog.Fatal(...)
	/go/pkg/mod/github.com/golang/[email protected]/glog.go:664
main.main()
	/workspace/main.go:146 +0x1418

SIGABRT: abort
PC=0x40708e m=2 sigcode=18446744073709551610

goroutine 1 gp=0xc0000061c0 m=2 mp=0xc000092808 [running, locked to thread]:
runtime/internal/syscall.Syscall6()
	/usr/local/go/src/runtime/internal/syscall/asm_linux_amd64.s:36 +0xe fp=0xc0004cba88 sp=0xc0004cba80 pc=0x40708e
syscall.RawSyscall6(0xc00034e038?, 0xc0006a0120?, 0xc00060c060?, 0x2be5440?, 0x548220?, 0x2be54d8?, 0xc0004cbaf0?)
	/usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 +0xd fp=0xc0004cbad0 sp=0xc0004cba88 pc=0x40706d
syscall.RawSyscall(0x2be54d8?, 0x0?, 0xc0004cbb70?, 0xc0004cbb50?)
	/usr/local/go/src/syscall/syscall_linux.go:62 +0x15 fp=0xc0004cbb18 sp=0xc0004cbad0 pc=0x48a8f5
syscall.Tgkill(0xba?, 0x0?, 0x0?)
	/usr/local/go/src/syscall/zsyscall_linux_amd64.go:894 +0x25 fp=0xc0004cbb48 sp=0xc0004cbb18 pc=0x488aa5
github.com/golang/glog.abortProcess()
	/go/pkg/mod/github.com/golang/[email protected]/glog_file_linux.go:35 +0x87 fp=0xc0004cbb90 sp=0xc0004cbb48 pc=0x548387
github.com/golang/glog.ctxfatalf({0x0?, 0x0?}, 0xc000280110?, {0x1b8f1eb?, 0x411d65?}, {0xc000280110?, 0x185ca80?, 0xc000328201?})
	/go/pkg/mod/github.com/golang/[email protected]/glog.go:647 +0x6a fp=0xc0004cbbf8 sp=0xc0004cbb90 pc=0x54606a
github.com/golang/glog.fatalf(...)
	/go/pkg/mod/github.com/golang/[email protected]/glog.go:657
github.com/golang/glog.FatalDepth(0x1, {0xc000280110, 0x1, 0x1})
	/go/pkg/mod/github.com/golang/[email protected]/glog.go:670 +0x57 fp=0xc0004cbc48 sp=0xc0004cbbf8 pc=0x5461f7
github.com/golang/glog.Fatal(...)
	/go/pkg/mod/github.com/golang/[email protected]/glog.go:664
main.main()
	/workspace/main.go:146 +0x1418 fp=0xc0004cbf50 sp=0xc0004cbc48 pc=0x172f418
runtime.main()
	/usr/local/go/src/runtime/proc.go:271 +0x29d fp=0xc0004cbfe0 sp=0xc0004cbf50 pc=0x4404fd
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0004cbfe8 sp=0xc0004cbfe0 pc=0x473721

Environment & Versions

  • Spark Operator App version: v1beta2-1.4.6-3.5.0
  • Helm Chart Version: 1.2.15
  • Kubernetes Version: 1.28
  • Apache Spark version: 3.5.0

Additional context

@yuchaoran2011
Copy link
Contributor

Honestly I don't see a need to run multiple replicas for HA purpose. Kubernetes Deployment controller is essentially providing the HA feature out of the box

@tankim
Copy link
Author

tankim commented Jun 15, 2024

I fixed this with new version of helm chart version 1.4.0.

@tankim
Copy link
Author

tankim commented Jun 15, 2024

Honestly I don't see a need to run multiple replicas for HA purpose. Kubernetes Deployment controller is essentially providing the HA feature out of the box

In our current workload, tens to hundreds of Spark applications are triggered simultaneously, and this number may grow to thousands in the future. In this process, if the operator pod becomes unstable, we believe that an HA setup is necessary to ensure stable operation (aiming for zero downtime). This can vary depending on the specific issues we are currently facing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants