[BUG] Spark Operator Lock identity is empty while HA #2063

tankim · 2024-06-15T02:59:45Z

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.

If your request is for a new feature, please use the Feature request template.

[v] ✋ I have searched the open/closed issues and my issue is not listed.

Reproduction Code [Required]

Steps to reproduce the behavior:

Just setting replicaCount higher than 1

replicaCount: 2

Expected behavior

Launch additional operator pod

Actual behavior

Got error from both operator pod

Terminal Output Screenshot(s)

+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ echo 0
+ echo 0
+ echo root:x:0:0:root:/root:/bin/bash
0
0
root:x:0:0:root:/root:/bin/bash
+ [[ -z root:x:0:0:root:/root:/bin/bash ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator -v=4 -logtostderr -namespace= -enable-ui-service=true -ingress-url-format= -controller-threads=600 -resync-interval=30 -enable-batch-scheduler=false -label-selector-filter= -enable-metrics=true -metrics-labels=app_type -metrics-port=10254 -metrics-endpoint=/metrics -metrics-prefix= -enable-webhook=true -webhook-svc-namespace=dataplatform-common-dev -webhook-port=8080 -webhook-timeout=30 -webhook-svc-name=spark-operator-webhook -webhook-config-name=spark-operator-webhook-config -webhook-namespace-selector=spark-webhook-enabled=true -enable-resource-quota-enforcement=false -leader-election=true -leader-election-lock-namespace=dataplatform-common-dev -leader-election-lock-name=spark-operator-lock
F0615 02:58:37.044201      10 main.go:146] Lock identity is empty

goroutine 1 [running]:
github.com/golang/glog.Fatal(...)
	/go/pkg/mod/github.com/golang/[email protected]/glog.go:664
main.main()
	/workspace/main.go:146 +0x1418

SIGABRT: abort
PC=0x40708e m=2 sigcode=18446744073709551610

goroutine 1 gp=0xc0000061c0 m=2 mp=0xc000092808 [running, locked to thread]:
runtime/internal/syscall.Syscall6()
	/usr/local/go/src/runtime/internal/syscall/asm_linux_amd64.s:36 +0xe fp=0xc0004cba88 sp=0xc0004cba80 pc=0x40708e
syscall.RawSyscall6(0xc00034e038?, 0xc0006a0120?, 0xc00060c060?, 0x2be5440?, 0x548220?, 0x2be54d8?, 0xc0004cbaf0?)
	/usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 +0xd fp=0xc0004cbad0 sp=0xc0004cba88 pc=0x40706d
syscall.RawSyscall(0x2be54d8?, 0x0?, 0xc0004cbb70?, 0xc0004cbb50?)
	/usr/local/go/src/syscall/syscall_linux.go:62 +0x15 fp=0xc0004cbb18 sp=0xc0004cbad0 pc=0x48a8f5
syscall.Tgkill(0xba?, 0x0?, 0x0?)
	/usr/local/go/src/syscall/zsyscall_linux_amd64.go:894 +0x25 fp=0xc0004cbb48 sp=0xc0004cbb18 pc=0x488aa5
github.com/golang/glog.abortProcess()
	/go/pkg/mod/github.com/golang/[email protected]/glog_file_linux.go:35 +0x87 fp=0xc0004cbb90 sp=0xc0004cbb48 pc=0x548387
github.com/golang/glog.ctxfatalf({0x0?, 0x0?}, 0xc000280110?, {0x1b8f1eb?, 0x411d65?}, {0xc000280110?, 0x185ca80?, 0xc000328201?})
	/go/pkg/mod/github.com/golang/[email protected]/glog.go:647 +0x6a fp=0xc0004cbbf8 sp=0xc0004cbb90 pc=0x54606a
github.com/golang/glog.fatalf(...)
	/go/pkg/mod/github.com/golang/[email protected]/glog.go:657
github.com/golang/glog.FatalDepth(0x1, {0xc000280110, 0x1, 0x1})
	/go/pkg/mod/github.com/golang/[email protected]/glog.go:670 +0x57 fp=0xc0004cbc48 sp=0xc0004cbbf8 pc=0x5461f7
github.com/golang/glog.Fatal(...)
	/go/pkg/mod/github.com/golang/[email protected]/glog.go:664
main.main()
	/workspace/main.go:146 +0x1418 fp=0xc0004cbf50 sp=0xc0004cbc48 pc=0x172f418
runtime.main()
	/usr/local/go/src/runtime/proc.go:271 +0x29d fp=0xc0004cbfe0 sp=0xc0004cbf50 pc=0x4404fd
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0004cbfe8 sp=0xc0004cbfe0 pc=0x473721

Environment & Versions

Spark Operator App version: v1beta2-1.4.6-3.5.0
Helm Chart Version: 1.2.15
Kubernetes Version: 1.28
Apache Spark version: 3.5.0

Additional context

The text was updated successfully, but these errors were encountered:

yuchaoran2011 · 2024-06-15T03:17:02Z

Honestly I don't see a need to run multiple replicas for HA purpose. Kubernetes Deployment controller is essentially providing the HA feature out of the box

tankim · 2024-06-15T03:30:22Z

I fixed this with new version of helm chart version 1.4.0.

tankim · 2024-06-15T03:34:04Z

Honestly I don't see a need to run multiple replicas for HA purpose. Kubernetes Deployment controller is essentially providing the HA feature out of the box

In our current workload, tens to hundreds of Spark applications are triggered simultaneously, and this number may grow to thousands in the future. In this process, if the operator pod becomes unstable, we believe that an HA setup is necessary to ensure stable operation (aiming for zero downtime). This can vary depending on the specific issues we are currently facing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Spark Operator Lock identity is empty while HA #2063

[BUG] Spark Operator Lock identity is empty while HA #2063

tankim commented Jun 15, 2024

yuchaoran2011 commented Jun 15, 2024

tankim commented Jun 15, 2024

tankim commented Jun 15, 2024

[BUG] Spark Operator Lock identity is empty while HA #2063

[BUG] Spark Operator Lock identity is empty while HA #2063

Comments

tankim commented Jun 15, 2024

Description

Reproduction Code [Required]

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Environment & Versions

Additional context

yuchaoran2011 commented Jun 15, 2024

tankim commented Jun 15, 2024

tankim commented Jun 15, 2024