-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The node was low on resource: ephemeral-storage. Container spark-kubernetes-driver was using 48136Ki, which exceeds its request of 0. #1546
Comments
Ephemeral storage is usually limited the amount of local disk on the node. Check that. |
@jkleckner You mean that there isn’t any free space left on the storage of the node which is running the pod? Thanks. |
It can be either running out of space on the node or the request limit for your pod. I have seen situations where we needed to limit a pod from using all local (ephemeral) storage and affecting other pods. Check your monitoring of this resource. See [1] for a definition of ephemeral storage as local node attached storage. See [2] how you can limit (or not) the amount used by a pod. [1] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#local-ephemeral-storage |
Also note this cryptic comment that the |
I look at the local (ephemeral) disk usage on executor nodes and try to keep the peak of it during a day < 15% or so for headroom. Also, on GKE, the IOPS are proportional to the volume size so over-provisioning the volume is also a throughput adjustment. You can observe throttled I/O to see whether it is a bottleneck. |
Thanks for your response, and I know how to limit the resources in Pod/Deployment and other objects in Kubernetes. I have tried the following: apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
name: synonym-data-gathering-scheduled
namespace: spark
spec:
resources:
limits:
- default:
ephemeral-storage: 1Gi
defaultRequest:
ephemeral-storage: 1Gi
suspend: true
schedule: "@every 1m"
concurrencyPolicy: Allow
successfulRunHistoryLimit: 5
failedRunHistoryLimit: 3
template:
deps:
packages:
- 'io.delta:delta-core_2.12:1.0.0'
- 'org.apache.hadoop:hadoop-hdfs-client:3.3.0'
type: Python
pythonVersion: '3'
mode: cluster
image: "<CICD_IMAGE_PLACEHOLDER>"
imagePullPolicy: IfNotPresent
mainApplicationFile: "local:///app/src/synonyms/1_data_gathering.py"
sparkVersion: "3.1.1"
restartPolicy:
type: Never
volumes: [
#<CICD_K8S_VOLUMES_PLACEHOLDER>
]
driver:
javaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"
volumeMounts: [
#<CICD_K8S_VOLUME_MOUNTS_PLACEHOLDER>
]
cores: 4
memory: "12g"
labels:
version: 3.1.1
serviceAccount: spark-app
executor:
javaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"
volumeMounts: [
#<CICD_K8S_VOLUME_MOUNTS_PLACEHOLDER>
]
cores: 3
instances: 1
memory: "3g"
labels:
version: 3.1.1 and faced to the following error:
I don't know how this operator implemented and we can do everything in memory without using disk. |
Yes, I noticed that the operator doesn't seem to surface the ephemeral storage request fields. It was in a non-spark pod that I had needed to set a limit so no limitation from this operator. Since the default is no limit, I think you need to look at where your spark pods are creating the "spill storage" and it is most likely that your node local filesystems are filling up. If your clusters have some monitoring of node storage, look at that when this error occurs to confirm that. If confirmed, the simplest remedy is to increase the amount of node local storage in your cluster. I use a node pool strictly for executors configured differently and with more local storage than the general node pool and use taints to schedule executors on that node pool. |
As an aside, I don't want to limit the ephemeral storage via kubernetes for my executors. You do that indirectly by controlling the compute graph, caching, and number/size/memory of executors so that the spill storage remains limited. 1GiB for ephemeral storage is very small for a computation worthy of the need to use Spark. |
I had a similar error. My analysis: Pods on a same k8s node share the ephemeral storage, which (if no special configuration was used) is used by spark to store temp data of spark jobs (disk spillage and shuffle data). The amount of ephemeral storage of a node is basically the size of the available storage in your k8s node. If some executor pods use up all of the ephemeral storage of a node, other pods will fail when they try to write data to ephemeral storage. In your case the failing pod is the driver pod, but it could have been any other pods on that node. In my case it was an executor that failed with a similar error message. I would try to optimize the spark code first before changing the deployment configuration.
If you know upfront the amount of storage required in each executor, maybe you can try to set the resources requests (and not limits) for ephemeral storage to right amount. |
me too faced this issue and i increased the size of /run mount path where the pod actually runs and set the ephemeral storage quota in deployment which resolves this issue, however dont know whether its a right solution |
Hi @mostafaghadimi @hiendang @jkleckner , |
Hi! I hoped the error below would disappear, but it keeps happening after the upgrade
Updating the chart to the latest version does not update the metadata included on #1661 ? Am I missing something or the problem is not fixed yet? |
Hi, |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
We are experiencing this
The node was low on resource: ephemeral-storage. Container spark-kubernetes-driver was using 48136Ki, which exceeds its request of 0.
error and found the pod evicted. I have two questions:ephermal-storage
limit in order to get rid of the error?The text was updated successfully, but these errors were encountered: