The node was low on resource: ephemeral-storage. Container spark-kubernetes-driver was using 48136Ki, which exceeds its request of 0. #1546

mostafaghadimi · 2022-06-08T17:26:06Z

We are experiencing this The node was low on resource: ephemeral-storage. Container spark-kubernetes-driver was using 48136Ki, which exceeds its request of 0. error and found the pod evicted. I have two questions:

The pod is using a little amount of assigned memory. Is there any way to prevent the pod to write data into storage while it has free memory?
How to set ephermal-storage limit in order to get rid of the error?

The text was updated successfully, but these errors were encountered:

jkleckner · 2022-06-08T21:00:00Z

Ephemeral storage is usually limited the amount of local disk on the node. Check that.

mostafaghadimi · 2022-06-08T22:12:29Z

@jkleckner
Hi Jim 👋🏻 ,

You mean that there isn’t any free space left on the storage of the node which is running the pod?
Would you please give me more details about finding where this problem comes from and how can I resolve it?

Thanks.

jkleckner · 2022-06-09T14:45:31Z

It can be either running out of space on the node or the request limit for your pod.

I have seen situations where we needed to limit a pod from using all local (ephemeral) storage and affecting other pods.

Check your monitoring of this resource.

See [1] for a definition of ephemeral storage as local node attached storage.

See [2] how you can limit (or not) the amount used by a pod.

[1] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#local-ephemeral-storage
[2] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage

jkleckner · 2022-06-09T14:48:45Z

Also note this cryptic comment that the SPARK_WORKER_DIR can be used to alter the volume where ephemeral data is stored.

[1] apache/spark@71fc113

jkleckner · 2022-06-09T14:52:52Z

I look at the local (ephemeral) disk usage on executor nodes and try to keep the peak of it during a day < 15% or so for headroom.

Also, on GKE, the IOPS are proportional to the volume size so over-provisioning the volume is also a throughput adjustment.

You can observe throttled I/O to see whether it is a bottleneck.

mostafaghadimi · 2022-06-11T12:28:44Z

Thanks for your response, and I know how to limit the resources in Pod/Deployment and other objects in Kubernetes. I have tried the following:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
  name: synonym-data-gathering-scheduled
  namespace: spark
spec:
  resources:
    limits:
      - default:
          ephemeral-storage: 1Gi
        defaultRequest:
          ephemeral-storage: 1Gi
  suspend: true
  schedule: "@every 1m"
  concurrencyPolicy: Allow
  successfulRunHistoryLimit: 5
  failedRunHistoryLimit: 3
  template:
    deps:
      packages:
        - 'io.delta:delta-core_2.12:1.0.0'
        - 'org.apache.hadoop:hadoop-hdfs-client:3.3.0'
    type: Python
    pythonVersion: '3'
    mode: cluster
    image: "<CICD_IMAGE_PLACEHOLDER>"
    imagePullPolicy: IfNotPresent
    mainApplicationFile: "local:///app/src/synonyms/1_data_gathering.py"
    sparkVersion: "3.1.1"
    restartPolicy:
      type: Never
    volumes: [
      #<CICD_K8S_VOLUMES_PLACEHOLDER>
    ]
    driver:
      javaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"
      volumeMounts: [
        #<CICD_K8S_VOLUME_MOUNTS_PLACEHOLDER>
      ]
      cores: 4
      memory: "12g"
      labels:
        version: 3.1.1
      serviceAccount: spark-app
    executor:
      javaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"
      volumeMounts: [
        #<CICD_K8S_VOLUME_MOUNTS_PLACEHOLDER>
      ]
      cores: 3
      instances: 1
      memory: "3g"
      labels:
        version: 3.1.1

and faced to the following error:

 error: error validating "spark_jobs/synonym-data-gathering.yml": error validating data: ValidationError(ScheduledSparkApplication.spec): unknown field "resources" in io.k8s.sparkoperator.v1beta2.ScheduledSparkApplication.spec; if you choose to ignore these errors, turn validation off with --validate=false

I don't know how this operator implemented and we can do everything in memory without using disk.
I would be thankful if you could help me.

jkleckner · 2022-06-11T15:27:58Z

Yes, I noticed that the operator doesn't seem to surface the ephemeral storage request fields. It was in a non-spark pod that I had needed to set a limit so no limitation from this operator.

Since the default is no limit, I think you need to look at where your spark pods are creating the "spill storage" and it is most likely that your node local filesystems are filling up. If your clusters have some monitoring of node storage, look at that when this error occurs to confirm that.

If confirmed, the simplest remedy is to increase the amount of node local storage in your cluster.

I use a node pool strictly for executors configured differently and with more local storage than the general node pool and use taints to schedule executors on that node pool.

jkleckner · 2022-06-11T15:31:54Z

As an aside, I don't want to limit the ephemeral storage via kubernetes for my executors. You do that indirectly by controlling the compute graph, caching, and number/size/memory of executors so that the spill storage remains limited.

1GiB for ephemeral storage is very small for a computation worthy of the need to use Spark.

hiendang · 2022-07-24T19:22:45Z

I had a similar error. My analysis:

Pods on a same k8s node share the ephemeral storage, which (if no special configuration was used) is used by spark to store temp data of spark jobs (disk spillage and shuffle data). The amount of ephemeral storage of a node is basically the size of the available storage in your k8s node.

If some executor pods use up all of the ephemeral storage of a node, other pods will fail when they try to write data to ephemeral storage. In your case the failing pod is the driver pod, but it could have been any other pods on that node. In my case it was an executor that failed with a similar error message.

I would try to optimize the spark code first before changing the deployment configuration.

reduce disk spillage, shuffle write
split transforms if possible
and increase the amount of executors as the last resource :)

If you know upfront the amount of storage required in each executor, maybe you can try to set the resources requests (and not limits) for ephemeral storage to right amount.

rameshd99 · 2022-07-28T10:00:32Z

me too faced this issue and i increased the size of /run mount path where the pod actually runs and set the ephemeral storage quota in deployment which resolves this issue, however dont know whether its a right solution

dheerajpanangat · 2023-06-23T13:03:58Z

Hi @mostafaghadimi @hiendang @jkleckner ,
Did we get a resolution for this ?
I see the issue is still open.
Is this an issue with the latest 1.1.27 version ?

pedrohff · 2023-07-19T18:56:49Z

Hi!
I've just updated from 1.1.26 to 1.1.27

I hoped the error below would disappear, but it keeps happening after the upgrade

error: error validating "optallsmall.yaml": error validating data: ValidationError(ScheduledSparkApplication.spec.template.volumes[2]): unknown field "ephemeral" in io.k8s.sparkoperator.v1beta2.ScheduledSparkApplication.spec.template.volumes

Updating the chart to the latest version does not update the metadata included on #1661 ? Am I missing something or the problem is not fixed yet?

cinesia · 2024-04-12T08:29:20Z

Hi,
I have answered a similar question here #1942.
Here is the link (#1942 (comment))

github-actions · 2024-09-03T12:10:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-09-23T16:05:03Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

This was referenced Dec 25, 2022

Add ephemeral storage to crds #1660

Closed

Add support for ephemeral.volumeClaimTemplate in helm chart CRDs #1661

Merged

bscaleb mentioned this issue Mar 27, 2024

Allow setting ephemeral-storage requests and limits on Pods #1942

Open

github-actions bot added the lifecycle/stale label Sep 3, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The node was low on resource: ephemeral-storage. Container spark-kubernetes-driver was using 48136Ki, which exceeds its request of 0. #1546

The node was low on resource: ephemeral-storage. Container spark-kubernetes-driver was using 48136Ki, which exceeds its request of 0. #1546

mostafaghadimi commented Jun 8, 2022 •

edited

Loading

jkleckner commented Jun 8, 2022

mostafaghadimi commented Jun 8, 2022

jkleckner commented Jun 9, 2022

jkleckner commented Jun 9, 2022

jkleckner commented Jun 9, 2022

mostafaghadimi commented Jun 11, 2022

jkleckner commented Jun 11, 2022 •

edited

Loading

jkleckner commented Jun 11, 2022

hiendang commented Jul 24, 2022 •

edited

Loading

rameshd99 commented Jul 28, 2022 •

edited

Loading

dheerajpanangat commented Jun 23, 2023

pedrohff commented Jul 19, 2023

cinesia commented Apr 12, 2024 •

edited

Loading

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 23, 2024

The node was low on resource: ephemeral-storage. Container spark-kubernetes-driver was using 48136Ki, which exceeds its request of 0. #1546

The node was low on resource: ephemeral-storage. Container spark-kubernetes-driver was using 48136Ki, which exceeds its request of 0. #1546

Comments

mostafaghadimi commented Jun 8, 2022 • edited Loading

jkleckner commented Jun 8, 2022

mostafaghadimi commented Jun 8, 2022

jkleckner commented Jun 9, 2022

jkleckner commented Jun 9, 2022

jkleckner commented Jun 9, 2022

mostafaghadimi commented Jun 11, 2022

jkleckner commented Jun 11, 2022 • edited Loading

jkleckner commented Jun 11, 2022

hiendang commented Jul 24, 2022 • edited Loading

rameshd99 commented Jul 28, 2022 • edited Loading

dheerajpanangat commented Jun 23, 2023

pedrohff commented Jul 19, 2023

cinesia commented Apr 12, 2024 • edited Loading

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 23, 2024

mostafaghadimi commented Jun 8, 2022 •

edited

Loading

jkleckner commented Jun 11, 2022 •

edited

Loading

hiendang commented Jul 24, 2022 •

edited

Loading

rameshd99 commented Jul 28, 2022 •

edited

Loading

cinesia commented Apr 12, 2024 •

edited

Loading