Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filesystem error count metric #3113

Open
anarcat opened this issue Sep 6, 2024 · 2 comments
Open

filesystem error count metric #3113

anarcat opened this issue Sep 6, 2024 · 2 comments

Comments

@anarcat
Copy link
Contributor

anarcat commented Sep 6, 2024

We are porting various alerts from Nagios to the prometheus ecosystem and we've found one check that is kind of useful in Nagios that seems to be missing from the node exporter. It's a check that looks at EXT filesystems with the tune2fs -l command and (basically) greps for the FS Error count field.

This should normally be zero but under certain circumstances (failing disk, filesystem bug, power outage), it will rise. running fsck on the filesystem will fix this (and, normally, after a power outage, a reboot will run fsck, but under certain circumstances, it might not fully do it).

So I think the node exporter should do this. I've tried to find metrics about this in our node exporters and couldn't find anything under the node_filesystem_* namespace. There is node_filesystem_readonly and, according to this post node_filesystem_device_error (but I can't see that metric here), but neither of those are the same as the error count.

Am I missing something or this is missing from the node exporter?

Here's a copy of the check, called dsa-check-filesystems here:

#!/usr/bin/ruby

require 'filesystem'

ignorefs = ["NFS", "nfs", "nfs4", "nfsd", "afs", "binfmt_misc", "proc", "smbfs",
	   "autofs", "iso9660", "ncpfs", "coda", "devpts", "ftpfs", "devfs",
	   "mfs", "shfs", "sysfs", "cifs", "lustre_lite", "tmpfs", "usbfs",
	   "udf", "fusectl", "fuse.snapshotfs", "rpc_pipefs"]
mountpoints = {}

FileSystem.mounts.each do |m|
	if ((not ignorefs.include?(m.fstype)) && (m.options !~ /bind/))
		mountpoints[m.device] = { 'type' => m.fstype, 'mount' => m.mount }
	end
end

def check_ext3(dev, mnt)
	output=%x{tune2fs -l #{dev}}
	if output =~ /FS Error count:\s*(\d+)/ and $1.to_i > 0
		return "#{dev} (#{mnt}) has #{$1} errors"
	end
end

output = []
mountpoints.keys.each do |m|
	temp = ''
	begin
		if mountpoints[m]['type'] =~ /ext/
			temp = check_ext3(m, mountpoints[m]['mount'])
		end
	rescue Exception => e
	end
	if temp && (temp.length > 0)
		output << temp
	end
end

if output.length > 0
	puts output.join("\n")
	exit 1
end
puts "OK: All filesystems ok."
exit 0
@SuperQ
Copy link
Member

SuperQ commented Sep 6, 2024

The node_exporter collector policy does not allow subprocess execution. It also does not allow for functions that require root privileges.

This can probably be solved by reading from /sys/fs/ext4/. There is a work in progress to implement this in prometheus/procfs.

@anarcat
Copy link
Contributor Author

anarcat commented Sep 7, 2024

right, running tune2fs seemed like an odd idea in the first place, i was hoping for something exactly like that.

the PR you linked to has been merged, so we're getting close? :)

i don't quite understand what it takes to percolate stuff from procfs into the node exporter itself, now we'd need a stub to call that ext4.fs.ProcStat() thing next? or does procfs need to make a release first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants