scrutiny: [Standby-Support][BUG] Disks in standby mode are not correctly processed/detected - Scrutiny will create a duplicate disk with empty data

Describe the bug I installed scrutiny through the linuxserver docker container on a raspberry pi 4 to which I have attached two disks by USB connection. When running for the first time scrutiny scrutiny-collector-metrics run command, my two hard disk are recognized. One is a hdd while another is a ssd. Thus, I have two entries on scrutiny (that is expected behaviour). However, I have noticed that after some time (like around a day) a new “empty” entry appears. I have marked it in red in the below screenshot. Do you know how to get rid of that fake entry?

Expected behavior Scrutiny should only list my two disks. It does it but also shows another entry of a “phantom” disk.

Screenshots image

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 2
  • Comments: 20 (8 by maintainers)

Most upvoted comments

ahh ok. I just took a closer look at that non-zero exit code of 2 and it matches bit 1 in the docs: http://www.linuxguide.it/command_line/linux-manpage/do.php?file=smartctl#sect7

Return Values

The return values of smartctl are defined by a bitmask. If all is well with the disk, the return value (exit status) of smartctl is 0 (all bits turned off). If a problem occurs, or an error, potential error, or fault is detected, then a non-zero status is returned. In this case, the eight different bits in the return value have the following meanings for ATA disks; some of these values may also be returned for SCSI disks.

Bit 0: Command line did not parse.

Bit 1: Device open failed, or device did not return an IDENTIFY DEVICE structure.

Bit 2: Some SMART command to the disk failed, or there was a checksum error in a SMART data structure (see В´-bВ´ option above).

Bit 3: SMART status check returned “DISK FAILING".

Bit 4: We found prefail Attributes <= threshold.

Bit 5: SMART status check returned “DISK OK” but we found that some (usage or prefail) Attributes have been <= threshold at some time in the past.

Bit 6: The device error log contains records of errors.

Bit 7: The device self-test log contains records of errors.

To test within the shell for whether or not the different bits are turned on or off, you can use the following type of construction (this is bash syntax):
smartstat=$(($? & 8))
This looks at only at bit 3 of the exit status $? (since 8=2^3). The shell variable $smartstat will be nonzero if SMART status check returned “disk failing” and zero otherwise. 

In general, scrutiny shouldn’t be forwarding empty, invalid data from the collector to the webapp, but we ignore non-zero exit codes because bit 3 will be set when smartctl detects an error, and that response is valid, with real meaningful data.

I’ll need to come up with a generic process to handle disks in standby mode or that haven’t been “seen” in a while – maybe just a new notification rule. Atleast I know what to look for now. Thanks for all the data everyone! 👍

Just to add to the logs and provide some suggestions. This is a sleeping external USB hard drive. The hard drive appeared to be too slow to respond to smartctl in its spin up

Suggestion 1: Include a quick read from the drive before running smartctl # dd if=/dev/sda bs=1k count=1 of=/dev/zero This will cause the drive to wake up ahead of running smartctl. This should be safe on all drives and will block scrutiny pending the read being successful.

Suggestion 2: Include a probe of the drives before running smartctl # partprobe -d /dev/sda This should cause the drive to wake up.

Suggestion 3: Work around It would appear that if smartctl returns that particular error, the app should sleep for 10 seconds and the command should simply be retried.

Recommendation I’d probably do one of the first two, as there is little downside in a quick read/probe prior to running smartctl. It would ensure it always has a live drive to read from.

Additional Enhancement Irrespective, it should also maybe not add the drive to the dashboard if the drive wasn’t detected properly.

Logs showing one sleeping drive (sdc)
**# scrutiny-collector-metrics run**
2022/01/12 16:41:36 Loading configuration file: /scrutiny/config/collector.yaml

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                               dev-0.3.12

INFO[0000] Verifying required tools                      type=metrics
INFO[0000] Executing command: smartctl --scan -j         type=metrics
INFO[0000] Executing command: smartctl --info -j -d sat /dev/sdb  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j -d sat /dev/sdc  type=metrics
**ERRO[0008] Could not retrieve device information for sdc: exit status 2  type=metrics**
INFO[0008] Executing command: smartctl --info -j -d sat /dev/sda  type=metrics
INFO[0009] Generating WWN                                type=metrics
INFO[0009] Sending detected devices to API, for filtering & validation  type=metrics
INFO[0009] Collecting smartctl results for sdb           type=metrics
INFO[0009] Executing command: smartctl -x -j -d sat /dev/sdb  type=metrics
INFO[0009] Publishing smartctl results for ...  type=metrics
INFO[0009] Collecting smartctl results for sdc           type=metrics
INFO[0009] Executing command: smartctl -x -j -d sat /dev/sdc  type=metrics
**ERRO[0013] smartctl returned an error code (2) while processing sdc  type=metrics**
**ERRO[0013] smartctl could not open device                type=metrics**
INFO[0013] Publishing smartctl results for               type=metrics
INFO[0013] Collecting smartctl results for sda           type=metrics
INFO[0013] Executing command: smartctl -x -j -d sat /dev/sda  type=metrics
INFO[0013] Publishing smartctl results for ...  type=metrics
INFO[0013] Main: Completed                               type=metrics

**# scrutiny-collector-metrics run**
2022/01/12 16:41:58 Loading configuration file: /scrutiny/config/collector.yaml

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                               dev-0.3.12

INFO[0000] Verifying required tools                      type=metrics
INFO[0000] Executing command: smartctl --scan -j         type=metrics
INFO[0000] Executing command: smartctl --info -j -d sat /dev/sda  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j -d sat /dev/sdb  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j -d sat /dev/sdc  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Sending detected devices to API, for filtering & validation  type=metrics
INFO[0000] Collecting smartctl results for sda           type=metrics
INFO[0000] Executing command: smartctl -x -j -d sat /dev/sda  type=metrics
INFO[0000] Publishing smartctl results for ... type=metrics
INFO[0000] Collecting smartctl results for sdb           type=metrics
INFO[0000] Executing command: smartctl -x -j -d sat /dev/sdb  type=metrics
INFO[0001] Publishing smartctl results for ... type=metrics
INFO[0001] Collecting smartctl results for sdc           type=metrics
INFO[0001] Executing command: smartctl -x -j -d sat /dev/sdc  type=metrics
INFO[0001] Publishing smartctl results for ... type=metrics
INFO[0001] Main: Completed                               type=metrics
\```

</details>

As a workaround while waiting for a fix, I added a cron job to wake the disk up (ls /mounted/path) a minute before midnight when the collector task is scheduled to kick in. On a fresh install, this seems to avoid the problem - no phantom disk has been detected after running it for 5 days. It does look like the problem is related to the disk entering sleep mode.