Solutions for EXT4 file system errors causing server write failures-Jtti

Solutions for EXT4 file system errors causing server write failures

Time : 2025-11-19 15:14:08

Edit : Jtti

　　When a server experiences issues such as disk write failures, file creation failures, log inactivity, application errors, or read-only system mounts, it typically means that the EXT4 file system has detected the anomaly and is proactively protecting data by entering read-only mode to prevent further damage. These types of problems are particularly challenging in production environments because they can lead to database write failures, website service interruptions, and cache update failures, severely impacting business continuity.

　　EXT4 file system errors primarily stem from two categories: hardware and software. Hardware-related issues include disk aging, bad blocks, SSD lifespan exhaustion, RAID controller malfunctions, storage array failures, or sudden power outages causing cache write failures. Software-related issues include abnormal system shutdowns, improperly unmounted devices, kernel panics or interruptions resulting in incomplete journal commits, and application or script errors causing metadata anomalies. Regardless of the cause, when EXT4 detects inode, block bitmap, superblock, or journal inconsistencies, it automatically mounts the file system in read-only mode to protect data, meaning any write operation will fail.

　　The first step in troubleshooting these issues is to confirm whether the file system is already in read-only mode. The following command can be used to check the mount status:

mount | grep "on / "

　　If the output displays ro (read-only), it means the root partition or its corresponding mount point has been forced to read-only. You can also try creating a file at the mount point to verify this.

touch /tmp/testfile

　　If the error message "Read-only file system" is returned, then the file system is indeed not writable. Next, you need to check the kernel logs to confirm the EXT4 error message.

dmesg | grep -i ext4

　　Common error messages include "EXT4-fs error," "journal has aborted," "inode checksum invalid," and "metadata corruption detected." These logs can help determine whether the problem stems from metadata corruption, journal errors, or hardware I/O errors.

　　After confirming file system errors, the core repair tool is fsck (file system check and repair tool). It scans the EXT4 inodes, block bitmaps, directory structure, superblock, and journal metadata, and attempts to repair any anomalies. Before using fsck, ensure the file system is not mounted or is mounted as read-only; otherwise, further damage may occur. If it's the root partition, you need to enter rescue mode or a LiveCD environment. Non-root partitions can be unmounted first.

umount /dev/sda1

　　If the device is busy, you can use lsof or fuser to check which processes are using it.

lsof | grep /dev/sda1

　　Or force uninstall:

umount -l /dev/sda1

　　After uninstallation is complete, you can run fsck to repair:

fsck -f /dev/sda1

　　The `-f` option forces a full scan, even if the filesystem is marked as clean. During the repair process, `fsck` will check inodes, directories, block bitmaps, reference counts, and group summary information step by step, prompting for repairs when errors are found. To automatically repair all repairable issues, you can use:

fsck -y /dev/sda1

　　During the scan, the most common repair types include orphan inodes, which are isolated metadata caused by journal synchronization failures during file deletion. fsck automatically moves these inodes to the lost+found directory; superblock corruption, EXT4 provides multiple spare superblocks, which fsck can use for repair; block bitmap errors or inode checksum inconsistencies, fsck will attempt to restore the correct state based on log information.

　　If serious errors occur during the fsck repair process, such as the inability to find the physical volume or device I/O errors, it may indicate a physical problem with the hard drive. In this case, further troubleshooting using hardware diagnostic tools is necessary, such as using SMART to check the hard drive's health status.

smartctl -a /dev/sda

　　Monitor metrics such as Reallocated_Sector_Ct, Pending_Sector_Ct, and Offline_Uncorrectable. If these metrics are abnormal, back up your data and replace the hard drive as soon as possible. For cloud servers, contact your service provider to migrate your data or replace the storage volume.

　　Assuming the hardware is functioning correctly, if fsck completes the repair, the file system can usually regain write functionality. However, it is still recommended to check the logs for any remaining error messages.

dmesg | grep -i ext4

　　If the logs are clear and error-free, the file system can be remounted.

mount /dev/sda1 /mnt

　　After mounting, you can create a file at the mount point for verification and change the mount from read-only to read-write:

mount -o remount,rw /mnt

　　After the root partition repair is complete, the server should be restarted to ensure the system runs in normal read/write mode, and the startup log should be checked for EXT4 errors.

　　Furthermore, to prevent future write failures, operations personnel need to thoroughly investigate the root cause of the problem, including hardware health, power outage protection, application write behavior, and the scheduled fsck check. For directories frequently written to, such as databases, logs, and caches, disk usage should be monitored regularly, and alarm thresholds should be set. For critical data in the production environment, regular backups are essential to prevent business data loss due to file system errors.

　　In certain special cases, if the file system repeatedly experiences read-only or incomplete fsck repair issues, rebuilding the EXT4 file system and restoring data can be considered. The typical procedure is: back up data → format partition → create new file system → restore data. Although this method carries a higher risk, it is an effective means of ensuring long-term system stability in cases of severe metadata corruption, aging block devices, or frequent errors.

Relevant contents

24/7/365 support.We work when you work