When using the Hadoop Distributed File System, many users focus solely on data reads and writes and node status, neglecting the configuration and management of log storage paths. However, in HDFS 3.x, logs not only record detailed system operation information but also serve as a crucial basis for troubleshooting and analyzing performance bottlenecks. Deploying HDFS in a CentOS 8 environment without understanding the log paths can lead to disk fullness, inefficient debugging, and even data risks. Understanding the distribution patterns and adjustment methods for these paths can make operations and maintenance more efficient and controllable.
After installing HDFS 3.x in CentOS 8, the default log path is typically located in the logs folder within the Hadoop installation directory, such as /opt/hadoop/logs or /usr/local/hadoop/logs, depending on the path selected during decompression and installation. This directory stores operational logs for various components, including the NameNode, DataNode, SecondaryNameNode, JournalNode, and ZKFC. HDFS generates a log path based on the hadoop.log.dir configuration file parameter at process startup. The default value is ${HADOOP_HOME}/logs. If the operations staff does not adjust this, all component logs will be centralized here. This default configuration is simple, but can be confusing in a multi-node cluster because the logs from multiple nodes are stored in different local directories, making collection less intuitive.
NameNode logs are typically named "hadoop-username-namenode-hostname.log," while DataNode logs are named "hadoop-username-datanode-hostname.log." In addition, corresponding ".out" files record standard output. During operation, the system automatically splits logs into historical files, such as ".log.1" and ".log.2," daily or when a file exceeds a specified size, for easy retrospective analysis. To identify the cause of a NameNode startup failure, the most direct way is to navigate to the log directory and use the "tail -f" command to tail the logs in real time, or "less" to scroll through the log history. For example:
tail -f /usr/local/hadoop/logs/hadoop-root-namenode-centos8.log
In HDFS 3.x, the YARN log path is also used, separate from the DataNode and NameNode logs in HDFS itself. The log path for ResourceManager and NodeManager is controlled by the yarn.log.dir parameter. By default, it is also located in the HADOOP_HOME/logs directory. However, to better separate responsibilities, some operators may specify a separate partition or directory in their configuration files, such as /data/logs/yarn, to avoid disk space contention. When the cluster is large and log writes are frequent, if a separate path is not specified, it is easy for log files to fill up the disk and cause service anomalies.
Changing the log path is simple. Simply add or modify the following parameter in hadoop-env.sh:
export HADOOP_LOG_DIR=/data/hdfs_logs
Save and restart the HDFS-related services for the effect to take effect. For YARN, you can also adjust the following in yarn-env.sh:
export YARN_LOG_DIR=/data/yarn_logs
This approach allows for separate storage of logs and data, improving security and read/write performance. It also facilitates centralized log collection later, such as with ELK or Fluentd for analysis.
In addition to system-generated operational logs, HDFS also records audit logs, such as user file access operations. The log path for these operations is controlled by the dfs.namenode.audit.log parameter and, by default, matches the NameNode log path. Audit logs are particularly important for cross-border businesses with stringent security compliance requirements, as they help administrators track who accessed data, when, and which directories were accessed, providing evidence in the event of a security incident.
In a CentOS 8 environment, if you use systemd to manage the HDFS service, additional startup information may be generated in the /var/log directory, which can be viewed using the journalctl command. This is different from Hadoop's own logging mechanism and is a system-level service log. While it does not replace HDFS's internal logs, it can provide clues to issues such as startup failures and missing environment variables.
After adjusting the log path, you also need to consider the log file rotation policy. In HDFS 3.x, the default log rotation relies on the log4j configuration, located in $HADOOP_HOME/etc/hadoop/log4j.properties. You can modify appender.RFA.MaxFileSize and appender.RFA.MaxBackupIndex to control the log size and number of historical files. For example, limiting a single log file to 100MB and retaining a maximum of 10 historical files prevents uncontrolled growth and disk space usage:
log4j.appender.RFA.MaxFileSize=100MB
log4j.appender.RFA.MaxBackupIndex=10
Optimizing the log path not only facilitates viewing but also prevents cascading issues during system operation due to disk pressure or reduced file search efficiency. In enterprise cluster operations, centralized logging systems are often used to aggregate logs from all nodes. This eliminates the need to log in to each server individually and allows for quick identification of anomalies using search and analysis tools. A clear log directory and a reasonable file rotation strategy are essential for troubleshooting HDFS performance bottlenecks, frequent DataNode disconnections, and NameNode heartbeat anomalies.
In short, understanding and planning the log storage path is essential when deploying HDFS 3.x on CentOS 8. While the default path is convenient, it's not suitable for all environments, especially those with large clusters, diverse node distribution, or compliance requirements.
By properly adjusting hadoop.log.dir, yarn.log.dir, and the audit log path, combined with log splitting and centralized collection mechanisms, you can not only improve operational efficiency but also mitigate risks, ensuring stable and controllable system operation. In this way, HDFS log management can truly provide value for daily operations and troubleshooting, rather than becoming a potential source of risk.