Overview
- The system is down. Lost connection to SLS database nodes. Status in Ambari shows SLS was reported down a while ago.
- The collector box ran out of space at
/opt/data
and also on the/var
partition which are at almost 100% disk utilization. - The collector error and transaction logs show no successful loads since many days.
Solution
Do some immediate maintenance to free up space on the collector and restart the cluster and collector:
- Turn down the collector to avoid further noloads until the system is back up.
- Remove old Ambari logs and core dumps to free up disk space on the collector's
/var
partition. - Manually start the SLS/EDW cluster and perform quick health checks to confirm that there is no corruption in the SLS tables:
# /opt/hexis/hawkeye-ap/bin//
clssh --hosts=<hostname>,<hostname> "/opt/hexis/hawkeye-ap/etc/ init.d/sensage_edw start" - Reset the noloads and start the collector to begin processing the backlog.
# /opt/hexis/hawkeye-ap/bin/
noload_reset.sh
Once the collector is running:
- On the collector's
/opt/data
partition, remove logs in thedone
directories to allow ingesting more logs. You can also move data from one of the log queues to free up disk space in the/opt/data
partition of the collector node. After the backlog is successfully processed, move the logs back to the corresponding log queue. - Ensure the current
/opt/data
utilization is normal, and that the files indone
directories will be automatically swept by the scheduled cronjob after 2 days. Reduce the retention time if it is more than 2 days. - Review the inode utilization and the fragmented leaves for the table to check that it is normal.
Comments
0 comments
Please sign in to leave a comment.