System down | Collector at full disk utilization

Overview

  • The system is down. Lost connection to SLS database nodes. Status in Ambari shows SLS was reported down a while ago.
  • The collector box ran out of space at /opt/data and also on the /var partition which are at almost 100% disk utilization.
  • The collector error and transaction logs show no successful loads since many days.

 

Solution

Do some immediate maintenance to free up space on the collector and restart the cluster and collector:

  1. Turn down the collector to avoid further noloads until the system is back up.
  2. Remove old Ambari logs and core dumps to free up disk space on the collector's /var partition.
  3. Manually start the SLS/EDW cluster and perform quick health checks to confirm that there is no corruption in the SLS tables:
    # /opt/hexis/hawkeye-ap/bin//clssh --hosts=<hostname>,<hostname>  "/opt/hexis/hawkeye-ap/etc/init.d/sensage_edw start"
  4. Reset the noloads and start the collector to begin processing the backlog.
    # /opt/hexis/hawkeye-ap/bin/noload_reset.sh

Once the collector is running:

  • On the collector's /opt/data partition, remove logs in the done directories to allow ingesting more logs. You can also move data from one of the log queues to free up disk space in the /opt/data partition of the collector node. After the backlog is successfully processed, move the logs back to the corresponding log queue.
  • Ensure the current /opt/data utilization is normal, and that the files in done directories will be automatically swept by the scheduled cronjob after 2 days. Reduce the retention time if it is more than 2 days.
  • Review the inode utilization and the fragmented leaves for the table to check that it is normal.

Comments

0 comments

Please sign in to leave a comment.