- The system is down. Lost connection to SLS database nodes. Status in Ambari shows SLS was reported down a while ago.
- The collector box ran out of space at
/opt/dataand also on the
/varpartition which are at almost 100% disk utilization.
- The collector error and transaction logs show no successful loads since many days.
Do some immediate maintenance to free up space on the collector and restart the cluster and collector:
- Turn down the collector to avoid further noloads until the system is back up.
- Remove old Ambari logs and core dumps to free up disk space on the collector's
- Manually start the SLS/EDW cluster and perform quick health checks to confirm that there is no corruption in the SLS tables:
clssh --hosts=<hostname>,<hostname> "/opt/hexis/hawkeye-ap/etc/ init.d/sensage_edw start"
- Reset the noloads and start the collector to begin processing the backlog.
Once the collector is running:
- On the collector's
/opt/datapartition, remove logs in the
donedirectories to allow ingesting more logs. You can also move data from one of the log queues to free up disk space in the
/opt/datapartition of the collector node. After the backlog is successfully processed, move the logs back to the corresponding log queue.
- Ensure the current
/opt/datautilization is normal, and that the files in
donedirectories will be automatically swept by the scheduled cronjob after 2 days. Reduce the retention time if it is more than 2 days.
- Review the inode utilization and the fragmented leaves for the table to check that it is normal.