Loader Speed Degrading Due to IO Performance Issues

Overview

  • The SenSage AP collector Loader cannot keep up with the incoming load stream resulting in a major backlog. Files in the log queue are taking too long to process. 
  • Manual loads show some nodes indicate storage issues and are taking much longer to finish with 2-4% CPU wait times and the load time averages also ramp up to very high values.

 

Solution

Diagnosis

Storage IO performance degradation issues are at the customer-end and should be resolved by the customer's System Admin (SA)/IO Storage team. SenSage AP Customer Support can assist in troubleshooting and recommendations etc. while the storage is upgraded.

The storage volume not being configured correctly on the affected nodes, like a mismatch between existing and any newly added storage to the servers, can be the cause of the performance degradation.

To pinpoint the affected nodes with a storage/IO performance issue, do some manual tests:

  • Stop the collector and do several manual file loads and compacts. Identify the affected nodes that are taking much longer to finish.
  • With the help of your SA team, install the iotop command and run a cycle of manual load and manual compact tests. If, on the most affected nodes, the processes using most IO resources are jbd2 and flush​, which are intrinsic to storage, this is a strong indication of IO performance issues.
  • Also test the IO speed to the storage device where /opt/data/sensage is mounted, using dd:

    dd if=/dev/zero of=/opt/data/sensage/test.img bs=1G count=1 oflag=dsync

    Example output:
    -bash-4.1# dd if=/dev/zero of=/opt/sensage/test.img bs=1G count=1 oflag=dsync
    1+0 records in
    1+0 records out
    1073741824 bytes (1.1 GB) copied, 2.44623 s, 439 MB/s
    Confirm the server nodes where the storage speed is much lower than average or highly fluctuating. 

Solution

Engage the IO Storage team to decide the next solution steps. 

It could be possible to correct the storage configuration on the current nodes by adding more storage etc., but if the last resort solution is to migrate the data, the IO Storage team will migrate the affected hosts to new storage.

  • While the data migration/mirroring to new servers is being done, the collector can be stopped to speed up the migration speed.
  • Prepare a load collector config.xml file with the retrievers disabled and the loaders enabled in preparation to load the backlog after the storage migration is done.
  • Also prepare a file with all the retrievers and loaders enabled in preparation to apply it when all the backlog has loaded and you are ready to start retrieving data for normal operation.

Related Articles

Changing Timestamps to Fix Fragmentation Issues After Collector Restart

Comments

0 comments

Please sign in to leave a comment.