Overview
- The SenSage AP collector Loader cannot keep up with the incoming load stream resulting in a major backlog. Files in the log queue are taking too long to process.
- Manual loads show some nodes indicate storage issues and are taking much longer to finish with 2-4% CPU wait times and the load time averages also ramp up to very high values.
Solution
Diagnosis
Storage IO performance degradation issues are at the customer-end and should be resolved by the customer's System Admin (SA)/IO Storage team. SenSage AP Customer Support can assist in troubleshooting and recommendations etc. while the storage is upgraded.
The storage volume not being configured correctly on the affected nodes, like a mismatch between existing and any newly added storage to the servers, can be the cause of the performance degradation.
To pinpoint the affected nodes with a storage/IO performance issue, do some manual tests:
- Stop the collector and do several manual file loads and compacts. Identify the affected nodes that are taking much longer to finish.
- With the help of your SA team, install the
iotop
command and run a cycle of manual load and manual compact tests. If, on the most affected nodes, the processes using most IO resources arejbd2
andflush
, which are intrinsic to storage, this is a strong indication of IO performance issues. - Also test the IO speed to the storage device where
/opt/data/sensage
is mounted, using dd:dd if=/dev/zero of=/opt/data/sensage/test.img bs=1G count=1 oflag=dsync
Example output:
Confirm the server nodes where the storage speed is much lower than average or highly fluctuating.-bash-4.1# dd if=/dev/zero of=/opt/sensage/test.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 2.44623 s, 439 MB/s
Solution
Engage the IO Storage team to decide the next solution steps.
It could be possible to correct the storage configuration on the current nodes by adding more storage etc., but if the last resort solution is to migrate the data, the IO Storage team will migrate the affected hosts to new storage.
- While the data migration/mirroring to new servers is being done, the collector can be stopped to speed up the migration speed.
- Prepare a load collector
config.xml
file with the retrievers disabled and the loaders enabled in preparation to load the backlog after the storage migration is done. - Also prepare a file with all the retrievers and loaders enabled in preparation to apply it when all the backlog has loaded and you are ready to start retrieving data for normal operation.
Related Articles
Changing Timestamps to Fix Fragmentation Issues After Collector Restart
Comments
0 comments
Please sign in to leave a comment.