I/O load on the Zabbix server’s trigger status is too heavy. What can we do about it?

The Zabbix server’s trigger status is showing Disk i/o overloaded. We can help you with it.

This is often caused when the iowait value exceeds a certain limit. Iowait times are the periods that a processor waits for a disk operation to complete.

With our Server Management Services, we perform Zabbix trigger audits on a regular basis to ensure the integrity of your system’s health

Since the Zabbix release of version 5.4, you’ll see how our Support Engineers help customers.

What is a Zabbix server?

A server monitoring software tool, Zabbix, is designed to monitor servers.

Database monitoring, application monitoring, and network monitoring are some of the many things that we monitor for in our company.

We know about service failures as soon as they happen with monitoring. This allows us to deal with any Zabbix triggers that have happened, so it’s important that the server administrator is alerted to this possibility.

Disk I/O alerts are often picked up when physical hard drives are either in a saturated state, or too slow. This can result in the server being very slow.

We’ve been getting a lot of requests concerning Zabbix errors. We help our customers by providing quick fixes and helping them correct these issues.

Understanding Disk I/O (Input/Output) – When should you be worried?

If you’ve heard of floppy disks, you might remember the sound they made while retrieving data. This sound is the telltale sign of an I/O bottleneck. For example, when playing Oregon Trail with a floppy disk, the game would stop for a minute or two at regular intervals while the player waited for new data to be loaded onto disk. The CPU would have to wait between for this period of time. If the floppy drive was faster, you’d be running the Columbia River rapids at this point.

Disk bottlenecks are notoriously hard to detect, but it’s more difficult if the disk isn’t on your desktop.

What impacts I/O performance?

For a database, mail server, or file server you should primarily consider their data throughput performance measured in input/output operations per second (IOPS).

The four main factors that influence IOPs are:

  • Multidisk Arrays – If you have one new disk that can perform 150 IOPS, a second disk will double the speed to 300 IOPS.

  • Average IOPS per-drive – The more IOPS each drive can handle, the more total IOPS capacity that provides. Basically, the higher the rotational speed of each drive, then that rotation translates into higher levels of capacity.

  • RAID Factor – Your application is likely using a RAID configuration for storage. Multiple disks can increase your data’s resilience to failure and reduce performance penalties for read-heavy workloads – like those found in most software-defined storage solutions.To achieve RAID 1/10, you only need two disk operations. This is really great for performance as the less operations, the better it will be for operation capacity.

  • Read and Write Workload – If you have a lot of writing and a RAID setup that requires a lot of operations in order to complete a request, your IOPS will be lower.

Calculating your maximum IOPS

There are many different ways to calculate theoretical IOPS, which is the maximum amount of I/O that you can do at one time. You can then compare it to your actual IOPS in order to see if there’s a problem.

You can estimate the theoretical write performance of a disk drive via the following equation:

I/O Operations Per-Sec = number of disks Average I/O Operations on 1 disk per-sec % of read workload + (Raid Factor % of write workload)

From your hardware specs, it should be possible to determine the read/write workload for this instance. To find out how much bandwidth you need, use a program like sar.

Once you’ve calculated your theoretical IOPS, compare it to the tps column displayed via sar. Tps stands for transfers per second and indicates the number of transfers issued to the device on a second-by-second basis. The latter should closely match with your theoretical IOPs if these calculations were indeed correct.

What’s the best path to fixing an I/O bottleneck?

If your disk I/O bottleneck is a problem now, tuning your hardware might not be the fastest remedy. Hardware changes will require a lot of testing, data migration and communication between developers and sys admins

We first try to figure out which service is the most resource intensive and cache more of its data in RAM. For example, we usually configure our database servers to have as much RAM (up to 64 GB) and MySQL caches as much of the data in memory.

Three takeaways

  • Disk access is slooowww – Disk access speeds are much slower than RAM, with some studies suggesting that accessing data on disk can sometimes take up to thousands of times longer than just reading from RAM.

  • Optimize your apps first – Tuning your disk hardware is likely to be a long and difficult process. However, read-heavy services (such as databases) can often serve requests more quickly if they’re reading from memory.

  • Measure – Changes to your application have a big impact on the way it’s read+written from storage. Record the key metrics over time to understand how much I/O modifications add to your application load.

More servers? Or faster code?

Adding servers is not always the answer for slow running code. Scout APM can identify inefficient code, analyze SQL queries, detect memory leaks, and enhance the performance of your server. This gives you extra time to focus on other tasks which are important for your business.

What is a housekeeper in Zabbix?

According to the documentation, the housekeeper task executed by the Zabbix agent is a periodic process.

And, the process removes all the outdated information and anything the user deletes

Additionally, this function helps to avoid collecting too much data which can be tough when you are using the neural network cleaner.

Performance issues are a common cause of performance problems.

When you store data records which you want to keep for a long time and those for which you need to keep them only temporarily, the housekeeper will remove the temporary ones.

How did we fix Zabbix i/o alert?

Recently, one of our customers approached us with an interesting Zabbix request. They were getting an alert “Disk input is overloaded on Zabbix server trigger status”

Here, this happened because of the house worker settings. In the Zabbix configuration file, we found that Housekeeping Frequency as disabled.

Keeping the value of HousekeepingFrequency to 1, then more than 4 hours of outdated information will get deleted per cycle.

We did define a mechanism called “hysteresis” on these particular triggers:({TRIGGER.VALUE} = 0 & {zabbix:system.cpu.util[,iowait].avg(5m)}>70) | ({TRIGGER.VALUE} = 1 & {zabbix:system.cpu.util[,iowait].avg(5m)}>40)

Then, we set the trigger to give an alert when the iowait crosses 40 as shown.

In short, when the housekeeper is in running mode the problem occurs. So, we need to adjust its frequency accordingly.

Leave a Comment