High Load

From GoBlueMich Wiki
Jump to navigation Jump to search

Porkins eating and crashing


Purpose

The purpose behind this wiki page is to provide the information needed to understand how to troubleshoot load issues on servers. This page may not cover every scenario but should help fill in the gaps.

Types of Load

There are generally 3 resources that get overloaded on a server CPU, RAM, and Disk IO.

CPU

This is the raw processing power of the server. If a server is maxing out it's CPU it will be very sluggish to respond to any commands and may report as down but generally will not completely crash.

Examples of things that can max out the CPU of a server include:

  • PHP Scripts that require lots of calculation
  • GZIP
  • MySQL queries

RAM

Every program that is running on a server is running in ram. If the server runs out of RAM and has a SWAP partition then it will begin using the server hard drive as additional ram.

If a server is properly configured then this swap usage will be very short and will not crash the server. However if for example the server is configured to be able to use 10G of memory while it has 2G of memory and 2G of swap space the server is configured in a way it can crash, and quickly.

Examples of things that can max out the RAM of a server:

  • PHP/Apache
    • If the PHP memory_limit * MaxClients of Apache exceed the amount of ram in the server, it is capable of crashing due to running out of memory.
  • MySQL
    • If MySQL's maximum configured memory limit exceeds the amount of ram in a server, it is capable of crashing the server.

Running Out Of Ram

Why servers run out of memory and crash:

Running out of ram is completely avoidable. The PHP memory_limit and Apache MaxClients are designed specifically to stop this from happening. If they are configured properly then PHP and Apache can not cause a server to OOM.

A common misconception:

It is a common misconception that it is OK to set these higher than a server can handle. The reason given for this misconception is that "most php scripts will be using less then 1M of memory so it is ok to set the MaxClients really high"

This is exactly why we have servers that OOM and crash regularly. The truth of the matter is if most PHP scripts on the server use less then 1M of memory, then we would be able set the global memory_limit to 1M. That is not usually the case.

Why the following approach is better:

The following approach to memory management is superior because we use the resources and tools available to determine how much the server can actually handle. We then set the settings appropriately for that limit. With this approach we can then take the math to the customer and explain it very clearly that this is why it needs to be this way.

Once the server settings are optimized to prevent the server from crashing, they will only be reached during periods where the server would have crashed otherwise. So instead of having the server crash and trying to figure out what happened afterwards, we have the opportunity to log in while the problem is still going on and find solutions to why those limits are being reached.

What memory_limit does:

The PHP memory_limit sets a limit for how much memory each php script that is being executed can use based on a single execution from a single call/visitor. This means that if the php memory_limit was 32M and file.php used 10M with 10 visitors at once for 100M total, no limits would be reached. Whereas if file.php requires 40M of memory and received 1 or 10 visitors, 1 or 10 memory_limit errors would occur respectively.

How memory_limit should be set:

The first step is getting the memory_limit as low as possible so that the majority of sites/pages on the server still work and then increase it locally where required. This requires time, effort, and downtime.

We must set the memory_limit as low as possible, restart apache and then test the sites out. It is helpful to tail the Apache error_log or the sites error_log during this to see if there are any memory errors. If there is a problem with the majority of sites/scripts on the server then increase it and try again.

Once we have a good value for the global memory_limit we can go through and increase it for individual scripts, entire folders, or entire accounts as needed. Keep in mind however that you will need to use your best judgement to find a way to mix that in when setting MaxClients.

Seeing MySQL Max Ram

Optimizing MySQL is outside the scope of this wiki currently, there is a MySQL wiki specifically for that. You can see the current MySQL memory limit by running the /scripts/tuning-primer.sh script, found in the MySQL wiki. It gives output like this:

 MEMORY USAGE
 Max Memory Ever Allocated : 208 M
 Configured Max Per-thread Buffers : 637 M
 Configured Max Global Buffers : 202 M
 Configured Max Memory Limit : 839 M
 Physical Memory : 7.73 G
 Max memory limit seem to be within acceptable norms

If MySQL has been running for a long time and the Max Ever Allocated is much lower then the Memory Limit, consider tweaking MySQL for lower limits as they are not being reached therefore may be too high.

Setting Apache MaxClients

Template:Info Template:Notice

The name of the Apache MaxClients setting is slightly misleading. The setting limits how many Apache connections can be actively processed and be serving pages to clients at once. If this is reached however the server does not outright refuse additional connections as the name might imply. The additional connections are queued up to a limit set by the ListenBacklog setting, which by default is 500.

This means that if MaxClients is 60, and ListenBacklog is 500, there could be up to 560 connections at once.

The goal to keep in mind is to allow the server to serve as many connections at once without staying in SWAP memory for very long if at all. The server will queue additional connections up to the ListenBacklog setting and will serve all those queued connections much faster if it stays in ram and not in swap.

For the sake of example we will assume the following:

  • Server ram 4000M
  • MySQL memory_limit 500M
  • PHP memory_limit 32M
  • Swap size 4000M

Subtract MySQL from total ram:

4000 - 500 = 3500

Divide by php memory_limit:

 3500 / 32 = 109 

Other system processes like exim, courier, the kernel etc will also be using ram but those are not as likely to suddenly cause the server to OOM.

Using my best judgement I would say with this example a MaxClients of 100 - 109 would be ok as a good starting point to keep the server in ram and out of swap/OOMing

The server may still need additional tweaking after this, perhaps the other services on the server use more memory than we expected so we may need to lower the MaxClients further to prevent the server from going into swap/OOMing, though you may also be able to tweak the memory_limit or MySQL to save some memory_usage as well.

Things to watch out for

  • php.ini files in sites
  • Scripts setting their own memory_limit
  • .htaccess files setting the memory_limit

If any of these things are there, the server could still run out of memory and crash if your calculation was based solely on the global memory_limit.

How to deal with reaching MaxClients

Once the server is not crashing anymore, they will likely be reaching MaxClients at points, perhaps often or even constantly.

  • Make sure sites are using caching to get the pages loaded and completed as fast as possible
  • Make sure there is an Opcode caching plugin installed and working if applicable.
  • Check the Apache Status page in WHM to see what all the connections are to
    • If everything is coming from 1 IP, consider ways to deal with it
      • Blocking the offending IP is a possibility. Do a whois first and make sure its not google.
    • If the connections are from search engines consider using a robots.txt file

There are many other ways to deal with this, but reaching MaxClients is better then crashing. Crashing corrupts databases, breaks filesystems and causes all sorts of problems. Reaching MaxClients instead of crashing gives us a chance to optimize things further.


How to explain this to the customer

Customers tend to get worried when you tell them that you would like to lower the MaxClients on the server. When you explain the math behind the settings they usually understand. Another good point to bring up is that these settings will only be reached when the server would have crashed before, instead of crashing most people will still be able to see the sites.

Also keep in mind this is always the customers decision, if you explain the math to them and they would prefer to get more Ram that is up to them.


Disk IO

This is how fast the server can read from and write to it's hard drives. One of the most common causes of high disk IO is running out of ram. This is because once the server runs out of ram it will begin using the Disk as additional memory and hard drives are exponentially slower than ram.

Examples of things that can max out the Disk IO of a server:

  • Running out of ram and swapping to disk
  • MySQL queries writing temporary tables to disk
  • Large amounts of email being sent from the server
  • Backups running

Tools

There are many ways a server can be over loaded, the first step is determining what server resource is being maxed out so we can look into what is causing it.

Mr Radar in Billing

The first tool in our disposal for seeing the problem is the Sub Accounts page in Billing. Click the arrow to the left of the server in question and you should see the Mr Radar graph on the right hand side. This graph by default shows the 15 minute load average. Any part of this graph that is in red indicates our monitoring servers were unable to poll the service chosen for "Report on down" in the monitoring subsection of this page. By default on managed servers this is httpd.

Useful information that can be obtained from this:

  • Patterns, is the problem always happening at the same time?
  • Swap usage
    • Click the dropdown to change "15 min load" to "Swap Used"
    • Is the server always swapping to disk before the outage?
  • Correlations between load and bandwidth spikes
    • Does the servers load spike when there is a bandwidth usage spike?

Template:Info Template:Box Notice Example 15 min load:

File:Load radar 15min.jpg

Example Swap usage:

File:Load radar swap.jpg

System Commands

Top

The top command is useful when the load is happening while you are in the server. This is a partial example of what it looks like:

File:Load top.jpg

What this means:

  • load average: 0.68, 0.57, 0.48
    • 0.68 1 minute load average
    • 0.57 5 minute load average
    • 0.48 15 minute load average
  • Tasks: 170 total, 1 running, 169 sleeping, 0 stopped, 0 zombie
    • 170 total processes open
    • 1 actively using CPU
    • 169 idle, but still using RAM
  • Cpu(s): 9.6%us, 0.4%sy, 0.0%ni, 89.7%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
    • 9.6%us
      • Percentage of CPU time spent on User level processes (apache, mysql, most other services)
    • 0.4%sy
      • Percentage of CPU time spent on System/Kernel processes
    • 89.7%id
      • Percentage of CPU time spent idle
    • 0.3%wa
      • Percentage of CPU time spent waiting on Disk IO

Template:Notice

htop

LiquidWeb Install:

 mkdir -p /home/lwtemp
 wget -O /home/lwtemp/htop-installer.sh http://layer3.liquidweb.com/lantern/htop.sh
 chmod +x /home/lwtemp/htop-installer.sh
 /home/lwtemp/htop-installer.sh
 rm /home/lwtemp/htop-installer.sh

Source Install:

 cd /usr/local/src/
 wget http://www.sfr-fresh.com/linux/misc/htop-0.9.tar.gz
 tar zxf htop-0.9.tar.gz
 cd htop-0.9
 ./configure ; make ; make install

Sar

The sar command is useful to see the resource usage of a server over time. It parses information logged by Sysstat which by default captures information every 10 minutes. It will give you a historic view of the same information you get from the Cpu(s) row in the Top command.

Template:Box Notice Common flags:

  • None, just sar by itself will show CPU usage and IO wait over time
  • -r, `sar -r` will show you memory and swap usage over time

Example of stock `sar` command:

File:Load sar.jpg

What to look for:

  • Are there particular times where there are problems
  • Is the problem CPU usage
  • Is the problem Disk IO (iowait)
    • If it is Disk IO, see the section for sar -r

Example of `sar -r` command:

File:Load sar-r.jpg

What to look for:

  • The main thing to look at here is %swpused
    • If it is a stable percentage like 1.78 all the way down there is likely no problem.
    • If it grows rapidly or goes up and down a lot that would explain high iowait from sar and top.

Iostat

iostat -x

  • avgqu-sz shows how many requests are in queue
  • await says how much time they have to wait (in milliseconds)
  • example: 20 files waiting for the disk and await says nearly 300 (20x300) = 6,000ms or 6 seconds


grep -si imuxsock /var/log/messages

  • usually if it's high I/O, something is being logged excessively or named query cache just getting hammered
  • example: Oct 7 09:49:27 host rsyslogd-2177: imuxsock begins to drop messages from pid 1791 due to rate-limiting
  • ps aux | grep 1791

Check Domlogs

Check for WordPress Brute Force

Link to WordPress Wiki

EA3

Check for POSTs to wp-login.php today

 grep -s "$(date +"%d/%b/%Y")" /usr/local/apache/domlogs/* | grep wp-login.php |grep POST | grep "$(date +"%d/%b/%Y")" | cut -d: -f1 | sort| uniq -c | sort -nr | head -25


EA4

Check domlogs for POSTs to wp-login.php today

 grep -s "$(date +"%d/%b/%Y")" /etc/apache2/logs/domlogs/*|grep wp-login.php|grep POST|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5

Check rotated domlogs for POSTs to wp-login.php today

zgrep -s "$(date +"%d/%b/%Y")" /home*/*/logs/*-Aug-2018.gz |grep wp-login.php|grep POST|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5


Grep Domlogs For $DATE (EA3)

Set $DATE Variable:

Date Formatting: "26/Mar/2018:16:42:27"

DATE="26/Mar/2018:16:42:27"

Search all of today:

DATE="$(date +"%d/%b/%Y")"

Ex: Search 6AM hour today:

DATE="$(date +"%d/%b/%Y"):06"

Find which site was getting hit

POST:

 grep -s "$DATE" /usr/local/apache/*|grep POST|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5

GET:

 grep -s "$DATE" /usr/local/apache/logs/domlogs/*|grep GET|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5

Set $SITE Variable:

SITE="/usr/local/apache/logs/domlogs/domain.com"

or

SITE="/home/CPUSER/logs/domain.com"

Find what was being hit on the site

POST:

 grep -s "$DATE" $SITE|grep POST|awk '{print $7}'|sort |uniq -c|sort -n|tail -n5

GET:

 grep -s "$DATE" $SITE|grep GET|awk '{print $7}'|sort |uniq -c|sort -n|tail -n5

Find the IP hitting that site

POST:

 grep -s "$DATE" $SITE|grep POST|awk '{print $1}'|sort |uniq -c|sort -n|tail -n5

GET:

 grep -s "$DATE" $SITE|grep GET|awk '{print $1}'|sort |uniq -c|sort -n|tail -n5

Grep Domlogs For $DATE (EA4)

Set $DATE Variable

echo "Setting DATE variable to check"

Date Formatting: "26/Mar/2018:16:42:27"

DATE="26/Mar/2018:16:42:27"

Search all of today:

DATE="$(date +"%d/%b/%Y")"

Ex: Search 6AM hour today:

DATE="$(date +"%d/%b/%Y"):06"

Set $MO and $YR Variables

echo "Setting MO and YR variables to check rotated logs"
MO=`echo "$DATE" | egrep -o "[a-zA-Z]+"`
YR=`echo "$DATE" | egrep -o "201[0-9]+"`

Find which site was getting hit

Date Formatting: "26/Mar/2017:16:42:27"

POST:

 echo "Checking current domlogs sites with most POST requests"
 grep -s "$DATE" /etc/apache2/logs/domlogs/*|grep POST|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5

GET:

 echo "Checking current domlogs sites with most GET requests"
 grep -s "$DATE" /etc/apache2/logs/domlogs/*|grep GET|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5

Rotated POST:

 echo "Checking rotated domlogs sites with most POST requests"
zgrep -s "$DATE" /home*/*/logs/*-"$MO"-"$YR".gz |grep POST|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5

Rotated GET:

 echo "Checking rotated domlogs sites with most GET requests"
zgrep -s "$DATE" /home*/*/logs/*-"$MO"-"$YR".gz |grep GET|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5

Set $SITE Variable:

 echo "Setting SITE var based on high request count"
SITE="/etc/apache2/logs/domlogs/domain.com"

or

SITE="/home/CPUSER/logs/domain.com"

Find what was being hit on the site

POST:

 echo "Checking POST requests for SITE"
 zgrep -s "$DATE" "$SITE"|grep POST|awk '{print $7}'|sort |uniq -c|sort -n|tail -n5

GET:

 echo "Checking GET requests for SITE"
 zgrep -s "$DATE" "$SITE"|grep GET|awk '{print $7}'|sort |uniq -c|sort -n|tail -n5

Find the IP hitting that site

POST:

 zgrep -s "$DATE" "$SITE"|grep POST|awk '{print $1}'|sort |uniq -c|sort -n|tail -n5

GET:

 zgrep -s "$DATE" "$SITE"|grep GET|awk '{print $1}'|sort |uniq -c|sort -n|tail -n5

Rotated POST:

Grep Domlogs For Today Plesk 12+

Set $DATE Variable:

Date Formatting: "26/Mar/2018:16:42:27"

DATE="26/Mar/2018:16:42:27"

Search all of today:

DATE="$(date +"%d/%b/%Y")"

Ex: Search 6AM hour today:

DATE="$(date +"%d/%b/%Y"):06"

Find which site was getting hit

Date Formatting: "26/Mar/2017:16:42:27"

POST:

zgrep -s "$DATE" /var/www/vhosts/*/logs/access_log* |grep POST|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5

GET:

zgrep -s "$DATE" /var/www/vhosts/*/logs/access_log* |grep GET|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5


Set $SITE Variable:

SITE="/var/www/vhosts/DOMAIN/logs/access_log"

Find what was being hit on the site:

POST:

zgrep -s "$DATE" "$SITE" |grep POST|awk '{print $7}'|sort |uniq -c|sort -n|tail -n5

GET:

zgrep -s "$DATE" "$SITE" |grep GET|awk '{print $7}'|sort |uniq -c|sort -n|tail -n5


Find the IP hitting that site:

POST:

zgrep -s "$DATE" "$SITE" |grep POST|awk '{print $1}'|sort |uniq -c|sort -n|tail -n5

GET

zgrep -s "$DATE" "$SITE" |grep GET|awk '{print $1}'|sort |uniq -c|sort -n|tail -n5

Console Smart Server

LoadWatch