Maximizing performance
Related articles
This article provides information on basic system diagnostics relating to performance as well as steps that may be taken to reduce resource consumption or to otherwise optimize the system with the end-goal being either perceived or documented improvements to a system's performance.
Contents
The basics
Know your system
The best way to tune a system is to target bottlenecks, or subsystems which limit overall speed. The system specifications can help identify them.
- If the computer becomes slow when large applications (such as OpenOffice.org and Firefox) run at the same time, check if the amount of RAM is sufficient. Use the following command, and check the "available" column:
$ free -h
- If boot time is slow, and applications take a long time to load at first launch (only), then the hard drive is likely to blame. The speed of a hard drive can be measured with the
hdparm
command:
# hdparm -t /dev/sdx
hdparm indicates only the pure read speed of a hard drive, and is not a valid benchmark. A value higher than 40MB/s (while idle) is however acceptable on an average system.
- If CPU load is consistently high even with enough RAM available, then lowering CPU use should be a priority. This can be monitored in several ways, for example with htop:
$ htop
- If only applications using direct rendering are slow (i.e those which use the GPU, such as video players and games), then improving GPU performance should help. The first step is to verify if direct rendering is actually enabled. This is indicated by the
glxinfo
command:
$ glxinfo | grep direct
glxinfo
is part of the mesa-demos package.
The first thing to do
The simplest and most efficient way of improving overall performance is to run lightweight environments and applications.
- Use a window manager instead of a desktop environment. Choices include Awesome, dwm, Fluxbox, i3, JWM, Openbox, wmii and xmonad. If choosing a desktop environment, consider LXDE or Xfce.
- Use
pstree
or htop to list running daemons and their resource use.
Benchmarking
The effects of optimization are often difficult to judge. They can however be measured by benchmarking tools.
Storage devices
Swap files
Creating your swap files on a separate disk can also help quite a bit, especially if your machine swaps frequently. It happens if you do not have enough RAM for your environment. Using KDE with all the features and applications that come along may require several GiB of memory, whereas a tiny window manager with console applications will perfectly fit in less than 512 MiB of memory.
RAID
If you have multiple disks available, you can set them up as a software RAID for serious speed improvements. In a RAID 0 array there is no redundancy in case of drive failure, but for each additional disk you add to the array, the speed of the disk becomes that much faster.
Multiple hardware paths
An internal hardware path is how the storage device is connected to your motherboard. There are different ways to connect to the motherboard such as TCP/IP through a NIC, plugged in directly using PCIe/PCI, Firewire, Raid Card, USB, etc. By spreading your storage devices across these multiple connection points you maximize the capabilities of your motherboard, for example 6 hard-drives connected via USB would be much much slower than 3 over USB and 3 over Firewire. The reason is that each entry path into the motherboard is like a pipe, and there is a set limit to how much can go through that pipe at any one time. The good news is that the motherboard usually has several pipes.
More Examples
- Directly to the motherboard using pci/PCIe/ata
- Using an external enclosure to house the disk over USB/Firewire
- Turn the device into a network storage device by connecting over tcp/ip
Note also that if you have a 2 USB ports on the front of your machine, and 4 USB ports on the back, and you have 4 disks, it would probably be fastest to put 2 on front/2 on back or 3 on back/1 on front. This is because internally the front ports are likely a separate Root Hub than the back, meaning you can send twice as much data by using both than just 1. Use the following commands to determine the various paths on your machine.
USB Device Tree
$ lsusb -tv
PCI Device Tree
$ lspci -tv
Partitioning
If using a traditional spinning HDD, your partition layout can influence the system's performance. Sectors at the beginning of the drive (closer to the outside of the disk) are faster than those at the end. Also, a smaller partition requires less movements from the drive's head, and so speed up disk operations. Therefore, it is advised to create a small partition (10GB, more or less depending on your needs) only for your system, as near to the beginning of the drive as possible. Other data (pictures, videos) should be kept on a separate partition, and this is usually achieved by separating the home directory (/home/user
) from the system (/
).
Choosing and tuning your filesystem
Choosing the best filesystem for a specific system is very important because each has its own strengths. The File systems article provides a short summary of the most popular ones. You can also find relevant articles here.
Mount options
The noatime option is known to improve performance of the filesystem.
Other mount options are filesystem specific, therefore see the relevant articles for the filesystems:
- Ext3
- Ext4#Tips and tricks
- JFS Filesystem#Optimizations
- XFS
- Btrfs#Defragmentation and Btrfs#Compression
- ZFS#Tuning
Reiserfs
The data=writeback
mount option improves speed, but may corrupt data during power loss. The notail
mount option increases the space used by the filesystem by about 5%, but also improves overall speed. You can also reduce disk load by putting the journal and data on separate drives. This is done when creating the filesystem:
# mkreiserfs –j /dev/sda1 /dev/sdb1
Replace /dev/sda1
with the partition reserved for the journal, and /dev/sdb1
with the partition for data. You can learn more about reiserfs with this article.
Tuning kernel parameters
There are several key tunables affecting the performance of block devices, see sysctl#Virtual memory for more information.
Tuning IO schedulers
The kernel supports different schedulers for storage disk in-/output (IO). These are the CFQ scheduler (Completely Fair Queuing), the NOOP and Deadline. Another, the BFQ (Budget Fair Scheduler) is available for the Linux-ckAUR kernel from the AUR.
A HDD has spinning disks and head that move physically to the required location. Such structure leads to following characteristics:
- random latency it quite high, for modern HDD it is ~10ms (ignoring a disk controller write buffer).
- sequential access provides much higher throughput. In this case head needs to move less distance.
In case if we have a lot of running processes that make IO requests to different parts of storage (i.e. random access) then we can expect that a disk handles ~100 IO requests per second. Because modern systems can easily generate load much higher than 100 requests per second we have a queue of requests that have to wait for access to the storage. One way to improve throughput is to linearize access, i.e. order waiting requests by its logical address and always choose the closest request. Historically this was the first Linux IO scheduler called elevator scheduler.
One of the problems with the elevator algorithm is that it makes suffer processes with sequential access. Such processes read a block of data then process it for several microseconds then read next block and so on. The elevator scheduler does not know that the process is going to read another block nearby and, thus, moves to another request at some other location. To overcome the problem anticipatory IO scheduler was added. For synchronous requests this algorithm waits for a short amount of time before moving to another request.
While these schedulers try to improve total throughput they also might leave some unlucky requests waiting for a very long time. As an example, imagine the majority of processes make requests at the beginning of storage space while an unlucky process makes a request at the other end of storage. So developers tried to make the algorithm more fair and the deadline scheduler was added. It has a queue ordered by address (the same as elevator). If some request sits in this queue for a long time then it moves to an "expired" queue ordered by expire time. The scheduler checks the expire queue first and processes requests from there and only then moves to elevator queue. It is important to understand that this algorithm scarifies total throughput for fairness.
CFQ (the default scheduler nowadays) aggregates all ideas from above and adds cgroup
support that allows to reserve some amount of IO to a specific cgroup
. It is useful on shared (and cloud) hosting - users who paid for 20 IO/s want to get their share if needed.
The characteristics of a SSD are different. It does not have moving parts. Random access is as fast as sequential one. An SSD can handle multiple requests at the same time. Modern devices' throughput ~10K IO/s, which is higher than workload on most systems. Essentially a user cannot generate enough requests to saturate a SDD, the requests queue is effectively always empty. In this case IO scheduler does not provide any improvements. Thus, it is recommended to use the noop scheduler for an SSD.
It is possible to change the scheduler at runtime and even to use different schedulers for separate storage devices at the same time. See SSD IO scheduler for commands and examples.
RAM disks
See [1].
USB storage devices
If USB drives like pendrives are slow to copy files, append these three lines in a systemd tmpfile:
/etc/tmpfiles.d/local.conf
w /sys/kernel/mm/transparent_hugepage/enabled - - - - madvise w /sys/kernel/mm/transparent_hugepage/defrag - - - - madvise w /sys/kernel/mm/transparent_hugepage/khugepaged/defrag - - - - 0
See also sysctl#Virtual memory, [2] and [3].
CPU
The only way to directly improve CPU speed is overclocking. As it is a complicated and risky task, it is not recommended for anyone except experts. The best way to overclock is through the BIOS. When purchasing your system, keep in mind that most Intel motherboards are notorious for disabling the capability to overclock.
Many Intel i5 and i7 chips, even when overclocked properly through the BIOS or UEFI interface, will not report the correct clock frequency to acpi_cpufreq and most other utilities. This will result in excessive messages in dmesg about delays unless the module acpi_cpufreq is unloaded and blacklisted. The only tool known to correctly read the clock speed of these overclocked chips under Linux is i7z. The i7z package is available in the community repo and i7z-gitAUR is available in the AUR.
A way to modify performance (ref) is to use Con Kolivas' desktop-centric kernel patchset, which, among other things, replaces the Completely Fair Scheduler (CFS) with the Brain Fuck Scheduler (BFS).
Kernel PKGBUILDs that include the BFS patch can be installed from the AUR or Unofficial user repositories. See the respective pages for linux-ckAUR and Linux-ck wiki page, linux-pfAUR and Linux-pf wiki page or linux-bfsAUR for more information on their additional patches.
Verynice
VeryNice is a daemon, available in the AUR as veryniceAUR, for dynamically adjusting the nice levels of executables. The nice level represents the priority of the executable when allocating CPU resources. Simply define executables for which responsiveness is important, like X or multimedia applications, as goodexe in /etc/verynice.conf
. Similarly, CPU-hungry executables running in the background, like make, can be defined as badexe. This prioritization greatly improves system responsiveness under heavy load.
cgroups
See cgroups.
irqbalance
The purpose of irqbalance is distribute hardware interrupts across processors on a multiprocessor system in order to increase performance. It can be controlled by the provided irqbalance.service
.
Graphics
As with CPUs, overclocking can directly improve performance, but is generally recommended against. There are several packages in the AUR, such as rovclockAUR, amdoverdrivectrlAUR (ATI), and nvclockAUR (NVIDIA).
Xorg.conf configuration
Graphics performance may depend on the settings in /etc/X11/xorg.conf
; see the NVIDIA, ATI and Intel articles. Improper settings may stop Xorg from working, so caution is advised.
Driconf
driconf is a small utility which allows to change direct rendering settings for open source drivers. Enabling HyperZ may improve performance.
RAM and swap
Relocate files to tmpfs
Relocate files, such as your browser profile, to a tmpfs file system, including /tmp
, or /dev/shm
for improvements in application response as all the files are now stored in RAM.
Use an active management script for maximal reliability and ease of use.
Refer to the Profile-sync-daemon wiki article for more information on syncing browser profiles.
Refer to the Anything-sync-daemon wiki article for more information on syncing any specified folder.
Root on RAM overlay
If running off a slow writing medium (USB, spinning HDDs) and storage requirements are low, the root may be run on a RAM overlay ontop of read only root (on disk). This can vastly improve performance at the cost of a limited writable space to root. See liverootAUR.
Zram or zswap
The zram kernel module (previously called compcache) provides a compressed block device in RAM. If you use it as swap device, the RAM can hold much more information but uses more CPU. Still, it is much quicker than swapping to a hard drive. If a system often falls back to swap, this could improve responsiveness.
Example: To set up one lz4 compressed zram device with 32GiB capacity and a higher-than-normal priority (only for the current session):
# modprobe zram # echo lz4 > /sys/block/zram0/comp_algorithm # echo 32G > /sys/block/zram0/disksize # mkswap --label zram0 /dev/zram0 # swapon --priority 100 /dev/zram0
To disable it again, either reboot or run
# swapoff /dev/zram0 # rmmod zram
If you want to automatically initialize zram on every boot, consider writing a systemd service for it.
A detailed explanation of all steps, options and potential problems is provided in the official documentation of the module here.
The AUR package zramswapAUR provides an automated script for setting up such swap devices with optimal settings for your system (such as RAM size and CPU core number). The script creates one zram device per CPU core with a total space equivalent to the RAM available. To do this automatically on every boot, enable zramswap.service
.
You will have a compressed swap with higher priority than your regular swap which will utilize multiple CPU cores for compessing data.
Similar benefits (at similar costs) can be achieved using zswap rather than zram. The two are generally similar in intent although not operation: zswap operates as a compressed RAM cache and neither requires (nor permits) extensive userspace configuration.
Swap on zRAM using a udev rule
The example below describes how to set up swap on zRAM automatically at boot with a single udev rule. No extra package should be needed to make this work.
First, enable the module:
/etc/modules-load.d/zram.conf
zram
Configure the number of /dev/zram nodes you need.
/etc/modprobe.d/zram.conf
options zram num_devices=2
Create the udev rule as shown in the example.
/etc/udev/rules.d/99-zram.rules
KERNEL=="zram0", ATTR{disksize}="512M" RUN="/usr/bin/mkswap /dev/zram0", TAG+="systemd" KERNEL=="zram1", ATTR{disksize}="512M" RUN="/usr/bin/mkswap /dev/zram1", TAG+="systemd"
Add /dev/zram to your fstab.
/etc/fstab
/dev/zram0 none swap defaults 0 0 /dev/zram1 none swap defaults 0 0
Using the graphic card's RAM
In the unlikely case that you have very little RAM and a surplus of video RAM, you can use the latter as swap. See Swap on video ram.
Network
Use a DNS caching server in your local network. Every time a connections is made, the TCP/IP stack must resolve a fully qualified donamin name to an IP address. Only then the connection can be done. To use a DNS caching server directly present in your local network will decreases the latency on new connections. Your DSL router should contain such server, if not you can install your own. See Dnsmasq for more details.
Application-specific tips
Firefox
See Firefox tweaks#Performance and Firefox Ramdisk.
Firefox in the official repositories is built with the profile guided optimization flag enabled. You may want to use it in your custom build. To do this append:
ac_add_options --enable-profile-guided-optimization
to your .mozconfig
file.