Solid State Drives (简体中文)

Tango-preferences-desktop-locale.png

Tango-preferences-desktop-locale.png

本页面或部分需要翻译,部分内容可能已经与英文文章脱节。如果您希望贡献翻译,请访问简体中文翻译组

附注: please use the first argument of the template to provide more detailed indications.

本文包含了固态硬盘(SSD)的许多与 Linux 相关的方面;然而,下面叙述的一些准则和关键学习点对于在其他操作系统(如Windows和 MacOS X) 中运行 SSD 的用户也是基本足够的。除此之外,Linux 用户可以从调整/优化中获得更好的性能。

简介

固态硬盘 (SSD) 不是即插即用设备。如果想让 SSD 获得更好的性能,需要特别考虑分区对齐、文件系统选择、TRIM 支持等等。本文尝试搜集一些相关的关键的学习点,来让用户能够在 Linux 中充分利用 SSD。用户在进行操作前最好通读全文,尽管本文是按照主题组织的,不是按照任何系统性的顺序或按年代的顺序组织。

注意: 本文的目标群体是Linux用户,但是大多数内容也适用于其他使用 Windows 和 Mac OS X 的用户。

相对普通硬盘的优点

  • 更快的读取速度 - 比普通桌面硬盘快 2-3 倍 (7,200 RPM 使用 SATA2 接口).
  • 持续读取速度 - 在整个设备中读取速度不会下降。在普通硬盘中,当磁头从磁盘外缘移向中心时性能会相对下降。
  • 极小访问时间 - 比普通硬盘快大约 100 倍。例如,0.1 ms (100 ns) vs. 12-20 ms (12,000-20,000 ns) (桌面硬盘)。
  • 高度可靠性
  • 没有运动部件
  • 极小的产热
  • 极小的能源消耗 - 大约1W(闲置)以及 1-2 W (读取) vs. 10-30 W 对于普通硬盘 HDD (与转速有关).
  • 轻量级 - 对于笔记本电脑很理想

局限

  • 单位容量价格 (每GB数美元 vs. 普通硬盘每GB数美分).
  • 市售容量小于普通硬盘
  • 大的读写单元需要与传统存储介质不同的文件系统优化。闪存翻译层(flash translation layer)隐藏了允许现代操作系统用来优化访问性能的原始的闪存访问。
  • 分区和文件系统需要一些针对性的调整。页面大小和擦除页面大小无法自动探测。
  • 单元损坏。现代成熟的50nm消费级 MLC 单元可以进行 10000 次写入;35nm 通常可以进行 5000 次写入,25nm 可以进行3000次写入 (工艺越小,密度越高,价格越便宜)。如果写入被正确的分散开,写入不是太小,并且与单元正确对齐,这翻译成固态硬盘的终生写入量是上面的数字乘以容量。日常的写入量必须与期望寿命权衡。
  • 固件和控制器很复杂。它们有时候会出现错误。现在它们会消耗和普通硬盘差不多的能量。它们实现了带有垃圾回收功能的日志文件系统,同时转换传统的为旋转媒体设计的 SATA 命令。同时一些固件和控制器会做在线的压缩。它们把重复的写入分散到整个闪存的不同部分,来避免过早的损坏。它们同时还把写入组合起来,这样小的写入就不会导致大的存储单元的重复擦除。最后它们还要移动存储数据的单元,这样这个单元就不会在长时间之后丢失数据。
  • 当磁盘变满时性能下降。垃圾回收并不总是能很好的实现,这意味着剩余空间不会总是收集为整个空闲单元。

购买前注意事项

有几个关键的功能需要在购买固态硬盘之前考虑。

关键功能

  • 原生 TRIM 支持是一个很重要的功能,它既可以延长固态硬盘寿命,同时可以在长期减少写入性能下降。
  • 购买正确容量的固态硬盘是关键。对于大多数文件系统,对所有的固态硬盘分区使用 <75 % 的容量可以确保被内核高效的使用。

评论

本部分不想无所不包,只包含一些关键的评论。

优化 SSD 性能的技巧

分区对齐方式

整体概况

正确的分区对齐对于优化性能和寿命很重要。对齐的关键是(至少)按照 EBS (擦除块大小) 分区。

注意: EBS 依制造商不同有很大的区别的;在 Google 上搜索一下中意的型号是个好主意!例如 Intel X25-M 被认为 EBS 大小是 512 KiB,但是 Intel 没有官方公开过这个信息。
注意: 如果你不知道你的固态硬盘的 EBS,你仍然可以使用 512 KiB (或者 1024 KiB 如果你想更保险一点而不在乎损失磁盘的最初的1 MiB)。这个数字大于或等于现有的所有EBS。对于这样一个数字对齐意味着对于更小的数字也是对齐的。这就是 Windows 7 和 Ubuntu 优化固态硬盘分区的方式。

如果分区没有按照EBS的整数倍对齐,对齐文件系统是徒劳无功的,因为在分区的开始就是有偏差的。传统上,硬盘是按照柱面磁头扇区 来寻址需要读写的数据位置。这代表了相关数据的径向的位置、驱动器磁头和轴向的位置。对于 LBA (逻辑块寻址),这就不再是这样了。而是整个磁盘被按照连续的数据流寻址。

使用 GPT - 推荐方式

GPT 是一个可选的较新的分区方式,它的目标是替代古老的主引导记录 (MBR) 系统。GPT 相对于 MBR 有许多优点,而 MBR 可以回溯到 MS-DOS 时代。使用最新开发的格式化工具 fdisk (MBR) 和 gdisk (GPT),使用 GPT 或者 MBR 是同样容易的,同时使用 GPT 可以获得更高的性能。

GPT 和 MBR 选择

选择基本上决定于以下几点:

  • 如果你使用 GRUB Legacy 作为引导加载器,你必须使用 MBR。参见下面的 #使用 MBR - 旧方法
  • 如果你想与 Windows 双启动,你必须使用 MBR。参见下面的 #使用 MBR - 旧方法
    • 一个特殊的例外:双启动到 Windows Vista/7 64 位,并且使用 UEFI 代替 BIOS,你必须使用 GPT。
  • 如果上面都不满足,可以自由选用 GPT 和 MBR。因为 GPT 更新,推荐使用它。
Gdisk 使用方法概述

GPT-able类似于fdisk、gdisk,使用它分区时可以基于2048扇区(或1024KiB)自动对齐,此方法兼容于大多数的SSD。GNU parted 也支持GPT,但对于对齐分区来说,稍显”不那么“用户友好”。gdisk的使用方法如下:

  • 从 extra源 安装gdisk (gptfdisk package)。
  • 用 gdisk 打开 你的 SSD。
  • 如果SSD为全新亦或想重新来过,那么输入“o”来创建一个新的空GUID分区表。
  • 输入“n”创建一个新的分区(主分区/第一个分区)。
  • Assuming the partition is new, gdisk will pick the highest possible alignment. Otherwise, it will pick the largest power of two that divides all partition offsets.
  • 如果选择的起始扇区小于2048,gdisk将自动将起始扇区更改为2048。This is to ensure a 2048-sectors alignment (as a sector is 512B, this is a 1024KiB alignment which should fit any SSD NAND erase block).
  • Use the +x{M,G} format to extend the partition x megabytes or gigabytes, if choosing a size that is not a multiple of the alignment size (1024kiB), gdisk will shrink the partition to the nearest inferior multiple).
  • Select the partition's type id, the default, 'Linux/Windows data' (code 0700), should be fine for most use. Press L to show the codes list.
  • Assign other partitions in a like fashion.
  • Write the table to disk and exit via the 'w' command.
  • Create the filesystems as usual.
Warning: If planing to use the GPT partitioned SSD as a boot-disk on a BIOS based system (most systems except Apple computers and some very rare motherboard models with Intel chipset) one may have to create, preferably at the disk's beginning, a 1 MiB partition with the partition type as BIOS boot or bios_grub partition (gdisk type code EF02) for booting from the disk using GRUB2. For Syslinux, one does not need to create a separate 1 MiB bios_grub partition, but one needs to have separate /boot partition and enable Legacy BIOS Bootable partition attribute for that partition (using gdisk). See GPT for more information.
Warning: GRUB legacy does not support GUID partitioning scheme, users must use burg, GRUB2 or Syslinux.
Warning: If planing to dual boot with Windows (XP, Vista or 7) do NOT use GPT since they do NOT support booting from a GPT disk in BIOS systems! Users need to use the legacy MBR method described below for dual-boot in BIOS systems! This limitation does not apply if booting in UEFI mode and using Windows Vista (64bits) or 7 (64bits). For 32-bit Windows Vista and 7, and 32 and 64-bit Windows XP, users need to use MBR partitioning and boot in BIOS mode only.

使用 MBR - 旧方法

Using MBR, the utility for editing the partition table is called fdisk. Recent versions of fdisk have abandoned the deprecated system of using cylinders as the default display unit, as well as MS-DOS compatibility by default. The latest fdisk automatically aligns all partitions to 2048 sectors, or 1024 KiB, which should work for all EBS sizes that are known to be used by SSD manufacturers. This means that the default settings will give you proper alignment.

Note that in the olden days, fdisk used cylinders as the default display unit, and retained an MS-DOS compatibility quirk that messed with SSD alignment. Therefore one will find many guides around the internet from around 2008-2009 making a big deal out of getting everything correct. With the latest fdisk, things are much simpler, as reflected in this guide.

Fdisk 使用方法概述
  • Start fdisk.
  • If the SSD is brand new, create a new empty DOS partition table with the 'o' command.
  • Create a new partition with the 'n' command (primary type/1st partition).
  • Use the +xG format to extend the partition x gigabytes.
  • Change the partition's system id from the default type of Linux (type 83) to the desired type via the 't' command. This is an optional step should the user wish to create another type of partition for example, swap, NTFS, LVM, etc. Note that a complete listing of all valid partition types is available via the 'l' command.
  • Assign other partitions in a like fashion.
  • Write the table to disk and exit via the 'w' command.

When finished, users may format their newly created partitions with the 'mkfs.x /dev/sdXN' where x is the filesystem, X is the drive letter, and N is the partition number. The following example will format the first partition on the first disk to ext4 using the defaults specified in /etc/mke2fs.conf:

# mkfs.ext4 /dev/sda1
Warning: Using the mkfs command can be dangerous as a simple mistake can result in formatting the WRONG partition and in data loss! TRIPLE check the target of this command before hitting the Enter key!
逻辑分区注意事项

---Place holder for content.

多个 SSD 的 RAID0 设置注意事项

---Place holder for content.

加密分区

When using cryptsetup, define a sufficient payload (see here):

cryptsetup luksFormat --align-payload=8192 ...

But remember that DISCARD/TRIM feature is NOT SUPPORTED by device-mapper (but they are working on it, see here. August 2011 news: support will be in Linux 3.1, and involves a userspace dm-crypt update as well [1])

挂载标识

There are several key mount flags to use in one's /etc/fstab entries for SSD partitions.

  • noatime - Reading accesses to the file system will no longer result in an update to the atime information associated with the file. The importance of the noatime setting is that it eliminates the need by the system to make writes to the file system for files which are simply being read. Since writes can be somewhat expensive as mentioned in previous section, this can result in measurable performance gains. Note that the write time information to a file will continue to be updated anytime the file is written to with this option enabled.
  • discard - The discard flag will enable the benefits of the TRIM command as long as one is using kernel version >=2.6.33. It does not work with ext3; using the discard flag for an ext3 root partition will result in it being mounted read-only.
/dev/sda1 / ext4 defaults,noatime,discard 0 1
/dev/sda2 /home ext4 defaults,noatime,discard 0 1
Warning: It is critically important that users switch the controller driving the SSD to AHCI mode (not IDE mode) to ensure that the kernel is able to use the TRIM command.
Warning: Users need to be certain that kernel version 2.6.33 or above is being used AND that their SSD supports TRIM before attempting to mount a partition with the discard flag. Data loss can occur otherwise!

Mac 计算机注意事项

By default, Apple's firmware switches SATA drives into IDE mode (not AHCI mode) when booting any OS besides Mac OS. It is easy to switch back to AHCI if you are using GRUB2 with an Intel SATA controller.

First determine the PCI identifier of your SATA controller. Run the command

# lspci -nn

and find the line that says "SATA AHCI Controller". The PCI identifier is in square brackets and should look like 8086:27c4 (but the last digits may be different).

Now edit /boot/grub/grub.cfg and add the line

# setpci -d 8086:27c4 90.b=40

right above the "set root" line of each OS you want to enable AHCI for. Be sure to substitute the appropriate PCI identifier.

(credit: http://darkfader.blogspot.com/2010/04/windows-on-intel-mac-and-ahci-mode.html)

If you have a macbook unibody late 2008 (5.1) you doesn't have an intel controler. You have got an "nVidia Corporation MCP79 SATA Controller".

add this line to /boot/grub/grub.cfg

# setpci -d 10de:0ab5 9c.b=06

I/O 调度器

Consider switching from the default scheduler, which under Arch is cfq (completely fair queuing), to the noop or deadline scheduler for an SSD. The later two offer performance boosts over cfq. Using the noop scheduler, for example, simply processes requests in the order they are received, without giving any consideration to where the data physically resides on the disk. This option is thought to be advantageous for SSDs since seek times are identical for all sectors on the SSD.

However, for some SSDs, particularly earlier, JMicron-based ones, you may experience better performance sticking with the default scheduler (see here for one such benchmark); on these, while seek times are similar for all sectors, random access throughput is bad enough to offset any advantage. If the SSD was manufactured within the last year or so, or is made by Intel, this probably does not apply to you.

For more on schedulers, see this Linux Magazine article (needs registration).

About the default scheduler for ssd drives: FS#22605

The cfq scheduler is enabled by default on Arch. Verify this by viewing the contents /sys/block/sda/queue/scheduler:

$ cat /sys/block/sdX/queue/scheduler
noop deadline [cfq]

The scheduler currently in use is denoted from the available schedulers by the brackets.

There are several ways to change the scheduler.

Note: Only switch the scheduler to noop or deadline for SSDs. Keeping the cfq scheduler for all other physical HDDs is highly recommended.
Using the sys virtual filesystem

This method is preferred when the system has several physical storage devices (for example an SSD and an HDD). Add the following line in /etc/rc.local:

echo noop > /sys/block/sdX/queue/scheduler

where X is the letter for the SSD device.

Because of the potential for udev to assign different /dev/ nodes to drives before and after a kernel update, users must take care that the noop scheduler is applied to the correct device upon boot. One way to do this is by using the SSD's device ID to determine its /dev/ node. To do this automatically, use the following snippet instead of the line above and add it to /etc/rc.local:

SSD=(ata-OCZ-ONYX_XA1Y7CE3709UG79OTHVM ata-SAMSUNG_SSD_830_Series_S0VTNYABA01063)

declare -i i=0
while [ "${SSD[$i]}" != "" ]; do
  NODE=`ls -l /dev/disk/by-id/${SSD[$i]} | awk '{ print $NF }' | sed -e 's/[/\.]//g'`
  echo noop > /sys/block/$NODE/queue/scheduler
  i=i+1
done

where SSD is a Bash array containing the device IDs of all SSD devices. Device IDs are listed in /dev/disk/by-id/ as symbolic links pointing to their corresponding /dev/ nodes. To view the links listed with their targets, issue the following command:

ls -l /dev/disk/by-id/
内核参数

If the sole storage device in the system is an SSD, consider setting the I/O scheduler for the entire system via the elevator kernel parameter:

elevator=noop

For example, with GRUB, in /boot/grub/menu.lst:

kernel /vmlinuz-linux root=/dev/sda3 ro elevator=noop

or with GRUB2, in /etc/default/grub: (remember to run update-grub afterwards)

GRUB_CMDLINE_LINUX="elevator=noop"

SSD 上的交换空间

One can place a swap partition on an SSD. Note that most modern desktops with an excess of 2 Gigs of memory rarely use swap at all. The notable exception is systems which make use of the hibernate feature. The following is recommended tweak for SSDs using a swap partition that will reduce the "swapiness" of the system thus avoiding writes to swap.

# echo 1 > /proc/sys/vm/swappiness

Or one can simply modify /etc/sysctl.d/99-sysctl.conf as recommended in the Maximizing Performance wiki article.

vm.swappiness=1
vm.vfs_cache_pressure=50

SSD 存储单元清空

On occasion, users may wish to completely reset an SSD's cells to the same virgin state they were at the time he/she installed the device thus restoring it to its factory default write performance. Write performance is known to degrade over time even on SSDs with native TRIM support. TRIM only safeguards against file deletes, not replacements such as an incremental save.

The reset is easily accomplished in a three step procedure denoted on the SSD Memory Cell Clearing wiki article.

Tips for Minimizing SSD Read/Writes

An overarching theme for SSD usage should be 'simplicity' in terms of locating high-read/write operations either in RAM (Random Access Memory) or on a physical HDD rather than on an SSD. Doing so will add longevity to an SSD. This is primarily due to the large erase block size (512 KiB in some cases); a lot of small writes result in huge effective writes.

Note: A 32GB SSD with a mediocre 10x write amplification factor, a standard 10000 write/erase cycle, and 10GB of data written per day, would get an 8 years life expectancy. It gets better with bigger SSDs and modern controllers with less write amplification.

Use "iotop -oPa" and sort by disk writes to see how much your programs are writing to disk.

Intelligent Partition Scheme

Consider relocating the /var partition to a physical disc on the system rather than on the SSD itself to avoid read/write wear. Many users elect to keep only /, and /home on the SSD (/boot is okay too) locating /var and /tmp on a physical HDD.

# SSD
/
/home

# HDD
/boot
/var
/media/data (and other extra partitions, etc.)

If the SSD is the only storage device on the system (i.e. no HDDs), consider allocating a separate partition for /var to allow for better crash recovery for example in the event of a broken program wasting all the space on / or if some run away log file maxes out the space, etc.

Another intelligent option is to locate /tmp is into RAM provided the system has enough to spare. See the next section for more on this procedure.

noatime 挂载标识

Assign the noatime flag to partitions residing on SSDs. See the Mount Flags section below for more.

将浏览器配置文件放在内存中

One can easily mount browser profile(s) such as chromium, firefox, opera, etc. into RAM via tmpfs and also use rsync to keep them synced with HDD-based backups. In addition to the obvious speed enhancements, users will also save read/write cycles on their SSD by doing so.

The AUR contains several packages to automate this process, for example:

在 tmpfs 中编译

Intentionally compiling in /tmp is a great idea to minimize this problem. For systems with >4 Gigs of memory, the tmp line in /etc/fstab can be tweaked to use more than 1/2 the physical memory on the system via the size flag.

Example of a machine with 8 GB of physical memory:

tmpfs /tmp tmpfs nodev,nosuid,size=7G 0 0

Disabling Journaling on the Filesystem?

Using a journaling filesystem such as ext3 or ext4 on an SSD WITHOUT a journal is an option to decrease read/writes. The obvious drawback of using a filesystem with journaling disabled is data loss as a result of an ungraceful dismount (i.e. post power failure, kernel lockup, etc.). With modern SSDs, Ted Tso advocates that journaling can be enabled with minimal extraneous read/write cycles under most circumstances:

Amount of data written (in megabytes) on an ext4 file system mounted with noatime.

operation journal w/o journal percent change
git clone 367.0 353.0 3.81 %
make 207.6 199.4 3.95 %
make clean 6.45 3.73 42.17 %

"What the results show is that metadata-heavy workloads, such as make clean, do result in almost twice the amount data written to disk. This is to be expected, since all changes to metadata blocks are first written to the journal and the journal transaction committed before the metadata is written to their final location on disk. However, for more common workloads where we are writing data as well as modifying filesystem metadata blocks, the difference is much smaller."

Note: The make clean example from the table above typifies the importance of intentionally doing compiling in /dev/shm as recommended in the preceding section of this article!

文件系统选择

有许多文件系统可供选择,包括:ext2, ext3, ext4, btrfs,等等。

Btrfs

Btrfs support has been included with the mainline 2.6.29 release of the Linux kernel. Some feel that it is not mature enough for production use while there are also early adopters of this potential successor to ext4. It should be noted that at the time this article was originally written (27-June-2010), a stable version of btrfs did not exist. See this blog entry for more on btrfs. Be sure to read the btrfs wiki as well.

Warning: At the time this entry was written (21-Nov-2010) there is NO fsck utility to fix/diagnose errors on btrfs partitions. While Btrfs is stable on a stable machine, it is currently possible to corrupt a filesystem irrecoverably in the event of a crash or power loss on disks that do not handle flush requests correctly.

Ext4

Ext4 为另一个支持SSD的文件系统。Kernel 2.6.28 时开始变得成熟且稳定,足以我们日常使用。 与Btrfs相反, ext4不自动检测磁盘的功能属性,须手动在fstab文件中(或使用命令tune2fs -o discard /dev/sdaX)使用挂载参数“discard”开启磁盘的TRIM功能。 关于ext4文件系统的详情,请浏览official in kernel tree documentation

SSD 基准测试

See the SSD Benchmarking article for a general process of benchmarking your SSD or to see some of the SSDs in the database.

固件升级

OCZ

对于Linux,OCZ在他们的论坛上有一个i686和x86_64都可以使用的命令行的工具