This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Optimizing Redis

Benchmarking, profiling, and optimizations for memory and latency

1: Redis benchmark
2: Redis CPU profiling
3: Diagnosing latency issues
4: Redis latency monitoring
5: Memory optimization

1 - Redis benchmark

Using the redis-benchmark utility to benchmark a Redis server

Redis includes the redis-benchmark utility that simulates running commands done by N clients at the same time sending M total queries. The utility provides a default set of tests, or a custom set of tests can be supplied.

The following options are supported:

Usage: redis-benchmark [-h <host>] [-p <port>] [-c <clients>] [-n <requests]> [-k <boolean>]

 -h <hostname>      Server hostname (default 127.0.0.1)
 -p <port>          Server port (default 6379)
 -s <socket>        Server socket (overrides host and port)
 -a <password>      Password for Redis Auth
 -c <clients>       Number of parallel connections (default 50)
 -n <requests>      Total number of requests (default 100000)
 -d <size>          Data size of SET/GET value in bytes (default 2)
 --dbnum <db>       SELECT the specified db number (default 0)
 -k <boolean>       1=keep alive 0=reconnect (default 1)
 -r <keyspacelen>   Use random keys for SET/GET/INCR, random values for SADD
  Using this option the benchmark will expand the string __rand_int__
  inside an argument with a 12 digits number in the specified range
  from 0 to keyspacelen-1. The substitution changes every time a command
  is executed. Default tests use this to hit random keys in the
  specified range.
 -P <numreq>        Pipeline <numreq> requests. Default 1 (no pipeline).
 -q                 Quiet. Just show query/sec values
 --csv              Output in CSV format
 -l                 Loop. Run the tests forever
 -t <tests>         Only run the comma separated list of tests. The test
                    names are the same as the ones produced as output.
 -I                 Idle mode. Just open N idle connections and wait.

You need to have a running Redis instance before launching the benchmark. You can run the benchmarking utility like so:

redis-benchmark -q -n 100000

Running only a subset of the tests

You don’t need to run all the default tests every time you execute redis-benchmark. For example, to select only a subset of tests, use the -t option as in the following example:

$ redis-benchmark -t set,lpush -n 100000 -q
SET: 74239.05 requests per second
LPUSH: 79239.30 requests per second

This example runs the tests for the SET and LPUSH commands and uses quiet mode (see the -q switch).

You can even benchmark a specfic command:

$ redis-benchmark -n 100000 -q script load "redis.call('set','foo','bar')"
script load redis.call('set','foo','bar'): 69881.20 requests per second

Selecting the size of the key space

By default, the benchmark runs against a single key. In Redis the difference between such a synthetic benchmark and a real one is not huge since it is an in-memory system, however it is possible to stress cache misses and in general to simulate a more real-world work load by using a large key space.

This is obtained by using the -r switch. For instance if I want to run one million SET operations, using a random key for every operation out of 100k possible keys, I’ll use the following command line:

$ redis-cli flushall
OK

$ redis-benchmark -t set -r 100000 -n 1000000
====== SET ======
  1000000 requests completed in 13.86 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1

99.76% `<=` 1 milliseconds
99.98% `<=` 2 milliseconds
100.00% `<=` 3 milliseconds
100.00% `<=` 3 milliseconds
72144.87 requests per second

$ redis-cli dbsize
(integer) 99993

Using pipelining

By default every client (the benchmark simulates 50 clients if not otherwise specified with -c) sends the next command only when the reply of the previous command is received, this means that the server will likely need a read call in order to read each command from every client. Also RTT is paid as well.

Redis supports pipelining, so it is possible to send multiple commands at once, a feature often exploited by real world applications. Redis pipelining is able to dramatically improve the number of operations per second a server is able do deliver.

This is an example of running the benchmark in a MacBook Air 11" using a pipelining of 16 commands:

$ redis-benchmark -n 1000000 -t set,get -P 16 -q
SET: 403063.28 requests per second
GET: 508388.41 requests per second

Using pipelining results in a significant increase in performance.

Pitfalls and misconceptions

The first point is obvious: the golden rule of a useful benchmark is to only compare apples and apples. Different versions of Redis can be compared on the same workload for instance. Or the same version of Redis, but with different options. If you plan to compare Redis to something else, then it is important to evaluate the functional and technical differences, and take them in account.

Redis is a server: all commands involve network or IPC round trips. It is meaningless to compare it to embedded data stores, because the cost of most operations is primarily in network/protocol management.
Redis commands return an acknowledgment for all usual commands. Some other data stores do not. Comparing Redis to stores involving one-way queries is only mildly useful.
Naively iterating on synchronous Redis commands does not benchmark Redis itself, but rather measure your network (or IPC) latency and the client library intrinsic latency. To really test Redis, you need multiple connections (like redis-benchmark) and/or to use pipelining to aggregate several commands and/or multiple threads or processes.
Redis is an in-memory data store with some optional persistence options. If you plan to compare it to transactional servers (MySQL, PostgreSQL, etc …), then you should consider activating AOF and decide on a suitable fsync policy.
Redis is, mostly, a single-threaded server from the POV of commands execution (actually modern versions of Redis use threads for different things). It is not designed to benefit from multiple CPU cores. People are supposed to launch several Redis instances to scale out on several cores if needed. It is not really fair to compare one single Redis instance to a multi-threaded data store.

The redis-benchmark program is a quick and useful way to get some figures and evaluate the performance of a Redis instance on a given hardware. However, by default, it does not represent the maximum throughput a Redis instance can sustain. Actually, by using pipelining and a fast client (hiredis), it is fairly easy to write a program generating more throughput than redis-benchmark. The default behavior of redis-benchmark is to achieve throughput by exploiting concurrency only (i.e. it creates several connections to the server). It does not use pipelining or any parallelism at all (one pending query per connection at most, and no multi-threading), if not explicitly enabled via the -P parameter. So in some way using redis-benchmark and, triggering, for example, a BGSAVE operation in the background at the same time, will provide the user with numbers more near to the worst case than to the best case.

To run a benchmark using pipelining mode (and achieve higher throughput), you need to explicitly use the -P option. Please note that it is still a realistic behavior since a lot of Redis based applications actively use pipelining to improve performance. However you should use a pipeline size that is more or less the average pipeline length you’ll be able to use in your application in order to get realistic numbers.

The benchmark should apply the same operations, and work in the same way with the multiple data stores you want to compare. It is absolutely pointless to compare the result of redis-benchmark to the result of another benchmark program and extrapolate.

For instance, Redis and memcached in single-threaded mode can be compared on GET/SET operations. Both are in-memory data stores, working mostly in the same way at the protocol level. Provided their respective benchmark application is aggregating queries in the same way (pipelining) and use a similar number of connections, the comparison is actually meaningful.

When you’re benchmarking a high-performance, in-memory database like Redis, it may be difficult to saturate the server. Sometimes, the performance bottleneck is on the client side, and not the server-side. In that case, the client (i.e., the benchmarking program itself) must be fixed, or perhaps scaled out, to reach the maximum throughput.

Factors impacting Redis performance

There are multiple factors having direct consequences on Redis performance. We mention them here, since they can alter the result of any benchmarks. Please note however, that a typical Redis instance running on a low end, untuned box usually provides good enough performance for most applications.

Network bandwidth and latency usually have a direct impact on the performance. It is a good practice to use the ping program to quickly check the latency between the client and server hosts is normal before launching the benchmark. Regarding the bandwidth, it is generally useful to estimate the throughput in Gbit/s and compare it to the theoretical bandwidth of the network. For instance a benchmark setting 4 KB strings in Redis at 100000 q/s, would actually consume 3.2 Gbit/s of bandwidth and probably fit within a 10 Gbit/s link, but not a 1 Gbit/s one. In many real world scenarios, Redis throughput is limited by the network well before being limited by the CPU. To consolidate several high-throughput Redis instances on a single server, it worth considering putting a 10 Gbit/s NIC or multiple 1 Gbit/s NICs with TCP/IP bonding.
CPU is another very important factor. Being single-threaded, Redis favors fast CPUs with large caches and not many cores. At this game, Intel CPUs are currently the winners. It is not uncommon to get only half the performance on an AMD Opteron CPU compared to similar Nehalem EP/Westmere EP/Sandy Bridge Intel CPUs with Redis. When client and server run on the same box, the CPU is the limiting factor with redis-benchmark.
Speed of RAM and memory bandwidth seem less critical for global performance especially for small objects. For large objects (>10 KB), it may become noticeable though. Usually, it is not really cost-effective to buy expensive fast memory modules to optimize Redis.
Redis runs slower on a VM compared to running without virtualization using the same hardware. If you have the chance to run Redis on a physical machine this is preferred. However this does not mean that Redis is slow in virtualized environments, the delivered performances are still very good and most of the serious performance issues you may incur in virtualized environments are due to over-provisioning, non-local disks with high latency, or old hypervisor software that have slow fork syscall implementation.
When the server and client benchmark programs run on the same box, both the TCP/IP loopback and unix domain sockets can be used. Depending on the platform, unix domain sockets can achieve around 50% more throughput than the TCP/IP loopback (on Linux for instance). The default behavior of redis-benchmark is to use the TCP/IP loopback.
The performance benefit of unix domain sockets compared to TCP/IP loopback tends to decrease when pipelining is heavily used (i.e. long pipelines).
When an ethernet network is used to access Redis, aggregating commands using pipelining is especially efficient when the size of the data is kept under the ethernet packet size (about 1500 bytes). Actually, processing 10 bytes, 100 bytes, or 1000 bytes queries almost result in the same throughput. See the graph below.

Data size impact

On multi CPU sockets servers, Redis performance becomes dependent on the NUMA configuration and process location. The most visible effect is that redis-benchmark results seem non-deterministic because client and server processes are distributed randomly on the cores. To get deterministic results, it is required to use process placement tools (on Linux: taskset or numactl). The most efficient combination is always to put the client and server on two different cores of the same CPU to benefit from the L3 cache. Here are some results of 4 KB SET benchmark for 3 server CPUs (AMD Istanbul, Intel Nehalem EX, and Intel Westmere) with different relative placements. Please note this benchmark is not meant to compare CPU models between themselves (CPUs exact model and frequency are therefore not disclosed).

NUMA chart

With high-end configurations, the number of client connections is also an important factor. Being based on epoll/kqueue, the Redis event loop is quite scalable. Redis has already been benchmarked at more than 60000 connections, and was still able to sustain 50000 q/s in these conditions. As a rule of thumb, an instance with 30000 connections can only process half the throughput achievable with 100 connections. Here is an example showing the throughput of a Redis instance per number of connections:

connections chart

With high-end configurations, it is possible to achieve higher throughput by tuning the NIC(s) configuration and associated interruptions. Best throughput is achieved by setting an affinity between Rx/Tx NIC queues and CPU cores, and activating RPS (Receive Packet Steering) support. More information in this thread. Jumbo frames may also provide a performance boost when large objects are used.
Depending on the platform, Redis can be compiled against different memory allocators (libc malloc, jemalloc, tcmalloc), which may have different behaviors in term of raw speed, internal and external fragmentation. If you did not compile Redis yourself, you can use the INFO command to check the mem_allocator field. Please note most benchmarks do not run long enough to generate significant external fragmentation (contrary to production Redis instances).

Other things to consider

One important goal of any benchmark is to get reproducible results, so they can be compared to the results of other tests.

A good practice is to try to run tests on isolated hardware as much as possible. If it is not possible, then the system must be monitored to check the benchmark is not impacted by some external activity.
Some configurations (desktops and laptops for sure, some servers as well) have a variable CPU core frequency mechanism. The policy controlling this mechanism can be set at the OS level. Some CPU models are more aggressive than others at adapting the frequency of the CPU cores to the workload. To get reproducible results, it is better to set the highest possible fixed frequency for all the CPU cores involved in the benchmark.
An important point is to size the system accordingly to the benchmark. The system must have enough RAM and must not swap. On Linux, do not forget to set the overcommit_memory parameter correctly. Please note 32 and 64 bit Redis instances do not have the same memory footprint.
If you plan to use RDB or AOF for your benchmark, please check there is no other I/O activity in the system. Avoid putting RDB or AOF files on NAS or NFS shares, or on any other devices impacting your network bandwidth and/or latency (for instance, EBS on Amazon EC2).
Set Redis logging level (loglevel parameter) to warning or notice. Avoid putting the generated log file on a remote filesystem.
Avoid using monitoring tools which can alter the result of the benchmark. For instance using INFO at regular interval to gather statistics is probably fine, but MONITOR will impact the measured performance significantly.

Other Redis benchmarking tools

There are several third-party tools that can be used for benchmarking Redis. Refer to each tool’s documentation for more information about its goals and capabilities.

memtier_benchmark from Redis Ltd. is a NoSQL Redis and Memcache traffic generation and benchmarking tool.
rpc-perf from Twitter is a tool for benchmarking RPC services that supports Redis and Memcache.
YCSB from Yahoo @Yahoo is a benchmarking framework with clients to many databases, including Redis.

2 - Redis CPU profiling

Performance engineering guide for on-CPU profiling and tracing

Filling the performance checklist

Redis is developed with a great emphasis on performance. We do our best with every release to make sure you’ll experience a very stable and fast product.

Nevertheless, if you’re finding room to improve the efficiency of Redis or are pursuing a performance regression investigation you will need a concise methodical way of monitoring and analyzing Redis performance.

To do so you can rely on different methodologies (some more suited than other depending on the class of issues/analysis we intent to make). A curated list of methodologies and their steps are enumerated by Brendan Greg at the following link.

We recommend the Utilization Saturation and Errors (USE) Method for answering the question of what is your bottleneck. Check the following mapping between system resource, metric, and tools for a pratical deep dive: USE method.

Ensuring the CPU is your bottleneck

This guide assumes you’ve followed one of the above methodologies to perform a complete check of system health, and identified the bottleneck being the CPU. If you have identified that most of the time is spent blocked on I/O, locks, timers, paging/swapping, etc., this guide is not for you.

Build Prerequisites

For a proper On-CPU analysis, Redis (and any dynamically loaded library like Redis Modules) requires stack traces to be available to tracers, which you may need to fix first.

By default, Redis is compiled with the -O2 switch (which we intent to keep during profiling). This means that compiler optimizations are enabled. Many compilers omit the frame pointer as a runtime optimization (saving a register), thus breaking frame pointer-based stack walking. This makes the Redis executable faster, but at the same time it makes Redis (like any other program) harder to trace, potentially wrongfully pinpointing on-CPU time to the last available frame pointer of a call stack that can get a lot deeper (but impossible to trace).

It’s important that you ensure that:

debug information is present: compile option -g
frame pointer register is present: -fno-omit-frame-pointer
we still run with optimizations to get an accurate representation of production run times, meaning we will keep: -O2

You can do it as follows within redis main repo:

$ make REDIS_CFLAGS="-g -fno-omit-frame-pointer"

A set of instruments to identify performance regressions and/or potential on-CPU performance improvements

This document focuses specifically on on-CPU resource bottlenecks analysis, meaning we’re interested in understanding where threads are spending CPU cycles while running on-CPU and, as importantly, whether those cycles are effectively being used for computation or stalled waiting (not blocked!) for memory I/O, and cache misses, etc.

For that we will rely on toolkits (perf, bcc tools), and hardware specific PMCs (Performance Monitoring Counters), to proceed with:

Hotspot analysis (pref or bcc tools): to profile code execution and determine which functions are consuming the most time and thus are targets for optimization. We’ll present two options to collect, report, and visualize hotspots either with perf or bcc/BPF tracing tools.
Call counts analysis: to count events including function calls, enabling us to correlate several calls/components at once, relying on bcc/BPF tracing tools.
Hardware event sampling: crucial for understanding CPU behavior, including memory I/O, stall cycles, and cache misses.

Tool prerequesits

The following steps rely on Linux perf_events (aka “perf”), bcc/BPF tracing tools, and Brendan Greg’s FlameGraph repo.

We assume beforehand you have:

Installed the perf tool on your system. Most Linux distributions will likely package this as a package related to the kernel. More information about the perf tool can be found at perf wiki.
Followed the install bcc/BPF instructions to install bcc toolkit on your machine.
Cloned Brendan Greg’s FlameGraph repo and made accessible the difffolded.pl and flamegraph.pl files, to generated the collapsed stack traces and Flame Graphs.

Hotspot analysis with perf or eBPF (stack traces sampling)

Profiling CPU usage by sampling stack traces at a timed interval is a fast and easy way to identify performance-critical code sections (hotspots).

Sampling stack traces using perf

To profile both user- and kernel-level stacks of redis-server for a specific length of time, for example 60 seconds, at a sampling frequency of 999 samples per second:

$ perf record -g --pid $(pgrep redis-server) -F 999 -- sleep 60

Displaying the recorded profile information using perf report

By default perf record will generate a perf.data file in the current working directory.

You can then report with a call-graph output (call chain, stack backtrace), with a minimum call graph inclusion threshold of 0.5%, with:

$ perf report -g "graph,0.5,caller"

See the perf report documention for advanced filtering, sorting and aggregation capabilities.

Visualizing the recorded profile information using Flame Graphs

Flame graphs allow for a quick and accurate visualization of frequent code-paths. They can be generated using Brendan Greg’s open source programs on github, which create interactive SVGs from folded stack files.

Specifically, for perf we need to convert the generated perf.data into the captured stacks, and fold each of them into single lines. You can then render the on-CPU flame graph with:

$ perf script > redis.perf.stacks
$ stackcollapse-perf.pl redis.perf.stacks > redis.folded.stacks
$ flamegraph.pl redis.folded.stacks > redis.svg

By default, perf script will generate a perf.data file in the current working directory. See the perf script documentation for advanced usage.

See FlameGraph usage options for more advanced stack trace visualizations (like the differential one).

So that analysis of the perf.data contents can be possible on a machine other than the one on which collection happened, you need to export along with the perf.data file all object files with build-ids found in the record data file. This can be easily done with the help of perf-archive.sh script:

$ perf-archive.sh perf.data

Now please run:

$ tar xvf perf.data.tar.bz2 -C ~/.debug

on the machine where you need to run perf report.

Sampling stack traces using bcc/BPF’s profile

Similarly to perf, as of Linux kernel 4.9, BPF-optimized profiling is now fully available with the promise of lower overhead on CPU (as stack traces are frequency counted in kernel context) and disk I/O resources during profiling.

Apart from that, and relying solely on bcc/BPF’s profile tool, we have also removed the perf.data and intermediate steps if stack traces analysis is our main goal. You can use bcc’s profile tool to output folded format directly, for flame graph generation:

$ /usr/share/bcc/tools/profile -F 999 -f --pid $(pgrep redis-server) --duration 60 > redis.folded.stacks

In that manner, we’ve remove any preprocessing and can render the on-CPU flame graph with a single command:

$ flamegraph.pl redis.folded.stacks > redis.svg

Visualizing the recorded profile information using Flame Graphs

Call counts analysis with bcc/BPF

A function may consume significant CPU cycles either because its code is slow or because it’s frequently called. To answer at what rate functions are being called, you can rely upon call counts analysis using BCC’s funccount tool:

$ /usr/share/bcc/tools/funccount 'redis-server:(call*|*Read*|*Write*)' --pid $(pgrep redis-server) --duration 60
Tracing 64 functions for "redis-server:(call*|*Read*|*Write*)"... Hit Ctrl-C to end.

FUNC                                    COUNT
call                                      334
handleClientsWithPendingWrites            388
clientInstallWriteHandler                 388
postponeClientRead                        514
handleClientsWithPendingReadsUsingThreads      735
handleClientsWithPendingWritesUsingThreads      735
prepareClientToWrite                     1442
Detaching...

The above output shows that, while tracing, the Redis’s call() function was called 334 times, handleClientsWithPendingWrites() 388 times, etc.

Hardware event counting with Performance Monitoring Counters (PMCs)

Many modern processors contain a performance monitoring unit (PMU) exposing Performance Monitoring Counters (PMCs). PMCs are crucial for understanding CPU behavior, including memory I/O, stall cycles, and cache misses, and provide low-level CPU performance statistics that aren’t available anywhere else.

The design and functionality of a PMU is CPU-specific and you should assess your CPU supported counters and features by using perf list.

To calculate the number of instructions per cycle, the number of micro ops executed, the number of cycles during which no micro ops were dispatched, the number stalled cycles on memory, including a per memory type stalls, for the duration of 60s, specifically for redis process:

$ perf stat -e "cpu-clock,cpu-cycles,instructions,uops_executed.core,uops_executed.stall_cycles,cache-references,cache-misses,cycle_activity.stalls_total,cycle_activity.stalls_mem_any,cycle_activity.stalls_l3_miss,cycle_activity.stalls_l2_miss,cycle_activity.stalls_l1d_miss" --pid $(pgrep redis-server) -- sleep 60

Performance counter stats for process id '3038':

  60046.411437      cpu-clock (msec)          #    1.001 CPUs utilized          
  168991975443      cpu-cycles                #    2.814 GHz                      (36.40%)
  388248178431      instructions              #    2.30  insn per cycle           (45.50%)
  443134227322      uops_executed.core        # 7379.862 M/sec                    (45.51%)
   30317116399      uops_executed.stall_cycles #  504.895 M/sec                    (45.51%)
     670821512      cache-references          #   11.172 M/sec                    (45.52%)
      23727619      cache-misses              #    3.537 % of all cache refs      (45.43%)
   30278479141      cycle_activity.stalls_total #  504.251 M/sec                    (36.33%)
   19981138777      cycle_activity.stalls_mem_any #  332.762 M/sec                    (36.33%)
     725708324      cycle_activity.stalls_l3_miss #   12.086 M/sec                    (36.33%)
    8487905659      cycle_activity.stalls_l2_miss #  141.356 M/sec                    (36.32%)
   10011909368      cycle_activity.stalls_l1d_miss #  166.736 M/sec                    (36.31%)

  60.002765665 seconds time elapsed

It’s important to know that there are two very different ways in which PMCs can be used (couting and sampling), and we’ve focused solely on PMCs counting for the sake of this analysis. Brendan Greg clearly explains it on the following link.

3 - Diagnosing latency issues

Finding the causes of slow responses

This document will help you understand what the problem could be if you are experiencing latency problems with Redis.

In this context latency is the maximum delay between the time a client issues a command and the time the reply to the command is received by the client. Usually Redis processing time is extremely low, in the sub microsecond range, but there are certain conditions leading to higher latency figures.

I’ve little time, give me the checklist

The following documentation is very important in order to run Redis in a low latency fashion. However I understand that we are busy people, so let’s start with a quick checklist. If you fail following these steps, please return here to read the full documentation.

Make sure you are not running slow commands that are blocking the server. Use the Redis Slow Log feature to check this.
For EC2 users, make sure you use HVM based modern EC2 instances, like m3.medium. Otherwise fork() is too slow.
Transparent huge pages must be disabled from your kernel. Use echo never > /sys/kernel/mm/transparent_hugepage/enabled to disable them, and restart your Redis process.
If you are using a virtual machine, it is possible that you have an intrinsic latency that has nothing to do with Redis. Check the minimum latency you can expect from your runtime environment using ./redis-cli --intrinsic-latency 100. Note: you need to run this command in the server not in the client.
Enable and use the Latency monitor feature of Redis in order to get a human readable description of the latency events and causes in your Redis instance.

In general, use the following table for durability VS latency/performance tradeoffs, ordered from stronger safety to better latency.

AOF + fsync always: this is very slow, you should use it only if you know what you are doing.
AOF + fsync every second: this is a good compromise.
AOF + fsync every second + no-appendfsync-on-rewrite option set to yes: this is as the above, but avoids to fsync during rewrites to lower the disk pressure.
AOF + fsync never. Fsyncing is up to the kernel in this setup, even less disk pressure and risk of latency spikes.
RDB. Here you have a vast spectrum of tradeoffs depending on the save triggers you configure.

And now for people with 15 minutes to spend, the details…

Measuring latency

If you are experiencing latency problems, you probably know how to measure it in the context of your application, or maybe your latency problem is very evident even macroscopically. However redis-cli can be used to measure the latency of a Redis server in milliseconds, just try:

redis-cli --latency -h `host` -p `port`

Using the internal Redis latency monitoring subsystem

Since Redis 2.8.13, Redis provides latency monitoring capabilities that are able to sample different execution paths to understand where the server is blocking. This makes debugging of the problems illustrated in this documentation much simpler, so we suggest enabling latency monitoring ASAP. Please refer to the Latency monitor documentation.

While the latency monitoring sampling and reporting capabilities will make it simpler to understand the source of latency in your Redis system, it is still advised that you read this documentation extensively to better understand the topic of Redis and latency spikes.

Latency baseline

There is a kind of latency that is inherently part of the environment where you run Redis, that is the latency provided by your operating system kernel and, if you are using virtualization, by the hypervisor you are using.

While this latency can’t be removed it is important to study it because it is the baseline, or in other words, you won’t be able to achieve a Redis latency that is better than the latency that every process running in your environment will experience because of the kernel or hypervisor implementation or setup.

We call this kind of latency intrinsic latency, and redis-cli starting from Redis version 2.8.7 is able to measure it. This is an example run under Linux 3.11.0 running on an entry level server.

Note: the argument 100 is the number of seconds the test will be executed. The more time we run the test, the more likely we’ll be able to spot latency spikes. 100 seconds is usually appropriate, however you may want to perform a few runs at different times. Please note that the test is CPU intensive and will likely saturate a single core in your system.

$ ./redis-cli --intrinsic-latency 100
Max latency so far: 1 microseconds.
Max latency so far: 16 microseconds.
Max latency so far: 50 microseconds.
Max latency so far: 53 microseconds.
Max latency so far: 83 microseconds.
Max latency so far: 115 microseconds.

Note: redis-cli in this special case needs to run in the server where you run or plan to run Redis, not in the client. In this special mode redis-cli does not connect to a Redis server at all: it will just try to measure the largest time the kernel does not provide CPU time to run to the redis-cli process itself.

In the above example, the intrinsic latency of the system is just 0.115 milliseconds (or 115 microseconds), which is a good news, however keep in mind that the intrinsic latency may change over time depending on the load of the system.

Virtualized environments will not show so good numbers, especially with high load or if there are noisy neighbors. The following is a run on a Linode 4096 instance running Redis and Apache:

$ ./redis-cli --intrinsic-latency 100
Max latency so far: 573 microseconds.
Max latency so far: 695 microseconds.
Max latency so far: 919 microseconds.
Max latency so far: 1606 microseconds.
Max latency so far: 3191 microseconds.
Max latency so far: 9243 microseconds.
Max latency so far: 9671 microseconds.

Here we have an intrinsic latency of 9.7 milliseconds: this means that we can’t ask better than that to Redis. However other runs at different times in different virtualization environments with higher load or with noisy neighbors can easily show even worse values. We were able to measure up to 40 milliseconds in systems otherwise apparently running normally.

Latency induced by network and communication

Clients connect to Redis using a TCP/IP connection or a Unix domain connection. The typical latency of a 1 Gbit/s network is about 200 us, while the latency with a Unix domain socket can be as low as 30 us. It actually depends on your network and system hardware. On top of the communication itself, the system adds some more latency (due to thread scheduling, CPU caches, NUMA placement, etc …). System induced latencies are significantly higher on a virtualized environment than on a physical machine.

The consequence is even if Redis processes most commands in sub microsecond range, a client performing many roundtrips to the server will have to pay for these network and system related latencies.

An efficient client will therefore try to limit the number of roundtrips by pipelining several commands together. This is fully supported by the servers and most clients. Aggregated commands like MSET/MGET can be also used for that purpose. Starting with Redis 2.4, a number of commands also support variadic parameters for all data types.

Here are some guidelines:

If you can afford it, prefer a physical machine over a VM to host the server.
Do not systematically connect/disconnect to the server (especially true for web based applications). Keep your connections as long lived as possible.
If your client is on the same host than the server, use Unix domain sockets.
Prefer to use aggregated commands (MSET/MGET), or commands with variadic parameters (if possible) over pipelining.
Prefer to use pipelining (if possible) over sequence of roundtrips.
Redis supports Lua server-side scripting to cover cases that are not suitable for raw pipelining (for instance when the result of a command is an input for the following commands).

On Linux, some people can achieve better latencies by playing with process placement (taskset), cgroups, real-time priorities (chrt), NUMA configuration (numactl), or by using a low-latency kernel. Please note vanilla Redis is not really suitable to be bound on a single CPU core. Redis can fork background tasks that can be extremely CPU consuming like BGSAVE or BGREWRITEAOF. These tasks must never run on the same core as the main event loop.

In most situations, these kind of system level optimizations are not needed. Only do them if you require them, and if you are familiar with them.

Single threaded nature of Redis

Redis uses a mostly single threaded design. This means that a single process serves all the client requests, using a technique called multiplexing. This means that Redis can serve a single request in every given moment, so all the requests are served sequentially. This is very similar to how Node.js works as well. However, both products are not often perceived as being slow. This is caused in part by the small amount of time to complete a single request, but primarily because these products are designed to not block on system calls, such as reading data from or writing data to a socket.

I said that Redis is mostly single threaded since actually from Redis 2.4 we use threads in Redis in order to perform some slow I/O operations in the background, mainly related to disk I/O, but this does not change the fact that Redis serves all the requests using a single thread.

Latency generated by slow commands

A consequence of being single thread is that when a request is slow to serve all the other clients will wait for this request to be served. When executing normal commands, like GET or SET or LPUSH this is not a problem at all since these commands are executed in constant (and very small) time. However there are commands operating on many elements, like SORT, LREM, SUNION and others. For instance taking the intersection of two big sets can take a considerable amount of time.

The algorithmic complexity of all commands is documented. A good practice is to systematically check it when using commands you are not familiar with.

If you have latency concerns you should either not use slow commands against values composed of many elements, or you should run a replica using Redis replication where you run all your slow queries.

It is possible to monitor slow commands using the Redis Slow Log feature.

Additionally, you can use your favorite per-process monitoring program (top, htop, prstat, etc …) to quickly check the CPU consumption of the main Redis process. If it is high while the traffic is not, it is usually a sign that slow commands are used.

IMPORTANT NOTE: a VERY common source of latency generated by the execution of slow commands is the use of the KEYS command in production environments. KEYS, as documented in the Redis documentation, should only be used for debugging purposes. Since Redis 2.8 a new commands were introduced in order to iterate the key space and other large collections incrementally, please check the SCAN, SSCAN, HSCAN and ZSCAN commands for more information.

Latency generated by fork

In order to generate the RDB file in background, or to rewrite the Append Only File if AOF persistence is enabled, Redis has to fork background processes. The fork operation (running in the main thread) can induce latency by itself.

Forking is an expensive operation on most Unix-like systems, since it involves copying a good number of objects linked to the process. This is especially true for the page table associated to the virtual memory mechanism.

For instance on a Linux/AMD64 system, the memory is divided in 4 kB pages. To convert virtual addresses to physical addresses, each process stores a page table (actually represented as a tree) containing at least a pointer per page of the address space of the process. So a large 24 GB Redis instance requires a page table of 24 GB / 4 kB * 8 = 48 MB.

When a background save is performed, this instance will have to be forked, which will involve allocating and copying 48 MB of memory. It takes time and CPU, especially on virtual machines where allocation and initialization of a large memory chunk can be expensive.

Fork time in different systems

Modern hardware is pretty fast at copying the page table, but Xen is not. The problem with Xen is not virtualization-specific, but Xen-specific. For instance using VMware or Virtual Box does not result into slow fork time. The following is a table that compares fork time for different Redis instance size. Data is obtained performing a BGSAVE and looking at the latest_fork_usec filed in the INFO command output.

However the good news is that new types of EC2 HVM based instances are much better with fork times, almost on par with physical servers, so for example using m3.medium (or better) instances will provide good results.

Linux beefy VM on VMware 6.0GB RSS forked in 77 milliseconds (12.8 milliseconds per GB).
Linux running on physical machine (Unknown HW) 6.1GB RSS forked in 80 milliseconds (13.1 milliseconds per GB)
Linux running on physical machine (Xeon @ 2.27Ghz) 6.9GB RSS forked into 62 milliseconds (9 milliseconds per GB).
Linux VM on 6sync (KVM) 360 MB RSS forked in 8.2 milliseconds (23.3 milliseconds per GB).
Linux VM on EC2, old instance types (Xen) 6.1GB RSS forked in 1460 milliseconds (239.3 milliseconds per GB).
Linux VM on EC2, new instance types (Xen) 1GB RSS forked in 10 milliseconds (10 milliseconds per GB).
Linux VM on Linode (Xen) 0.9GBRSS forked into 382 milliseconds (424 milliseconds per GB).

As you can see certain VMs running on Xen have a performance hit that is between one order to two orders of magnitude. For EC2 users the suggestion is simple: use modern HVM based instances.

Latency induced by transparent huge pages

Unfortunately when a Linux kernel has transparent huge pages enabled, Redis incurs to a big latency penalty after the fork call is used in order to persist on disk. Huge pages are the cause of the following issue:

Fork is called, two processes with shared huge pages are created.
In a busy instance, a few event loops runs will cause commands to target a few thousand of pages, causing the copy on write of almost the whole process memory.
This will result in big latency and big memory usage.

Make sure to disable transparent huge pages using the following command:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

Latency induced by swapping (operating system paging)

Linux (and many other modern operating systems) is able to relocate memory pages from the memory to the disk, and vice versa, in order to use the system memory efficiently.

If a Redis page is moved by the kernel from the memory to the swap file, when the data stored in this memory page is used by Redis (for example accessing a key stored into this memory page) the kernel will stop the Redis process in order to move the page back into the main memory. This is a slow operation involving random I/Os (compared to accessing a page that is already in memory) and will result into anomalous latency experienced by Redis clients.

The kernel relocates Redis memory pages on disk mainly because of three reasons:

The system is under memory pressure since the running processes are demanding more physical memory than the amount that is available. The simplest instance of this problem is simply Redis using more memory than is available.
The Redis instance data set, or part of the data set, is mostly completely idle (never accessed by clients), so the kernel could swap idle memory pages on disk. This problem is very rare since even a moderately slow instance will touch all the memory pages often, forcing the kernel to retain all the pages in memory.
Some processes are generating massive read or write I/Os on the system. Because files are generally cached, it tends to put pressure on the kernel to increase the filesystem cache, and therefore generate swapping activity. Please note it includes Redis RDB and/or AOF background threads which can produce large files.

Fortunately Linux offers good tools to investigate the problem, so the simplest thing to do is when latency due to swapping is suspected is just to check if this is the case.

The first thing to do is to checking the amount of Redis memory that is swapped on disk. In order to do so you need to obtain the Redis instance pid:

$ redis-cli info | grep process_id
process_id:5454

Now enter the /proc file system directory for this process:

$ cd /proc/5454

Here you’ll find a file called smaps that describes the memory layout of the Redis process (assuming you are using Linux 2.6.16 or newer). This file contains very detailed information about our process memory maps, and one field called Swap is exactly what we are looking for. However there is not just a single swap field since the smaps file contains the different memory maps of our Redis process (The memory layout of a process is more complex than a simple linear array of pages).

Since we are interested in all the memory swapped by our process the first thing to do is to grep for the Swap field across all the file:

$ cat smaps | grep 'Swap:'
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                 12 kB
Swap:                156 kB
Swap:                  8 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  4 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  4 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  4 kB
Swap:                  4 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB

If everything is 0 kB, or if there are sporadic 4k entries, everything is perfectly normal. Actually in our example instance (the one of a real web site running Redis and serving hundreds of users every second) there are a few entries that show more swapped pages. To investigate if this is a serious problem or not we change our command in order to also print the size of the memory map:

$ cat smaps | egrep '^(Swap|Size)'
Size:                316 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  8 kB
Swap:                  0 kB
Size:                 40 kB
Swap:                  0 kB
Size:                132 kB
Swap:                  0 kB
Size:             720896 kB
Swap:                 12 kB
Size:               4096 kB
Swap:                156 kB
Size:               4096 kB
Swap:                  8 kB
Size:               4096 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:               1272 kB
Swap:                  0 kB
Size:                  8 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                 16 kB
Swap:                  0 kB
Size:                 84 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  8 kB
Swap:                  4 kB
Size:                  8 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  4 kB
Size:                144 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  4 kB
Size:                 12 kB
Swap:                  4 kB
Size:                108 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                272 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB

As you can see from the output, there is a map of 720896 kB (with just 12 kB swapped) and 156 kB more swapped in another map: basically a very small amount of our memory is swapped so this is not going to create any problem at all.

If instead a non trivial amount of the process memory is swapped on disk your latency problems are likely related to swapping. If this is the case with your Redis instance you can further verify it using the vmstat command:

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0   3980 697932 147180 1406456    0    0     2     2    2    0  4  4 91  0
 0  0   3980 697428 147180 1406580    0    0     0     0 19088 16104  9  6 84  0
 0  0   3980 697296 147180 1406616    0    0     0    28 18936 16193  7  6 87  0
 0  0   3980 697048 147180 1406640    0    0     0     0 18613 15987  6  6 88  0
 2  0   3980 696924 147180 1406656    0    0     0     0 18744 16299  6  5 88  0
 0  0   3980 697048 147180 1406688    0    0     0     4 18520 15974  6  6 88  0
^C

The interesting part of the output for our needs are the two columns si and so, that counts the amount of memory swapped from/to the swap file. If you see non zero counts in those two columns then there is swapping activity in your system.

Finally, the iostat command can be used to check the global I/O activity of the system.

$ iostat -xk 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.55    0.04    2.92    0.53    0.00   82.95

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.77     0.00    0.01    0.00     0.40     0.00    73.65     0.00    3.62   2.58   0.00
sdb               1.27     4.75    0.82    3.54    38.00    32.32    32.19     0.11   24.80   4.24   1.85

If your latency problem is due to Redis memory being swapped on disk you need to lower the memory pressure in your system, either adding more RAM if Redis is using more memory than the available, or avoiding running other memory hungry processes in the same system.

Latency due to AOF and disk I/O

Another source of latency is due to the Append Only File support on Redis. The AOF basically uses two system calls to accomplish its work. One is write(2) that is used in order to write data to the append only file, and the other one is fdatasync(2) that is used in order to flush the kernel file buffer on disk in order to ensure the durability level specified by the user.

Both the write(2) and fdatasync(2) calls can be source of latency. For instance write(2) can block both when there is a system wide sync in progress, or when the output buffers are full and the kernel requires to flush on disk in order to accept new writes.

The fdatasync(2) call is a worse source of latency as with many combinations of kernels and file systems used it can take from a few milliseconds to a few seconds to complete, especially in the case of some other process doing I/O. For this reason when possible Redis does the fdatasync(2) call in a different thread since Redis 2.4.

We’ll see how configuration can affect the amount and source of latency when using the AOF file.

The AOF can be configured to perform an fsync on disk in three different ways using the appendfsync configuration option (this setting can be modified at runtime using the CONFIG SET command).

When appendfsync is set to the value of no Redis performs no fsync. In this configuration the only source of latency can be write(2). When this happens usually there is no solution since simply the disk can’t cope with the speed at which Redis is receiving data, however this is uncommon if the disk is not seriously slowed down by other processes doing I/O.
When appendfsync is set to the value of everysec Redis performs an fsync every second. It uses a different thread, and if the fsync is still in progress Redis uses a buffer to delay the write(2) call up to two seconds (since write would block on Linux if an fsync is in progress against the same file). However if the fsync is taking too long Redis will eventually perform the write(2) call even if the fsync is still in progress, and this can be a source of latency.
When appendfsync is set to the value of always an fsync is performed at every write operation before replying back to the client with an OK code (actually Redis will try to cluster many commands executed at the same time into a single fsync). In this mode performances are very low in general and it is strongly recommended to use a fast disk and a file system implementation that can perform the fsync in short time.

Most Redis users will use either the no or everysec setting for the appendfsync configuration directive. The suggestion for minimum latency is to avoid other processes doing I/O in the same system. Using an SSD disk can help as well, but usually even non SSD disks perform well with the append only file if the disk is spare as Redis writes to the append only file without performing any seek.

If you want to investigate your latency issues related to the append only file you can use the strace command under Linux:

sudo strace -p $(pidof redis-server) -T -e trace=fdatasync

The above command will show all the fdatasync(2) system calls performed by Redis in the main thread. With the above command you’ll not see the fdatasync system calls performed by the background thread when the appendfsync config option is set to everysec. In order to do so just add the -f switch to strace.

If you wish you can also see both fdatasync and write system calls with the following command:

sudo strace -p $(pidof redis-server) -T -e trace=fdatasync,write

However since write(2) is also used in order to write data to the client sockets this will likely show too many things unrelated to disk I/O. Apparently there is no way to tell strace to just show slow system calls so I use the following command:

sudo strace -f -p $(pidof redis-server) -T -e trace=fdatasync,write 2>&1 | grep -v '0.0' | grep -v unfinished

Latency generated by expires

Redis evict expired keys in two ways:

One lazy way expires a key when it is requested by a command, but it is found to be already expired.
One active way expires a few keys every 100 milliseconds.

The active expiring is designed to be adaptive. An expire cycle is started every 100 milliseconds (10 times per second), and will do the following:

Sample ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP keys, evicting all the keys already expired.
If the more than 25% of the keys were found expired, repeat.

Given that ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP is set to 20 by default, and the process is performed ten times per second, usually just 200 keys per second are actively expired. This is enough to clean the DB fast enough even when already expired keys are not accessed for a long time, so that the lazy algorithm does not help. At the same time expiring just 200 keys per second has no effects in the latency a Redis instance.

However the algorithm is adaptive and will loop if it finds more than 25% of keys already expired in the set of sampled keys. But given that we run the algorithm ten times per second, this means that the unlucky event of more than 25% of the keys in our random sample are expiring at least in the same second.

Basically this means that if the database has many many keys expiring in the same second, and these make up at least 25% of the current population of keys with an expire set, Redis can block in order to get the percentage of keys already expired below 25%.

This approach is needed in order to avoid using too much memory for keys that are already expired, and usually is absolutely harmless since it’s strange that a big number of keys are going to expire in the same exact second, but it is not impossible that the user used EXPIREAT extensively with the same Unix time.

In short: be aware that many keys expiring at the same moment can be a source of latency.

Redis software watchdog

Redis 2.6 introduces the Redis Software Watchdog that is a debugging tool designed to track those latency problems that for one reason or the other escaped an analysis using normal tools.

The software watchdog is an experimental feature. While it is designed to be used in production environments care should be taken to backup the database before proceeding as it could possibly have unexpected interactions with the normal execution of the Redis server.

It is important to use it only as last resort when there is no way to track the issue by other means.

This is how this feature works:

The user enables the software watchdog using the CONFIG SET command.
Redis starts monitoring itself constantly.
If Redis detects that the server is blocked into some operation that is not returning fast enough, and that may be the source of the latency issue, a low level report about where the server is blocked is dumped on the log file.
The user contacts the developers writing a message in the Redis Google Group, including the watchdog report in the message.

Note that this feature cannot be enabled using the redis.conf file, because it is designed to be enabled only in already running instances and only for debugging purposes.

To enable the feature just use the following:

CONFIG SET watchdog-period 500

The period is specified in milliseconds. In the above example I specified to log latency issues only if the server detects a delay of 500 milliseconds or greater. The minimum configurable period is 200 milliseconds.

When you are done with the software watchdog you can turn it off setting the watchdog-period parameter to 0. Important: remember to do this because keeping the instance with the watchdog turned on for a longer time than needed is generally not a good idea.

The following is an example of what you’ll see printed in the log file once the software watchdog detects a delay longer than the configured one:

[8547 | signal handler] (1333114359)
--- WATCHDOG TIMER EXPIRED ---
/lib/libc.so.6(nanosleep+0x2d) [0x7f16b5c2d39d]
/lib/libpthread.so.0(+0xf8f0) [0x7f16b5f158f0]
/lib/libc.so.6(nanosleep+0x2d) [0x7f16b5c2d39d]
/lib/libc.so.6(usleep+0x34) [0x7f16b5c62844]
./redis-server(debugCommand+0x3e1) [0x43ab41]
./redis-server(call+0x5d) [0x415a9d]
./redis-server(processCommand+0x375) [0x415fc5]
./redis-server(processInputBuffer+0x4f) [0x4203cf]
./redis-server(readQueryFromClient+0xa0) [0x4204e0]
./redis-server(aeProcessEvents+0x128) [0x411b48]
./redis-server(aeMain+0x2b) [0x411dbb]
./redis-server(main+0x2b6) [0x418556]
/lib/libc.so.6(__libc_start_main+0xfd) [0x7f16b5ba1c4d]
./redis-server() [0x411099]
------

Note: in the example the DEBUG SLEEP command was used in order to block the server. The stack trace is different if the server blocks in a different context.

If you happen to collect multiple watchdog stack traces you are encouraged to send everything to the Redis Google Group: the more traces we obtain, the simpler it will be to understand what the problem with your instance is.

4 - Redis latency monitoring

Discovering slow server events in Redis

Redis is often used for demanding use cases, where it serves a large number of queries per second per instance, but also has strict latency requirements for the average response time and the worst-case latency.

While Redis is an in-memory system, it deals with the operating system in different ways, for example, in the context of persisting to disk. Moreover Redis implements a rich set of commands. Certain commands are fast and run in constant or logarithmic time. Other commands are slower O(N) commands that can cause latency spikes.

Finally, Redis is single threaded. This is usually an advantage from the point of view of the amount of work it can perform per core, and in the latency figures it is able to provide. However, it poses a challenge for latency, since the single thread must be able to perform certain tasks incrementally, for example key expiration, in a way that does not impact the other clients that are served.

For all these reasons, Redis 2.8.13 introduced a new feature called Latency Monitoring, that helps the user to check and troubleshoot possible latency problems. Latency monitoring is composed of the following conceptual parts:

Latency hooks that sample different latency-sensitive code paths.
Time series recording of latency spikes, split by different events.
Reporting engine to fetch raw data from the time series.
Analysis engine to provide human-readable reports and hints according to the measurements.

The rest of this document covers the latency monitoring subsystem details. For more information about the general topic of Redis and latency, see Redis latency problems troubleshooting.

Events and time series

Different monitored code paths have different names and are called events. For example, command is an event that measures latency spikes of possibly slow command executions, while fast-command is the event name for the monitoring of the O(1) and O(log N) commands. Other events are less generic and monitor specific operations performed by Redis. For example, the fork event only monitors the time taken by Redis to execute the fork(2) system call.

A latency spike is an event that takes more time to run than the configured latency threshold. There is a separate time series associated with every monitored event. This is how the time series work:

Every time a latency spike happens, it is logged in the appropriate time series.
Every time series is composed of 160 elements.
Each element is a pair made of a Unix timestamp of the time the latency spike was measured and the number of milliseconds the event took to execute.
Latency spikes for the same event that occur in the same second are merged by taking the maximum latency. Even if continuous latency spikes are measured for a given event, which could happen with a low threshold, at least 180 seconds of history are available.
Records the all-time maximum latency for every element.

The framework monitors and logs latency spikes in the execution time of these events:

command: regular commands.
fast-command: O(1) and O(log N) commands.
fork: the fork(2) system call.
rdb-unlink-temp-file: the unlink(2) system call.
aof-write: writing to the AOF - a catchall event for fsync(2) system calls.
aof-fsync-always: the fsync(2) system call when invoked by the appendfsync allways policy.
aof-write-pending-fsync: the fsync(2) system call when there are pending writes.
aof-write-active-child: the fsync(2) system call when performed by a child process.
aof-write-alone: the fsync(2) system call when performed by the main process.
aof-fstat: the fstat(2) system call.
aof-rename: the rename(2) system call for renaming the temporary file after completing BGREWRITEAOF.
aof-rewrite-diff-write: writing the differences accumulated while performing BGREWRITEAOF.
active-defrag-cycle: the active defragmentation cycle.
expire-cycle: the expiration cycle.
eviction-cycle: the eviction cycle.
eviction-del: deletes during the eviction cycle.

How to enable latency monitoring

What is high latency for one use case may not be considered high latency for another. Some applications may require that all queries be served in less than 1 millisecond. For other applications, it may be acceptable for a small amount of clients to experience a 2 second latency on occasion.

The first step to enable the latency monitor is to set a latency threshold in milliseconds. Only events that take longer than the specified threshold will be logged as latency spikes. The user should set the threshold according to their needs. For example, if the application requires a maximum acceptable latency of 100 milliseconds, the threshold should be set to log all the events blocking the server for a time equal or greater to 100 milliseconds.

Enable the latency monitor at runtime in a production server with the following command:

CONFIG SET latency-monitor-threshold 100

Monitoring is turned off by default (threshold set to 0), even if the actual cost of latency monitoring is near zero. While the memory requirements of latency monitoring are very small, there is no good reason to raise the baseline memory usage of a Redis instance that is working well.

Report information with the LATENCY command

The user interface to the latency monitoring subsystem is the LATENCY command. Like many other Redis commands, LATENCY accepts subcommands that modify its behavior. These subcommands are:

LATENCY LATEST - returns the latest latency samples for all events.
LATENCY HISTORY - returns latency time series for a given event.
LATENCY RESET - resets latency time series data for one or more events.
LATENCY GRAPH - renders an ASCII-art graph of an event’s latency samples.
LATENCY DOCTOR - replies with a human-readable latency analysis report.

Refer to each subcommand’s documentation page for further information.

5 - Memory optimization

Strategies for optimizing memory usage in Redis

Special encoding of small aggregate data types

Since Redis 2.2 many data types are optimized to use less space up to a certain size. Hashes, Lists, Sets composed of just integers, and Sorted Sets, when smaller than a given number of elements, and up to a maximum element size, are encoded in a very memory efficient way that uses up to 10 times less memory (with 5 time less memory used being the average saving).

This is completely transparent from the point of view of the user and API. Since this is a CPU / memory trade off it is possible to tune the maximum number of elements and maximum element size for special encoded types using the following redis.conf directives.

hash-max-ziplist-entries 512
hash-max-ziplist-value 64
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
set-max-intset-entries 512

If a specially encoded value overflows the configured max size, Redis will automatically convert it into normal encoding. This operation is very fast for small values, but if you change the setting in order to use specially encoded values for much larger aggregate types the suggestion is to run some benchmarks and tests to check the conversion time.

Using 32 bit instances

Redis compiled with 32 bit target uses a lot less memory per key, since pointers are small, but such an instance will be limited to 4 GB of maximum memory usage. To compile Redis as 32 bit binary use make 32bit. RDB and AOF files are compatible between 32 bit and 64 bit instances (and between little and big endian of course) so you can switch from 32 to 64 bit, or the contrary, without problems.

Bit and byte level operations

Redis 2.2 introduced new bit and byte level operations: GETRANGE, SETRANGE, GETBIT and SETBIT. Using these commands you can treat the Redis string type as a random access array. For instance if you have an application where users are identified by a unique progressive integer number, you can use a bitmap in order to save information about the subscription of users in a mailing list, setting the bit for subscribed and clearing it for unsubscribed, or the other way around. With 100 million users this data will take just 12 megabytes of RAM in a Redis instance. You can do the same using GETRANGE and SETRANGE in order to store one byte of information for each user. This is just an example but it is actually possible to model a number of problems in very little space with these new primitives.

Use hashes when possible

Small hashes are encoded in a very small space, so you should try representing your data using hashes whenever possible. For instance if you have objects representing users in a web application, instead of using different keys for name, surname, email, password, use a single hash with all the required fields.

If you want to know more about this, read the next section.

Using hashes to abstract a very memory efficient plain key-value store on top of Redis

I understand the title of this section is a bit scary, but I’m going to explain in details what this is about.

Basically it is possible to model a plain key-value store using Redis where values can just be just strings, that is not just more memory efficient than Redis plain keys but also much more memory efficient than memcached.

Let’s start with some facts: a few keys use a lot more memory than a single key containing a hash with a few fields. How is this possible? We use a trick. In theory in order to guarantee that we perform lookups in constant time (also known as O(1) in big O notation) there is the need to use a data structure with a constant time complexity in the average case, like a hash table.

But many times hashes contain just a few fields. When hashes are small we can instead just encode them in an O(N) data structure, like a linear array with length-prefixed key value pairs. Since we do this only when N is small, the amortized time for HGET and HSET commands is still O(1): the hash will be converted into a real hash table as soon as the number of elements it contains grows too large (you can configure the limit in redis.conf).

This does not only work well from the point of view of time complexity, but also from the point of view of constant times, since a linear array of key value pairs happens to play very well with the CPU cache (it has a better cache locality than a hash table).

However since hash fields and values are not (always) represented as full featured Redis objects, hash fields can’t have an associated time to live (expire) like a real key, and can only contain a string. But we are okay with this, this was the intention anyway when the hash data type API was designed (we trust simplicity more than features, so nested data structures are not allowed, as expires of single fields are not allowed).

So hashes are memory efficient. This is useful when using hashes to represent objects or to model other problems when there are group of related fields. But what about if we have a plain key value business?

Imagine we want to use Redis as a cache for many small objects, that can be JSON encoded objects, small HTML fragments, simple key -> boolean values and so forth. Basically anything is a string -> string map with small keys and values.

Now let’s assume the objects we want to cache are numbered, like:

object:102393
object:1234
object:5

This is what we can do. Every time we perform a SET operation to set a new value, we actually split the key into two parts, one part used as a key, and the other part used as the field name for the hash. For instance the object named “object:1234” is actually split into:

a Key named object:12
a Field named 34

So we use all the characters but the last two for the key, and the final two characters for the hash field name. To set our key we use the following command:

HSET object:12 34 somevalue

As you can see every hash will end containing 100 fields, that is an optimal compromise between CPU and memory saved.

There is another important thing to note, with this schema every hash will have more or less 100 fields regardless of the number of objects we cached. This is since our objects will always end with a number, and not a random string. In some way the final number can be considered as a form of implicit pre-sharding.

What about small numbers? Like object:2? We handle this case using just “object:” as a key name, and the whole number as the hash field name. So object:2 and object:10 will both end inside the key “object:”, but one as field name “2” and one as “10”.

How much memory do we save this way?

I used the following Ruby program to test how this works:

require 'rubygems'
require 'redis'

USE_OPTIMIZATION = true

def hash_get_key_field(key)
  s = key.split(':')
  if s[1].length > 2
    { key: s[0] + ':' + s[1][0..-3], field: s[1][-2..-1] }
  else
    { key: s[0] + ':', field: s[1] }
  end
end

def hash_set(r, key, value)
  kf = hash_get_key_field(key)
  r.hset(kf[:key], kf[:field], value)
end

def hash_get(r, key, value)
  kf = hash_get_key_field(key)
  r.hget(kf[:key], kf[:field], value)
end

r = Redis.new
(0..100_000).each do |id|
  key = "object:#{id}"
  if USE_OPTIMIZATION
    hash_set(r, key, 'val')
  else
    r.set(key, 'val')
  end
end

This is the result against a 64 bit instance of Redis 2.2:

USE_OPTIMIZATION set to true: 1.7 MB of used memory
USE_OPTIMIZATION set to false; 11 MB of used memory

This is an order of magnitude, I think this makes Redis more or less the most memory efficient plain key value store out there.

WARNING: for this to work, make sure that in your redis.conf you have something like this:

hash-max-zipmap-entries 256

Also remember to set the following field accordingly to the maximum size of your keys and values:

hash-max-zipmap-value 1024

Every time a hash exceeds the number of elements or element size specified it will be converted into a real hash table, and the memory saving will be lost.

You may ask, why don’t you do this implicitly in the normal key space so that I don’t have to care? There are two reasons: one is that we tend to make tradeoffs explicit, and this is a clear tradeoff between many things: CPU, memory, max element size. The second is that the top level key space must support a lot of interesting things like expires, LRU data, and so forth so it is not practical to do this in a general way.

But the Redis Way is that the user must understand how things work so that he is able to pick the best compromise, and to understand how the system will behave exactly.

Memory allocation

To store user keys, Redis allocates at most as much memory as the maxmemory setting enables (however there are small extra allocations possible).

The exact value can be set in the configuration file or set later via CONFIG SET (see Using memory as an LRU cache for more info). There are a few things that should be noted about how Redis manages memory:

Redis will not always free up (return) memory to the OS when keys are removed. This is not something special about Redis, but it is how most malloc() implementations work. For example if you fill an instance with 5GB worth of data, and then remove the equivalent of 2GB of data, the Resident Set Size (also known as the RSS, which is the number of memory pages consumed by the process) will probably still be around 5GB, even if Redis will claim that the user memory is around 3GB. This happens because the underlying allocator can’t easily release the memory. For example often most of the removed keys were allocated in the same pages as the other keys that still exist.
The previous point means that you need to provision memory based on your peak memory usage. If your workload from time to time requires 10GB, even if most of the times 5GB could do, you need to provision for 10GB.
However allocators are smart and are able to reuse free chunks of memory, so after you freed 2GB of your 5GB data set, when you start adding more keys again, you’ll see the RSS (Resident Set Size) stay steady and not grow more, as you add up to 2GB of additional keys. The allocator is basically trying to reuse the 2GB of memory previously (logically) freed.
Because of all this, the fragmentation ratio is not reliable when you had a memory usage that at peak is much larger than the currently used memory. The fragmentation is calculated as the physical memory actually used (the RSS value) divided by the amount of memory currently in use (as the sum of all the allocations performed by Redis). Because the RSS reflects the peak memory, when the (virtually) used memory is low since a lot of keys / values were freed, but the RSS is high, the ratio RSS / mem_used will be very high.

If maxmemory is not set Redis will keep allocating memory as it sees fit and thus it can (gradually) eat up all your free memory. Therefore it is generally advisable to configure some limit. You may also want to set maxmemory-policy to noeviction (which is not the default value in some older versions of Redis).

It makes Redis return an out of memory error for write commands if and when it reaches the limit - which in turn may result in errors in the application but will not render the whole machine dead because of memory starvation.

Optimizing Redis

1 - Redis benchmark

Running only a subset of the tests

Selecting the size of the key space

Using pipelining

Pitfalls and misconceptions

Factors impacting Redis performance

Other things to consider

Other Redis benchmarking tools

2 - Redis CPU profiling

Filling the performance checklist

Ensuring the CPU is your bottleneck

Build Prerequisites

A set of instruments to identify performance regressions and/or potential on-CPU performance improvements

Tool prerequesits

Hotspot analysis with perf or eBPF (stack traces sampling)

Sampling stack traces using perf

Displaying the recorded profile information using perf report

Visualizing the recorded profile information using Flame Graphs

Archiving and sharing recorded profile information

Sampling stack traces using bcc/BPF’s profile

Visualizing the recorded profile information using Flame Graphs

Call counts analysis with bcc/BPF

Hardware event counting with Performance Monitoring Counters (PMCs)

3 - Diagnosing latency issues

I’ve little time, give me the checklist

Measuring latency

Using the internal Redis latency monitoring subsystem

Latency baseline

Latency induced by network and communication

Single threaded nature of Redis

Latency generated by slow commands

Latency generated by fork

Fork time in different systems

Latency induced by transparent huge pages

Latency induced by swapping (operating system paging)

Latency due to AOF and disk I/O

Latency generated by expires

Redis software watchdog

4 - Redis latency monitoring

Events and time series

How to enable latency monitoring

Report information with the LATENCY command

5 - Memory optimization

Special encoding of small aggregate data types

Using 32 bit instances

Bit and byte level operations

Use hashes when possible

Using hashes to abstract a very memory efficient plain key-value store on top of Redis

Memory allocation