Improving Firecracker MicroVMs Launch Time with Memory Snapshotting

Introduction

Firecracker microVMs are a popular solution for launching lightweight virtual machines with fast boot times and low overhead. However, for certain use cases, even the few seconds of boot time can be a bottleneck, especially in highly dynamic environments where quick provisioning of virtualized environments is essential.

One promising method to significantly reduce launch time is memory snapshotting. By taking a snapshot of a fully initialized microVM, including its memory and CPU state, you can resume the VM almost instantly. This eliminates the time-consuming steps of kernel boot and user space initialization, allowing the microVM to be available within milliseconds.

Current Problem: Long Boot Time

Currently, launching a microVM takes about 7 to 10 seconds on average, primarily due to the following stages:

Kernel boot time: Initializing the Linux kernel, which typically takes around 2 seconds.
User space initialization: Services like SSH and networking daemons, taking an additional 4 to 5 seconds.

While optimizations such as adjusting kernel parameters and streamlining services can reduce this time, further speedups are limited due to the inherent initialization requirements. This is where memory snapshotting comes into play.

Solution: MicroVM Memory Snapshotting

Memory snapshotting is a feature provided by Firecracker that allows you to pause a running microVM, save its state (including memory and CPU), and later resume it from this exact state. This approach bypasses the need for booting the VM from scratch and skips kernel initialization, resulting in sub-second startup times.

The benefits include:

Reduced launch time: VMs can be resumed in milliseconds instead of seconds.
Preserved state: The VM resumes exactly where it left off, with all applications and services already running.

How it Works

Initialize the VM: A microVM is launched and fully initialized with all required services, such as the internal daemon, SSH, and network interfaces.
Pause and Snapshot: The microVM is paused, and a snapshot is taken, including the memory and CPU state.
Resume: The snapshot is saved, and the VM can later be resumed by loading the saved state, allowing the VM to be available almost instantaneously.

Implementing Memory Snapshotting in Firecracker

Step 1: Pausing the MicroVM

Once the microVM is running and all services are initialized, you can pause it using Firecracker’s API:

curl --unix-socket /tmp/firecracker.socket -i \
-X PATCH 'http://localhost/vm' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
    "state": "Paused"
}'

This command pauses the microVM, freezing its current state.

Step 2: Creating the Snapshot

After pausing the VM, a snapshot can be created, which saves the memory and the metadata about the VM configuration.

This creates a snapshot of the current state of the VM, including:

curl --unix-socket /tmp/firecracker.socket -i \
-X PUT 'http://localhost/snapshot/create' \
-H  'Accept: application/json' \
-H  'Content-Type: application/json' \
-d '{
    "snapshot_type": "Full",
    "snapshot_path": "/path/to/snapshot_file",
    "mem_file_path": "/path/to/memory_file",
    "version": "1.2.0"
}'

The memory snapshot (which will be the size of the VM’s memory allocation).
Metadata about attached drives, network devices, and the microVM configuration.

Step 3: Loading the Snapshot

When a new VM instance is required, instead of going through the entire boot process, you can load the saved snapshot:

curl --unix-socket /tmp/firecracker.socket -i \
-X PUT 'http://localhost/snapshot/load' \
-H  'Accept: application/json' \
-H  'Content-Type: application/json' \
-d '{
    "snapshot_path": "/path/to/snapshot_file",
    "mem_backend": {
        "backend_path": "/path/to/memory_file",
        "backend_type": "File"
    },
    "enable_diff_snapshots": false,
    "resume_vm": true
}'

This resumes the microVM from the exact point where it was paused, effectively skipping the entire boot sequence.

Benefits of Snapshotting

1. Dramatic Reduction in Launch Time

By using snapshots, we reduce the microVM launch time from 7-10 seconds down to less than a second. The microVM is immediately ready with all necessary services running, including network configurations, SSH, and any internal daemons.

2. Increased Efficiency for Dynamic Environments

In environments where rapid scaling is required, such as cloud-native applications or isolated dev environments, memory snapshotting allows us to provision VMs faster than traditional methods, making this approach highly scalable.

3. State Preservation

Snapshots preserve the exact state of the running VM. Any applications that were running continue to function after the VM is resumed. This can be useful for environments where users need to save and restore their sessions.

Caveats and Considerations

1. Disk Space

Each snapshot includes a full memory dump, which can be as large as the allocated memory for the VM (e.g., 3GB for a 3GB memory VM). This requires significant storage resources if snapshots are frequently used.

2. Snapshot Compatibility

Snapshots are version-specific. A snapshot created in one version of Firecracker may not be compatible with other versions. It's essential to ensure compatibility when upgrading Firecracker or changing configurations.

3. Dynamic Resources

If the microVM relies on dynamic resources like network interfaces or external drives, they must be present and correctly configured when the snapshot is loaded. If these resources are not available, the VM may fail to resume.

Conclusion

Firecracker’s memory snapshotting is a powerful tool for reducing the boot time of microVMs. By pausing and resuming virtual machines from a saved state, we can effectively reduce launch times to less than a second, significantly improving performance in dynamic and high-demand environments. While this solution comes with its own set of challenges, such as managing disk space and resource dependencies, the benefits far outweigh the drawbacks for use cases requiring rapid provisioning of isolated environments.