Azul Zulu’s CRaC Support Finally Fixed My Cold Start Nightmare

I spent last weekend staring at Datadog dashboards, watching our Java microservices take their sweet, agonizing 14 seconds to spin up during a sudden traffic spike. By the time Kubernetes provisioned the new pods and the JVM warmed up, the user surge was already dropping requests. It’s 2026. We shouldn’t be dealing with startup penalties like it’s 2015.

Well, that’s not entirely accurate — I’ve historically been skeptical of the whole “just compile it to native” push. GraalVM native images are fantastic until you hit reflection hell with an older dependency, and suddenly you’re writing custom reachability metadata for three days straight.

So when Azul Zulu started heavily pushing their Coordinated Restore at Checkpoint (CRaC) support recently, I decided to actually benchmark it. Combine that with the recent Spring Boot updates that quietly patched a handful of nasty CVEs, and I finally had an excuse to upgrade our core pricing service.

The 450ms Reality Check

server monitoring dashboard - What is Server Monitoring? With 11 Best Server Monitoring Tools — server monitoring dashboard – What is Server Monitoring? With 11 Best Server Monitoring Tools

I took our heaviest legacy pricing API—a massive monolith running Spring Boot 3.5.3 that normally takes 12.4 seconds to report healthy. I set up Azul Zulu 21 on my M3 Max MacBook running Sonoma 14.4.

The process is straightforward but weird if you haven’t done it before. You run the app, let the application context fully load, hit it with some dummy traffic to warm up the JIT compiler, and then trigger the checkpoint. The JVM dumps its entire state to disk and kills the process.

The restore? 450 milliseconds. I actually thought it crashed. I had to check the logs three times to verify it actually bound to port 8080 and was accepting traffic. It went from a 12.4-second cold start to sub-half-second availability. For auto-scaling in a cloud environment, that completely changes the math on whether Java is viable for scale-to-zero workloads.

The Database Pool Gotcha

There is a catch, though. The documentation glosses over exactly how much manual intervention you need for external connections. If you just checkpoint a running Spring Boot app, the JVM saves the state of your HikariCP database connections. When you restore that image on a different machine (or even the same machine ten minutes later), those TCP connections are dead. The database dropped them. Your app will throw a massive wall of SocketException errors the second a user tries to load a page.

You have to explicitly tell your app to close its pools before checkpointing and reopen them on restore. Spring Framework handles a lot of this automatically now, but if you have custom thread pools or raw socket connections, you need to implement the CRaC Resource interface.

Surviving the CVE Fire Drills

cloud computing server room - Server room with a cloud computing system in the center generative ... — cloud computing server room – Server room with a cloud computing system in the center generative …

And while I was messing around with JVM flags, the security team flagged two new high-severity CVEs in our dependency tree. The usual drill. But what I appreciated about the recent Spring Boot updates is how they handled the mitigations. They patched the path traversal vulnerability in the embedded Tomcat layer without breaking our custom servlet filters. I was terrified I’d have to rewrite our authentication middleware, which is tightly coupled to how Tomcat parses headers. I bumped the version in our pom.xml, ran the test suite, and everything just passed. It’s rare that a security update doesn’t break at least one obscure integration in a five-year-old codebase.

Looking at JDK 26

cloud computing server room - Corporate server room — cloud computing server room – Corporate server room

With OpenJDK 26 dropping right around now, the backlog of JEPs (JDK Enhancement Proposals) is getting interesting. Value Objects (JEP 401) are finally taking shape, which will eventually make memory layouts way more efficient. But honestly, I’m ignoring most of the syntax sugar for now. The stabilization of the Foreign Function & Memory API is what I’m watching. We have a Python microservice that uses a C++ library for heavy matrix math, and I am probably going to be able to retire that Python service entirely by Q2 2027 once the FFMI tooling matures a bit more.

I’m migrating our staging cluster to Azul Zulu with CRaC enabled next week. The infrastructure team is already arguing about how to manage the checkpoint image files in our CI/CD pipeline, which is a fair concern. The checkpoint files are huge — often matching the allocated heap size. But I’ll deal with the storage costs. I’m just thrilled I don’t have to rewrite this monolith in Go.

Common questions

How much does Azul Zulu CRaC reduce Java cold start times?

In the article’s benchmark, a Spring Boot 3.5.3 pricing monolith that normally takes 12.4 seconds to report healthy restored from a CRaC checkpoint in just 450 milliseconds. That drop from a 12.4-second cold start to sub-half-second availability fundamentally changes the math on whether Java is viable for auto-scaling and scale-to-zero workloads in cloud environments.

Why do HikariCP database connections break after a CRaC restore?

When you checkpoint a running Spring Boot app, the JVM saves the state of HikariCP connections, but the database drops those TCP connections once the process is killed. Restoring the image later—even on the same machine—leaves dead sockets, causing SocketException errors. You must explicitly close pools before checkpointing and reopen them on restore, implementing the CRaC Resource interface for custom pools.

How do you create a CRaC checkpoint for a Spring Boot application?

The process is straightforward but unusual: run the application, let the Spring context fully load, send dummy traffic to warm up the JIT compiler, then trigger the checkpoint. The JVM dumps its entire state to disk and kills the process. On restore, the saved image comes back in around 450 milliseconds, though you must handle external connections like database pools manually before checkpointing.

Why choose CRaC over GraalVM native image for Java startup?

The author was skeptical of the “just compile it to native” approach because GraalVM native images hit reflection hell with older dependencies, forcing you to write custom reachability metadata for days. CRaC sidesteps that entirely by checkpointing a fully warmed JVM to disk, avoiding compilation rewrites. The tradeoff is large checkpoint files that often match allocated heap size, creating CI/CD storage concerns.

ByRafael Serrano

The 450ms Reality Check

The Database Pool Gotcha

Surviving the CVE Fire Drills

Looking at JDK 26

Common questions

How much does Azul Zulu CRaC reduce Java cold start times?

Why do HikariCP database connections break after a CRaC restore?

How do you create a CRaC checkpoint for a Spring Boot application?

Why choose CRaC over GraalVM native image for Java startup?

By Rafael Serrano

Related Post

Inside JMH: How the Microbenchmark Harness Defeats Dead Code

Inside JFR Event Streaming: How the JVM Buffers Events Without

JFR vs async-profiler for production CPU profiling

You missed

Inside JMH: How the Microbenchmark Harness Defeats Dead Code

Inside JFR Event Streaming: How the JVM Buffers Events Without

JFR vs async-profiler for production CPU profiling

Testcontainers vs zonky embedded Postgres: integration test latency on GitHub Actions