Dockerize Cassandra troubleshooting on waitingforcode.com

Versions: Cassandra 3.10

Some time ago I tried to create Docker image with Cassandra and some other programs. For the "others", the operation was quite easy but Cassandra caused some problems because of several configuration properties.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post tries to explain in what creating Cassandra Docker image was difficult. It starts by showing what happens with the default configuration and explains the evolution through parameters changes.

The first victory - --net=host

Below code shows the first step to put Cassandra into Docker container:

RUN wget --quiet http://mirrors.standaloneinstaller.com/apache/cassandra/3.10/apache-cassandra-3.10-bin.tar.gz -P /home/streaming_user/installs
RUN tar -xvzf /home/streaming_user/installs/apache-cassandra-3.10-bin.tar.gz -C /home/streaming_user/programs
RUN sed -i -E "s/MAX_HEAP_SIZE=\".*\"/MAX_HEAP_SIZE=\"1000M\"/" $cassandraConfig/cassandra-env.sh

It simply downloads specific Cassandra's version, untars it to configured location and starts the process. Docker container was firstly ran with this command: docker run -p 127.0.0.1:9042:9042 --name my_streaming_context -i -t streaming_context. The problem was that 9042 port, even if explicitly exposed through the command, was not reachable from client's application.

After some configuration research, the property --net=host appeared to be the solution. And it was, but rather in terms of just making something work - not necessarily proper solution. In fact --net=host flag allows container to share the network namespace of the host, i.e. container is exposed to public network.

It wasn't necessarily the thing I wanted. It's why I continued the tests.

Solution that fits

After reading some blogs and analyzing different Docker images, I found Hugo Picado's blog post about Containerizing Spring Boot and Cassandra making insight on properties to change before making Cassandra works without container's port public exposition. Among these properties we can find:

listen_address - its default value equals to localhost and brings some problems since the property defines where Cassandra process binds. It also defines the address to which other Cassandra nodes will connect.
broadcast_address - by default it's not specified and thus the value defined in listen_address is used. The broadcast_address property defines the address broadcasted to other Cassandra nodes .

Broadcast address

Broadcast address is especially useful when not all nodes have access to other nodes by their private addresses (listen_address). It can occur when they're deployed in data centers in different geographical regions.
rpc_address - defines the listen address for client connections.
broadcast_rpc_address - RPC address broadcasted to driver and other Cassandra nodes.
- seeds - this property is discovery property since it serves to learn about the topology of the ring. This configuration is especially useful when Cassandra runs with multiple nodes .

Seeds

Seeds are used to discover the topology of the ring. Thus in distributed mode at least one node from each data center should be defined there. It's advised to define more than 1 to achieve fault tolerance. But it's not recommended to define every node as a seed since it deteriorates performances of gossip protocol (internode communication protocol).

Not all of above properties are useful in our case of simple containerization of databases. But to remain totally consistent, they all were changed as advised. And without any surprise, the 9042 port was correctly exposed to client's program.

This post presents the integration of Cassandra into Docker container composed of different data sources. It's mostly focused on troubleshooting part. The first section described the problem of port accessibility and a hack solution consisted on exposing container's ports publicly. The second part gave another, more proper solution, consisted on changing Cassandra's default properties.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

Dockerize Cassandra troubleshooting

Data Engineering Design Patterns

The first victory - --net=host

Solution that fits

Broadcast address

Seeds

Consulting