Dockerize Cassandra troubleshooting

Some time ago I tried to create Docker image with Cassandra and some other programs. For the "others", the operation was quite easy but Cassandra caused some problems because of several configuration properties.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

This post tries to explain in what creating Cassandra Docker image was difficult. It starts by showing what happens with the default configuration and explains the evolution through parameters changes.

The first victory - --net=host

Below code shows the first step to put Cassandra into Docker container:

RUN wget --quiet http://mirrors.standaloneinstaller.com/apache/cassandra/3.10/apache-cassandra-3.10-bin.tar.gz -P /home/streaming_user/installs
RUN tar -xvzf /home/streaming_user/installs/apache-cassandra-3.10-bin.tar.gz -C /home/streaming_user/programs
RUN sed -i -E "s/MAX_HEAP_SIZE=\".*\"/MAX_HEAP_SIZE=\"1000M\"/" $cassandraConfig/cassandra-env.sh

It simply downloads specific Cassandra's version, untars it to configured location and starts the process. Docker container was firstly ran with this command: docker run -p 127.0.0.1:9042:9042 --name my_streaming_context -i -t streaming_context. The problem was that 9042 port, even if explicitly exposed through the command, was not reachable from client's application.

After some configuration research, the property --net=host appeared to be the solution. And it was, but rather in terms of just making something work - not necessarily proper solution. In fact --net=host flag allows container to share the network namespace of the host, i.e. container is exposed to public network.

It wasn't necessarily the thing I wanted. It's why I continued the tests.

Solution that fits

After reading some blogs and analyzing different Docker images, I found Hugo Picado's blog post about Containerizing Spring Boot and Cassandra making insight on properties to change before making Cassandra works without container's port public exposition. Among these properties we can find:

Not all of above properties are useful in our case of simple containerization of databases. But to remain totally consistent, they all were changed as advised. And without any surprise, the 9042 port was correctly exposed to client's program.

This post presents the integration of Cassandra into Docker container composed of different data sources. It's mostly focused on troubleshooting part. The first section described the problem of port accessibility and a hack solution consisted on exposing container's ports publicly. The second part gave another, more proper solution, consisted on changing Cassandra's default properties.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©