Go back to Apache Spark feedback initiative page
What Apache Spark version did you use ?
I was working with Apache Spark 2.0+
You are still using Apache Spark ?
Yes, but I'm also using different technologies.
Like ?
For some months I've been playing around serverless data processing frameworks like Dataflow. I'm also studying Apache Beam and its streaming model pretty well exposed in "Streaming Systems" book.
So Apache Spark was inefficient ?
Not really. I think that they are complementary, Apache Beam helps to understand better some aspects of Apache Spark and data processing in general. Also, Dataflow gives me some clues about serverless data processing that I hope will be soon available with Kubernetes scheduler.
What modules did you use ?
Mainly SQL, Streaming and Structured Streaming.
What is the runtime environment of your pipelines ?
I'm used to work with AWS and EMR service. But I also did some tries with Kubernetes at the beginning of the integration.
Where did you use Apache Spark ?
Initially I used it only for the data processing pipelines but for several months now I'm actively using the framework to do data exploration from notebooks. Since EMR offers a built-in Zeppelin installation, I'm principally working on it.
How do you evaluate the performances of an Apache Spark-based pipeline ?
The framework performs quite well but you must be careful about data sources and sinks. For instance by the past I did some tries with PostgreSQL and despite pretty fast data transformation, the loading and writing jobs were very slow. The same is true for compressed files with not splittable format like GZIP.
What Apache Spark features did cause the most of problems ? How did you solve them ?
At the beginning I had a lot of difficulties with writing serializable code. I had plenty of "not serializable" exceptions. Another problem was the change of logic. Previously I used to work with, "at worse", multi-threading code, so everything were on the same machine. With Apache Spark I had to do some mind switch and start to think about the code like executed in real separation. I'm saying this especially regarding to some stateful objects like lists, maps, ... that once pipeline executed, were always empty :)
What Apache Spark features did cause the most of problems at the runtime ? How did you solve them ?
When I did some streaming processing, it was the difficulty to handle extra load. That's true, at that moment I was rather a beginner data engineer and didn't know the concepts like elasticity. Still for streaming processing, it was also hard to catch up the processing late after the code failure.
Have you ever needed to scale your pipelines ? If yes, how have you done that ?
Yes, by the way that's the topic interesting me a lot at this moment. For now I only had to implement an auto-scaling policy based on hours of the day. I used for that the auto-scaling policy provided by AWS.
What did you learn from using Apache Spark ? What tips could you share with the community ?
Do not work with suppositions, especially when you have just started. I saw some people around me trying to work with Spark and trying to solve performance issues with sentences like "I think that this part will speed up the processing" and finally that "this part" had a very small impact. Instead, familiarize yourself with logs and Spark UI, and always try to explain your thinking process with numbers.
Another tips is to avoid to build too many objects. It will have an impact on JVM and in some scenarios like streaming ones, GC can be a real bottleneck. Instead, use as most as possible Row abstraction that with a sample mapper can reduce the memory pressure.
How do you monitor your Apache Spark pipeline ?
Mainly some business metrics included in the code, collected with Accumulator, and job execution state. To expose the metrics I use often already provided monitoring tools like CloudWatch (AWS) or Datadog dashboards. Integrating their APIs to Apache Spark code is quite easy.
Do you have some special configuration tips to share ?
Not yet.
How do you manage the deployment of a new version of your application ?
I experienced 2 models, manual and automatic and useless to say that I was convinced by the latter one. You can use any deployment tool like Jenkins, CircleCI, version your code with Git and deploy new artifact of your code that way. It guarantees much better traceability than the manual work.
Did you use some advanced testing strategies like integration or regression tests in addition to unit tests ?
Once I tried regression testing with simple SQL queries. Quite easy to put in place.
If you had a magic wand, what would you change in Apache Spark ?
Get rid off the executors abstraction exposed to the end users. That was one of my first issues with the framework, i.e. how to compute optimal number of resources. Instead I would rather opt for AWS Lambda-like configuration with memory and CPUs to use by the application and delegate the executors split to the framework.
Go back to Apache Spark feedback initiative page