Data+AI Summit 2022 - Top Announcements and Recap
Data+AI Summit 2022 [https://databricks.com/dataaisummit/] is the world’s largest gathering among…
Databricks recently announced the release of Apache Spark 3.0 [https://databricks.com/blog/2020/…
This post demonstrates a cost-effective and automated solution for running Spark-Jobs on the EMR cluster on a daily basis using CloudWatch, Lambda, EMR, S3, and SNS.…
Performance Tweaking Apache Spark
Apache Spark Streaming applications need to be monitored frequently to be certain that they are…
Incrementally loaded Parquet files
In this post, I explore how you can leverage Parquet [https://parquet.apache.org/] when…
MongoDB and Apache Spark - Getting started tutorial
MongoDB and Apache Spark are two popular Big Data technologies. In my previous post [https:…
Introduction to the MongoDB connector for Apache Spark
MongoDB is one of the most popular NoSQL databases. Its unique capabilities to store document-oriented…
Spark Summit East 2017 - A summary
I attended Spark Summit East 2017 last week. This 2 day conference - February 8th…
A tour of Databricks Community Edition: a hosted Spark service
With the recent announcement [https://databricks.com/blog/2016/02/17/introducing-databricks-community-edition-apache-spark-for-all.html] of the…
Testing strategy for Spark Streaming - Part 2 of 2
In a previous post [https://test-ippon.ghost.io/testing-strategy-apache-spark-jobs/], we’ve seen why it’s…
Testing strategy for Apache Spark jobs - Part 1 of 2
Like any other application, Apache Spark jobs deserve good testing practices and coverage. Indeed, the…
Applying Data Science with Apache Spark Coding Dojo
This week, at the power plant (Ippon Technologies USA headquarters), we had the pleasure of…