Distributed Computing with Spark SQL (Coursera)

Overview

This course was a great positive surprise regarding the way the practical exams worked. You can think of this as a great introduction to Databricks, along with Spark, and a fact I think few people know: you can use Databricks for free if you create a community account.

Spark is one of the most important tools (if not the most important) for large data processing (a.k.a. Big Data). The main advantage it provides over traditional data processing tools is the possibility of horizontal scaling, which makes scalability almost limitless. Spark is a basic tool for Data Engineers and one of the largest problems one can face when learning Data Engineering is the problems we are trying to solve are usually for big corporations with a huge pile of data, which often involves high costs to simulate such situations.

This course is part of the SQL Basics for Data Science specialization.

Here are the course details from Coursera:


Pros

As mentioned before, the best part of this course was the practical exercises done in Databricks. That was an excellent opportunity to get to know a little bit more about Spark, especially the tuning of partitions, shuffling, etc.

There are many notebooks available in the course material which are not used in the classes, and those are totally worth studying "by yourself". The fact that Databricks has a community account is great for integrating a more sophisticated solution into your personal projects and boosting your personal portfolio in Github.

The whole course has a "hands-on" approach which I personally like a lot. A brief theoretical explanation is presented but soon we are going through the example notebooks with real code.

Cons

The only point the course could improve a bit is to have more exercises on fine-tuning and solving performance issues with Spark. No one uses Spark just for fun, so usually there is a performance component involved. It would be frustrating if after moving to Spark, the engineer still faces performance issues and is unable to solve them quickly. 

Conclusion

This course is definitely recommended if you never had any experience with Spark and really feel this knowledge is missing on your resume. Bear in mind this is an introductory course, and it does not go deep in one of the most important parts of Spark: improving performance. If you feel something is missing in this review, please write a note in the comments!

Comments

Popular posts from this blog

Dealing With Large Features in Git Repos

Foreign Visitors in Brazil - 2005 to 2015 - Part II