Posts

Showing posts from February, 2022

SQL For Data Science (Coursera)

Image
Overview SQL is one of the most basic skills for developers, DBAs, and all the data team (analysts, engineers, and scientists), but constantly it is underestimated. In general, people are reluctant to keep different languages in the same project to avoid issues with professionals, and sometimes that forces the use of ORMs when the fastest option would be to run some stored procedures. No problem here looking by the perspective of management: if all engineers could use the same technology, there are less problems with work allocation and situations of absence (vacations for a small number of people who master a key technology means they can't all leave at the same time). Growth of cloud services and introduction of new tools to query massive amounts of data have made systems designs that delegate a large portion of business logic to the databases and SQL based tools popular again. Concepts like "data lakes" and "data warehouses" are starting to be applied in smal

Distributed Computing with Spark SQL (Coursera)

Image
Overview This course was a great positive surprise regarding the way the practical exams worked. You can think of this as a great introduction to Databricks, along with Spark, and a fact I think few people know: you can use Databricks for free if you create a community account. Spark is one of the most important tools (if not the most important) for large data processing (a.k.a. Big Data). The main advantage it provides over traditional data processing tools is the possibility of horizontal scaling, which makes scalability almost limitless. Spark is a basic tool for Data Engineers and one of the largest problems one can face when learning Data Engineering is the problems we are trying to solve are usually for big corporations with a huge pile of data, which often involves high costs to simulate such situations. This course is part of the SQL Basics for Data Science specialization. Here are the course details from Coursera: Pros As mentioned before, the best part of this course was the

Building Batch Data Pipelines on GCP (Coursera)

Image
Overview This is another course that Google developed with Coursera and has quizzes and practical lab activities. Batch processing ( the "opposite" of streaming processing) is a way to process huge amounts of data that are available in the form of files, locally, or in buckets in the cloud. The main advantage of modern batch processing techniques involves the usage of parallelism for such tasks, which greatly improves performance and scalability. Goole Cloud has several tools to approach this problem, and I was surprised to know that it is possible to use Spark in GCP Dataproc. Pros The great difference between this course and the "regular" Coursera ones is the integration with Qwik labs, which lets the student access GCP and run all exercises there. This is definitely an advantage for students who are reluctant to inform credit card numbers (and depending on the age, don't even have a credit card) in order to test and learn GCP. Another great difference is that