djsarkar

237 points

Dipanjan (DJ) Sarkar is a Data Scientist at Red Hat, a published author, consultant and trainer. He has consulted and worked with several startups as well as Fortune 500 companies like Intel. He primarily works on leveraging data science, machine learning and deep learning to build large- scale intelligent systems. He holds a master of technology degree with specializations in Data Science and Software Engineering. He is also an avid supporter of self-learning and massive open online courses. He has recently ventured into the world of open-source products to improve the productivity of developers across the world.

Dipanjan has been an analytics practitioner for several years now, specializing in machine learning, natural language processing, statistical methods and deep learning. Having a passion for data science and education, he also acts as an AI Consultant and Mentor at various organizations like Springboard, where he helps people build their skills on areas like Data Science and Machine Learning. He also acts as a key contributor and Editor for Towards Data Science, a leading online journal focusing on Artificial Intelligence and Data Science. Dipanjan has also authored several books on R, Python, Machine Learning, Social Media Analytics, Natural Language Processing and
Deep Learning.

Dipanjan's interests include learning about new technology, financial markets, disruptive start-ups, data science, artificial intelligence and deep learning. In his spare time he loves reading, gaming, watching popular sitcoms and football and writing interesting articles on https://medium.com/@dipanzan.sarkar and https://www.linkedin.com/in/dipanzan. He is also a strong supporter of open-source and publishes his code and analyses from his books and articles on GitHub at https://github.com/dipanjanS.

Open Minded Author Contributor Club

Authored Content

Person standing in front of a giant computer screen with numbers, data

How to analyze log data with Python and Apache Spark

Case study with NASA logs to show how Spark can be leveraged for analyzing data at scale.

metrics and data shown on a computer screen

How to wrangle log data with Python and Apache Spark

Case study with NASA logs to show how Spark can be leveraged for analyzing data at scale.

Detecting malaria with deep learning

Artificial intelligence combined with open source tools can improve diagnosis of the fatal disease malaria.

Scaling relational databases with Apache Spark SQL and DataFrames

Wrangle, aggregate, and filter data at scale using your friendly SQL with a twist.

How to use Spark SQL: A hands-on tutorial

This tutorial explains how to leverage relational databases at scale using Spark SQL and DataFrames.

Authored Comments

djsarkar

22 Mar 2019

Scaling relational databases with Apache Spark SQL and DataFrames

Absolutely, so overall the data structures are kind of similar yet different making it a bit confusing. But if you check the history of the evolution of Spark (https://stackoverflow.com/questions/31508083/difference-between-datafra…), we first had the RDDs and then DataFrames came into the picture in 2013 and then finally Dataset spun off from DataFrames in 2015 as a type-safe version of DFs.

Datasets are pretty good and work quite well in native Spark (leveraging Scala) but since we leverage python in our example, we have to go for Spark DataFrames. Traditionally though Datasets have always been slightly slower than DataFrames but their performance is catching up (https://databricks.com/session/demystifying-dataframe-and-dataset). Hope this helps!