Spark 3.0 released

2020-06-20 / Spark / 2 minutes

A new major version of Spark has been released on Thursday. Version 3.0 of the popular big data processing and ML engine improves speed and functionality, primarily for Spark SQL and the Python API. Below, I shortly present some highlights for SQL, DataFrames and the MLlib.

DataFrame/SQL

  • The documentation has been extended to include a comprehensive SQL reference. It provides an overview of available aggregate, window, map, JSON, array and date functions.
  • New build-in functions are available. count_if, any (or some or bool_or) and every (or bool_and) are aggregate functions over Boolean values. map_filter and map_zip_with provide functionality known from Scala maps. typeof returns the data type of a column. Instead of using date functions year, month, hour etc to extract date information, you can now use date_part and extract.

MLlib

  • Three new algorithms are available. Gaussian Naive Bayes, Complement Naive Bayes and Factorization Machines for classification, as well as Factorization Machines for regression problems.

  • A RobustScaler transformation is available that uses median and interquartile range to center and scale a variable. It is more robust to outliers than StandardScaler which uses mean and standard deviation.

  • Transformers Binarizer and StringIndexer can now handle multiple input columns at once. This makes it easier to apply this common transformation to a large set of variables. Use parameter inputCols rather than inputCol (and thresholds rather than threshold) to submit multiple columns for transformation.

  • For classification problems where a sample can be mapped to multiple labels instead of just one, the MultilabelClassificationEvaluator can be used to measure model performance.

  • For ranking problems where a sample is mapped to an ordered set of classes (e.g. top 5 recommended movies for a customer), the RankingEvaluator is now available to simplify model evaluation.