A new major version of Spark has been released on Thursday. Version 3.0 of the popular big data processing and ML engine improves speed and functionality, primarily for Spark SQL and the Python API. Below, I shortly present some highlights for SQL, DataFrames and the MLlib.
DataFrame/SQL
- The documentation has been extended to include a comprehensive SQL reference. It provides an overview of available aggregate, window, map, JSON, array and date functions.
- New build-in functions are available.
count_if
,any
(orsome
orbool_or
) andevery
(orbool_and
) are aggregate functions over Boolean values.map_filter
andmap_zip_with
provide functionality known from Scala maps.typeof
returns the data type of a column. Instead of using date functionsyear
,month
,hour
etc to extract date information, you can now usedate_part
andextract
.
MLlib
-
Three new algorithms are available. Gaussian Naive Bayes, Complement Naive Bayes and Factorization Machines for classification, as well as Factorization Machines for regression problems.
-
A RobustScaler transformation is available that uses median and interquartile range to center and scale a variable. It is more robust to outliers than StandardScaler which uses mean and standard deviation.
-
Transformers Binarizer and StringIndexer can now handle multiple input columns at once. This makes it easier to apply this common transformation to a large set of variables. Use parameter
inputCols
rather thaninputCol
(andthresholds
rather thanthreshold
) to submit multiple columns for transformation. -
For classification problems where a sample can be mapped to multiple labels instead of just one, the MultilabelClassificationEvaluator can be used to measure model performance.
-
For ranking problems where a sample is mapped to an ordered set of classes (e.g. top 5 recommended movies for a customer), the RankingEvaluator is now available to simplify model evaluation.