Python Dataclasses

2021-03-07 / Python / Dataclass / Scala / Case Class

Recently, I have started using Python’s dataclasses, a module providing decorators and functions to easily create classes similar to Scala’s case class, including immutability features. It has become a valuable tool for data-intensive applications without the boilerplate code of regular classes.


A simple real-time web application using websocket

2021-01-10 / Python / Javascript / AWS API Gateway / AWS Lambda / AWS DynamoDB

In this post, we develop a simple real-time web application. Multiple users can simultaneously increment or decrement a counter, the result being immediately published to all users. Instead of retrieving the updated results via frequent HTTP GET requests (“excessive polling”), we are using a websocket API implemented in AWS API Gateway, with AWS Lambda and AWS DynamoDB as backend services to persist the state of the counter.


Tables and Graphs on a Webpage

2021-01-01 / JavaScript

In this post, I show how to convert data in a JavaScript object to a table and how to visualize the data with plotly.


Statistical models as a serverless service (v1)

2020-11-08 / AWS SAM / AWS Lambda / AWS S3 / AWS API Gateway / Python / statsmodels

In this post I will outline an implementation of a serverless service to run regression models (fitting and predicting) in AWS. Under the hood it is using Python’s statsmodels package and the formula interface. Sending JSON data and a model formula to the API will fit a linear model using ordinary least squares (OLS) and save the model to S3. The model can then available to predict new observations.


Working with Python DataFrames and AWS: `awswrangler`

2020-10-25 / AWS / Python / pandas / AWS Athena / AWS Lambda / AWS Glue

Just recently, I discovered the awsrangler Python package which provides function to easily work with AWS analytics services using pandas data frames.


Deploying Python with openpyxl on AWS Lambda

2020-09-27 / AWS Lambda / Python / openpyxl

AWS Lambda is a powerful function-as-a-service (FaaS) tool that allows you to run code in the cloud without having to maintain a server and its computing environment. One limitation is that the runtimes provided do not always contain all necessary packages that you need. In these cases you have to create your deployment package outside of the AWS console and upload it manually. In this post, we see how to do this for a Python function that uses package openpyxl.


Athena Queries via Python

2020-09-05 / AWS Athena / Python / pandas

Python’s boto3 library provides an easy way to submit SQL queries to your databases on AWS Athena. In some cases though, instead of just submitting the query letting results be written to S3, you want to immediately work with the output. This post shows how to download the result after the query is processed and how to import it as a pandas DataFrame.


Converting JSON to JSON Lines file format

2020-08-29 / AWS Athena / Python / JSON

A Python function to convert a JSON file to a JSON Lines file where each item is stored on a separate line.


Resolving Timestamp Issues on Athena

2020-08-23 / AWS Athena / Python / Pandas / Parquet

How to save pandas dataframes with timestamps to parquet so that AWS Athena interprets timestamp columns correctly.


Running Powershell on AWS Lambda

2020-07-09 / AWS Lambda / AWS S3 / PowerShell / DESTATIS

How to run a simple PowerShell function on AWS Lambda. It retrieves data from DESTATIS database via its REST API and stores the tabular data in S3.


Update to Windows Download Tool for DESTATIS

2020-07-09

I’ve released a new version of my Windows download tool for the DESTATIS database (see launch post). It allows you to interactively provide your credentials via the command line and creates a credentials file for you.


A Windows Download Tool for DESTATIS

2020-06-26 / Destatis / PowerShell / Windows

The new REST-API released by the German Federal Statistical Office DESTATIS (see my previous blog post) allowed me create a standalone Windows binary (.exe) that allows to easily automate downloads of tidy datasets from the DESTATIS database.


A RESTful API and tidy datasets for DESTATIS

2020-06-25 / Destatis / REST / API

The German Federal Statistical Office (Statistisches Bundesamt, DESTATIS) has released a RESTful/JSON API to access its databases. Additionally, a new option is available to retrieve tidy tabular datasets, rather than spreadsheet-like worksheets. Workarounds for tidying datasets like R package destatiscleanr are now becoming obsolete. The post explores key features of the new API and how to retrieve the tidy datasets. It assumes the user has little to no experience with REST-APIs.


Spark 3.0 released

2020-06-20 / Spark

A new major version of Spark has been released on Thursday. Version 3.0 of the popular big data processing and ML engine improves speed and functionality, primarily for Spark SQL and the Python API. Below, I shortly present some highlights for SQL, DataFrames and the MLlib.


Running Scala On AWS Lambda

2020-06-14 / AWS Lambda / Scala

How to run a simple Scala function on AWS Lambda. Since no Scala runtime is provided out-of-the-box in AWS Lambda, the Scala program needs to be compiled for the Java 8 runtime. Function input and output are JSON strings which are encoded and decoded using Scala’s circe library.