Python Dataclasses

2021-03-07 / Python / Dataclass / Scala / Case Class / 6 minutes

Recently, I have started using Python’s dataclasses, a module providing decorators and functions to easily create classes similar to Scala’s case class, including immutability features. It has become a valuable tool for data-intensive applications without the boilerplate code of regular classes.

Dataclasses

The dataclasses module was introduced in Python version 3.7 and been backported to 3.6. They are a useful tool for functional programming where classes are mainly used to define data structures without any functionality (methods) encapsulated (as in object-oriented programming). While this can be achieved with regular classes, dataclasses provide a simpler syntax that avoids boilerplate code, especially for initialization.

Dataclasses vs Regular Classes

Using regular classes, you could create a class that contains as user’s Id and their age as follows:

class User:
     def __init__(self, id: str, age: int):
         self.id  = id
         self.age = age

In functional programming __init__ doesn’t do a lot except for assigning the values provided at initialization to members of the class. Having to do this assignment for every field quickly becomes tedious, especially in applications where you have many classes with many fields.

The dataclass decorator helps out by creating this simple __init__ for you and you don’t have to write the code yourself. Hence, you can define the same data structure as follows:

from dataclasses import dataclass

@dataclass
class User:
    id: str
    age: int

Initializing an object of class User works the same as for regular classes:

>>> User(id='id001', age=35)
User(id='id001', age=35)

Initialization

If some fields are to be created after the initialization, you can specify this by assigning the field(init=False) to this field and set the value through method __post_init__. If we want to have a field called is_18 that is calculated from field age at initialization we can do the following:

@dataclass
class User:
    id: str
    age: int
    is_18: bool = field(init=False)
    def __post_init__(self):
        self.is_18 = self.age >= 18

After initialization, you can access (and set) this field as any other:

>>> x = User(id='id001', age=35)
>>> x
User(id='id001', age=35, is_18=True)
>>> x.is_18
True
>>> x.is_18 = False
>>> x
User(id='id001', age=35, is_18=False)

Immutability

The example before demonstrated a problem with mutable fields. We were able to set is_18 independent from age, which can lead to inconsistent data. As seen above, we could create a user that is 35 years old, but doesn’t seem to be of age.

Fortunately, there is an option to declare a dataclass immutable by using the decorator @dataclass(frozen=True)

@dataclass(frozen=True)
class Profile:
    id: str
    age: int

Trying to mutate a field of a frozen class will fail and avoid creating inconsistent class instances:

>>> x = User(id='id001', age=35)
>>> x.age = 15
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 4, in __setattr__
dataclasses.FrozenInstanceError: cannot assign to field 'age'

Creating fields post initialization is a little bit trickier though, as the assignment to self.is_18 in __post_init__ would fail due to immutability. Hence,

@dataclass(frozen=True)
class User:
    id: str
    age: int
    is_18: bool = field(init=False)
    def __post_init__(self):
        object.__setattr__(self, 'is_18', self.age >= 18)

Initializing frozen dataclasses works as before:

>>> x = User(id='id001', age=35)
>>> x
User(id='id001', age=35, is_18=True)

but assignment to any field will fail

>>> x.age = 30
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 4, in __setattr__
dataclasses.FrozenInstanceError: cannot assign to field 'age'
>>> x.is_18 = False
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 4, in __setattr__
dataclasses.FrozenInstanceError: cannot assign to field 'is_18'

Creating instance copies

You can create copies using the replace() function without any additional arguments except the class instance to copy:

>>> from dataclasses import replace
>>> y = replace(x)
>>> y
User(id='id001', age=35, is_18=True)

As the name suggests, replace() allows you to create copies where some fields are changed:

>>> z = replace(x, age=10)
>>> z
User(id='id001', age=10, is_18=False)

Note that field is_18 was changed as well. init=False do not get copied from the source object but are intialized in __post_init__, as if we had created z using User(id='id001', age=10) instead.

InitVar

In cases where values need to be provided at initialization, but should not become part of the class instance, you can declare a “field” as an InitVar. In our example, let’s assume that we want to provide to hash the user Id rather than keeping it stored in the instance in plain text. The hashing function will be

import uuid
import hashlib
def hash_password(password):
    salt = uuid.uuid4().hex
    return hashlib.sha256(salt.encode() + password.encode()).hexdigest() + ':' + salt

and the class definition is

from dataclasses import InitVar
@dataclass(frozen=True)
class User:
    id: InitVar[str]
    age: int
    hashed_id: str = field(init=False) 
    def __post_init__(self, id):
        object.__setattr__(self, 'hashed_id', hash_password(id))

Note that, in __post_init__ we do not refer to self.id but rather id, as id is not a member of the class instance.

After initialization, we will not be able to access id but only hashed_id:

>>> x = User(id='id001', age=35)
>>> x
User(age=35, hashed_id='5fe717290f1274b1f4363623bcacf3c72304881421a371a41ccdeb5867f384e9:a92b39d9e0d046518810613e94edd583')
>>> x.id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'User' object has no attribute 'id'

Converting dataclass To and From dict

To (de-)serializa a dataclass instance, you can use function asdict and using **source_dict as argument in initialization:

>>> d_source = {'id': 'id002', 'age': 45}
>>> z = User(**d_source)
>>> z
User(age=45, hashed_id='1798f729ecd7ed5f7e28e42f95f9b952b5319012658439856ace9dbad4d79048:600dcc2a435e4603a215f764e8bdf830')

and

>>> from dataclasses import asdict
>>> d_target = asdict(z)
>>> d_target
{'age': 45, 'hashed_id': '1798f729ecd7ed5f7e28e42f95f9b952b5319012658439856ace9dbad4d79048:600dcc2a435e4603a215f764e8bdf830'}

Full Code

import uuid
import hashlib

from dataclasses import dataclass, asdict, field, replace, InitVar

# simple dataclass
@dataclass
class User:
    id: str
    age: int

x = User(id='id001', age=35)

x
x.id
x.id = 'id002'

@dataclass
class User:
    id: str
    age: int
    is_18: bool = field(init=False)
    def __post_init__(self):
        self.is_18 = self.age >= 18

User(id='id001', age=35)

x = User(id='id001', age=35)
x.is_18

x.is_18 = False

x.age = 14
x


@dataclass(frozen=True)
class User:
    id: str
    age: int

x = User(id='id001', age=35)
x.age = 15

x = User(id='id001', age=35)
x.is_18

x.is_18 = False

x.age = 14
x

# Initialization

@dataclass(frozen=True)
class User:
    id: str
    age: int
    is_18: bool = field(init=False)
    def __post_init__(self):
        object.__setattr__(self, 'is_18', self.age >= 18)

x = User(id='id001', age=35)
x

x.age = 30
x.is_18 = False

# Copy and Replace
y = replace(x)
z = replace(x, age=10)

# InitVar

def hash_password(password):
    salt = uuid.uuid4().hex
    return hashlib.sha256(salt.encode() + password.encode()).hexdigest() + ':' + salt

hash_password('id001')

@dataclass(frozen=True)
class User:
    id: InitVar[str]
    age: int
    hashed_id: str = field(init=False) 
    def __post_init__(self, id):
        object.__setattr__(self, 'hashed_id', hash_password(id))

x = User(id='id001', age=35)
x
x.id

# to and from dict
d_source = {'id': 'id002', 'age': 45}
z = User(**d_source)

from dataclasses import asdict
d_target = asdict(z)
d_target


comments powered by Disqus