For the last four years I have been jumping back and forth between Python and Go. The majority of the work I carry out as a Machine Learning Engineer at Octopus Energy is focused on Python and the typical data science stack. We are working on medium/big scale data processing projects using all the usual suspects: keras, scikit-learn, pandas, airflow etc. But at the moment, I’m not doing any Go.
There are two major things that I miss from Go when working on a Python project: static typing and interfaces. At Octopus Energy we use Python’s type hinting in order to avoid silly mistakes (the ones I do more often) with some success but there are intrinsic limitations.
While dealing with data streams and especially with the
io package in Python,
it becomes a burden trying to keep a flexible interface while reusing elements
of the standard library. Consider the following example:
import pandas as pd def transform(input_file: str) -> pd.DataFrame: df = pd.read_csv(input_file, index="index") return _transform(df)
This implementation has some drawbacks: passing a file-path string for pandas to load a CSV file makes testing difficult as you would need to create an actual CSV. Also, we restrict ourselves to local files for applying our data transform - however you might want to load the stream from other sources, such as HTTP connections or a S3 bucket.
You might be tempted to rewrite this function using
import pandas as pd import io def transform(input_io: io.TextIO) -> pd.DataFrame: df = pd.read_csv(input_io, index="index") return _transform(df)
This is more sensible: we’re using a class offered by the standard library and
pandas will like it as it uses an intrinsic reading protocol (more on this
soon). This means that as long as whatever you send down to
pd.read_csv has a
read, pandas will be able to load the data. But it’s
difficult to extend the classes in the
S3 readers won’t inherit from these classes
as typing is not really big in the python ecosystem. Also, building fake
subclasses to test is an ordeal as you need to create a bunch of empty stubs to
Let’s focus on the concept of protocol in python. This not something new, you
might know that calling
len() in any object that contains the method
will just work;
__getitem__ and the pickling protocol are more examples.
Python, being dynamically typed, will check programmatically that
before calling it during runtime.
If you’ve written any Go code, this will sound familiar, but of course you would enforce your protocol into an actual interface. In Go, there’s no concept of classes or inheritance, just structs. Structs can have functions attached to them called methods (sorry the nomenclature makes the comparison a bit messy). You define interfaces as contracts that your struct needs to meet, just like protocols! But being strictly typed, this is enforced during compilation time.
Relying on type hinting to play along with protocols is known as structural subtyping in python.
You might be wondering how this is different from inheritance. Let’s have a look at the following example where we will create a useful function to print the length of different objects.
class LenMixin: def __len__(self) -> int: raise NotImplementedError() class OneLengther(LenMixin): def __len__(self) -> int: return 1 def print_with_len(has_len: LenMixin): print(len(has_len)) print_with_len(OneLengther()) # yay print_with_len(list([1, 2, 3])) # ney! type error
Mypy will complain about the last line in this example. The reason being the
list object is not a subclass of
LenMixin. That’s unfortunate as the list
class actually offers all the functionality we need. Of course, we could create
MyList class that inherits from both list and
LenMixin but it adds tons of
typing_extensions allows us to use protocols in a strictly typed
Protocol defines a set of functions that a class needs to have in
order to be used as an argument to another function.
from typing_extensions import Protocol class Sizable(Protocol): """A custom definition of https://mypy.readthedocs.io/en/latest/protocols.html#sized """ def __len__(self) -> int: pass class MySizable: # No superclass def __len__(self) -> int: return 1 def print_sizable(sizable: Sizable): print(len(sizable)) print_sizable(MySizable()) # yay print_sizable(list([1, 2, 3])) # yaaaay!
In this way we just need to have the
__len__ method in our class (or function
in our module) in order to fulfill the
The whole concept of dependency injection in Go revolves around the idea of having interfaces as a contract. Your python object or your Go struct needs to fulfill this contract in order to be used as a parameter to a function. This blog post by Christoph Berger explains it way better, and this video by the great Francesc Campoy will give you a deep insight of the concept.
Coming back to the reader example, here we define a protocol that allows reading streams.
from abc import abstractmethod from typing_extensions import Protocol from typing import Any class Reader(Protocol): """Reader protocol sets the contract to read streams.""" @abstractmethod def __iter__(self): """Make the reader an iterable.""" pass @abstractmethod def read(self, size: int = -1) -> Any: """Read the given amount of bytes.""" pass
The reason why our reader needs to be iterable comes from panda’s file objects.
Now we can rewrite our
tranform_cvs function to use this protocol as an input.
import pandas as pd from abc import abstractmethod from typing_extensions import Protocol from typing import Any class Reader(Protocol): """Reader protocol sets the contract to read streams.""" @abstractmethod def __iter__(self): """Make the reader an iterable.""" pass @abstractmethod def read(self, size: int = -1) -> Any: """Read the given amount of bytes.""" pass def transform(input_reader: Reader) -> pd.DataFrame: df = pd.read_csv(input_reader, index="index") return _transform(df)
From this implementation we obtain a function that accepts not only any
file-like-object, such as files,
io.BytesIO, but also any
other third party libraries to stream S3 buckets, FTP, sFTP, HTTP, as long as
they are iterables and have a
read(i:int=-1) method. No inheritance needed, no
extra glue code to make this work with third party libraries or testing fakes.
Protocols and structural subtyping make injecting dependencies and creating fake objects for testing easy and elegant.